Optimization#
Modeling#
\(p_x\) dimensional input vector \(\mathbf{x}_i\)
associated target vector \(\mathbf{y}_i\) for each of the \(i = 1, \dots, n\) data points.
We create a model \(f\) with a set of \(p_\theta\) parameters \(\mathbf{\theta}\) as:
Our target vector \(\mathbf{y}_i\) can be
A single continuous output value
Membership in one k classes
To find solution to your prediction problems, we have to find \(\theta\)
Finding \(\theta\)#
To find \(\theta\), a loss (objective) function which describes the distance between \(\mathbf{y}_i\) and \(f(\mathbf{x}_i; \mathbf{\theta})\) must be constructed In the first example, we can use the squared error loss
In the second example, we can use Cross Entropy Loss
In either case, we seek a set of model parameters that minimize the loss function \(Q\)
A note on classification
Similar to Generalized Linear Models, we must consider predicting a probability \(p \in (0,1)\) so we should ensure that whatever is coming out of our model maps to that interval
Typically, we will let models learn on whatever space they what and then squeeze them into \((0,1)\) with the softmax function
This function takes a linear outputs for each class and scales \((0,1)\) so that their sum is 1
in the binary case, this would jut be the sigmoid
Finding a suitable \(\theta\)#
Goal: Find a set of \(\mathbf{\theta}\) that minimize the loss function \(Q\)
Idea: start with a guess and iterate, using the gradient as a guide
Steps
where
\(\frac{\partial Q (\hat{\mathbf{A}}^{(j)} ) }{\partial \mathbf{\theta}}\) the gradient of the objective function \(Q(\mathbf{\theta})\) at \(\hat{\mathbf{\theta}}\)
Tells us direction each parameter should move to reduce loss
Also denoted as \(\nabla_\theta Q (\hat{\mathbf{\theta}}^{(j)})\)
\(\epsilon_j\) is the learning rate, which alows us to stretch or shrink the magnitude of movement
\(\mathbf{A}_j\) is a positive definite matrix
Typically the inverse of the squared Jacobian (Gauss-Newton) or Hessian (Newton-Raphson)
Allows the optimization to have a deeper understanding of the surface being optimized.
Basically, you incorporate more information so that you can reach your goal faster