Optimization

Optimization#

Modeling#

\(p_x\) dimensional input vector \(\mathbf{x}_i\)
associated target vector \(\mathbf{y}_i\) for each of the \(i = 1, \dots, n\) data points.

We create a model \(f\) with a set of \(p_\theta\) parameters \(\mathbf{\theta}\) as:

\[ f(\mathbf{x}_i; \mathbf{\theta}) \]

Our target vector \(\mathbf{y}_i\) can be

A single continuous output value
Membership in one k classes

To find solution to your prediction problems, we have to find \(\theta\)

Finding \(\theta\)#

To find \(\theta\), a loss (objective) function which describes the distance between \(\mathbf{y}_i\) and \(f(\mathbf{x}_i; \mathbf{\theta})\) must be constructed In the first example, we can use the squared error loss

\[ Q(f(\mathbf{x}); y; \mathbf{\theta}) = \sum_{i = 1}^n (y_i - f(\mathbf{x}_i; \mathbf{\theta}))^2 \]

In the second example, we can use Cross Entropy Loss

\[ Q(f(\mathbf{x}); y ; \mathbf{\theta}) = - \frac{1}{n} \sum_{i=1}^n \sum_{c=1}^{10} y_{i, c} \log (f(\mathbf{x}_i, \mathbf{\theta})_c) \]

In either case, we seek a set of model parameters that minimize the loss function \(Q\)

\[ \hat{\mathbf{\theta}} = \underset{\theta}{\operatorname{argmax}} (Q(f(\mathbf{x}); y; \mathbf{\theta}) \]

A note on classification

Similar to Generalized Linear Models, we must consider predicting a probability \(p \in (0,1)\) so we should ensure that whatever is coming out of our model maps to that interval
Typically, we will let models learn on whatever space they what and then squeeze them into \((0,1)\) with the softmax function

\[ p(y_i = k |\mathbf{x}_i, \mathbf{\theta}) = \frac{\exp(\mathbf{\theta}_k' \mathbf{x}_i)}{\sum_{l = 1}^K \exp(\mathbf{\theta}_l' \mathbf{x}_i)} \]

This function takes a linear outputs for each class and scales \((0,1)\) so that their sum is 1
in the binary case, this would jut be the sigmoid

Finding a suitable \(\theta\)#

Goal: Find a set of \(\mathbf{\theta}\) that minimize the loss function \(Q\)
Idea: start with a guess and iterate, using the gradient as a guide

Steps

\[ \hat{\mathbf{\theta}}^{(j + 1)} = \hat{\mathbf{\theta}}^{(j)} - \epsilon_j \mathbf{A}_j \frac{\partial Q (\hat{\mathbf{A}}^{(j)} ) }{\partial \mathbf{\theta}} \]

where

\(\frac{\partial Q (\hat{\mathbf{A}}^{(j)} ) }{\partial \mathbf{\theta}}\) the gradient of the objective function \(Q(\mathbf{\theta})\) at \(\hat{\mathbf{\theta}}\)
- Tells us direction each parameter should move to reduce loss
- Also denoted as \(\nabla_\theta Q (\hat{\mathbf{\theta}}^{(j)})\)
\(\epsilon_j\) is the learning rate, which alows us to stretch or shrink the magnitude of movement
\(\mathbf{A}_j\) is a positive definite matrix
- Typically the inverse of the squared Jacobian (Gauss-Newton) or Hessian (Newton-Raphson)
- Allows the optimization to have a deeper understanding of the surface being optimized.
- Basically, you incorporate more information so that you can reach your goal faster

Optimization

Contents

Optimization#

Modeling#

Finding \(\theta\)#

Finding a suitable \(\theta\)#