92 Binary Logistic Regression

Binary logistic regression is the workhorse of supervised classification. It models the probability that an observation belongs to one of two classes as a smooth function of a linear combination of features. Despite its simplicity, it remains a default baseline in industry and a conceptual cornerstone, because it cleanly connects linear models, probability theory, information theory, and convex optimization. This chapter develops the model from first principles, derives its training objective through maximum likelihood, shows that this objective coincides with cross entropy, and works through both gradient based and Newton (IRLS) optimization. We close with decision boundaries and the interpretation of fitted coefficients.

92.1 1. From Linear Scores to Probabilities

92.1.1 1.1 The classification setting

We observe a dataset of $n$ labeled examples $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}$, where $\mathbf{x}_i \in \mathbb{R}^{d}$ is a feature vector and $y_i \in \{0, 1\}$ is a binary label. Our goal is to learn a function that maps a feature vector to an estimate of $P(y = 1 \mid \mathbf{x})$, the conditional probability that the positive class occurs.

A naive approach would fit ordinary linear regression directly to the labels, predicting $\hat{y} = \mathbf{w}^{\top}\mathbf{x} + b$. This fails for two reasons. First, the linear score ranges over all of $\mathbb{R}$, so it cannot be read as a probability bounded in $[0, 1]$. Second, the squared error loss implied by linear regression is a poor match for a Bernoulli outcome, and it heavily penalizes confident correct predictions that lie far from the boundary. We need a model that squashes a real valued score into a valid probability.

92.1.2 1.2 The sigmoid function

The logistic sigmoid accomplishes exactly this squashing. Define the linear score, often called the logit or activation,

\[ z = \mathbf{w}^{\top}\mathbf{x} + b, \]

and pass it through

\[ \sigma(z) = \frac{1}{1 + e^{-z}}. \]

The sigmoid is a monotonically increasing function from $\mathbb{R}$ onto the open interval $(0, 1)$. It satisfies $\sigma(0) = 1/2$, $\sigma(z) \to 1$ as $z \to +\infty$, and $\sigma(z) \to 0$ as $z \to -\infty$. Two algebraic identities are used constantly:

\[ 1 - \sigma(z) = \sigma(-z), \qquad \sigma'(z) = \sigma(z)\bigl(1 - \sigma(z)\bigr). \]

The derivative identity is what makes logistic regression so clean to optimize. The model probability is then

\[ p(\mathbf{x}) = P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^{\top}\mathbf{x} + b). \]

To lighten notation we absorb the bias into the weight vector by appending a constant feature of $1$ to every $\mathbf{x}$, so that $z = \mathbf{w}^{\top}\mathbf{x}$ with the convention that $w_0 = b$ and $x_0 = 1$. From here on $\mathbf{w} \in \mathbb{R}^{d+1}$.

Code

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    # numerically stable form avoids overflow for large negative z
    return np.where(z >= 0, 1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))

z = np.linspace(-8, 8, 400)
fig, ax = plt.subplots(figsize=(6, 3.5))
ax.plot(z, sigmoid(z), color="C0", lw=2)
ax.axhline(0.5, color="gray", ls="--", lw=1)
ax.axvline(0.0, color="gray", ls="--", lw=1)
ax.set_xlabel("logit z")
ax.set_ylabel("sigmoid(z)")
ax.set_title("Logistic sigmoid squashes a real score into (0, 1)")
plt.show()

print("sigmoid(0) =", sigmoid(0.0))

sigmoid(0) = 0.5

92.2 2. The Log Odds Model

92.2.1 2.1 Odds and the logit link

The quantity $p / (1 - p)$ is the odds of the positive class. A probability of $0.8$ corresponds to odds of $4$, meaning the event is four times as likely to occur as not. Taking the logarithm gives the log odds, also called the logit:

\[ \operatorname{logit}(p) = \log \frac{p}{1 - p}. \]

The logit is the inverse of the sigmoid. Substituting $p = \sigma(z)$ and using $1 - \sigma(z) = \sigma(-z)$,

\[ \log \frac{\sigma(z)}{1 - \sigma(z)} = \log \frac{\sigma(z)}{\sigma(-z)} = \log e^{z} = z. \]

This is the defining structural assumption of logistic regression: the log odds of the positive class are a linear function of the features.

\[ \log \frac{P(y = 1 \mid \mathbf{x})}{P(y = 0 \mid \mathbf{x})} = \mathbf{w}^{\top}\mathbf{x}. \]

This is why logistic regression is a generalized linear model. The response is Bernoulli distributed, and the canonical link function connecting the mean of the response to the linear predictor is the logit. Linearity lives on the log odds scale, not on the probability scale, where the relationship is the characteristic S shaped curve.

92.2.2 2.2 Why the log odds scale is natural

Working on the log odds scale has three practical benefits. It maps the bounded interval $(0, 1)$ onto the entire real line, so the linear predictor never produces an invalid probability. It is symmetric: swapping the roles of the two classes flips the sign of the logit but leaves the structure intact. And it gives coefficients a clean multiplicative meaning in terms of odds ratios, which we develop in Section 7. The log odds representation also reveals that the decision boundary, where the two classes are equally probable, is exactly the set where $\mathbf{w}^{\top}\mathbf{x} = 0$, a hyperplane.

92.3 3. Maximum Likelihood Estimation

92.3.1 3.1 The Bernoulli likelihood

Logistic regression estimates $\mathbf{w}$ by maximum likelihood. Each label is modeled as a Bernoulli draw with success probability $p_i = \sigma(\mathbf{w}^{\top}\mathbf{x}_i)$. The probability mass function for a single observation, written compactly for both outcomes, is

\[ P(y_i \mid \mathbf{x}_i; \mathbf{w}) = p_i^{\,y_i} (1 - p_i)^{1 - y_i}. \]

Assuming the observations are independent given the features, the likelihood of the full dataset is the product

\[ L(\mathbf{w}) = \prod_{i=1}^{n} p_i^{\,y_i} (1 - p_i)^{1 - y_i}. \]

92.3.2 3.2 The log likelihood

Products of many small probabilities underflow numerically and are awkward to differentiate, so we maximize the log likelihood instead. Because logarithm is monotonic, the maximizer is unchanged.

\[ \ell(\mathbf{w}) = \sum_{i=1}^{n} \Bigl[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \Bigr]. \]

Maximizing $\ell(\mathbf{w})$ is equivalent to minimizing its negative. The negative log likelihood, averaged over the dataset, is the loss we actually minimize:

\[ J(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \Bigl[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \Bigr]. \]

92.4 4. Cross Entropy and the Connection to Information Theory

92.4.1 4.1 Cross entropy as the same objective

The loss $J(\mathbf{w})$ is precisely the binary cross entropy between the empirical label distribution and the model. For a single example, treat the true label as a degenerate distribution placing mass $y_i$ on class $1$ and $1 - y_i$ on class $0$, and the prediction as the distribution $(p_i, 1 - p_i)$. The cross entropy between these two distributions is

\[ H(y_i, p_i) = -\bigl[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \bigr], \]

which is exactly the per example term in $J$. So maximum likelihood for the Bernoulli model and cross entropy minimization are two names for the same thing. This is no coincidence. Minimizing cross entropy is equivalent to minimizing the Kullback Leibler divergence from the model to the data generating distribution, and the maximum likelihood estimator is the one whose model distribution is closest to the empirical distribution in KL divergence.

92.4.2 4.2 Why not squared error

One might ask why we do not minimize $\sum_i (y_i - p_i)^2$ instead. The answer is twofold. First, with $p_i = \sigma(z_i)$, the squared error is non convex in $\mathbf{w}$, so gradient descent can stall in poor local minima. Cross entropy composed with the sigmoid is convex, as we show next. Second, squared error produces vanishing gradients when a prediction is confidently wrong, because the sigmoid saturates and its derivative tends to zero, whereas the cross entropy gradient remains well behaved. These two properties, convexity and healthy gradients, are why cross entropy is the standard objective for classification.

92.4.3 4.3 Convexity

The objective $J(\mathbf{w})$ is convex in $\mathbf{w}$. Each term is a composition of the convex softplus function $\log(1 + e^{z})$ with an affine map, since after simplification the per example loss for label $y_i$ can be written $\log(1 + e^{z_i}) - y_i z_i$ with $z_i = \mathbf{w}^{\top}\mathbf{x}_i$. The Hessian, derived in Section 6, is positive semidefinite everywhere, which confirms convexity. Convexity guarantees that any local minimum is global, so first and second order methods both converge to the same optimum, assuming it exists.

92.5 5. Gradient Based Optimization

92.5.1 5.1 The gradient

To minimize $J(\mathbf{w})$ we need its gradient. Start from a single example and use the chain rule together with the sigmoid derivative identity. With $p_i = \sigma(z_i)$ and $z_i = \mathbf{w}^{\top}\mathbf{x}_i$, the derivative of the loss with respect to the logit collapses to a strikingly simple form:

\[ \frac{\partial}{\partial z_i} \Bigl[ -y_i \log p_i - (1 - y_i)\log(1 - p_i) \Bigr] = p_i - y_i. \]

The intermediate $\sigma'$ terms cancel exactly. The residual $p_i - y_i$ is the difference between the predicted probability and the observed label. By the chain rule, $\partial z_i / \partial \mathbf{w} = \mathbf{x}_i$, so the full gradient of the averaged loss is

\[ \nabla_{\mathbf{w}} J = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)\,\mathbf{x}_i = \frac{1}{n} \mathbf{X}^{\top}(\mathbf{p} - \mathbf{y}), \]

where $\mathbf{X}$ is the $n \times (d+1)$ design matrix whose rows are the $\mathbf{x}_i^{\top}$, and $\mathbf{p}$, $\mathbf{y}$ are the stacked vectors of predictions and labels. This is the same gradient form as linear regression, with the residual now defined through the sigmoid.

92.5.2 5.2 Gradient descent

Gradient descent updates the weights against the gradient with a step size, or learning rate, $\eta > 0$:

\[ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \, \nabla_{\mathbf{w}} J\bigl(\mathbf{w}^{(t)}\bigr). \]

Because $J$ is convex with Lipschitz continuous gradient, gradient descent converges to the global minimum for a sufficiently small fixed step size. In practice the features should be standardized so that the curvature is similar in all directions, which lets a single learning rate work well. Stochastic and mini batch variants replace the full sum with an estimate from a subset of examples, trading gradient accuracy for cheaper and more frequent updates, which is essential at scale.

Code

# Two Gaussian blobs, one per class, with a bias column appended to X.
rng = np.random.default_rng(0)
n = 200
X0 = rng.normal([-1.5, -1.0], 1.0, size=(n // 2, 2))
X1 = rng.normal([1.5, 1.0], 1.0, size=(n // 2, 2))
X = np.vstack([X0, X1])
y = np.concatenate([np.zeros(n // 2), np.ones(n // 2)])
Xb = np.hstack([np.ones((len(X), 1)), X])  # column of ones is the bias feature

def train_gd(X, y, eta=0.1, steps=500):
    w = np.zeros(X.shape[1])
    n = len(y)
    for _ in range(steps):
        p = sigmoid(X @ w)
        grad = X.T @ (p - y) / n
        w -= eta * grad
    return w

w_gd = train_gd(Xb, y)
acc_gd = ((sigmoid(Xb @ w_gd) >= 0.5) == y).mean()
print("gradient descent weights [bias, w1, w2]:", np.round(w_gd, 3))
print("training accuracy:", acc_gd)

gradient descent weights [bias, w1, w2]: [0.058 2.38  1.363]
training accuracy: 0.96

92.5.3 5.3 Regularization

When features are numerous or collinear, or when the classes are linearly separable, the unregularized maximum likelihood weights can grow without bound, because pushing the logits toward $\pm\infty$ drives the loss arbitrarily low. Adding an L2 penalty $\tfrac{\lambda}{2}\lVert \mathbf{w} \rVert^2$ to the objective keeps the weights finite, strictly convexifies the problem so the solution is unique, and corresponds to a Gaussian prior on the weights in a Bayesian reading. The penalized gradient simply adds $\lambda \mathbf{w}$, usually excluding the bias term from the penalty.

92.6 6. Newton’s Method and Iteratively Reweighted Least Squares

92.6.1 6.1 The Hessian

Second order methods use curvature to take better steps. Differentiating the gradient again gives the Hessian. Define the diagonal weight matrix $\mathbf{S}$ with entries $s_{ii} = p_i(1 - p_i)$, the variance of each Bernoulli prediction. Then

\[ \mathbf{H} = \nabla^2_{\mathbf{w}} J = \frac{1}{n} \sum_{i=1}^{n} p_i (1 - p_i)\, \mathbf{x}_i \mathbf{x}_i^{\top} = \frac{1}{n}\mathbf{X}^{\top}\mathbf{S}\,\mathbf{X}. \]

Because each $s_{ii} \ge 0$, the matrix $\mathbf{X}^{\top}\mathbf{S}\mathbf{X}$ is positive semidefinite, which proves the convexity claimed earlier. It is positive definite, and the minimizer unique, when $\mathbf{X}$ has full column rank and no prediction is exactly $0$ or $1$.

92.6.2 6.2 The Newton update

Newton’s method minimizes the local quadratic approximation of $J$ at each step, giving the update

\[ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \mathbf{H}^{-1} \nabla_{\mathbf{w}} J. \]

Near the optimum Newton’s method converges quadratically, often reaching high accuracy in well under ten iterations, far fewer than gradient descent needs. The cost per iteration is higher because it forms and inverts the $(d+1) \times (d+1)$ Hessian, an $O(d^3)$ operation, which makes it attractive for moderate dimensional problems and less so for very high dimensional ones.

92.6.3 6.3 IRLS as weighted least squares

The Newton update for logistic regression has an elegant interpretation as iteratively reweighted least squares. Substituting the gradient and Hessian and rearranging, each Newton step is the solution of a weighted least squares problem,

\[ \mathbf{w}^{(t+1)} = \bigl(\mathbf{X}^{\top}\mathbf{S}\mathbf{X}\bigr)^{-1}\mathbf{X}^{\top}\mathbf{S}\,\mathbf{r}, \]

where the working response, or adjusted dependent variable, is

\[ \mathbf{r} = \mathbf{X}\mathbf{w}^{(t)} + \mathbf{S}^{-1}(\mathbf{y} - \mathbf{p}). \]

Each iteration thus fits a weighted linear regression of the working response $\mathbf{r}$ on the features, with weights $s_{ii} = p_i(1 - p_i)$, then recomputes the predictions and weights and repeats. Observations near the decision boundary, where $p_i \approx 1/2$, carry the largest weight, while confident predictions near $0$ or $1$ contribute little, since the model has already explained them. This is the algorithm used by classical statistical software to fit generalized linear models.

Code

def train_irls(X, y, iters=10):
    w = np.zeros(X.shape[1])
    for _ in range(iters):
        p = sigmoid(X @ w)
        s = p * (1 - p)
        # working response and weighted normal equations
        z = X @ w + (y - p) / np.clip(s, 1e-8, None)
        H = X.T @ (s[:, None] * X)
        w = np.linalg.solve(H, X.T @ (s * z))
    return w

w_irls = train_irls(Xb, y)
print("IRLS weights [bias, w1, w2]:", np.round(w_irls, 3))

# The decision boundary is the line w0 + w1*x1 + w2*x2 = 0.
fig, ax = plt.subplots(figsize=(6, 5))
ax.scatter(X[y == 0, 0], X[y == 0, 1], s=15, color="C0", label="class 0")
ax.scatter(X[y == 1, 0], X[y == 1, 1], s=15, color="C3", label="class 1")
xx = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200)
yy = -(w_irls[0] + w_irls[1] * xx) / w_irls[2]
ax.plot(xx, yy, "k-", lw=2, label="decision boundary")
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_title("Linear decision boundary fit by IRLS")
ax.legend()
plt.show()

IRLS weights [bias, w1, w2]: [0.032 2.922 1.707]

92.7 7. Decision Boundaries and Interpretation

92.7.1 7.1 The decision boundary

To turn probabilities into class predictions we apply a threshold $\tau$, predicting class $1$ when $p(\mathbf{x}) \ge \tau$. The default threshold $\tau = 1/2$ corresponds to predicting the more probable class. Since $\sigma$ is monotone and $\sigma(0) = 1/2$, the threshold of $1/2$ on probability is equivalent to a threshold of $0$ on the logit, so the decision boundary is the hyperplane

\[ \mathbf{w}^{\top}\mathbf{x} = 0. \]

This is a linear boundary in feature space. Logistic regression is therefore a linear classifier: it can only separate classes with a flat boundary in the given features. The orientation of the boundary is set by the weight vector, which is normal to the hyperplane, and the signed distance of a point from the boundary is proportional to its logit, so points far from the boundary receive confident probabilities near $0$ or $1$. Nonlinear boundaries are obtained by engineering nonlinear features, such as polynomial or interaction terms, after which the model is still linear in the expanded feature space.

92.7.2 7.2 Choosing the threshold

The threshold need not be $1/2$. When classes are imbalanced or when the costs of false positives and false negatives differ, the operating threshold should be chosen to optimize the relevant criterion, such as expected cost, the F1 score, or a target on precision or recall. Crucially, the fitted probabilities themselves do not change with the threshold; only the classification rule applied to them does. Tools such as the receiver operating characteristic curve and the precision recall curve summarize performance across all thresholds at once.

92.7.3 7.3 Interpreting coefficients

The coefficients carry a precise meaning on the odds scale. Holding all other features fixed, increasing feature $j$ by one unit adds $w_j$ to the log odds, which multiplies the odds by $e^{w_j}$. The quantity $e^{w_j}$ is the odds ratio associated with that feature. A coefficient of $0.7$, for instance, means each unit increase in the feature multiplies the odds of the positive class by about $e^{0.7} \approx 2.0$, doubling them. A coefficient of $0$ means the feature leaves the odds unchanged. The sign tells the direction of the effect, and the magnitude, once features are standardized, indicates relative importance.

Two cautions apply. The effect is constant on the log odds scale but not on the probability scale, where the same coefficient produces a larger change near $p = 1/2$ and a smaller change in the saturated tails. And these are associations within the model, not causal effects, and they depend on which other features are included, since logistic coefficients are adjusted for the other covariates in the model.

92.7.4 7.4 Calibration

A valuable property of logistic regression trained by maximum likelihood is that its predicted probabilities tend to be well calibrated: among examples assigned probability around $0.3$, roughly $30$ percent are truly positive. This follows from the fact that minimizing cross entropy is a proper scoring rule, rewarding honest probabilities rather than merely correct rankings. Calibration makes the outputs directly usable in downstream decision making, expected value calculations, and risk estimates, which is one reason logistic regression endures even where more flexible classifiers are available.

92.8 8. Summary

Binary logistic regression assumes the log odds of the positive class are linear in the features, which is equivalent to passing a linear score through the sigmoid to obtain a probability. The model is fit by maximum likelihood, whose negative log likelihood is exactly the binary cross entropy, a convex objective with a clean gradient equal to the feature weighted residual $p - y$. Optimization can proceed by gradient descent, which is cheap and scalable, or by Newton’s method, which appears as iteratively reweighted least squares and converges in a handful of iterations for moderate dimensions. The resulting classifier has a linear decision boundary, coefficients interpretable as odds ratios, and probability outputs that are typically well calibrated. These properties make it the canonical baseline for binary classification and the building block from which softmax regression, generalized linear models, and the output layers of neural networks are constructed.

92.9 References

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
McCullagh, P., and Nelder, J. A. Generalized Linear Models, 2nd ed. Chapman and Hall, 1989. https://doi.org/10.1201/9780203753736
Murphy, K. P. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. https://probml.github.io/pml-book/book1.html
Ng, A. CS229 Lecture Notes: Supervised Learning and Logistic Regression. Stanford University. https://cs229.stanford.edu/main_notes.pdf
Boyd, S., and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. https://web.stanford.edu/~boyd/cvxbook/
Pedregosa, F., et al. Scikit learn: Logistic Regression User Guide. https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

# Binary Logistic Regression Binary logistic regression is the workhorse of supervised classification. It models the probability that an observation belongs to one of two classes as a smooth function of a linear combination of features. Despite its simplicity, it remains a default baseline in industry and a conceptual cornerstone, because it cleanly connects linear models, probability theory, information theory, and convex optimization. This chapter develops the model from first principles, derives its training objective through maximum likelihood, shows that this objective coincides with cross entropy, and works through both gradient based and Newton (IRLS) optimization. We close with decision boundaries and the interpretation of fitted coefficients. ## 1. From Linear Scores to Probabilities ### 1.1 The classification setting We observe a dataset of $n$ labeled examples $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}$, where $\mathbf{x}_i \in \mathbb{R}^{d}$ is a feature vector and $y_i \in \{0, 1\}$ is a binary label. Our goal is to learn a function that maps a feature vector to an estimate of $P(y = 1 \mid \mathbf{x})$, the conditional probability that the positive class occurs. A naive approach would fit ordinary linear regression directly to the labels, predicting $\hat{y} = \mathbf{w}^{\top}\mathbf{x} + b$. This fails for two reasons. First, the linear score ranges over all of $\mathbb{R}$, so it cannot be read as a probability bounded in $[0, 1]$. Second, the squared error loss implied by linear regression is a poor match for a Bernoulli outcome, and it heavily penalizes confident correct predictions that lie far from the boundary. We need a model that squashes a real valued score into a valid probability. ### 1.2 The sigmoid function The logistic sigmoid accomplishes exactly this squashing. Define the linear score, often called the logit or activation, $$ z = \mathbf{w}^{\top}\mathbf{x} + b, $$ and pass it through $$ \sigma(z) = \frac{1}{1 + e^{-z}}. $$ The sigmoid is a monotonically increasing function from $\mathbb{R}$ onto the open interval $(0, 1)$. It satisfies $\sigma(0) = 1/2$, $\sigma(z) \to 1$ as $z \to +\infty$, and $\sigma(z) \to 0$ as $z \to -\infty$. Two algebraic identities are used constantly: $$ 1 - \sigma(z) = \sigma(-z), \qquad \sigma'(z) = \sigma(z)\bigl(1 - \sigma(z)\bigr). $$ The derivative identity is what makes logistic regression so clean to optimize. The model probability is then $$ p(\mathbf{x}) = P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^{\top}\mathbf{x} + b). $$ To lighten notation we absorb the bias into the weight vector by appending a constant feature of $1$ to every $\mathbf{x}$, so that $z = \mathbf{w}^{\top}\mathbf{x}$ with the convention that $w_0 = b$ and $x_0 = 1$. From here on $\mathbf{w} \in \mathbb{R}^{d+1}$. ```{python} import numpy as np import matplotlib.pyplot as plt def sigmoid(z): # numerically stable form avoids overflow for large negative z return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) z = np.linspace(-8, 8, 400) fig, ax = plt.subplots(figsize=(6, 3.5)) ax.plot(z, sigmoid(z), color="C0", lw=2) ax.axhline(0.5, color="gray", ls="--", lw=1) ax.axvline(0.0, color="gray", ls="--", lw=1) ax.set_xlabel("logit z") ax.set_ylabel("sigmoid(z)") ax.set_title("Logistic sigmoid squashes a real score into (0, 1)") plt.show() print("sigmoid(0) =", sigmoid(0.0)) ``` ## 2. The Log Odds Model ### 2.1 Odds and the logit link The quantity $p / (1 - p)$ is the odds of the positive class. A probability of $0.8$ corresponds to odds of $4$, meaning the event is four times as likely to occur as not. Taking the logarithm gives the log odds, also called the logit: $$ \operatorname{logit}(p) = \log \frac{p}{1 - p}. $$ The logit is the inverse of the sigmoid. Substituting $p = \sigma(z)$ and using $1 - \sigma(z) = \sigma(-z)$, $$ \log \frac{\sigma(z)}{1 - \sigma(z)} = \log \frac{\sigma(z)}{\sigma(-z)} = \log e^{z} = z. $$ This is the defining structural assumption of logistic regression: the log odds of the positive class are a linear function of the features. $$ \log \frac{P(y = 1 \mid \mathbf{x})}{P(y = 0 \mid \mathbf{x})} = \mathbf{w}^{\top}\mathbf{x}. $$ This is why logistic regression is a generalized linear model. The response is Bernoulli distributed, and the canonical link function connecting the mean of the response to the linear predictor is the logit. Linearity lives on the log odds scale, not on the probability scale, where the relationship is the characteristic S shaped curve. ### 2.2 Why the log odds scale is natural Working on the log odds scale has three practical benefits. It maps the bounded interval $(0, 1)$ onto the entire real line, so the linear predictor never produces an invalid probability. It is symmetric: swapping the roles of the two classes flips the sign of the logit but leaves the structure intact. And it gives coefficients a clean multiplicative meaning in terms of odds ratios, which we develop in Section 7. The log odds representation also reveals that the decision boundary, where the two classes are equally probable, is exactly the set where $\mathbf{w}^{\top}\mathbf{x} = 0$, a hyperplane. ## 3. Maximum Likelihood Estimation ### 3.1 The Bernoulli likelihood Logistic regression estimates $\mathbf{w}$ by maximum likelihood. Each label is modeled as a Bernoulli draw with success probability $p_i = \sigma(\mathbf{w}^{\top}\mathbf{x}_i)$. The probability mass function for a single observation, written compactly for both outcomes, is $$ P(y_i \mid \mathbf{x}_i; \mathbf{w}) = p_i^{\,y_i} (1 - p_i)^{1 - y_i}. $$ Assuming the observations are independent given the features, the likelihood of the full dataset is the product $$ L(\mathbf{w}) = \prod_{i=1}^{n} p_i^{\,y_i} (1 - p_i)^{1 - y_i}. $$ ### 3.2 The log likelihood Products of many small probabilities underflow numerically and are awkward to differentiate, so we maximize the log likelihood instead. Because logarithm is monotonic, the maximizer is unchanged. $$ \ell(\mathbf{w}) = \sum_{i=1}^{n} \Bigl[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \Bigr]. $$ Maximizing $\ell(\mathbf{w})$ is equivalent to minimizing its negative. The negative log likelihood, averaged over the dataset, is the loss we actually minimize: $$ J(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \Bigl[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \Bigr]. $$ ## 4. Cross Entropy and the Connection to Information Theory ### 4.1 Cross entropy as the same objective The loss $J(\mathbf{w})$ is precisely the binary cross entropy between the empirical label distribution and the model. For a single example, treat the true label as a degenerate distribution placing mass $y_i$ on class $1$ and $1 - y_i$ on class $0$, and the prediction as the distribution $(p_i, 1 - p_i)$. The cross entropy between these two distributions is $$ H(y_i, p_i) = -\bigl[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \bigr], $$ which is exactly the per example term in $J$. So maximum likelihood for the Bernoulli model and cross entropy minimization are two names for the same thing. This is no coincidence. Minimizing cross entropy is equivalent to minimizing the Kullback Leibler divergence from the model to the data generating distribution, and the maximum likelihood estimator is the one whose model distribution is closest to the empirical distribution in KL divergence. ### 4.2 Why not squared error One might ask why we do not minimize $\sum_i (y_i - p_i)^2$ instead. The answer is twofold. First, with $p_i = \sigma(z_i)$, the squared error is non convex in $\mathbf{w}$, so gradient descent can stall in poor local minima. Cross entropy composed with the sigmoid is convex, as we show next. Second, squared error produces vanishing gradients when a prediction is confidently wrong, because the sigmoid saturates and its derivative tends to zero, whereas the cross entropy gradient remains well behaved. These two properties, convexity and healthy gradients, are why cross entropy is the standard objective for classification. ### 4.3 Convexity The objective $J(\mathbf{w})$ is convex in $\mathbf{w}$. Each term is a composition of the convex softplus function $\log(1 + e^{z})$ with an affine map, since after simplification the per example loss for label $y_i$ can be written $\log(1 + e^{z_i}) - y_i z_i$ with $z_i = \mathbf{w}^{\top}\mathbf{x}_i$. The Hessian, derived in Section 6, is positive semidefinite everywhere, which confirms convexity. Convexity guarantees that any local minimum is global, so first and second order methods both converge to the same optimum, assuming it exists. ## 5. Gradient Based Optimization ### 5.1 The gradient To minimize $J(\mathbf{w})$ we need its gradient. Start from a single example and use the chain rule together with the sigmoid derivative identity. With $p_i = \sigma(z_i)$ and $z_i = \mathbf{w}^{\top}\mathbf{x}_i$, the derivative of the loss with respect to the logit collapses to a strikingly simple form: $$ \frac{\partial}{\partial z_i} \Bigl[ -y_i \log p_i - (1 - y_i)\log(1 - p_i) \Bigr] = p_i - y_i. $$ The intermediate $\sigma'$ terms cancel exactly. The residual $p_i - y_i$ is the difference between the predicted probability and the observed label. By the chain rule, $\partial z_i / \partial \mathbf{w} = \mathbf{x}_i$, so the full gradient of the averaged loss is $$ \nabla_{\mathbf{w}} J = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)\,\mathbf{x}_i = \frac{1}{n} \mathbf{X}^{\top}(\mathbf{p} - \mathbf{y}), $$ where $\mathbf{X}$ is the $n \times (d+1)$ design matrix whose rows are the $\mathbf{x}_i^{\top}$, and $\mathbf{p}$, $\mathbf{y}$ are the stacked vectors of predictions and labels. This is the same gradient form as linear regression, with the residual now defined through the sigmoid. ### 5.2 Gradient descent Gradient descent updates the weights against the gradient with a step size, or learning rate, $\eta > 0$: $$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \, \nabla_{\mathbf{w}} J\bigl(\mathbf{w}^{(t)}\bigr). $$ Because $J$ is convex with Lipschitz continuous gradient, gradient descent converges to the global minimum for a sufficiently small fixed step size. In practice the features should be standardized so that the curvature is similar in all directions, which lets a single learning rate work well. Stochastic and mini batch variants replace the full sum with an estimate from a subset of examples, trading gradient accuracy for cheaper and more frequent updates, which is essential at scale. ```{python} # Two Gaussian blobs, one per class, with a bias column appended to X. rng = np.random.default_rng(0) n = 200 X0 = rng.normal([-1.5, -1.0], 1.0, size=(n // 2, 2)) X1 = rng.normal([1.5, 1.0], 1.0, size=(n // 2, 2)) X = np.vstack([X0, X1]) y = np.concatenate([np.zeros(n // 2), np.ones(n // 2)]) Xb = np.hstack([np.ones((len(X), 1)), X]) # column of ones is the bias feature def train_gd(X, y, eta=0.1, steps=500): w = np.zeros(X.shape[1]) n = len(y) for _ in range(steps): p = sigmoid(X @ w) grad = X.T @ (p - y) / n w -= eta * grad return w w_gd = train_gd(Xb, y) acc_gd = ((sigmoid(Xb @ w_gd) >= 0.5) == y).mean() print("gradient descent weights [bias, w1, w2]:", np.round(w_gd, 3)) print("training accuracy:", acc_gd) ``` ### 5.3 Regularization When features are numerous or collinear, or when the classes are linearly separable, the unregularized maximum likelihood weights can grow without bound, because pushing the logits toward $\pm\infty$ drives the loss arbitrarily low. Adding an L2 penalty $\tfrac{\lambda}{2}\lVert \mathbf{w} \rVert^2$ to the objective keeps the weights finite, strictly convexifies the problem so the solution is unique, and corresponds to a Gaussian prior on the weights in a Bayesian reading. The penalized gradient simply adds $\lambda \mathbf{w}$, usually excluding the bias term from the penalty. ## 6. Newton's Method and Iteratively Reweighted Least Squares ### 6.1 The Hessian Second order methods use curvature to take better steps. Differentiating the gradient again gives the Hessian. Define the diagonal weight matrix $\mathbf{S}$ with entries $s_{ii} = p_i(1 - p_i)$, the variance of each Bernoulli prediction. Then $$ \mathbf{H} = \nabla^2_{\mathbf{w}} J = \frac{1}{n} \sum_{i=1}^{n} p_i (1 - p_i)\, \mathbf{x}_i \mathbf{x}_i^{\top} = \frac{1}{n}\mathbf{X}^{\top}\mathbf{S}\,\mathbf{X}. $$ Because each $s_{ii} \ge 0$, the matrix $\mathbf{X}^{\top}\mathbf{S}\mathbf{X}$ is positive semidefinite, which proves the convexity claimed earlier. It is positive definite, and the minimizer unique, when $\mathbf{X}$ has full column rank and no prediction is exactly $0$ or $1$. ### 6.2 The Newton update Newton's method minimizes the local quadratic approximation of $J$ at each step, giving the update $$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \mathbf{H}^{-1} \nabla_{\mathbf{w}} J. $$ Near the optimum Newton's method converges quadratically, often reaching high accuracy in well under ten iterations, far fewer than gradient descent needs. The cost per iteration is higher because it forms and inverts the $(d+1) \times (d+1)$ Hessian, an $O(d^3)$ operation, which makes it attractive for moderate dimensional problems and less so for very high dimensional ones. ### 6.3 IRLS as weighted least squares The Newton update for logistic regression has an elegant interpretation as iteratively reweighted least squares. Substituting the gradient and Hessian and rearranging, each Newton step is the solution of a weighted least squares problem, $$ \mathbf{w}^{(t+1)} = \bigl(\mathbf{X}^{\top}\mathbf{S}\mathbf{X}\bigr)^{-1}\mathbf{X}^{\top}\mathbf{S}\,\mathbf{r}, $$ where the working response, or adjusted dependent variable, is $$ \mathbf{r} = \mathbf{X}\mathbf{w}^{(t)} + \mathbf{S}^{-1}(\mathbf{y} - \mathbf{p}). $$ Each iteration thus fits a weighted linear regression of the working response $\mathbf{r}$ on the features, with weights $s_{ii} = p_i(1 - p_i)$, then recomputes the predictions and weights and repeats. Observations near the decision boundary, where $p_i \approx 1/2$, carry the largest weight, while confident predictions near $0$ or $1$ contribute little, since the model has already explained them. This is the algorithm used by classical statistical software to fit generalized linear models. ```{python} def train_irls(X, y, iters=10): w = np.zeros(X.shape[1]) for _ in range(iters): p = sigmoid(X @ w) s = p * (1 - p) # working response and weighted normal equations z = X @ w + (y - p) / np.clip(s, 1e-8, None) H = X.T @ (s[:, None] * X) w = np.linalg.solve(H, X.T @ (s * z)) return w w_irls = train_irls(Xb, y) print("IRLS weights [bias, w1, w2]:", np.round(w_irls, 3)) # The decision boundary is the line w0 + w1*x1 + w2*x2 = 0. fig, ax = plt.subplots(figsize=(6, 5)) ax.scatter(X[y == 0, 0], X[y == 0, 1], s=15, color="C0", label="class 0") ax.scatter(X[y == 1, 0], X[y == 1, 1], s=15, color="C3", label="class 1") xx = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200) yy = -(w_irls[0] + w_irls[1] * xx) / w_irls[2] ax.plot(xx, yy, "k-", lw=2, label="decision boundary") ax.set_xlabel("x1") ax.set_ylabel("x2") ax.set_title("Linear decision boundary fit by IRLS") ax.legend() plt.show() ``` ## 7. Decision Boundaries and Interpretation ### 7.1 The decision boundary To turn probabilities into class predictions we apply a threshold $\tau$, predicting class $1$ when $p(\mathbf{x}) \ge \tau$. The default threshold $\tau = 1/2$ corresponds to predicting the more probable class. Since $\sigma$ is monotone and $\sigma(0) = 1/2$, the threshold of $1/2$ on probability is equivalent to a threshold of $0$ on the logit, so the decision boundary is the hyperplane $$ \mathbf{w}^{\top}\mathbf{x} = 0. $$ This is a linear boundary in feature space. Logistic regression is therefore a linear classifier: it can only separate classes with a flat boundary in the given features. The orientation of the boundary is set by the weight vector, which is normal to the hyperplane, and the signed distance of a point from the boundary is proportional to its logit, so points far from the boundary receive confident probabilities near $0$ or $1$. Nonlinear boundaries are obtained by engineering nonlinear features, such as polynomial or interaction terms, after which the model is still linear in the expanded feature space. ### 7.2 Choosing the threshold The threshold need not be $1/2$. When classes are imbalanced or when the costs of false positives and false negatives differ, the operating threshold should be chosen to optimize the relevant criterion, such as expected cost, the F1 score, or a target on precision or recall. Crucially, the fitted probabilities themselves do not change with the threshold; only the classification rule applied to them does. Tools such as the receiver operating characteristic curve and the precision recall curve summarize performance across all thresholds at once. ### 7.3 Interpreting coefficients The coefficients carry a precise meaning on the odds scale. Holding all other features fixed, increasing feature $j$ by one unit adds $w_j$ to the log odds, which multiplies the odds by $e^{w_j}$. The quantity $e^{w_j}$ is the odds ratio associated with that feature. A coefficient of $0.7$, for instance, means each unit increase in the feature multiplies the odds of the positive class by about $e^{0.7} \approx 2.0$, doubling them. A coefficient of $0$ means the feature leaves the odds unchanged. The sign tells the direction of the effect, and the magnitude, once features are standardized, indicates relative importance. Two cautions apply. The effect is constant on the log odds scale but not on the probability scale, where the same coefficient produces a larger change near $p = 1/2$ and a smaller change in the saturated tails. And these are associations within the model, not causal effects, and they depend on which other features are included, since logistic coefficients are adjusted for the other covariates in the model. ### 7.4 Calibration A valuable property of logistic regression trained by maximum likelihood is that its predicted probabilities tend to be well calibrated: among examples assigned probability around $0.3$, roughly $30$ percent are truly positive. This follows from the fact that minimizing cross entropy is a proper scoring rule, rewarding honest probabilities rather than merely correct rankings. Calibration makes the outputs directly usable in downstream decision making, expected value calculations, and risk estimates, which is one reason logistic regression endures even where more flexible classifiers are available. ## 8. Summary Binary logistic regression assumes the log odds of the positive class are linear in the features, which is equivalent to passing a linear score through the sigmoid to obtain a probability. The model is fit by maximum likelihood, whose negative log likelihood is exactly the binary cross entropy, a convex objective with a clean gradient equal to the feature weighted residual $p - y$. Optimization can proceed by gradient descent, which is cheap and scalable, or by Newton's method, which appears as iteratively reweighted least squares and converges in a handful of iterations for moderate dimensions. The resulting classifier has a linear decision boundary, coefficients interpretable as odds ratios, and probability outputs that are typically well calibrated. These properties make it the canonical baseline for binary classification and the building block from which softmax regression, generalized linear models, and the output layers of neural networks are constructed. ## References 1. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/ 2. Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 3. McCullagh, P., and Nelder, J. A. Generalized Linear Models, 2nd ed. Chapman and Hall, 1989. https://doi.org/10.1201/9780203753736 4. Murphy, K. P. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. https://probml.github.io/pml-book/book1.html 5. Ng, A. CS229 Lecture Notes: Supervised Learning and Logistic Regression. Stanford University. https://cs229.stanford.edu/main_notes.pdf 6. Boyd, S., and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. https://web.stanford.edu/~boyd/cvxbook/ 7. Pedregosa, F., et al. Scikit learn: Logistic Regression User Guide. https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression