143 Independent Component Analysis

Independent Component Analysis (ICA) is a computational framework for recovering a set of underlying source signals from observed mixtures of those signals, given only the mixtures and a few weak statistical assumptions. It is the canonical solution to the blind source separation problem, where the word blind signals that we know almost nothing about either the sources or the mixing process. This chapter develops the theory rigorously, beginning with the generative model, then clarifying the crucial distinction between statistical independence and mere uncorrelatedness, establishing why non-Gaussianity is the engine that makes separation possible, and finally deriving the FastICA algorithm that has become the practical workhorse of the field.

143.2 2. Statistical Independence Versus Uncorrelatedness

143.2.1 2.1 Two Distinct Notions

The heart of ICA is a careful distinction between two conditions that are often conflated. Two random variables $y_1$ and $y_2$ are uncorrelated if their covariance vanishes:

\[ \mathbb{E}[y_1 y_2] - \mathbb{E}[y_1]\,\mathbb{E}[y_2] = 0 . \]

They are statistically independent if their joint density factorizes into the product of the marginals:

\[ p(y_1, y_2) = p_1(y_1)\, p_2(y_2) . \]

Independence is strictly stronger. It implies that for any pair of measurable functions $g$ and $h$,

\[ \mathbb{E}[g(y_1)\, h(y_2)] = \mathbb{E}[g(y_1)]\,\mathbb{E}[h(y_2)] . \]

Uncorrelatedness is the special case where $g$ and $h$ are both the identity. Independence constrains all higher order moments and not merely the second order cross moment. Independent variables are always uncorrelated, but uncorrelated variables can be strongly dependent through their higher order statistics.

143.2.2 2.2 Why the Distinction Matters

Methods such as Principal Component Analysis (PCA) and whitening only impose uncorrelatedness. They diagonalize the covariance matrix and thereby remove all second order dependence, but they leave higher order structure untouched. Because second order statistics are insufficient to pin down the mixing matrix, decorrelation alone cannot solve blind source separation. There are infinitely many uncorrelated representations of a given dataset, related to one another by orthogonal rotations, and only special rotations yield genuinely independent components. ICA must therefore exploit information beyond the covariance, which is precisely the higher order information encoded in non-Gaussianity.

143.2.3 2.3 Whitening as Preprocessing

Although decorrelation is not sufficient, it is an extremely useful first step. Whitening transforms the observed data so that its components are uncorrelated and have unit variance, giving an identity covariance matrix. If $\mathbf{x}$ has covariance $\mathbf{C} = \mathbb{E}[\mathbf{x}\mathbf{x}^\top] = \mathbf{E}\mathbf{D}\mathbf{E}^\top$ in its eigendecomposition, the whitening transform is

\[ \mathbf{z} = \mathbf{D}^{-1/2}\mathbf{E}^\top \mathbf{x}, \qquad \mathbb{E}[\mathbf{z}\mathbf{z}^\top] = \mathbf{I}. \]

After whitening, the remaining mixing matrix is orthogonal. The proof is short: if $\mathbf{z} = \tilde{\mathbf{A}}\mathbf{s}$ with unit variance independent sources, then $\mathbf{I} = \mathbb{E}[\mathbf{z}\mathbf{z}^\top] = \tilde{\mathbf{A}}\,\mathbb{E}[\mathbf{s}\mathbf{s}^\top]\,\tilde{\mathbf{A}}^\top = \tilde{\mathbf{A}}\tilde{\mathbf{A}}^\top$, so $\tilde{\mathbf{A}}$ is orthogonal. Whitening thus reduces the number of free parameters from $n^2$ to the $n(n-1)/2$ that parameterize an orthogonal matrix, which substantially simplifies the search.

143.3 3. Non-Gaussianity as the Key to Separation

143.3.1 3.1 The Central Limit Theorem Intuition

Why should non-Gaussianity matter? Consider a single estimated component $y = \mathbf{w}^\top \mathbf{x} = \mathbf{w}^\top \mathbf{A}\mathbf{s} = \mathbf{q}^\top \mathbf{s}$, where $\mathbf{q} = \mathbf{A}^\top \mathbf{w}$. The estimate $y$ is a linear combination of the independent sources. The Central Limit Theorem says that a sum of independent random variables tends toward a Gaussian distribution as the number of terms grows. Consequently, $\mathbf{q}^\top \mathbf{s}$ is generally more Gaussian than any single source $s_j$, and it is least Gaussian precisely when it equals one of the sources, meaning $\mathbf{q}$ has a single nonzero entry. This suggests a striking principle: maximizing the non-Gaussianity of $\mathbf{w}^\top \mathbf{x}$ recovers an independent component. Each local maximum of non-Gaussianity over the unit sphere corresponds to one of the sources.

143.3.2 3.2 The Gaussian Degeneracy

This principle also reveals the fundamental limitation of ICA. If two or more sources are Gaussian, ICA cannot separate them. For Gaussian variables, uncorrelatedness already implies independence, so any orthogonal rotation of jointly Gaussian whitened data remains jointly Gaussian and independent. The likelihood is invariant under such rotations, and there is no preferred basis to recover. ICA therefore requires that at most one source be Gaussian. This is the mirror image of the usual statistical preference for Gaussianity: here non-Gaussianity is not a nuisance but the very signal that drives identifiability.

143.3.3 3.3 Measuring Non-Gaussianity with Kurtosis

The classical measure of non-Gaussianity is the fourth order cumulant, kurtosis. For a zero mean random variable $y$ with unit variance,

\[ \mathrm{kurt}(y) = \mathbb{E}[y^4] - 3\big(\mathbb{E}[y^2]\big)^2 = \mathbb{E}[y^4] - 3 . \]

Kurtosis is zero for a Gaussian variable. It is positive for supergaussian or leptokurtic variables, which have heavy tails and a sharp peak, such as a Laplacian distribution, and negative for subgaussian or platykurtic variables, which are flatter than Gaussian, such as a uniform distribution. Non-Gaussianity is measured by $|\mathrm{kurt}(y)|$ or $\mathrm{kurt}(y)^2$. Kurtosis is computationally simple and theoretically transparent, but it is sensitive to outliers, because a few extreme samples in the tail can dominate the fourth power. This fragility motivates more robust contrast functions.

143.3.4 3.4 Negentropy and Robust Contrast Functions

A more principled measure is negentropy, grounded in information theory. Among all distributions with a fixed variance, the Gaussian has maximum differential entropy. Negentropy is defined as the entropy gap between a variable and a Gaussian $y_{\mathrm{gauss}}$ of identical covariance:

\[ J(y) = H(y_{\mathrm{gauss}}) - H(y), \qquad H(y) = -\int p(y)\log p(y)\, dy . \]

Negentropy is always nonnegative, is zero only for a Gaussian, and is invariant under invertible linear transformations, which makes it a theoretically optimal contrast. Its drawback is that computing $J(y)$ exactly requires the unknown density. Practical ICA therefore uses approximations of the form

\[ J(y) \approx \big( \mathbb{E}[G(y)] - \mathbb{E}[G(\nu)] \big)^2, \]

where $\nu$ is a standard Gaussian and $G$ is a smooth nonquadratic function chosen to grow slowly so as to resist outliers. Common choices are

\[ G_1(u) = \frac{1}{a_1}\log\cosh(a_1 u), \qquad G_2(u) = -\exp\!\left(-\frac{u^2}{2}\right), \]

with $1 \le a_1 \le 2$. The function $\log\cosh$ behaves like the absolute value for large arguments and so down-weights extreme samples relative to the fourth power used by kurtosis.

143.4 4. The FastICA Algorithm

143.4.1 4.1 A Fixed Point Iteration

FastICA, introduced by Hyvarinen and Oja, is a fixed point algorithm that maximizes the negentropy approximation with remarkable efficiency. It operates on whitened data $\mathbf{z}$ and seeks a unit weight vector $\mathbf{w}$ that maximizes the non-Gaussianity of $\mathbf{w}^\top \mathbf{z}$. Let $g$ denote the derivative of the contrast function $G$, and $g'$ its second derivative. For the choices above,

\[ g_1(u) = \tanh(a_1 u), \qquad g_2(u) = u\,\exp\!\left(-\frac{u^2}{2}\right). \]

The maxima of the negentropy approximation occur at points satisfying a Lagrangian stationarity condition. Applying an approximate Newton method to that condition yields the FastICA update for a single component:

\[ \mathbf{w}^{+} \leftarrow \mathbb{E}\big[\mathbf{z}\, g(\mathbf{w}^\top \mathbf{z})\big] - \mathbb{E}\big[g'(\mathbf{w}^\top \mathbf{z})\big]\,\mathbf{w}, \]

\[ \mathbf{w} \leftarrow \frac{\mathbf{w}^{+}}{\lVert \mathbf{w}^{+} \rVert}. \]

The first expectation is the gradient term, and the second is a scalar approximation of the Hessian that makes the step a Newton step rather than a plain gradient ascent. The normalization keeps $\mathbf{w}$ on the unit sphere, which is legitimate because the data is whitened. In practice the expectations are replaced by sample averages over the observed data.

143.4.2 4.2 Pseudocode for One Component

Input: whitened data Z (samples in columns), contrast g and its derivative g'
Initialize w to a random unit vector
repeat
    wp = mean over samples of [ Z * g(w^T Z) ]  -  mean[ g'(w^T Z) ] * w
    w  = wp / norm(wp)
until |w_new dot w_old| is close to 1
return w

The convergence criterion checks whether the new and old weight vectors point in essentially the same direction, since the sign is irrelevant. Convergence is cubic for the kurtosis based contrast and at least quadratic in general, far faster than the linear convergence of gradient methods, and it requires no learning rate to tune.

143.4.3 4.3 Extracting Several Components

To recover all $n$ sources, FastICA estimates several vectors $\mathbf{w}_1, \ldots, \mathbf{w}_n$ and constrains them to be orthogonal in the whitened space, since distinct independent components correspond to orthogonal directions after whitening. Two schemes are common. Deflationary orthogonalization extracts components one at a time, and after each new $\mathbf{w}_p$ is updated it is orthogonalized against the previously found vectors by a Gram-Schmidt step:

\[ \mathbf{w}_p \leftarrow \mathbf{w}_p - \sum_{j=1}^{p-1} (\mathbf{w}_p^\top \mathbf{w}_j)\, \mathbf{w}_j, \qquad \mathbf{w}_p \leftarrow \frac{\mathbf{w}_p}{\lVert \mathbf{w}_p \rVert}. \]

Symmetric orthogonalization instead updates all vectors in parallel and then orthogonalizes the whole matrix $\mathbf{W} = (\mathbf{w}_1, \ldots, \mathbf{w}_n)^\top$ at once:

\[ \mathbf{W} \leftarrow \big(\mathbf{W}\mathbf{W}^\top\big)^{-1/2}\, \mathbf{W} . \]

The symmetric scheme avoids the accumulation of estimation errors that the sequential deflation can suffer, and it does not privilege any component over another, so it is often preferred. Symmetric orthogonalization has a clean closed form. Write the inverse symmetric square root through the eigendecomposition $\mathbf{W}\mathbf{W}^\top = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^\top$, giving

\[ \big(\mathbf{W}\mathbf{W}^\top\big)^{-1/2} = \mathbf{U}\,\boldsymbol{\Lambda}^{-1/2}\,\mathbf{U}^\top , \]

so a single symmetric eigendecomposition of the small $n \times n$ Gram matrix suffices at every iteration. The reference implementations below use a self-contained cyclic Jacobi eigensolver for this step, which makes the whole pipeline reproducible bit for bit across Python, Julia and Rust.

143.4.4 4.4 The Complete Pipeline

A full ICA analysis proceeds in three stages. First, center the data by subtracting the mean. Second, whiten it using the eigendecomposition of the covariance matrix, which both decorrelates the components and reduces the problem to a search over orthogonal matrices. Third, run FastICA with symmetric orthogonalization to find the unmixing rotation. The recovered sources are then $\mathbf{y} = \mathbf{W}\mathbf{z}$, and the columns of the estimated mixing matrix follow by inverting the composed transformation. Dimensionality reduction can be folded into the whitening step by retaining only the leading principal components, which is valuable when the number of mixtures exceeds the number of sources of interest.

143.4.5 4.5 A Reference Implementation

The companion aiinaction libraries ship a small, deterministic FastICA in Python, Julia, and Rust. The three share one source of truth for their test fixtures and agree numerically to floating-point tolerance. The public API is intentionally tiny: fit_ica(X, n_components, max_iter) returns a fitted model carrying the per-signal mean, the whitening matrix $\mathbf{K}$, the orthogonal unmixing rotation $\mathbf{W}$, the full unmixing operator components $= \mathbf{W}\mathbf{K}$, and the estimated mixing matrix; transform(model, X) recovers the sources $\mathbf{S} = (\mathbf{X} - \boldsymbol{\mu})\,\text{components}^\top$.

To keep results reproducible across languages the implementations make three deliberate choices. The unmixing matrix is initialized to the identity rather than to a random rotation; the fixed-point iteration runs a fixed max_iter count instead of stopping at a tolerance; and whitening and symmetric orthogonalization both use the same cyclic Jacobi eigensolver. The recovered components are then ordered by descending sample variance and each row’s largest-magnitude entry is forced positive, which pins down the permutation and sign ambiguities deterministically.

The example below mixes two synthetic sources through a fixed $2 \times 2$ matrix $\mathbf{A} = \left[\begin{smallmatrix} 2 & 1 \\ 1 & 3 \end{smallmatrix}\right]$ and recovers them. Because ICA identifies sources only up to scale and order, we check separation by confirming that the recovered components are mutually uncorrelated and that the estimated mixing operator inverts the unmixing operator.

Code

import numpy as np
from aiinaction.ch138_fastica import fit_ica, transform

# Two synthetic sources mixed by a fixed matrix A.
X = np.array([
    [ 1.0, -2.0], [ 0.0,  5.0], [ 4.0,  2.0], [-2.0, -6.0],
    [-3.0,  1.0], [ 4.0,  7.0], [-3.0, -4.0], [ 1.0,  3.0],
])

model = fit_ica(X, n_components=2, max_iter=200)
S = transform(model, X)

print("unmixing rotation W:")
print(np.round(model.unmixing, 4))
print("recovered source (row 0):", np.round(S[0], 4))

# Recovered sources are uncorrelated; mixing inverts the unmixing operator.
cov = np.cov(S, rowvar=False)
print("off-diagonal source covariance:", round(float(cov[0, 1]), 12))
print("components @ mixing =\n", np.round(model.components @ model.mixing, 9))

unmixing rotation W:
[[ 0.7846  0.62  ]
 [-0.62    0.7846]]
recovered source (row 0): [0.2606 1.0281]
off-diagonal source covariance: -0.0
components @ mixing =
 [[ 1. -0.]
 [ 0.  1.]]

using AIInAction.Ch138Fastica

X = [
    1.0 -2.0; 0.0 5.0; 4.0 2.0; -2.0 -6.0;
    -3.0 1.0; 4.0 7.0; -3.0 -4.0; 1.0 3.0
]

model = fit_ica(X; n_components=2, max_iter=200)
S = transform(model, X)

println("unmixing rotation W:")
println(round.(model.unmixing; digits=4))
println("recovered source (row 1): ", round.(S[1, :]; digits=4))

# Recovered sources are uncorrelated; mixing inverts the unmixing operator.
n = size(X, 1)
Sc = S .- (sum(S; dims=1) ./ n)
println("off-diagonal source covariance: ",
        round(sum(Sc[:, 1] .* Sc[:, 2]) / (n - 1); digits=12))
println("components * mixing =")
println(round.(model.components * model.mixing; digits=9))

use aiinaction::ch138_fastica::{fit_ica, transform, Matrix};

fn main() {
    let x = Matrix::from_rows(&[
        vec![1.0, -2.0], vec![0.0, 5.0], vec![4.0, 2.0], vec![-2.0, -6.0],
        vec![-3.0, 1.0], vec![4.0, 7.0], vec![-3.0, -4.0], vec![1.0, 3.0],
    ])
    .unwrap();

    let model = fit_ica(&x, Some(2), 200).unwrap();
    let s = transform(&model, &x).unwrap();

    println!("unmixing rotation W:");
    println!("{:?}", model.unmixing);
    println!("recovered source (row 0): {:?}", s[0]);

    // components @ mixing = identity for the square full-rank case.
    let c = &model.components;
    let m = &model.mixing;
    for i in 0..2 {
        for j in 0..2 {
            let v: f64 = (0..2).map(|t| c[i][t] * m[t][j]).sum();
            print!("{:.6} ", v);
        }
        println!();
    }
}

143.5 5. Connections, Assumptions, and Practice

143.5.1 5.1 Maximum Likelihood and Infomax

The non-Gaussianity view is one of several equivalent formulations. ICA can also be derived by maximum likelihood, where one posits source densities and maximizes the likelihood of the observed mixtures over the unmixing matrix, and by the Infomax principle, which maximizes the output entropy of a nonlinear neural network. Both lead to update rules closely related to FastICA, and the nonlinearity $g$ plays the role of the score function of the assumed source density. A further equivalent objective is the minimization of mutual information among the outputs, which is the most direct formalization of independence and connects negentropy maximization to information theory.

143.5.2 5.2 Assumptions and Limitations

The standard model carries several assumptions worth stating plainly. The mixing is linear and instantaneous, with no time delays or convolution, which is violated by realistic room acoustics and requires convolutive ICA extensions. The mixing matrix is square and invertible, so the number of sensors equals the number of sources; the underdetermined case where sources outnumber sensors needs sparse methods. At most one source may be Gaussian. The sources are assumed stationary and mutually independent. Real applications must check these assumptions or adopt extensions that relax them.

143.5.3 5.3 Applications

ICA has found wide use precisely where independent generative sources mix linearly. In biomedical signal processing it separates artifacts such as eye blinks and heartbeat from electroencephalography and magnetoencephalography recordings. In neuroimaging it identifies spatially independent networks in functional MRI data. It serves in financial time series analysis, telecommunications for separating interfering transmissions, and feature extraction for natural images, where the learned independent components resemble the oriented edge detectors found in early visual cortex. Across these domains the same mathematical core operates: exploit non-Gaussianity to undo an unknown linear mixing.

143.6 6. Summary

Independent Component Analysis solves blind source separation by recovering independent sources from their linear mixtures. Its conceptual pivot is the recognition that statistical independence is far stronger than uncorrelatedness, so that second order methods like PCA and whitening, while useful as preprocessing, cannot finish the job. The missing ingredient is non-Gaussianity, which the Central Limit Theorem links to identifiability: the most non-Gaussian linear projections of the mixtures are the original sources. Measured by kurtosis or, more robustly, by negentropy, non-Gaussianity becomes an objective that the FastICA fixed point iteration maximizes with fast and reliable convergence. Within its assumptions of linearity, a square mixing matrix, and at most one Gaussian source, ICA delivers a powerful and widely deployed tool for uncovering hidden structure in mixed signals.

143.7 References

Hyvarinen, A., and Oja, E. (2000). Independent Component Analysis: Algorithms and Applications. Neural Networks, 13(4), 411 to 430. https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf
Hyvarinen, A. (1999). Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks, 10(3), 626 to 634. https://www.cs.helsinki.fi/u/ahyvarin/papers/TNN99new.pdf
Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley and Sons. https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf
Bell, A. J., and Sejnowski, T. J. (1995). An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7(6), 1129 to 1159. https://www.cnl.salk.edu/~tony/ptr/bell.pdf
Comon, P. (1994). Independent Component Analysis, A New Concept? Signal Processing, 36(3), 287 to 314. https://hal.science/hal-00417283/document
Cardoso, J. F. (1998). Blind Signal Separation: Statistical Principles. Proceedings of the IEEE, 86(10), 2009 to 2025. https://www2.iap.fr/users/cardoso/papers/ProcIEEE.pdf
Stone, J. V. (2004). Independent Component Analysis: A Tutorial Introduction. MIT Press. https://mitpress.mit.edu/9780262693158/independent-component-analysis/

# Independent Component Analysis Independent Component Analysis (ICA) is a computational framework for recovering a set of underlying source signals from observed mixtures of those signals, given only the mixtures and a few weak statistical assumptions. It is the canonical solution to the blind source separation problem, where the word blind signals that we know almost nothing about either the sources or the mixing process. This chapter develops the theory rigorously, beginning with the generative model, then clarifying the crucial distinction between statistical independence and mere uncorrelatedness, establishing why non-Gaussianity is the engine that makes separation possible, and finally deriving the FastICA algorithm that has become the practical workhorse of the field. ## 1. The Blind Source Separation Problem ### 1.1 The Generative Model Suppose there are $n$ statistically independent source signals collected into a random vector $\mathbf{s} = (s_1, s_2, \ldots, s_n)^\top$. We do not observe these sources directly. Instead we observe $n$ linear mixtures collected into a vector $\mathbf{x} = (x_1, \ldots, x_n)^\top$, formed by an unknown invertible mixing matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$: $$ \mathbf{x} = \mathbf{A}\mathbf{s}, \qquad x_i = \sum_{j=1}^{n} a_{ij}\, s_j . $$ The goal of ICA is to estimate an unmixing matrix $\mathbf{W}$ such that $$ \mathbf{y} = \mathbf{W}\mathbf{x} $$ recovers the sources, ideally with $\mathbf{W} = \mathbf{A}^{-1}$ up to acceptable ambiguities. The defining difficulty is that we know neither $\mathbf{A}$ nor $\mathbf{s}$. All inference must proceed from the observed samples of $\mathbf{x}$ alone, leaning entirely on the structural assumption that the components of $\mathbf{s}$ are mutually independent. ### 1.2 The Cocktail Party Problem The intuition pump for ICA is the cocktail party problem. Imagine $n$ people speaking simultaneously in a room containing $n$ microphones. Each microphone records a different weighted sum of all the voices, because each speaker sits at a different distance from each microphone. The recordings $x_i(t)$ are the mixtures, the speech signals $s_j(t)$ are the sources, and the attenuation coefficients form the matrix $\mathbf{A}$. A human listener can attend to a single conversation amid the babble. ICA gives a machine an analogous ability: from the raw microphone recordings alone, it reconstructs the individual voices without knowing where anyone is standing or what they are saying. ### 1.3 Inherent Ambiguities Two ambiguities cannot be resolved by ICA and are usually harmless. First, the variances of the independent components are not identifiable. Since both $\mathbf{s}$ and $\mathbf{A}$ are unknown, any scalar multiplying a source $s_j$ can be absorbed by dividing the corresponding column of $\mathbf{A}$: $$ \mathbf{x} = \sum_j \left( \tfrac{1}{\alpha_j}\mathbf{a}_j \right)\left( \alpha_j s_j \right). $$ By convention we fix this scale ambiguity by assuming each source has unit variance, $\mathbb{E}[s_j^2] = 1$, which still leaves an unresolved sign. Second, the order of the components is arbitrary, because we can permute the sources and apply the inverse permutation to the columns of $\mathbf{A}$ without changing $\mathbf{x}$. Formally, ICA recovers $\mathbf{W} = \mathbf{P}\mathbf{D}\mathbf{A}^{-1}$ for some permutation matrix $\mathbf{P}$ and diagonal scaling matrix $\mathbf{D}$. ## 2. Statistical Independence Versus Uncorrelatedness ### 2.1 Two Distinct Notions The heart of ICA is a careful distinction between two conditions that are often conflated. Two random variables $y_1$ and $y_2$ are uncorrelated if their covariance vanishes: $$ \mathbb{E}[y_1 y_2] - \mathbb{E}[y_1]\,\mathbb{E}[y_2] = 0 . $$ They are statistically independent if their joint density factorizes into the product of the marginals: $$ p(y_1, y_2) = p_1(y_1)\, p_2(y_2) . $$ Independence is strictly stronger. It implies that for any pair of measurable functions $g$ and $h$, $$ \mathbb{E}[g(y_1)\, h(y_2)] = \mathbb{E}[g(y_1)]\,\mathbb{E}[h(y_2)] . $$ Uncorrelatedness is the special case where $g$ and $h$ are both the identity. Independence constrains all higher order moments and not merely the second order cross moment. Independent variables are always uncorrelated, but uncorrelated variables can be strongly dependent through their higher order statistics. ### 2.2 Why the Distinction Matters Methods such as Principal Component Analysis (PCA) and whitening only impose uncorrelatedness. They diagonalize the covariance matrix and thereby remove all second order dependence, but they leave higher order structure untouched. Because second order statistics are insufficient to pin down the mixing matrix, decorrelation alone cannot solve blind source separation. There are infinitely many uncorrelated representations of a given dataset, related to one another by orthogonal rotations, and only special rotations yield genuinely independent components. ICA must therefore exploit information beyond the covariance, which is precisely the higher order information encoded in non-Gaussianity. ### 2.3 Whitening as Preprocessing Although decorrelation is not sufficient, it is an extremely useful first step. Whitening transforms the observed data so that its components are uncorrelated and have unit variance, giving an identity covariance matrix. If $\mathbf{x}$ has covariance $\mathbf{C} = \mathbb{E}[\mathbf{x}\mathbf{x}^\top] = \mathbf{E}\mathbf{D}\mathbf{E}^\top$ in its eigendecomposition, the whitening transform is $$ \mathbf{z} = \mathbf{D}^{-1/2}\mathbf{E}^\top \mathbf{x}, \qquad \mathbb{E}[\mathbf{z}\mathbf{z}^\top] = \mathbf{I}. $$ After whitening, the remaining mixing matrix is orthogonal. The proof is short: if $\mathbf{z} = \tilde{\mathbf{A}}\mathbf{s}$ with unit variance independent sources, then $\mathbf{I} = \mathbb{E}[\mathbf{z}\mathbf{z}^\top] = \tilde{\mathbf{A}}\,\mathbb{E}[\mathbf{s}\mathbf{s}^\top]\,\tilde{\mathbf{A}}^\top = \tilde{\mathbf{A}}\tilde{\mathbf{A}}^\top$, so $\tilde{\mathbf{A}}$ is orthogonal. Whitening thus reduces the number of free parameters from $n^2$ to the $n(n-1)/2$ that parameterize an orthogonal matrix, which substantially simplifies the search. ## 3. Non-Gaussianity as the Key to Separation ### 3.1 The Central Limit Theorem Intuition Why should non-Gaussianity matter? Consider a single estimated component $y = \mathbf{w}^\top \mathbf{x} = \mathbf{w}^\top \mathbf{A}\mathbf{s} = \mathbf{q}^\top \mathbf{s}$, where $\mathbf{q} = \mathbf{A}^\top \mathbf{w}$. The estimate $y$ is a linear combination of the independent sources. The Central Limit Theorem says that a sum of independent random variables tends toward a Gaussian distribution as the number of terms grows. Consequently, $\mathbf{q}^\top \mathbf{s}$ is generally more Gaussian than any single source $s_j$, and it is least Gaussian precisely when it equals one of the sources, meaning $\mathbf{q}$ has a single nonzero entry. This suggests a striking principle: maximizing the non-Gaussianity of $\mathbf{w}^\top \mathbf{x}$ recovers an independent component. Each local maximum of non-Gaussianity over the unit sphere corresponds to one of the sources. ### 3.2 The Gaussian Degeneracy This principle also reveals the fundamental limitation of ICA. If two or more sources are Gaussian, ICA cannot separate them. For Gaussian variables, uncorrelatedness already implies independence, so any orthogonal rotation of jointly Gaussian whitened data remains jointly Gaussian and independent. The likelihood is invariant under such rotations, and there is no preferred basis to recover. ICA therefore requires that at most one source be Gaussian. This is the mirror image of the usual statistical preference for Gaussianity: here non-Gaussianity is not a nuisance but the very signal that drives identifiability. ### 3.3 Measuring Non-Gaussianity with Kurtosis The classical measure of non-Gaussianity is the fourth order cumulant, kurtosis. For a zero mean random variable $y$ with unit variance, $$ \mathrm{kurt}(y) = \mathbb{E}[y^4] - 3\big(\mathbb{E}[y^2]\big)^2 = \mathbb{E}[y^4] - 3 . $$ Kurtosis is zero for a Gaussian variable. It is positive for supergaussian or leptokurtic variables, which have heavy tails and a sharp peak, such as a Laplacian distribution, and negative for subgaussian or platykurtic variables, which are flatter than Gaussian, such as a uniform distribution. Non-Gaussianity is measured by $|\mathrm{kurt}(y)|$ or $\mathrm{kurt}(y)^2$. Kurtosis is computationally simple and theoretically transparent, but it is sensitive to outliers, because a few extreme samples in the tail can dominate the fourth power. This fragility motivates more robust contrast functions. ### 3.4 Negentropy and Robust Contrast Functions A more principled measure is negentropy, grounded in information theory. Among all distributions with a fixed variance, the Gaussian has maximum differential entropy. Negentropy is defined as the entropy gap between a variable and a Gaussian $y_{\mathrm{gauss}}$ of identical covariance: $$ J(y) = H(y_{\mathrm{gauss}}) - H(y), \qquad H(y) = -\int p(y)\log p(y)\, dy . $$ Negentropy is always nonnegative, is zero only for a Gaussian, and is invariant under invertible linear transformations, which makes it a theoretically optimal contrast. Its drawback is that computing $J(y)$ exactly requires the unknown density. Practical ICA therefore uses approximations of the form $$ J(y) \approx \big( \mathbb{E}[G(y)] - \mathbb{E}[G(\nu)] \big)^2, $$ where $\nu$ is a standard Gaussian and $G$ is a smooth nonquadratic function chosen to grow slowly so as to resist outliers. Common choices are $$ G_1(u) = \frac{1}{a_1}\log\cosh(a_1 u), \qquad G_2(u) = -\exp\!\left(-\frac{u^2}{2}\right), $$ with $1 \le a_1 \le 2$. The function $\log\cosh$ behaves like the absolute value for large arguments and so down-weights extreme samples relative to the fourth power used by kurtosis. ## 4. The FastICA Algorithm ### 4.1 A Fixed Point Iteration FastICA, introduced by Hyvarinen and Oja, is a fixed point algorithm that maximizes the negentropy approximation with remarkable efficiency. It operates on whitened data $\mathbf{z}$ and seeks a unit weight vector $\mathbf{w}$ that maximizes the non-Gaussianity of $\mathbf{w}^\top \mathbf{z}$. Let $g$ denote the derivative of the contrast function $G$, and $g'$ its second derivative. For the choices above, $$ g_1(u) = \tanh(a_1 u), \qquad g_2(u) = u\,\exp\!\left(-\frac{u^2}{2}\right). $$ The maxima of the negentropy approximation occur at points satisfying a Lagrangian stationarity condition. Applying an approximate Newton method to that condition yields the FastICA update for a single component: $$ \mathbf{w}^{+} \leftarrow \mathbb{E}\big[\mathbf{z}\, g(\mathbf{w}^\top \mathbf{z})\big] - \mathbb{E}\big[g'(\mathbf{w}^\top \mathbf{z})\big]\,\mathbf{w}, $$ $$ \mathbf{w} \leftarrow \frac{\mathbf{w}^{+}}{\lVert \mathbf{w}^{+} \rVert}. $$ The first expectation is the gradient term, and the second is a scalar approximation of the Hessian that makes the step a Newton step rather than a plain gradient ascent. The normalization keeps $\mathbf{w}$ on the unit sphere, which is legitimate because the data is whitened. In practice the expectations are replaced by sample averages over the observed data. ### 4.2 Pseudocode for One Component ```text Input: whitened data Z (samples in columns), contrast g and its derivative g' Initialize w to a random unit vector repeat wp = mean over samples of [ Z * g(w^T Z) ] - mean[ g'(w^T Z) ] * w w = wp / norm(wp) until |w_new dot w_old| is close to 1 return w ``` The convergence criterion checks whether the new and old weight vectors point in essentially the same direction, since the sign is irrelevant. Convergence is cubic for the kurtosis based contrast and at least quadratic in general, far faster than the linear convergence of gradient methods, and it requires no learning rate to tune. ### 4.3 Extracting Several Components To recover all $n$ sources, FastICA estimates several vectors $\mathbf{w}_1, \ldots, \mathbf{w}_n$ and constrains them to be orthogonal in the whitened space, since distinct independent components correspond to orthogonal directions after whitening. Two schemes are common. Deflationary orthogonalization extracts components one at a time, and after each new $\mathbf{w}_p$ is updated it is orthogonalized against the previously found vectors by a Gram-Schmidt step: $$ \mathbf{w}_p \leftarrow \mathbf{w}_p - \sum_{j=1}^{p-1} (\mathbf{w}_p^\top \mathbf{w}_j)\, \mathbf{w}_j, \qquad \mathbf{w}_p \leftarrow \frac{\mathbf{w}_p}{\lVert \mathbf{w}_p \rVert}. $$ Symmetric orthogonalization instead updates all vectors in parallel and then orthogonalizes the whole matrix $\mathbf{W} = (\mathbf{w}_1, \ldots, \mathbf{w}_n)^\top$ at once: $$ \mathbf{W} \leftarrow \big(\mathbf{W}\mathbf{W}^\top\big)^{-1/2}\, \mathbf{W} . $$ The symmetric scheme avoids the accumulation of estimation errors that the sequential deflation can suffer, and it does not privilege any component over another, so it is often preferred. Symmetric orthogonalization has a clean closed form. Write the inverse symmetric square root through the eigendecomposition $\mathbf{W}\mathbf{W}^\top = \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^\top$, giving $$ \big(\mathbf{W}\mathbf{W}^\top\big)^{-1/2} = \mathbf{U}\,\boldsymbol{\Lambda}^{-1/2}\,\mathbf{U}^\top , $$ so a single symmetric eigendecomposition of the small $n \times n$ Gram matrix suffices at every iteration. The reference implementations below use a self-contained cyclic Jacobi eigensolver for this step, which makes the whole pipeline reproducible bit for bit across Python, Julia and Rust. ### 4.4 The Complete Pipeline A full ICA analysis proceeds in three stages. First, center the data by subtracting the mean. Second, whiten it using the eigendecomposition of the covariance matrix, which both decorrelates the components and reduces the problem to a search over orthogonal matrices. Third, run FastICA with symmetric orthogonalization to find the unmixing rotation. The recovered sources are then $\mathbf{y} = \mathbf{W}\mathbf{z}$, and the columns of the estimated mixing matrix follow by inverting the composed transformation. Dimensionality reduction can be folded into the whitening step by retaining only the leading principal components, which is valuable when the number of mixtures exceeds the number of sources of interest. ### 4.5 A Reference Implementation The companion `aiinaction` libraries ship a small, deterministic FastICA in Python, Julia, and Rust. The three share one source of truth for their test fixtures and agree numerically to floating-point tolerance. The public API is intentionally tiny: `fit_ica(X, n_components, max_iter)` returns a fitted model carrying the per-signal `mean`, the `whitening` matrix $\mathbf{K}$, the orthogonal `unmixing` rotation $\mathbf{W}$, the full unmixing operator `components` $= \mathbf{W}\mathbf{K}$, and the estimated `mixing` matrix; `transform(model, X)` recovers the sources $\mathbf{S} = (\mathbf{X} - \boldsymbol{\mu})\,\text{components}^\top$. To keep results reproducible across languages the implementations make three deliberate choices. The unmixing matrix is initialized to the identity rather than to a random rotation; the fixed-point iteration runs a fixed `max_iter` count instead of stopping at a tolerance; and whitening and symmetric orthogonalization both use the same cyclic Jacobi eigensolver. The recovered components are then ordered by descending sample variance and each row's largest-magnitude entry is forced positive, which pins down the permutation and sign ambiguities deterministically. The example below mixes two synthetic sources through a fixed $2 \times 2$ matrix $\mathbf{A} = \left[\begin{smallmatrix} 2 & 1 \\ 1 & 3 \end{smallmatrix}\right]$ and recovers them. Because ICA identifies sources only up to scale and order, we check separation by confirming that the recovered components are mutually uncorrelated and that the estimated mixing operator inverts the unmixing operator. ::: {.panel-tabset} ## Python ```{python} import numpy as np from aiinaction.ch138_fastica import fit_ica, transform # Two synthetic sources mixed by a fixed matrix A. X = np.array([ [ 1.0, -2.0], [ 0.0, 5.0], [ 4.0, 2.0], [-2.0, -6.0], [-3.0, 1.0], [ 4.0, 7.0], [-3.0, -4.0], [ 1.0, 3.0], ]) model = fit_ica(X, n_components=2, max_iter=200) S = transform(model, X) print("unmixing rotation W:") print(np.round(model.unmixing, 4)) print("recovered source (row 0):", np.round(S[0], 4)) # Recovered sources are uncorrelated; mixing inverts the unmixing operator. cov = np.cov(S, rowvar=False) print("off-diagonal source covariance:", round(float(cov[0, 1]), 12)) print("components @ mixing =\n", np.round(model.components @ model.mixing, 9)) ``` ## Julia ```julia using AIInAction.Ch138Fastica X = [ 1.0 -2.0; 0.0 5.0; 4.0 2.0; -2.0 -6.0; -3.0 1.0; 4.0 7.0; -3.0 -4.0; 1.0 3.0 ] model = fit_ica(X; n_components=2, max_iter=200) S = transform(model, X) println("unmixing rotation W:") println(round.(model.unmixing; digits=4)) println("recovered source (row 1): ", round.(S[1, :]; digits=4)) # Recovered sources are uncorrelated; mixing inverts the unmixing operator. n = size(X, 1) Sc = S .- (sum(S; dims=1) ./ n) println("off-diagonal source covariance: ", round(sum(Sc[:, 1] .* Sc[:, 2]) / (n - 1); digits=12)) println("components * mixing =") println(round.(model.components * model.mixing; digits=9)) ``` ## Rust ```rust use aiinaction::ch138_fastica::{fit_ica, transform, Matrix}; fn main() { let x = Matrix::from_rows(&[ vec![1.0, -2.0], vec![0.0, 5.0], vec![4.0, 2.0], vec![-2.0, -6.0], vec![-3.0, 1.0], vec![4.0, 7.0], vec![-3.0, -4.0], vec![1.0, 3.0], ]) .unwrap(); let model = fit_ica(&x, Some(2), 200).unwrap(); let s = transform(&model, &x).unwrap(); println!("unmixing rotation W:"); println!("{:?}", model.unmixing); println!("recovered source (row 0): {:?}", s[0]); // components @ mixing = identity for the square full-rank case. let c = &model.components; let m = &model.mixing; for i in 0..2 { for j in 0..2 { let v: f64 = (0..2).map(|t| c[i][t] * m[t][j]).sum(); print!("{:.6} ", v); } println!(); } } ``` ::: ## 5. Connections, Assumptions, and Practice ### 5.1 Maximum Likelihood and Infomax The non-Gaussianity view is one of several equivalent formulations. ICA can also be derived by maximum likelihood, where one posits source densities and maximizes the likelihood of the observed mixtures over the unmixing matrix, and by the Infomax principle, which maximizes the output entropy of a nonlinear neural network. Both lead to update rules closely related to FastICA, and the nonlinearity $g$ plays the role of the score function of the assumed source density. A further equivalent objective is the minimization of mutual information among the outputs, which is the most direct formalization of independence and connects negentropy maximization to information theory. ### 5.2 Assumptions and Limitations The standard model carries several assumptions worth stating plainly. The mixing is linear and instantaneous, with no time delays or convolution, which is violated by realistic room acoustics and requires convolutive ICA extensions. The mixing matrix is square and invertible, so the number of sensors equals the number of sources; the underdetermined case where sources outnumber sensors needs sparse methods. At most one source may be Gaussian. The sources are assumed stationary and mutually independent. Real applications must check these assumptions or adopt extensions that relax them. ### 5.3 Applications ICA has found wide use precisely where independent generative sources mix linearly. In biomedical signal processing it separates artifacts such as eye blinks and heartbeat from electroencephalography and magnetoencephalography recordings. In neuroimaging it identifies spatially independent networks in functional MRI data. It serves in financial time series analysis, telecommunications for separating interfering transmissions, and feature extraction for natural images, where the learned independent components resemble the oriented edge detectors found in early visual cortex. Across these domains the same mathematical core operates: exploit non-Gaussianity to undo an unknown linear mixing. ## 6. Summary Independent Component Analysis solves blind source separation by recovering independent sources from their linear mixtures. Its conceptual pivot is the recognition that statistical independence is far stronger than uncorrelatedness, so that second order methods like PCA and whitening, while useful as preprocessing, cannot finish the job. The missing ingredient is non-Gaussianity, which the Central Limit Theorem links to identifiability: the most non-Gaussian linear projections of the mixtures are the original sources. Measured by kurtosis or, more robustly, by negentropy, non-Gaussianity becomes an objective that the FastICA fixed point iteration maximizes with fast and reliable convergence. Within its assumptions of linearity, a square mixing matrix, and at most one Gaussian source, ICA delivers a powerful and widely deployed tool for uncovering hidden structure in mixed signals. ## References 1. Hyvarinen, A., and Oja, E. (2000). Independent Component Analysis: Algorithms and Applications. Neural Networks, 13(4), 411 to 430. https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf 2. Hyvarinen, A. (1999). Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks, 10(3), 626 to 634. https://www.cs.helsinki.fi/u/ahyvarin/papers/TNN99new.pdf 3. Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley and Sons. https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf 4. Bell, A. J., and Sejnowski, T. J. (1995). An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7(6), 1129 to 1159. https://www.cnl.salk.edu/~tony/ptr/bell.pdf 5. Comon, P. (1994). Independent Component Analysis, A New Concept? Signal Processing, 36(3), 287 to 314. https://hal.science/hal-00417283/document 6. Cardoso, J. F. (1998). Blind Signal Separation: Statistical Principles. Proceedings of the IEEE, 86(10), 2009 to 2025. https://www2.iap.fr/users/cardoso/papers/ProcIEEE.pdf 7. Stone, J. V. (2004). Independent Component Analysis: A Tutorial Introduction. MIT Press. https://mitpress.mit.edu/9780262693158/independent-component-analysis/