61 Exploratory Data Analysis

Exploratory data analysis (EDA) is the disciplined practice of interrogating a dataset before committing to a model. It is less a fixed recipe than a stance, one that treats every column, distribution, and relationship as a claim that must be examined rather than assumed. John Tukey, who coined the term in his 1977 book, framed EDA as detective work: the analyst looks for clues, follows hunches, and remains willing to be surprised. This chapter develops the EDA mindset and then works through univariate, bivariate, and multivariate exploration, the assessment of distributions and missingness, and the formation of hypotheses that will guide later modeling.

61.1 1. The EDA Mindset

61.1.1 1.1 Why Explore Before Modeling

Modeling is an act of compression. A learned function reduces a complex dataset to a small set of parameters and a single objective. If the data violate the assumptions baked into that objective, the compression discards exactly the structure that matters. EDA exists to surface those violations early, while they are cheap to fix. A skewed target that should be log transformed, a leakage column that encodes the label, a cluster of duplicated rows, a subgroup with a reversed sign of effect: each of these is invisible to a cross validation score until it has already corrupted the conclusions.

The central goal is to build an accurate mental model of the data generating process. You want to know how the data were collected, what each variable means, which values are possible, and where the recording could have gone wrong. This understanding is what lets you distinguish signal from artifact later. A correlation of $0.9$ is exciting only until you discover that both variables were derived from the same source measurement.

61.1.2 1.2 Confirmatory Versus Exploratory Analysis

It helps to separate two modes of inquiry. Confirmatory data analysis tests a prespecified hypothesis with a fixed protocol and reports a calibrated error rate. Exploratory analysis generates hypotheses by searching the data for patterns. The two are complementary, but conflating them is dangerous. If you look at the same data that suggested a hypothesis and then test that hypothesis on those same data, the resulting p value is not interpretable. This is the multiple comparisons problem in disguise, sometimes called the garden of forking paths.

The mechanism is worth stating precisely. Suppose you scan $m$ independent candidate relationships, each genuinely null, and test each at level $\alpha$. The probability that at least one crosses the threshold by chance is

\[ P(\text{at least one false positive}) = 1 - (1 - \alpha)^m, \]

which for $\alpha = 0.05$ and $m = 20$ already exceeds $0.64$. Exploration routinely examines far more than twenty views, often implicitly, because each choice of subset, transform, or binning is another comparison. The point is not that exploration is illegitimate but that the significance attached to whatever it surfaces is not the nominal $\alpha$.

The practical discipline is to keep the roles of the data separate. Reserve a holdout that you do not look at during exploration, or at minimum record which patterns were discovered exploratorily so that any later inferential claim about them is treated as tentative. EDA earns its freedom to look everywhere precisely by refusing to attach formal significance to what it finds.

61.1.3 1.3 An Iterative Loop

EDA proceeds as a loop: form a question, generate a view that answers it, and let the answer raise the next question. Wickham, Cetinkaya-Rundel, and Grolemund describe this as a cycle of transforming, visualizing, and modeling in service of understanding. The loop terminates not when you run out of plots but when your questions stop producing surprises. A useful habit is to write each question down before plotting, because an unwritten question is easy to retrofit to whatever the plot happens to show.

flowchart LR
    Q["Ask a question"] --> V["Generate a view"]
    V --> A["Read the answer"]
    A -->|"new surprise"| Q
    A -->|"no surprise"| H["Record a hypothesis"]

Figure 61.1: The exploratory loop. Each answer either raises a sharper question or, once surprises run out, hardens into a documented hypothesis.

61.2 2. Univariate Exploration

61.2.1 2.1 Summary Statistics and Their Limits

Begin one variable at a time. For a numeric variable, the basic summary is a location measure, a spread measure, and a shape description. The mean $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ and standard deviation $s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$ are familiar but fragile, because both are sensitive to outliers.

Fragility can be made precise through the breakdown point, the smallest fraction of observations that can be moved to arbitrary values before the estimator itself becomes arbitrary. The mean has a breakdown point of $0$, since a single point sent to infinity drags it without bound. The median, by contrast, has a breakdown point of $1/2$: up to half the data can be corrupted before it loses meaning. The interquartile range (IQR), the distance between the $25$th and $75$th percentiles, inherits the same robustness and gives a more stable picture of spread for skewed or heavy tailed data.

Summary statistics alone are never sufficient. Anscombe’s quartet and the more recent Datasaurus dozen demonstrate that radically different datasets can share identical means, variances, and correlations to two decimal places while differing wildly in shape. The lesson is permanent: always plot the distribution rather than trusting a five number summary to characterize it.

61.2.2 2.2 Visualizing a Single Distribution

The histogram is the workhorse for numeric variables, but its appearance depends heavily on bin width. Too few bins hide structure such as bimodality; too many bins turn the plot into noise. Several rules of thumb exist for a starting bin count, including Sturges’ rule of $\lceil \log_2 n + 1 \rceil$ bins and the Freedman-Diaconis width $2 \cdot \text{IQR} \cdot n^{-1/3}$, which adapts to spread and is more robust for skewed data. Treat any such rule as a default to override once the eye sees structure, not as a final answer.

A kernel density estimate smooths the histogram into a continuous curve,

\[ \hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right), \]

where $K$ is a kernel (often Gaussian) and the bandwidth $h$ plays the role bin width plays for a histogram. Small $h$ undersmooths and shows spurious wiggles; large $h$ oversmooths and erases real modes. A boxplot compresses the distribution into quartiles and flags outliers using the $1.5 \times \text{IQR}$ rule, while the empirical cumulative distribution function (ECDF),

\[ \hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\{x_i \le x\}, \]

avoids binning and bandwidth choices entirely. The ECDF is excellent for reading off quantiles directly and for comparing groups, since two ECDFs can be overlaid without the visual artifacts that overlapping histograms introduce.

# Inspect one numeric column from several angles
df["amount"].describe()
df["amount"].plot(kind="hist", bins=50)
df["amount"].plot(kind="box")

For categorical variables, the analogue is a frequency table or bar chart. Watch for rare categories, near constant columns that carry almost no information, and high cardinality fields such as identifiers that masquerade as features.

61.2.3 2.3 Shape, Skew, and Transformation

The shape of a distribution suggests how to treat it. Skewness measures asymmetry,

\[ g_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^{3/2}}, \]

with $g_1 > 0$ indicating a long right tail, while kurtosis measures tail heaviness relative to the normal. Right skewed quantities such as income, transaction size, or counts often benefit from a logarithmic or square root transform, which can stabilize variance and make linear relationships visible. More generally, the Box-Cox family

\[ x^{(\lambda)} = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[2mm] \ln x, & \lambda = 0, \end{cases} \qquad x > 0, \]

interpolates smoothly between the log ($\lambda = 0$), square root ($\lambda = 1/2$), and identity ($\lambda = 1$) transforms, and the value of $\lambda$ that most nearly symmetrizes the data can be chosen by maximum likelihood. Both skewness and kurtosis are worth computing, but the eye reading a histogram usually decides faster. When a variable spans several orders of magnitude, plotting it on a log scale frequently turns an uninformative spike near zero into a clean, interpretable curve.

61.3 3. Bivariate Exploration

61.3.1 3.1 Numeric Versus Numeric

With two numeric variables, the scatter plot is the primary tool. It reveals the form of the relationship (linear, curved, or absent), its direction, its strength, and any points that sit far from the bulk. The Pearson correlation coefficient

\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2}\,\sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \]

quantifies only the linear component and ranges in $[-1, 1]$. A near zero $r$ does not mean independence: a perfect parabola $y = x^2$ on data symmetric about zero has $r \approx 0$ despite an exact functional relationship. For monotone but nonlinear relationships, Spearman’s rank correlation, which is Pearson’s $r$ applied to the ranks of the data rather than the raw values, is more faithful and is invariant to any monotone transformation of either variable.

Overplotting is a frequent hazard with large datasets. When millions of points pile on top of one another, switch to hexagonal binning, two dimensional density contours, or transparency so that the dense regions remain legible.

61.3.2 3.2 Numeric Versus Categorical

To compare a numeric variable across the levels of a category, place distributions side by side. Grouped boxplots, violin plots, or overlaid density curves let you see whether the groups differ in location, in spread, or in shape. A difference in medians is interesting, but so is a difference in variance, which a comparison of means alone would miss. Faceting, drawing one small panel per group on a shared scale, scales gracefully to many categories and keeps comparisons honest because every panel shares the same axes.

61.3.3 3.3 Categorical Versus Categorical

For two categorical variables, the contingency table of joint counts is the foundation. Converting counts to row or column proportions exposes conditional structure, and a mosaic plot or heatmap renders the same information visually. The question is whether the conditional distribution of one variable changes across the levels of the other. Under exact independence, each cell count $O_{ij}$ would match the expected count $E_{ij} = \frac{R_i C_j}{N}$ formed from the row total $R_i$, column total $C_j$, and grand total $N$. The size of the departures, summarized informally by how far the $O_{ij}$ stray from the $E_{ij}$, is exactly what a mosaic plot makes visible. If the conditional distribution does not change, the variables are roughly independent in this sample; if it does, you have found a relationship worth carrying into the modeling stage.

61.4 4. Multivariate Exploration

61.4.1 4.1 Beyond Two Dimensions

Real effects rarely live in a single pair of variables. Multivariate EDA asks how three or more variables interact, and it requires techniques that compress dimensionality without destroying the patterns you care about. A correlation heatmap across all numeric features gives a fast overview of redundancy and multicollinearity, flagging clusters of variables that move together and may be near duplicates.

The pair plot, a grid of scatter plots for every variable pairing with histograms on the diagonal, is dense but powerful for moderate numbers of features. Adding color for a categorical variable layers a third dimension onto each panel. Beyond a handful of features, however, the grid becomes unreadable (it grows as the square of the feature count) and you turn to projection methods.

61.4.2 4.2 Confounding and Simpson’s Paradox

The defining risk of multivariate data is that a relationship visible in aggregate can reverse within subgroups. Simpson’s paradox is the canonical example: a treatment can appear beneficial overall yet harmful in every stratum once a confounder is controlled. Algebraically the paradox is no contradiction at all, only a reminder that a ratio of sums is not the sum of ratios. For positive quantities,

\[ \frac{a_1}{b_1} > \frac{c_1}{d_1} \quad\text{and}\quad \frac{a_2}{b_2} > \frac{c_2}{d_2} \quad\text{can coexist with}\quad \frac{a_1 + a_2}{b_1 + b_2} < \frac{c_1 + c_2}{d_1 + d_2}, \]

whenever the subgroup weights $b_i$ and $d_i$ differ between the two arms. The 1973 Berkeley graduate admissions case is the classic illustration: an apparent bias against women at the university level dissolved once department was taken into account, because women applied disproportionately to more selective departments.

A concrete worked example makes the arithmetic vivid. Two treatments are each tried on patients split into mild and severe cases.

	Treatment A	Treatment B
Mild cases	81 / 87 (93%)	234 / 270 (87%)
Severe cases	192 / 263 (73%)	55 / 80 (69%)
Combined	273 / 350 (78%)	289 / 350 (83%)

Treatment A wins in mild cases and in severe cases, yet loses overall. The reversal is driven entirely by the allocation: A was given mostly to the hard, low recovery severe cases, while B was given mostly to the easy mild cases. The combined rate confounds the treatment effect with case severity, and only conditioning separates them.

The practical defense is to condition. Before trusting any bivariate relationship, ask which third variable might be driving both sides of it, and then redraw the relationship within levels of that variable. If the pattern holds across strata, it is more credible; if it flips, you have learned something essential that no aggregate plot could have told you.

# Check whether an aggregate relationship survives conditioning
df.groupby("segment").apply(
    lambda g: g[["price", "demand"]].corr().iloc[0, 1]
)

61.4.3 4.3 Dimensionality Reduction for Exploration

When the feature space is wide, linear projection helps. Principal component analysis (PCA) finds orthogonal directions of maximal variance: the first principal component is the unit vector $w_1$ maximizing $\operatorname{Var}(Xw)$, equivalently the leading eigenvector of the covariance matrix, and each subsequent component is the variance maximizing direction orthogonal to those before it. Plotting the first two components, which together capture the largest share of total variance, often reveals clusters, gradients, or outliers that were invisible in any single feature. Inspecting the fraction of variance explained by each component, the scree plot, indicates how many directions are needed to summarize the data.

For nonlinear structure, neighborhood embedding methods such as t-SNE and UMAP can expose clusters that PCA misses, though their layouts must be read with care: they preserve local neighborhoods, so distances and densities between separated clusters in the embedding are not reliable, and the apparent size of a cluster carries little meaning. Used as exploratory aids rather than as conclusions, these projections are valuable for forming hypotheses about latent groups, not for measuring them.

61.5 5. Distributions and Data Quality

61.5.1 5.1 Reading a Distribution Critically

Every distribution tells a story about how the data were produced. A spike at exactly zero may encode a default value or a sentinel for missing data. A hard ceiling suggests censoring, where values above a threshold were clipped or never recorded. A suspicious mode at a round number such as $99$ or $-1$ often signals a placeholder. Bimodality frequently means two populations were mixed, for example two device types or two time periods, and disentangling them may matter more than any model you fit afterward.

61.5.2 5.2 Outliers and Anomalies

Not every extreme value is an error, and not every error is extreme. An outlier is simply a point that is improbable under your working model of the data, and the right response depends on its cause. A measurement glitch should be corrected or removed; a genuine rare event should be kept and perhaps studied directly.

A robust way to flag candidates is the modified z score built on the median absolute deviation,

\[ \text{MAD} = \operatorname{median}_i\,\lvert x_i - \tilde{x}\rvert, \qquad z_i = \frac{0.6745\,(x_i - \tilde{x})}{\text{MAD}}, \]

where $\tilde{x}$ is the median and the constant $0.6745$ rescales the MAD so that it estimates the standard deviation of a normal distribution. Because both the center and the scale are medians, the very outliers you are hunting cannot inflate the threshold, which is the failure mode of the ordinary z score that divides by a non robust standard deviation. A common convention flags points with $\lvert z_i \rvert > 3.5$. The plain $1.5 \times \text{IQR}$ rule of the boxplot is a simpler alternative with the same robustness logic. The decision to drop a point should always be recorded, because silent deletion is a common source of irreproducible results.

61.5.3 5.3 Duplicates, Types, and Units

Mundane data quality checks repay the time spent. Look for duplicated rows, which inflate apparent sample size and bias estimates. Verify that numeric columns are stored as numbers rather than strings, that dates parse correctly, and that categorical labels are consistent in spelling and case. Confirm units, since a column silently mixing dollars and cents, or kilometers and miles, will quietly poison every downstream calculation. These checks are unglamorous but they catch the errors that derail projects most often. Free, open source tools such as pandas, ydata-profiling, and Great Expectations can automate much of this scan and turn ad hoc checks into a repeatable, versioned suite of data quality assertions.

61.6 6. Missingness Patterns

61.6.1 6.1 The Three Mechanisms

Missing data are not all alike, and Rubin’s taxonomy provides the vocabulary. Let $X$ denote the complete data, partitioned into observed and missing parts $X_{\text{obs}}$ and $X_{\text{mis}}$, and let $M$ be the indicator matrix of which entries are missing. The mechanism is characterized by how the distribution of $M$ depends on the data.

Missing completely at random (MCAR): $P(M \mid X) = P(M)$. Missingness is independent of all values, observed or not.
Missing at random (MAR): $P(M \mid X) = P(M \mid X_{\text{obs}})$. Missingness depends only on observed variables, for example when older respondents skip a question and age is recorded.
Missing not at random (MNAR): $P(M \mid X)$ depends on $X_{\text{mis}}$ even after conditioning on $X_{\text{obs}}$, as when high earners decline to report income precisely because it is high.

The mechanism matters because it dictates which handling strategies are valid: listwise deletion can be unbiased under MCAR but actively misleading under MNAR, and standard imputation methods assume MAR.

61.6.2 6.2 Visualizing Missingness

Treat missingness as data to be explored in its own right. A bar chart of the fraction missing per column shows where the problem concentrates. A matrix plot of the missingness pattern, with rows as records and columns as fields, reveals whether absences cluster together, which suggests a shared cause such as a form section that was added later. A correlation of missingness indicators tells you whether two fields tend to go missing in tandem, a strong hint that the mechanism is not MCAR.

# Quantify and locate missingness
df.isna().mean().sort_values(ascending=False)
df.isna().corr()  # do absences co-occur?

A useful diagnostic for ruling out MCAR is to compare the observed values of one column across records where a second column is present versus absent. Under MCAR those two conditional distributions should be indistinguishable, so a clear difference is evidence against it. The free, open source missingno library renders the matrix and correlation views directly for pandas data frames.

61.6.3 6.3 From Pattern to Strategy

The point of the missingness exploration is to choose a defensible strategy, not to fill blanks reflexively. If a column is missing for a structural reason, an explicit missing indicator may carry real signal. If missingness appears MAR, multiple imputation or model based imputation can recover information that listwise deletion would throw away. If you suspect MNAR, you may need to model the missingness mechanism directly or, at minimum, flag the resulting estimates as sensitive to untestable assumptions. EDA cannot prove the mechanism, since MAR and MNAR are formally indistinguishable from the observed data alone, but it can rule out the naive defaults and force an honest choice.

61.7 7. From Exploration to Hypotheses

61.7.1 7.1 Turning Observations into Testable Claims

The output of EDA is a short list of concrete, falsifiable hypotheses, each tied to a specific observation. A vague note that “price seems related to demand” is less useful than a sharpened claim such as “within each customer segment, demand declines roughly linearly with log price, and the slope is steeper for premium segments.” The sharpened version names the variables, the functional form, the conditioning structure, and the expected direction, which makes it something a later analysis can confirm or refute.

61.7.2 7.2 Guarding Against Self Deception

The danger throughout is that the analyst will see patterns that are artifacts of noise, of selection, or of the very flexibility that makes exploration powerful. Several habits guard against this. Hold out data so that promising patterns can be checked on records that did not suggest them. Prefer effects that are large, stable across reasonable subsets, and mechanistically plausible over those that are merely statistically striking. Treat any relationship discovered by extensive searching as a candidate rather than a finding. The freedom to look everywhere is exactly why exploratory conclusions must be confirmed before they are believed.

61.7.3 7.3 Documenting the Exploration

A disciplined EDA leaves a written trail: the questions asked, the views generated, the surprises encountered, the decisions made about outliers and missing values, and the hypotheses carried forward. This record is what makes the analysis reproducible and what lets a collaborator understand why the eventual model was built the way it was. The data dictionary you assemble, the transformations you settle on, and the open questions you flag become the bridge between exploration and the modeling chapters that follow.

61.8 8. When to Use and Common Pitfalls

EDA belongs at the start of every analysis that touches unfamiliar data, after any major change to data collection, and whenever a model behaves in a way you cannot explain. It is time well spent precisely when the cost of a hidden defect is high, which is to say almost always. The main way to overspend is to keep generating plots after the questions have stopped producing surprises; the loop should end when conditioning and re-slicing no longer change your picture of the data.

A few pitfalls recur often enough to name:

Attaching formal p values to patterns the same data suggested, which the garden of forking paths makes meaningless.
Trusting summary statistics without a plot, the failure that Anscombe’s quartet was built to expose.
Reading a near zero correlation as independence, when it only rules out a linear relationship.
Accepting an aggregate relationship without conditioning on plausible confounders, leaving the door open to Simpson’s paradox.
Deleting rows with missing values by default, which silently assumes MCAR and can bias every estimate that follows.
Over interpreting the geometry of a t-SNE or UMAP layout, whose between cluster distances are not meaningful.

61.9 9. Summary

Exploratory data analysis is the stage where you earn the right to model. By examining variables one at a time, in pairs, and in combination, by reading distributions and missingness patterns critically, and by conditioning on potential confounders, you build an accurate picture of the data generating process and surface the problems that would otherwise corrupt downstream work. The deliverable is not a gallery of plots but a small set of sharpened, conditioned hypotheses, accompanied by an honest account of the data’s flaws and the discipline to confirm exploratory findings before trusting them.

61.10 References

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. https://archive.org/details/exploratorydataa0000tuke
Wickham, H., Cetinkaya-Rundel, M., and Grolemund, G. (2023). R for Data Science, 2nd ed. O’Reilly. https://r4ds.hadley.nz/
Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27(1), 17-21. https://doi.org/10.1080/00031305.1973.10478966
Matejka, J. and Fitzmaurice, G. (2017). Same Stats, Different Graphs (the Datasaurus Dozen). CHI 2017. https://doi.org/10.1145/3025453.3025912
Gelman, A. and Loken, E. (2014). The Garden of Forking Paths. http://www.stat.columbia.edu/~gelman/research/unpublished/forking.pdf
Bickel, P. J., Hammel, E. A., and O’Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175), 398-404. https://doi.org/10.1126/science.187.4175.398
Box, G. E. P. and Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society, Series B, 26(2), 211-252. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x
Iglewicz, B. and Hoaglin, D. C. (1993). How to Detect and Handle Outliers. ASQC Quality Press. https://asq.org/quality-press/display-item?item=E0498
Rubin, D. B. (1976). Inference and Missing Data. Biometrika, 63(3), 581-592. https://doi.org/10.1093/biomet/63.3.581
van der Maaten, L. and Hinton, G. (2008). Visualizing Data using t-SNE. JMLR, 9, 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html
McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. https://arxiv.org/abs/1802.03426
VanderPlas, J. (2016). Python Data Science Handbook. O’Reilly. https://jakevdp.github.io/PythonDataScienceHandbook/

# Exploratory Data Analysis Exploratory data analysis (EDA) is the disciplined practice of interrogating a dataset before committing to a model. It is less a fixed recipe than a stance, one that treats every column, distribution, and relationship as a claim that must be examined rather than assumed. John Tukey, who coined the term in his 1977 book, framed EDA as detective work: the analyst looks for clues, follows hunches, and remains willing to be surprised. This chapter develops the EDA mindset and then works through univariate, bivariate, and multivariate exploration, the assessment of distributions and missingness, and the formation of hypotheses that will guide later modeling. ## 1. The EDA Mindset ### 1.1 Why Explore Before Modeling Modeling is an act of compression. A learned function reduces a complex dataset to a small set of parameters and a single objective. If the data violate the assumptions baked into that objective, the compression discards exactly the structure that matters. EDA exists to surface those violations early, while they are cheap to fix. A skewed target that should be log transformed, a leakage column that encodes the label, a cluster of duplicated rows, a subgroup with a reversed sign of effect: each of these is invisible to a cross validation score until it has already corrupted the conclusions. The central goal is to build an accurate mental model of the data generating process. You want to know how the data were collected, what each variable means, which values are possible, and where the recording could have gone wrong. This understanding is what lets you distinguish signal from artifact later. A correlation of $0.9$ is exciting only until you discover that both variables were derived from the same source measurement. ### 1.2 Confirmatory Versus Exploratory Analysis It helps to separate two modes of inquiry. Confirmatory data analysis tests a prespecified hypothesis with a fixed protocol and reports a calibrated error rate. Exploratory analysis generates hypotheses by searching the data for patterns. The two are complementary, but conflating them is dangerous. If you look at the same data that suggested a hypothesis and then test that hypothesis on those same data, the resulting p value is not interpretable. This is the multiple comparisons problem in disguise, sometimes called the garden of forking paths. The mechanism is worth stating precisely. Suppose you scan $m$ independent candidate relationships, each genuinely null, and test each at level $\alpha$. The probability that at least one crosses the threshold by chance is $$ P(\text{at least one false positive}) = 1 - (1 - \alpha)^m, $$ which for $\alpha = 0.05$ and $m = 20$ already exceeds $0.64$. Exploration routinely examines far more than twenty views, often implicitly, because each choice of subset, transform, or binning is another comparison. The point is not that exploration is illegitimate but that the significance attached to whatever it surfaces is not the nominal $\alpha$. The practical discipline is to keep the roles of the data separate. Reserve a holdout that you do not look at during exploration, or at minimum record which patterns were discovered exploratorily so that any later inferential claim about them is treated as tentative. EDA earns its freedom to look everywhere precisely by refusing to attach formal significance to what it finds. ### 1.3 An Iterative Loop EDA proceeds as a loop: form a question, generate a view that answers it, and let the answer raise the next question. Wickham, Cetinkaya-Rundel, and Grolemund describe this as a cycle of transforming, visualizing, and modeling in service of understanding. The loop terminates not when you run out of plots but when your questions stop producing surprises. A useful habit is to write each question down before plotting, because an unwritten question is easy to retrofit to whatever the plot happens to show. ```{mermaid} %%| label: fig-eda-loop %%| fig-cap: "The exploratory loop. Each answer either raises a sharper question or, once surprises run out, hardens into a documented hypothesis." flowchart LR Q["Ask a question"] --> V["Generate a view"] V --> A["Read the answer"] A -->|"new surprise"| Q A -->|"no surprise"| H["Record a hypothesis"] ``` ## 2. Univariate Exploration ### 2.1 Summary Statistics and Their Limits Begin one variable at a time. For a numeric variable, the basic summary is a location measure, a spread measure, and a shape description. The mean $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ and standard deviation $s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$ are familiar but fragile, because both are sensitive to outliers. Fragility can be made precise through the breakdown point, the smallest fraction of observations that can be moved to arbitrary values before the estimator itself becomes arbitrary. The mean has a breakdown point of $0$, since a single point sent to infinity drags it without bound. The median, by contrast, has a breakdown point of $1/2$: up to half the data can be corrupted before it loses meaning. The interquartile range (IQR), the distance between the $25$th and $75$th percentiles, inherits the same robustness and gives a more stable picture of spread for skewed or heavy tailed data. Summary statistics alone are never sufficient. Anscombe's quartet and the more recent Datasaurus dozen demonstrate that radically different datasets can share identical means, variances, and correlations to two decimal places while differing wildly in shape. The lesson is permanent: always plot the distribution rather than trusting a five number summary to characterize it. ### 2.2 Visualizing a Single Distribution The histogram is the workhorse for numeric variables, but its appearance depends heavily on bin width. Too few bins hide structure such as bimodality; too many bins turn the plot into noise. Several rules of thumb exist for a starting bin count, including Sturges' rule of $\lceil \log_2 n + 1 \rceil$ bins and the Freedman-Diaconis width $2 \cdot \text{IQR} \cdot n^{-1/3}$, which adapts to spread and is more robust for skewed data. Treat any such rule as a default to override once the eye sees structure, not as a final answer. A kernel density estimate smooths the histogram into a continuous curve, $$ \hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right), $$ where $K$ is a kernel (often Gaussian) and the bandwidth $h$ plays the role bin width plays for a histogram. Small $h$ undersmooths and shows spurious wiggles; large $h$ oversmooths and erases real modes. A boxplot compresses the distribution into quartiles and flags outliers using the $1.5 \times \text{IQR}$ rule, while the empirical cumulative distribution function (ECDF), $$ \hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\{x_i \le x\}, $$ avoids binning and bandwidth choices entirely. The ECDF is excellent for reading off quantiles directly and for comparing groups, since two ECDFs can be overlaid without the visual artifacts that overlapping histograms introduce. ```python # Inspect one numeric column from several angles df["amount"].describe() df["amount"].plot(kind="hist", bins=50) df["amount"].plot(kind="box") ``` For categorical variables, the analogue is a frequency table or bar chart. Watch for rare categories, near constant columns that carry almost no information, and high cardinality fields such as identifiers that masquerade as features. ### 2.3 Shape, Skew, and Transformation The shape of a distribution suggests how to treat it. Skewness measures asymmetry, $$ g_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^{3/2}}, $$ with $g_1 > 0$ indicating a long right tail, while kurtosis measures tail heaviness relative to the normal. Right skewed quantities such as income, transaction size, or counts often benefit from a logarithmic or square root transform, which can stabilize variance and make linear relationships visible. More generally, the Box-Cox family $$ x^{(\lambda)} = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[2mm] \ln x, & \lambda = 0, \end{cases} \qquad x > 0, $$ interpolates smoothly between the log ($\lambda = 0$), square root ($\lambda = 1/2$), and identity ($\lambda = 1$) transforms, and the value of $\lambda$ that most nearly symmetrizes the data can be chosen by maximum likelihood. Both skewness and kurtosis are worth computing, but the eye reading a histogram usually decides faster. When a variable spans several orders of magnitude, plotting it on a log scale frequently turns an uninformative spike near zero into a clean, interpretable curve. ## 3. Bivariate Exploration ### 3.1 Numeric Versus Numeric With two numeric variables, the scatter plot is the primary tool. It reveals the form of the relationship (linear, curved, or absent), its direction, its strength, and any points that sit far from the bulk. The Pearson correlation coefficient $$ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2}\,\sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} $$ quantifies only the linear component and ranges in $[-1, 1]$. A near zero $r$ does not mean independence: a perfect parabola $y = x^2$ on data symmetric about zero has $r \approx 0$ despite an exact functional relationship. For monotone but nonlinear relationships, Spearman's rank correlation, which is Pearson's $r$ applied to the ranks of the data rather than the raw values, is more faithful and is invariant to any monotone transformation of either variable. Overplotting is a frequent hazard with large datasets. When millions of points pile on top of one another, switch to hexagonal binning, two dimensional density contours, or transparency so that the dense regions remain legible. ### 3.2 Numeric Versus Categorical To compare a numeric variable across the levels of a category, place distributions side by side. Grouped boxplots, violin plots, or overlaid density curves let you see whether the groups differ in location, in spread, or in shape. A difference in medians is interesting, but so is a difference in variance, which a comparison of means alone would miss. Faceting, drawing one small panel per group on a shared scale, scales gracefully to many categories and keeps comparisons honest because every panel shares the same axes. ### 3.3 Categorical Versus Categorical For two categorical variables, the contingency table of joint counts is the foundation. Converting counts to row or column proportions exposes conditional structure, and a mosaic plot or heatmap renders the same information visually. The question is whether the conditional distribution of one variable changes across the levels of the other. Under exact independence, each cell count $O_{ij}$ would match the expected count $E_{ij} = \frac{R_i C_j}{N}$ formed from the row total $R_i$, column total $C_j$, and grand total $N$. The size of the departures, summarized informally by how far the $O_{ij}$ stray from the $E_{ij}$, is exactly what a mosaic plot makes visible. If the conditional distribution does not change, the variables are roughly independent in this sample; if it does, you have found a relationship worth carrying into the modeling stage. ## 4. Multivariate Exploration ### 4.1 Beyond Two Dimensions Real effects rarely live in a single pair of variables. Multivariate EDA asks how three or more variables interact, and it requires techniques that compress dimensionality without destroying the patterns you care about. A correlation heatmap across all numeric features gives a fast overview of redundancy and multicollinearity, flagging clusters of variables that move together and may be near duplicates. The pair plot, a grid of scatter plots for every variable pairing with histograms on the diagonal, is dense but powerful for moderate numbers of features. Adding color for a categorical variable layers a third dimension onto each panel. Beyond a handful of features, however, the grid becomes unreadable (it grows as the square of the feature count) and you turn to projection methods. ### 4.2 Confounding and Simpson's Paradox The defining risk of multivariate data is that a relationship visible in aggregate can reverse within subgroups. Simpson's paradox is the canonical example: a treatment can appear beneficial overall yet harmful in every stratum once a confounder is controlled. Algebraically the paradox is no contradiction at all, only a reminder that a ratio of sums is not the sum of ratios. For positive quantities, $$ \frac{a_1}{b_1} > \frac{c_1}{d_1} \quad\text{and}\quad \frac{a_2}{b_2} > \frac{c_2}{d_2} \quad\text{can coexist with}\quad \frac{a_1 + a_2}{b_1 + b_2} < \frac{c_1 + c_2}{d_1 + d_2}, $$ whenever the subgroup weights $b_i$ and $d_i$ differ between the two arms. The 1973 Berkeley graduate admissions case is the classic illustration: an apparent bias against women at the university level dissolved once department was taken into account, because women applied disproportionately to more selective departments. A concrete worked example makes the arithmetic vivid. Two treatments are each tried on patients split into mild and severe cases. | | Treatment A | Treatment B | |---|---|---| | Mild cases | 81 / 87 (93%) | 234 / 270 (87%) | | Severe cases | 192 / 263 (73%) | 55 / 80 (69%) | | Combined | 273 / 350 (78%) | 289 / 350 (83%) | Treatment A wins in mild cases and in severe cases, yet loses overall. The reversal is driven entirely by the allocation: A was given mostly to the hard, low recovery severe cases, while B was given mostly to the easy mild cases. The combined rate confounds the treatment effect with case severity, and only conditioning separates them. The practical defense is to condition. Before trusting any bivariate relationship, ask which third variable might be driving both sides of it, and then redraw the relationship within levels of that variable. If the pattern holds across strata, it is more credible; if it flips, you have learned something essential that no aggregate plot could have told you. ```python # Check whether an aggregate relationship survives conditioning df.groupby("segment").apply( lambda g: g[["price", "demand"]].corr().iloc[0, 1] ) ``` ### 4.3 Dimensionality Reduction for Exploration When the feature space is wide, linear projection helps. Principal component analysis (PCA) finds orthogonal directions of maximal variance: the first principal component is the unit vector $w_1$ maximizing $\operatorname{Var}(Xw)$, equivalently the leading eigenvector of the covariance matrix, and each subsequent component is the variance maximizing direction orthogonal to those before it. Plotting the first two components, which together capture the largest share of total variance, often reveals clusters, gradients, or outliers that were invisible in any single feature. Inspecting the fraction of variance explained by each component, the scree plot, indicates how many directions are needed to summarize the data. For nonlinear structure, neighborhood embedding methods such as t-SNE and UMAP can expose clusters that PCA misses, though their layouts must be read with care: they preserve local neighborhoods, so distances and densities between separated clusters in the embedding are not reliable, and the apparent size of a cluster carries little meaning. Used as exploratory aids rather than as conclusions, these projections are valuable for forming hypotheses about latent groups, not for measuring them. ## 5. Distributions and Data Quality ### 5.1 Reading a Distribution Critically Every distribution tells a story about how the data were produced. A spike at exactly zero may encode a default value or a sentinel for missing data. A hard ceiling suggests censoring, where values above a threshold were clipped or never recorded. A suspicious mode at a round number such as $99$ or $-1$ often signals a placeholder. Bimodality frequently means two populations were mixed, for example two device types or two time periods, and disentangling them may matter more than any model you fit afterward. ### 5.2 Outliers and Anomalies Not every extreme value is an error, and not every error is extreme. An outlier is simply a point that is improbable under your working model of the data, and the right response depends on its cause. A measurement glitch should be corrected or removed; a genuine rare event should be kept and perhaps studied directly. A robust way to flag candidates is the modified z score built on the median absolute deviation, $$ \text{MAD} = \operatorname{median}_i\,\lvert x_i - \tilde{x}\rvert, \qquad z_i = \frac{0.6745\,(x_i - \tilde{x})}{\text{MAD}}, $$ where $\tilde{x}$ is the median and the constant $0.6745$ rescales the MAD so that it estimates the standard deviation of a normal distribution. Because both the center and the scale are medians, the very outliers you are hunting cannot inflate the threshold, which is the failure mode of the ordinary z score that divides by a non robust standard deviation. A common convention flags points with $\lvert z_i \rvert > 3.5$. The plain $1.5 \times \text{IQR}$ rule of the boxplot is a simpler alternative with the same robustness logic. The decision to drop a point should always be recorded, because silent deletion is a common source of irreproducible results. ### 5.3 Duplicates, Types, and Units Mundane data quality checks repay the time spent. Look for duplicated rows, which inflate apparent sample size and bias estimates. Verify that numeric columns are stored as numbers rather than strings, that dates parse correctly, and that categorical labels are consistent in spelling and case. Confirm units, since a column silently mixing dollars and cents, or kilometers and miles, will quietly poison every downstream calculation. These checks are unglamorous but they catch the errors that derail projects most often. Free, open source tools such as pandas, ydata-profiling, and Great Expectations can automate much of this scan and turn ad hoc checks into a repeatable, versioned suite of data quality assertions. ## 6. Missingness Patterns ### 6.1 The Three Mechanisms Missing data are not all alike, and Rubin's taxonomy provides the vocabulary. Let $X$ denote the complete data, partitioned into observed and missing parts $X_{\text{obs}}$ and $X_{\text{mis}}$, and let $M$ be the indicator matrix of which entries are missing. The mechanism is characterized by how the distribution of $M$ depends on the data. - Missing completely at random (MCAR): $P(M \mid X) = P(M)$. Missingness is independent of all values, observed or not. - Missing at random (MAR): $P(M \mid X) = P(M \mid X_{\text{obs}})$. Missingness depends only on observed variables, for example when older respondents skip a question and age is recorded. - Missing not at random (MNAR): $P(M \mid X)$ depends on $X_{\text{mis}}$ even after conditioning on $X_{\text{obs}}$, as when high earners decline to report income precisely because it is high. The mechanism matters because it dictates which handling strategies are valid: listwise deletion can be unbiased under MCAR but actively misleading under MNAR, and standard imputation methods assume MAR. ### 6.2 Visualizing Missingness Treat missingness as data to be explored in its own right. A bar chart of the fraction missing per column shows where the problem concentrates. A matrix plot of the missingness pattern, with rows as records and columns as fields, reveals whether absences cluster together, which suggests a shared cause such as a form section that was added later. A correlation of missingness indicators tells you whether two fields tend to go missing in tandem, a strong hint that the mechanism is not MCAR. ```python # Quantify and locate missingness df.isna().mean().sort_values(ascending=False) df.isna().corr() # do absences co-occur? ``` A useful diagnostic for ruling out MCAR is to compare the observed values of one column across records where a second column is present versus absent. Under MCAR those two conditional distributions should be indistinguishable, so a clear difference is evidence against it. The free, open source missingno library renders the matrix and correlation views directly for pandas data frames. ### 6.3 From Pattern to Strategy The point of the missingness exploration is to choose a defensible strategy, not to fill blanks reflexively. If a column is missing for a structural reason, an explicit missing indicator may carry real signal. If missingness appears MAR, multiple imputation or model based imputation can recover information that listwise deletion would throw away. If you suspect MNAR, you may need to model the missingness mechanism directly or, at minimum, flag the resulting estimates as sensitive to untestable assumptions. EDA cannot prove the mechanism, since MAR and MNAR are formally indistinguishable from the observed data alone, but it can rule out the naive defaults and force an honest choice. ## 7. From Exploration to Hypotheses ### 7.1 Turning Observations into Testable Claims The output of EDA is a short list of concrete, falsifiable hypotheses, each tied to a specific observation. A vague note that "price seems related to demand" is less useful than a sharpened claim such as "within each customer segment, demand declines roughly linearly with log price, and the slope is steeper for premium segments." The sharpened version names the variables, the functional form, the conditioning structure, and the expected direction, which makes it something a later analysis can confirm or refute. ### 7.2 Guarding Against Self Deception The danger throughout is that the analyst will see patterns that are artifacts of noise, of selection, or of the very flexibility that makes exploration powerful. Several habits guard against this. Hold out data so that promising patterns can be checked on records that did not suggest them. Prefer effects that are large, stable across reasonable subsets, and mechanistically plausible over those that are merely statistically striking. Treat any relationship discovered by extensive searching as a candidate rather than a finding. The freedom to look everywhere is exactly why exploratory conclusions must be confirmed before they are believed. ### 7.3 Documenting the Exploration A disciplined EDA leaves a written trail: the questions asked, the views generated, the surprises encountered, the decisions made about outliers and missing values, and the hypotheses carried forward. This record is what makes the analysis reproducible and what lets a collaborator understand why the eventual model was built the way it was. The data dictionary you assemble, the transformations you settle on, and the open questions you flag become the bridge between exploration and the modeling chapters that follow. ## 8. When to Use and Common Pitfalls EDA belongs at the start of every analysis that touches unfamiliar data, after any major change to data collection, and whenever a model behaves in a way you cannot explain. It is time well spent precisely when the cost of a hidden defect is high, which is to say almost always. The main way to overspend is to keep generating plots after the questions have stopped producing surprises; the loop should end when conditioning and re-slicing no longer change your picture of the data. A few pitfalls recur often enough to name: - Attaching formal p values to patterns the same data suggested, which the garden of forking paths makes meaningless. - Trusting summary statistics without a plot, the failure that Anscombe's quartet was built to expose. - Reading a near zero correlation as independence, when it only rules out a linear relationship. - Accepting an aggregate relationship without conditioning on plausible confounders, leaving the door open to Simpson's paradox. - Deleting rows with missing values by default, which silently assumes MCAR and can bias every estimate that follows. - Over interpreting the geometry of a t-SNE or UMAP layout, whose between cluster distances are not meaningful. ## 9. Summary Exploratory data analysis is the stage where you earn the right to model. By examining variables one at a time, in pairs, and in combination, by reading distributions and missingness patterns critically, and by conditioning on potential confounders, you build an accurate picture of the data generating process and surface the problems that would otherwise corrupt downstream work. The deliverable is not a gallery of plots but a small set of sharpened, conditioned hypotheses, accompanied by an honest account of the data's flaws and the discipline to confirm exploratory findings before trusting them. ## References 1. Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. https://archive.org/details/exploratorydataa0000tuke 2. Wickham, H., Cetinkaya-Rundel, M., and Grolemund, G. (2023). R for Data Science, 2nd ed. O'Reilly. https://r4ds.hadley.nz/ 3. Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27(1), 17-21. https://doi.org/10.1080/00031305.1973.10478966 4. Matejka, J. and Fitzmaurice, G. (2017). Same Stats, Different Graphs (the Datasaurus Dozen). CHI 2017. https://doi.org/10.1145/3025453.3025912 5. Gelman, A. and Loken, E. (2014). The Garden of Forking Paths. http://www.stat.columbia.edu/~gelman/research/unpublished/forking.pdf 6. Bickel, P. J., Hammel, E. A., and O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175), 398-404. https://doi.org/10.1126/science.187.4175.398 7. Box, G. E. P. and Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society, Series B, 26(2), 211-252. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x 8. Iglewicz, B. and Hoaglin, D. C. (1993). How to Detect and Handle Outliers. ASQC Quality Press. https://asq.org/quality-press/display-item?item=E0498 9. Rubin, D. B. (1976). Inference and Missing Data. Biometrika, 63(3), 581-592. https://doi.org/10.1093/biomet/63.3.581 10. van der Maaten, L. and Hinton, G. (2008). Visualizing Data using t-SNE. JMLR, 9, 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html 11. McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. https://arxiv.org/abs/1802.03426 12. VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly. https://jakevdp.github.io/PythonDataScienceHandbook/