61 Exploratory Data Analysis
Exploratory data analysis (EDA) is the disciplined practice of interrogating a dataset before committing to a model. It is less a fixed recipe than a stance, one that treats every column, distribution, and relationship as a claim that must be examined rather than assumed. John Tukey, who coined the term in 1977, framed EDA as detective work: the analyst looks for clues, follows hunches, and remains willing to be surprised. This chapter develops the EDA mindset and then works through univariate, bivariate, and multivariate exploration, the assessment of distributions and missingness, and the formation of hypotheses that will guide later modeling.
61.1 1. The EDA Mindset
61.1.1 1.1 Why Explore Before Modeling
Modeling is an act of compression. A learned function reduces a complex dataset to a small set of parameters and a single objective. If the data violate the assumptions baked into that objective, the compression discards exactly the structure that matters. EDA exists to surface those violations early, while they are cheap to fix. A skewed target that should be log transformed, a leakage column that encodes the label, a cluster of duplicated rows, a subgroup with a reversed sign of effect: each of these is invisible to a cross validation score until it has already corrupted the conclusions.
The central goal is to build an accurate mental model of the data generating process. You want to know how the data were collected, what each variable means, which values are possible, and where the recording could have gone wrong. This understanding is what lets you distinguish signal from artifact later. A correlation of \(0.9\) is exciting only until you discover that both variables were derived from the same source measurement.
61.1.2 1.2 Confirmatory Versus Exploratory Analysis
It helps to separate two modes of inquiry. Confirmatory data analysis tests a prespecified hypothesis with a fixed protocol and reports a calibrated error rate. Exploratory analysis generates hypotheses by searching the data for patterns. The two are complementary, but conflating them is dangerous. If you look at the same data that suggested a hypothesis and then test that hypothesis on those same data, the resulting p value is not interpretable. This is the multiple comparisons problem in disguise, sometimes called the garden of forking paths.
The practical discipline is to keep the roles of the data separate. Reserve a holdout that you do not look at during exploration, or at minimum record which patterns were discovered exploratorily so that any later inferential claim about them is treated as tentative. EDA earns its freedom to look everywhere precisely by refusing to attach formal significance to what it finds.
61.1.3 1.3 An Iterative Loop
EDA proceeds as a loop: form a question, generate a view that answers it, and let the answer raise the next question. Wickham and Grolemund describe this as a cycle of transforming, visualizing, and modeling in service of understanding. The loop terminates not when you run out of plots but when your questions stop producing surprises. A useful habit is to write each question down before plotting, because an unwritten question is easy to retrofit to whatever the plot happens to show.
61.2 2. Univariate Exploration
61.2.1 2.1 Summary Statistics and Their Limits
Begin one variable at a time. For a numeric variable, the basic summary is a location measure, a spread measure, and a shape description. The mean \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\) and standard deviation \(s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}\) are familiar but fragile, because both are sensitive to outliers. The median and the interquartile range (IQR), the distance between the \(25\)th and \(75\)th percentiles, give a more robust picture for skewed or heavy tailed data.
Summary statistics alone are never sufficient. Anscombe’s quartet and the more recent Datasaurus dozen demonstrate that radically different datasets can share identical means, variances, and correlations. The lesson is permanent: always plot the distribution rather than trusting a five number summary to characterize it.
61.2.2 2.2 Visualizing a Single Distribution
The histogram is the workhorse for numeric variables, but its appearance depends heavily on bin width. Too few bins hide structure such as bimodality; too many bins turn the plot into noise. A kernel density estimate smooths the histogram and exposes shape, while a boxplot compresses the distribution into quartiles and flags outliers using the \(1.5 \times \text{IQR}\) rule. The empirical cumulative distribution function (ECDF) avoids binning entirely and is excellent for reading off quantiles and comparing groups.
# Inspect one numeric column from several angles
df["amount"].describe()
df["amount"].plot(kind="hist", bins=50)
df["amount"].plot(kind="box")For categorical variables, the analogue is a frequency table or bar chart. Watch for rare categories, near constant columns that carry almost no information, and high cardinality fields such as identifiers that masquerade as features.
61.2.3 2.3 Shape, Skew, and Transformation
The shape of a distribution suggests how to treat it. Right skewed quantities such as income, transaction size, or counts often benefit from a logarithmic or square root transform, which can stabilize variance and make linear relationships visible. Skewness measures asymmetry and kurtosis measures tail heaviness; both are worth computing, but the eye reading a histogram usually decides faster. When a variable spans several orders of magnitude, plotting it on a log scale frequently turns an uninformative spike near zero into a clean, interpretable curve.
61.3 3. Bivariate Exploration
61.3.1 3.1 Numeric Versus Numeric
With two numeric variables, the scatter plot is the primary tool. It reveals the form of the relationship (linear, curved, or absent), its direction, its strength, and any points that sit far from the bulk. The Pearson correlation coefficient \(r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2}\sqrt{\sum (y_i - \bar{y})^2}}\) quantifies only the linear component and ranges in \([-1, 1]\). A near zero \(r\) does not mean independence; a perfect parabola has \(r \approx 0\). For monotone but nonlinear relationships, Spearman’s rank correlation is more faithful because it compares ranks rather than raw values.
Overplotting is a frequent hazard with large datasets. When millions of points pile on top of one another, switch to hexagonal binning, two dimensional density contours, or transparency so that the dense regions remain legible.
61.3.2 3.2 Numeric Versus Categorical
To compare a numeric variable across the levels of a category, place distributions side by side. Grouped boxplots, violin plots, or overlaid density curves let you see whether the groups differ in location, in spread, or in shape. A difference in medians is interesting, but so is a difference in variance, which a comparison of means alone would miss. Faceting, drawing one small panel per group on a shared scale, scales gracefully to many categories and keeps comparisons honest.
61.3.3 3.3 Categorical Versus Categorical
For two categorical variables, the contingency table of joint counts is the foundation. Converting counts to row or column proportions exposes conditional structure, and a mosaic plot or heatmap renders the same information visually. The question is whether the conditional distribution of one variable changes across the levels of the other. If it does not, the variables are roughly independent in this sample; if it does, you have found a relationship worth carrying into the modeling stage.
61.4 4. Multivariate Exploration
61.4.1 4.1 Beyond Two Dimensions
Real effects rarely live in a single pair of variables. Multivariate EDA asks how three or more variables interact, and it requires techniques that compress dimensionality without destroying the patterns you care about. A correlation heatmap across all numeric features gives a fast overview of redundancy and multicollinearity, flagging clusters of variables that move together and may be near duplicates.
The pair plot, a grid of scatter plots for every variable pairing with histograms on the diagonal, is dense but powerful for moderate numbers of features. Adding color for a categorical variable layers a third dimension onto each panel. Beyond a handful of features, however, the grid becomes unreadable and you turn to projection methods.
61.4.2 4.2 Confounding and Simpson’s Paradox
The defining risk of multivariate data is that a relationship visible in aggregate can reverse within subgroups. Simpson’s paradox is the canonical example: a treatment can appear beneficial overall yet harmful in every stratum once a confounder is controlled. The 1973 Berkeley graduate admissions case is the classic illustration, where an apparent bias against women at the university level dissolved or reversed once department was taken into account.
The practical defense is to condition. Before trusting any bivariate relationship, ask which third variable might be driving both sides of it, and then redraw the relationship within levels of that variable. If the pattern holds across strata, it is more credible; if it flips, you have learned something essential that no aggregate plot could have told you.
# Check whether an aggregate relationship survives conditioning
df.groupby("segment").apply(
lambda g: g[["price", "demand"]].corr().iloc[0, 1]
)61.4.3 4.3 Dimensionality Reduction for Exploration
When the feature space is wide, linear projection helps. Principal component analysis (PCA) finds orthogonal directions of maximal variance, and plotting the first two components often reveals clusters, gradients, or outliers that were invisible in any single feature. For nonlinear structure, neighborhood embedding methods such as t-SNE and UMAP can expose clusters that PCA misses, though their layouts must be read with care because distances between clusters in the embedding are not always meaningful. Used as exploratory aids rather than as conclusions, these projections are valuable for forming hypotheses about latent groups.
61.5 5. Distributions and Data Quality
61.5.1 5.1 Reading a Distribution Critically
Every distribution tells a story about how the data were produced. A spike at exactly zero may encode a default value or a sentinel for missing data. A hard ceiling suggests censoring, where values above a threshold were clipped or never recorded. A suspicious mode at a round number such as \(99\) or \(-1\) often signals a placeholder. Bimodality frequently means two populations were mixed, for example two device types or two time periods, and disentangling them may matter more than any model you fit afterward.
61.5.2 5.2 Outliers and Anomalies
Not every extreme value is an error, and not every error is extreme. An outlier is simply a point that is improbable under your working model of the data, and the right response depends on its cause. A measurement glitch should be corrected or removed; a genuine rare event should be kept and perhaps studied directly. The robust way to flag candidates is with the IQR rule or with a modified z score based on the median absolute deviation, which does not let the very outliers you are hunting inflate the threshold. The decision to drop a point should always be recorded, because silent deletion is a common source of irreproducible results.
61.5.3 5.3 Duplicates, Types, and Units
Mundane data quality checks repay the time spent. Look for duplicated rows, which inflate apparent sample size and bias estimates. Verify that numeric columns are stored as numbers rather than strings, that dates parse correctly, and that categorical labels are consistent in spelling and case. Confirm units, since a column silently mixing dollars and cents, or kilometers and miles, will quietly poison every downstream calculation. These checks are unglamorous but they catch the errors that derail projects most often.
61.6 6. Missingness Patterns
61.6.1 6.1 The Three Mechanisms
Missing data are not all alike, and Rubin’s taxonomy provides the vocabulary. Data are missing completely at random (MCAR) when the probability of being missing is independent of all values, observed or not. Data are missing at random (MAR) when missingness depends only on observed variables, for example when older respondents skip a question and age is recorded. Data are missing not at random (MNAR) when missingness depends on the unobserved value itself, as when high earners decline to report income. The mechanism matters because it dictates which handling strategies are valid: simple deletion can be unbiased under MCAR but actively misleading under MNAR.
61.6.2 6.2 Visualizing Missingness
Treat missingness as data to be explored in its own right. A bar chart of the fraction missing per column shows where the problem concentrates. A matrix plot of the missingness pattern, with rows as records and columns as fields, reveals whether absences cluster together, which suggests a shared cause such as a form section that was added later. A correlation of missingness indicators tells you whether two fields tend to go missing in tandem, a strong hint that the mechanism is not MCAR.
# Quantify and locate missingness
df.isna().mean().sort_values(ascending=False)
df.isna().corr() # do absences co-occur?61.6.3 6.3 From Pattern to Strategy
The point of the missingness exploration is to choose a defensible strategy, not to fill blanks reflexively. If a column is missing for a structural reason, an explicit missing indicator may carry real signal. If missingness appears MAR, multiple imputation or model based imputation can recover information that listwise deletion would throw away. If you suspect MNAR, you may need to model the missingness mechanism directly or, at minimum, flag the resulting estimates as sensitive to untestable assumptions. EDA cannot prove the mechanism, since MAR and MNAR are indistinguishable from the observed data alone, but it can rule out the naive defaults and force an honest choice.
61.7 7. From Exploration to Hypotheses
61.7.1 7.1 Turning Observations into Testable Claims
The output of EDA is a short list of concrete, falsifiable hypotheses, each tied to a specific observation. A vague note that “price seems related to demand” is less useful than a sharpened claim such as “within each customer segment, demand declines roughly linearly with log price, and the slope is steeper for premium segments.” The sharpened version names the variables, the functional form, the conditioning structure, and the expected direction, which makes it something a later analysis can confirm or refute.
61.7.2 7.2 Guarding Against Self Deception
The danger throughout is that the analyst will see patterns that are artifacts of noise, of selection, or of the very flexibility that makes exploration powerful. Several habits guard against this. Hold out data so that promising patterns can be checked on records that did not suggest them. Prefer effects that are large, stable across reasonable subsets, and mechanistically plausible over those that are merely statistically striking. Treat any relationship discovered by extensive searching as a candidate rather than a finding. The freedom to look everywhere is exactly why exploratory conclusions must be confirmed before they are believed.
61.7.3 7.3 Documenting the Exploration
A disciplined EDA leaves a written trail: the questions asked, the views generated, the surprises encountered, the decisions made about outliers and missing values, and the hypotheses carried forward. This record is what makes the analysis reproducible and what lets a collaborator understand why the eventual model was built the way it was. The data dictionary you assemble, the transformations you settle on, and the open questions you flag become the bridge between exploration and the modeling chapters that follow.
61.8 8. Summary
Exploratory data analysis is the stage where you earn the right to model. By examining variables one at a time, in pairs, and in combination, by reading distributions and missingness patterns critically, and by conditioning on potential confounders, you build an accurate picture of the data generating process and surface the problems that would otherwise corrupt downstream work. The deliverable is not a gallery of plots but a small set of sharpened, conditioned hypotheses, accompanied by an honest account of the data’s flaws and the discipline to confirm exploratory findings before trusting them.
61.9 References
- Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. https://archive.org/details/exploratorydataa0000tuke
- Wickham, H., Cetinkaya-Rundel, M., and Grolemund, G. (2023). R for Data Science, 2nd ed. O’Reilly. https://r4ds.hadley.nz/
- Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27(1), 17-21. https://www.tandfonline.com/doi/abs/10.1080/00031305.1973.10478966
- Matejka, J. and Fitzmaurice, G. (2017). Same Stats, Different Graphs (the Datasaurus Dozen). CHI 2017. https://www.research.autodesk.com/publications/same-stats-different-graphs/
- Gelman, A. and Loken, E. (2014). The Garden of Forking Paths. http://www.stat.columbia.edu/~gelman/research/unpublished/forking.pdf
- Bickel, P. J., Hammel, E. A., and O’Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175), 398-404. https://www.science.org/doi/10.1126/science.187.4175.398
- Rubin, D. B. (1976). Inference and Missing Data. Biometrika, 63(3), 581-592. https://academic.oup.com/biomet/article/63/3/581/270932
- van der Maaten, L. and Hinton, G. (2008). Visualizing Data using t-SNE. JMLR, 9, 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html
- McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. https://arxiv.org/abs/1802.03426
- VanderPlas, J. (2016). Python Data Science Handbook. O’Reilly. https://jakevdp.github.io/PythonDataScienceHandbook/