60 Data Validation and Testing
Machine learning systems inherit the quality of the data that flows through them. A model trained on corrupted features, a feature store populated with stale records, or a serving pipeline fed malformed input will produce confident and wrong predictions. Unlike a software bug that throws an exception, a data quality defect often passes silently through every layer of a system until it surfaces as degraded business metrics weeks later. Data validation and testing exist to make these defects loud, early, and cheap to fix.
This chapter treats data as a first class artifact that deserves the same engineering discipline applied to source code. We examine the spectrum of validation techniques, from rigid schema enforcement to probabilistic statistical checks, survey the dominant open source tooling, and discuss how validation behaves differently in batch and streaming production pipelines.
60.1 1. Why Data Validation Matters
60.1.1 1.1 The Silent Failure Mode
Traditional software fails loudly. A null pointer dereference crashes, a type mismatch refuses to compile, and a failed assertion halts execution. Data pipelines fail quietly. A column that should range from zero to one hundred suddenly contains values in the millions because an upstream service changed its units from dollars to cents. The pipeline does not crash. It computes averages, trains models, and serves predictions, all built on a silent defect.
The cost of a data defect grows with the distance it travels before detection. Catching a malformed record at ingestion costs a retry. Catching it after it has trained a production model costs a retraining cycle, a redeployment, and possibly an incident review. Validation pushes detection as close to the source as possible.
60.1.2 1.2 Data Quality Dimensions
Practitioners commonly decompose data quality into several dimensions. Completeness asks whether required values are present. Validity asks whether values conform to their expected type and format. Consistency asks whether related values agree with each other across tables or time. Accuracy asks whether values reflect reality, which is the hardest dimension to test automatically. Timeliness asks whether data arrived within its expected window. Uniqueness asks whether records that should be distinct are in fact distinct.
A mature validation strategy maps each dimension to concrete, automatable checks. Completeness becomes a null count threshold. Validity becomes a type and regex check. Uniqueness becomes a primary key duplication check. Accuracy usually requires a trusted reference source or human review, so teams often approximate it with plausibility bounds.
60.2 2. Schema Validation
60.2.1 2.1 What a Schema Captures
A schema is a contract that describes the structure of a dataset: the set of columns, their names, their data types, their nullability, and sometimes their permissible value ranges. Schema validation verifies that an incoming dataset honors this contract before any downstream code touches it.
Schema validation is the cheapest and highest leverage form of data testing because it catches an entire class of structural defects with a single declarative specification. When an upstream team renames a column, drops a field, or changes an integer to a string, a schema check fails immediately rather than letting the change propagate.
60.2.2 2.2 Explicit Schemas with Pandera
Pandera lets you express a schema as code over pandas, Polars, or PySpark frames. A DataFrameSchema declares columns with types and constraints, and validation either returns the validated frame or raises a SchemaError describing exactly which rows and checks failed.
import pandera as pa
from pandera import Column, Check
schema = pa.DataFrameSchema({
"user_id": Column(int, Check.gt(0), unique=True),
"age": Column(int, Check.in_range(0, 120), nullable=False),
"email": Column(str, Check.str_matches(r".+@.+\..+")),
"signup_country": Column(str, Check.isin(["US", "CA", "GB", "DE"])),
})
validated = schema.validate(raw_df, lazy=True)The lazy=True flag is important in practice. Without it, validation stops at the first failure. With it, Pandera collects every failing check across the frame and reports them together, which dramatically shortens the debugging loop when an unfamiliar dataset arrives.
60.2.3 2.3 Class Based Schemas and Type Integration
Pandera also offers a class based API using DataFrameModel, which reads like a dataclass and integrates with static type checkers. This style keeps the schema definition close to the rest of your typed code and supports inheritance for shared field definitions.
from pandera import DataFrameModel, Field
class Transaction(DataFrameModel):
amount: float = Field(ge=0, le=1_000_000)
currency: str = Field(isin=["USD", "EUR", "GBP"])
timestamp: pa.DateTime = Field(nullable=False)
class Config:
strict = True # reject unexpected columnsThe strict = True configuration is a deliberate choice. It rejects any column not declared in the schema, which protects against silent additions of unexpected fields. The opposite default, permissive validation, is more forgiving but lets schema drift accumulate unnoticed.
60.2.4 2.4 Serialized Schemas for Cross Language Pipelines
When a pipeline spans multiple languages or services, an in code Python schema does not travel well. Serialized schema formats such as JSON Schema, Avro, and Protobuf describe structure in a language neutral way. A producer written in Java and a consumer written in Python can share one Avro schema and enforce it at the boundary. Schema registries, common in Kafka deployments, store these definitions centrally and enforce compatibility rules so that a producer cannot publish a breaking change without explicit approval.
60.3 3. Constraint Checks
60.3.1 3.1 Beyond Structure
Schema validation confirms that a column is an integer. Constraint checks confirm that the integer makes sense. A column may be a perfectly valid float and still be wrong if it represents an age of negative five or a probability of one hundred. Constraint checks encode domain knowledge that no type system can express.
Common constraint categories include range checks (a value falls within known bounds), set membership checks (a category belongs to an allowed list), uniqueness checks (a key does not repeat), referential checks (a foreign key exists in a parent table), and cross column checks (a discount price does not exceed a list price).
60.3.2 3.2 Cross Field and Cross Row Constraints
The most valuable constraints often span multiple fields. A record where end_date precedes start_date is structurally valid but logically impossible. A row where total does not equal the sum of its line items signals a computation bug upstream. These relationships are invisible to per column validation and require checks that reason about the whole record.
schema = pa.DataFrameSchema(
columns={
"start_date": Column(pa.DateTime),
"end_date": Column(pa.DateTime),
},
checks=Check(
lambda df: df["end_date"] >= df["start_date"],
error="end_date must not precede start_date",
),
)60.3.3 3.3 Choosing Hard versus Soft Constraints
Not every constraint violation should halt a pipeline. A hard constraint, such as a null primary key, justifies rejecting the batch outright because downstream joins will break. A soft constraint, such as a slightly elevated null rate in an optional field, may warrant a warning and a logged metric rather than a full stop. Encoding this distinction explicitly prevents two failure modes: a brittle pipeline that halts on trivial issues, and a permissive pipeline that ignores real corruption. Most validation tools support both raising errors and emitting warnings, and the choice should reflect the genuine downstream impact of each rule.
60.4 4. Statistical Validation
60.4.1 4.1 From Deterministic Rules to Distributions
Schema and constraint checks are deterministic: a value either satisfies the rule or it does not. Many real defects are statistical. The data is individually valid but collectively wrong. A feature whose mean shifts by three standard deviations, a categorical column whose distribution suddenly skews toward one value, or a join that quietly drops half its rows all produce data that passes every per row check while being deeply broken.
Statistical validation compares the shape of incoming data against an expected profile, usually derived from historical reference data. Rather than asking “is this value valid,” it asks “does this dataset look like the data we expect.”
60.4.2 4.2 Distributional Checks and Drift
The core technique is to summarize a column with statistics, the mean, standard deviation, quantiles, null fraction, and cardinality, then compare those summaries against a baseline. When the deviation exceeds a threshold, the check fails or raises an alert.
For detecting whether two distributions differ, practitioners use tests and metrics such as the Kolmogorov Smirnov statistic for continuous columns, the chi squared test for categorical columns, the population stability index for binned distributions, and the Jensen Shannon divergence for comparing probability distributions. These quantify drift between a reference window and a current window.
from scipy.stats import ks_2samp
stat, p_value = ks_2samp(reference["feature"], current["feature"])
if p_value < 0.01:
raise DataDriftError(f"feature distribution shifted, KS p={p_value:.4f}")60.4.3 4.3 The Threshold Problem
Statistical validation introduces a tuning challenge absent from deterministic checks. Set thresholds too tight and the pipeline alerts on normal seasonal variation, training the team to ignore alarms. Set them too loose and genuine drift slips through. There is no universally correct threshold. Effective teams derive thresholds empirically by replaying historical data and measuring how often each candidate threshold would have fired, then tuning so that alerts correlate with real incidents rather than noise. Statistical checks should be treated as monitoring signals that often warrant investigation rather than hard gates that block deployment.
60.4.4 4.4 Validating Against Reference Data
A powerful pattern compares a derived dataset against an independent source of truth. If a feature pipeline computes daily active users, an independent analytics warehouse should report a closely matching count. Reconciliation checks of this kind catch logic bugs that no internal consistency check could find, because they compare two independent computations of the same quantity.
60.5 5. Data Unit Tests
60.5.1 5.1 Treating Data Logic Like Code
The transformations that produce data are code, and code deserves unit tests. A data unit test runs a transformation against a small, fixed input and asserts an exact expected output. Unlike validation, which runs against live production data, a data unit test runs in continuous integration against synthetic fixtures and verifies that the transformation logic itself is correct.
def test_revenue_excludes_refunds():
input_df = pd.DataFrame({
"order_id": [1, 2, 3],
"amount": [100, 50, 30],
"status": ["paid", "refunded", "paid"],
})
result = compute_revenue(input_df)
assert result == 130 # refunded order excludedThis test pins the business rule that refunds do not count toward revenue. If a future change accidentally includes refunds, the test fails in continuous integration before the bad logic ever reaches production data.
60.5.2 5.2 Fixtures, Edge Cases, and Property Based Testing
Good data unit tests deliberately include edge cases: empty inputs, single row inputs, nulls in every nullable field, duplicate keys, and boundary values. Property based testing libraries such as Hypothesis generate many randomized inputs and check that invariants hold across all of them, which surfaces edge cases a human would not think to enumerate.
A useful property to assert is that a transformation never introduces nulls into a non nullable output column, or that an aggregation never produces a count larger than its input row count. These structural invariants hold regardless of the specific input values.
60.5.3 5.3 The dbt Testing Model
In analytics engineering, dbt has popularized declarative data tests defined in YAML alongside model definitions. Built in generic tests cover the most common assertions, and custom tests are expressed as SQL queries that should return zero rows, where any returned row represents a failure.
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ["placed", "shipped", "delivered", "returned"]This model blurs the line between unit testing and production validation. The same assertions run during development against sample data and in production against the full warehouse, which is one of dbt’s most practical strengths.
60.6 6. Tooling: Great Expectations and Pandera
60.6.1 6.1 Great Expectations
Great Expectations is a validation framework built around the concept of an Expectation, a declarative assertion about data such as expect_column_values_to_not_be_null or expect_column_mean_to_be_between. Expectations are grouped into Expectation Suites, validated against data through Checkpoints, and the results render as readable Data Docs that serve as both documentation and a test report.
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_dataframe(df)
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", 0, 120)
validator.expect_column_values_to_be_in_set("country", ["US", "CA", "GB"])
results = validator.validate()Great Expectations excels at communication. Its Data Docs turn a validation run into a shareable artifact that non engineers can read, which makes data quality visible across an organization. A distinctive feature is automated profiling, which can inspect a reference dataset and suggest an initial suite of expectations, giving teams a starting point rather than a blank page.
60.6.2 6.2 Pandera
Pandera, introduced earlier, takes a lighter and more code centric approach. It integrates directly into Python data code, validates pandas, Polars, and PySpark frames, and raises exceptions that fit naturally into existing error handling. Pandera is well suited to validation that lives inside a transformation function, where you want a schema check to run inline and fail fast.
60.6.3 6.3 Choosing Between Them
The tools occupy different niches. Great Expectations favors organizational visibility, rich reporting, and a large catalog of prebuilt expectations, at the cost of more setup and a heavier conceptual footprint. Pandera favors developer ergonomics and tight integration into Python code, at the cost of less elaborate reporting. Many teams use both: Pandera for inline schema enforcement inside pipeline code, and Great Expectations for scheduled, reportable validation at dataset boundaries. The decision should follow the audience. If the primary consumers of validation results are engineers, lean toward Pandera. If they include analysts, data stewards, and managers, lean toward Great Expectations.
60.6.4 6.4 Adjacent Tooling
The ecosystem extends beyond these two. Soda offers a declarative checks language aimed at data reliability monitoring. Deequ, built on Spark, brings constraint verification and data profiling to large scale datasets and can suggest constraints automatically. Evidently focuses on drift and model monitoring with rich visual reports. The right combination depends on data scale, the prevailing language stack, and whether the emphasis is on batch validation or continuous monitoring.
60.7 7. Validating Data in Production Pipelines
60.7.1 7.1 Where to Place Validation Gates
Validation in production is about placement as much as technique. The standard pattern places gates at every boundary where data changes ownership: at ingestion, where external data enters the system; after each major transformation, where bugs are introduced; and before serving, where bad data would directly harm users or models. Validating only at the end leaves you knowing that something broke without knowing where. Validating at each boundary localizes failures to a specific stage.
A widely used architecture routes data through quality zones, often named for metals. Raw data lands in a bronze zone with minimal validation. Cleaned and conformed data moves to a silver zone after passing schema and constraint checks. Business level aggregates reach a gold zone only after statistical validation confirms they look reasonable. Each promotion between zones is a validation gate.
60.7.2 7.2 Fail Fast versus Quarantine
When validation fails in production, the pipeline must decide what to do with the offending data. Two strategies dominate. The fail fast strategy halts the pipeline, alerts an operator, and prevents bad data from propagating, which is appropriate when downstream consumers cannot tolerate any corruption. The quarantine strategy diverts failing records to a separate location for later inspection while allowing valid records to proceed, which keeps the pipeline flowing when partial data is better than none.
The right choice depends on the use case. A financial reconciliation pipeline should fail fast because a wrong number is worse than a late number. A recommendation feature pipeline might quarantine a small fraction of malformed events and continue, because serving slightly incomplete recommendations beats serving none. Encoding this policy explicitly, rather than letting it emerge from where exceptions happen to propagate, is a mark of a mature pipeline.
60.7.3 7.3 Batch versus Streaming Validation
Batch pipelines validate a whole dataset at once, which makes statistical checks straightforward because the full distribution is available. Streaming pipelines validate one record or a small window at a time, which constrains the kinds of checks that are feasible. A per record schema check is trivial in a stream. A distributional drift check is not, because no single record reveals a distribution.
Streaming systems address this by maintaining rolling windows and incremental statistics, computing approximate distributions over a recent window and comparing them against a baseline. Per record validation handles structural and constraint checks at the edge, while windowed aggregation handles statistical checks slightly behind the live edge. Records that fail structural checks are typically routed to a dead letter queue, the streaming analog of quarantine.
60.7.4 7.4 Validation as Monitoring
In production, validation results are themselves a stream of metrics. The null rate of a column, the number of quarantined records, the drift score against baseline, and the pass rate of each expectation suite should all be emitted to a monitoring system and tracked over time. This transforms validation from a binary gate into an observable signal. A slow upward creep in a null rate, invisible to any single threshold check, becomes obvious on a time series chart and lets teams intervene before the rate crosses a failure boundary.
Effective teams alert on validation metrics with the same rigor they apply to service latency and error rates. A failing expectation suite should page someone, a rising quarantine rate should open a ticket, and a sustained drift signal should trigger an investigation into whether a model needs retraining.
60.7.5 7.5 Versioning Expectations with Data
Validation logic evolves as understanding of the data deepens. New constraints get added, thresholds get tuned, and obsolete checks get removed. Because validation suites are code, they belong in version control alongside the pipeline they protect, reviewed through the same pull request process. When a validation rule changes, the change history explains why, which is invaluable during an incident review months later. Treating expectations as versioned artifacts also enables a disciplined workflow: propose a new constraint, test it against historical data to estimate its failure rate, and only then promote it to a blocking gate.
60.8 8. Building a Validation Strategy
60.8.1 8.1 Start at the Boundaries
A team cannot validate everything at once, and attempting to do so produces a brittle thicket of checks that nobody maintains. The pragmatic starting point is to enforce schemas at the boundaries where external data enters the system, because that is where the most defects originate and where a single check protects the most downstream code. Schema enforcement at ingestion delivers the highest return for the least effort.
60.8.2 8.2 Layer in Depth Over Time
Once boundary schemas are in place, layer additional validation where incidents reveal it is needed. Each production data incident is a lesson that should become a permanent check, so that the same defect can never recur silently. This incident driven growth keeps the validation suite focused on real risks rather than hypothetical ones, and it ensures that every check earns its maintenance cost by guarding against a failure that actually happened.
60.8.3 8.3 Balancing Coverage and Maintenance
Every validation rule is code that must be maintained, and an overzealous suite that fires constantly on benign variation trains a team to ignore it, which is worse than having no validation at all. The goal is not maximal coverage but calibrated coverage, where each check that fires reliably signals a real problem worth a human’s attention. Periodically auditing which checks have fired, which have never fired, and which fire so often they are ignored keeps the suite healthy and trusted.
60.9 References
- Great Expectations Documentation. https://docs.greatexpectations.io/
- Pandera Documentation. https://pandera.readthedocs.io/
- dbt Tests Documentation. https://docs.getdbt.com/docs/build/data-tests
- Soda Core Documentation. https://docs.soda.io/
- Amazon Deequ: Unit Tests for Data. https://github.com/awslabs/deequ
- Evidently AI Documentation. https://docs.evidentlyai.com/
- Hypothesis Property Based Testing. https://hypothesis.readthedocs.io/
- Schelter et al., “Automating Large Scale Data Quality Verification,” VLDB 2018. https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf
- Polyzotis et al., “Data Validation for Machine Learning,” MLSys 2019. https://mlsys.org/Conferences/2019/doc/2019/167.pdf
- Apache Avro Schema Specification. https://avro.apache.org/docs/current/specification/
- Confluent Schema Registry Documentation. https://docs.confluent.io/platform/current/schema-registry/index.html
- Great Expectations, “Data Docs.” https://docs.greatexpectations.io/docs/reference/learn/terms/data_docs/