52 Data Types and Structures

Every machine learning system begins with a decision that is too often made implicitly: what kind of data are we working with, and how will it be represented in memory and on disk? This decision shapes everything downstream. The choice of model architecture, the loss function, the evaluation metric, the storage format, and even the hardware budget all follow from the nature of the data. A model is, in a real sense, a hypothesis about the structure of its inputs. A convolutional network assumes spatial locality. A transformer assumes that order matters and that long range dependencies are worth modeling. A gradient boosted tree assumes that the world can be carved up by axis aligned thresholds on tabular features. When the assumption matches the data, learning is efficient. When it does not, no amount of compute will rescue the result.

This chapter develops a working taxonomy of data types, examines the structural properties that distinguish them, surveys the storage formats that practitioners actually use, and shows how each data type drives concrete modeling choices.

52.0.1 A Unifying View: Data as Functions on Index Sets

A productive way to see past the surface variety is to treat every data type as a function defined on some index set. Tabular data is a function on an unordered set of columns. Text is a function on a totally ordered set of positions, $x : \{1, \ldots, T\} \to V$ for a vocabulary $V$. An image is a function on a two dimensional grid, $x : \{1, \ldots, H\} \times \{1, \ldots, W\} \to \mathbb{R}^C$. Audio is a function on a regularly spaced one dimensional grid. A graph is a function on a vertex set equipped with an adjacency relation rather than a fixed coordinate system. Under this lens, the central question for any data type is the symmetry group of its index set, the set of transformations that leave the meaning of the data unchanged. The right model is the one whose hypothesis class is invariant or equivariant under exactly that group. This is the modern geometric deep learning perspective (reference 11), and it explains in one sentence why convolutions suit images (translation symmetry), permutation invariant aggregators suit sets and graphs (no canonical ordering), and causal attention suits sequences (a directed order that must be respected).

A symmetry is a transformation $g$ of the index set such that the prediction target is unchanged when the input is transformed by $g$. A function $f$ is invariant if $f(g \cdot x) = f(x)$ and equivariant if $f(g \cdot x) = g \cdot f(x)$. Image classification wants invariance to translation, since a cat shifted ten pixels right is still a cat. Image segmentation wants equivariance, since the output mask must shift with the input. Matching this group structure to the model is the single most reliable inductive bias in applied machine learning.

52.1 1. Structured Versus Unstructured Data

52.1.1 1.1 The Classic Dichotomy

The oldest and coarsest distinction is between structured and unstructured data. Structured data has a predefined schema. Each record conforms to a fixed set of fields, each field has a declared type, and the meaning of a value is given by its position in the schema. A relational database table is the canonical example. Unstructured data lacks such a schema at the level of raw bytes. A photograph is a grid of pixel intensities, a document is a sequence of characters, and an audio clip is a sequence of amplitude samples. The semantic content is present, but it is not laid out in named, typed fields.

The dichotomy is useful but leaky. A great deal of practical data is better called semi structured. JSON and XML documents carry structure through nesting and keys, yet the schema may vary from record to record and may be deeply hierarchical rather than flat. Log lines, emails, and HTML pages all mix rigid structure with free text. For this reason it is more productive to think of a spectrum of structure rather than a binary, and to ask of any dataset: where does the structure live, and how much of the predictive signal it contains is exposed by that structure?

It helps to separate three layers that are often conflated. The physical layer is the byte layout on disk or wire (CSV rows, a PNG, a Parquet column chunk). The logical layer is the schema or type: a relation with named typed columns, a tensor of a given shape, a graph with typed edges. The semantic layer is the meaning a human or model assigns to the values. Structured data is precisely data whose logical layer is explicit and fixed in advance. Unstructured data has a trivial logical layer (a flat array of bytes or samples) but rich semantics that a model must recover. Semi structured data has a logical layer that is present but variable. Most engineering pain comes from confusing these layers, for example treating a JSON blob as a single opaque string when its keys in fact carry the schema.

52.1.2 1.2 Why the Distinction Matters for Modeling

The practical consequence is the amount of representation learning required. Structured tabular data arrives with features that are already meaningful: an age, a price, a category. A model can operate directly on these. Unstructured data requires a representation learning stage that maps raw bytes into a vector space where the relevant geometry reflects semantic similarity. This is precisely the work that deep networks do well and that classical methods do poorly. The historical dominance of deep learning in vision, speech, and language, contrasted with the continued strength of tree ensembles on tabular problems, is a direct reflection of this split. Where features are engineered by the world, classical models compete. Where features must be learned from raw signal, depth wins.

52.2 2. Tabular Data

52.2.1 2.1 Structure and Semantics

Tabular data is the workhorse of applied analytics. It is organized as a matrix $X \in \mathbb{R}^{n \times d}$ of $n$ rows and $d$ columns, where each row is an observation and each column is a feature. Critically, the columns are heterogeneous. Some are continuous numeric values, some are integer counts, some are categorical with no natural order, some are ordinal, and some are dates or identifiers. There is no meaningful notion of locality between adjacent columns; permuting the column order changes nothing about the information content. This permutation invariance is a defining property, and it is exactly why architectures built on locality assumptions, such as convolutions, are a poor fit.

52.2.2 2.2 Modeling Implications

The heterogeneity of columns demands per feature preprocessing. Continuous features are typically standardized to zero mean and unit variance, $z = (x - \mu) / \sigma$, or scaled to a bounded range. Categorical features are encoded, whether by one hot expansion, target encoding, or learned embeddings. Missing values must be handled explicitly through imputation or through models that natively support them.

For prediction on tabular data, gradient boosted decision trees, as implemented in XGBoost, LightGBM, and CatBoost, remain the default choice and frequently outperform deep networks on medium sized datasets. The reason is structural. Trees naturally handle mixed types, are invariant to monotone transformations of individual features, are robust to uninformative features, and capture the axis aligned, non smooth decision boundaries that tabular targets often exhibit. Deep tabular models such as TabNet and various transformer variants have narrowed the gap but have not decisively overturned this picture for the typical mid sized industrial dataset.

row_id  age  income   region   churned
1       34   58000    west     0
2       51   92000    east     1
3       29   41000    south    0

52.2.3 2.3 Worked Example: Encoding a Categorical Column

Consider the region column above, a nominal categorical with no natural order. The three common encodings illustrate the trade space. One hot encoding maps each of $K$ categories to a unit basis vector in $\mathbb{R}^K$, so west, east, and south become $(1,0,0)$, $(0,1,0)$, $(0,0,1)$. This is lossless and order free but explodes dimensionality when $K$ is large and produces a sparse design matrix. Target encoding replaces a category with a statistic of the target conditioned on that category, for instance the smoothed mean churn rate

\[ \hat{\theta}_k = \frac{n_k \bar{y}_k + \alpha \bar{y}}{n_k + \alpha}, \]

where $n_k$ is the count of rows in category $k$, $\bar{y}_k$ is the in category target mean, $\bar{y}$ is the global mean, and $\alpha$ is a smoothing strength that shrinks small categories toward the global prior. This keeps dimensionality at one column but leaks the target and must therefore be fit inside cross validation folds, never on the full data, to avoid optimistic bias. Learned embeddings map each category to a trainable vector in $\mathbb{R}^m$ with $m \ll K$, which a neural network can tune end to end and which lets similar categories sit near one another. The pitfall common to all three is the unseen category at inference time, which one hot drops, target encoding sends to the global prior, and embeddings handle with a reserved unknown slot. Tree ensembles such as CatBoost sidestep much of this by handling categoricals natively with an ordered target statistic that is robust to leakage.

52.3 3. Text Data

52.3.1 3.1 Structure and Semantics

Text is a sequence of discrete tokens drawn from a finite vocabulary. Two structural properties dominate. First, order carries meaning; “dog bites man” and “man bites dog” share tokens but not sense. Second, dependencies can span long distances, so the resolution of a pronoun may depend on a noun many sentences earlier. Text is also discrete and high dimensional in its raw form, since a vocabulary may contain tens of thousands of tokens.

52.3.2 3.2 Modeling Implications

The first step is tokenization, the segmentation of raw text into units. Modern systems use subword schemes such as byte pair encoding or WordPiece, which balance vocabulary size against the ability to represent rare words. Tokens are then mapped to dense vectors through an embedding matrix $E \in \mathbb{R}^{|V| \times h}$, where $|V|$ is the vocabulary size and $h$ the embedding dimension.

The transformer architecture, built on self attention, is now the standard model for text. Self attention computes, for every pair of positions, a weighting

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V, \]

which lets each token aggregate information from any other token regardless of distance, directly addressing the long range dependency problem. Here $Q$, $K$, and $V$ are the query, key, and value projections of the input, and $d_k$ is the key dimension, with the scaling by $\sqrt{d_k}$ keeping the dot products in a range where the softmax has usable gradients. The cost is $O(T^2 d)$ in time and $O(T^2)$ in memory for sequence length $T$, which is the central engineering tension and motivates much of the research into efficient and sparse attention variants. Self attention is permutation equivariant by construction, so positional information must be injected separately through positional encodings; without them the model would treat a sentence as a bag of tokens. The discreteness of text also means that generation is framed as classification over the vocabulary at each step, optimized with a cross entropy loss whose minimization is equivalent to maximizing the likelihood of the training corpus.

52.4 4. Image Data

52.4.1 4.1 Structure and Semantics

An image is a regular grid of pixels. A color image is a tensor of shape $H \times W \times C$, with height, width, and channels, the channels usually being red, green, and blue. The defining structural property is spatial locality: nearby pixels are strongly correlated, and meaningful features such as edges and textures are local patterns that can appear anywhere in the frame. This gives rise to a second property, translation equivariance, since an object retains its identity when shifted across the image.

52.4.2 4.2 Modeling Implications

These two properties motivate the convolution. A convolutional layer applies the same small filter across all spatial positions. For a single channel input $x$ and a $k \times k$ kernel $w$, the output at position $(i, j)$ is

\[ (x * w)_{i,j} = \sum_{a=1}^{k} \sum_{b=1}^{k} x_{i+a,\, j+b}\, w_{a,b}, \]

and because the same weights $w$ are reused at every location, the layer is exactly translation equivariant: shifting the input shifts the output identically. This weight sharing is also what makes convolutions parameter efficient, since a layer needs only $k^2$ weights per channel pair regardless of image size, whereas a fully connected layer over an $H \times W$ image would need parameters proportional to $H^2 W^2$. Stacking convolutions with pooling builds a hierarchy from edges to textures to parts to objects, and the effective receptive field grows with depth so that deep layers see large regions of the image. Convolutional networks dominated computer vision for a decade.

The Vision Transformer reframed images as sequences by splitting them into fixed size patches, embedding each patch, and feeding the sequence to a transformer. With sufficient data and pretraining, this approach matches or exceeds convolutional networks, trading the strong locality prior for greater flexibility and scale. The lesson is general: strong architectural priors help when data is scarce, while flexible architectures win when data is abundant. Either way, the high dimensionality of pixels, where a modest image holds hundreds of thousands of values, makes representation learning essential, since raw pixels are a poor feature space for any classical method.

52.5 5. Audio Data

52.5.1 5.1 Structure and Semantics

Audio is a one dimensional signal: a sequence of amplitude samples taken at a fixed sampling rate, commonly 16 kHz for speech or 44.1 kHz for music. The structure is temporal, and the information is carried largely in the frequency content as it evolves over time. Sampling rate sets the highest representable frequency through the Nyquist limit, $f_{\max} = f_s / 2$, so a 16 kHz rate captures frequencies up to 8 kHz, sufficient for intelligible speech.

52.5.2 5.2 Modeling Implications

Raw waveforms are dense and high rate, with tens of thousands of samples per second. A common and powerful move is to convert the waveform into a time frequency representation, the spectrogram, computed by the short time Fourier transform. The mel spectrogram further warps the frequency axis to match human perception. This transforms a one dimensional signal into a two dimensional image like array, which means that convolutional and transformer architectures developed for vision and sequences transfer naturally. End to end models that learn directly from the waveform also exist, but spectrogram front ends remain a robust and widely used default in speech recognition and audio classification.

52.6 6. Video Data

52.6.1 6.1 Structure and Semantics

Video adds a temporal axis to images, yielding a tensor of shape $T \times H \times W \times C$ across $T$ frames. It therefore inherits spatial locality within each frame and adds temporal locality across frames, since consecutive frames are highly redundant. This redundancy is both a burden, because of the sheer data volume, and an opportunity, because motion itself is a rich signal.

52.6.2 6.2 Modeling Implications

The central challenge is scale. A few seconds of video contains far more raw values than a single image, which makes naive processing of every frame at full resolution prohibitive. Practical systems use three dimensional convolutions that span space and time, two stream designs that process appearance and motion separately, or video transformers with factorized spatial and temporal attention to control cost. Temporal redundancy is exploited by sampling frames sparsely rather than processing every one. Modeling choices here are dominated by the compute and memory budget as much as by accuracy, and efficiency is a first class concern rather than an afterthought.

52.7 7. Graph Data

52.7.1 7.1 Structure and Semantics

A graph $G = (V, E)$ consists of nodes and edges, with optional features attached to each. Graphs represent relational data: social networks, molecules, knowledge bases, transportation systems, and recommendation interactions. The defining structural property is that there is no fixed ordering of nodes and no fixed neighborhood size. A given node may have two neighbors or two thousand. Any valid model must be permutation invariant or equivariant, producing the same output regardless of how the nodes happen to be numbered.

52.7.2 7.2 Modeling Implications

This irregularity rules out architectures that assume a grid or a sequence. The dominant approach is the graph neural network, which operates by message passing. Each node iteratively updates its representation by aggregating messages from its neighbors:

\[ h_v^{(l+1)} = \phi\!\left(h_v^{(l)}, \; \bigoplus_{u \in \mathcal{N}(v)} \psi\big(h_v^{(l)}, h_u^{(l)}\big)\right), \]

where $\bigoplus$ is a permutation invariant aggregator such as sum or mean. After $k$ rounds, each node’s representation reflects its $k$ hop neighborhood. The permutation invariant aggregator is what encodes the absence of node ordering directly into the architecture. Sparsity is the key efficiency property, since real graphs have far fewer edges than the $|V|^2$ of a dense adjacency matrix, and efficient implementations exploit this.

52.8 8. Time Series Data

52.8.1 8.1 Structure and Semantics

A time series is a sequence of observations indexed by time, $x_1, x_2, \ldots, x_T$, which may be univariate or multivariate. It shares temporal ordering with text and audio, but it carries distinctive structure: trend, the long run direction; seasonality, periodic patterns at fixed intervals; and autocorrelation, the dependence of a value on its own recent past. Time series may be regularly or irregularly sampled, and they raise the issue of stationarity, whether the statistical properties are constant over time.

52.8.2 8.2 Modeling Implications

The temporal ordering imposes a hard constraint that distinguishes time series work from most other data: information must not leak from the future into the past. Train and test splits must respect chronology, and cross validation must use forward chaining rather than random folds. Classical statistical models such as ARIMA explicitly decompose the series into autoregressive and moving average components and remain strong baselines, particularly for univariate problems with clear seasonality. Modern approaches apply recurrent networks, temporal convolutions, and transformers adapted to long horizons. Across methods, feature engineering of lagged values, rolling statistics, and calendar effects often contributes as much as the model class. The recurring lesson is that respecting the arrow of time in both modeling and evaluation matters more than the choice of architecture.

52.9 9. Storage Formats

The logical type of data is one concern; how it is serialized to disk and moved between systems is another, and it has large practical consequences for speed, size, and interoperability.

52.9.1 9.1 CSV

Comma separated values is the lowest common denominator. It is a plain text, row oriented format that is human readable and universally supported. Its weaknesses are equally well known. It has no type information, so every field is text until parsed, and there is no standard for encoding, quoting, or null values. It is verbose, lacks compression, and must be scanned in full even to read a single column. CSV is excellent for small data and interchange and poor for analytical workloads at scale.

52.9.2 9.2 Parquet

Apache Parquet is a columnar, binary format designed for analytics. Storing data by column rather than by row yields two major advantages. First, a query that touches a few columns reads only those columns, which slashes input and output for wide tables. Second, values within a column share a type and tend to be similar, which makes compression and encoding schemes such as dictionary and run length encoding highly effective. Parquet also stores a schema and per column statistics that allow whole blocks to be skipped during filtering. It is the standard for data lakes and large scale batch processing.

CSV (row oriented):        Parquet (column oriented):
1,34,west                  [1,2,3]
2,51,east                  [34,51,29]
3,29,south                 [west,east,south]

52.9.3 9.3 JSON

JavaScript Object Notation is the standard for semi structured and hierarchical data. It is text based, human readable, and represents nested objects and arrays naturally, which makes it ubiquitous in web APIs and document stores. Its costs are verbosity and parsing overhead, and like CSV it is row oriented and untyped beyond a few primitives. For high volume pipelines, a binary or columnar relative is usually preferred, but JSON remains the default for configuration, interchange, and any data whose shape varies record to record.

52.9.4 9.4 Arrow

Apache Arrow is a columnar, in memory format rather than primarily a disk format. Its purpose is to provide a single, language independent representation of columnar data in RAM so that different tools and languages can share data without serialization and copying. Where Parquet optimizes data at rest, Arrow optimizes data in motion and in compute. Its zero copy design and cache friendly columnar layout enable vectorized processing and let Python, Java, R, and others operate on the same buffers. Parquet and Arrow are complementary and frequently paired: Parquet on disk, Arrow in memory.

52.9.5 9.5 Choosing a Format

The practical guidance follows from the access pattern. Use CSV or JSON for small data, interchange, and human inspection. Use Parquet for analytical storage, columnar access, and large datasets at rest. Use Arrow as the in memory backbone for fast, cross language data processing pipelines. The cost of a wrong choice is rarely correctness, but it is often an order of magnitude in storage and processing time.

52.10 10. How Data Type Drives Modeling Choices

Pulling the threads together, the data type determines a chain of decisions. The structural properties of the data, locality, ordering, permutation invariance, and dimensionality, dictate which architectural priors are appropriate. Locality favors convolutions, ordering favors sequence models, the absence of ordering favors permutation invariant aggregation, and abundant data favors flexible attention based models over strong priors. The same properties dictate evaluation: time series demand chronological splits, while independently sampled tabular rows permit random ones. They dictate preprocessing: tabular data needs per column treatment, text needs tokenization, audio benefits from spectral transforms.

The mature practitioner reads a dataset’s type as a specification. It tells them which inductive biases to build in, which storage format will keep the pipeline fast, which leakage risks to guard against, and which family of models is likely to repay the effort of tuning. The most common and costly errors in applied machine learning are not subtle failures of optimization but mismatches between the structure of the data and the assumptions of the method. Getting the data type right, in both representation and storage, is the foundation on which everything else is built.

The following diagram traces the chain from a structural property of the data to a default modeling family. It is a heuristic starting point, not a law, and abundant data plus pretraining can always justify replacing a strong prior with a more flexible architecture.

flowchart TD
    A["What is the index structure of the data"] --> B["Unordered fields, mixed types"]
    A --> C["Total order over positions"]
    A --> D["Regular spatial grid"]
    A --> E["Vertices with adjacency"]
    B --> B1["Tabular: gradient boosted trees"]
    C --> C1["Text or time series: sequence models and attention"]
    D --> D1["Image or audio spectrogram: convolutions or patch transformers"]
    E --> E1["Graph: message passing networks"]

52.10.1 10.1 When to Use and Common Pitfalls

A compact decision table summarizes the defaults and the traps. The recurring theme is that the index structure and its symmetry group fix the inductive bias, while the access pattern fixes the storage format.

Data type	Default model family	Storage default	Most common pitfall
Tabular	Gradient boosted trees	Parquet	Target leakage from encodings or aggregations fit on full data
Text	Transformer with subword tokens	JSON or Parquet	Tokenization mismatch between training and inference
Image	Convolutional net or patch transformer	Binary formats with metadata	Forgetting normalization statistics used at training time
Audio	Spectrogram front end plus vision or sequence model	Compressed binary	Sampling rate mismatch and aliasing above the Nyquist limit
Video	Three dimensional convolutions or factorized attention	Compressed binary	Processing every frame at full resolution, exhausting the compute budget
Graph	Message passing network	Adjacency lists or edge tables	Treating a dense adjacency when the graph is sparse, or ignoring permutation invariance
Time series	ARIMA baseline, then recurrent, convolutional, or transformer	Parquet	Random splits that leak the future into the past

The single most expensive mistake across all types is leakage, the silent contamination of training with information unavailable at prediction time. It hides in target encodings fit on the whole dataset, in features computed across a train test boundary, and in random shuffles of inherently ordered data. A model that scores well in offline evaluation and then fails in production has almost always learned from a leak. Auditing the data pipeline for leakage repays more than nearly any architectural refinement.

52.11 References

Apache Parquet documentation. https://parquet.apache.org/docs/
Apache Arrow documentation. https://arrow.apache.org/docs/
Vaswani, A. et al. Attention Is All You Need. 2017. https://arxiv.org/abs/1706.03762
Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. https://arxiv.org/abs/2010.11929
Grinsztajn, L. et al. Why Do Tree Based Models Still Outperform Deep Learning on Tabular Data? 2022. https://arxiv.org/abs/2207.08815
Chen, T. and Guestrin, C. XGBoost: A Scalable Tree Boosting System. 2016. https://arxiv.org/abs/1603.02754
Kipf, T. and Welling, M. Semi Supervised Classification with Graph Convolutional Networks. 2016. https://arxiv.org/abs/1609.02907
Hyndman, R. and Athanasopoulos, G. Forecasting: Principles and Practice. https://otexts.com/fpp3/
Radford, A. et al. Robust Speech Recognition via Large Scale Weak Supervision (Whisper). 2022. https://arxiv.org/abs/2212.04356
The JSON Data Interchange Standard (ECMA 404). https://www.json.org/
Bronstein, M., Bruna, J., Cohen, T., and Velickovic, P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. 2021. https://arxiv.org/abs/2104.13478

# Data Types and Structures Every machine learning system begins with a decision that is too often made implicitly: what kind of data are we working with, and how will it be represented in memory and on disk? This decision shapes everything downstream. The choice of model architecture, the loss function, the evaluation metric, the storage format, and even the hardware budget all follow from the nature of the data. A model is, in a real sense, a hypothesis about the structure of its inputs. A convolutional network assumes spatial locality. A transformer assumes that order matters and that long range dependencies are worth modeling. A gradient boosted tree assumes that the world can be carved up by axis aligned thresholds on tabular features. When the assumption matches the data, learning is efficient. When it does not, no amount of compute will rescue the result. This chapter develops a working taxonomy of data types, examines the structural properties that distinguish them, surveys the storage formats that practitioners actually use, and shows how each data type drives concrete modeling choices. ### A Unifying View: Data as Functions on Index Sets A productive way to see past the surface variety is to treat every data type as a function defined on some index set. Tabular data is a function on an unordered set of columns. Text is a function on a totally ordered set of positions, $x : \{1, \ldots, T\} \to V$ for a vocabulary $V$. An image is a function on a two dimensional grid, $x : \{1, \ldots, H\} \times \{1, \ldots, W\} \to \mathbb{R}^C$. Audio is a function on a regularly spaced one dimensional grid. A graph is a function on a vertex set equipped with an adjacency relation rather than a fixed coordinate system. Under this lens, the central question for any data type is the symmetry group of its index set, the set of transformations that leave the meaning of the data unchanged. The right model is the one whose hypothesis class is invariant or equivariant under exactly that group. This is the modern geometric deep learning perspective (reference 11), and it explains in one sentence why convolutions suit images (translation symmetry), permutation invariant aggregators suit sets and graphs (no canonical ordering), and causal attention suits sequences (a directed order that must be respected). A symmetry is a transformation $g$ of the index set such that the prediction target is unchanged when the input is transformed by $g$. A function $f$ is *invariant* if $f(g \cdot x) = f(x)$ and *equivariant* if $f(g \cdot x) = g \cdot f(x)$. Image classification wants invariance to translation, since a cat shifted ten pixels right is still a cat. Image segmentation wants equivariance, since the output mask must shift with the input. Matching this group structure to the model is the single most reliable inductive bias in applied machine learning. ## 1. Structured Versus Unstructured Data ### 1.1 The Classic Dichotomy The oldest and coarsest distinction is between structured and unstructured data. Structured data has a predefined schema. Each record conforms to a fixed set of fields, each field has a declared type, and the meaning of a value is given by its position in the schema. A relational database table is the canonical example. Unstructured data lacks such a schema at the level of raw bytes. A photograph is a grid of pixel intensities, a document is a sequence of characters, and an audio clip is a sequence of amplitude samples. The semantic content is present, but it is not laid out in named, typed fields. The dichotomy is useful but leaky. A great deal of practical data is better called semi structured. JSON and XML documents carry structure through nesting and keys, yet the schema may vary from record to record and may be deeply hierarchical rather than flat. Log lines, emails, and HTML pages all mix rigid structure with free text. For this reason it is more productive to think of a spectrum of structure rather than a binary, and to ask of any dataset: where does the structure live, and how much of the predictive signal it contains is exposed by that structure? It helps to separate three layers that are often conflated. The *physical* layer is the byte layout on disk or wire (CSV rows, a PNG, a Parquet column chunk). The *logical* layer is the schema or type: a relation with named typed columns, a tensor of a given shape, a graph with typed edges. The *semantic* layer is the meaning a human or model assigns to the values. Structured data is precisely data whose logical layer is explicit and fixed in advance. Unstructured data has a trivial logical layer (a flat array of bytes or samples) but rich semantics that a model must recover. Semi structured data has a logical layer that is present but variable. Most engineering pain comes from confusing these layers, for example treating a JSON blob as a single opaque string when its keys in fact carry the schema. ### 1.2 Why the Distinction Matters for Modeling The practical consequence is the amount of representation learning required. Structured tabular data arrives with features that are already meaningful: an age, a price, a category. A model can operate directly on these. Unstructured data requires a representation learning stage that maps raw bytes into a vector space where the relevant geometry reflects semantic similarity. This is precisely the work that deep networks do well and that classical methods do poorly. The historical dominance of deep learning in vision, speech, and language, contrasted with the continued strength of tree ensembles on tabular problems, is a direct reflection of this split. Where features are engineered by the world, classical models compete. Where features must be learned from raw signal, depth wins. ## 2. Tabular Data ### 2.1 Structure and Semantics Tabular data is the workhorse of applied analytics. It is organized as a matrix $X \in \mathbb{R}^{n \times d}$ of $n$ rows and $d$ columns, where each row is an observation and each column is a feature. Critically, the columns are heterogeneous. Some are continuous numeric values, some are integer counts, some are categorical with no natural order, some are ordinal, and some are dates or identifiers. There is no meaningful notion of locality between adjacent columns; permuting the column order changes nothing about the information content. This permutation invariance is a defining property, and it is exactly why architectures built on locality assumptions, such as convolutions, are a poor fit. ### 2.2 Modeling Implications The heterogeneity of columns demands per feature preprocessing. Continuous features are typically standardized to zero mean and unit variance, $z = (x - \mu) / \sigma$, or scaled to a bounded range. Categorical features are encoded, whether by one hot expansion, target encoding, or learned embeddings. Missing values must be handled explicitly through imputation or through models that natively support them. For prediction on tabular data, gradient boosted decision trees, as implemented in XGBoost, LightGBM, and CatBoost, remain the default choice and frequently outperform deep networks on medium sized datasets. The reason is structural. Trees naturally handle mixed types, are invariant to monotone transformations of individual features, are robust to uninformative features, and capture the axis aligned, non smooth decision boundaries that tabular targets often exhibit. Deep tabular models such as TabNet and various transformer variants have narrowed the gap but have not decisively overturned this picture for the typical mid sized industrial dataset. ```text row_id age income region churned 1 34 58000 west 0 2 51 92000 east 1 3 29 41000 south 0 ``` ### 2.3 Worked Example: Encoding a Categorical Column Consider the `region` column above, a nominal categorical with no natural order. The three common encodings illustrate the trade space. One hot encoding maps each of $K$ categories to a unit basis vector in $\mathbb{R}^K$, so `west`, `east`, and `south` become $(1,0,0)$, $(0,1,0)$, $(0,0,1)$. This is lossless and order free but explodes dimensionality when $K$ is large and produces a sparse design matrix. Target encoding replaces a category with a statistic of the target conditioned on that category, for instance the smoothed mean churn rate $$ \hat{\theta}_k = \frac{n_k \bar{y}_k + \alpha \bar{y}}{n_k + \alpha}, $$ where $n_k$ is the count of rows in category $k$, $\bar{y}_k$ is the in category target mean, $\bar{y}$ is the global mean, and $\alpha$ is a smoothing strength that shrinks small categories toward the global prior. This keeps dimensionality at one column but leaks the target and must therefore be fit inside cross validation folds, never on the full data, to avoid optimistic bias. Learned embeddings map each category to a trainable vector in $\mathbb{R}^m$ with $m \ll K$, which a neural network can tune end to end and which lets similar categories sit near one another. The pitfall common to all three is the unseen category at inference time, which one hot drops, target encoding sends to the global prior, and embeddings handle with a reserved unknown slot. Tree ensembles such as CatBoost sidestep much of this by handling categoricals natively with an ordered target statistic that is robust to leakage. ## 3. Text Data ### 3.1 Structure and Semantics Text is a sequence of discrete tokens drawn from a finite vocabulary. Two structural properties dominate. First, order carries meaning; "dog bites man" and "man bites dog" share tokens but not sense. Second, dependencies can span long distances, so the resolution of a pronoun may depend on a noun many sentences earlier. Text is also discrete and high dimensional in its raw form, since a vocabulary may contain tens of thousands of tokens. ### 3.2 Modeling Implications The first step is tokenization, the segmentation of raw text into units. Modern systems use subword schemes such as byte pair encoding or WordPiece, which balance vocabulary size against the ability to represent rare words. Tokens are then mapped to dense vectors through an embedding matrix $E \in \mathbb{R}^{|V| \times h}$, where $|V|$ is the vocabulary size and $h$ the embedding dimension. The transformer architecture, built on self attention, is now the standard model for text. Self attention computes, for every pair of positions, a weighting $$ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V, $$ which lets each token aggregate information from any other token regardless of distance, directly addressing the long range dependency problem. Here $Q$, $K$, and $V$ are the query, key, and value projections of the input, and $d_k$ is the key dimension, with the scaling by $\sqrt{d_k}$ keeping the dot products in a range where the softmax has usable gradients. The cost is $O(T^2 d)$ in time and $O(T^2)$ in memory for sequence length $T$, which is the central engineering tension and motivates much of the research into efficient and sparse attention variants. Self attention is permutation equivariant by construction, so positional information must be injected separately through positional encodings; without them the model would treat a sentence as a bag of tokens. The discreteness of text also means that generation is framed as classification over the vocabulary at each step, optimized with a cross entropy loss whose minimization is equivalent to maximizing the likelihood of the training corpus. ## 4. Image Data ### 4.1 Structure and Semantics An image is a regular grid of pixels. A color image is a tensor of shape $H \times W \times C$, with height, width, and channels, the channels usually being red, green, and blue. The defining structural property is spatial locality: nearby pixels are strongly correlated, and meaningful features such as edges and textures are local patterns that can appear anywhere in the frame. This gives rise to a second property, translation equivariance, since an object retains its identity when shifted across the image. ### 4.2 Modeling Implications These two properties motivate the convolution. A convolutional layer applies the same small filter across all spatial positions. For a single channel input $x$ and a $k \times k$ kernel $w$, the output at position $(i, j)$ is $$ (x * w)_{i,j} = \sum_{a=1}^{k} \sum_{b=1}^{k} x_{i+a,\, j+b}\, w_{a,b}, $$ and because the same weights $w$ are reused at every location, the layer is exactly translation equivariant: shifting the input shifts the output identically. This weight sharing is also what makes convolutions parameter efficient, since a layer needs only $k^2$ weights per channel pair regardless of image size, whereas a fully connected layer over an $H \times W$ image would need parameters proportional to $H^2 W^2$. Stacking convolutions with pooling builds a hierarchy from edges to textures to parts to objects, and the effective receptive field grows with depth so that deep layers see large regions of the image. Convolutional networks dominated computer vision for a decade. The Vision Transformer reframed images as sequences by splitting them into fixed size patches, embedding each patch, and feeding the sequence to a transformer. With sufficient data and pretraining, this approach matches or exceeds convolutional networks, trading the strong locality prior for greater flexibility and scale. The lesson is general: strong architectural priors help when data is scarce, while flexible architectures win when data is abundant. Either way, the high dimensionality of pixels, where a modest image holds hundreds of thousands of values, makes representation learning essential, since raw pixels are a poor feature space for any classical method. ## 5. Audio Data ### 5.1 Structure and Semantics Audio is a one dimensional signal: a sequence of amplitude samples taken at a fixed sampling rate, commonly 16 kHz for speech or 44.1 kHz for music. The structure is temporal, and the information is carried largely in the frequency content as it evolves over time. Sampling rate sets the highest representable frequency through the Nyquist limit, $f_{\max} = f_s / 2$, so a 16 kHz rate captures frequencies up to 8 kHz, sufficient for intelligible speech. ### 5.2 Modeling Implications Raw waveforms are dense and high rate, with tens of thousands of samples per second. A common and powerful move is to convert the waveform into a time frequency representation, the spectrogram, computed by the short time Fourier transform. The mel spectrogram further warps the frequency axis to match human perception. This transforms a one dimensional signal into a two dimensional image like array, which means that convolutional and transformer architectures developed for vision and sequences transfer naturally. End to end models that learn directly from the waveform also exist, but spectrogram front ends remain a robust and widely used default in speech recognition and audio classification. ## 6. Video Data ### 6.1 Structure and Semantics Video adds a temporal axis to images, yielding a tensor of shape $T \times H \times W \times C$ across $T$ frames. It therefore inherits spatial locality within each frame and adds temporal locality across frames, since consecutive frames are highly redundant. This redundancy is both a burden, because of the sheer data volume, and an opportunity, because motion itself is a rich signal. ### 6.2 Modeling Implications The central challenge is scale. A few seconds of video contains far more raw values than a single image, which makes naive processing of every frame at full resolution prohibitive. Practical systems use three dimensional convolutions that span space and time, two stream designs that process appearance and motion separately, or video transformers with factorized spatial and temporal attention to control cost. Temporal redundancy is exploited by sampling frames sparsely rather than processing every one. Modeling choices here are dominated by the compute and memory budget as much as by accuracy, and efficiency is a first class concern rather than an afterthought. ## 7. Graph Data ### 7.1 Structure and Semantics A graph $G = (V, E)$ consists of nodes and edges, with optional features attached to each. Graphs represent relational data: social networks, molecules, knowledge bases, transportation systems, and recommendation interactions. The defining structural property is that there is no fixed ordering of nodes and no fixed neighborhood size. A given node may have two neighbors or two thousand. Any valid model must be permutation invariant or equivariant, producing the same output regardless of how the nodes happen to be numbered. ### 7.2 Modeling Implications This irregularity rules out architectures that assume a grid or a sequence. The dominant approach is the graph neural network, which operates by message passing. Each node iteratively updates its representation by aggregating messages from its neighbors: $$ h_v^{(l+1)} = \phi\!\left(h_v^{(l)}, \; \bigoplus_{u \in \mathcal{N}(v)} \psi\big(h_v^{(l)}, h_u^{(l)}\big)\right), $$ where $\bigoplus$ is a permutation invariant aggregator such as sum or mean. After $k$ rounds, each node's representation reflects its $k$ hop neighborhood. The permutation invariant aggregator is what encodes the absence of node ordering directly into the architecture. Sparsity is the key efficiency property, since real graphs have far fewer edges than the $|V|^2$ of a dense adjacency matrix, and efficient implementations exploit this. ## 8. Time Series Data ### 8.1 Structure and Semantics A time series is a sequence of observations indexed by time, $x_1, x_2, \ldots, x_T$, which may be univariate or multivariate. It shares temporal ordering with text and audio, but it carries distinctive structure: trend, the long run direction; seasonality, periodic patterns at fixed intervals; and autocorrelation, the dependence of a value on its own recent past. Time series may be regularly or irregularly sampled, and they raise the issue of stationarity, whether the statistical properties are constant over time. ### 8.2 Modeling Implications The temporal ordering imposes a hard constraint that distinguishes time series work from most other data: information must not leak from the future into the past. Train and test splits must respect chronology, and cross validation must use forward chaining rather than random folds. Classical statistical models such as ARIMA explicitly decompose the series into autoregressive and moving average components and remain strong baselines, particularly for univariate problems with clear seasonality. Modern approaches apply recurrent networks, temporal convolutions, and transformers adapted to long horizons. Across methods, feature engineering of lagged values, rolling statistics, and calendar effects often contributes as much as the model class. The recurring lesson is that respecting the arrow of time in both modeling and evaluation matters more than the choice of architecture. ## 9. Storage Formats The logical type of data is one concern; how it is serialized to disk and moved between systems is another, and it has large practical consequences for speed, size, and interoperability. ### 9.1 CSV Comma separated values is the lowest common denominator. It is a plain text, row oriented format that is human readable and universally supported. Its weaknesses are equally well known. It has no type information, so every field is text until parsed, and there is no standard for encoding, quoting, or null values. It is verbose, lacks compression, and must be scanned in full even to read a single column. CSV is excellent for small data and interchange and poor for analytical workloads at scale. ### 9.2 Parquet Apache Parquet is a columnar, binary format designed for analytics. Storing data by column rather than by row yields two major advantages. First, a query that touches a few columns reads only those columns, which slashes input and output for wide tables. Second, values within a column share a type and tend to be similar, which makes compression and encoding schemes such as dictionary and run length encoding highly effective. Parquet also stores a schema and per column statistics that allow whole blocks to be skipped during filtering. It is the standard for data lakes and large scale batch processing. ```text CSV (row oriented): Parquet (column oriented): 1,34,west [1,2,3] 2,51,east [34,51,29] 3,29,south [west,east,south] ``` ### 9.3 JSON JavaScript Object Notation is the standard for semi structured and hierarchical data. It is text based, human readable, and represents nested objects and arrays naturally, which makes it ubiquitous in web APIs and document stores. Its costs are verbosity and parsing overhead, and like CSV it is row oriented and untyped beyond a few primitives. For high volume pipelines, a binary or columnar relative is usually preferred, but JSON remains the default for configuration, interchange, and any data whose shape varies record to record. ### 9.4 Arrow Apache Arrow is a columnar, in memory format rather than primarily a disk format. Its purpose is to provide a single, language independent representation of columnar data in RAM so that different tools and languages can share data without serialization and copying. Where Parquet optimizes data at rest, Arrow optimizes data in motion and in compute. Its zero copy design and cache friendly columnar layout enable vectorized processing and let Python, Java, R, and others operate on the same buffers. Parquet and Arrow are complementary and frequently paired: Parquet on disk, Arrow in memory. ### 9.5 Choosing a Format The practical guidance follows from the access pattern. Use CSV or JSON for small data, interchange, and human inspection. Use Parquet for analytical storage, columnar access, and large datasets at rest. Use Arrow as the in memory backbone for fast, cross language data processing pipelines. The cost of a wrong choice is rarely correctness, but it is often an order of magnitude in storage and processing time. ## 10. How Data Type Drives Modeling Choices Pulling the threads together, the data type determines a chain of decisions. The structural properties of the data, locality, ordering, permutation invariance, and dimensionality, dictate which architectural priors are appropriate. Locality favors convolutions, ordering favors sequence models, the absence of ordering favors permutation invariant aggregation, and abundant data favors flexible attention based models over strong priors. The same properties dictate evaluation: time series demand chronological splits, while independently sampled tabular rows permit random ones. They dictate preprocessing: tabular data needs per column treatment, text needs tokenization, audio benefits from spectral transforms. The mature practitioner reads a dataset's type as a specification. It tells them which inductive biases to build in, which storage format will keep the pipeline fast, which leakage risks to guard against, and which family of models is likely to repay the effort of tuning. The most common and costly errors in applied machine learning are not subtle failures of optimization but mismatches between the structure of the data and the assumptions of the method. Getting the data type right, in both representation and storage, is the foundation on which everything else is built. The following diagram traces the chain from a structural property of the data to a default modeling family. It is a heuristic starting point, not a law, and abundant data plus pretraining can always justify replacing a strong prior with a more flexible architecture. ```{mermaid} flowchart TD A["What is the index structure of the data"] --> B["Unordered fields, mixed types"] A --> C["Total order over positions"] A --> D["Regular spatial grid"] A --> E["Vertices with adjacency"] B --> B1["Tabular: gradient boosted trees"] C --> C1["Text or time series: sequence models and attention"] D --> D1["Image or audio spectrogram: convolutions or patch transformers"] E --> E1["Graph: message passing networks"] ``` ### 10.1 When to Use and Common Pitfalls A compact decision table summarizes the defaults and the traps. The recurring theme is that the index structure and its symmetry group fix the inductive bias, while the access pattern fixes the storage format. | Data type | Default model family | Storage default | Most common pitfall | |---|---|---|---| | Tabular | Gradient boosted trees | Parquet | Target leakage from encodings or aggregations fit on full data | | Text | Transformer with subword tokens | JSON or Parquet | Tokenization mismatch between training and inference | | Image | Convolutional net or patch transformer | Binary formats with metadata | Forgetting normalization statistics used at training time | | Audio | Spectrogram front end plus vision or sequence model | Compressed binary | Sampling rate mismatch and aliasing above the Nyquist limit | | Video | Three dimensional convolutions or factorized attention | Compressed binary | Processing every frame at full resolution, exhausting the compute budget | | Graph | Message passing network | Adjacency lists or edge tables | Treating a dense adjacency when the graph is sparse, or ignoring permutation invariance | | Time series | ARIMA baseline, then recurrent, convolutional, or transformer | Parquet | Random splits that leak the future into the past | The single most expensive mistake across all types is leakage, the silent contamination of training with information unavailable at prediction time. It hides in target encodings fit on the whole dataset, in features computed across a train test boundary, and in random shuffles of inherently ordered data. A model that scores well in offline evaluation and then fails in production has almost always learned from a leak. Auditing the data pipeline for leakage repays more than nearly any architectural refinement. ## References 1. Apache Parquet documentation. https://parquet.apache.org/docs/ 2. Apache Arrow documentation. https://arrow.apache.org/docs/ 3. Vaswani, A. et al. Attention Is All You Need. 2017. https://arxiv.org/abs/1706.03762 4. Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. https://arxiv.org/abs/2010.11929 5. Grinsztajn, L. et al. Why Do Tree Based Models Still Outperform Deep Learning on Tabular Data? 2022. https://arxiv.org/abs/2207.08815 6. Chen, T. and Guestrin, C. XGBoost: A Scalable Tree Boosting System. 2016. https://arxiv.org/abs/1603.02754 7. Kipf, T. and Welling, M. Semi Supervised Classification with Graph Convolutional Networks. 2016. https://arxiv.org/abs/1609.02907 8. Hyndman, R. and Athanasopoulos, G. Forecasting: Principles and Practice. https://otexts.com/fpp3/ 9. Radford, A. et al. Robust Speech Recognition via Large Scale Weak Supervision (Whisper). 2022. https://arxiv.org/abs/2212.04356 10. The JSON Data Interchange Standard (ECMA 404). https://www.json.org/ 11. Bronstein, M., Bruna, J., Cohen, T., and Velickovic, P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. 2021. https://arxiv.org/abs/2104.13478