52  Data Types and Structures

Every machine learning system begins with a decision that is too often made implicitly: what kind of data are we working with, and how will it be represented in memory and on disk? This decision shapes everything downstream. The choice of model architecture, the loss function, the evaluation metric, the storage format, and even the hardware budget all follow from the nature of the data. A model is, in a real sense, a hypothesis about the structure of its inputs. A convolutional network assumes spatial locality. A transformer assumes that order matters and that long range dependencies are worth modeling. A gradient boosted tree assumes that the world can be carved up by axis aligned thresholds on tabular features. When the assumption matches the data, learning is efficient. When it does not, no amount of compute will rescue the result.

This chapter develops a working taxonomy of data types, examines the structural properties that distinguish them, surveys the storage formats that practitioners actually use, and shows how each data type drives concrete modeling choices.

52.1 1. Structured Versus Unstructured Data

52.1.1 1.1 The Classic Dichotomy

The oldest and coarsest distinction is between structured and unstructured data. Structured data has a predefined schema. Each record conforms to a fixed set of fields, each field has a declared type, and the meaning of a value is given by its position in the schema. A relational database table is the canonical example. Unstructured data lacks such a schema at the level of raw bytes. A photograph is a grid of pixel intensities, a document is a sequence of characters, and an audio clip is a sequence of amplitude samples. The semantic content is present, but it is not laid out in named, typed fields.

The dichotomy is useful but leaky. A great deal of practical data is better called semi structured. JSON and XML documents carry structure through nesting and keys, yet the schema may vary from record to record and may be deeply hierarchical rather than flat. Log lines, emails, and HTML pages all mix rigid structure with free text. For this reason it is more productive to think of a spectrum of structure rather than a binary, and to ask of any dataset: where does the structure live, and how much of the predictive signal it contains is exposed by that structure?

52.1.2 1.2 Why the Distinction Matters for Modeling

The practical consequence is the amount of representation learning required. Structured tabular data arrives with features that are already meaningful: an age, a price, a category. A model can operate directly on these. Unstructured data requires a representation learning stage that maps raw bytes into a vector space where the relevant geometry reflects semantic similarity. This is precisely the work that deep networks do well and that classical methods do poorly. The historical dominance of deep learning in vision, speech, and language, contrasted with the continued strength of tree ensembles on tabular problems, is a direct reflection of this split. Where features are engineered by the world, classical models compete. Where features must be learned from raw signal, depth wins.

52.2 2. Tabular Data

52.2.1 2.1 Structure and Semantics

Tabular data is the workhorse of applied analytics. It is organized as a matrix \(X \in \mathbb{R}^{n \times d}\) of \(n\) rows and \(d\) columns, where each row is an observation and each column is a feature. Critically, the columns are heterogeneous. Some are continuous numeric values, some are integer counts, some are categorical with no natural order, some are ordinal, and some are dates or identifiers. There is no meaningful notion of locality between adjacent columns; permuting the column order changes nothing about the information content. This permutation invariance is a defining property, and it is exactly why architectures built on locality assumptions, such as convolutions, are a poor fit.

52.2.2 2.2 Modeling Implications

The heterogeneity of columns demands per feature preprocessing. Continuous features are typically standardized to zero mean and unit variance, \(z = (x - \mu) / \sigma\), or scaled to a bounded range. Categorical features are encoded, whether by one hot expansion, target encoding, or learned embeddings. Missing values must be handled explicitly through imputation or through models that natively support them.

For prediction on tabular data, gradient boosted decision trees, as implemented in XGBoost, LightGBM, and CatBoost, remain the default choice and frequently outperform deep networks on medium sized datasets. The reason is structural. Trees naturally handle mixed types, are invariant to monotone transformations of individual features, are robust to uninformative features, and capture the axis aligned, non smooth decision boundaries that tabular targets often exhibit. Deep tabular models such as TabNet and various transformer variants have narrowed the gap but have not decisively overturned this picture for the typical mid sized industrial dataset.

row_id  age  income   region   churned
1       34   58000    west     0
2       51   92000    east     1
3       29   41000    south    0

52.3 3. Text Data

52.3.1 3.1 Structure and Semantics

Text is a sequence of discrete tokens drawn from a finite vocabulary. Two structural properties dominate. First, order carries meaning; “dog bites man” and “man bites dog” share tokens but not sense. Second, dependencies can span long distances, so the resolution of a pronoun may depend on a noun many sentences earlier. Text is also discrete and high dimensional in its raw form, since a vocabulary may contain tens of thousands of tokens.

52.3.2 3.2 Modeling Implications

The first step is tokenization, the segmentation of raw text into units. Modern systems use subword schemes such as byte pair encoding or WordPiece, which balance vocabulary size against the ability to represent rare words. Tokens are then mapped to dense vectors through an embedding matrix \(E \in \mathbb{R}^{|V| \times h}\), where \(|V|\) is the vocabulary size and \(h\) the embedding dimension.

The transformer architecture, built on self attention, is now the standard model for text. Self attention computes, for every pair of positions, a weighting

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V, \]

which lets each token aggregate information from any other token regardless of distance, directly addressing the long range dependency problem. The quadratic cost of attention in sequence length is the central engineering tension, and it motivates much of the research into efficient and sparse attention variants. The discreteness of text also means that generation is framed as classification over the vocabulary at each step, with a cross entropy loss.

52.4 4. Image Data

52.4.1 4.1 Structure and Semantics

An image is a regular grid of pixels. A color image is a tensor of shape \(H \times W \times C\), with height, width, and channels, the channels usually being red, green, and blue. The defining structural property is spatial locality: nearby pixels are strongly correlated, and meaningful features such as edges and textures are local patterns that can appear anywhere in the frame. This gives rise to a second property, translation equivariance, since an object retains its identity when shifted across the image.

52.4.2 4.2 Modeling Implications

These two properties motivate the convolution. A convolutional layer applies the same small filter across all spatial positions, which builds in translation equivariance and shares parameters efficiently. Stacking convolutions with pooling builds a hierarchy from edges to textures to parts to objects. Convolutional networks dominated computer vision for a decade.

The Vision Transformer reframed images as sequences by splitting them into fixed size patches, embedding each patch, and feeding the sequence to a transformer. With sufficient data and pretraining, this approach matches or exceeds convolutional networks, trading the strong locality prior for greater flexibility and scale. The lesson is general: strong architectural priors help when data is scarce, while flexible architectures win when data is abundant. Either way, the high dimensionality of pixels, where a modest image holds hundreds of thousands of values, makes representation learning essential, since raw pixels are a poor feature space for any classical method.

52.5 5. Audio Data

52.5.1 5.1 Structure and Semantics

Audio is a one dimensional signal: a sequence of amplitude samples taken at a fixed sampling rate, commonly 16 kHz for speech or 44.1 kHz for music. The structure is temporal, and the information is carried largely in the frequency content as it evolves over time. Sampling rate sets the highest representable frequency through the Nyquist limit, \(f_{\max} = f_s / 2\), so a 16 kHz rate captures frequencies up to 8 kHz, sufficient for intelligible speech.

52.5.2 5.2 Modeling Implications

Raw waveforms are dense and high rate, with tens of thousands of samples per second. A common and powerful move is to convert the waveform into a time frequency representation, the spectrogram, computed by the short time Fourier transform. The mel spectrogram further warps the frequency axis to match human perception. This transforms a one dimensional signal into a two dimensional image like array, which means that convolutional and transformer architectures developed for vision and sequences transfer naturally. End to end models that learn directly from the waveform also exist, but spectrogram front ends remain a robust and widely used default in speech recognition and audio classification.

52.6 6. Video Data

52.6.1 6.1 Structure and Semantics

Video adds a temporal axis to images, yielding a tensor of shape \(T \times H \times W \times C\) across \(T\) frames. It therefore inherits spatial locality within each frame and adds temporal locality across frames, since consecutive frames are highly redundant. This redundancy is both a burden, because of the sheer data volume, and an opportunity, because motion itself is a rich signal.

52.6.2 6.2 Modeling Implications

The central challenge is scale. A few seconds of video contains far more raw values than a single image, which makes naive processing of every frame at full resolution prohibitive. Practical systems use three dimensional convolutions that span space and time, two stream designs that process appearance and motion separately, or video transformers with factorized spatial and temporal attention to control cost. Temporal redundancy is exploited by sampling frames sparsely rather than processing every one. Modeling choices here are dominated by the compute and memory budget as much as by accuracy, and efficiency is a first class concern rather than an afterthought.

52.7 7. Graph Data

52.7.1 7.1 Structure and Semantics

A graph \(G = (V, E)\) consists of nodes and edges, with optional features attached to each. Graphs represent relational data: social networks, molecules, knowledge bases, transportation systems, and recommendation interactions. The defining structural property is that there is no fixed ordering of nodes and no fixed neighborhood size. A given node may have two neighbors or two thousand. Any valid model must be permutation invariant or equivariant, producing the same output regardless of how the nodes happen to be numbered.

52.7.2 7.2 Modeling Implications

This irregularity rules out architectures that assume a grid or a sequence. The dominant approach is the graph neural network, which operates by message passing. Each node iteratively updates its representation by aggregating messages from its neighbors:

\[ h_v^{(l+1)} = \phi\!\left(h_v^{(l)}, \; \bigoplus_{u \in \mathcal{N}(v)} \psi\big(h_v^{(l)}, h_u^{(l)}\big)\right), \]

where \(\bigoplus\) is a permutation invariant aggregator such as sum or mean. After \(k\) rounds, each node’s representation reflects its \(k\) hop neighborhood. The permutation invariant aggregator is what encodes the absence of node ordering directly into the architecture. Sparsity is the key efficiency property, since real graphs have far fewer edges than the \(|V|^2\) of a dense adjacency matrix, and efficient implementations exploit this.

52.8 8. Time Series Data

52.8.1 8.1 Structure and Semantics

A time series is a sequence of observations indexed by time, \(x_1, x_2, \ldots, x_T\), which may be univariate or multivariate. It shares temporal ordering with text and audio, but it carries distinctive structure: trend, the long run direction; seasonality, periodic patterns at fixed intervals; and autocorrelation, the dependence of a value on its own recent past. Time series may be regularly or irregularly sampled, and they raise the issue of stationarity, whether the statistical properties are constant over time.

52.8.2 8.2 Modeling Implications

The temporal ordering imposes a hard constraint that distinguishes time series work from most other data: information must not leak from the future into the past. Train and test splits must respect chronology, and cross validation must use forward chaining rather than random folds. Classical statistical models such as ARIMA explicitly decompose the series into autoregressive and moving average components and remain strong baselines, particularly for univariate problems with clear seasonality. Modern approaches apply recurrent networks, temporal convolutions, and transformers adapted to long horizons. Across methods, feature engineering of lagged values, rolling statistics, and calendar effects often contributes as much as the model class. The recurring lesson is that respecting the arrow of time in both modeling and evaluation matters more than the choice of architecture.

52.9 9. Storage Formats

The logical type of data is one concern; how it is serialized to disk and moved between systems is another, and it has large practical consequences for speed, size, and interoperability.

52.9.1 9.1 CSV

Comma separated values is the lowest common denominator. It is a plain text, row oriented format that is human readable and universally supported. Its weaknesses are equally well known. It has no type information, so every field is text until parsed, and there is no standard for encoding, quoting, or null values. It is verbose, lacks compression, and must be scanned in full even to read a single column. CSV is excellent for small data and interchange and poor for analytical workloads at scale.

52.9.2 9.2 Parquet

Apache Parquet is a columnar, binary format designed for analytics. Storing data by column rather than by row yields two major advantages. First, a query that touches a few columns reads only those columns, which slashes input and output for wide tables. Second, values within a column share a type and tend to be similar, which makes compression and encoding schemes such as dictionary and run length encoding highly effective. Parquet also stores a schema and per column statistics that allow whole blocks to be skipped during filtering. It is the standard for data lakes and large scale batch processing.

CSV (row oriented):        Parquet (column oriented):
1,34,west                  [1,2,3]
2,51,east                  [34,51,29]
3,29,south                 [west,east,south]

52.9.3 9.3 JSON

JavaScript Object Notation is the standard for semi structured and hierarchical data. It is text based, human readable, and represents nested objects and arrays naturally, which makes it ubiquitous in web APIs and document stores. Its costs are verbosity and parsing overhead, and like CSV it is row oriented and untyped beyond a few primitives. For high volume pipelines, a binary or columnar relative is usually preferred, but JSON remains the default for configuration, interchange, and any data whose shape varies record to record.

52.9.4 9.4 Arrow

Apache Arrow is a columnar, in memory format rather than primarily a disk format. Its purpose is to provide a single, language independent representation of columnar data in RAM so that different tools and languages can share data without serialization and copying. Where Parquet optimizes data at rest, Arrow optimizes data in motion and in compute. Its zero copy design and cache friendly columnar layout enable vectorized processing and let Python, Java, R, and others operate on the same buffers. Parquet and Arrow are complementary and frequently paired: Parquet on disk, Arrow in memory.

52.9.5 9.5 Choosing a Format

The practical guidance follows from the access pattern. Use CSV or JSON for small data, interchange, and human inspection. Use Parquet for analytical storage, columnar access, and large datasets at rest. Use Arrow as the in memory backbone for fast, cross language data processing pipelines. The cost of a wrong choice is rarely correctness, but it is often an order of magnitude in storage and processing time.

52.10 10. How Data Type Drives Modeling Choices

Pulling the threads together, the data type determines a chain of decisions. The structural properties of the data, locality, ordering, permutation invariance, and dimensionality, dictate which architectural priors are appropriate. Locality favors convolutions, ordering favors sequence models, the absence of ordering favors permutation invariant aggregation, and abundant data favors flexible attention based models over strong priors. The same properties dictate evaluation: time series demand chronological splits, while independently sampled tabular rows permit random ones. They dictate preprocessing: tabular data needs per column treatment, text needs tokenization, audio benefits from spectral transforms.

The mature practitioner reads a dataset’s type as a specification. It tells them which inductive biases to build in, which storage format will keep the pipeline fast, which leakage risks to guard against, and which family of models is likely to repay the effort of tuning. The most common and costly errors in applied machine learning are not subtle failures of optimization but mismatches between the structure of the data and the assumptions of the method. Getting the data type right, in both representation and storage, is the foundation on which everything else is built.

52.11 References

  1. Apache Parquet documentation. https://parquet.apache.org/docs/
  2. Apache Arrow documentation. https://arrow.apache.org/docs/
  3. Vaswani, A. et al. Attention Is All You Need. 2017. https://arxiv.org/abs/1706.03762
  4. Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. https://arxiv.org/abs/2010.11929
  5. Grinsztajn, L. et al. Why Do Tree Based Models Still Outperform Deep Learning on Tabular Data? 2022. https://arxiv.org/abs/2207.08815
  6. Chen, T. and Guestrin, C. XGBoost: A Scalable Tree Boosting System. 2016. https://arxiv.org/abs/1603.02754
  7. Kipf, T. and Welling, M. Semi Supervised Classification with Graph Convolutional Networks. 2016. https://arxiv.org/abs/1609.02907
  8. Hyndman, R. and Athanasopoulos, G. Forecasting: Principles and Practice. https://otexts.com/fpp3/
  9. Radford, A. et al. Robust Speech Recognition via Large Scale Weak Supervision (Whisper). 2022. https://arxiv.org/abs/2212.04356
  10. The JSON Data Interchange Standard (ECMA 404). https://www.json.org/