69 Feature Engineering for Temporal Data

Temporal data carries information that ordinary tabular data does not. The order of observations matters, the spacing between them matters, and the relationship between a value and its own past matters. A model that ignores these properties throws away much of the signal that makes forecasting and time aware classification work. This chapter develops the core vocabulary of temporal feature engineering: lag features, rolling and expanding windows, calendar and seasonality encodings, and time since event features. It closes with the single most consequential issue in the entire discipline, lookahead leakage, and the validation discipline that prevents it.

Throughout, we assume a data set indexed by time $t$, with a target $y_t$ and possibly a panel of entities such as users, stores, or sensors. The recurring question is simple to state and easy to get wrong: at the moment a prediction must be made, which information is genuinely available, and which would only be known later.

To make this question precise, it helps to fix a small amount of notation up front. Let $\mathcal{F}_t$ denote the information set available at time $t$, that is, the collection of all quantities whose values are known by the clock time at which the prediction for $t$ is issued. A feature $\phi$ is admissible for predicting at time $t$ if and only if $\phi$ is measurable with respect to $\mathcal{F}_t$, written $\phi \in \sigma(\mathcal{F}_t)$. Every construction in this chapter is, in the end, a recipe for building columns that stay inside $\mathcal{F}_t$. Leakage, the subject of Section 6, is exactly the event that a feature escapes $\mathcal{F}_t$ and reaches into the future. Holding this single criterion in mind unifies what would otherwise look like a grab bag of tricks.

69.1 1. Why Temporal Features Are Different

69.1.1 1.1 The independence assumption fails

Most supervised learning rests on the assumption that examples are independent and identically distributed. Time series violate both halves of this. Observations are autocorrelated, meaning $y_t$ is correlated with $y_{t-1}$, $y_{t-2}$, and so on. The distribution also drifts: the mean, variance, and seasonal structure of a series in January can differ from the same series in July, and the structure this year can differ from last year. Feature engineering for temporal data is largely the craft of turning this dependence into explicit, model readable columns, so that an otherwise time blind learner such as gradient boosted trees can exploit it.

It is worth being precise about what dependence we are exploiting. A process is weakly stationary if its mean $\mathbb{E}[y_t] = \mu$ is constant, its variance is finite and constant, and its autocovariance $\gamma(k) = \operatorname{Cov}(y_t, y_{t-k})$ depends only on the lag $k$ and not on $t$. The autocorrelation function (ACF) is the normalized version, $\rho(k) = \gamma(k) / \gamma(0)$. Many real series are not stationary in level, but become approximately so after differencing or after the seasonal and trend components are stripped out by the calendar features of Section 4. The practical payoff is that, once a series is rendered roughly stationary, $\rho(k)$ is stable enough to read off which lags carry signal, and a learner fed those lags can recover much of the predictive content of a classical autoregressive model without being told the model form in advance.

69.1.2 1.2 Information availability defines correctness

The defining constraint is causality of information. A feature computed for the row at time $t$ may only use data observed at or before $t$. If a feature secretly incorporates $y_{t+1}$ or any quantity derived from the future, the model will appear to perform brilliantly in offline evaluation and then collapse in production. This is not a tuning problem, it is a correctness problem, and it is the reason the final section of this chapter is the longest. Every transformation below is presented with its availability boundary made explicit.

69.2 2. Lag Features

69.2.1 2.1 Definition and intuition

A lag feature is simply a past value of a series, shifted forward so it lines up with the current row. The lag $k$ feature is

\[ x^{(k)}_t = y_{t-k}. \]

Because $y_{t-k}$ was observed $k$ steps before $t$, any lag with $k \geq 1$ is safe with respect to the present, provided $y_{t-k}$ was actually known by time $t$. Lags convert autocorrelation into columns. If demand today is strongly correlated with demand yesterday and demand seven days ago, then lags of 1 and 7 give a tree based model direct access to that relationship.

# pandas-style pseudocode, not executed
df["lag_1"]  = df["y"].shift(1)
df["lag_7"]  = df["y"].shift(7)
df["lag_28"] = df["y"].shift(28)

69.2.2 2.2 Choosing which lags to include

Three sources guide lag selection. First, domain knowledge: daily retail data warrants lags at 1, 7, and 28 to capture day over day, weekly, and roughly monthly structure. Second, the autocorrelation function (ACF) and partial autocorrelation function (PACF), which quantify the linear correlation between $y_t$ and $y_{t-k}$. The ACF $\rho(k)$ measures the total correlation at lag $k$, including the indirect part that flows through the intermediate lags. The PACF, written $\phi_{kk}$, isolates the direct contribution of lag $k$ after removing the influence of all shorter lags, and is defined as the coefficient on $y_{t-k}$ in the linear regression of $y_t$ on $y_{t-1}, \ldots, y_{t-k}$. The two functions read out different model structures. For a pure autoregressive process of order $p$, the PACF cuts off sharply to zero beyond lag $p$ while the ACF tails off geometrically. For a moving average process of order $q$, the roles reverse: the ACF cuts off beyond lag $q$ and the PACF tails off. A sharp PACF cutoff at lag $p$ therefore suggests including lags $1$ through $p$. As a rough significance guide, sample autocorrelations of white noise of length $n$ have standard error near $1/\sqrt{n}$, so spikes exceeding roughly $2/\sqrt{n}$ in magnitude are candidates worth keeping. Third, the forecast horizon. If you must predict $h$ steps ahead and only data up to time $t$ is available, then the smallest usable lag relative to the target $y_{t+h}$ is $h$ itself. Lags shorter than the horizon are not available at prediction time and must be excluded, a point we return to in Section 6.

69.2.3 2.3 Lags in panel data

With multiple entities, lags must be computed within each entity, never across the boundary between them. Grouping is mandatory.

df["lag_1"] = df.groupby("store_id")["y"].shift(1)

Without the group by, the first row of one store would borrow the last value of the previous store, fabricating a relationship that does not exist. This silent cross contamination is among the most common temporal feature bugs.

69.3 3. Rolling and Expanding Windows

69.3.1 3.1 Rolling windows

A rolling, or moving, window aggregates the most recent $w$ observations into a summary statistic. The rolling mean over window $w$ is

\[ \bar{x}_{t,w} = \frac{1}{w} \sum_{i=1}^{w} y_{t-i}. \]

Note the index runs from $t-1$ back to $t-w$, deliberately excluding $y_t$. Including the current value would leak the target into its own feature. In practice this is enforced by shifting before rolling, or by using a closed interval that excludes the right endpoint.

# shift first, then roll: the window ends at t-1
roll = df["y"].shift(1).rolling(window=7)
df["roll_mean_7"] = roll.mean()
df["roll_std_7"]  = roll.std()
df["roll_max_7"]  = roll.max()
df["roll_min_7"]  = roll.min()

Common rolling statistics include the mean (local level), the standard deviation (local volatility), the minimum and maximum (recent extremes), the median (robust level), and quantiles. Rolling counts of events, such as the number of purchases in the trailing 30 days, are equally useful for behavioral data.

69.3.2 3.2 Window size and the bias variance tradeoff

The window length $w$ controls a smoothing tradeoff. Short windows react quickly to change but are noisy. Long windows are stable but lag behind genuine shifts in the series. Rather than choosing a single $w$, it is common to compute the same statistic over several windows, for example 7, 14, 30, and 90 days, and let the model decide which resolution matters. Each additional window is another column, so be mindful of dimensionality and correlation among the resulting features.

69.3.3 3.3 Weighted and exponential windows

A flat rolling mean weights every observation in the window equally, which is rarely ideal because recent observations usually carry more information. The exponentially weighted moving average (EWMA) addresses this by decaying older observations geometrically:

\[ s_t = \alpha\, y_{t-1} + (1 - \alpha)\, s_{t-1}, \qquad 0 < \alpha \leq 1. \]

A larger $\alpha$ means faster adaptation and shorter effective memory. Unrolling the recursion exposes the geometric weighting directly:

\[ s_t = \alpha \sum_{j=1}^{\infty} (1 - \alpha)^{j-1}\, y_{t-j}, \]

a weighted average whose weights $\alpha (1-\alpha)^{j-1}$ sum to one and decay by a constant factor $(1-\alpha)$ at each step further into the past. This makes the connection to the flat rolling window precise: where the window of Section 3.1 assigns weight $1/w$ to the last $w$ observations and zero to everything older, the EWMA assigns smoothly decaying weight to the entire history. The effective memory of the EWMA is often summarized by its center of mass, $\sum_{j\geq 1} j\,\alpha(1-\alpha)^{j-1} = (1-\alpha)/\alpha$, which lets one translate a chosen $\alpha$ into an equivalent window length and vice versa. The EWMA thus has the practical advantage of an unbounded but decaying memory, summarizing all past data in a single recursively updated value, and it again excludes $y_t$ when defined with $y_{t-1}$ as the most recent input.

69.3.4 3.4 Expanding windows

An expanding window grows to include all history up to the current point rather than a fixed span:

\[ \bar{x}^{\text{exp}}_t = \frac{1}{t-1} \sum_{i=1}^{t-1} y_{t-i}. \]

Expanding aggregates capture long run level, cumulative counts, and running records such as the historical maximum to date. They are appropriate when the early history remains relevant indefinitely, for instance a customer’s lifetime average order value. As with rolling windows, the summation must stop at $t-1$.

exp = df["y"].shift(1).expanding(min_periods=1)
df["cum_mean"] = exp.mean()
df["cum_max"]  = exp.max()

69.3.5 3.5 Target encoding over time

A subtle and powerful case is encoding a categorical variable by the historical mean of the target within each category, computed only from past rows. This is an expanding group statistic and is highly prone to leakage if computed naively over the whole data set. The safe construction computes, for each row, the mean of the target over earlier rows sharing the same category, optionally smoothed toward the global mean to stabilize categories with little history. A standard smoothed form is

\[ \hat{e}_{c,t} = \frac{n_{c,t}\,\bar{y}_{c,t} + m\,\bar{y}_t}{n_{c,t} + m}, \]

where $\bar{y}_{c,t}$ is the mean target over past rows in category $c$, $n_{c,t}$ is the count of those rows, $\bar{y}_t$ is the running global mean, and $m$ is a smoothing weight expressed in pseudo counts. When a category has plenty of history ($n_{c,t} \gg m$) the estimate trusts the category mean, and when history is thin it shrinks toward the global mean, which guards against overconfident encodings of rare levels.

When to use and what to watch for. Time aware target encoding shines when a high cardinality categorical (product id, zip code, user id) carries strong signal that one hot encoding would render too sparse to learn. The pitfalls are twofold. First, the encoding must be expanding and strictly backward looking; the global, whole data set mean per category is a textbook leak. Second, early rows in each category are noisy because $n_{c,t}$ is small, so the smoothing weight $m$ should be chosen with that in mind, and the very first occurrence of a level necessarily falls back entirely to the global mean.

69.4 4. Calendar and Seasonality Features

69.4.1 4.1 Decomposing the timestamp

A raw timestamp is nearly useless to a model, but the calendar fields extracted from it are rich. Standard extractions include the hour of day, day of week, day of month, week of year, month, quarter, and year. Boolean flags for weekend, month start, month end, quarter end, and holiday capture structural breaks in human activity. These features let a model learn, for example, that traffic peaks on weekday mornings or that sales spike at month end.

ts = df["timestamp"]
df["hour"]       = ts.dt.hour
df["dayofweek"]  = ts.dt.dayofweek
df["month"]      = ts.dt.month
df["is_weekend"] = ts.dt.dayofweek >= 5

Calendar features are computed entirely from the timestamp of the row being predicted, so they are always available at prediction time and never leak. The future calendar is known in advance, which is precisely why these features are so valuable for forecasting.

69.4.2 4.2 Cyclical encoding

Calendar fields are cyclical: hour 23 is adjacent to hour 0, and December is adjacent to January, yet a naive integer encoding places them maximally far apart. The standard remedy is to map a cyclical variable with period $P$ onto a circle using sine and cosine:

\[ x_{\sin} = \sin\!\left(\frac{2\pi\, c}{P}\right), \qquad x_{\cos} = \cos\!\left(\frac{2\pi\, c}{P}\right), \]

where $c$ is the raw cyclical value, such as the hour, and $P$ is its period, such as 24. The pair $(x_{\sin}, x_{\cos})$ encodes proximity correctly, so that 23:00 and 00:00 sit close together. This encoding mainly benefits models that rely on smooth distance in feature space, such as linear models and neural networks. Tree based models, which split on thresholds, often handle raw integer calendar fields adequately, though cyclical features can still help.

69.4.3 4.3 Fourier terms for seasonality

When seasonality is strong or multi period, a richer representation uses several Fourier harmonics of the seasonal cycle:

\[ \sum_{k=1}^{K} \left[ a_k \sin\!\left(\frac{2\pi k t}{P}\right) + b_k \cos\!\left(\frac{2\pi k t}{P}\right) \right]. \]

Increasing the number of harmonics $K$ lets the model represent more complex, non sinusoidal seasonal shapes. Fourier features are central to additive forecasting models such as Prophet and are a compact way to inject yearly seasonality, with $P = 365.25$, into a model that operates at daily resolution.

69.4.4 4.4 Holidays and events

Holidays, promotions, sporting finals, and product launches create deviations that pure calendar arithmetic cannot capture. These are typically supplied through an external calendar and encoded as binary indicators, often with leading and trailing windows to model anticipation and aftermath, for example a flag for the three days before a major holiday. Because such events are usually scheduled in advance, they remain available at prediction time. Events that are not known in advance must be treated with care, since using them would constitute leakage.

69.5 5. Time Since Event Features

69.5.1 5.1 Elapsed time and recency

Many problems hinge not on calendar position but on elapsed time since something happened. The time since last event feature measures, at each row, how long it has been since the most recent occurrence of an event of interest:

\[ \Delta_t = t - t_{\text{last event} \leq t}. \]

Examples include days since a customer’s last purchase, hours since the last login, time since the last fault on a machine, and tenure since account creation. These features encode recency and are often among the strongest predictors in churn, conversion, and survival style problems. Because the construction looks only at the most recent past event, it respects the availability boundary by definition.

69.5.2 5.2 Time until a known future event

When an event is scheduled and therefore known ahead of time, a symmetric time until next event feature is legitimate. Days until the next public holiday or until a contract renewal are known at prediction time and carry strong signal. The crucial distinction is knowability. Time until the next purchase is not known in advance and would leak the future, whereas time until the next scheduled holiday is fixed on the calendar and is safe.

69.5.3 5.3 Counts and decay since events

Recency can be combined with frequency. Useful constructions include the count of events in a trailing window, the cumulative count of events to date, and a decayed event intensity that weights recent events more heavily. A common decay form is $\exp(-\lambda \Delta_t)$, which assigns high weight to events that occurred recently and negligible weight to those in the distant past. The decay rate $\lambda$ becomes a tunable hyperparameter governing how quickly the influence of an event fades.

69.6 6. Avoiding Lookahead Leakage

69.6.1 6.1 What leakage is

Lookahead leakage, also called temporal leakage or data leakage from the future, occurs when a feature for the row at time $t$ depends on information that would not have been available until after $t$. The model learns from this illicit information during training, reports excellent metrics during offline evaluation, and then degrades sharply in production because the future information is genuinely absent when predictions are made in real time. Leakage is insidious because it inflates exactly the metrics practitioners trust, so it is rarely caught by looking at validation scores alone.

69.6.2 6.2 Common sources of leakage

A handful of patterns account for most temporal leakage in practice.

The first is fitting preprocessing on the full data set. Computing a scaler’s mean and variance, a target encoding, or an imputation value over all rows, including future ones, lets statistics from the future contaminate the past. Every such transform must be fit on training data only and then applied to later data.

The second is centered or symmetric windows. A rolling statistic that is centered on $t$ uses observations from both sides, meaning it reads from the future. Temporal features must use trailing windows that end at or before $t$, never centered ones.

The third is forgetting to exclude the current value. A rolling mean that includes $y_t$ leaks the target into its own feature. Shift before aggregating, as shown earlier.

The fourth is ignoring the forecast horizon. If predictions for $y_{t+h}$ are made using only data up to $t$, then any feature requiring data between $t+1$ and $t+h$ is unavailable. The minimum usable lag is $h$, and rolling windows must be offset by the horizon accordingly.

The fifth is publication and ingestion delay. Even past data may not have been recorded yet at prediction time. A daily sales figure may only be finalized two days later, so a feature must respect the latency with which each input actually arrives, not merely its nominal timestamp. Modeling this with an explicit availability timestamp per field is the rigorous solution.

69.6.3 6.3 Validation that respects time

Standard $k$ fold cross validation shuffles rows and trains on data that is, in time, both before and after the validation rows. For temporal data this is invalid because it trains on the future to predict the past. The correct approach uses time ordered splits.

In forward chaining, also called rolling origin evaluation, the data is split at successive time points. The model trains on everything up to a cutoff and is evaluated on the following block, then the cutoff advances and the process repeats. Two common variants exist. An expanding window keeps all history in each training fold, while a sliding window uses a fixed length training period that moves forward, which is preferable when the relationship between features and target drifts over time.

Fold 1: train [..t1]              test (t1..t2]
Fold 2: train [..t2]              test (t2..t3]
Fold 3: train [..t3]              test (t3..t4]

A further refinement inserts a gap, sometimes called an embargo, between the end of training and the start of testing. The gap equals the forecast horizon plus any window over which feature and label construction could otherwise overlap, and it prevents information from bleeding across the boundary through overlapping windows. This embargo discipline is standard in domains such as quantitative finance where overlapping labels are common (see reference 6, Lopez de Prado). The following diagram shows a single fold with its embargo gap, which is the unit that the rolling origin procedure repeats as the cutoff advances.

flowchart LR
    A["Train window (history up to cutoff)"] --> B["Embargo gap (horizon plus overlap)"]
    B --> C["Test window (held out future block)"]

Figure 69.1: One fold of time ordered validation with an embargo gap separating train and test.

The embargo width is not arbitrary. If labels are built from a window of length $L$ ending at the prediction time, and forecasts are issued $h$ steps ahead, then a training row near the cutoff and a test row just past it can share underlying observations unless the boundary is widened by at least $h + L$ time steps. Setting the gap to that value purges the overlap and restores the property that the test block is genuinely out of sample.

69.6.4 6.4 A worked example: horizon aware features

Concrete numbers make the availability boundary vivid. Suppose a retailer forecasts daily demand $y_t$ for each store and must commit replenishment orders three days ahead, so the horizon is $h = 3$. The prediction issued on the morning of day $t$ targets $y_{t+3}$, and because the sales total for a day is only finalized one day after it closes, the most recent figure available at decision time is $y_{t-1}$, not $y_t$.

Now consider three candidate features for the row whose target is $y_{t+3}$.

A lag of one day relative to the target, $y_{t+2}$, is not admissible. Day $t+2$ has not yet occurred when the order is placed on day $t$, so $y_{t+2} \notin \mathcal{F}_t$. The smallest admissible lag is the one ending at the latest known observation, $y_{t-1}$, which sits four steps before the target. Relative to the target index $t+3$, that is a lag of $4 = h + 1$, the horizon plus the one day ingestion delay.

A trailing seven day rolling mean is admissible only if its window ends at $y_{t-1}$ or earlier. A window that runs through $y_{t+2}$ would silently incorporate three unobserved days. The fix is the same offset that protected the lag: shift the window back so its right endpoint is $y_{t-1}$.

A calendar flag for “is day $t+3$ a public holiday” is admissible, because the holiday schedule is fixed in advance and is therefore in $\mathcal{F}_t$ even though it concerns a future date. This is the asymmetry of Section 5.2 in action: future values are forbidden, but future known facts are allowed.

For validation, the embargo follows directly. With $h = 3$ and a label that aggregates a single day ($L = 1$), the gap between train and test must be at least $h + L = 4$ days, ensuring no training row’s feature window can reach into a test day’s target.

69.6.5 6.5 A practical leakage checklist

Before trusting any temporal model, audit each feature against the following questions. Does the feature use only data observed at or before the prediction time, accounting for ingestion delay. Were all preprocessing statistics fit on training data alone. Are all windows trailing rather than centered, and do they exclude the current target. Is the smallest lag at least as large as the forecast horizon. Does the validation scheme preserve time order with an appropriate gap. A feature that fails any of these is a leakage candidate and must be corrected before the model is considered valid. A useful sanity check is that suspiciously high offline accuracy, far above what the problem plausibly allows, is more often a symptom of leakage than of a breakthrough.

69.7 7. Summary

Temporal feature engineering is the practice of making the structure of time explicit so that general purpose learners can use it. Lag features expose autocorrelation, rolling and expanding windows summarize recent and full history, calendar and Fourier features encode known periodicity, and time since event features capture recency and decay. Each of these is only as good as its respect for the availability boundary. The discipline that ties everything together is the relentless question of what is known when a prediction is made, enforced through trailing windows, horizon aware lags, train only preprocessing, and time ordered validation with an embargo. Master that question, and the rest of temporal feature engineering follows naturally.

69.8 References

Hyndman, R. J., and Athanasopoulos, G. Forecasting: Principles and Practice (3rd ed.). OTexts. https://otexts.com/fpp3/
Kuhn, M., and Johnson, K. Feature Engineering and Selection: A Practical Approach for Predictive Models. https://bookdown.org/max/FES/
scikit-learn Developers. TimeSeriesSplit and time series cross validation. https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split
pandas Developers. Window operations: rolling, expanding, and exponentially weighted. https://pandas.pydata.org/docs/user_guide/window.html
Taylor, S. J., and Letham, B. Forecasting at Scale (Prophet). https://peerj.com/preprints/3190/
Lopez de Prado, M. Advances in Financial Machine Learning (purged cross validation and embargo). https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086
Kaufman, S., Rosset, S., and Perlich, C. Leakage in Data Mining: Formulation, Detection, and Avoidance. https://dl.acm.org/doi/10.1145/2020408.2020496
Bergmeir, C., and Benitez, J. M. On the use of cross validation for time series predictor evaluation. https://doi.org/10.1016/j.ins.2011.12.028

# Feature Engineering for Temporal Data Temporal data carries information that ordinary tabular data does not. The order of observations matters, the spacing between them matters, and the relationship between a value and its own past matters. A model that ignores these properties throws away much of the signal that makes forecasting and time aware classification work. This chapter develops the core vocabulary of temporal feature engineering: lag features, rolling and expanding windows, calendar and seasonality encodings, and time since event features. It closes with the single most consequential issue in the entire discipline, lookahead leakage, and the validation discipline that prevents it. Throughout, we assume a data set indexed by time $t$, with a target $y_t$ and possibly a panel of entities such as users, stores, or sensors. The recurring question is simple to state and easy to get wrong: at the moment a prediction must be made, which information is genuinely available, and which would only be known later. To make this question precise, it helps to fix a small amount of notation up front. Let $\mathcal{F}_t$ denote the information set available at time $t$, that is, the collection of all quantities whose values are known by the clock time at which the prediction for $t$ is issued. A feature $\phi$ is *admissible* for predicting at time $t$ if and only if $\phi$ is measurable with respect to $\mathcal{F}_t$, written $\phi \in \sigma(\mathcal{F}_t)$. Every construction in this chapter is, in the end, a recipe for building columns that stay inside $\mathcal{F}_t$. Leakage, the subject of Section 6, is exactly the event that a feature escapes $\mathcal{F}_t$ and reaches into the future. Holding this single criterion in mind unifies what would otherwise look like a grab bag of tricks. ## 1. Why Temporal Features Are Different ### 1.1 The independence assumption fails Most supervised learning rests on the assumption that examples are independent and identically distributed. Time series violate both halves of this. Observations are autocorrelated, meaning $y_t$ is correlated with $y_{t-1}$, $y_{t-2}$, and so on. The distribution also drifts: the mean, variance, and seasonal structure of a series in January can differ from the same series in July, and the structure this year can differ from last year. Feature engineering for temporal data is largely the craft of turning this dependence into explicit, model readable columns, so that an otherwise time blind learner such as gradient boosted trees can exploit it. It is worth being precise about what dependence we are exploiting. A process is *weakly stationary* if its mean $\mathbb{E}[y_t] = \mu$ is constant, its variance is finite and constant, and its autocovariance $\gamma(k) = \operatorname{Cov}(y_t, y_{t-k})$ depends only on the lag $k$ and not on $t$. The autocorrelation function (ACF) is the normalized version, $\rho(k) = \gamma(k) / \gamma(0)$. Many real series are not stationary in level, but become approximately so after differencing or after the seasonal and trend components are stripped out by the calendar features of Section 4. The practical payoff is that, once a series is rendered roughly stationary, $\rho(k)$ is stable enough to read off which lags carry signal, and a learner fed those lags can recover much of the predictive content of a classical autoregressive model without being told the model form in advance. ### 1.2 Information availability defines correctness The defining constraint is causality of information. A feature computed for the row at time $t$ may only use data observed at or before $t$. If a feature secretly incorporates $y_{t+1}$ or any quantity derived from the future, the model will appear to perform brilliantly in offline evaluation and then collapse in production. This is not a tuning problem, it is a correctness problem, and it is the reason the final section of this chapter is the longest. Every transformation below is presented with its availability boundary made explicit. ## 2. Lag Features ### 2.1 Definition and intuition A lag feature is simply a past value of a series, shifted forward so it lines up with the current row. The lag $k$ feature is $$ x^{(k)}_t = y_{t-k}. $$ Because $y_{t-k}$ was observed $k$ steps before $t$, any lag with $k \geq 1$ is safe with respect to the present, provided $y_{t-k}$ was actually known by time $t$. Lags convert autocorrelation into columns. If demand today is strongly correlated with demand yesterday and demand seven days ago, then lags of 1 and 7 give a tree based model direct access to that relationship. ```python # pandas-style pseudocode, not executed df["lag_1"] = df["y"].shift(1) df["lag_7"] = df["y"].shift(7) df["lag_28"] = df["y"].shift(28) ``` ### 2.2 Choosing which lags to include Three sources guide lag selection. First, domain knowledge: daily retail data warrants lags at 1, 7, and 28 to capture day over day, weekly, and roughly monthly structure. Second, the autocorrelation function (ACF) and partial autocorrelation function (PACF), which quantify the linear correlation between $y_t$ and $y_{t-k}$. The ACF $\rho(k)$ measures the *total* correlation at lag $k$, including the indirect part that flows through the intermediate lags. The PACF, written $\phi_{kk}$, isolates the *direct* contribution of lag $k$ after removing the influence of all shorter lags, and is defined as the coefficient on $y_{t-k}$ in the linear regression of $y_t$ on $y_{t-1}, \ldots, y_{t-k}$. The two functions read out different model structures. For a pure autoregressive process of order $p$, the PACF cuts off sharply to zero beyond lag $p$ while the ACF tails off geometrically. For a moving average process of order $q$, the roles reverse: the ACF cuts off beyond lag $q$ and the PACF tails off. A sharp PACF cutoff at lag $p$ therefore suggests including lags $1$ through $p$. As a rough significance guide, sample autocorrelations of white noise of length $n$ have standard error near $1/\sqrt{n}$, so spikes exceeding roughly $2/\sqrt{n}$ in magnitude are candidates worth keeping. Third, the forecast horizon. If you must predict $h$ steps ahead and only data up to time $t$ is available, then the smallest usable lag relative to the target $y_{t+h}$ is $h$ itself. Lags shorter than the horizon are not available at prediction time and must be excluded, a point we return to in Section 6. ### 2.3 Lags in panel data With multiple entities, lags must be computed within each entity, never across the boundary between them. Grouping is mandatory. ```python df["lag_1"] = df.groupby("store_id")["y"].shift(1) ``` Without the group by, the first row of one store would borrow the last value of the previous store, fabricating a relationship that does not exist. This silent cross contamination is among the most common temporal feature bugs. ## 3. Rolling and Expanding Windows ### 3.1 Rolling windows A rolling, or moving, window aggregates the most recent $w$ observations into a summary statistic. The rolling mean over window $w$ is $$ \bar{x}_{t,w} = \frac{1}{w} \sum_{i=1}^{w} y_{t-i}. $$ Note the index runs from $t-1$ back to $t-w$, deliberately excluding $y_t$. Including the current value would leak the target into its own feature. In practice this is enforced by shifting before rolling, or by using a closed interval that excludes the right endpoint. ```python # shift first, then roll: the window ends at t-1 roll = df["y"].shift(1).rolling(window=7) df["roll_mean_7"] = roll.mean() df["roll_std_7"] = roll.std() df["roll_max_7"] = roll.max() df["roll_min_7"] = roll.min() ``` Common rolling statistics include the mean (local level), the standard deviation (local volatility), the minimum and maximum (recent extremes), the median (robust level), and quantiles. Rolling counts of events, such as the number of purchases in the trailing 30 days, are equally useful for behavioral data. ### 3.2 Window size and the bias variance tradeoff The window length $w$ controls a smoothing tradeoff. Short windows react quickly to change but are noisy. Long windows are stable but lag behind genuine shifts in the series. Rather than choosing a single $w$, it is common to compute the same statistic over several windows, for example 7, 14, 30, and 90 days, and let the model decide which resolution matters. Each additional window is another column, so be mindful of dimensionality and correlation among the resulting features. ### 3.3 Weighted and exponential windows A flat rolling mean weights every observation in the window equally, which is rarely ideal because recent observations usually carry more information. The exponentially weighted moving average (EWMA) addresses this by decaying older observations geometrically: $$ s_t = \alpha\, y_{t-1} + (1 - \alpha)\, s_{t-1}, \qquad 0 < \alpha \leq 1. $$ A larger $\alpha$ means faster adaptation and shorter effective memory. Unrolling the recursion exposes the geometric weighting directly: $$ s_t = \alpha \sum_{j=1}^{\infty} (1 - \alpha)^{j-1}\, y_{t-j}, $$ a weighted average whose weights $\alpha (1-\alpha)^{j-1}$ sum to one and decay by a constant factor $(1-\alpha)$ at each step further into the past. This makes the connection to the flat rolling window precise: where the window of Section 3.1 assigns weight $1/w$ to the last $w$ observations and zero to everything older, the EWMA assigns smoothly decaying weight to the entire history. The *effective memory* of the EWMA is often summarized by its center of mass, $\sum_{j\geq 1} j\,\alpha(1-\alpha)^{j-1} = (1-\alpha)/\alpha$, which lets one translate a chosen $\alpha$ into an equivalent window length and vice versa. The EWMA thus has the practical advantage of an unbounded but decaying memory, summarizing all past data in a single recursively updated value, and it again excludes $y_t$ when defined with $y_{t-1}$ as the most recent input. ### 3.4 Expanding windows An expanding window grows to include all history up to the current point rather than a fixed span: $$ \bar{x}^{\text{exp}}_t = \frac{1}{t-1} \sum_{i=1}^{t-1} y_{t-i}. $$ Expanding aggregates capture long run level, cumulative counts, and running records such as the historical maximum to date. They are appropriate when the early history remains relevant indefinitely, for instance a customer's lifetime average order value. As with rolling windows, the summation must stop at $t-1$. ```python exp = df["y"].shift(1).expanding(min_periods=1) df["cum_mean"] = exp.mean() df["cum_max"] = exp.max() ``` ### 3.5 Target encoding over time A subtle and powerful case is encoding a categorical variable by the historical mean of the target within each category, computed only from past rows. This is an expanding group statistic and is highly prone to leakage if computed naively over the whole data set. The safe construction computes, for each row, the mean of the target over earlier rows sharing the same category, optionally smoothed toward the global mean to stabilize categories with little history. A standard smoothed form is $$ \hat{e}_{c,t} = \frac{n_{c,t}\,\bar{y}_{c,t} + m\,\bar{y}_t}{n_{c,t} + m}, $$ where $\bar{y}_{c,t}$ is the mean target over past rows in category $c$, $n_{c,t}$ is the count of those rows, $\bar{y}_t$ is the running global mean, and $m$ is a smoothing weight expressed in pseudo counts. When a category has plenty of history ($n_{c,t} \gg m$) the estimate trusts the category mean, and when history is thin it shrinks toward the global mean, which guards against overconfident encodings of rare levels. When to use and what to watch for. Time aware target encoding shines when a high cardinality categorical (product id, zip code, user id) carries strong signal that one hot encoding would render too sparse to learn. The pitfalls are twofold. First, the encoding must be expanding and strictly backward looking; the global, whole data set mean per category is a textbook leak. Second, early rows in each category are noisy because $n_{c,t}$ is small, so the smoothing weight $m$ should be chosen with that in mind, and the very first occurrence of a level necessarily falls back entirely to the global mean. ## 4. Calendar and Seasonality Features ### 4.1 Decomposing the timestamp A raw timestamp is nearly useless to a model, but the calendar fields extracted from it are rich. Standard extractions include the hour of day, day of week, day of month, week of year, month, quarter, and year. Boolean flags for weekend, month start, month end, quarter end, and holiday capture structural breaks in human activity. These features let a model learn, for example, that traffic peaks on weekday mornings or that sales spike at month end. ```python ts = df["timestamp"] df["hour"] = ts.dt.hour df["dayofweek"] = ts.dt.dayofweek df["month"] = ts.dt.month df["is_weekend"] = ts.dt.dayofweek >= 5 ``` Calendar features are computed entirely from the timestamp of the row being predicted, so they are always available at prediction time and never leak. The future calendar is known in advance, which is precisely why these features are so valuable for forecasting. ### 4.2 Cyclical encoding Calendar fields are cyclical: hour 23 is adjacent to hour 0, and December is adjacent to January, yet a naive integer encoding places them maximally far apart. The standard remedy is to map a cyclical variable with period $P$ onto a circle using sine and cosine: $$ x_{\sin} = \sin\!\left(\frac{2\pi\, c}{P}\right), \qquad x_{\cos} = \cos\!\left(\frac{2\pi\, c}{P}\right), $$ where $c$ is the raw cyclical value, such as the hour, and $P$ is its period, such as 24. The pair $(x_{\sin}, x_{\cos})$ encodes proximity correctly, so that 23:00 and 00:00 sit close together. This encoding mainly benefits models that rely on smooth distance in feature space, such as linear models and neural networks. Tree based models, which split on thresholds, often handle raw integer calendar fields adequately, though cyclical features can still help. ### 4.3 Fourier terms for seasonality When seasonality is strong or multi period, a richer representation uses several Fourier harmonics of the seasonal cycle: $$ \sum_{k=1}^{K} \left[ a_k \sin\!\left(\frac{2\pi k t}{P}\right) + b_k \cos\!\left(\frac{2\pi k t}{P}\right) \right]. $$ Increasing the number of harmonics $K$ lets the model represent more complex, non sinusoidal seasonal shapes. Fourier features are central to additive forecasting models such as Prophet and are a compact way to inject yearly seasonality, with $P = 365.25$, into a model that operates at daily resolution. ### 4.4 Holidays and events Holidays, promotions, sporting finals, and product launches create deviations that pure calendar arithmetic cannot capture. These are typically supplied through an external calendar and encoded as binary indicators, often with leading and trailing windows to model anticipation and aftermath, for example a flag for the three days before a major holiday. Because such events are usually scheduled in advance, they remain available at prediction time. Events that are not known in advance must be treated with care, since using them would constitute leakage. ## 5. Time Since Event Features ### 5.1 Elapsed time and recency Many problems hinge not on calendar position but on elapsed time since something happened. The time since last event feature measures, at each row, how long it has been since the most recent occurrence of an event of interest: $$ \Delta_t = t - t_{\text{last event} \leq t}. $$ Examples include days since a customer's last purchase, hours since the last login, time since the last fault on a machine, and tenure since account creation. These features encode recency and are often among the strongest predictors in churn, conversion, and survival style problems. Because the construction looks only at the most recent past event, it respects the availability boundary by definition. ### 5.2 Time until a known future event When an event is scheduled and therefore known ahead of time, a symmetric time until next event feature is legitimate. Days until the next public holiday or until a contract renewal are known at prediction time and carry strong signal. The crucial distinction is knowability. Time until the next purchase is not known in advance and would leak the future, whereas time until the next scheduled holiday is fixed on the calendar and is safe. ### 5.3 Counts and decay since events Recency can be combined with frequency. Useful constructions include the count of events in a trailing window, the cumulative count of events to date, and a decayed event intensity that weights recent events more heavily. A common decay form is $\exp(-\lambda \Delta_t)$, which assigns high weight to events that occurred recently and negligible weight to those in the distant past. The decay rate $\lambda$ becomes a tunable hyperparameter governing how quickly the influence of an event fades. ## 6. Avoiding Lookahead Leakage ### 6.1 What leakage is Lookahead leakage, also called temporal leakage or data leakage from the future, occurs when a feature for the row at time $t$ depends on information that would not have been available until after $t$. The model learns from this illicit information during training, reports excellent metrics during offline evaluation, and then degrades sharply in production because the future information is genuinely absent when predictions are made in real time. Leakage is insidious because it inflates exactly the metrics practitioners trust, so it is rarely caught by looking at validation scores alone. ### 6.2 Common sources of leakage A handful of patterns account for most temporal leakage in practice. The first is fitting preprocessing on the full data set. Computing a scaler's mean and variance, a target encoding, or an imputation value over all rows, including future ones, lets statistics from the future contaminate the past. Every such transform must be fit on training data only and then applied to later data. The second is centered or symmetric windows. A rolling statistic that is centered on $t$ uses observations from both sides, meaning it reads from the future. Temporal features must use trailing windows that end at or before $t$, never centered ones. The third is forgetting to exclude the current value. A rolling mean that includes $y_t$ leaks the target into its own feature. Shift before aggregating, as shown earlier. The fourth is ignoring the forecast horizon. If predictions for $y_{t+h}$ are made using only data up to $t$, then any feature requiring data between $t+1$ and $t+h$ is unavailable. The minimum usable lag is $h$, and rolling windows must be offset by the horizon accordingly. The fifth is publication and ingestion delay. Even past data may not have been recorded yet at prediction time. A daily sales figure may only be finalized two days later, so a feature must respect the latency with which each input actually arrives, not merely its nominal timestamp. Modeling this with an explicit availability timestamp per field is the rigorous solution. ### 6.3 Validation that respects time Standard $k$ fold cross validation shuffles rows and trains on data that is, in time, both before and after the validation rows. For temporal data this is invalid because it trains on the future to predict the past. The correct approach uses time ordered splits. In forward chaining, also called rolling origin evaluation, the data is split at successive time points. The model trains on everything up to a cutoff and is evaluated on the following block, then the cutoff advances and the process repeats. Two common variants exist. An expanding window keeps all history in each training fold, while a sliding window uses a fixed length training period that moves forward, which is preferable when the relationship between features and target drifts over time. ```text Fold 1: train [..t1] test (t1..t2] Fold 2: train [..t2] test (t2..t3] Fold 3: train [..t3] test (t3..t4] ``` A further refinement inserts a gap, sometimes called an embargo, between the end of training and the start of testing. The gap equals the forecast horizon plus any window over which feature and label construction could otherwise overlap, and it prevents information from bleeding across the boundary through overlapping windows. This embargo discipline is standard in domains such as quantitative finance where overlapping labels are common (see reference 6, Lopez de Prado). The following diagram shows a single fold with its embargo gap, which is the unit that the rolling origin procedure repeats as the cutoff advances. ```{mermaid} %%| label: fig-embargo %%| fig-cap: "One fold of time ordered validation with an embargo gap separating train and test." flowchart LR A["Train window (history up to cutoff)"] --> B["Embargo gap (horizon plus overlap)"] B --> C["Test window (held out future block)"] ``` The embargo width is not arbitrary. If labels are built from a window of length $L$ ending at the prediction time, and forecasts are issued $h$ steps ahead, then a training row near the cutoff and a test row just past it can share underlying observations unless the boundary is widened by at least $h + L$ time steps. Setting the gap to that value purges the overlap and restores the property that the test block is genuinely out of sample. ### 6.4 A worked example: horizon aware features Concrete numbers make the availability boundary vivid. Suppose a retailer forecasts daily demand $y_t$ for each store and must commit replenishment orders three days ahead, so the horizon is $h = 3$. The prediction issued on the morning of day $t$ targets $y_{t+3}$, and because the sales total for a day is only finalized one day after it closes, the most recent figure available at decision time is $y_{t-1}$, not $y_t$. Now consider three candidate features for the row whose target is $y_{t+3}$. A lag of one day relative to the target, $y_{t+2}$, is *not admissible*. Day $t+2$ has not yet occurred when the order is placed on day $t$, so $y_{t+2} \notin \mathcal{F}_t$. The smallest admissible lag is the one ending at the latest known observation, $y_{t-1}$, which sits four steps before the target. Relative to the target index $t+3$, that is a lag of $4 = h + 1$, the horizon plus the one day ingestion delay. A trailing seven day rolling mean is admissible only if its window ends at $y_{t-1}$ or earlier. A window that runs through $y_{t+2}$ would silently incorporate three unobserved days. The fix is the same offset that protected the lag: shift the window back so its right endpoint is $y_{t-1}$. A calendar flag for "is day $t+3$ a public holiday" is *admissible*, because the holiday schedule is fixed in advance and is therefore in $\mathcal{F}_t$ even though it concerns a future date. This is the asymmetry of Section 5.2 in action: future *values* are forbidden, but future *known facts* are allowed. For validation, the embargo follows directly. With $h = 3$ and a label that aggregates a single day ($L = 1$), the gap between train and test must be at least $h + L = 4$ days, ensuring no training row's feature window can reach into a test day's target. ### 6.5 A practical leakage checklist Before trusting any temporal model, audit each feature against the following questions. Does the feature use only data observed at or before the prediction time, accounting for ingestion delay. Were all preprocessing statistics fit on training data alone. Are all windows trailing rather than centered, and do they exclude the current target. Is the smallest lag at least as large as the forecast horizon. Does the validation scheme preserve time order with an appropriate gap. A feature that fails any of these is a leakage candidate and must be corrected before the model is considered valid. A useful sanity check is that suspiciously high offline accuracy, far above what the problem plausibly allows, is more often a symptom of leakage than of a breakthrough. ## 7. Summary Temporal feature engineering is the practice of making the structure of time explicit so that general purpose learners can use it. Lag features expose autocorrelation, rolling and expanding windows summarize recent and full history, calendar and Fourier features encode known periodicity, and time since event features capture recency and decay. Each of these is only as good as its respect for the availability boundary. The discipline that ties everything together is the relentless question of what is known when a prediction is made, enforced through trailing windows, horizon aware lags, train only preprocessing, and time ordered validation with an embargo. Master that question, and the rest of temporal feature engineering follows naturally. ## References 1. Hyndman, R. J., and Athanasopoulos, G. Forecasting: Principles and Practice (3rd ed.). OTexts. https://otexts.com/fpp3/ 2. Kuhn, M., and Johnson, K. Feature Engineering and Selection: A Practical Approach for Predictive Models. https://bookdown.org/max/FES/ 3. scikit-learn Developers. TimeSeriesSplit and time series cross validation. https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split 4. pandas Developers. Window operations: rolling, expanding, and exponentially weighted. https://pandas.pydata.org/docs/user_guide/window.html 5. Taylor, S. J., and Letham, B. Forecasting at Scale (Prophet). https://peerj.com/preprints/3190/ 6. Lopez de Prado, M. Advances in Financial Machine Learning (purged cross validation and embargo). https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086 7. Kaufman, S., Rosset, S., and Perlich, C. Leakage in Data Mining: Formulation, Detection, and Avoidance. https://dl.acm.org/doi/10.1145/2020408.2020496 8. Bergmeir, C., and Benitez, J. M. On the use of cross validation for time series predictor evaluation. https://doi.org/10.1016/j.ins.2011.12.028