Split Samples in Time Series#
Simulated Dataset:
We generate 100 customers, each with data over 24 months.
The dataset includes two random features (feature_1 and feature_2), and a binary default outcome, with a default rate of 20%.
Rolling-Window Cross-Validation (recommended for R&D):
We use TimeSeriesSplit from sklearn.model_selection, which ensures that earlier months are used for training and later months for testing.
The training and testing sets “roll” forward as you move through the data.
Expanding Window Cross-Validation (for production purpose):
In the expanding window approach, the training set grows as more data becomes available.
We define a function expanding_window_split that expands the training set window while testing on the next available step.
Model:
We use a RandomForestClassifier for demonstration, though you can replace this with any model.
Accuracy Calculation:
The accuracy of the model for each fold is printed out for both rolling-window and expanding window cross-validation.
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Simulate example dataset
np.random.seed(42)
# Parameters for the simulation
n_customers = 100
n_months = 24
# Create a DataFrame for customers across months
customer_ids = np.repeat(np.arange(1, n_customers + 1), n_months)
months = np.tile(np.arange(1, n_months + 1), n_customers)
default = np.random.binomial(1, 0.2, n_customers * n_months) # 20% default rate
# Simulate some features (you can add more complex features)
feature_1 = np.random.randn(n_customers * n_months) # Random feature
feature_2 = np.random.randn(n_customers * n_months) # Another random feature
# Create the DataFrame
df = pd.DataFrame({
'customer_id': customer_ids,
'month': months,
'feature_1': feature_1,
'feature_2': feature_2,
'default': default
})
# Sort by customer_id and month to maintain temporal order
df = df.sort_values(by=['customer_id', 'month'])
# Prepare features (X) and target (y)
X = df[['feature_1', 'feature_2']]
y = df['default']
# ---------------------------------
# Rolling-Window Cross-Validation
# ---------------------------------
print("Rolling-Window Cross-Validation:")
tscv = TimeSeriesSplit(n_splits=5)
model = RandomForestClassifier() # Example model
# Cross-validation loop for rolling-window
for fold, (train_index, test_index) in enumerate(tscv.split(X), 1):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Fit the model on the training set
model.fit(X_train, y_train)
# Predict on the testing set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Fold {fold} Accuracy: {accuracy:.4f}')
print('End')
Rolling-Window Cross-Validation:
Fold 1 Accuracy: 0.7400
Fold 2 Accuracy: 0.7250
Fold 3 Accuracy: 0.7600
Fold 4 Accuracy: 0.7950
Fold 5 Accuracy: 0.7600
End