Split Samples in Time Series

Split Samples in Time Series#

Simulated Dataset:

We generate 100 customers, each with data over 24 months.
The dataset includes two random features (feature_1 and feature_2), and a binary default outcome, with a default rate of 20%.

Rolling-Window Cross-Validation (recommended for R&D):

We use TimeSeriesSplit from sklearn.model_selection, which ensures that earlier months are used for training and later months for testing.
The training and testing sets “roll” forward as you move through the data.

Expanding Window Cross-Validation (for production purpose):

In the expanding window approach, the training set grows as more data becomes available.
We define a function expanding_window_split that expands the training set window while testing on the next available step.

Model:

We use a RandomForestClassifier for demonstration, though you can replace this with any model.

Accuracy Calculation:

The accuracy of the model for each fold is printed out for both rolling-window and expanding window cross-validation.

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Simulate example dataset
np.random.seed(42)

# Parameters for the simulation
n_customers = 100
n_months = 24

# Create a DataFrame for customers across months
customer_ids = np.repeat(np.arange(1, n_customers + 1), n_months)
months = np.tile(np.arange(1, n_months + 1), n_customers)
default = np.random.binomial(1, 0.2, n_customers * n_months)  # 20% default rate

# Simulate some features (you can add more complex features)
feature_1 = np.random.randn(n_customers * n_months)  # Random feature
feature_2 = np.random.randn(n_customers * n_months)  # Another random feature

# Create the DataFrame
df = pd.DataFrame({
    'customer_id': customer_ids,
    'month': months,
    'feature_1': feature_1,
    'feature_2': feature_2,
    'default': default
})

# Sort by customer_id and month to maintain temporal order
df = df.sort_values(by=['customer_id', 'month'])

# Prepare features (X) and target (y)
X = df[['feature_1', 'feature_2']]
y = df['default']

# ---------------------------------
# Rolling-Window Cross-Validation
# ---------------------------------
print("Rolling-Window Cross-Validation:")
tscv = TimeSeriesSplit(n_splits=5)
model = RandomForestClassifier()  # Example model

# Cross-validation loop for rolling-window
for fold, (train_index, test_index) in enumerate(tscv.split(X), 1):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Fit the model on the training set
    model.fit(X_train, y_train)

    # Predict on the testing set
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Fold {fold} Accuracy: {accuracy:.4f}')
print('End')

Rolling-Window Cross-Validation:
Fold 1 Accuracy: 0.7400

Fold 2 Accuracy: 0.7250

Fold 3 Accuracy: 0.7600

Fold 4 Accuracy: 0.7950

Fold 5 Accuracy: 0.7600
End