Multi-Armed Bandits

Multi-Armed Bandits#

Step 1: Simulate the Dataset

We’ll create a dataset where each potential client has a set of features (context) and can receive one of several possible messages.

The goal is to learn which message works best for different segments based on their context.

Key Components:

  • Contextual features: These could be behavioral data such as the number of SMSs read, the average time taken to respond, etc.

  • Actions: The different messages that can be sent.

  • Rewards: The reply rate, indicating whether the client responded to the message.

import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_customers = 1000  # Number of customers
n_messages = 5  # Number of different messages
n_features = 4  # Number of contextual features

# Simulate customer features (contexts)
X = np.random.rand(n_customers, n_features)

# Simulate rewards for each message (action)
# We assume that different contexts have different optimal messages
true_coefficients = np.random.rand(n_messages, n_features)
noise = np.random.randn(n_customers, n_messages) * 0.1
rewards = X @ true_coefficients.T + noise

# Convert rewards to probabilities (between 0 and 1)
reply_probabilities = 1 / (1 + np.exp(-rewards))

# Generate actual replies (binary rewards) based on probabilities
y = np.random.binomial(1, reply_probabilities)

# Create a DataFrame to store the dataset
columns = [f'feature_{i+1}' for i in range(n_features)] + [f'message_{i+1}' for i in range(n_messages)]
data = np.hstack((X, y))
df = pd.DataFrame(data, columns=columns)

# Display the first few rows of the dataset
df.head()
feature_1 feature_2 feature_3 feature_4 message_1 message_2 message_3 message_4 message_5
0 0.374540 0.950714 0.731994 0.598658 1.0 1.0 1.0 0.0 1.0
1 0.156019 0.155995 0.058084 0.866176 1.0 0.0 1.0 1.0 1.0
2 0.601115 0.708073 0.020584 0.969910 0.0 0.0 0.0 1.0 1.0
3 0.832443 0.212339 0.181825 0.183405 0.0 0.0 1.0 1.0 0.0
4 0.304242 0.524756 0.431945 0.291229 0.0 0.0 0.0 1.0 1.0

Step 2: Test the Contextual Bandit Model Assuming your team’s model is implemented as a function contextual_bandit_predict(X), which takes customer features and predicts the best message, we can simulate running the model on this dataset.

def contextual_bandit_predict(X):
    """
    Dummy implementation for the sake of testing.
    Replace this with your team's actual model.
    """
    # For simplicity, let's assume it picks the message with the highest predicted probability
    predicted_rewards = X @ true_coefficients.T
    return np.argmax(predicted_rewards, axis=1)

# Simulate running the bandit model
predicted_messages = contextual_bandit_predict(X)

# Calculate the actual rewards for the predicted messages
actual_rewards = [y[i, predicted_messages[i]] for i in range(n_customers)]

# Evaluate the performance: average reward
average_reward = np.mean(actual_rewards)
print(f'Average Reward: {average_reward:.4f}')
Average Reward: 0.7440

The higher the average reward (at least above random = 0.5) the better