Multi-Armed Bandits

Multi-Armed Bandits#

Step 1: Simulate the Dataset

We’ll create a dataset where each potential client has a set of features (context) and can receive one of several possible messages.

The goal is to learn which message works best for different segments based on their context.

Key Components:

Contextual features: These could be behavioral data such as the number of SMSs read, the average time taken to respond, etc.
Actions: The different messages that can be sent.
Rewards: The reply rate, indicating whether the client responded to the message.

import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_customers = 1000  # Number of customers
n_messages = 5  # Number of different messages
n_features = 4  # Number of contextual features

# Simulate customer features (contexts)
X = np.random.rand(n_customers, n_features)

# Simulate rewards for each message (action)
# We assume that different contexts have different optimal messages
true_coefficients = np.random.rand(n_messages, n_features)
noise = np.random.randn(n_customers, n_messages) * 0.1
rewards = X @ true_coefficients.T + noise

# Convert rewards to probabilities (between 0 and 1)
reply_probabilities = 1 / (1 + np.exp(-rewards))

# Generate actual replies (binary rewards) based on probabilities
y = np.random.binomial(1, reply_probabilities)

# Create a DataFrame to store the dataset
columns = [f'feature_{i+1}' for i in range(n_features)] + [f'message_{i+1}' for i in range(n_messages)]
data = np.hstack((X, y))
df = pd.DataFrame(data, columns=columns)

# Display the first few rows of the dataset
df.head()

	feature_1	feature_2	feature_3	feature_4	message_1	message_2	message_3	message_4	message_5
0	0.374540	0.950714	0.731994	0.598658	1.0	1.0	1.0	0.0	1.0
1	0.156019	0.155995	0.058084	0.866176	1.0	0.0	1.0	1.0	1.0
2	0.601115	0.708073	0.020584	0.969910	0.0	0.0	0.0	1.0	1.0
3	0.832443	0.212339	0.181825	0.183405	0.0	0.0	1.0	1.0	0.0
4	0.304242	0.524756	0.431945	0.291229	0.0	0.0	0.0	1.0	1.0

Step 2: Test the Contextual Bandit Model Assuming your team’s model is implemented as a function contextual_bandit_predict(X), which takes customer features and predicts the best message, we can simulate running the model on this dataset.

def contextual_bandit_predict(X):
    """
    Dummy implementation for the sake of testing.
    Replace this with your team's actual model.
    """
    # For simplicity, let's assume it picks the message with the highest predicted probability
    predicted_rewards = X @ true_coefficients.T
    return np.argmax(predicted_rewards, axis=1)

# Simulate running the bandit model
predicted_messages = contextual_bandit_predict(X)

# Calculate the actual rewards for the predicted messages
actual_rewards = [y[i, predicted_messages[i]] for i in range(n_customers)]

# Evaluate the performance: average reward
average_reward = np.mean(actual_rewards)
print(f'Average Reward: {average_reward:.4f}')

Average Reward: 0.7440

The higher the average reward (at least above random = 0.5) the better