15 Beta Estimation

This chapter introduces one of the most fundamental concepts in financial economics: the exposure of an individual stock to systematic market risk. According to the Capital Asset Pricing Model (CAPM) developed by Sharpe (1964), Lintner (1965), and Mossin (1966), cross-sectional variation in expected asset returns should be determined by the covariance between an asset’s excess return and the excess return on the market portfolio. The regression coefficient that captures this relationship (commonly known as market beta) serves as the cornerstone of modern portfolio theory and remains widely used in practice for cost of capital estimation, performance attribution, and risk management.

In this chapter, we develop a complete framework for estimating market betas for Vietnamese stocks. We begin with a conceptual overview of the CAPM and its empirical implementation. We then demonstrate beta estimation using ordinary least squares regression, first for individual stocks and then scaled to the entire market using rolling-window estimation. To handle the computational demands of estimating betas for hundreds of stocks across many time periods, we introduce parallelization techniques that dramatically reduce processing time. Finally, we compare beta estimates derived from monthly versus daily returns and examine how betas vary across industries and over time in the Vietnamese market.

The chapter leverages several important computational concepts that extend beyond beta estimation itself. Rolling-window estimation is a technique applicable to any time-varying parameter, while parallelization provides a general solution for computationally intensive tasks that can be divided into independent subtasks.

15.1 Theoretical Foundation

15.1.1 The Capital Asset Pricing Model

The CAPM provides a theoretical framework linking expected returns to systematic risk. Under the model’s assumptions—including mean-variance optimizing investors, homogeneous expectations, and frictionless markets—the expected excess return on any asset $i$ is proportional to its covariance with the market portfolio:

\[ E[r_i - r_f] = \beta_i \cdot E[r_m - r_f] \]

where $r_i$ is the return on asset $i$, $r_f$ is the risk-free rate, $r_m$ is the return on the market portfolio, and $\beta_i$ is defined as:

\[ \beta_i = \frac{\text{Cov}(r_i, r_m)}{\text{Var}(r_m)} \]

The market beta $\beta_i$ measures the sensitivity of asset $i$’s returns to market movements. A beta greater than one indicates the asset amplifies market movements, while a beta less than one indicates dampened sensitivity. A beta of zero would imply no systematic risk exposure, leaving only idiosyncratic risk that can be diversified away.

15.1.2 Empirical Implementation

In practice, we estimate beta by regressing excess stock returns on excess market returns:

\[ r_{i,t} - r_{f,t} = \alpha_i + \beta_i(r_{m,t} - r_{f,t}) + \varepsilon_{i,t} \tag{15.1}\]

where $\alpha_i$ represents abnormal return (Jensen’s alpha), $\beta_i$ is the market beta we seek to estimate, and $\varepsilon_{i,t}$ is the idiosyncratic error term. Under the CAPM, $\alpha_i$ should equal zero for all assets—any non-zero alpha represents a deviation from the model’s predictions.

Several practical considerations affect beta estimation:

Estimation Window: Longer windows provide more observations and thus more precise estimates, but may include outdated information if betas change over time. Common choices range from 36 to 60 months for monthly data.
Return Frequency: Monthly returns reduce noise but provide fewer observations. Daily returns offer more data points but may introduce microstructure effects and non-synchronous trading biases.
Market Proxy: The theoretical market portfolio includes all assets, but in practice we use a broad equity index. For Vietnam, we use the value-weighted market return constructed from our stock universe.
Minimum Observations: Requiring a minimum number of observations (e.g., 48 out of 60 months) helps avoid unreliable estimates from sparse data.

15.2 Setting Up the Environment

We begin by loading the necessary Python packages. The core packages handle data manipulation, statistical modeling, and database operations. We also import parallelization tools that will be essential when scaling our estimation to the full market.

import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf
from scipy.stats.mstats import winsorize

from plotnine import *
from mizani.formatters import percent_format, comma_format
from joblib import Parallel, delayed, cpu_count
from dateutil.relativedelta import relativedelta

We connect to our SQLite database containing the processed Vietnamese financial data from previous chapters.

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

15.3 Loading and Preparing Data

15.3.1 Stock Returns Data

We load the monthly stock returns data prepared in the Datacore chapter. The data includes excess returns (returns minus the risk-free rate) for all Vietnamese listed stocks.

prices_monthly = pd.read_sql_query(
    sql="""
        SELECT symbol, date, ret_excess 
        FROM prices_monthly
    """,
    con=tidy_finance,
    parse_dates={"date"}
)

# Add year for merging with fundamentals
prices_monthly["year"] = prices_monthly["date"].dt.year

print(f"Loaded {len(prices_monthly):,} monthly observations")
print(f"Covering {prices_monthly['symbol'].nunique():,} unique stocks")
print(f"Date range: {prices_monthly['date'].min():%Y-%m} to {prices_monthly['date'].max():%Y-%m}")

Loaded 209,495 monthly observations
Covering 1,837 unique stocks
Date range: 2010-01 to 2025-05

prices_daily = pd.read_sql_query(
    sql="""
        SELECT symbol, date, ret_excess 
        FROM prices_daily
    """,
    con=tidy_finance,
    parse_dates={"date"}
)

15.3.2 Company Information

We load company information to enable industry-level analysis of beta estimates.

comp_vn = pd.read_sql_query(
    sql="""
        SELECT symbol, datadate, icb_name_vi 
        FROM comp_vn
    """,
    con=tidy_finance,
    parse_dates={"datadate"}
)

# Extract year for merging
comp_vn["year"] = comp_vn["datadate"].dt.year

print(f"Company data: {comp_vn['symbol'].nunique():,} firms")

Company data: 1,502 firms

15.3.3 Market Excess Returns

For the market portfolio proxy, we use the value-weighted market excess return. If you have constructed Fama-French factors in a previous chapter, load them here. Otherwise, we can construct a simple market return from our stock data.

# Option 1: Load pre-computed market factor
factors_ff3_monthly = pd.read_sql_query(
    sql="SELECT date, mkt_excess FROM factors_ff3_monthly",
    con=tidy_finance,
    parse_dates={"date"}
)

# Option 2: Construct market return from stock data (if factors not available)
# This computes the value-weighted average return across all stocks
def compute_market_return(prices_df):
    """
    Compute value-weighted market return from individual stock returns.
    
    Parameters
    ----------
    prices_df : pd.DataFrame
        Stock returns with mktcap_lag for weighting
        
    Returns
    -------
    pd.DataFrame
        Monthly market excess returns
    """
    market_return = (prices_df
        .groupby("date")
        .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
        .reset_index(name="mkt_excess")
    )
    return market_return

15.3.4 Merging Datasets

We combine the stock returns with market returns and company information to create our estimation dataset.

# Merge stock returns with market returns
prices_monthly = prices_monthly.merge(
    factors_ff3_monthly, 
    on="date", 
    how="left"
)

# Merge with company information for industry classification
prices_monthly = prices_monthly.merge(
    comp_vn[["symbol", "year", "icb_name_vi"]], 
    on=["symbol", "year"], 
    how="left"
)

# Remove observations with missing data
prices_monthly = prices_monthly.dropna(subset=["ret_excess", "mkt_excess"])

print(f"Final estimation sample: {len(prices_monthly):,} observations")

Final estimation sample: 169,983 observations

15.3.5 Handling Outliers

Extreme returns can unduly influence regression estimates. We apply winsorization to limit the impact of outliers while preserving the general distribution of returns. Winsorization at the 1% level replaces values below the 1st percentile with the 1st percentile value, and values above the 99th percentile with the 99th percentile value.

def winsorize_returns(df, columns, limits=(0.01, 0.01)):
    """
    Apply winsorization to return columns to limit outlier influence.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing return columns
    columns : list
        Column names to winsorize
    limits : tuple
        Lower and upper percentile limits for winsorization
        
    Returns
    -------
    pd.DataFrame
        DataFrame with winsorized columns
    """
    df = df.copy()
    for col in columns:
        df[col] = winsorize(df[col], limits=limits)
    return df

prices_monthly = winsorize_returns(
    prices_monthly, 
    columns=["ret_excess", "mkt_excess"],
    limits=(0.01, 0.01)
)

print("Return distributions after winsorization:")
print(prices_monthly[["ret_excess", "mkt_excess"]].describe().round(4))

Return distributions after winsorization:
        ret_excess   mkt_excess
count  169983.0000  169983.0000
mean        0.0011      -0.0102
std         0.1548       0.0579
min        -0.4078      -0.1794
25%        -0.0700      -0.0384
50%        -0.0033      -0.0084
75%         0.0531       0.0219
max         0.6117       0.1221

15.4 Estimating Beta for Individual Stocks

15.4.1 Single Stock Example

Before scaling to the full market, we demonstrate beta estimation for a single well-known Vietnamese stock. We use Vingroup (VIC), one of the largest conglomerates in Vietnam with significant exposure to real estate, retail, and automotive sectors.

# Filter data for Vingroup
vic_data = prices_monthly.query("symbol == 'VIC'").copy()

print(f"VIC observations: {len(vic_data)}")
print(f"Date range: {vic_data['date'].min():%Y-%m} to {vic_data['date'].max():%Y-%m}")

VIC observations: 150
Date range: 2011-07 to 2023-12

We estimate the CAPM regression using ordinary least squares via the statsmodels package. The formula interface provides a convenient way to specify regression models.

# Estimate CAPM for Vingroup
model_vic = smf.ols(
    formula="ret_excess ~ mkt_excess",
    data=vic_data
).fit()

# Display regression results
print(model_vic.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             ret_excess   R-squared:                       0.153
Model:                            OLS   Adj. R-squared:                  0.147
Method:                 Least Squares   F-statistic:                     26.67
Date:                Sat, 14 Feb 2026   Prob (F-statistic):           7.66e-07
Time:                        07:51:19   Log-Likelihood:                 131.96
No. Observations:                 150   AIC:                            -259.9
Df Residuals:                     148   BIC:                            -253.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0075      0.008     -0.895      0.372      -0.024       0.009
mkt_excess     0.7503      0.145      5.164      0.000       0.463       1.037
==============================================================================
Omnibus:                       39.111   Durbin-Watson:                   2.039
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              107.620
Skew:                          -1.015   Prob(JB):                     4.27e-24
Kurtosis:                       6.619   Cond. No.                         17.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression output provides several important pieces of information:

Beta (mkt_excess coefficient): The estimated market sensitivity. A beta above 1 indicates VIC amplifies market movements.
Alpha (Intercept): The abnormal return not explained by market exposure. Under CAPM, this should be zero.
R-squared: The proportion of return variation explained by market movements.
t-statistics: Test whether coefficients differ significantly from zero.

# Extract key estimates
coefficients = model_vic.summary2().tables[1]

print("\nKey estimates for Vingroup (VIC):")
print(f"  Beta:  {coefficients.loc['mkt_excess', 'Coef.']:.3f}")
print(f"  Alpha: {coefficients.loc['Intercept', 'Coef.']:.4f}")
print(f"  R²:    {model_vic.rsquared:.3f}")


Key estimates for Vingroup (VIC):
  Beta:  0.750
  Alpha: -0.0075
  R²:    0.153

15.4.2 CAPM Estimation Function

We create a reusable function that estimates the CAPM and returns results in a standardized format. The function includes a minimum observations requirement to avoid unreliable estimates from sparse data.

def estimate_capm(data, min_obs=48):
    """
    Estimate CAPM regression and return coefficients.
    
    This function regresses excess stock returns on excess market returns
    and extracts the coefficient estimates along with t-statistics.
    
    Parameters
    ----------
    data : pd.DataFrame
        DataFrame with 'ret_excess' and 'mkt_excess' columns
    min_obs : int
        Minimum number of observations required for estimation
        
    Returns
    -------
    pd.DataFrame
        DataFrame with coefficient estimates and t-statistics,
        or empty DataFrame if insufficient observations
    """
    if len(data) < min_obs:
        return pd.DataFrame()
    
    try:
        # Estimate OLS regression
        model = smf.ols(
            formula="ret_excess ~ mkt_excess", 
            data=data
        ).fit()
        
        # Extract coefficient table
        coef_table = model.summary2().tables[1]
        
        # Format results
        results = pd.DataFrame({
            "coefficient": ["alpha", "beta"],
            "estimate": [
                coef_table.loc["Intercept", "Coef."],
                coef_table.loc["mkt_excess", "Coef."]
            ],
            "t_statistic": [
                coef_table.loc["Intercept", "t"],
                coef_table.loc["mkt_excess", "t"]
            ],
            "r_squared": model.rsquared
        })
        
        return results
        
    except Exception as e:
        # Return empty DataFrame if estimation fails
        return pd.DataFrame()

15.5 Rolling-Window Estimation

15.5.1 Motivation for Rolling Windows

Stock betas are not constant over time. A company’s business mix, leverage, and operating environment evolve, causing its systematic risk exposure to change. To capture this time variation, we use rolling-window estimation: at each point in time, we estimate beta using only data from a fixed lookback period (e.g., the past 60 months).

Rolling-window estimation involves a trade-off:

Longer windows provide more observations and thus more precise estimates, but may include stale information.
Shorter windows are more responsive to changes but produce noisier estimates.

A common choice in academic research is 60 months (5 years) of monthly data, requiring at least 48 valid observations for estimation.

15.5.2 Rolling Window Implementation

The following function implements rolling-window CAPM estimation. For each month in the sample, it looks back over the specified window and estimates beta using all available data within that window.

def roll_capm_estimation(data, look_back=60, min_obs=48):
    """
    Perform rolling-window CAPM estimation.
    
    This function slides a window across time, estimating the CAPM
    regression at each point using the most recent 'look_back' months
    of data.
    
    Parameters
    ----------
    data : pd.DataFrame
        DataFrame with 'date', 'ret_excess', and 'mkt_excess' columns
    look_back : int
        Number of months in the estimation window
    min_obs : int
        Minimum observations required within each window
        
    Returns
    -------
    pd.DataFrame
        Time series of coefficient estimates with dates
    """
    # Ensure data is sorted by date
    data = data.sort_values("date").copy()
    
    # Get unique dates
    dates = data["date"].drop_duplicates().sort_values()
    
    # Container for results
    results = []
    
    # Slide window across dates
    for i in range(look_back - 1, len(dates)):
        # Define window boundaries
        end_date = dates.iloc[i]
        start_date = end_date - relativedelta(months=look_back - 1)
        
        # Extract data within window
        window_data = data.query("date >= @start_date and date <= @end_date")
        
        # Estimate CAPM for this window
        window_results = estimate_capm(window_data, min_obs=min_obs)
        
        if not window_results.empty:
            window_results["date"] = end_date
            results.append(window_results)
    
    # Combine all results
    if results:
        return pd.concat(results, ignore_index=True)
    else:
        return pd.DataFrame()

15.5.3 Example: Rolling Betas for Selected Stocks

We demonstrate rolling-window estimation for several well-known Vietnamese stocks spanning different industries.

# Define example stocks
examples = pd.DataFrame({
    "symbol": ["FPT", "VNM", "VIC", "HPG", "VCB"],
    "company": [
        "FPT Corporation",      # Technology
        "Vinamilk",             # Consumer goods
        "Vingroup",             # Real estate/conglomerate
        "Hoa Phat Group",       # Steel/materials
        "Vietcombank"           # Banking
    ]
})

# Check data availability for each example
data_availability = (prices_monthly
    .query("symbol in @examples['symbol']")
    .groupby("symbol")
    .agg(
        n_obs=("date", "count"),
        first_date=("date", "min"),
        last_date=("date", "max")
    )
    .reset_index()
)

print("Data availability for example stocks:")
print(data_availability)

Data availability for example stocks:
  symbol  n_obs first_date  last_date
0    FPT    150 2011-07-31 2023-12-31
1    HPG    150 2011-07-31 2023-12-31
2    VCB    150 2011-07-31 2023-12-31
3    VIC    150 2011-07-31 2023-12-31
4    VNM    150 2011-07-31 2023-12-31

# Estimate rolling betas for example stocks
example_data = prices_monthly.query("symbol in @examples['symbol']")

capm_examples = (example_data
    .groupby("symbol", group_keys=True)
    .apply(lambda x: roll_capm_estimation(x), include_groups=False)
    .reset_index()
    .drop(columns="level_1", errors="ignore")
)

# Filter to beta estimates only
beta_examples = (capm_examples
    .query("coefficient == 'beta'")
    .merge(examples, on="symbol")
)

print(f"Rolling beta estimates: {len(beta_examples):,} observations")

Rolling beta estimates: 455 observations

15.5.4 Visualizing Rolling Betas

Figure 22.6 displays the time series of beta estimates for our example stocks. The figure reveals how systematic risk exposure evolves differently across industries.

rolling_beta_figure = (
    ggplot(
        beta_examples,
        aes(x="date", y="estimate", color="company")
    )
    + geom_line(size=0.8)
    + geom_hline(yintercept=1, linetype="dashed", color="gray", alpha=0.7)
    + labs(
        x="",
        y="Beta",
        color="",
        title="Rolling Beta Estimates (60-Month Window)"
    )
    + scale_x_datetime(date_breaks="2 years", date_labels="%Y")
    + theme_minimal()
    + theme(legend_position="bottom")
)
rolling_beta_figure.show()

Line chart showing time series of beta estimates for five Vietnamese stocks from different industries. — Figure 15.1: Monthly rolling beta estimates for selected Vietnamese stocks using a 60-month estimation window. Different industries exhibit distinct patterns of market sensitivity over time.

Several patterns emerge from the figure:

Industry differences: Technology and banking stocks may exhibit different beta patterns than real estate or consumer goods companies.
Time variation: Betas are not constant. They respond to changes in business conditions, leverage, and market regimes.
Crisis periods: Market stress periods (e.g., 2008 financial crisis, 2020 COVID-19) often see beta estimates change as correlations across stocks increase.

15.6 Parallelized Estimation for the Full Market

15.6.1 The Computational Challenge

Estimating rolling betas for all stocks in our database is computationally intensive. With hundreds of stocks, each requiring rolling estimation across many time periods, sequential processing would take considerable time. Fortunately, beta estimation for different stocks is independent (i.e., the estimate for stock A does not depend on the estimate for stock B). This independence makes the problem ideal for parallelization.

15.6.2 Setting Up Parallel Processing

We use the joblib library to distribute computation across multiple CPU cores. The Parallel class manages worker processes, while delayed wraps function calls for deferred execution.

# Determine available cores (reserve one for system operations)
n_cores = max(1, cpu_count() - 1)
print(f"Available cores for parallel processing: {n_cores}")

Available cores for parallel processing: 3

15.6.3 Parallel Beta Estimation

The following code estimates rolling betas for all stocks in parallel. Each stock is processed independently by a separate worker.

def estimate_all_betas_parallel(data, n_cores, look_back=60, min_obs=48):
    """
    Estimate rolling betas for all stocks using parallel processing.
    
    Parameters
    ----------
    data : pd.DataFrame
        Full dataset with all stocks
    n_cores : int
        Number of CPU cores to use
    look_back : int
        Months in estimation window
    min_obs : int
        Minimum observations required
        
    Returns
    -------
    pd.DataFrame
        Beta estimates for all stocks and dates
    """
    # Group data by stock
    grouped = data.groupby("symbol", group_keys=False)
    
    # Define worker function
    def process_stock(name, group):
        result = roll_capm_estimation(group, look_back=look_back, min_obs=min_obs)
        if not result.empty:
            result["symbol"] = name
        return result
    
    # Execute in parallel
    results = Parallel(n_jobs=n_cores, verbose=1)(
        delayed(process_stock)(name, group) 
        for name, group in grouped
    )
    
    # Combine results
    results = [r for r in results if not r.empty]
    if results:
        return pd.concat(results, ignore_index=True)
    else:
        return pd.DataFrame()

# Estimate betas for all stocks
print("Estimating rolling betas for all stocks...")
capm_monthly = estimate_all_betas_parallel(
    prices_monthly, 
    n_cores=n_cores,
    look_back=60,
    min_obs=48
)

print(f"\nCompleted: {len(capm_monthly):,} coefficient estimates")
print(f"Unique stocks: {capm_monthly['symbol'].nunique():,}")

15.6.4 Storing Results

We save the CAPM estimates to our database for use in subsequent chapters.

capm_monthly.to_sql(
    name="capm_monthly",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

print("CAPM estimates saved to database.")

For subsequent analysis, we load the pre-computed estimates:

capm_monthly = pd.read_sql_query(
    sql="SELECT * FROM capm_monthly",
    con=tidy_finance,
    parse_dates={"date"}
)

print(f"Loaded {len(capm_monthly):,} CAPM estimates")

Loaded 161,580 CAPM estimates

15.7 Beta Estimation Using Daily Returns

While monthly returns are standard in academic research, some applications benefit from higher-frequency data:

Shorter estimation windows: Daily data allows meaningful estimation over shorter periods (e.g., 3 months rather than 5 years).
More responsive estimates: Daily betas capture changes more quickly.
Event studies: High-frequency betas are useful for analyzing market reactions to specific events.

However, daily data introduces additional challenges:

Microstructure noise: Bid-ask bounce and other trading frictions add noise to returns.
Non-synchronous trading: Less liquid stocks may not trade every day, biasing beta estimates downward.
Computational burden: Daily data is roughly 21 times larger than monthly data.

15.7.1 Batch Processing for Daily Data

Given the size of daily data, we process stocks in batches to manage memory constraints. This approach loads and processes a subset of stocks, saves results, and proceeds to the next batch.

def compute_market_return_daily(tidy_finance):
    """
    Compute daily value-weighted market excess return from stock data.
    """
    # Load daily prices with market cap for weighting
    prices_daily_full = pd.read_sql_query(
        sql="""
            SELECT p.symbol, p.date, p.ret_excess, m.mktcap_lag
            FROM prices_daily p
            LEFT JOIN prices_monthly m ON p.symbol = m.symbol 
                AND strftime('%Y-%m', p.date) = strftime('%Y-%m', m.date)
        """,
        con=tidy_finance,
        parse_dates={"date"}
    )
    
    # Compute value-weighted market return each day
    mkt_daily = (prices_daily_full
        .dropna(subset=["ret_excess", "mktcap_lag"])
        .groupby("date")
        .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
        .reset_index(name="mkt_excess")
    )
    
    return mkt_daily


def roll_capm_estimation_daily(data, look_back_days=1260, min_obs=1000):
    """
    Perform rolling-window CAPM estimation using daily data.
    
    Parameters
    ----------
    data : pd.DataFrame
        DataFrame with 'date', 'ret_excess', and 'mkt_excess' columns
    look_back_days : int
        Number of trading days in the estimation window
    min_obs : int
        Minimum daily observations required within each window
        
    Returns
    -------
    pd.DataFrame
        Time series of coefficient estimates with dates
    """
    data = data.sort_values("date").copy()
    dates = data["date"].drop_duplicates().sort_values().reset_index(drop=True)
    
    results = []
    
    for i in range(look_back_days - 1, len(dates)):
        end_date = dates.iloc[i]
        start_idx = max(0, i - look_back_days + 1)
        start_date = dates.iloc[start_idx]
        
        window_data = data.query("date >= @start_date and date <= @end_date")
        window_results = estimate_capm(window_data, min_obs=min_obs)
        
        if not window_results.empty:
            window_results["date"] = end_date
            results.append(window_results)
    
    if results:
        return pd.concat(results, ignore_index=True)
    else:
        return pd.DataFrame()


def estimate_daily_betas_batch(symbols, tidy_finance, n_cores, batch_size=500, 
                                look_back_days=1260, min_obs=1000):
    """
    Estimate rolling betas from daily data using batch processing.
    """
    # First, compute or load market return
    print("Computing daily market excess returns...")
    mkt_daily = compute_market_return_daily(tidy_finance)
    print(f"Market returns: {len(mkt_daily)} days")
    
    n_batches = int(np.ceil(len(symbols) / batch_size))
    all_results = []
    
    for j in range(n_batches):
        batch_start = j * batch_size
        batch_end = min((j + 1) * batch_size, len(symbols))
        batch_symbols = symbols[batch_start:batch_end]
        
        symbol_list = ", ".join(f"'{s}'" for s in batch_symbols)
        
        query = f"""
            SELECT symbol, date, ret_excess
            FROM prices_daily
            WHERE symbol IN ({symbol_list})
        """
        
        prices_daily_batch = pd.read_sql_query(
            sql=query,
            con=tidy_finance,
            parse_dates={"date"}
        )
        
        # Merge with market excess return
        prices_daily_batch = prices_daily_batch.merge(
            mkt_daily, 
            on="date", 
            how="inner"
        )
        
        # Group by symbol and estimate betas
        grouped = prices_daily_batch.groupby("symbol", group_keys=False)
        
        # Parallel estimation
        batch_results = Parallel(n_jobs=n_cores)(
            delayed(lambda name, group: 
                roll_capm_estimation_daily(group, look_back_days=look_back_days, min_obs=min_obs)
                .assign(symbol=name)
            )(name, group)
            for name, group in grouped
        )
        
        batch_results = [r for r in batch_results if r is not None and not r.empty]
        
        if batch_results:
            all_results.append(pd.concat(batch_results, ignore_index=True))
        
        print(f"Batch {j+1}/{n_batches} complete")
    
    if all_results:
        return pd.concat(all_results, ignore_index=True)
    else:
        return pd.DataFrame()

symbols = prices_monthly["symbol"].unique().tolist()

capm_daily = estimate_daily_betas_batch(
    symbols=symbols,
    tidy_finance=tidy_finance,
    n_cores=n_cores,
    batch_size=500,
    look_back_days=1260,  # ~5 years of trading days
    min_obs=1000
)

print(f"Daily beta estimates: {len(capm_daily):,}")

capm_daily.to_sql(
    name="capm_daily",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

print("CAPM estimates saved to database.")

For subsequent analysis, we load the pre-computed estimates:

capm_daily = pd.read_sql_query(
    sql="SELECT * FROM capm_daily",
    con=tidy_finance,
    parse_dates={"date"}
)

print(f"Loaded {len(capm_daily):,} CAPM estimates")

Loaded 3,394,490 CAPM estimates

15.8 Analyzing Beta Estimates

15.8.1 Extracting Beta Estimates

We extract the beta coefficient estimates from our CAPM results for analysis.

# Extract monthly betas
beta_monthly = (capm_monthly
    .query("coefficient == 'beta'")
    .rename(columns={"estimate": "beta"})
    [["symbol", "date", "beta"]]
    .assign(frequency="monthly")
)

# Save to database
beta_monthly.to_sql(
    name="beta_monthly",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

print(f"Monthly betas: {len(beta_monthly):,} observations")
print(f"Unique stocks: {beta_monthly['symbol'].nunique():,}")

Monthly betas: 80,790 observations
Unique stocks: 1,383

# Load pre-computed betas
beta_monthly = pd.read_sql_query(
    sql="SELECT * FROM beta_monthly",
    con=tidy_finance,
    parse_dates={"date"}
)

15.8.2 Summary Statistics

We examine the distribution of beta estimates to verify their reasonableness.

print("Beta Summary Statistics:")
print(beta_monthly["beta"].describe().round(3))

# Additional diagnostics
print(f"\nStocks with negative average beta: {(beta_monthly.groupby('symbol')['beta'].mean() < 0).sum()}")
print(f"Stocks with beta > 2: {(beta_monthly.groupby('symbol')['beta'].mean() > 2).sum()}")

Beta Summary Statistics:
count    80790.000
mean         0.501
std          0.539
min         -1.345
25%          0.130
50%          0.447
75%          0.832
max          2.678
Name: beta, dtype: float64

Stocks with negative average beta: 177
Stocks with beta > 2: 5

15.8.3 Beta Distribution Across Industries

Different industries have different exposures to systematic market risk based on their business models, operating leverage, and financial leverage. Figure 15.2 shows the distribution of firm-level average betas across Vietnamese industries.

# Merge betas with industry information
beta_with_industry = (beta_monthly
    .merge(
        prices_monthly[["symbol", "date", "icb_name_vi"]].drop_duplicates(),
        on=["symbol", "date"],
        how="left"
    )
    .dropna(subset=["icb_name_vi"])
)

# Compute firm-level average beta by industry
beta_by_industry = (beta_with_industry
    .groupby(["icb_name_vi", "symbol"])["beta"]
    .mean()
    .reset_index()
)

# Order industries by median beta
industry_order = (beta_by_industry
    .groupby("icb_name_vi")["beta"]
    .median()
    .sort_values()
    .index.tolist()
)

# Select top 10 industries by number of firms for clearer visualization
top_industries = (beta_by_industry
    .groupby("icb_name_vi")
    .size()
    .nlargest(10)
    .index.tolist()
)

beta_by_industry_filtered = beta_by_industry.query("icb_name_vi in @top_industries")

beta_industry_figure = (
    ggplot(
        beta_by_industry_filtered,
        aes(x="icb_name_vi", y="beta")
    )
    + geom_boxplot(fill="steelblue", alpha=0.7)
    + geom_hline(yintercept=1, linetype="dashed", color="red", alpha=0.7)
    + coord_flip()
    + labs(
        x="",
        y="Beta",
        title="Beta Distribution by Industry"
    )
    + theme_minimal()
)
beta_industry_figure.show()

Box plots showing beta distributions by industry, ordered by median beta. — Figure 15.2: Distribution of firm-level average betas across Vietnamese industries. Box plots show the median, interquartile range, and outliers for each industry.

15.8.4 Time Variation in Cross-Sectional Beta Distribution

Betas vary not only across stocks but also over time. Figure 15.3 shows how the cross-sectional distribution of betas has evolved in the Vietnamese market.

# Compute monthly quantiles
beta_quantiles = (beta_monthly
    .groupby("date")["beta"]
    .quantile(q=np.arange(0.1, 1.0, 0.1))
    .reset_index()
    .rename(columns={"level_1": "quantile"})
    .assign(quantile=lambda x: (x["quantile"] * 100).astype(int).astype(str) + "%")
)

beta_quantiles_figure = (
    ggplot(
        beta_quantiles,
        aes(x="date", y="beta", color="quantile")
    )
    + geom_line(alpha=0.8)
    + geom_hline(yintercept=1, linetype="dashed", color="gray")
    + labs(
        x="",
        y="Beta",
        color="Quantile",
        title="Cross-Sectional Distribution of Betas Over Time"
    )
    + scale_x_datetime(date_breaks="2 years", date_labels="%Y")
    + theme_minimal()
)
beta_quantiles_figure.show()

Line chart showing time series of beta deciles, illustrating how the distribution of betas has changed over time. — Figure 15.3: Monthly quantiles of beta estimates over time. Each line represents a decile of the cross-sectional beta distribution.

The figure reveals several interesting patterns:

Level shifts: The entire distribution of betas can shift over time, reflecting changes in market-wide correlation.
Dispersion changes: During market stress, the spread between high and low beta stocks may change as correlations move.
Trends: Some periods show trending behavior in betas, possibly reflecting structural changes in the economy.

15.8.5 Coverage Analysis

We verify that our estimation procedure produces reasonable coverage across the sample. Figure 15.4 shows the fraction of stocks with available beta estimates over time.

# Count stocks with and without betas
coverage = (prices_monthly
    .groupby("date")["symbol"]
    .nunique()
    .reset_index(name="total_stocks")
    .merge(
        beta_monthly.groupby("date")["symbol"].nunique().reset_index(name="with_beta"),
        on="date",
        how="left"
    )
    .fillna(0)
    .assign(coverage=lambda x: x["with_beta"] / x["total_stocks"])
)

coverage_figure = (
    ggplot(coverage, aes(x="date", y="coverage"))
    + geom_line(color="steelblue", size=1)
    + labs(
        x="",
        y="Share with Beta Estimate",
        title="Beta Estimation Coverage Over Time"
    )
    + scale_y_continuous(labels=percent_format(), limits=(0, 1))
    + scale_x_datetime(date_breaks="2 years", date_labels="%Y")
    + theme_minimal()
)
coverage_figure.show()

Line chart showing the percentage of stocks with beta estimates available each month. — Figure 15.4: Share of stocks with available beta estimates over time. Coverage increases as more stocks accumulate sufficient return history.

Coverage is lower in early years because stocks need sufficient return history (at least 48 months) before their betas can be estimated. As the market matures and stocks accumulate longer histories, coverage approaches 100%.

15.9 Comparing Monthly and Daily Beta Estimates

When both monthly and daily beta estimates are available, we can compare them to understand how estimation frequency affects results.

# Combine monthly and daily estimates
beta_daily = (capm_daily
    .query("coefficient == 'beta'")
    .rename(columns={"estimate": "beta"})
    [["symbol", "date", "beta"]]
    .assign(frequency="daily")
)

beta_combined = pd.concat([beta_monthly, beta_daily], ignore_index=True)

# Filter to example stocks
beta_comparison = (beta_combined
    .merge(examples, on="symbol")
    .query("symbol in ['VIC', 'FPT']")  # Select two for clarity
)

comparison_figure = (
    ggplot(
        beta_comparison,
        aes(x="date", y="beta", color="frequency", linetype="frequency")
    )
    + geom_line(size=0.8)
    + facet_wrap("~company", ncol=1)
    + labs(
        x="",
        y="Beta",
        color="Data Frequency",
        linetype="Data Frequency",
        title="Monthly vs Daily Beta Estimates"
    )
    + scale_x_datetime(date_breaks="2 years", date_labels="%Y")
    + theme_minimal()
    + theme(legend_position="bottom")
)
comparison_figure.show()

Line chart comparing monthly and daily beta estimates over time for example stocks. — Figure 15.5: Comparison of beta estimates using monthly versus daily returns for selected stocks. Daily estimates are smoother due to more observations per estimation window.

The comparison reveals that daily-based estimates are generally smoother due to the larger number of observations in each window. However, the level and trend of estimates are similar across frequencies, providing validation that both approaches capture the same underlying systematic risk exposure.

# Correlation between monthly and daily estimates
correlation_data = (beta_combined
    .pivot_table(index=["symbol", "date"], columns="frequency", values="beta")
    .dropna()
)

print(f"Correlation between monthly and daily betas: {correlation_data.corr().iloc[0,1]:.3f}")

Correlation between monthly and daily betas: 0.745

Table 15.1: Theoretical Reasons for Imperfect Correlation

Factor	Effect
Non-synchronous trading	Daily betas can be biased downward for illiquid stocks
Microstructure noise	Bid-ask bounce adds noise to daily estimates
Different effective windows	Same calendar period but ~20x more observations for daily
Mean reversion speed	Daily captures faster-moving risk dynamics

Table 15.1 shows several reasons why we might observe imperfect correlation.

15.10 Key Takeaways

CAPM beta measures a stock’s sensitivity to systematic market risk and is fundamental to modern portfolio theory, cost of capital estimation, and risk management.
Rolling-window estimation captures time variation in betas, which reflects changes in companies’ business models, leverage, and market conditions.
Parallelization dramatically reduces computation time for large-scale estimation tasks by distributing work across multiple CPU cores.
Estimation choices matter: Window length, return frequency, and minimum observation requirements all affect beta estimates. Researchers should choose parameters appropriate for their specific application.
Industry patterns: Vietnamese stocks show systematic differences in market sensitivity across industries, with cyclical sectors exhibiting higher betas than defensive sectors.
Time variation: The cross-sectional distribution of betas in Vietnam has evolved over time, with notable shifts during market stress periods.
Frequency comparison: Monthly and daily beta estimates are positively correlated but not identical. Daily estimates are smoother while monthly estimates may better capture lower-frequency variation.
Data quality checks: Coverage analysis and summary statistics help identify potential issues in estimation procedures before using results in downstream analyses.

--- title: Beta Estimation format: html: toc: true number-sections: true jupyter: python3 execute: echo: true warning: false message: false --- This chapter introduces one of the most fundamental concepts in financial economics: the exposure of an individual stock to systematic market risk. According to the Capital Asset Pricing Model (CAPM) developed by @Sharpe1964, @Lintner1965, and @Mossin1966, cross-sectional variation in expected asset returns should be determined by the covariance between an asset's excess return and the excess return on the market portfolio. The regression coefficient that captures this relationship (commonly known as market beta) serves as the cornerstone of modern portfolio theory and remains widely used in practice for cost of capital estimation, performance attribution, and risk management. In this chapter, we develop a complete framework for estimating market betas for Vietnamese stocks. We begin with a conceptual overview of the CAPM and its empirical implementation. We then demonstrate beta estimation using ordinary least squares regression, first for individual stocks and then scaled to the entire market using rolling-window estimation. To handle the computational demands of estimating betas for hundreds of stocks across many time periods, we introduce parallelization techniques that dramatically reduce processing time. Finally, we compare beta estimates derived from monthly versus daily returns and examine how betas vary across industries and over time in the Vietnamese market. The chapter leverages several important computational concepts that extend beyond beta estimation itself. Rolling-window estimation is a technique applicable to any time-varying parameter, while parallelization provides a general solution for computationally intensive tasks that can be divided into independent subtasks. ## Theoretical Foundation ### The Capital Asset Pricing Model The CAPM provides a theoretical framework linking expected returns to systematic risk. Under the model's assumptions—including mean-variance optimizing investors, homogeneous expectations, and frictionless markets—the expected excess return on any asset $i$ is proportional to its covariance with the market portfolio: $$ E[r_i - r_f] = \beta_i \cdot E[r_m - r_f] $$ where $r_i$ is the return on asset $i$, $r_f$ is the risk-free rate, $r_m$ is the return on the market portfolio, and $\beta_i$ is defined as: $$ \beta_i = \frac{\text{Cov}(r_i, r_m)}{\text{Var}(r_m)} $$ The market beta $\beta_i$ measures the sensitivity of asset $i$'s returns to market movements. A beta greater than one indicates the asset amplifies market movements, while a beta less than one indicates dampened sensitivity. A beta of zero would imply no systematic risk exposure, leaving only idiosyncratic risk that can be diversified away. ### Empirical Implementation In practice, we estimate beta by regressing excess stock returns on excess market returns: $$ r_{i,t} - r_{f,t} = \alpha_i + \beta_i(r_{m,t} - r_{f,t}) + \varepsilon_{i,t} $$ {#eq-capm-regression} where $\alpha_i$ represents abnormal return (Jensen's alpha), $\beta_i$ is the market beta we seek to estimate, and $\varepsilon_{i,t}$ is the idiosyncratic error term. Under the CAPM, $\alpha_i$ should equal zero for all assets—any non-zero alpha represents a deviation from the model's predictions. Several practical considerations affect beta estimation: 1. **Estimation Window**: Longer windows provide more observations and thus more precise estimates, but may include outdated information if betas change over time. Common choices range from 36 to 60 months for monthly data. 2. **Return Frequency**: Monthly returns reduce noise but provide fewer observations. Daily returns offer more data points but may introduce microstructure effects and non-synchronous trading biases. 3. **Market Proxy**: The theoretical market portfolio includes all assets, but in practice we use a broad equity index. For Vietnam, we use the value-weighted market return constructed from our stock universe. 4. **Minimum Observations**: Requiring a minimum number of observations (e.g., 48 out of 60 months) helps avoid unreliable estimates from sparse data. ## Setting Up the Environment We begin by loading the necessary Python packages. The core packages handle data manipulation, statistical modeling, and database operations. We also import parallelization tools that will be essential when scaling our estimation to the full market. ```{python} import pandas as pd import numpy as np import sqlite3 import statsmodels.formula.api as smf from scipy.stats.mstats import winsorize from plotnine import * from mizani.formatters import percent_format, comma_format from joblib import Parallel, delayed, cpu_count from dateutil.relativedelta import relativedelta ``` We connect to our SQLite database containing the processed Vietnamese financial data from previous chapters. ```{python} tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite") ``` ## Loading and Preparing Data ### Stock Returns Data We load the monthly stock returns data prepared in the Datacore chapter. The data includes excess returns (returns minus the risk-free rate) for all Vietnamese listed stocks. ```{python} prices_monthly = pd.read_sql_query( sql=""" SELECT symbol, date, ret_excess FROM prices_monthly """, con=tidy_finance, parse_dates={"date"} ) # Add year for merging with fundamentals prices_monthly["year"] = prices_monthly["date"].dt.year print(f"Loaded {len(prices_monthly):,} monthly observations") print(f"Covering {prices_monthly['symbol'].nunique():,} unique stocks") print(f"Date range: {prices_monthly['date'].min():%Y-%m} to {prices_monthly['date'].max():%Y-%m}") ``` ```{python} prices_daily = pd.read_sql_query( sql=""" SELECT symbol, date, ret_excess FROM prices_daily """, con=tidy_finance, parse_dates={"date"} ) ``` ### Company Information We load company information to enable industry-level analysis of beta estimates. ```{python} comp_vn = pd.read_sql_query( sql=""" SELECT symbol, datadate, icb_name_vi FROM comp_vn """, con=tidy_finance, parse_dates={"datadate"} ) # Extract year for merging comp_vn["year"] = comp_vn["datadate"].dt.year print(f"Company data: {comp_vn['symbol'].nunique():,} firms") ``` ### Market Excess Returns For the market portfolio proxy, we use the value-weighted market excess return. If you have constructed Fama-French factors in a previous chapter, load them here. Otherwise, we can construct a simple market return from our stock data. ```{python} # Option 1: Load pre-computed market factor factors_ff3_monthly = pd.read_sql_query( sql="SELECT date, mkt_excess FROM factors_ff3_monthly", con=tidy_finance, parse_dates={"date"} ) # Option 2: Construct market return from stock data (if factors not available) # This computes the value-weighted average return across all stocks def compute_market_return(prices_df): """ Compute value-weighted market return from individual stock returns. Parameters ---------- prices_df : pd.DataFrame Stock returns with mktcap_lag for weighting Returns ------- pd.DataFrame Monthly market excess returns """ market_return = (prices_df .groupby("date") .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"])) .reset_index(name="mkt_excess") ) return market_return ``` ### Merging Datasets We combine the stock returns with market returns and company information to create our estimation dataset. ```{python} # Merge stock returns with market returns prices_monthly = prices_monthly.merge( factors_ff3_monthly, on="date", how="left" ) # Merge with company information for industry classification prices_monthly = prices_monthly.merge( comp_vn[["symbol", "year", "icb_name_vi"]], on=["symbol", "year"], how="left" ) # Remove observations with missing data prices_monthly = prices_monthly.dropna(subset=["ret_excess", "mkt_excess"]) print(f"Final estimation sample: {len(prices_monthly):,} observations") ``` ### Handling Outliers Extreme returns can unduly influence regression estimates. We apply winsorization to limit the impact of outliers while preserving the general distribution of returns. Winsorization at the 1% level replaces values below the 1st percentile with the 1st percentile value, and values above the 99th percentile with the 99th percentile value. ```{python} def winsorize_returns(df, columns, limits=(0.01, 0.01)): """ Apply winsorization to return columns to limit outlier influence. Parameters ---------- df : pd.DataFrame DataFrame containing return columns columns : list Column names to winsorize limits : tuple Lower and upper percentile limits for winsorization Returns ------- pd.DataFrame DataFrame with winsorized columns """ df = df.copy() for col in columns: df[col] = winsorize(df[col], limits=limits) return df prices_monthly = winsorize_returns( prices_monthly, columns=["ret_excess", "mkt_excess"], limits=(0.01, 0.01) ) print("Return distributions after winsorization:") print(prices_monthly[["ret_excess", "mkt_excess"]].describe().round(4)) ``` ## Estimating Beta for Individual Stocks ### Single Stock Example Before scaling to the full market, we demonstrate beta estimation for a single well-known Vietnamese stock. We use Vingroup (VIC), one of the largest conglomerates in Vietnam with significant exposure to real estate, retail, and automotive sectors. ```{python} # Filter data for Vingroup vic_data = prices_monthly.query("symbol == 'VIC'").copy() print(f"VIC observations: {len(vic_data)}") print(f"Date range: {vic_data['date'].min():%Y-%m} to {vic_data['date'].max():%Y-%m}") ``` We estimate the CAPM regression using ordinary least squares via the `statsmodels` package. The formula interface provides a convenient way to specify regression models. ```{python} # Estimate CAPM for Vingroup model_vic = smf.ols( formula="ret_excess ~ mkt_excess", data=vic_data ).fit() # Display regression results print(model_vic.summary()) ``` The regression output provides several important pieces of information: - **Beta (mkt_excess coefficient)**: The estimated market sensitivity. A beta above 1 indicates VIC amplifies market movements. - **Alpha (Intercept)**: The abnormal return not explained by market exposure. Under CAPM, this should be zero. - **R-squared**: The proportion of return variation explained by market movements. - **t-statistics**: Test whether coefficients differ significantly from zero. ```{python} # Extract key estimates coefficients = model_vic.summary2().tables[1] print("\nKey estimates for Vingroup (VIC):") print(f" Beta: {coefficients.loc['mkt_excess', 'Coef.']:.3f}") print(f" Alpha: {coefficients.loc['Intercept', 'Coef.']:.4f}") print(f" R²: {model_vic.rsquared:.3f}") ``` ### CAPM Estimation Function We create a reusable function that estimates the CAPM and returns results in a standardized format. The function includes a minimum observations requirement to avoid unreliable estimates from sparse data. ```{python} def estimate_capm(data, min_obs=48): """ Estimate CAPM regression and return coefficients. This function regresses excess stock returns on excess market returns and extracts the coefficient estimates along with t-statistics. Parameters ---------- data : pd.DataFrame DataFrame with 'ret_excess' and 'mkt_excess' columns min_obs : int Minimum number of observations required for estimation Returns ------- pd.DataFrame DataFrame with coefficient estimates and t-statistics, or empty DataFrame if insufficient observations """ if len(data) < min_obs: return pd.DataFrame() try: # Estimate OLS regression model = smf.ols( formula="ret_excess ~ mkt_excess", data=data ).fit() # Extract coefficient table coef_table = model.summary2().tables[1] # Format results results = pd.DataFrame({ "coefficient": ["alpha", "beta"], "estimate": [ coef_table.loc["Intercept", "Coef."], coef_table.loc["mkt_excess", "Coef."] ], "t_statistic": [ coef_table.loc["Intercept", "t"], coef_table.loc["mkt_excess", "t"] ], "r_squared": model.rsquared }) return results except Exception as e: # Return empty DataFrame if estimation fails return pd.DataFrame() ``` ## Rolling-Window Estimation ### Motivation for Rolling Windows Stock betas are not constant over time. A company's business mix, leverage, and operating environment evolve, causing its systematic risk exposure to change. To capture this time variation, we use rolling-window estimation: at each point in time, we estimate beta using only data from a fixed lookback period (e.g., the past 60 months). Rolling-window estimation involves a trade-off: - **Longer windows** provide more observations and thus more precise estimates, but may include stale information. - **Shorter windows** are more responsive to changes but produce noisier estimates. A common choice in academic research is 60 months (5 years) of monthly data, requiring at least 48 valid observations for estimation. ### Rolling Window Implementation The following function implements rolling-window CAPM estimation. For each month in the sample, it looks back over the specified window and estimates beta using all available data within that window. ```{python} def roll_capm_estimation(data, look_back=60, min_obs=48): """ Perform rolling-window CAPM estimation. This function slides a window across time, estimating the CAPM regression at each point using the most recent 'look_back' months of data. Parameters ---------- data : pd.DataFrame DataFrame with 'date', 'ret_excess', and 'mkt_excess' columns look_back : int Number of months in the estimation window min_obs : int Minimum observations required within each window Returns ------- pd.DataFrame Time series of coefficient estimates with dates """ # Ensure data is sorted by date data = data.sort_values("date").copy() # Get unique dates dates = data["date"].drop_duplicates().sort_values() # Container for results results = [] # Slide window across dates for i in range(look_back - 1, len(dates)): # Define window boundaries end_date = dates.iloc[i] start_date = end_date - relativedelta(months=look_back - 1) # Extract data within window window_data = data.query("date >= @start_date and date <= @end_date") # Estimate CAPM for this window window_results = estimate_capm(window_data, min_obs=min_obs) if not window_results.empty: window_results["date"] = end_date results.append(window_results) # Combine all results if results: return pd.concat(results, ignore_index=True) else: return pd.DataFrame() ``` ### Example: Rolling Betas for Selected Stocks We demonstrate rolling-window estimation for several well-known Vietnamese stocks spanning different industries. ```{python} # Define example stocks examples = pd.DataFrame({ "symbol": ["FPT", "VNM", "VIC", "HPG", "VCB"], "company": [ "FPT Corporation", # Technology "Vinamilk", # Consumer goods "Vingroup", # Real estate/conglomerate "Hoa Phat Group", # Steel/materials "Vietcombank" # Banking ] }) # Check data availability for each example data_availability = (prices_monthly .query("symbol in @examples['symbol']") .groupby("symbol") .agg( n_obs=("date", "count"), first_date=("date", "min"), last_date=("date", "max") ) .reset_index() ) print("Data availability for example stocks:") print(data_availability) ``` ```{python} # Estimate rolling betas for example stocks example_data = prices_monthly.query("symbol in @examples['symbol']") capm_examples = (example_data .groupby("symbol", group_keys=True) .apply(lambda x: roll_capm_estimation(x), include_groups=False) .reset_index() .drop(columns="level_1", errors="ignore") ) # Filter to beta estimates only beta_examples = (capm_examples .query("coefficient == 'beta'") .merge(examples, on="symbol") ) print(f"Rolling beta estimates: {len(beta_examples):,} observations") ``` ### Visualizing Rolling Betas @fig-rolling-betas displays the time series of beta estimates for our example stocks. The figure reveals how systematic risk exposure evolves differently across industries. ```{python} #| label: fig-rolling-betas #| out-width: "90%" #| fig-align: "center" #| fig-cap: "Monthly rolling beta estimates for selected Vietnamese stocks using a 60-month estimation window. Different industries exhibit distinct patterns of market sensitivity over time." #| fig-alt: "Line chart showing time series of beta estimates for five Vietnamese stocks from different industries." rolling_beta_figure = ( ggplot( beta_examples, aes(x="date", y="estimate", color="company") ) + geom_line(size=0.8) + geom_hline(yintercept=1, linetype="dashed", color="gray", alpha=0.7) + labs( x="", y="Beta", color="", title="Rolling Beta Estimates (60-Month Window)" ) + scale_x_datetime(date_breaks="2 years", date_labels="%Y") + theme_minimal() + theme(legend_position="bottom") ) rolling_beta_figure.show() ``` Several patterns emerge from the figure: 1. **Industry differences**: Technology and banking stocks may exhibit different beta patterns than real estate or consumer goods companies. 2. **Time variation**: Betas are not constant. They respond to changes in business conditions, leverage, and market regimes. 3. **Crisis periods**: Market stress periods (e.g., 2008 financial crisis, 2020 COVID-19) often see beta estimates change as correlations across stocks increase. ## Parallelized Estimation for the Full Market ### The Computational Challenge Estimating rolling betas for all stocks in our database is computationally intensive. With hundreds of stocks, each requiring rolling estimation across many time periods, sequential processing would take considerable time. Fortunately, beta estimation for different stocks is independent (i.e., the estimate for stock A does not depend on the estimate for stock B). This independence makes the problem ideal for parallelization. ### Setting Up Parallel Processing We use the `joblib` library to distribute computation across multiple CPU cores. The `Parallel` class manages worker processes, while `delayed` wraps function calls for deferred execution. ```{python} # Determine available cores (reserve one for system operations) n_cores = max(1, cpu_count() - 1) print(f"Available cores for parallel processing: {n_cores}") ``` ### Parallel Beta Estimation The following code estimates rolling betas for all stocks in parallel. Each stock is processed independently by a separate worker. ```{python} def estimate_all_betas_parallel(data, n_cores, look_back=60, min_obs=48): """ Estimate rolling betas for all stocks using parallel processing. Parameters ---------- data : pd.DataFrame Full dataset with all stocks n_cores : int Number of CPU cores to use look_back : int Months in estimation window min_obs : int Minimum observations required Returns ------- pd.DataFrame Beta estimates for all stocks and dates """ # Group data by stock grouped = data.groupby("symbol", group_keys=False) # Define worker function def process_stock(name, group): result = roll_capm_estimation(group, look_back=look_back, min_obs=min_obs) if not result.empty: result["symbol"] = name return result # Execute in parallel results = Parallel(n_jobs=n_cores, verbose=1)( delayed(process_stock)(name, group) for name, group in grouped ) # Combine results results = [r for r in results if not r.empty] if results: return pd.concat(results, ignore_index=True) else: return pd.DataFrame() ``` ```{python} #| eval: false # Estimate betas for all stocks print("Estimating rolling betas for all stocks...") capm_monthly = estimate_all_betas_parallel( prices_monthly, n_cores=n_cores, look_back=60, min_obs=48 ) print(f"\nCompleted: {len(capm_monthly):,} coefficient estimates") print(f"Unique stocks: {capm_monthly['symbol'].nunique():,}") ``` ### Storing Results We save the CAPM estimates to our database for use in subsequent chapters. ```{python} #| eval: false capm_monthly.to_sql( name="capm_monthly", con=tidy_finance, if_exists="replace", index=False ) print("CAPM estimates saved to database.") ``` For subsequent analysis, we load the pre-computed estimates: ```{python} capm_monthly = pd.read_sql_query( sql="SELECT * FROM capm_monthly", con=tidy_finance, parse_dates={"date"} ) print(f"Loaded {len(capm_monthly):,} CAPM estimates") ``` ## Beta Estimation Using Daily Returns While monthly returns are standard in academic research, some applications benefit from higher-frequency data: - **Shorter estimation windows**: Daily data allows meaningful estimation over shorter periods (e.g., 3 months rather than 5 years). - **More responsive estimates**: Daily betas capture changes more quickly. - **Event studies**: High-frequency betas are useful for analyzing market reactions to specific events. However, daily data introduces additional challenges: - **Microstructure noise**: Bid-ask bounce and other trading frictions add noise to returns. - **Non-synchronous trading**: Less liquid stocks may not trade every day, biasing beta estimates downward. - **Computational burden**: Daily data is roughly 21 times larger than monthly data. ### Batch Processing for Daily Data Given the size of daily data, we process stocks in batches to manage memory constraints. This approach loads and processes a subset of stocks, saves results, and proceeds to the next batch. ```{python} def compute_market_return_daily(tidy_finance): """ Compute daily value-weighted market excess return from stock data. """ # Load daily prices with market cap for weighting prices_daily_full = pd.read_sql_query( sql=""" SELECT p.symbol, p.date, p.ret_excess, m.mktcap_lag FROM prices_daily p LEFT JOIN prices_monthly m ON p.symbol = m.symbol AND strftime('%Y-%m', p.date) = strftime('%Y-%m', m.date) """, con=tidy_finance, parse_dates={"date"} ) # Compute value-weighted market return each day mkt_daily = (prices_daily_full .dropna(subset=["ret_excess", "mktcap_lag"]) .groupby("date") .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"])) .reset_index(name="mkt_excess") ) return mkt_daily def roll_capm_estimation_daily(data, look_back_days=1260, min_obs=1000): """ Perform rolling-window CAPM estimation using daily data. Parameters ---------- data : pd.DataFrame DataFrame with 'date', 'ret_excess', and 'mkt_excess' columns look_back_days : int Number of trading days in the estimation window min_obs : int Minimum daily observations required within each window Returns ------- pd.DataFrame Time series of coefficient estimates with dates """ data = data.sort_values("date").copy() dates = data["date"].drop_duplicates().sort_values().reset_index(drop=True) results = [] for i in range(look_back_days - 1, len(dates)): end_date = dates.iloc[i] start_idx = max(0, i - look_back_days + 1) start_date = dates.iloc[start_idx] window_data = data.query("date >= @start_date and date <= @end_date") window_results = estimate_capm(window_data, min_obs=min_obs) if not window_results.empty: window_results["date"] = end_date results.append(window_results) if results: return pd.concat(results, ignore_index=True) else: return pd.DataFrame() def estimate_daily_betas_batch(symbols, tidy_finance, n_cores, batch_size=500, look_back_days=1260, min_obs=1000): """ Estimate rolling betas from daily data using batch processing. """ # First, compute or load market return print("Computing daily market excess returns...") mkt_daily = compute_market_return_daily(tidy_finance) print(f"Market returns: {len(mkt_daily)} days") n_batches = int(np.ceil(len(symbols) / batch_size)) all_results = [] for j in range(n_batches): batch_start = j * batch_size batch_end = min((j + 1) * batch_size, len(symbols)) batch_symbols = symbols[batch_start:batch_end] symbol_list = ", ".join(f"'{s}'" for s in batch_symbols) query = f""" SELECT symbol, date, ret_excess FROM prices_daily WHERE symbol IN ({symbol_list}) """ prices_daily_batch = pd.read_sql_query( sql=query, con=tidy_finance, parse_dates={"date"} ) # Merge with market excess return prices_daily_batch = prices_daily_batch.merge( mkt_daily, on="date", how="inner" ) # Group by symbol and estimate betas grouped = prices_daily_batch.groupby("symbol", group_keys=False) # Parallel estimation batch_results = Parallel(n_jobs=n_cores)( delayed(lambda name, group: roll_capm_estimation_daily(group, look_back_days=look_back_days, min_obs=min_obs) .assign(symbol=name) )(name, group) for name, group in grouped ) batch_results = [r for r in batch_results if r is not None and not r.empty] if batch_results: all_results.append(pd.concat(batch_results, ignore_index=True)) print(f"Batch {j+1}/{n_batches} complete") if all_results: return pd.concat(all_results, ignore_index=True) else: return pd.DataFrame() ``` ```{python} #| eval: false symbols = prices_monthly["symbol"].unique().tolist() capm_daily = estimate_daily_betas_batch( symbols=symbols, tidy_finance=tidy_finance, n_cores=n_cores, batch_size=500, look_back_days=1260, # ~5 years of trading days min_obs=1000 ) print(f"Daily beta estimates: {len(capm_daily):,}") ``` ```{python} #| eval: false capm_daily.to_sql( name="capm_daily", con=tidy_finance, if_exists="replace", index=False ) print("CAPM estimates saved to database.") ``` For subsequent analysis, we load the pre-computed estimates: ```{python} capm_daily = pd.read_sql_query( sql="SELECT * FROM capm_daily", con=tidy_finance, parse_dates={"date"} ) print(f"Loaded {len(capm_daily):,} CAPM estimates") ``` ## Analyzing Beta Estimates ### Extracting Beta Estimates We extract the beta coefficient estimates from our CAPM results for analysis. ```{python} # Extract monthly betas beta_monthly = (capm_monthly .query("coefficient == 'beta'") .rename(columns={"estimate": "beta"}) [["symbol", "date", "beta"]] .assign(frequency="monthly") ) # Save to database beta_monthly.to_sql( name="beta_monthly", con=tidy_finance, if_exists="replace", index=False ) print(f"Monthly betas: {len(beta_monthly):,} observations") print(f"Unique stocks: {beta_monthly['symbol'].nunique():,}") ``` ```{python} # Load pre-computed betas beta_monthly = pd.read_sql_query( sql="SELECT * FROM beta_monthly", con=tidy_finance, parse_dates={"date"} ) ``` ### Summary Statistics We examine the distribution of beta estimates to verify their reasonableness. ```{python} print("Beta Summary Statistics:") print(beta_monthly["beta"].describe().round(3)) # Additional diagnostics print(f"\nStocks with negative average beta: {(beta_monthly.groupby('symbol')['beta'].mean() < 0).sum()}") print(f"Stocks with beta > 2: {(beta_monthly.groupby('symbol')['beta'].mean() > 2).sum()}") ``` ### Beta Distribution Across Industries Different industries have different exposures to systematic market risk based on their business models, operating leverage, and financial leverage. @fig-beta-by-industry shows the distribution of firm-level average betas across Vietnamese industries. ```{python} # Merge betas with industry information beta_with_industry = (beta_monthly .merge( prices_monthly[["symbol", "date", "icb_name_vi"]].drop_duplicates(), on=["symbol", "date"], how="left" ) .dropna(subset=["icb_name_vi"]) ) # Compute firm-level average beta by industry beta_by_industry = (beta_with_industry .groupby(["icb_name_vi", "symbol"])["beta"] .mean() .reset_index() ) # Order industries by median beta industry_order = (beta_by_industry .groupby("icb_name_vi")["beta"] .median() .sort_values() .index.tolist() ) # Select top 10 industries by number of firms for clearer visualization top_industries = (beta_by_industry .groupby("icb_name_vi") .size() .nlargest(10) .index.tolist() ) beta_by_industry_filtered = beta_by_industry.query("icb_name_vi in @top_industries") ``` ```{python} #| label: fig-beta-by-industry #| fig-cap: "Distribution of firm-level average betas across Vietnamese industries. Box plots show the median, interquartile range, and outliers for each industry." #| fig-alt: "Box plots showing beta distributions by industry, ordered by median beta." beta_industry_figure = ( ggplot( beta_by_industry_filtered, aes(x="icb_name_vi", y="beta") ) + geom_boxplot(fill="steelblue", alpha=0.7) + geom_hline(yintercept=1, linetype="dashed", color="red", alpha=0.7) + coord_flip() + labs( x="", y="Beta", title="Beta Distribution by Industry" ) + theme_minimal() ) beta_industry_figure.show() ``` ### Time Variation in Cross-Sectional Beta Distribution Betas vary not only across stocks but also over time. @fig-beta-quantiles shows how the cross-sectional distribution of betas has evolved in the Vietnamese market. ```{python} #| label: fig-beta-quantiles #| fig-cap: "Monthly quantiles of beta estimates over time. Each line represents a decile of the cross-sectional beta distribution." #| fig-alt: "Line chart showing time series of beta deciles, illustrating how the distribution of betas has changed over time." # Compute monthly quantiles beta_quantiles = (beta_monthly .groupby("date")["beta"] .quantile(q=np.arange(0.1, 1.0, 0.1)) .reset_index() .rename(columns={"level_1": "quantile"}) .assign(quantile=lambda x: (x["quantile"] * 100).astype(int).astype(str) + "%") ) beta_quantiles_figure = ( ggplot( beta_quantiles, aes(x="date", y="beta", color="quantile") ) + geom_line(alpha=0.8) + geom_hline(yintercept=1, linetype="dashed", color="gray") + labs( x="", y="Beta", color="Quantile", title="Cross-Sectional Distribution of Betas Over Time" ) + scale_x_datetime(date_breaks="2 years", date_labels="%Y") + theme_minimal() ) beta_quantiles_figure.show() ``` The figure reveals several interesting patterns: 1. **Level shifts**: The entire distribution of betas can shift over time, reflecting changes in market-wide correlation. 2. **Dispersion changes**: During market stress, the spread between high and low beta stocks may change as correlations move. 3. **Trends**: Some periods show trending behavior in betas, possibly reflecting structural changes in the economy. ### Coverage Analysis We verify that our estimation procedure produces reasonable coverage across the sample. @fig-beta-coverage shows the fraction of stocks with available beta estimates over time. ```{python} #| label: fig-beta-coverage #| fig-cap: "Share of stocks with available beta estimates over time. Coverage increases as more stocks accumulate sufficient return history." #| fig-alt: "Line chart showing the percentage of stocks with beta estimates available each month." # Count stocks with and without betas coverage = (prices_monthly .groupby("date")["symbol"] .nunique() .reset_index(name="total_stocks") .merge( beta_monthly.groupby("date")["symbol"].nunique().reset_index(name="with_beta"), on="date", how="left" ) .fillna(0) .assign(coverage=lambda x: x["with_beta"] / x["total_stocks"]) ) coverage_figure = ( ggplot(coverage, aes(x="date", y="coverage")) + geom_line(color="steelblue", size=1) + labs( x="", y="Share with Beta Estimate", title="Beta Estimation Coverage Over Time" ) + scale_y_continuous(labels=percent_format(), limits=(0, 1)) + scale_x_datetime(date_breaks="2 years", date_labels="%Y") + theme_minimal() ) coverage_figure.show() ``` Coverage is lower in early years because stocks need sufficient return history (at least 48 months) before their betas can be estimated. As the market matures and stocks accumulate longer histories, coverage approaches 100%. ## Comparing Monthly and Daily Beta Estimates When both monthly and daily beta estimates are available, we can compare them to understand how estimation frequency affects results. ```{python} # Combine monthly and daily estimates beta_daily = (capm_daily .query("coefficient == 'beta'") .rename(columns={"estimate": "beta"}) [["symbol", "date", "beta"]] .assign(frequency="daily") ) beta_combined = pd.concat([beta_monthly, beta_daily], ignore_index=True) ``` ```{python} #| label: fig-beta-comparison #| fig-cap: "Comparison of beta estimates using monthly versus daily returns for selected stocks. Daily estimates are smoother due to more observations per estimation window." #| fig-alt: "Line chart comparing monthly and daily beta estimates over time for example stocks." # Filter to example stocks beta_comparison = (beta_combined .merge(examples, on="symbol") .query("symbol in ['VIC', 'FPT']") # Select two for clarity ) comparison_figure = ( ggplot( beta_comparison, aes(x="date", y="beta", color="frequency", linetype="frequency") ) + geom_line(size=0.8) + facet_wrap("~company", ncol=1) + labs( x="", y="Beta", color="Data Frequency", linetype="Data Frequency", title="Monthly vs Daily Beta Estimates" ) + scale_x_datetime(date_breaks="2 years", date_labels="%Y") + theme_minimal() + theme(legend_position="bottom") ) comparison_figure.show() ``` The comparison reveals that daily-based estimates are generally smoother due to the larger number of observations in each window. However, the level and trend of estimates are similar across frequencies, providing validation that both approaches capture the same underlying systematic risk exposure. ```{python} # Correlation between monthly and daily estimates correlation_data = (beta_combined .pivot_table(index=["symbol", "date"], columns="frequency", values="beta") .dropna() ) print(f"Correlation between monthly and daily betas: {correlation_data.corr().iloc[0,1]:.3f}") ``` | Factor | Effect | |------------------------------------|------------------------------------| | Non-synchronous trading | Daily betas can be biased downward for illiquid stocks | | Microstructure noise | Bid-ask bounce adds noise to daily estimates | | Different effective windows | Same calendar period but \~20x more observations for daily | | Mean reversion speed | Daily captures faster-moving risk dynamics | : Theoretical Reasons for Imperfect Correlation {#tbl-imperfect-cor} @tbl-imperfect-cor shows several reasons why we might observe imperfect correlation. ## Key Takeaways 1. **CAPM beta** measures a stock's sensitivity to systematic market risk and is fundamental to modern portfolio theory, cost of capital estimation, and risk management. 2. **Rolling-window estimation** captures time variation in betas, which reflects changes in companies' business models, leverage, and market conditions. 3. **Parallelization** dramatically reduces computation time for large-scale estimation tasks by distributing work across multiple CPU cores. 4. **Estimation choices matter**: Window length, return frequency, and minimum observation requirements all affect beta estimates. Researchers should choose parameters appropriate for their specific application. 5. **Industry patterns**: Vietnamese stocks show systematic differences in market sensitivity across industries, with cyclical sectors exhibiting higher betas than defensive sectors. 6. **Time variation**: The cross-sectional distribution of betas in Vietnam has evolved over time, with notable shifts during market stress periods. 7. **Frequency comparison**: Monthly and daily beta estimates are positively correlated but not identical. Daily estimates are smoother while monthly estimates may better capture lower-frequency variation. 8. **Data quality checks**: Coverage analysis and summary statistics help identify potential issues in estimation procedures before using results in downstream analyses.