17  Fama-French Factors

This chapter provides a replication of the Fama-French factor portfolios for the Vietnamese stock market. The Fama-French factor models represent a cornerstone of empirical asset pricing, originating from the seminal work of Fama and French (1992) and later extended in Fama and French (2015). These models have transformed how academics and practitioners understand the cross-section of expected stock returns, moving beyond the single-factor Capital Asset Pricing Model to incorporate multiple sources of systematic risk.

We construct both the three-factor and five-factor models at monthly and daily frequencies. The monthly factors serve as the foundation for most asset pricing tests and portfolio analyses, while the daily factors enable higher-frequency applications including short-horizon event studies, market microstructure research, and daily beta estimation. By constructing factors at both frequencies, we create a complete toolkit for empirical finance research in the Vietnamese market.

The chapter proceeds as follows. We first discuss the theoretical motivation for each factor and the economic intuition behind the Fama-French methodology. We then prepare the necessary data, merging stock returns with accounting characteristics. Next, we implement the portfolio sorting procedures that form the basis of factor construction, carefully following the original Fama-French protocols while adapting them for Vietnamese market characteristics. We construct the three-factor model (market, size, and value) before extending to the five-factor model (adding profitability and investment). Finally, we construct daily factors and validate our replicated factors through various diagnostic checks.

17.1 Theoretical Background

17.1.1 The Evolution from CAPM to Multi-Factor Models

The Capital Asset Pricing Model of Sharpe (1964) posits that a single factor—the market portfolio—should explain all cross-sectional variation in expected returns. However, decades of empirical research have documented persistent patterns that CAPM cannot explain. Fama and French (1992) demonstrated that two firm characteristics—size and book-to-market ratio—capture substantial variation in average returns that the market beta leaves unexplained.

Small firms tend to earn higher returns than large firms, a pattern known as the size effect. Similarly, firms with high book-to-market ratios (value stocks) tend to outperform firms with low book-to-market ratios (growth stocks), known as the value premium. The three-factor model formalizes these observations by constructing tradeable factor portfolios:

\[ r_{i,t} - r_{f,t} = \alpha_i + \beta_i^{MKT}(r_{m,t} - r_{f,t}) + \beta_i^{SMB} \cdot SMB_t + \beta_i^{HML} \cdot HML_t + \varepsilon_{i,t} \tag{17.1}\]

where:

  • \(r_{i,t} - r_{f,t}\) is the excess return on asset \(i\)
  • \(r_{m,t} - r_{f,t}\) is the market excess return
  • \(SMB_t\) (Small Minus Big) is the size factor
  • \(HML_t\) (High Minus Low) is the value factor

17.1.2 The Five-Factor Extension

Fama and French (2015) extended the model to include two additional factors motivated by the dividend discount model. Firms with higher profitability should have higher expected returns (all else equal), and firms with aggressive investment policies should have lower expected returns:

\[ r_{i,t} - r_{f,t} = \alpha_i + \beta_i^{MKT}MKT_t + \beta_i^{SMB}SMB_t + \beta_i^{HML}HML_t + \beta_i^{RMW}RMW_t + \beta_i^{CMA}CMA_t + \varepsilon_{i,t} \tag{17.2}\]

where:

  • \(RMW_t\) (Robust Minus Weak) is the profitability factor
  • \(CMA_t\) (Conservative Minus Aggressive) is the investment factor

17.1.3 Factor Construction Methodology

The Fama-French methodology constructs factors through double-sorted portfolios:

  1. Size sorts: Stocks are divided into Small and Big groups based on median market capitalization.

  2. Characteristic sorts: Within each size group, stocks are sorted into terciles based on book-to-market (for HML), operating profitability (for RMW), or investment (for CMA).

  3. Factor returns: Factors are computed as the difference between average returns of portfolios with high versus low characteristic values, averaging across size groups to neutralize size effects.

  4. Timing: Portfolios are formed at the end of June each year using accounting data from the prior fiscal year, ensuring all information was publicly available at formation.

17.2 Setting Up the Environment

We load the required Python packages for data manipulation, statistical analysis, and visualization.

import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf
from scipy.stats.mstats import winsorize
import matplotlib.pyplot as plt

from plotnine import *
from mizani.formatters import percent_format, comma_format

We connect to our SQLite database containing the processed Vietnamese financial data.

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

17.3 Data Preparation

17.3.1 Loading Stock Returns

We load the monthly stock returns data, which includes excess returns, market capitalization, and the risk-free rate. These variables are essential for computing value-weighted portfolio returns and factor premiums.

prices_monthly = pd.read_sql_query(
    sql="""
        SELECT symbol, date, ret_excess, mktcap, mktcap_lag, risk_free
        FROM prices_monthly
    """,
    con=tidy_finance,
    parse_dates={"date"}
).dropna()

print(f"Monthly returns: {len(prices_monthly):,} observations")
print(f"Unique stocks: {prices_monthly['symbol'].nunique():,}")
print(f"Date range: {prices_monthly['date'].min():%Y-%m} to {prices_monthly['date'].max():%Y-%m}")
Monthly returns: 165,499 observations
Unique stocks: 1,457
Date range: 2010-02 to 2023-12

17.3.2 Loading Company Fundamentals

We load the company fundamentals data containing book equity, operating profitability, and investment—the characteristics needed for constructing the Fama-French factors.

comp_vn = pd.read_sql_query(
    sql="""
        SELECT symbol, datadate, be, op, inv
        FROM comp_vn
    """,
    con=tidy_finance,
    parse_dates={"datadate"}
).dropna()

print(f"Fundamentals: {len(comp_vn):,} firm-year observations")
print(f"Unique firms: {comp_vn['symbol'].nunique():,}")
Fundamentals: 18,108 firm-year observations
Unique firms: 1,496

17.3.3 Constructing Sorting Variables

Following Fama-French conventions, we construct the sorting variables with careful attention to timing. The key principles are:

  1. Size (June Market Cap): We use market capitalization at the end of June of year \(t\) to sort stocks into size groups. This ensures we capture the firm’s size at the moment of portfolio formation.

  2. Book-to-Market Ratio: We use book equity from fiscal year \(t-1\) divided by market equity at the end of December \(t-1\). This creates a six-month gap between the accounting data and portfolio formation, ensuring the information was publicly available.

  3. Portfolio Formation Date: Portfolios are formed on July 1st and held for twelve months until the following June.

def construct_sorting_variables(prices_monthly, comp_vn):
    """
    Construct sorting variables following Fama-French methodology.
    
    Parameters
    ----------
    prices_monthly : pd.DataFrame
        Monthly stock returns with market cap
    comp_vn : pd.DataFrame
        Company fundamentals with book equity, profitability, investment
        
    Returns
    -------
    pd.DataFrame
        Sorting variables aligned with July 1st formation dates
    """
    
    # 1. Size: June market capitalization
    # Portfolio formation is July 1st, so we use June market cap
    size = (prices_monthly
        .query("date.dt.month == 6")
        .assign(
            sorting_date=lambda x: x["date"] + pd.offsets.MonthBegin(1)
        )
        [["symbol", "sorting_date", "mktcap"]]
        .rename(columns={"mktcap": "size"})
    )
    
    print(f"Size observations: {len(size):,}")
    
    # 2. Market Equity: December market cap for B/M calculation
    # December t-1 market cap is used with fiscal year t-1 book equity
    # This is then used for July t portfolio formation
    market_equity = (prices_monthly
        .query("date.dt.month == 12")
        .assign(
            # December year t-1 maps to July year t formation
            sorting_date=lambda x: x["date"] + pd.offsets.MonthBegin(7)
        )
        [["symbol", "sorting_date", "mktcap"]]
        .rename(columns={"mktcap": "me"})
    )
    
    print(f"Market equity observations: {len(market_equity):,}")
    
    # 3. Book-to-Market and other characteristics
    # Fiscal year t-1 data is used for July t portfolio formation
    book_to_market = (comp_vn
        .assign(
            # Fiscal year-end + 6 months = July formation
            sorting_date=lambda x: pd.to_datetime(
                (x["datadate"].dt.year + 1).astype(str) + "-07-01"
            )
        )
        .merge(market_equity, on=["symbol", "sorting_date"], how="inner")
        .assign(
            # Scale book equity to match market equity units
            # BE is in VND, ME is in millions VND
            bm=lambda x: x["be"] / (x["me"] * 1e9)
        )
        [["symbol", "sorting_date", "me", "bm", "op", "inv"]]
    )
    
    print(f"Book-to-market observations: {len(book_to_market):,}")
    
    # 4. Merge size with characteristics
    sorting_variables = (size
        .merge(book_to_market, on=["symbol", "sorting_date"], how="inner")
        .dropna()
        .drop_duplicates(subset=["symbol", "sorting_date"])
    )
    
    return sorting_variables

sorting_variables = construct_sorting_variables(prices_monthly, comp_vn)

print(f"\nFinal sorting variables: {len(sorting_variables):,} stock-years")
print(f"Sorting date range: {sorting_variables['sorting_date'].min():%Y-%m} to {sorting_variables['sorting_date'].max():%Y-%m}")
Size observations: 13,756
Market equity observations: 14,286
Book-to-market observations: 13,389

Final sorting variables: 12,046 stock-years
Sorting date range: 2011-07 to 2023-07

17.3.4 Validating Sorting Variables

Before proceeding, we validate that our sorting variables have reasonable distributions. The book-to-market ratio should center around 1.0 for a typical market, though emerging markets may differ.

print("Sorting Variable Summary Statistics:")
print(sorting_variables[["size", "bm", "op", "inv"]].describe().round(4))

# Check for extreme values that might indicate data issues
print(f"\nB/M Median: {sorting_variables['bm'].median():.4f}")
print(f"B/M 1st percentile: {sorting_variables['bm'].quantile(0.01):.4f}")
print(f"B/M 99th percentile: {sorting_variables['bm'].quantile(0.99):.4f}")
Sorting Variable Summary Statistics:
              size          bm          op         inv
count   12046.0000  12046.0000  12046.0000  12046.0000
mean     2225.5648      1.7033      0.1852      0.1322
std     14680.4225      3.8683      0.2782      2.5410
min         0.4864      0.0014     -0.7529     -0.9569
25%        62.7556      0.7595      0.0309     -0.0495
50%       182.6410      1.1849      0.1367      0.0342
75%       641.8896      1.8853      0.2952      0.1582
max    426020.9817    272.1893      1.4256    261.3355

B/M Median: 1.1849
B/M 1st percentile: 0.1710
B/M 99th percentile: 8.0262

17.3.5 Handling Outliers

Extreme values in sorting characteristics can distort portfolio assignments and factor returns. We apply winsorization to limit the influence of outliers while preserving the general ranking of stocks.

# Check BEFORE winsorization
print("BEFORE Winsorization:")
print(sorting_variables[["size", "bm", "op", "inv"]].describe().round(4))

# Apply winsorization
def winsorize_characteristics(df, columns, limits=(0.01, 0.99)):
    """
    Apply winsorization using pandas clip.
    """
    df = df.copy()
    for col in columns:
        if col in df.columns:
            lower = df[col].quantile(limits[0])
            upper = df[col].quantile(limits[1])
            df[col] = df[col].clip(lower=lower, upper=upper)
            print(f"  {col}: clipped to [{lower:.4f}, {upper:.4f}]")
    return df

sorting_variables = winsorize_characteristics(
    sorting_variables,
    columns=["bm", "op", "inv"],  # Don't winsorize size
    limits=(0.01, 0.99)
)

# Check AFTER winsorization
print("\nAFTER Winsorization:")
print(sorting_variables[["size", "bm", "op", "inv"]].describe().round(4))
BEFORE Winsorization:
              size          bm          op         inv
count   12046.0000  12046.0000  12046.0000  12046.0000
mean     2225.5648      1.7033      0.1852      0.1322
std     14680.4225      3.8683      0.2782      2.5410
min         0.4864      0.0014     -0.7529     -0.9569
25%        62.7556      0.7595      0.0309     -0.0495
50%       182.6410      1.1849      0.1367      0.0342
75%       641.8896      1.8853      0.2952      0.1582
max    426020.9817    272.1893      1.4256    261.3355
  bm: clipped to [0.1710, 8.0262]
  op: clipped to [-0.7319, 1.2192]
  inv: clipped to [-0.3990, 1.5195]

AFTER Winsorization:
              size          bm          op         inv
count   12046.0000  12046.0000  12046.0000  12046.0000
mean     2225.5648      1.5544      0.1837      0.0894
std     14680.4225      1.2843      0.2705      0.2721
min         0.4864      0.1710     -0.7319     -0.3990
25%        62.7556      0.7595      0.0309     -0.0495
50%       182.6410      1.1849      0.1367      0.0342
75%       641.8896      1.8853      0.2952      0.1582
max    426020.9817      8.0262      1.2192      1.5195

17.4 Portfolio Assignment Functions

17.4.1 The Portfolio Assignment Function

We create a flexible function for assigning stocks to portfolios based on quantile breakpoints. This function handles both independent sorts (where breakpoints are computed across all stocks) and dependent sorts (where breakpoints are computed within subgroups).

def assign_portfolio(data, sorting_variable, percentiles):
    """Assign portfolios to a bin according to a sorting variable."""
    
    # Get the values
    values = data[sorting_variable].dropna()
    
    if len(values) == 0:
        return pd.Series([np.nan] * len(data), index=data.index)
    
    # Calculate breakpoints
    breakpoints = values.quantile(percentiles, interpolation="linear")
    
    # Handle duplicate breakpoints by using unique values
    unique_breakpoints = np.unique(breakpoints)
    
    # If all values are the same, assign all to portfolio 1
    if len(unique_breakpoints) <= 1:
        return pd.Series([1] * len(data), index=data.index)
    
    # Set boundaries to -inf and +inf
    unique_breakpoints.iloc[0] = -np.inf
    unique_breakpoints.iloc[unique_breakpoints.size-1] = np.inf
    
    # Assign to bins
    assigned = pd.cut(
        data[sorting_variable],
        bins=unique_breakpoints,
        labels=pd.Series(range(1, breakpoints.size)),
        include_lowest=True,
        right=False
    )
    
    return assigned
# Check the distribution of characteristics BEFORE portfolio assignment
print("Operating Profitability Distribution:")
print(sorting_variables["op"].describe())
print(f"\nUnique OP values: {sorting_variables['op'].nunique()}")

print("\nInvestment Distribution:")
print(sorting_variables["inv"].describe())
print(f"\nUnique INV values: {sorting_variables['inv'].nunique()}")

# Check breakpoints for a specific date
test_date = sorting_variables["sorting_date"].iloc[0]
test_data = sorting_variables.query("sorting_date == @test_date")

print(f"\nBreakpoints for {test_date}:")
print(f"OP 30th percentile: {test_data['op'].quantile(0.3):.4f}")
print(f"OP 70th percentile: {test_data['op'].quantile(0.7):.4f}")
print(f"INV 30th percentile: {test_data['inv'].quantile(0.3):.4f}")
print(f"INV 70th percentile: {test_data['inv'].quantile(0.7):.4f}")
Operating Profitability Distribution:
count    12046.000000
mean         0.183738
std          0.270509
min         -0.731888
25%          0.030913
50%          0.136675
75%          0.295185
max          1.219223
Name: op, dtype: float64

Unique OP values: 11804

Investment Distribution:
count    12046.000000
mean         0.089388
std          0.272147
min         -0.399042
25%         -0.049497
50%          0.034157
75%          0.158155
max          1.519474
Name: inv, dtype: float64

Unique INV values: 11805

Breakpoints for 2019-07-01 00:00:00:
OP 30th percentile: 0.0541
OP 70th percentile: 0.2566
INV 30th percentile: -0.0343
INV 70th percentile: 0.1116

17.4.2 Assigning Portfolios for Three-Factor Model

For the three-factor model, we perform independent double sorts on size and book-to-market. Size is split at the median (2 groups), and book-to-market is split at the 30th and 70th percentiles (3 groups), creating 6 portfolios.

def assign_ff3_portfolios(sorting_variables):
    """
    Assign portfolios for Fama-French three-factor model.
    Independent 2x3 sort on size and book-to-market.
    """
    df = sorting_variables.copy()
    
    # Independent size sort (median split)
    df["portfolio_size"] = df.groupby("sorting_date")["size"].transform(
        lambda x: pd.qcut(x, q=[0, 0.5, 1], labels=[1, 2], duplicates='drop')
    )
    
    # Independent B/M sort (30/70 split)
    df["portfolio_bm"] = df.groupby("sorting_date")["bm"].transform(
        lambda x: pd.qcut(x, q=[0, 0.3, 0.7, 1], labels=[1, 2, 3], duplicates='drop')
    )
    
    return df

# Assign portfolios
portfolios_ff3 = assign_ff3_portfolios(sorting_variables)

# Validate
print("FF3 Book-to-Market by Portfolio (should be INCREASING):")
print(portfolios_ff3.groupby("portfolio_bm", observed=True)["bm"].median().round(4))


print("Three-Factor Portfolio Assignments:")
print(portfolios_ff3[["symbol", "sorting_date", "portfolio_size", "portfolio_bm"]].head(10))
FF3 Book-to-Market by Portfolio (should be INCREASING):
portfolio_bm
1    0.5836
2    1.1891
3    2.4552
Name: bm, dtype: float64
Three-Factor Portfolio Assignments:
  symbol sorting_date portfolio_size portfolio_bm
0    A32   2019-07-01              1            2
1    A32   2020-07-01              1            2
2    A32   2021-07-01              1            2
3    A32   2022-07-01              1            3
4    A32   2023-07-01              1            2
5    AAA   2011-07-01              2            2
6    AAA   2012-07-01              2            3
7    AAA   2013-07-01              2            3
8    AAA   2014-07-01              2            2
9    AAA   2015-07-01              2            3

17.4.3 Validating Portfolio Assignments

We verify that the portfolio assignments create the expected 2×3 grid with reasonable stock counts in each cell.

# Check portfolio distribution for most recent year
latest_date = portfolios_ff3["sorting_date"].max()

portfolio_counts = (portfolios_ff3
    .query("sorting_date == @latest_date")
    .groupby(["portfolio_size", "portfolio_bm"], observed=True)
    .size()
    .unstack(fill_value=0)
)

print(f"Portfolio Counts for {latest_date:%Y-%m}:")
print(portfolio_counts)

# Verify characteristic monotonicity
print("\nBook-to-Market by Portfolio (should be increasing):")
print(portfolios_ff3.groupby("portfolio_bm", observed=True)["bm"].median().round(4))
Portfolio Counts for 2023-07:
portfolio_bm      1    2    3
portfolio_size               
1               113  271  263
2               275  246  125

Book-to-Market by Portfolio (should be increasing):
portfolio_bm
1    0.5836
2    1.1891
3    2.4552
Name: bm, dtype: float64

We verify that for a single stock, the portfolio assignment remains constant between July of one year and June of the next.

# Trace a single symbol (e.g., 'A32') across a formation window
persistence_check = (portfolios_ff3
    .query("symbol == 'A32' & sorting_date >= '2022-01-01' & sorting_date <= '2023-12-31'")
    .sort_values("sorting_date")
    [['symbol', 'sorting_date', 'portfolio_size', 'portfolio_bm']]
)
print("\nTemporal Persistence Check (Symbol A32):")
print(persistence_check.head(15))

Temporal Persistence Check (Symbol A32):
  symbol sorting_date portfolio_size portfolio_bm
3    A32   2022-07-01              1            3
4    A32   2023-07-01              1            2

17.5 Fama-French Three-Factor Model (Monthly)

17.5.1 Merging Portfolios with Returns

We merge the portfolio assignments with monthly returns. The key insight is that portfolios formed in July of year \(t\) are held through June of year \(t+1\). We implement this by computing a sorting_date for each monthly return observation.

def merge_portfolios_with_returns(prices_monthly, portfolio_assignments):
    """
    Merge portfolio assignments with monthly returns.
    
    Portfolios formed in July t are held through June t+1.
    
    Parameters
    ----------
    prices_monthly : pd.DataFrame
        Monthly stock returns
    portfolio_assignments : pd.DataFrame
        Portfolio assignments with sorting_date
        
    Returns
    -------
    pd.DataFrame
        Returns merged with portfolio assignments
    """
    portfolios = (prices_monthly
        .assign(
            # Map each return month to its portfolio formation date
            sorting_date=lambda x: pd.to_datetime(
                np.where(
                    x["date"].dt.month <= 6,
                    (x["date"].dt.year - 1).astype(str) + "-07-01",
                    x["date"].dt.year.astype(str) + "-07-01"
                )
            )
        )
        .merge(
            portfolio_assignments,
            on=["symbol", "sorting_date"],
            how="inner"
        )
    )
    
    return portfolios

portfolios_monthly_ff3 = merge_portfolios_with_returns(
    prices_monthly,
    portfolios_ff3[["symbol", "sorting_date", "portfolio_size", "portfolio_bm"]]
)


print(f"Merged observations: {len(portfolios_monthly_ff3):,}")
Merged observations: 136,444

17.5.2 Computing Value-Weighted Portfolio Returns

We compute value-weighted returns for each of the six portfolios. Value-weighting uses lagged market capitalization to avoid look-ahead bias.

def compute_portfolio_returns(data, grouping_vars):
    """
    Compute value-weighted portfolio returns.
    
    Parameters
    ----------
    data : pd.DataFrame
        Returns data with portfolio assignments and mktcap_lag
    grouping_vars : list
        Variables defining portfolio groups
        
    Returns
    -------
    pd.DataFrame
        Value-weighted returns for each portfolio-date
    """
    portfolio_returns = (data
        .groupby(grouping_vars + ["date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"]),
            "n_stocks": len(x)
        }))
        .reset_index()
    )
    
    return portfolio_returns


# Compute portfolio returns
portfolio_returns_ff3 = compute_portfolio_returns(
    portfolios_monthly_ff3,
    ["portfolio_size", "portfolio_bm"]
)

print("Portfolio Returns Summary:")
print(portfolio_returns_ff3.groupby(["portfolio_size", "portfolio_bm"], observed=True)["ret"].describe().round(4))
Portfolio Returns Summary:
                             count    mean     std     min     25%     50%  \
portfolio_size portfolio_bm                                                  
1              1             150.0 -0.0052  0.0379 -0.1268 -0.0262 -0.0058   
               2             150.0 -0.0025  0.0414 -0.1080 -0.0236 -0.0053   
               3             150.0  0.0039  0.0601 -0.1612 -0.0269 -0.0004   
2              1             150.0 -0.0124  0.0594 -0.2222 -0.0449 -0.0107   
               2             150.0 -0.0021  0.0671 -0.1701 -0.0403 -0.0046   
               3             150.0  0.0024  0.0879 -0.2359 -0.0527 -0.0012   

                                75%     max  
portfolio_size portfolio_bm                  
1              1             0.0161  0.0849  
               2             0.0180  0.1195  
               3             0.0314  0.2015  
2              1             0.0210  0.1741  
               2             0.0317  0.1770  
               3             0.0437  0.2124  

17.5.3 Constructing SMB and HML Factors

We now construct the SMB and HML factors from the portfolio returns.

SMB (Small Minus Big): Average return of three small portfolios minus average return of three big portfolios.

HML (High Minus Low): Average return of two high B/M portfolios minus average return of two low B/M portfolios.

def construct_ff3_factors(portfolio_returns):
    """
    Construct Fama-French three factors from portfolio returns.
    
    Parameters
    ----------
    portfolio_returns : pd.DataFrame
        Value-weighted returns for 2x3 portfolios
        
    Returns
    -------
    pd.DataFrame
        Monthly SMB and HML factors
    """
    factors = (portfolio_returns
        .groupby("date")
        .apply(lambda x: pd.Series({
            # SMB: Small minus Big (average across B/M groups)
            "smb": (
                x.loc[x["portfolio_size"] == 1, "ret"].mean() -
                x.loc[x["portfolio_size"] == 2, "ret"].mean()
            ),
            # HML: High minus Low B/M (average across size groups)
            "hml": (
                x.loc[x["portfolio_bm"] == 3, "ret"].mean() -
                x.loc[x["portfolio_bm"] == 1, "ret"].mean()
            )
        }))
        .reset_index()
    )
    
    return factors

factors_smb_hml = construct_ff3_factors(portfolio_returns_ff3)

print("SMB and HML Factors:")
print(factors_smb_hml.head(10))
SMB and HML Factors:
        date       smb       hml
0 2011-07-31 -0.007768  0.002754
1 2011-08-31 -0.067309  0.011474
2 2011-09-30  0.014884  0.022854
3 2011-10-31 -0.003743  0.001631
4 2011-11-30  0.063234  0.009103
5 2011-12-31  0.014571  0.015280
6 2012-01-31 -0.026080  0.009672
7 2012-02-29 -0.035721  0.005474
8 2012-03-31 -0.002344  0.032477
9 2012-04-30 -0.033391  0.074191

17.5.4 Computing the Market Factor

The market factor is the value-weighted return of all stocks minus the risk-free rate. We compute this independently from the sorted portfolios.

def compute_market_factor(prices_monthly):
    """
    Compute value-weighted market excess return.
    
    Parameters
    ----------
    prices_monthly : pd.DataFrame
        Monthly stock returns with mktcap_lag
        
    Returns
    -------
    pd.DataFrame
        Monthly market excess return
    """
    market_factor = (prices_monthly
        .groupby("date")
        .apply(lambda x: pd.Series({
            "mkt_excess": np.average(x["ret_excess"], weights=x["mktcap_lag"]),
            "n_stocks": len(x)
        }), include_groups=False)
        .reset_index()
    )
    
    return market_factor

market_factor = compute_market_factor(prices_monthly)

print("Market Factor Summary:")
print(market_factor["mkt_excess"].describe().round(4))
Market Factor Summary:
count    167.0000
mean      -0.0123
std        0.0595
min       -0.2149
25%       -0.0394
50%       -0.0106
75%        0.0200
max        0.1677
Name: mkt_excess, dtype: float64

17.5.5 Combining Three Factors

We combine SMB, HML, and the market factor into the complete three-factor dataset.

factors_ff3_monthly = (factors_smb_hml
    .merge(market_factor[["date", "mkt_excess"]], on="date", how="inner")
)

# Add risk-free rate for completeness
rf_monthly = (prices_monthly
    .groupby("date")["risk_free"]
    .first()
    .reset_index()
)

factors_ff3_monthly = factors_ff3_monthly.merge(rf_monthly, on="date", how="left")

print("Fama-French Three Factors (Monthly):")
print(factors_ff3_monthly.head(10))

print("\nFactor Summary Statistics:")
print(factors_ff3_monthly[["mkt_excess", "smb", "hml"]].describe().round(4))
Fama-French Three Factors (Monthly):
        date       smb       hml  mkt_excess  risk_free
0 2011-07-31 -0.007768  0.002754   -0.078748   0.003333
1 2011-08-31 -0.067309  0.011474    0.029906   0.003333
2 2011-09-30  0.014884  0.022854   -0.002173   0.003333
3 2011-10-31 -0.003743  0.001631   -0.014005   0.003333
4 2011-11-30  0.063234  0.009103   -0.179410   0.003333
5 2011-12-31  0.014571  0.015280   -0.094802   0.003333
6 2012-01-31 -0.026080  0.009672    0.081273   0.003333
7 2012-02-29 -0.035721  0.005474    0.069655   0.003333
8 2012-03-31 -0.002344  0.032477    0.029005   0.003333
9 2012-04-30 -0.033391  0.074191    0.048791   0.003333

Factor Summary Statistics:
       mkt_excess       smb       hml
count    150.0000  150.0000  150.0000
mean      -0.0101    0.0027    0.0120
std        0.0586    0.0420    0.0535
min       -0.2149   -0.1599   -0.1284
25%       -0.0380   -0.0175   -0.0160
50%       -0.0095    0.0070    0.0043
75%        0.0214    0.0261    0.0340
max        0.1677    0.1175    0.1618

17.5.6 Saving Three-Factor Data

factors_ff3_monthly.to_sql(
    name="factors_ff3_monthly",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

print("Three-factor monthly data saved to database.")
Three-factor monthly data saved to database.

17.6 Fama-French Five-Factor Model (Monthly)

17.6.1 Portfolio Assignments with Dependent Sorts

For the five-factor model, we use dependent sorts: size is sorted independently, but profitability and investment are sorted within size groups. This controls for the correlation between size and these characteristics.

def assign_ff5_portfolios(sorting_variables):
    """
    Assign portfolios for Fama-French five-factor model.
    """
    df = sorting_variables.copy()
    
    # Independent size sort
    df["portfolio_size"] = df.groupby("sorting_date")["size"].transform(
        lambda x: pd.qcut(x, q=[0, 0.5, 1], labels=[1, 2], duplicates='drop')
    )
    
    # Dependent sorts within size groups
    df["portfolio_bm"] = df.groupby(["sorting_date", "portfolio_size"])["bm"].transform(
        lambda x: pd.qcut(x, q=[0, 0.3, 0.7, 1], labels=[1, 2, 3], duplicates='drop')
    )
    
    df["portfolio_op"] = df.groupby(["sorting_date", "portfolio_size"])["op"].transform(
        lambda x: pd.qcut(x, q=[0, 0.3, 0.7, 1], labels=[1, 2, 3], duplicates='drop')
    )
    
    df["portfolio_inv"] = df.groupby(["sorting_date", "portfolio_size"])["inv"].transform(
        lambda x: pd.qcut(x, q=[0, 0.3, 0.7, 1], labels=[1, 2, 3], duplicates='drop')
    )
    
    return df

# Run
portfolios_ff5 = assign_ff5_portfolios(sorting_variables)

17.6.2 Validating Five-Factor Portfolios

# Check characteristic monotonicity for each dimension
print("Profitability by Portfolio (should be increasing):")
print(portfolios_ff5.groupby("portfolio_op", observed=True)["op"].median().round(4))

print("\nInvestment by Portfolio (should be increasing):")
print(portfolios_ff5.groupby("portfolio_inv", observed=True)["inv"].median().round(4))

# Check portfolio counts
print("\nStocks per Size/Profitability Bin:")
print(portfolios_ff5.groupby(["portfolio_size", "portfolio_op"], observed=True).size().unstack(fill_value=0))

# Check number of unique firms per year
(portfolios_ff5
 .groupby("sorting_date")["symbol"]
 .nunique())
Profitability by Portfolio (should be increasing):
portfolio_op
1   -0.0053
2    0.1366
3    0.4098
Name: op, dtype: float64

Investment by Portfolio (should be increasing):
portfolio_inv
1   -0.1012
2    0.0329
3    0.2568
Name: inv, dtype: float64

Stocks per Size/Profitability Bin:
portfolio_op       1     2     3
portfolio_size                  
1               1812  2403  1811
2               1811  2401  1808
sorting_date
2011-07-01     556
2012-07-01     632
2013-07-01     643
2014-07-01     650
2015-07-01     669
2016-07-01     737
2017-07-01     842
2018-07-01    1073
2019-07-01    1183
2020-07-01    1224
2021-07-01    1258
2022-07-01    1286
2023-07-01    1293
Name: symbol, dtype: int64

17.6.3 Merging and Computing Portfolio Returns

# Merge with returns
portfolios_monthly_ff5 = merge_portfolios_with_returns(
    prices_monthly,
    portfolios_ff5[["symbol", "sorting_date", "portfolio_size", 
                    "portfolio_bm", "portfolio_op", "portfolio_inv"]]
)

print(f"Five-factor merged observations: {len(portfolios_monthly_ff5):,}")
Five-factor merged observations: 136,444

17.6.4 Constructing All Five Factors

We construct each factor from the appropriate portfolio sorts.

def construct_ff5_factors(portfolios_monthly):
    """
    Construct Fama-French five factors from portfolio data.
    
    Parameters
    ----------
    portfolios_monthly : pd.DataFrame
        Monthly returns with all portfolio assignments
        
    Returns
    -------
    pd.DataFrame
        Monthly five-factor returns
    """
    
    # HML: Value factor from B/M sorts
    portfolios_bm = (portfolios_monthly
        .groupby(["portfolio_size", "portfolio_bm", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    factors_hml = (portfolios_bm
        .groupby("date")
        .apply(lambda x: pd.Series({
            "hml": (x.loc[x["portfolio_bm"] == 3, "ret"].mean() -
                   x.loc[x["portfolio_bm"] == 1, "ret"].mean())
        }))
        .reset_index()
    )
    
    # RMW: Profitability factor from OP sorts
    portfolios_op = (portfolios_monthly
        .groupby(["portfolio_size", "portfolio_op", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    factors_rmw = (portfolios_op
        .groupby("date")
        .apply(lambda x: pd.Series({
            "rmw": (x.loc[x["portfolio_op"] == 3, "ret"].mean() -
                   x.loc[x["portfolio_op"] == 1, "ret"].mean())
        }))
        .reset_index()
    )
    
    # CMA: Investment factor from INV sorts
    # Note: CMA is Conservative minus Aggressive (low inv - high inv)
    portfolios_inv = (portfolios_monthly
        .groupby(["portfolio_size", "portfolio_inv", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    factors_cma = (portfolios_inv
        .groupby("date")
        .apply(lambda x: pd.Series({
            "cma": (x.loc[x["portfolio_inv"] == 1, "ret"].mean() -
                   x.loc[x["portfolio_inv"] == 3, "ret"].mean())
        }))
        .reset_index()
    )
    
    # SMB: Size factor (average across all characteristic portfolios)
    all_portfolios = pd.concat([portfolios_bm, portfolios_op, portfolios_inv])
    
    factors_smb = (all_portfolios
        .groupby("date")
        .apply(lambda x: pd.Series({
            "smb": (x.loc[x["portfolio_size"] == 1, "ret"].mean() -
                   x.loc[x["portfolio_size"] == 2, "ret"].mean())
        }))
        .reset_index()
    )
    
    # Combine all factors
    factors = (factors_smb
        .merge(factors_hml, on="date", how="outer")
        .merge(factors_rmw, on="date", how="outer")
        .merge(factors_cma, on="date", how="outer")
    )
    
    return factors

factors_ff5 = construct_ff5_factors(portfolios_monthly_ff5)

# Add market factor
factors_ff5_monthly = (factors_ff5
    .merge(market_factor[["date", "mkt_excess"]], on="date", how="inner")
    .merge(rf_monthly, on="date", how="left")
)

print("Fama-French Five Factors (Monthly):")
print(factors_ff5_monthly.head(10))

print("\nFactor Summary Statistics:")
print(factors_ff5_monthly[["mkt_excess", "smb", "hml", "rmw", "cma"]].describe().round(4))
Fama-French Five Factors (Monthly):
        date       smb       hml       rmw       cma  mkt_excess  risk_free
0 2011-07-31 -0.015907 -0.002812  0.060525  0.045291   -0.078748   0.003333
1 2011-08-31 -0.061842  0.006189 -0.022700 -0.023177    0.029906   0.003333
2 2011-09-30  0.014387  0.024301 -0.006005  0.003588   -0.002173   0.003333
3 2011-10-31 -0.006958 -0.006940  0.026694  0.003649   -0.014005   0.003333
4 2011-11-30  0.074369  0.015617 -0.058766  0.044214   -0.179410   0.003333
5 2011-12-31  0.006687  0.022494  0.062655  0.052444   -0.094802   0.003333
6 2012-01-31 -0.016254  0.010513 -0.042191 -0.067170    0.081273   0.003333
7 2012-02-29 -0.026606  0.024465 -0.030849 -0.036383    0.069655   0.003333
8 2012-03-31  0.005096  0.050930 -0.018441  0.043488    0.029005   0.003333
9 2012-04-30  0.000712  0.058214 -0.061434  0.009233    0.048791   0.003333

Factor Summary Statistics:
       mkt_excess       smb       hml       rmw       cma
count    150.0000  150.0000  150.0000  150.0000  150.0000
mean      -0.0101    0.0077    0.0115   -0.0047    0.0083
std        0.0586    0.0419    0.0518    0.0477    0.0335
min       -0.2149   -0.1522   -0.1283   -0.2126   -0.0814
25%       -0.0380   -0.0137   -0.0126   -0.0308   -0.0131
50%       -0.0095    0.0104    0.0046    0.0010    0.0067
75%        0.0214    0.0316    0.0323    0.0178    0.0289
max        0.1677    0.1284    0.1510    0.1297    0.1331

17.6.5 Factor Correlations

We examine correlations between factors, which should generally be low for the factors to capture distinct sources of risk.

print("Factor Correlation Matrix:")
correlation_matrix = factors_ff5_monthly[["mkt_excess", "smb", "hml", "rmw", "cma"]].corr()
print(correlation_matrix.round(3))
Factor Correlation Matrix:
            mkt_excess    smb    hml    rmw    cma
mkt_excess       1.000 -0.712  0.230 -0.006 -0.104
smb             -0.712  1.000  0.256 -0.373  0.246
hml              0.230  0.256  1.000 -0.694  0.479
rmw             -0.006 -0.373 -0.694  1.000 -0.352
cma             -0.104  0.246  0.479 -0.352  1.000

17.6.6 Saving Five-Factor Data

factors_ff5_monthly.to_sql(
    name="factors_ff5_monthly",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

print("Five-factor monthly data saved to database.")
Five-factor monthly data saved to database.

17.7 Daily Fama-French Factors

17.7.1 Motivation for Daily Factors

Daily factors are essential for several applications:

  1. Daily beta estimation: CAPM regressions using daily data require daily market excess returns.
  2. Event studies: Measuring abnormal returns around corporate events requires daily factor adjustments.
  3. High-frequency research: Market microstructure studies need daily or intraday factor data.

The construction methodology mirrors the monthly approach, but we compute portfolio returns at daily frequency while maintaining the same annual portfolio formation dates.

17.7.2 Loading Daily Returns

# Load daily price data
prices_daily = pd.read_sql_query(
    sql="""
        SELECT symbol, date, ret_excess
        FROM prices_daily
    """,
    con=tidy_finance,
    parse_dates={"date"}
)

print(f"Daily returns: {len(prices_daily):,} observations")
print(f"Date range: {prices_daily['date'].min():%Y-%m-%d} to {prices_daily['date'].max():%Y-%m-%d}")
Daily returns: 3,462,157 observations
Date range: 2010-01-05 to 2023-12-29

17.7.3 Adding Market Cap for Daily Weighting

For value-weighted daily returns, we need market capitalization. We use the most recent monthly market cap as the weight for daily returns within that month.

# Get monthly market cap to use as weights for daily returns
mktcap_monthly = (prices_monthly
    [["symbol", "date", "mktcap_lag"]]
    .assign(year_month=lambda x: x["date"].dt.to_period("M"))
)

# Add year_month to daily data for merging
prices_daily = (prices_daily
    .assign(year_month=lambda x: x["date"].dt.to_period("M"))
    .merge(
        mktcap_monthly[["symbol", "year_month", "mktcap_lag"]],
        on=["symbol", "year_month"],
        how="left"
    )
    .dropna(subset=["ret_excess", "mktcap_lag"])
)

print(f"Daily returns with weights: {len(prices_daily):,} observations")
Daily returns with weights: 3,443,815 observations

17.7.4 Merging Daily Returns with Portfolios

We use the same portfolio assignments (formed annually in July) for daily returns.

# Step 1: Ensure portfolios_ff3 has correct format
portfolios_ff3_clean = portfolios_ff3[["symbol", "sorting_date", "portfolio_size", "portfolio_bm"]].copy()
portfolios_ff3_clean["sorting_date"] = pd.to_datetime(portfolios_ff3_clean["sorting_date"])

print("Portfolio sorting dates:")
print(portfolios_ff3_clean["sorting_date"].unique()[:5])

# Step 2: Create sorting_date for daily data
prices_daily_with_sort = prices_daily.copy()
prices_daily_with_sort["sorting_date"] = prices_daily_with_sort["date"].apply(
    lambda x: pd.Timestamp(f"{x.year}-07-01") if x.month > 6 else pd.Timestamp(f"{x.year - 1}-07-01")
)

print("\nDaily sorting dates:")
print(prices_daily_with_sort["sorting_date"].unique()[:5])

# Step 3: Merge
portfolios_daily_ff3 = prices_daily_with_sort.merge(
    portfolios_ff3_clean,
    on=["symbol", "sorting_date"],
    how="inner"
)

print(f"\nMerged daily observations: {len(portfolios_daily_ff3):,}")
print(f"Unique dates: {portfolios_daily_ff3['date'].nunique():,}")

# Step 4: Verify portfolio distribution
print("\nPortfolio distribution in daily data:")
print(portfolios_daily_ff3.groupby(["portfolio_size", "portfolio_bm"], observed=True).size().unstack(fill_value=0))
Portfolio sorting dates:
<DatetimeArray>
['2019-07-01 00:00:00', '2020-07-01 00:00:00', '2021-07-01 00:00:00',
 '2022-07-01 00:00:00', '2023-07-01 00:00:00']
Length: 5, dtype: datetime64[us]

Daily sorting dates:
<DatetimeArray>
['2018-07-01 00:00:00', '2019-07-01 00:00:00', '2020-07-01 00:00:00',
 '2021-07-01 00:00:00', '2022-07-01 00:00:00']
Length: 5, dtype: datetime64[us]

Merged daily observations: 2,843,570
Unique dates: 3,126

Portfolio distribution in daily data:
portfolio_bm         1       2       3
portfolio_size                        
1               218040  585561  617327
2               636152  552114  234376
# Diagnostic: Check the daily portfolio merge
print("="*50)
print("DIAGNOSTIC: Daily Portfolio Merge")
print("="*50)

print(f"\nDaily prices rows: {len(prices_daily):,}")
print(f"Daily FF3 portfolios rows: {len(portfolios_daily_ff3):,}")
print(f"Match rate: {len(portfolios_daily_ff3)/len(prices_daily)*100:.1f}%")

# Check portfolio distribution in daily data
print("\nDaily portfolio distribution:")
print(portfolios_daily_ff3.groupby(["portfolio_size", "portfolio_bm"], observed=True).size().unstack(fill_value=0))

# Check a specific date
test_date = portfolios_daily_ff3["date"].iloc[1000]
print(f"\nSample date: {test_date}")
print(portfolios_daily_ff3.query("date == @test_date").groupby(["portfolio_size", "portfolio_bm"], observed=True).size().unstack(fill_value=0))
==================================================
DIAGNOSTIC: Daily Portfolio Merge
==================================================

Daily prices rows: 3,443,815
Daily FF3 portfolios rows: 2,843,570
Match rate: 82.6%

Daily portfolio distribution:
portfolio_bm         1       2       3
portfolio_size                        
1               218040  585561  617327
2               636152  552114  234376

Sample date: 2023-06-28 00:00:00
portfolio_bm      1    2    3
portfolio_size               
1                93  232  312
2               291  280   67

17.7.5 Computing Daily Three Factors

def compute_daily_ff3_factors(portfolios_daily):
    """
    Compute daily Fama-French three factors.
    
    Parameters
    ----------
    portfolios_daily : pd.DataFrame
        Daily returns with portfolio assignments
        
    Returns
    -------
    pd.DataFrame
        Daily SMB and HML factors
    """
    # Compute daily portfolio returns
    portfolio_returns = (portfolios_daily
        .groupby(["portfolio_size", "portfolio_bm", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    # Compute factors
    factors = (portfolio_returns
        .groupby("date")
        .apply(lambda x: pd.Series({
            "smb": (x.loc[x["portfolio_size"] == 1, "ret"].mean() -
                   x.loc[x["portfolio_size"] == 2, "ret"].mean()),
            "hml": (x.loc[x["portfolio_bm"] == 3, "ret"].mean() -
                   x.loc[x["portfolio_bm"] == 1, "ret"].mean())
        }))
        .reset_index()
    )
    
    return factors

factors_daily_smb_hml = compute_daily_ff3_factors(portfolios_daily_ff3)

print(f"Daily factor observations: {len(factors_daily_smb_hml):,}")
print(factors_daily_smb_hml.head(10))
Daily factor observations: 3,126
        date       smb       hml
0 2011-07-01  0.008587  0.000967
1 2011-07-04  0.005099 -0.001099
2 2011-07-05 -0.009088  0.010152
3 2011-07-06  0.004875 -0.003918
4 2011-07-07 -0.011239 -0.000584
5 2011-07-08  0.005636 -0.008003
6 2011-07-11  0.003940  0.006172
7 2011-07-12  0.003205  0.006543
8 2011-07-13 -0.000097 -0.001134
9 2011-07-14 -0.001248  0.001669

17.7.6 Computing Daily Market Factor

def compute_daily_market_factor(prices_daily):
    """
    Compute daily value-weighted market excess return.
    
    Parameters
    ----------
    prices_daily : pd.DataFrame
        Daily returns with mktcap_lag
        
    Returns
    -------
    pd.DataFrame
        Daily market excess return
    """
    market_daily = (prices_daily
        .groupby("date")
        .apply(lambda x: pd.Series({
            "mkt_excess": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    return market_daily

market_factor_daily = compute_daily_market_factor(prices_daily)

print(f"Daily market factor: {len(market_factor_daily):,} days")
Daily market factor: 3,474 days

17.7.7 Combining Daily Three Factors

factors_ff3_daily = (factors_daily_smb_hml
    .merge(market_factor_daily, on="date", how="inner")
)

# Add risk-free rate (use monthly rate / 21 as daily approximation, or load actual daily rate)
factors_ff3_daily["risk_free"] = 0.04 / 252  # Approximate daily risk-free

print("Daily Fama-French Three Factors:")
print(factors_ff3_daily.head(10))

print("\nDaily Factor Summary Statistics:")
print(factors_ff3_daily[["mkt_excess", "smb", "hml"]].describe().round(6))
Daily Fama-French Three Factors:
        date       smb       hml  mkt_excess  risk_free
0 2011-07-01  0.008587  0.000967   -0.019862   0.000159
1 2011-07-04  0.005099 -0.001099   -0.000633   0.000159
2 2011-07-05 -0.009088  0.010152    0.013314   0.000159
3 2011-07-06  0.004875 -0.003918   -0.008045   0.000159
4 2011-07-07 -0.011239 -0.000584    0.003391   0.000159
5 2011-07-08  0.005636 -0.008003    0.000218   0.000159
6 2011-07-11  0.003940  0.006172   -0.013393   0.000159
7 2011-07-12  0.003205  0.006543   -0.017505   0.000159
8 2011-07-13 -0.000097 -0.001134    0.000767   0.000159
9 2011-07-14 -0.001248  0.001669   -0.000695   0.000159

Daily Factor Summary Statistics:
        mkt_excess          smb          hml
count  3126.000000  3126.000000  3126.000000
mean     -0.000479     0.000236     0.000594
std       0.011269     0.008488     0.008585
min      -0.070268    -0.032671    -0.039418
25%      -0.005074    -0.004882    -0.003941
50%       0.000350    -0.000106     0.000522
75%       0.005531     0.004307     0.005233
max       0.043386     0.042686     0.083889

17.7.8 Computing Daily Five Factors

# Step 1: Clean portfolios
portfolios_ff5_clean = portfolios_ff5[["symbol", "sorting_date", "portfolio_size", 
                                        "portfolio_bm", "portfolio_op", "portfolio_inv"]].copy()
portfolios_ff5_clean["sorting_date"] = pd.to_datetime(portfolios_ff5_clean["sorting_date"])

# Step 2: Merge with daily prices
portfolios_daily_ff5 = prices_daily_with_sort.merge(
    portfolios_ff5_clean,
    on=["symbol", "sorting_date"],
    how="inner"
)

print(f"FF5 Daily merged observations: {len(portfolios_daily_ff5):,}")
FF5 Daily merged observations: 2,843,570
def compute_daily_ff5_factors(portfolios_daily):
    """Compute daily Fama-French five factors."""
    
    # HML from B/M sorts
    portfolios_bm = (portfolios_daily
        .groupby(["portfolio_size", "portfolio_bm", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    factors_hml = (portfolios_bm
        .groupby("date")
        .apply(lambda x: pd.Series({
            "hml": (x.loc[x["portfolio_bm"] == 3, "ret"].mean() -
                   x.loc[x["portfolio_bm"] == 1, "ret"].mean())
        }))
        .reset_index()
    )
    
    # RMW from OP sorts
    portfolios_op = (portfolios_daily
        .groupby(["portfolio_size", "portfolio_op", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    factors_rmw = (portfolios_op
        .groupby("date")
        .apply(lambda x: pd.Series({
            "rmw": (x.loc[x["portfolio_op"] == 3, "ret"].mean() -
                   x.loc[x["portfolio_op"] == 1, "ret"].mean())
        }))
        .reset_index()
    )
    
    # CMA from INV sorts (note: low minus high)
    portfolios_inv = (portfolios_daily
        .groupby(["portfolio_size", "portfolio_inv", "date"], observed=True)
        .apply(lambda x: pd.Series({
            "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
        }))
        .reset_index()
    )
    
    factors_cma = (portfolios_inv
        .groupby("date")
        .apply(lambda x: pd.Series({
            "cma": (x.loc[x["portfolio_inv"] == 1, "ret"].mean() -
                   x.loc[x["portfolio_inv"] == 3, "ret"].mean())
        }))
        .reset_index()
    )
    
    # SMB from all sorts
    all_portfolios = pd.concat([portfolios_bm, portfolios_op, portfolios_inv])
    
    factors_smb = (all_portfolios
        .groupby("date")
        .apply(lambda x: pd.Series({
            "smb": (x.loc[x["portfolio_size"] == 1, "ret"].mean() -
                   x.loc[x["portfolio_size"] == 2, "ret"].mean())
        }))
        .reset_index()
    )
    
    # Combine
    factors = (factors_smb
        .merge(factors_hml, on="date", how="outer")
        .merge(factors_rmw, on="date", how="outer")
        .merge(factors_cma, on="date", how="outer")
    )
    
    return factors

# Compute daily FF5 factors
factors_daily_ff5 = compute_daily_ff5_factors(portfolios_daily_ff5)

# Add market factor
factors_ff5_daily = (factors_daily_ff5
    .merge(market_factor_daily, on="date", how="inner")
)
factors_ff5_daily["risk_free"] = 0.04 / 252

print("Daily Fama-French Five Factors:")
print(factors_ff5_daily.head(10))

print("\nDaily Five-Factor Summary Statistics:")
print(factors_ff5_daily[["mkt_excess", "smb", "hml", "rmw", "cma"]].describe().round(6))
Daily Fama-French Five Factors:
        date       smb       hml       rmw       cma  mkt_excess  risk_free
0 2011-07-01  0.006295  0.002515  0.013140  0.007680   -0.019862   0.000159
1 2011-07-04  0.002880 -0.002875  0.006560 -0.004886   -0.000633   0.000159
2 2011-07-05 -0.004260  0.009864 -0.012158 -0.004470    0.013314   0.000159
3 2011-07-06  0.001544 -0.009847  0.012977  0.006286   -0.008045   0.000159
4 2011-07-07 -0.009789 -0.003988 -0.000197 -0.006995    0.003391   0.000159
5 2011-07-08  0.001537 -0.006700  0.010841 -0.007661    0.000218   0.000159
6 2011-07-11  0.005396  0.004747  0.000655  0.013375   -0.013393   0.000159
7 2011-07-12  0.004759  0.007367  0.001989  0.014669   -0.017505   0.000159
8 2011-07-13 -0.000009  0.001110 -0.002052 -0.001633    0.000767   0.000159
9 2011-07-14 -0.001668  0.002916 -0.005427  0.005388   -0.000695   0.000159

Daily Five-Factor Summary Statistics:
        mkt_excess          smb          hml          rmw          cma
count  3126.000000  3126.000000  3126.000000  3126.000000  3126.000000
mean     -0.000479     0.000484     0.000549    -0.000136     0.000413
std       0.011269     0.008033     0.008312     0.008538     0.006756
min      -0.070268    -0.036283    -0.039155    -0.154013    -0.047698
25%      -0.005074    -0.004122    -0.003681    -0.004212    -0.003364
50%       0.000350     0.000105     0.000384     0.000030     0.000162
75%       0.005531     0.004358     0.004819     0.004036     0.003800
max       0.043386     0.060307     0.086269     0.102001     0.089907

17.7.9 Saving Daily Factors

factors_ff3_daily.to_sql(
    name="factors_ff3_daily",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

factors_ff5_daily.to_sql(
    name="factors_ff5_daily",
    con=tidy_finance,
    if_exists="replace",
    index=False
)

print("Daily factor data saved to database.")
Daily factor data saved to database.

17.8 Factor Validation and Diagnostics

# Verify all tables are in database
print("\n" + "="*50)
print("DATABASE SUMMARY")
print("="*50)

tables = ["factors_ff3_monthly", "factors_ff5_monthly", 
          "factors_ff3_daily", "factors_ff5_daily"]

for table in tables:
    df = pd.read_sql_query(f"SELECT COUNT(*) as n FROM {table}", con=tidy_finance)
    print(f"{table}: {df['n'].iloc[0]:,} observations")

# Correlation check: Monthly vs Daily (aggregated)
print("\n" + "="*50)
print("MONTHLY VS DAILY CONSISTENCY CHECK")
print("="*50)

factors_daily_agg = (factors_ff3_daily
    .assign(year_month=lambda x: x["date"].dt.to_period("M"))
    .groupby("year_month")[["mkt_excess", "smb", "hml"]]
    .sum()
    .reset_index()
)

factors_monthly_check = (factors_ff3_monthly
    .assign(year_month=lambda x: x["date"].dt.to_period("M"))
)

comparison = factors_monthly_check.merge(
    factors_daily_agg, on="year_month", suffixes=("_monthly", "_daily")
)

for factor in ["mkt_excess", "smb", "hml"]:
    corr = comparison[f"{factor}_monthly"].corr(comparison[f"{factor}_daily"])
    print(f"{factor}: Monthly-Daily correlation = {corr:.4f}")

==================================================
DATABASE SUMMARY
==================================================
factors_ff3_monthly: 150 observations
factors_ff5_monthly: 150 observations
factors_ff3_daily: 3,126 observations
factors_ff5_daily: 3,126 observations

==================================================
MONTHLY VS DAILY CONSISTENCY CHECK
==================================================
mkt_excess: Monthly-Daily correlation = 0.9980
smb: Monthly-Daily correlation = 0.9953
hml: Monthly-Daily correlation = 0.9936

17.8.1 Cumulative Factor Returns

We visualize the cumulative performance of each factor to assess whether the factors generate meaningful premiums over time.

# Compute cumulative returns
factors_cumulative = (factors_ff5_monthly
    .set_index("date")
    [["mkt_excess", "smb", "hml", "rmw", "cma"]]
    .add(1)
    .cumprod()
)

# Plot
fig, ax = plt.subplots(figsize=(12, 6))
factors_cumulative.plot(ax=ax)
ax.set_title("Cumulative Factor Returns (Vietnam)")
ax.set_xlabel("")
ax.set_ylabel("Growth of $1")
ax.legend(title="Factor")
ax.axhline(y=1, color='gray', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
Line chart showing cumulative factor returns over time.
Figure 17.1: Cumulative returns of Fama-French factors for the Vietnamese market. The figure shows the growth of $1 invested in each factor portfolio.

17.8.2 Average Factor Premiums

We compute annualized average factor premiums and their statistical significance.

# Annualized average returns (monthly returns * 12)
factor_premiums = (factors_ff5_monthly
    [["mkt_excess", "smb", "hml", "rmw", "cma"]]
    .mean() * 12 * 100  # Annualized percentage
)

# Standard errors
factor_se = (factors_ff5_monthly
    [["mkt_excess", "smb", "hml", "rmw", "cma"]]
    .std() / np.sqrt(len(factors_ff5_monthly)) * np.sqrt(12) * 100
)

# T-statistics
factor_tstat = factor_premiums / factor_se

print("Annualized Factor Premiums (%):")
print(factor_premiums.round(2))

print("\nT-Statistics:")
print(factor_tstat.round(2))
Annualized Factor Premiums (%):
mkt_excess   -12.09
smb            9.19
hml           13.75
rmw           -5.69
cma            9.94
dtype: float64

T-Statistics:
mkt_excess    -7.30
smb            7.76
hml            9.38
rmw           -4.22
cma           10.49
dtype: float64

17.8.3 Comparing Monthly and Daily Factors

We verify consistency between monthly and daily factors by computing correlations.

# Aggregate daily factors to monthly for comparison
factors_daily_monthly = (factors_ff5_daily
    .assign(year_month=lambda x: x["date"].dt.to_period("M"))
    .groupby("year_month")
    [["mkt_excess", "smb", "hml", "rmw", "cma"]]
    .sum()  # Sum daily returns to get monthly
    .reset_index()
)

# Merge with actual monthly factors
comparison = (factors_ff5_monthly
    .assign(year_month=lambda x: x["date"].dt.to_period("M"))
    .merge(
        factors_daily_monthly,
        on="year_month",
        suffixes=("_monthly", "_daily")
    )
)

# Correlations
for factor in ["mkt_excess", "smb", "hml", "rmw", "cma"]:
    corr = comparison[f"{factor}_monthly"].corr(comparison[f"{factor}_daily"])
    print(f"{factor}: Monthly-Daily correlation = {corr:.4f}")
mkt_excess: Monthly-Daily correlation = 0.9980
smb: Monthly-Daily correlation = 0.9950
hml: Monthly-Daily correlation = 0.9948
rmw: Monthly-Daily correlation = 0.9929
cma: Monthly-Daily correlation = 0.9884

17.9 Key Takeaways

  1. Factor Models Explained: The Fama-French three-factor model adds size (SMB) and value (HML) factors to the CAPM, while the five-factor model further includes profitability (RMW) and investment (CMA) factors.

  2. Construction Methodology: Factors are constructed through double-sorted portfolios with careful attention to timing. Portfolios are formed in July using accounting data from the prior fiscal year to ensure information was publicly available.

  3. Independent vs. Dependent Sorts: The three-factor model uses independent sorts on size and book-to-market, creating a 2×3 grid. The five-factor model uses dependent sorts where characteristics are sorted within size groups.

  4. Value-Weighted Returns: Portfolio returns are computed using value-weighting with lagged market capitalization to avoid look-ahead bias.

  5. Daily Factors: Daily factors use the same annual portfolio assignments but compute returns at daily frequency, enabling higher-frequency applications like daily beta estimation.

  6. Market Factor: The market factor is computed independently as the value-weighted return of all stocks minus the risk-free rate.

  7. Validation: Factor quality can be assessed through characteristic monotonicity, portfolio diversification, cumulative returns, and consistency between daily and monthly frequencies.

  8. Vietnamese Market Adaptation: While following the original Fama-French methodology, we adapt for Vietnamese market characteristics including VAS accounting standards, reporting timelines, and currency units.