19 Factor Construction Principles

Note

In this chapter, we develop a general-purpose factor construction engine for the Vietnamese equity market. We cover every methodological decision in the pipeline, including universe definition, breakpoint computation, portfolio formation, weighting, rebalancing, and factor return calculation, and demonstrate how each choice affects the resulting factor.

The previous chapters introduced specific asset pricing models, including the CAPM, the Fama-French three-factor model, and momentum. Each of those chapters presented its factor as given. This chapter steps behind the curtain and addresses the engineering question: how exactly do you build a factor? The question matters because seemingly minor methodological decisions, such as where to set breakpoints, whether to value-weight or equal-weight, how to handle missing accounting data, which stocks to exclude, can alter the magnitude, statistical significance, and even the sign of a factor premium.

In the U.S. context, Fama and French (1993) established a canonical procedure: sort stocks independently on size and a characteristic, form six value-weighted portfolios from 2×3 intersections, and define the factor as the average return of the two high-characteristic portfolios minus the average return of the two low-characteristic portfolios. This procedure has been replicated thousands of times. But it was designed for the U.S. market circa 1990, with its deep liquidity, broad cross-section, and CRSP/Compustat data infrastructure. Applying it mechanically to Vietnam, a market with 700 listed stocks, extreme illiquidity in the bottom tercile, high concentration in the top decile, and accounting data that arrives with variable lags, requires careful adaptation.

Hou, Xue, and Zhang (2020) replicated 452 anomalies from the U.S. literature and found that over half fail to replicate even in U.S. data with minor methodological variations. The replication crisis in empirical asset pricing makes it essential that researchers understand and document every construction choice. This chapter provides the tools to do so transparently.

19.1 The Factor Construction Pipeline

Every tradeable factor follows the same logical pipeline:

Define the universe: Select the eligible securities (e.g., common stocks with liquidity and size filters).
Compute the signal: Calculate the characteristic of interest (e.g., value, momentum, profitability).
Set breakpoints: Determine how stocks will be sorted (e.g., median, quintiles, deciles).
Assign portfolios: Group stocks into high and low (or multiple) portfolios based on the signal.
Compute returns: Calculate portfolio returns (equal- or value-weighted).
Construct the factor: Take Long (high) - Short (low).
Validate: Test performance, significance, and robustness.

Each step involves choices that interact with each other. A breakpoint that works well for a liquid universe may be inappropriate for the full cross-section. A weighting scheme that reduces noise in the U.S. may amplify it in Vietnam. We address each step systematically.

19.2 Data Construction

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from itertools import product
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'figure.figsize': (12, 6),
    'figure.dpi': 150,
    'font.size': 11,
    'axes.spines.top': False,
    'axes.spines.right': False
})

from datacore import DataCoreClient

client = DataCoreClient()

# Monthly returns (survivorship-bias-free)
monthly = client.get_monthly_returns(
    exchanges=['HOSE', 'HNX'],
    start_date='2008-01-01',
    end_date='2024-12-31',
    include_delisted=True,
    fields=[
        'ticker', 'month_end', 'monthly_return', 'market_cap',
        'shares_outstanding', 'volume_avg_20d', 'turnover_value_avg_20d',
        'n_zero_volume_days', 'exchange'
    ]
)

# Annual accounting data
accounting = client.get_fundamentals(
    exchanges=['HOSE', 'HNX'],
    start_date='2006-01-01',
    end_date='2024-12-31',
    include_delisted=True,
    frequency='annual',
    fields=[
        'ticker', 'fiscal_year', 'filing_date',
        'total_assets', 'total_equity', 'book_equity',
        'net_income', 'revenue', 'gross_profit',
        'operating_profit', 'total_debt', 'retained_earnings',
        'dividends_paid', 'capex', 'depreciation',
        'shares_outstanding_fy'
    ]
)

# Daily prices for momentum and volatility signals
daily = client.get_daily_prices(
    exchanges=['HOSE', 'HNX'],
    start_date='2008-01-01',
    end_date='2024-12-31',
    include_delisted=True,
    fields=['ticker', 'date', 'adjusted_close', 'volume', 'turnover_value']
)

monthly['month_end'] = pd.to_datetime(monthly['month_end'])
monthly = monthly.sort_values(['ticker', 'month_end'])

print(f"Monthly returns: {len(monthly):,} firm-months")
print(f"Accounting: {len(accounting):,} firm-years")
print(f"Unique tickers: {monthly['ticker'].nunique()}")

19.3 Step 1: Universe Definition

The first and most consequential choice is which stocks enter the factor construction universe. The universe definition determines what population the factor premium describes and whether it is implementable.

19.3.1 The Universe Problem in Vietnam

Vietnam presents a specific challenge: the cross-section is small (600-800 stocks on HOSE and HNX combined), and the size distribution is extremely skewed. The top 10 stocks by market capitalization account for roughly 50% of the total market cap on HOSE. The bottom tercile consists of micro-cap stocks that often trade fewer than 5 days per month. Including these stocks inflates apparent factor premia because their prices are noisy and stale, but excluding them shrinks the already small cross-section.

def apply_universe_filters(df, filters='standard'):
    """
    Apply universe filters to the monthly return panel.
    
    Parameters
    ----------
    filters : str
        'none': all stocks
        'minimal': exclude zero market cap and extreme returns
        'standard': + minimum listing age + positive volume
        'strict': + minimum market cap + minimum turnover
    
    Returns
    -------
    Filtered DataFrame with 'in_universe' column
    """
    d = df.copy()
    d['in_universe'] = True
    
    # Always: remove missing returns and market cap
    d.loc[d['monthly_return'].isna(), 'in_universe'] = False
    d.loc[d['market_cap'].isna() | (d['market_cap'] <= 0), 'in_universe'] = False
    
    # Minimal: winsorize extreme returns (likely data errors)
    if filters in ['minimal', 'standard', 'strict']:
        d.loc[d['monthly_return'].abs() > 1.0, 'in_universe'] = False
    
    # Standard: listing age >= 6 months
    if filters in ['standard', 'strict']:
        d['listing_age'] = (
            d.groupby('ticker').cumcount() + 1
        )
        d.loc[d['listing_age'] < 6, 'in_universe'] = False
        
        # Require at least 10 positive-volume days in the month
        d.loc[d['n_zero_volume_days'] > 12, 'in_universe'] = False
    
    # Strict: minimum market cap (20th percentile of HOSE)
    if filters == 'strict':
        mcap_threshold = (
            d[d['exchange'] == 'HOSE']
            .groupby('month_end')['market_cap']
            .transform(lambda x: x.quantile(0.20))
        )
        # Apply HOSE threshold to all stocks
        d['mcap_threshold'] = (
            d.groupby('month_end')['market_cap']
            .transform(lambda x: x.quantile(0.20))
        )
        d.loc[d['market_cap'] < d['mcap_threshold'], 'in_universe'] = False
        
        # Minimum average daily turnover (VND 200 million)
        d.loc[d['turnover_value_avg_20d'] < 2e8, 'in_universe'] = False
    
    return d

# Apply all filter levels and compare
filter_summary = {}
for level in ['none', 'minimal', 'standard', 'strict']:
    filtered = apply_universe_filters(monthly, filters=level)
    in_univ = filtered[filtered['in_universe']]
    filter_summary[level] = {
        'Firm-months': len(in_univ),
        'Avg stocks/month': in_univ.groupby('month_end')['ticker'].nunique().mean(),
        'Avg MCap coverage (%)': (
            in_univ.groupby('month_end')['market_cap'].sum()
            / filtered.groupby('month_end')['market_cap'].sum()
        ).mean() * 100
    }

filter_df = pd.DataFrame(filter_summary).T
print("Universe Filter Effects:")
print(filter_df.round(1).to_string())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors_filter = {
    'none': '#BDC3C7', 'minimal': '#3498DB',
    'standard': '#2C5F8A', 'strict': '#C0392B'
}

for level in ['none', 'minimal', 'standard', 'strict']:
    filtered = apply_universe_filters(monthly, filters=level)
    counts = (
        filtered[filtered['in_universe']]
        .groupby('month_end')['ticker']
        .nunique()
    )
    axes[0].plot(counts.index, counts.values,
                 color=colors_filter[level], linewidth=1.5, label=level)

axes[0].set_ylabel('Number of Stocks')
axes[0].set_title('Panel A: Universe Size')
axes[0].legend()

# Panel B: Market cap coverage
for level in ['minimal', 'standard', 'strict']:
    filtered = apply_universe_filters(monthly, filters=level)
    total_mcap = filtered.groupby('month_end')['market_cap'].sum()
    filtered_mcap = (
        filtered[filtered['in_universe']]
        .groupby('month_end')['market_cap']
        .sum()
    )
    coverage = (filtered_mcap / total_mcap * 100).dropna()
    axes[1].plot(coverage.index, coverage.values,
                 color=colors_filter[level], linewidth=1.5, label=level)

axes[1].set_ylabel('Market Cap Coverage (%)')
axes[1].set_title('Panel B: Market Capitalization Coverage')
axes[1].legend()
axes[1].set_ylim([60, 102])

plt.tight_layout()
plt.show()

Figure 19.1

19.4 Step 2: Signal Construction

19.4.1 Point-in-Time Accounting Data

As discussed in the missing data chapter, accounting signals must be aligned with their public availability date to avoid look-ahead bias. We implement a general-purpose point-in-time merge:

def pit_merge_accounting(monthly_df, accounting_df, lag_months=4):
    """
    Merge accounting data with monthly returns respecting
    the point-in-time availability constraint.
    
    Vietnamese annual reports are due within 90 days of fiscal
    year-end. We use a conservative 4-month lag.
    
    Parameters
    ----------
    lag_months : int
        Number of months after fiscal year-end before data
        are assumed to be publicly available.
    """
    acc = accounting_df.copy()
    
    # Accounting data becomes available lag_months after FY end
    # If filing_date is available, use it; otherwise use FY-end + lag
    if 'filing_date' in acc.columns:
        acc['filing_date'] = pd.to_datetime(acc['filing_date'])
        acc['available_date'] = acc['filing_date']
        # Fallback for missing filing dates
        acc['fy_end'] = pd.to_datetime(
            acc['fiscal_year'].astype(str) + '-12-31'
        )
        acc['available_date'] = acc['available_date'].fillna(
            acc['fy_end'] + pd.DateOffset(months=lag_months)
        )
    else:
        acc['available_date'] = pd.to_datetime(
            acc['fiscal_year'].astype(str) + '-12-31'
        ) + pd.DateOffset(months=lag_months)
    
    # For each firm-month, find the most recent available accounting data
    merged = monthly_df.copy()
    
    # Efficient approach: for June rebalancing, use FY t-1 data
    # which is available by April of year t (4-month lag)
    merged['year'] = merged['month_end'].dt.year
    merged['month'] = merged['month_end'].dt.month
    
    # Map: if month >= (lag_months + 1), use current year's FY-1 data
    # Otherwise, use FY-2 data
    merged['data_fy'] = np.where(
        merged['month'] >= lag_months + 1,
        merged['year'] - 1,
        merged['year'] - 2
    )
    
    # Merge
    acc_cols = [c for c in acc.columns if c not in
                ['filing_date', 'available_date', 'fy_end']]
    merged = merged.merge(
        acc[acc_cols].rename(columns={'fiscal_year': 'data_fy'}),
        on=['ticker', 'data_fy'],
        how='left'
    )
    
    return merged

# Apply point-in-time merge
panel = pit_merge_accounting(monthly, accounting, lag_months=4)

# Construct common signals
panel['log_mcap'] = np.log(panel['market_cap'].clip(lower=1))

# Book-to-market
panel['bm'] = panel['book_equity'] / panel['market_cap']
panel.loc[panel['bm'] <= 0, 'bm'] = np.nan  # Negative BE firms

# Gross profitability (Novy-Marx 2013)
panel['gp_at'] = panel['gross_profit'] / panel['total_assets']

# Operating profitability (Fama-French 2015)
panel['op'] = panel['operating_profit'] / panel['book_equity']

# Investment (asset growth)
panel['investment'] = (
    panel.groupby('ticker')['total_assets']
    .pct_change(periods=1)
)

# Leverage
panel['leverage'] = panel['total_debt'] / panel['total_assets']

print("Signal Coverage:")
for sig in ['bm', 'gp_at', 'op', 'investment', 'leverage']:
    pct = panel[sig].notna().mean()
    print(f"  {sig:<15}: {pct:.1%}")

19.4.2 Momentum and Volatility Signals

Price-based signals require return history, not accounting data, so they have different timing requirements.

# Past returns for momentum signals
panel = panel.sort_values(['ticker', 'month_end'])

# Momentum: cumulative return from month t-12 to t-2 (skip most recent month)
panel['ret_12_2'] = (
    panel.groupby('ticker')['monthly_return']
    .transform(lambda x: x.shift(2).rolling(11).apply(
        lambda r: (1 + r).prod() - 1, raw=True))
)

# Short-term reversal: month t-1 return
panel['ret_1'] = panel.groupby('ticker')['monthly_return'].shift(1)

# Idiosyncratic volatility (from daily data, rolling 60 days)
daily['date'] = pd.to_datetime(daily['date'])
daily['daily_return'] = daily.groupby('ticker')['adjusted_close'].pct_change()

ivol = (
    daily.groupby('ticker')
    .apply(lambda g: g.set_index('date')['daily_return']
           .rolling(60, min_periods=40).std() * np.sqrt(252))
    .reset_index(name='ivol')
)
ivol['month_end'] = ivol['date'].dt.to_period('M').dt.to_timestamp('M')
ivol_monthly = (
    ivol.groupby(['ticker', 'month_end'])['ivol']
    .last()
    .reset_index()
)

panel = panel.merge(ivol_monthly, on=['ticker', 'month_end'], how='left')

print("Price Signal Coverage:")
for sig in ['ret_12_2', 'ret_1', 'ivol']:
    pct = panel[sig].notna().mean()
    print(f"  {sig:<15}: {pct:.1%}")

19.5 Step 3: Breakpoint Computation

19.5.1 The Breakpoint Decision

Breakpoints determine which stocks are “high” versus “low” on a given characteristic. The two key choices are:

Breakpoint universe: Should breakpoints be computed from all stocks or from a subset (e.g., HOSE only)?
Number of groups: 2×3 (Fama-French standard), 5×5 (for finer sorts), or independent terciles/quintiles?

Fama and French (1993) use NYSE breakpoints for U.S. sorts because this prevents the large number of small Nasdaq/AMEX stocks from dominating the breakpoint distribution. The analog in Vietnam is to use HOSE breakpoints, since HOSE lists the larger, more liquid firms and HNX lists smaller firms. Using all-stock breakpoints would place most HOSE stocks in the upper size groups and most HNX stocks in the lower groups, producing mechanically different results.

def compute_breakpoints(df, signal_col, n_groups, bp_universe='hose',
                          exchange_col='exchange'):
    """
    Compute cross-sectional breakpoints for portfolio sorting.
    
    Parameters
    ----------
    bp_universe : str
        'all': use all stocks in universe
        'hose': use only HOSE stocks (analogous to NYSE breakpoints)
    n_groups : int
        Number of groups (2, 3, 5, or 10)
    
    Returns
    -------
    Series of breakpoints (quantiles)
    """
    signal = df[signal_col].dropna()
    
    if bp_universe == 'hose':
        mask = df[exchange_col] == 'HOSE'
        signal = df.loc[mask, signal_col].dropna()
    
    quantiles = np.linspace(0, 1, n_groups + 1)[1:-1]
    breakpoints = signal.quantile(quantiles)
    
    return breakpoints

# Example: compare HOSE vs all-stock breakpoints for book-to-market
example_month = panel[panel['month_end'] == '2023-06-30'].copy()
example_month = example_month[example_month['bm'].notna()]

bp_hose = compute_breakpoints(example_month, 'bm', 3, bp_universe='hose')
bp_all = compute_breakpoints(example_month, 'bm', 3, bp_universe='all')

print("BM Tercile Breakpoints (June 2023):")
print(f"  HOSE-only: {bp_hose.values.round(3)}")
print(f"  All stocks: {bp_all.values.round(3)}")
print(f"\n  Difference: HOSE breakpoints are "
      f"{'higher' if bp_hose.values[0] > bp_all.values[0] else 'lower'} "
      f"than all-stock breakpoints")

# Compute breakpoints for every month
bp_comparison = []
for month, group in panel.dropna(subset=['bm']).groupby('month_end'):
    bp_h = compute_breakpoints(group, 'bm', 3, bp_universe='hose')
    bp_a = compute_breakpoints(group, 'bm', 3, bp_universe='all')
    
    # Count stocks in each tercile under each rule
    for bp_name, bp_vals in [('HOSE', bp_h), ('All', bp_a)]:
        low = (group['bm'] <= bp_vals.iloc[0]).sum()
        mid = ((group['bm'] > bp_vals.iloc[0]) &
               (group['bm'] <= bp_vals.iloc[1])).sum()
        high = (group['bm'] > bp_vals.iloc[1]).sum()
        bp_comparison.append({
            'month_end': month, 'bp_rule': bp_name,
            'median_bp': bp_vals.iloc[0],
            'n_low': low, 'n_mid': mid, 'n_high': high
        })

bp_df = pd.DataFrame(bp_comparison)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel A: Median breakpoint over time
for rule, color in [('HOSE', '#2C5F8A'), ('All', '#C0392B')]:
    subset = bp_df[bp_df['bp_rule'] == rule]
    axes[0].plot(subset['month_end'], subset['median_bp'],
                 color=color, linewidth=1.5, label=f'{rule} breakpoints')
axes[0].set_ylabel('Lower Tercile Breakpoint (BM)')
axes[0].set_title('Panel A: Breakpoint Time Series')
axes[0].legend()

# Panel B: Number of stocks in high BM group
for rule, color in [('HOSE', '#2C5F8A'), ('All', '#C0392B')]:
    subset = bp_df[bp_df['bp_rule'] == rule]
    axes[1].plot(subset['month_end'], subset['n_high'],
                 color=color, linewidth=1.5, label=f'{rule} breakpoints')
axes[1].set_ylabel('Stocks in High BM Group')
axes[1].set_title('Panel B: High BM Portfolio Size')
axes[1].legend()

plt.tight_layout()
plt.show()

Figure 19.2

19.6 Step 4: Portfolio Formation

19.6.1 The Generic Factor Engine

We implement a general-purpose factor construction function that takes any signal column and produces a long-short factor return series. The function encapsulates all methodological choices as parameters, making it easy to test sensitivity.

def construct_factor(
    panel_df,
    signal_col,
    size_col='market_cap',
    return_col='monthly_return',
    date_col='month_end',
    exchange_col='exchange',
    formation_month=6,
    rebalance_freq='annual',
    n_signal_groups=3,
    n_size_groups=2,
    weighting='value',
    bp_universe='hose',
    independent_sorts=True,
    long_group='high',
    min_stocks_per_portfolio=5,
    signal_lag=0,
    universe_filter='standard'
):
    """
    Construct a tradeable factor following the Fama-French methodology.
    
    Parameters
    ----------
    signal_col : str
        Column with the sorting variable.
    formation_month : int
        Month of year for portfolio formation (6 = June for FF).
    rebalance_freq : str
        'annual' (FF standard), 'semi', 'quarterly', 'monthly'.
    n_signal_groups : int
        Number of signal groups (3 for FF standard, 5 for quintiles).
    n_size_groups : int
        Number of size groups (2 for FF standard).
    weighting : str
        'value' (VW) or 'equal' (EW).
    bp_universe : str
        'hose' or 'all' for breakpoint computation.
    independent_sorts : bool
        True for independent double sorts (FF standard).
    long_group : str
        'high' or 'low'—which signal group is the long leg.
    signal_lag : int
        Additional months to lag the signal beyond the
        standard point-in-time alignment.
    
    Returns
    -------
    Dictionary with 'factor_returns', 'portfolio_returns',
    'diagnostics'.
    """
    df = panel_df.copy()
    
    # Apply universe filter
    df = apply_universe_filters(df, filters=universe_filter)
    df = df[df['in_universe']].copy()
    
    # Lag the signal if requested
    if signal_lag > 0:
        df[signal_col] = (
            df.groupby('ticker')[signal_col].shift(signal_lag)
        )
    
    # Determine formation dates
    if rebalance_freq == 'annual':
        # Form portfolios in formation_month, hold for 12 months
        df['formation_date'] = df[date_col].apply(
            lambda d: pd.Timestamp(
                year=d.year if d.month >= formation_month else d.year - 1,
                month=formation_month, day=30
            )
        )
    elif rebalance_freq == 'monthly':
        df['formation_date'] = df[date_col] - pd.DateOffset(months=1)
    elif rebalance_freq == 'quarterly':
        df['formation_date'] = df[date_col].apply(
            lambda d: pd.Timestamp(
                year=d.year,
                month=((d.month - 1) // 3) * 3 + 1,
                day=1
            ) - pd.DateOffset(days=1)
        )
    
    # Assign signal and size groups at each formation date
    all_portfolios = []
    
    formation_dates = sorted(df['formation_date'].unique())
    
    for f_date in formation_dates:
        # Stocks available at formation
        formation_data = df[df['formation_date'] == f_date].copy()
        
        # Get signal values at formation
        available = formation_data.dropna(subset=[signal_col, size_col])
        if len(available) < min_stocks_per_portfolio * n_signal_groups * n_size_groups:
            continue
        
        # Compute breakpoints
        size_bp = compute_breakpoints(
            available, size_col, n_size_groups, bp_universe
        )
        signal_bp = compute_breakpoints(
            available, signal_col, n_signal_groups, bp_universe
        )
        
        # Assign groups
        available['size_group'] = np.searchsorted(
            size_bp.values, available[size_col].values
        )
        available['signal_group'] = np.searchsorted(
            signal_bp.values, available[signal_col].values
        )
        
        all_portfolios.append(available)
    
    if not all_portfolios:
        return None
    
    portfolios = pd.concat(all_portfolios, ignore_index=True)
    
    # Compute portfolio returns
    def weighted_return(group):
        if weighting == 'value':
            if group[size_col].sum() > 0:
                return np.average(group[return_col], weights=group[size_col])
            else:
                return group[return_col].mean()
        else:
            return group[return_col].mean()
    
    port_returns = (
        portfolios
        .groupby([date_col, 'size_group', 'signal_group'])
        .apply(weighted_return)
        .reset_index(name='port_return')
    )
    
    # Construct factor: average of high-signal portfolios minus
    # average of low-signal portfolios (across size groups)
    high_label = n_signal_groups - 1 if long_group == 'high' else 0
    low_label = 0 if long_group == 'high' else n_signal_groups - 1
    
    high_ports = port_returns[port_returns['signal_group'] == high_label]
    low_ports = port_returns[port_returns['signal_group'] == low_label]
    
    high_avg = high_ports.groupby(date_col)['port_return'].mean()
    low_avg = low_ports.groupby(date_col)['port_return'].mean()
    
    factor_returns = (high_avg - low_avg).to_frame('factor_return')
    
    # Diagnostics
    port_counts = (
        portfolios
        .groupby([date_col, 'size_group', 'signal_group'])['ticker']
        .nunique()
        .reset_index(name='n_stocks')
    )
    
    diagnostics = {
        'avg_stocks_per_portfolio': port_counts['n_stocks'].mean(),
        'min_stocks_per_portfolio': port_counts['n_stocks'].min(),
        'ann_return': factor_returns['factor_return'].mean() * 12,
        'ann_vol': factor_returns['factor_return'].std() * np.sqrt(12),
        'sharpe': (factor_returns['factor_return'].mean()
                   / factor_returns['factor_return'].std() * np.sqrt(12)),
        't_stat': (factor_returns['factor_return'].mean()
                   / (factor_returns['factor_return'].std()
                      / np.sqrt(len(factor_returns)))),
        'n_months': len(factor_returns)
    }
    
    return {
        'factor_returns': factor_returns,
        'portfolio_returns': port_returns,
        'portfolios': portfolios,
        'diagnostics': diagnostics
    }

19.6.2 Building the Core Factors

We now use the engine to construct the standard Fama-French factors for Vietnam:

# SMB (Size): small minus big
# Signal = market cap; long_group = 'low' (small stocks)
smb_result = construct_factor(
    panel, signal_col='log_mcap', long_group='low',
    formation_month=6, rebalance_freq='annual',
    n_signal_groups=2, n_size_groups=1,  # No double sort for size itself
    weighting='value', bp_universe='hose'
)

# HML (Value): high BM minus low BM
hml_result = construct_factor(
    panel, signal_col='bm', long_group='high',
    formation_month=6, rebalance_freq='annual',
    n_signal_groups=3, n_size_groups=2,
    weighting='value', bp_universe='hose'
)

# RMW (Profitability): robust minus weak
rmw_result = construct_factor(
    panel, signal_col='op', long_group='high',
    formation_month=6, rebalance_freq='annual',
    n_signal_groups=3, n_size_groups=2,
    weighting='value', bp_universe='hose'
)

# CMA (Investment): conservative minus aggressive
cma_result = construct_factor(
    panel, signal_col='investment', long_group='low',
    formation_month=6, rebalance_freq='annual',
    n_signal_groups=3, n_size_groups=2,
    weighting='value', bp_universe='hose'
)

# WML (Momentum): winners minus losers
wml_result = construct_factor(
    panel, signal_col='ret_12_2', long_group='high',
    formation_month=6, rebalance_freq='monthly',
    n_signal_groups=3, n_size_groups=2,
    weighting='value', bp_universe='hose'
)

# Summary table
print("Vietnamese Factor Summary:")
print(f"{'Factor':<8} {'Ann. Ret':>10} {'Ann. Vol':>10} {'Sharpe':>8} "
      f"{'t-stat':>8} {'Avg N':>8}")
print("-" * 54)
for name, result in [('SMB', smb_result), ('HML', hml_result),
                       ('RMW', rmw_result), ('CMA', cma_result),
                       ('WML', wml_result)]:
    if result is None:
        continue
    d = result['diagnostics']
    print(f"{name:<8} {d['ann_return']:>10.4f} {d['ann_vol']:>10.4f} "
          f"{d['sharpe']:>8.2f} {d['t_stat']:>8.2f} "
          f"{d['avg_stocks_per_portfolio']:>8.1f}")

fig, ax = plt.subplots(figsize=(14, 6))

factor_colors = {
    'SMB': '#2C5F8A', 'HML': '#C0392B', 'RMW': '#27AE60',
    'CMA': '#E67E22', 'WML': '#8E44AD'
}

for name, result in [('SMB', smb_result), ('HML', hml_result),
                       ('RMW', rmw_result), ('CMA', cma_result),
                       ('WML', wml_result)]:
    if result is None:
        continue
    fr = result['factor_returns']
    cum = (1 + fr['factor_return']).cumprod()
    ax.plot(cum.index, cum.values, color=factor_colors[name],
            linewidth=2, label=name)

ax.axhline(y=1, color='gray', linewidth=0.5)
ax.set_ylabel('Cumulative Return')
ax.set_xlabel('Date')
ax.set_title('Vietnamese Factor Cumulative Returns (2×3 VW, HOSE Breakpoints)')
ax.legend(ncol=5)
ax.set_yscale('log')
plt.tight_layout()
plt.show()

Figure 19.3

19.7 Step 5: Sensitivity to Construction Choices

The most important lesson in factor construction is that the resulting factor premium is not uniquely determined by the economic hypothesis; it depends substantially on implementation choices. We systematically vary each choice and examine how the factor changes.

19.7.1 Weighting: Value-Weighted vs. Equal-Weighted

sensitivity_results = {}

for name, signal, long_grp in [
    ('HML', 'bm', 'high'), ('RMW', 'op', 'high'), ('WML', 'ret_12_2', 'high')
]:
    for wt in ['value', 'equal']:
        result = construct_factor(
            panel, signal_col=signal, long_group=long_grp,
            weighting=wt, bp_universe='hose',
            rebalance_freq='annual' if name != 'WML' else 'monthly'
        )
        if result:
            d = result['diagnostics']
            sensitivity_results[f"{name}_{wt}"] = {
                'Factor': name, 'Weighting': wt,
                'Ann. Return': d['ann_return'],
                't-stat': d['t_stat']
            }

sens_df = pd.DataFrame(sensitivity_results).T
print("VW vs EW Factor Returns:")
print(sens_df.round(3).to_string())

19.7.2 Breakpoint Universe: HOSE vs. All Stocks

for name, signal, long_grp in [
    ('HML', 'bm', 'high'), ('WML', 'ret_12_2', 'high')
]:
    for bp in ['hose', 'all']:
        result = construct_factor(
            panel, signal_col=signal, long_group=long_grp,
            bp_universe=bp, weighting='value',
            rebalance_freq='annual' if name != 'WML' else 'monthly'
        )
        if result:
            d = result['diagnostics']
            sensitivity_results[f"{name}_bp_{bp}"] = {
                'Factor': name, 'Breakpoints': bp,
                'Ann. Return': d['ann_return'],
                't-stat': d['t_stat'],
                'Avg N': d['avg_stocks_per_portfolio']
            }

bp_sens = pd.DataFrame({k: v for k, v in sensitivity_results.items()
                          if 'bp_' in k}).T
print("\nBreakpoint Universe Sensitivity:")
print(bp_sens.round(3).to_string())

19.7.3 Number of Groups: 2×3 vs. 5×5 vs. Deciles

group_configs = [
    (2, 3, '2x3 (FF standard)'),
    (2, 5, '2x5 (quintiles)'),
    (1, 10, '1x10 (deciles)')
]

group_results = {}
for n_size, n_signal, label in group_configs:
    result = construct_factor(
        panel, signal_col='bm', long_group='high',
        n_size_groups=n_size, n_signal_groups=n_signal,
        weighting='value', bp_universe='hose',
        rebalance_freq='annual'
    )
    if result:
        d = result['diagnostics']
        group_results[label] = {
            'Ann. Return': d['ann_return'],
            't-stat': d['t_stat'],
            'Avg N per port': d['avg_stocks_per_portfolio'],
            'Min N per port': d['min_stocks_per_portfolio']
        }

print("HML: Sorting Granularity Sensitivity:")
print(pd.DataFrame(group_results).T.round(3).to_string())

19.7.4 Rebalancing Frequency

rebal_results = {}
for freq in ['annual', 'quarterly', 'monthly']:
    result = construct_factor(
        panel, signal_col='bm', long_group='high',
        rebalance_freq=freq, weighting='value', bp_universe='hose'
    )
    if result:
        d = result['diagnostics']
        rebal_results[freq] = {
            'Ann. Return': d['ann_return'],
            'Ann. Vol': d['ann_vol'],
            't-stat': d['t_stat']
        }

print("HML: Rebalancing Frequency Sensitivity:")
print(pd.DataFrame(rebal_results).T.round(4).to_string())

19.7.5 Comprehensive Sensitivity Summary

# Systematic grid search for HML
configs = list(product(
    ['value', 'equal'],        # Weighting
    ['hose', 'all'],           # Breakpoint universe
    [(2, 3), (2, 5), (1, 10)]  # (n_size, n_signal)
))

grid_results = []
for wt, bp, (ns, nsig) in configs:
    result = construct_factor(
        panel, signal_col='bm', long_group='high',
        weighting=wt, bp_universe=bp,
        n_size_groups=ns, n_signal_groups=nsig,
        rebalance_freq='annual'
    )
    if result:
        d = result['diagnostics']
        grid_results.append({
            'Weighting': 'VW' if wt == 'value' else 'EW',
            'Breakpoints': bp.upper(),
            'Sort': f'{ns}x{nsig}',
            'Ann. Return': d['ann_return'],
            't-stat': d['t_stat']
        })

grid_df = pd.DataFrame(grid_results)
print("HML Factor: Full Sensitivity Grid:")
print(grid_df.to_string(index=False))

# Pivot for heatmap
pivot = grid_df.pivot_table(
    values='Ann. Return', index=['Weighting', 'Breakpoints'],
    columns='Sort'
)

fig, ax = plt.subplots(figsize=(8, 5))
sns.heatmap(pivot * 100, annot=True, fmt='.1f', cmap='RdYlGn',
            center=0, linewidths=0.5, ax=ax,
            cbar_kws={'label': 'Ann. Return (%)'})
ax.set_title('HML Premium: Sensitivity to Construction Choices')
plt.tight_layout()
plt.show()

Figure 19.4

19.8 Factor Correlation Structure

# Merge all factor return series
factor_panel = pd.DataFrame()
for name, result in [('SMB', smb_result), ('HML', hml_result),
                       ('RMW', rmw_result), ('CMA', cma_result),
                       ('WML', wml_result)]:
    if result is None:
        continue
    fr = result['factor_returns'].rename(columns={'factor_return': name})
    if factor_panel.empty:
        factor_panel = fr
    else:
        factor_panel = factor_panel.merge(fr, left_index=True,
                                           right_index=True, how='outer')

corr = factor_panel.corr()

fig, ax = plt.subplots(figsize=(7, 6))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1, square=True,
            linewidths=0.5, ax=ax)
ax.set_title('Factor Return Correlations')
plt.tight_layout()
plt.show()

Figure 19.5

19.9 Factor Validation

A well-constructed factor should pass several diagnostic tests before being used in asset pricing research.

19.9.1 Diagnostic Checklist

def validate_factor(factor_result, name):
    """
    Run standard diagnostic tests on a constructed factor.
    """
    fr = factor_result['factor_returns']['factor_return']
    diag = factor_result['diagnostics']
    ports = factor_result['portfolios']
    
    tests = {}
    
    # 1. Statistical significance (t > 2)
    tests['t-stat'] = diag['t_stat']
    tests['t > 2'] = abs(diag['t_stat']) > 2.0
    
    # 2. Economic magnitude
    tests['Ann. Return'] = diag['ann_return']
    tests['Ann. Vol'] = diag['ann_vol']
    tests['Sharpe'] = diag['sharpe']
    
    # 3. Adequate portfolio diversification
    tests['Avg N per portfolio'] = diag['avg_stocks_per_portfolio']
    tests['Min N per portfolio'] = diag['min_stocks_per_portfolio']
    tests['Min N >= 5'] = diag['min_stocks_per_portfolio'] >= 5
    
    # 4. Not dominated by a single month
    tests['Max monthly return'] = fr.max()
    tests['Min monthly return'] = fr.min()
    tests['Fraction > 0'] = (fr > 0).mean()
    
    # 5. Persistence (ACF at lag 1)
    if len(fr) > 12:
        tests['ACF(1)'] = fr.autocorr(lag=1)
    
    # 6. Consistency across subperiods
    mid = len(fr) // 2
    first_half = fr.iloc[:mid]
    second_half = fr.iloc[mid:]
    tests['Return (1st half)'] = first_half.mean() * 12
    tests['Return (2nd half)'] = second_half.mean() * 12
    tests['Same sign both halves'] = (
        np.sign(first_half.mean()) == np.sign(second_half.mean())
    )
    
    return tests

print("Factor Validation Summary:")
print("=" * 70)
for name, result in [('SMB', smb_result), ('HML', hml_result),
                       ('RMW', rmw_result), ('CMA', cma_result),
                       ('WML', wml_result)]:
    if result is None:
        continue
    tests = validate_factor(result, name)
    print(f"\n{name}:")
    for test, value in tests.items():
        if isinstance(value, bool):
            status = 'PASS' if value else 'FAIL'
            print(f"  {test:<30}: {status}")
        elif isinstance(value, float):
            print(f"  {test:<30}: {value:.4f}")
        else:
            print(f"  {test:<30}: {value}")

19.9.2 Monotonicity Test

A factor built from a characteristic sort should produce monotonically increasing (or decreasing) average returns across quantiles. Violations of monotonicity suggest the signal-return relationship is nonlinear or absent.

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

signals = [
    ('bm', 'Book-to-Market', 'high'),
    ('op', 'Operating Profit.', 'high'),
    ('investment', 'Investment', 'low'),
    ('ret_12_2', 'Momentum (12-2)', 'high'),
    ('ivol', 'Idio. Volatility', 'low'),
]

for i, (sig, label, long_grp) in enumerate(signals):
    result = construct_factor(
        panel, signal_col=sig, long_group=long_grp,
        n_size_groups=1, n_signal_groups=10,
        weighting='value', bp_universe='hose',
        rebalance_freq='annual' if sig not in ['ret_12_2'] else 'monthly'
    )
    if result is None:
        continue
    
    port_ret = result['portfolio_returns']
    decile_means = (
        port_ret.groupby('signal_group')['port_return']
        .mean() * 12 * 100
    )
    
    colors_mono = plt.cm.RdYlGn_r(np.linspace(0.1, 0.9, len(decile_means)))
    axes[i].bar(range(len(decile_means)), decile_means.values,
                color=colors_mono, edgecolor='white')
    axes[i].set_xticks(range(len(decile_means)))
    axes[i].set_xticklabels([f'D{d+1}' for d in range(len(decile_means))],
                             fontsize=8)
    axes[i].set_ylabel('Ann. Return (%)')
    axes[i].set_title(label)
    axes[i].axhline(y=0, color='gray', linewidth=0.5)

axes[5].set_visible(False)
plt.suptitle('Decile Portfolio Returns by Signal', fontsize=14)
plt.tight_layout()
plt.show()

Figure 19.6

19.9.3 Spanning Tests

Does a new factor add information beyond existing factors? We test this by regressing each factor on all other factors and examining the intercept (alpha). A significant alpha means the factor captures return variation not spanned by the others:

factor_names = [c for c in factor_panel.columns if factor_panel[c].notna().sum() > 24]
spanning_data = factor_panel[factor_names].dropna()

print("Spanning Tests (alpha = intercept when regressed on other factors):")
print(f"{'Factor':<8} {'Alpha (ann.)':>12} {'t-stat':>8} {'R²':>6}")
print("-" * 36)

for target in factor_names:
    others = [f for f in factor_names if f != target]
    y = spanning_data[target]
    X = sm.add_constant(spanning_data[others])
    model = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6})
    
    alpha_ann = model.params['const'] * 12
    alpha_t = model.tvalues['const']
    r2 = model.rsquared
    
    print(f"{target:<8} {alpha_ann:>12.4f} {alpha_t:>8.2f} {r2:>6.3f}")

19.10 Vietnamese-Specific Considerations

19.10.1 State-Owned Enterprise Classification

A unique feature of Vietnamese equities is the high proportion of state-owned enterprises (SOEs). The state retains majority or significant minority stakes in many listed firms, which affects governance, information environment, and trading dynamics. Factors may behave differently within SOE and non-SOE subsamples:

# Get SOE classification
firm_info = client.get_firm_info(
    exchanges=['HOSE', 'HNX'],
    fields=['ticker', 'state_ownership_pct', 'is_soe']
)

panel_soe = panel.merge(firm_info[['ticker', 'is_soe']], on='ticker', how='left')

for label, subset in [('SOE', panel_soe[panel_soe['is_soe'] == True]),
                        ('Non-SOE', panel_soe[panel_soe['is_soe'] == False])]:
    hml = construct_factor(
        subset, signal_col='bm', long_group='high',
        weighting='value', bp_universe='all',
        rebalance_freq='annual'
    )
    if hml:
        d = hml['diagnostics']
        print(f"HML ({label}): Ann = {d['ann_return']:.4f}, "
              f"t = {d['t_stat']:.2f}, N = {d['avg_stocks_per_portfolio']:.0f}")

19.10.2 Foreign Ownership Limits

Foreign ownership caps (49% for most sectors, 30% for banking) affect the investable universe for international investors. Factors constructed from the full universe may not be achievable by foreign investors if the long leg concentrates in stocks at the foreign ownership limit:

fol = client.get_foreign_ownership(
    exchanges=['HOSE', 'HNX'],
    fields=['ticker', 'month_end', 'foreign_pct', 'foreign_limit_pct']
)

# Merge with HML portfolios
if hml_result:
    hml_ports = hml_result['portfolios'].merge(
        fol, on=['ticker', 'month_end'], how='left'
    )
    hml_ports['near_limit'] = (
        (hml_ports['foreign_limit_pct'] - hml_ports['foreign_pct']) < 5
    )
    
    fol_by_group = (
        hml_ports.groupby('signal_group')
        .agg(
            avg_foreign_pct=('foreign_pct', 'mean'),
            pct_near_limit=('near_limit', 'mean')
        )
    )
    print("Foreign Ownership in HML Portfolios:")
    print(fol_by_group.round(3))

19.11 Recommended Factor Specifications for Vietnam

Based on the sensitivity analysis, we recommend the following baseline specifications (Table 19.1).

Table 19.1: Recommended factor construction parameters for Vietnamese equities.

Choice	Recommendation	Rationale
Universe	Standard filter (listing age ≥ 6m, volume > 0)	Excludes shell firms without losing too much breadth
Breakpoints	HOSE stocks only	Prevents HNX micro-caps from dominating breakpoints
Size groups	2 (median split)	Sufficient control with limited cross-section
Signal groups	3 (terciles) for factor construction; 5 or 10 for portfolio analysis	2×3 = adequate diversification per portfolio
Weighting	Value-weighted	Investable; less noisy than EW
Rebalancing	Annual (June) for accounting signals; monthly for momentum	Standard; consistent with Fama-French
Accounting lag	4 months (available by April for Dec FY)	Conservative PIT alignment
Minimum stocks	≥ 5 per portfolio per month	Below this, single-stock idiosyncratic risk dominates

Always report results under the baseline and at least one alternative specification (e.g., EW, all-stock breakpoints) to demonstrate robustness.

19.12 Summary

This chapter has developed a modular, transparent factor construction framework for Vietnamese equities. The key insights are:

The Fama and French (1993) 2×3 methodology translates well to Vietnam with one critical adaptation: breakpoints should be computed from HOSE stocks only. Using all-stock breakpoints allows HNX micro-caps to dominate the small-stock groups, inflating apparent premia with economically untradeable returns.
Value weighting produces more conservative (and more implementable) factor premia than equal weighting. The difference is particularly large for signals correlated with size (BM, investment), where EW overweights the smallest stocks that contribute most to the premium but least to investable returns.
No single construction choice determines whether a factor “exists.” A factor premium that appears only under one specific combination of breakpoints, weighting, and rebalancing frequency is fragile and should be treated with skepticism. Robust factors survive a grid of specifications. The sensitivity analysis framework developed here (e.g., varying weighting, breakpoints, sort granularity, and rebalancing simultaneously) should be standard practice.

The construct_factor() function developed in this chapter is designed for reuse throughout the book. Any anomaly variable can be fed through the same pipeline, producing a tradeable factor with full diagnostics. This ensures methodological consistency across chapters and makes it easy to compare premia on an apples-to-apples basis.

# Factor Construction Principles ::: callout-note In this chapter, we develop a general-purpose factor construction engine for the Vietnamese equity market. We cover every methodological decision in the pipeline, including universe definition, breakpoint computation, portfolio formation, weighting, rebalancing, and factor return calculation, and demonstrate how each choice affects the resulting factor. ::: The previous chapters introduced specific asset pricing models, including the CAPM, the Fama-French three-factor model, and momentum. Each of those chapters presented its factor as given. This chapter steps behind the curtain and addresses the engineering question: *how exactly do you build a factor?* The question matters because seemingly minor methodological decisions, such as where to set breakpoints, whether to value-weight or equal-weight, how to handle missing accounting data, which stocks to exclude, can alter the magnitude, statistical significance, and even the sign of a factor premium. In the U.S. context, @fama1993common established a canonical procedure: sort stocks independently on size and a characteristic, form six value-weighted portfolios from 2×3 intersections, and define the factor as the average return of the two high-characteristic portfolios minus the average return of the two low-characteristic portfolios. This procedure has been replicated thousands of times. But it was designed for the U.S. market circa 1990, with its deep liquidity, broad cross-section, and CRSP/Compustat data infrastructure. Applying it mechanically to Vietnam, a market with 700 listed stocks, extreme illiquidity in the bottom tercile, high concentration in the top decile, and accounting data that arrives with variable lags, requires careful adaptation. @hou2020replicating replicated 452 anomalies from the U.S. literature and found that over half fail to replicate even in U.S. data with minor methodological variations. The replication crisis in empirical asset pricing makes it essential that researchers understand and document every construction choice. This chapter provides the tools to do so transparently. ## The Factor Construction Pipeline {#sec-factor-con-pipeline} Every tradeable factor follows the same logical pipeline: 1. Define the universe: Select the eligible securities (e.g., common stocks with liquidity and size filters). 2. Compute the signal: Calculate the characteristic of interest (e.g., value, momentum, profitability). 3. Set breakpoints: Determine how stocks will be sorted (e.g., median, quintiles, deciles). 4. Assign portfolios: Group stocks into high and low (or multiple) portfolios based on the signal. 5. Compute returns: Calculate portfolio returns (equal- or value-weighted). 6. Construct the factor: Take Long (high) - Short (low). 7. Validate: Test performance, significance, and robustness. Each step involves choices that interact with each other. A breakpoint that works well for a liquid universe may be inappropriate for the full cross-section. A weighting scheme that reduces noise in the U.S. may amplify it in Vietnam. We address each step systematically. ## Data Construction {#sec-factor-con-data} ```{python} #| label: setup #| code-summary: "Import libraries and configure environment" import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import statsmodels.api as sm from scipy import stats from itertools import product import warnings warnings.filterwarnings('ignore') plt.rcParams.update({ 'figure.figsize': (12, 6), 'figure.dpi': 150, 'font.size': 11, 'axes.spines.top': False, 'axes.spines.right': False }) ``` ```{python} #| label: data-load #| eval: false #| code-summary: "Load returns, characteristics, and accounting data" from datacore import DataCoreClient client = DataCoreClient() # Monthly returns (survivorship-bias-free) monthly = client.get_monthly_returns( exchanges=['HOSE', 'HNX'], start_date='2008-01-01', end_date='2024-12-31', include_delisted=True, fields=[ 'ticker', 'month_end', 'monthly_return', 'market_cap', 'shares_outstanding', 'volume_avg_20d', 'turnover_value_avg_20d', 'n_zero_volume_days', 'exchange' ] ) # Annual accounting data accounting = client.get_fundamentals( exchanges=['HOSE', 'HNX'], start_date='2006-01-01', end_date='2024-12-31', include_delisted=True, frequency='annual', fields=[ 'ticker', 'fiscal_year', 'filing_date', 'total_assets', 'total_equity', 'book_equity', 'net_income', 'revenue', 'gross_profit', 'operating_profit', 'total_debt', 'retained_earnings', 'dividends_paid', 'capex', 'depreciation', 'shares_outstanding_fy' ] ) # Daily prices for momentum and volatility signals daily = client.get_daily_prices( exchanges=['HOSE', 'HNX'], start_date='2008-01-01', end_date='2024-12-31', include_delisted=True, fields=['ticker', 'date', 'adjusted_close', 'volume', 'turnover_value'] ) monthly['month_end'] = pd.to_datetime(monthly['month_end']) monthly = monthly.sort_values(['ticker', 'month_end']) print(f"Monthly returns: {len(monthly):,} firm-months") print(f"Accounting: {len(accounting):,} firm-years") print(f"Unique tickers: {monthly['ticker'].nunique()}") ``` ## Step 1: Universe Definition {#sec-factor-con-universe} The first and most consequential choice is which stocks enter the factor construction universe. The universe definition determines what population the factor premium describes and whether it is implementable. ### The Universe Problem in Vietnam Vietnam presents a specific challenge: the cross-section is small (600-800 stocks on HOSE and HNX combined), and the size distribution is extremely skewed. The top 10 stocks by market capitalization account for roughly 50% of the total market cap on HOSE. The bottom tercile consists of micro-cap stocks that often trade fewer than 5 days per month. Including these stocks inflates apparent factor premia because their prices are noisy and stale, but excluding them shrinks the already small cross-section. ```{python} #| label: universe-filters #| eval: false #| code-summary: "Define universe filters and document their effects" def apply_universe_filters(df, filters='standard'): """ Apply universe filters to the monthly return panel. Parameters ---------- filters : str 'none': all stocks 'minimal': exclude zero market cap and extreme returns 'standard': + minimum listing age + positive volume 'strict': + minimum market cap + minimum turnover Returns ------- Filtered DataFrame with 'in_universe' column """ d = df.copy() d['in_universe'] = True # Always: remove missing returns and market cap d.loc[d['monthly_return'].isna(), 'in_universe'] = False d.loc[d['market_cap'].isna() | (d['market_cap'] <= 0), 'in_universe'] = False # Minimal: winsorize extreme returns (likely data errors) if filters in ['minimal', 'standard', 'strict']: d.loc[d['monthly_return'].abs() > 1.0, 'in_universe'] = False # Standard: listing age >= 6 months if filters in ['standard', 'strict']: d['listing_age'] = ( d.groupby('ticker').cumcount() + 1 ) d.loc[d['listing_age'] < 6, 'in_universe'] = False # Require at least 10 positive-volume days in the month d.loc[d['n_zero_volume_days'] > 12, 'in_universe'] = False # Strict: minimum market cap (20th percentile of HOSE) if filters == 'strict': mcap_threshold = ( d[d['exchange'] == 'HOSE'] .groupby('month_end')['market_cap'] .transform(lambda x: x.quantile(0.20)) ) # Apply HOSE threshold to all stocks d['mcap_threshold'] = ( d.groupby('month_end')['market_cap'] .transform(lambda x: x.quantile(0.20)) ) d.loc[d['market_cap'] < d['mcap_threshold'], 'in_universe'] = False # Minimum average daily turnover (VND 200 million) d.loc[d['turnover_value_avg_20d'] < 2e8, 'in_universe'] = False return d # Apply all filter levels and compare filter_summary = {} for level in ['none', 'minimal', 'standard', 'strict']: filtered = apply_universe_filters(monthly, filters=level) in_univ = filtered[filtered['in_universe']] filter_summary[level] = { 'Firm-months': len(in_univ), 'Avg stocks/month': in_univ.groupby('month_end')['ticker'].nunique().mean(), 'Avg MCap coverage (%)': ( in_univ.groupby('month_end')['market_cap'].sum() / filtered.groupby('month_end')['market_cap'].sum() ).mean() * 100 } filter_df = pd.DataFrame(filter_summary).T print("Universe Filter Effects:") print(filter_df.round(1).to_string()) ``` ```{python} #| label: fig-universe-size #| eval: false #| fig-cap: "Number of stocks in the investable universe under different filter definitions. The 'strict' filter reduces the cross-section by roughly 30-40% but retains over 90% of total market capitalization. The trade-off is between cross-sectional breadth (more breakpoint precision) and data quality (less noise from illiquid micro-caps)." #| code-summary: "Plot universe size under different filters over time" fig, axes = plt.subplots(1, 2, figsize=(14, 5)) colors_filter = { 'none': '#BDC3C7', 'minimal': '#3498DB', 'standard': '#2C5F8A', 'strict': '#C0392B' } for level in ['none', 'minimal', 'standard', 'strict']: filtered = apply_universe_filters(monthly, filters=level) counts = ( filtered[filtered['in_universe']] .groupby('month_end')['ticker'] .nunique() ) axes[0].plot(counts.index, counts.values, color=colors_filter[level], linewidth=1.5, label=level) axes[0].set_ylabel('Number of Stocks') axes[0].set_title('Panel A: Universe Size') axes[0].legend() # Panel B: Market cap coverage for level in ['minimal', 'standard', 'strict']: filtered = apply_universe_filters(monthly, filters=level) total_mcap = filtered.groupby('month_end')['market_cap'].sum() filtered_mcap = ( filtered[filtered['in_universe']] .groupby('month_end')['market_cap'] .sum() ) coverage = (filtered_mcap / total_mcap * 100).dropna() axes[1].plot(coverage.index, coverage.values, color=colors_filter[level], linewidth=1.5, label=level) axes[1].set_ylabel('Market Cap Coverage (%)') axes[1].set_title('Panel B: Market Capitalization Coverage') axes[1].legend() axes[1].set_ylim([60, 102]) plt.tight_layout() plt.show() ``` ## Step 2: Signal Construction {#sec-factor-con-signals} ### Point-in-Time Accounting Data As discussed in the missing data chapter, accounting signals must be aligned with their public availability date to avoid look-ahead bias. We implement a general-purpose point-in-time merge: ```{python} #| label: pit-merge #| eval: false #| code-summary: "Point-in-time alignment of accounting data with monthly returns" def pit_merge_accounting(monthly_df, accounting_df, lag_months=4): """ Merge accounting data with monthly returns respecting the point-in-time availability constraint. Vietnamese annual reports are due within 90 days of fiscal year-end. We use a conservative 4-month lag. Parameters ---------- lag_months : int Number of months after fiscal year-end before data are assumed to be publicly available. """ acc = accounting_df.copy() # Accounting data becomes available lag_months after FY end # If filing_date is available, use it; otherwise use FY-end + lag if 'filing_date' in acc.columns: acc['filing_date'] = pd.to_datetime(acc['filing_date']) acc['available_date'] = acc['filing_date'] # Fallback for missing filing dates acc['fy_end'] = pd.to_datetime( acc['fiscal_year'].astype(str) + '-12-31' ) acc['available_date'] = acc['available_date'].fillna( acc['fy_end'] + pd.DateOffset(months=lag_months) ) else: acc['available_date'] = pd.to_datetime( acc['fiscal_year'].astype(str) + '-12-31' ) + pd.DateOffset(months=lag_months) # For each firm-month, find the most recent available accounting data merged = monthly_df.copy() # Efficient approach: for June rebalancing, use FY t-1 data # which is available by April of year t (4-month lag) merged['year'] = merged['month_end'].dt.year merged['month'] = merged['month_end'].dt.month # Map: if month >= (lag_months + 1), use current year's FY-1 data # Otherwise, use FY-2 data merged['data_fy'] = np.where( merged['month'] >= lag_months + 1, merged['year'] - 1, merged['year'] - 2 ) # Merge acc_cols = [c for c in acc.columns if c not in ['filing_date', 'available_date', 'fy_end']] merged = merged.merge( acc[acc_cols].rename(columns={'fiscal_year': 'data_fy'}), on=['ticker', 'data_fy'], how='left' ) return merged # Apply point-in-time merge panel = pit_merge_accounting(monthly, accounting, lag_months=4) # Construct common signals panel['log_mcap'] = np.log(panel['market_cap'].clip(lower=1)) # Book-to-market panel['bm'] = panel['book_equity'] / panel['market_cap'] panel.loc[panel['bm'] <= 0, 'bm'] = np.nan # Negative BE firms # Gross profitability (Novy-Marx 2013) panel['gp_at'] = panel['gross_profit'] / panel['total_assets'] # Operating profitability (Fama-French 2015) panel['op'] = panel['operating_profit'] / panel['book_equity'] # Investment (asset growth) panel['investment'] = ( panel.groupby('ticker')['total_assets'] .pct_change(periods=1) ) # Leverage panel['leverage'] = panel['total_debt'] / panel['total_assets'] print("Signal Coverage:") for sig in ['bm', 'gp_at', 'op', 'investment', 'leverage']: pct = panel[sig].notna().mean() print(f" {sig:<15}: {pct:.1%}") ``` ### Momentum and Volatility Signals Price-based signals require return history, not accounting data, so they have different timing requirements. ```{python} #| label: price-signals #| eval: false #| code-summary: "Construct momentum, reversal, and volatility signals" # Past returns for momentum signals panel = panel.sort_values(['ticker', 'month_end']) # Momentum: cumulative return from month t-12 to t-2 (skip most recent month) panel['ret_12_2'] = ( panel.groupby('ticker')['monthly_return'] .transform(lambda x: x.shift(2).rolling(11).apply( lambda r: (1 + r).prod() - 1, raw=True)) ) # Short-term reversal: month t-1 return panel['ret_1'] = panel.groupby('ticker')['monthly_return'].shift(1) # Idiosyncratic volatility (from daily data, rolling 60 days) daily['date'] = pd.to_datetime(daily['date']) daily['daily_return'] = daily.groupby('ticker')['adjusted_close'].pct_change() ivol = ( daily.groupby('ticker') .apply(lambda g: g.set_index('date')['daily_return'] .rolling(60, min_periods=40).std() * np.sqrt(252)) .reset_index(name='ivol') ) ivol['month_end'] = ivol['date'].dt.to_period('M').dt.to_timestamp('M') ivol_monthly = ( ivol.groupby(['ticker', 'month_end'])['ivol'] .last() .reset_index() ) panel = panel.merge(ivol_monthly, on=['ticker', 'month_end'], how='left') print("Price Signal Coverage:") for sig in ['ret_12_2', 'ret_1', 'ivol']: pct = panel[sig].notna().mean() print(f" {sig:<15}: {pct:.1%}") ``` ## Step 3: Breakpoint Computation {#sec-factor-con-breakpoints} ### The Breakpoint Decision Breakpoints determine which stocks are "high" versus "low" on a given characteristic. The two key choices are: 1. **Breakpoint universe:** Should breakpoints be computed from all stocks or from a subset (e.g., HOSE only)? 2. **Number of groups:** 2×3 (Fama-French standard), 5×5 (for finer sorts), or independent terciles/quintiles? @fama1993common use NYSE breakpoints for U.S. sorts because this prevents the large number of small Nasdaq/AMEX stocks from dominating the breakpoint distribution. The analog in Vietnam is to use HOSE breakpoints, since HOSE lists the larger, more liquid firms and HNX lists smaller firms. Using all-stock breakpoints would place most HOSE stocks in the upper size groups and most HNX stocks in the lower groups, producing mechanically different results. ```{python} #| label: breakpoints #| eval: false #| code-summary: "Compare breakpoint choices and their effects on portfolio composition" def compute_breakpoints(df, signal_col, n_groups, bp_universe='hose', exchange_col='exchange'): """ Compute cross-sectional breakpoints for portfolio sorting. Parameters ---------- bp_universe : str 'all': use all stocks in universe 'hose': use only HOSE stocks (analogous to NYSE breakpoints) n_groups : int Number of groups (2, 3, 5, or 10) Returns ------- Series of breakpoints (quantiles) """ signal = df[signal_col].dropna() if bp_universe == 'hose': mask = df[exchange_col] == 'HOSE' signal = df.loc[mask, signal_col].dropna() quantiles = np.linspace(0, 1, n_groups + 1)[1:-1] breakpoints = signal.quantile(quantiles) return breakpoints # Example: compare HOSE vs all-stock breakpoints for book-to-market example_month = panel[panel['month_end'] == '2023-06-30'].copy() example_month = example_month[example_month['bm'].notna()] bp_hose = compute_breakpoints(example_month, 'bm', 3, bp_universe='hose') bp_all = compute_breakpoints(example_month, 'bm', 3, bp_universe='all') print("BM Tercile Breakpoints (June 2023):") print(f" HOSE-only: {bp_hose.values.round(3)}") print(f" All stocks: {bp_all.values.round(3)}") print(f"\n Difference: HOSE breakpoints are " f"{'higher' if bp_hose.values[0] > bp_all.values[0] else 'lower'} " f"than all-stock breakpoints") ``` ```{python} #| label: fig-breakpoints #| eval: false #| fig-cap: "Effect of breakpoint universe on portfolio composition. Panel A shows the time series of the median BM breakpoint under HOSE-only versus all-stock computation. Panel B shows the resulting number of stocks in each tercile. Using HOSE breakpoints places more HNX stocks in the growth (low BM) group, reflecting the size-value correlation." #| code-summary: "Visualize breakpoint effects over time" # Compute breakpoints for every month bp_comparison = [] for month, group in panel.dropna(subset=['bm']).groupby('month_end'): bp_h = compute_breakpoints(group, 'bm', 3, bp_universe='hose') bp_a = compute_breakpoints(group, 'bm', 3, bp_universe='all') # Count stocks in each tercile under each rule for bp_name, bp_vals in [('HOSE', bp_h), ('All', bp_a)]: low = (group['bm'] <= bp_vals.iloc[0]).sum() mid = ((group['bm'] > bp_vals.iloc[0]) & (group['bm'] <= bp_vals.iloc[1])).sum() high = (group['bm'] > bp_vals.iloc[1]).sum() bp_comparison.append({ 'month_end': month, 'bp_rule': bp_name, 'median_bp': bp_vals.iloc[0], 'n_low': low, 'n_mid': mid, 'n_high': high }) bp_df = pd.DataFrame(bp_comparison) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Panel A: Median breakpoint over time for rule, color in [('HOSE', '#2C5F8A'), ('All', '#C0392B')]: subset = bp_df[bp_df['bp_rule'] == rule] axes[0].plot(subset['month_end'], subset['median_bp'], color=color, linewidth=1.5, label=f'{rule} breakpoints') axes[0].set_ylabel('Lower Tercile Breakpoint (BM)') axes[0].set_title('Panel A: Breakpoint Time Series') axes[0].legend() # Panel B: Number of stocks in high BM group for rule, color in [('HOSE', '#2C5F8A'), ('All', '#C0392B')]: subset = bp_df[bp_df['bp_rule'] == rule] axes[1].plot(subset['month_end'], subset['n_high'], color=color, linewidth=1.5, label=f'{rule} breakpoints') axes[1].set_ylabel('Stocks in High BM Group') axes[1].set_title('Panel B: High BM Portfolio Size') axes[1].legend() plt.tight_layout() plt.show() ``` ## Step 4: Portfolio Formation {#sec-factor-con-formation} ### The Generic Factor Engine We implement a general-purpose factor construction function that takes any signal column and produces a long-short factor return series. The function encapsulates all methodological choices as parameters, making it easy to test sensitivity. ```{python} #| label: factor-engine #| eval: false #| code-summary: "General-purpose factor construction engine" def construct_factor( panel_df, signal_col, size_col='market_cap', return_col='monthly_return', date_col='month_end', exchange_col='exchange', formation_month=6, rebalance_freq='annual', n_signal_groups=3, n_size_groups=2, weighting='value', bp_universe='hose', independent_sorts=True, long_group='high', min_stocks_per_portfolio=5, signal_lag=0, universe_filter='standard' ): """ Construct a tradeable factor following the Fama-French methodology. Parameters ---------- signal_col : str Column with the sorting variable. formation_month : int Month of year for portfolio formation (6 = June for FF). rebalance_freq : str 'annual' (FF standard), 'semi', 'quarterly', 'monthly'. n_signal_groups : int Number of signal groups (3 for FF standard, 5 for quintiles). n_size_groups : int Number of size groups (2 for FF standard). weighting : str 'value' (VW) or 'equal' (EW). bp_universe : str 'hose' or 'all' for breakpoint computation. independent_sorts : bool True for independent double sorts (FF standard). long_group : str 'high' or 'low'—which signal group is the long leg. signal_lag : int Additional months to lag the signal beyond the standard point-in-time alignment. Returns ------- Dictionary with 'factor_returns', 'portfolio_returns', 'diagnostics'. """ df = panel_df.copy() # Apply universe filter df = apply_universe_filters(df, filters=universe_filter) df = df[df['in_universe']].copy() # Lag the signal if requested if signal_lag > 0: df[signal_col] = ( df.groupby('ticker')[signal_col].shift(signal_lag) ) # Determine formation dates if rebalance_freq == 'annual': # Form portfolios in formation_month, hold for 12 months df['formation_date'] = df[date_col].apply( lambda d: pd.Timestamp( year=d.year if d.month >= formation_month else d.year - 1, month=formation_month, day=30 ) ) elif rebalance_freq == 'monthly': df['formation_date'] = df[date_col] - pd.DateOffset(months=1) elif rebalance_freq == 'quarterly': df['formation_date'] = df[date_col].apply( lambda d: pd.Timestamp( year=d.year, month=((d.month - 1) // 3) * 3 + 1, day=1 ) - pd.DateOffset(days=1) ) # Assign signal and size groups at each formation date all_portfolios = [] formation_dates = sorted(df['formation_date'].unique()) for f_date in formation_dates: # Stocks available at formation formation_data = df[df['formation_date'] == f_date].copy() # Get signal values at formation available = formation_data.dropna(subset=[signal_col, size_col]) if len(available) < min_stocks_per_portfolio * n_signal_groups * n_size_groups: continue # Compute breakpoints size_bp = compute_breakpoints( available, size_col, n_size_groups, bp_universe ) signal_bp = compute_breakpoints( available, signal_col, n_signal_groups, bp_universe ) # Assign groups available['size_group'] = np.searchsorted( size_bp.values, available[size_col].values ) available['signal_group'] = np.searchsorted( signal_bp.values, available[signal_col].values ) all_portfolios.append(available) if not all_portfolios: return None portfolios = pd.concat(all_portfolios, ignore_index=True) # Compute portfolio returns def weighted_return(group): if weighting == 'value': if group[size_col].sum() > 0: return np.average(group[return_col], weights=group[size_col]) else: return group[return_col].mean() else: return group[return_col].mean() port_returns = ( portfolios .groupby([date_col, 'size_group', 'signal_group']) .apply(weighted_return) .reset_index(name='port_return') ) # Construct factor: average of high-signal portfolios minus # average of low-signal portfolios (across size groups) high_label = n_signal_groups - 1 if long_group == 'high' else 0 low_label = 0 if long_group == 'high' else n_signal_groups - 1 high_ports = port_returns[port_returns['signal_group'] == high_label] low_ports = port_returns[port_returns['signal_group'] == low_label] high_avg = high_ports.groupby(date_col)['port_return'].mean() low_avg = low_ports.groupby(date_col)['port_return'].mean() factor_returns = (high_avg - low_avg).to_frame('factor_return') # Diagnostics port_counts = ( portfolios .groupby([date_col, 'size_group', 'signal_group'])['ticker'] .nunique() .reset_index(name='n_stocks') ) diagnostics = { 'avg_stocks_per_portfolio': port_counts['n_stocks'].mean(), 'min_stocks_per_portfolio': port_counts['n_stocks'].min(), 'ann_return': factor_returns['factor_return'].mean() * 12, 'ann_vol': factor_returns['factor_return'].std() * np.sqrt(12), 'sharpe': (factor_returns['factor_return'].mean() / factor_returns['factor_return'].std() * np.sqrt(12)), 't_stat': (factor_returns['factor_return'].mean() / (factor_returns['factor_return'].std() / np.sqrt(len(factor_returns)))), 'n_months': len(factor_returns) } return { 'factor_returns': factor_returns, 'portfolio_returns': port_returns, 'portfolios': portfolios, 'diagnostics': diagnostics } ``` ### Building the Core Factors We now use the engine to construct the standard Fama-French factors for Vietnam: ```{python} #| label: core-factors #| eval: false #| code-summary: "Construct SMB, HML, RMW, CMA, and WML for Vietnam" # SMB (Size): small minus big # Signal = market cap; long_group = 'low' (small stocks) smb_result = construct_factor( panel, signal_col='log_mcap', long_group='low', formation_month=6, rebalance_freq='annual', n_signal_groups=2, n_size_groups=1, # No double sort for size itself weighting='value', bp_universe='hose' ) # HML (Value): high BM minus low BM hml_result = construct_factor( panel, signal_col='bm', long_group='high', formation_month=6, rebalance_freq='annual', n_signal_groups=3, n_size_groups=2, weighting='value', bp_universe='hose' ) # RMW (Profitability): robust minus weak rmw_result = construct_factor( panel, signal_col='op', long_group='high', formation_month=6, rebalance_freq='annual', n_signal_groups=3, n_size_groups=2, weighting='value', bp_universe='hose' ) # CMA (Investment): conservative minus aggressive cma_result = construct_factor( panel, signal_col='investment', long_group='low', formation_month=6, rebalance_freq='annual', n_signal_groups=3, n_size_groups=2, weighting='value', bp_universe='hose' ) # WML (Momentum): winners minus losers wml_result = construct_factor( panel, signal_col='ret_12_2', long_group='high', formation_month=6, rebalance_freq='monthly', n_signal_groups=3, n_size_groups=2, weighting='value', bp_universe='hose' ) # Summary table print("Vietnamese Factor Summary:") print(f"{'Factor':<8} {'Ann. Ret':>10} {'Ann. Vol':>10} {'Sharpe':>8} " f"{'t-stat':>8} {'Avg N':>8}") print("-" * 54) for name, result in [('SMB', smb_result), ('HML', hml_result), ('RMW', rmw_result), ('CMA', cma_result), ('WML', wml_result)]: if result is None: continue d = result['diagnostics'] print(f"{name:<8} {d['ann_return']:>10.4f} {d['ann_vol']:>10.4f} " f"{d['sharpe']:>8.2f} {d['t_stat']:>8.2f} " f"{d['avg_stocks_per_portfolio']:>8.1f}") ``` ```{python} #| label: fig-factor-cumulative #| eval: false #| fig-cap: "Cumulative returns of the five Vietnamese factors (SMB, HML, RMW, CMA, WML) constructed using the Fama-French 2×3 methodology with HOSE breakpoints and value weighting. Factor behavior reflects Vietnamese market characteristics: the strong HML premium reflects the value tilt of the market, while momentum (WML) shows the characteristic crash-and-recovery pattern." #| code-summary: "Plot cumulative factor returns" fig, ax = plt.subplots(figsize=(14, 6)) factor_colors = { 'SMB': '#2C5F8A', 'HML': '#C0392B', 'RMW': '#27AE60', 'CMA': '#E67E22', 'WML': '#8E44AD' } for name, result in [('SMB', smb_result), ('HML', hml_result), ('RMW', rmw_result), ('CMA', cma_result), ('WML', wml_result)]: if result is None: continue fr = result['factor_returns'] cum = (1 + fr['factor_return']).cumprod() ax.plot(cum.index, cum.values, color=factor_colors[name], linewidth=2, label=name) ax.axhline(y=1, color='gray', linewidth=0.5) ax.set_ylabel('Cumulative Return') ax.set_xlabel('Date') ax.set_title('Vietnamese Factor Cumulative Returns (2×3 VW, HOSE Breakpoints)') ax.legend(ncol=5) ax.set_yscale('log') plt.tight_layout() plt.show() ``` ## Step 5: Sensitivity to Construction Choices {#sec-factor-con-sensitivity} The most important lesson in factor construction is that the resulting factor premium is *not* uniquely determined by the economic hypothesis; it depends substantially on implementation choices. We systematically vary each choice and examine how the factor changes. ### Weighting: Value-Weighted vs. Equal-Weighted ```{python} #| label: vw-vs-ew #| eval: false #| code-summary: "Compare VW and EW factor construction" sensitivity_results = {} for name, signal, long_grp in [ ('HML', 'bm', 'high'), ('RMW', 'op', 'high'), ('WML', 'ret_12_2', 'high') ]: for wt in ['value', 'equal']: result = construct_factor( panel, signal_col=signal, long_group=long_grp, weighting=wt, bp_universe='hose', rebalance_freq='annual' if name != 'WML' else 'monthly' ) if result: d = result['diagnostics'] sensitivity_results[f"{name}_{wt}"] = { 'Factor': name, 'Weighting': wt, 'Ann. Return': d['ann_return'], 't-stat': d['t_stat'] } sens_df = pd.DataFrame(sensitivity_results).T print("VW vs EW Factor Returns:") print(sens_df.round(3).to_string()) ``` ### Breakpoint Universe: HOSE vs. All Stocks ```{python} #| label: bp-sensitivity #| eval: false #| code-summary: "Compare HOSE vs all-stock breakpoints" for name, signal, long_grp in [ ('HML', 'bm', 'high'), ('WML', 'ret_12_2', 'high') ]: for bp in ['hose', 'all']: result = construct_factor( panel, signal_col=signal, long_group=long_grp, bp_universe=bp, weighting='value', rebalance_freq='annual' if name != 'WML' else 'monthly' ) if result: d = result['diagnostics'] sensitivity_results[f"{name}_bp_{bp}"] = { 'Factor': name, 'Breakpoints': bp, 'Ann. Return': d['ann_return'], 't-stat': d['t_stat'], 'Avg N': d['avg_stocks_per_portfolio'] } bp_sens = pd.DataFrame({k: v for k, v in sensitivity_results.items() if 'bp_' in k}).T print("\nBreakpoint Universe Sensitivity:") print(bp_sens.round(3).to_string()) ``` ### Number of Groups: 2×3 vs. 5×5 vs. Deciles ```{python} #| label: group-sensitivity #| eval: false #| code-summary: "Compare different sorting granularities" group_configs = [ (2, 3, '2x3 (FF standard)'), (2, 5, '2x5 (quintiles)'), (1, 10, '1x10 (deciles)') ] group_results = {} for n_size, n_signal, label in group_configs: result = construct_factor( panel, signal_col='bm', long_group='high', n_size_groups=n_size, n_signal_groups=n_signal, weighting='value', bp_universe='hose', rebalance_freq='annual' ) if result: d = result['diagnostics'] group_results[label] = { 'Ann. Return': d['ann_return'], 't-stat': d['t_stat'], 'Avg N per port': d['avg_stocks_per_portfolio'], 'Min N per port': d['min_stocks_per_portfolio'] } print("HML: Sorting Granularity Sensitivity:") print(pd.DataFrame(group_results).T.round(3).to_string()) ``` ### Rebalancing Frequency ```{python} #| label: rebalance-sensitivity #| eval: false #| code-summary: "Compare annual vs quarterly vs monthly rebalancing" rebal_results = {} for freq in ['annual', 'quarterly', 'monthly']: result = construct_factor( panel, signal_col='bm', long_group='high', rebalance_freq=freq, weighting='value', bp_universe='hose' ) if result: d = result['diagnostics'] rebal_results[freq] = { 'Ann. Return': d['ann_return'], 'Ann. Vol': d['ann_vol'], 't-stat': d['t_stat'] } print("HML: Rebalancing Frequency Sensitivity:") print(pd.DataFrame(rebal_results).T.round(4).to_string()) ``` ### Comprehensive Sensitivity Summary ```{python} #| label: fig-sensitivity-heatmap #| eval: false #| fig-cap: "Sensitivity of factor premia to construction choices. Each cell shows the annualized long-short return under a specific combination of weighting (VW vs EW), breakpoint universe (HOSE vs all), and number of groups. Cells are shaded by t-statistic magnitude: darker = more statistically significant. Results that are robust across all cells are more credible." #| code-summary: "Heatmap of factor premium sensitivity" # Systematic grid search for HML configs = list(product( ['value', 'equal'], # Weighting ['hose', 'all'], # Breakpoint universe [(2, 3), (2, 5), (1, 10)] # (n_size, n_signal) )) grid_results = [] for wt, bp, (ns, nsig) in configs: result = construct_factor( panel, signal_col='bm', long_group='high', weighting=wt, bp_universe=bp, n_size_groups=ns, n_signal_groups=nsig, rebalance_freq='annual' ) if result: d = result['diagnostics'] grid_results.append({ 'Weighting': 'VW' if wt == 'value' else 'EW', 'Breakpoints': bp.upper(), 'Sort': f'{ns}x{nsig}', 'Ann. Return': d['ann_return'], 't-stat': d['t_stat'] }) grid_df = pd.DataFrame(grid_results) print("HML Factor: Full Sensitivity Grid:") print(grid_df.to_string(index=False)) # Pivot for heatmap pivot = grid_df.pivot_table( values='Ann. Return', index=['Weighting', 'Breakpoints'], columns='Sort' ) fig, ax = plt.subplots(figsize=(8, 5)) sns.heatmap(pivot * 100, annot=True, fmt='.1f', cmap='RdYlGn', center=0, linewidths=0.5, ax=ax, cbar_kws={'label': 'Ann. Return (%)'}) ax.set_title('HML Premium: Sensitivity to Construction Choices') plt.tight_layout() plt.show() ``` ## Factor Correlation Structure {#sec-factor-con-correlation} ```{python} #| label: fig-factor-correlations #| eval: false #| fig-cap: "Correlation matrix of Vietnamese factor returns. SMB and HML are near-zero correlated (by construction in the 2×3 methodology, which controls for size within each signal sort). WML shows low correlation with accounting-based factors. The correlation structure reveals the degree of independent information in each factor." #| code-summary: "Compute and plot factor return correlation matrix" # Merge all factor return series factor_panel = pd.DataFrame() for name, result in [('SMB', smb_result), ('HML', hml_result), ('RMW', rmw_result), ('CMA', cma_result), ('WML', wml_result)]: if result is None: continue fr = result['factor_returns'].rename(columns={'factor_return': name}) if factor_panel.empty: factor_panel = fr else: factor_panel = factor_panel.merge(fr, left_index=True, right_index=True, how='outer') corr = factor_panel.corr() fig, ax = plt.subplots(figsize=(7, 6)) mask = np.triu(np.ones_like(corr, dtype=bool), k=1) sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r', center=0, vmin=-1, vmax=1, square=True, linewidths=0.5, ax=ax) ax.set_title('Factor Return Correlations') plt.tight_layout() plt.show() ``` ## Factor Validation {#sec-factor-con-validation} A well-constructed factor should pass several diagnostic tests before being used in asset pricing research. ### Diagnostic Checklist ```{python} #| label: validation #| eval: false #| code-summary: "Run diagnostic tests on each factor" def validate_factor(factor_result, name): """ Run standard diagnostic tests on a constructed factor. """ fr = factor_result['factor_returns']['factor_return'] diag = factor_result['diagnostics'] ports = factor_result['portfolios'] tests = {} # 1. Statistical significance (t > 2) tests['t-stat'] = diag['t_stat'] tests['t > 2'] = abs(diag['t_stat']) > 2.0 # 2. Economic magnitude tests['Ann. Return'] = diag['ann_return'] tests['Ann. Vol'] = diag['ann_vol'] tests['Sharpe'] = diag['sharpe'] # 3. Adequate portfolio diversification tests['Avg N per portfolio'] = diag['avg_stocks_per_portfolio'] tests['Min N per portfolio'] = diag['min_stocks_per_portfolio'] tests['Min N >= 5'] = diag['min_stocks_per_portfolio'] >= 5 # 4. Not dominated by a single month tests['Max monthly return'] = fr.max() tests['Min monthly return'] = fr.min() tests['Fraction > 0'] = (fr > 0).mean() # 5. Persistence (ACF at lag 1) if len(fr) > 12: tests['ACF(1)'] = fr.autocorr(lag=1) # 6. Consistency across subperiods mid = len(fr) // 2 first_half = fr.iloc[:mid] second_half = fr.iloc[mid:] tests['Return (1st half)'] = first_half.mean() * 12 tests['Return (2nd half)'] = second_half.mean() * 12 tests['Same sign both halves'] = ( np.sign(first_half.mean()) == np.sign(second_half.mean()) ) return tests print("Factor Validation Summary:") print("=" * 70) for name, result in [('SMB', smb_result), ('HML', hml_result), ('RMW', rmw_result), ('CMA', cma_result), ('WML', wml_result)]: if result is None: continue tests = validate_factor(result, name) print(f"\n{name}:") for test, value in tests.items(): if isinstance(value, bool): status = 'PASS' if value else 'FAIL' print(f" {test:<30}: {status}") elif isinstance(value, float): print(f" {test:<30}: {value:.4f}") else: print(f" {test:<30}: {value}") ``` ### Monotonicity Test A factor built from a characteristic sort should produce monotonically increasing (or decreasing) average returns across quantiles. Violations of monotonicity suggest the signal-return relationship is nonlinear or absent. ```{python} #| label: fig-monotonicity #| eval: false #| fig-cap: "Decile portfolio returns sorted on each signal (independent of size). Monotonicity—a steady increase from the low-signal to the high-signal decile—validates the signal-return relationship. Violations (e.g., a U-shaped pattern) suggest nonlinearity or construction artifacts." #| code-summary: "Test monotonicity of returns across signal deciles" fig, axes = plt.subplots(2, 3, figsize=(16, 10)) axes = axes.flatten() signals = [ ('bm', 'Book-to-Market', 'high'), ('op', 'Operating Profit.', 'high'), ('investment', 'Investment', 'low'), ('ret_12_2', 'Momentum (12-2)', 'high'), ('ivol', 'Idio. Volatility', 'low'), ] for i, (sig, label, long_grp) in enumerate(signals): result = construct_factor( panel, signal_col=sig, long_group=long_grp, n_size_groups=1, n_signal_groups=10, weighting='value', bp_universe='hose', rebalance_freq='annual' if sig not in ['ret_12_2'] else 'monthly' ) if result is None: continue port_ret = result['portfolio_returns'] decile_means = ( port_ret.groupby('signal_group')['port_return'] .mean() * 12 * 100 ) colors_mono = plt.cm.RdYlGn_r(np.linspace(0.1, 0.9, len(decile_means))) axes[i].bar(range(len(decile_means)), decile_means.values, color=colors_mono, edgecolor='white') axes[i].set_xticks(range(len(decile_means))) axes[i].set_xticklabels([f'D{d+1}' for d in range(len(decile_means))], fontsize=8) axes[i].set_ylabel('Ann. Return (%)') axes[i].set_title(label) axes[i].axhline(y=0, color='gray', linewidth=0.5) axes[5].set_visible(False) plt.suptitle('Decile Portfolio Returns by Signal', fontsize=14) plt.tight_layout() plt.show() ``` ### Spanning Tests Does a new factor add information beyond existing factors? We test this by regressing each factor on all other factors and examining the intercept (alpha). A significant alpha means the factor captures return variation not spanned by the others: ```{python} #| label: spanning #| eval: false #| code-summary: "Run spanning tests: regress each factor on remaining factors" factor_names = [c for c in factor_panel.columns if factor_panel[c].notna().sum() > 24] spanning_data = factor_panel[factor_names].dropna() print("Spanning Tests (alpha = intercept when regressed on other factors):") print(f"{'Factor':<8} {'Alpha (ann.)':>12} {'t-stat':>8} {'R²':>6}") print("-" * 36) for target in factor_names: others = [f for f in factor_names if f != target] y = spanning_data[target] X = sm.add_constant(spanning_data[others]) model = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6}) alpha_ann = model.params['const'] * 12 alpha_t = model.tvalues['const'] r2 = model.rsquared print(f"{target:<8} {alpha_ann:>12.4f} {alpha_t:>8.2f} {r2:>6.3f}") ``` ## Vietnamese-Specific Considerations {#sec-factor-con-vietnam} ### State-Owned Enterprise Classification A unique feature of Vietnamese equities is the high proportion of state-owned enterprises (SOEs). The state retains majority or significant minority stakes in many listed firms, which affects governance, information environment, and trading dynamics. Factors may behave differently within SOE and non-SOE subsamples: ```{python} #| label: soe-split #| eval: false #| code-summary: "Compare factor premia within SOE and non-SOE subsamples" # Get SOE classification firm_info = client.get_firm_info( exchanges=['HOSE', 'HNX'], fields=['ticker', 'state_ownership_pct', 'is_soe'] ) panel_soe = panel.merge(firm_info[['ticker', 'is_soe']], on='ticker', how='left') for label, subset in [('SOE', panel_soe[panel_soe['is_soe'] == True]), ('Non-SOE', panel_soe[panel_soe['is_soe'] == False])]: hml = construct_factor( subset, signal_col='bm', long_group='high', weighting='value', bp_universe='all', rebalance_freq='annual' ) if hml: d = hml['diagnostics'] print(f"HML ({label}): Ann = {d['ann_return']:.4f}, " f"t = {d['t_stat']:.2f}, N = {d['avg_stocks_per_portfolio']:.0f}") ``` ### Foreign Ownership Limits Foreign ownership caps (49% for most sectors, 30% for banking) affect the investable universe for international investors. Factors constructed from the full universe may not be achievable by foreign investors if the long leg concentrates in stocks at the foreign ownership limit: ```{python} #| label: foreign-ownership #| eval: false #| code-summary: "Check foreign ownership proximity in factor portfolios" fol = client.get_foreign_ownership( exchanges=['HOSE', 'HNX'], fields=['ticker', 'month_end', 'foreign_pct', 'foreign_limit_pct'] ) # Merge with HML portfolios if hml_result: hml_ports = hml_result['portfolios'].merge( fol, on=['ticker', 'month_end'], how='left' ) hml_ports['near_limit'] = ( (hml_ports['foreign_limit_pct'] - hml_ports['foreign_pct']) < 5 ) fol_by_group = ( hml_ports.groupby('signal_group') .agg( avg_foreign_pct=('foreign_pct', 'mean'), pct_near_limit=('near_limit', 'mean') ) ) print("Foreign Ownership in HML Portfolios:") print(fol_by_group.round(3)) ``` ## Recommended Factor Specifications for Vietnam {#sec-factor-con-recommendations} Based on the sensitivity analysis, we recommend the following baseline specifications (@tbl-factor-con-recommendations). | Choice | Recommendation | Rationale | |------------------------|------------------------|------------------------| | Universe | Standard filter (listing age ≥ 6m, volume \> 0) | Excludes shell firms without losing too much breadth | | Breakpoints | HOSE stocks only | Prevents HNX micro-caps from dominating breakpoints | | Size groups | 2 (median split) | Sufficient control with limited cross-section | | Signal groups | 3 (terciles) for factor construction; 5 or 10 for portfolio analysis | 2×3 = adequate diversification per portfolio | | Weighting | Value-weighted | Investable; less noisy than EW | | Rebalancing | Annual (June) for accounting signals; monthly for momentum | Standard; consistent with Fama-French | | Accounting lag | 4 months (available by April for Dec FY) | Conservative PIT alignment | | Minimum stocks | ≥ 5 per portfolio per month | Below this, single-stock idiosyncratic risk dominates | : Recommended factor construction parameters for Vietnamese equities. {#tbl-factor-con-recommendations} Always report results under the baseline and at least one alternative specification (e.g., EW, all-stock breakpoints) to demonstrate robustness. ## Summary {#sec-factor-con-summary} This chapter has developed a modular, transparent factor construction framework for Vietnamese equities. The key insights are: - The @fama1993common 2×3 methodology translates well to Vietnam with one critical adaptation: breakpoints should be computed from HOSE stocks only. Using all-stock breakpoints allows HNX micro-caps to dominate the small-stock groups, inflating apparent premia with economically untradeable returns. - Value weighting produces more conservative (and more implementable) factor premia than equal weighting. The difference is particularly large for signals correlated with size (BM, investment), where EW overweights the smallest stocks that contribute most to the premium but least to investable returns. - No single construction choice determines whether a factor "exists." A factor premium that appears only under one specific combination of breakpoints, weighting, and rebalancing frequency is fragile and should be treated with skepticism. Robust factors survive a grid of specifications. The sensitivity analysis framework developed here (e.g., varying weighting, breakpoints, sort granularity, and rebalancing simultaneously) should be standard practice. The `construct_factor()` function developed in this chapter is designed for reuse throughout the book. Any anomaly variable can be fed through the same pipeline, producing a tradeable factor with full diagnostics. This ensures methodological consistency across chapters and makes it easy to compare premia on an apples-to-apples basis.