import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from itertools import product
import warnings
warnings.filterwarnings('ignore')
plt.rcParams.update({
'figure.figsize': (12, 6),
'figure.dpi': 150,
'font.size': 11,
'axes.spines.top': False,
'axes.spines.right': False
})19 Factor Construction Principles
In this chapter, we develop a general-purpose factor construction engine for the Vietnamese equity market. We cover every methodological decision in the pipeline, including universe definition, breakpoint computation, portfolio formation, weighting, rebalancing, and factor return calculation, and demonstrate how each choice affects the resulting factor.
The previous chapters introduced specific asset pricing models, including the CAPM, the Fama-French three-factor model, and momentum. Each of those chapters presented its factor as given. This chapter steps behind the curtain and addresses the engineering question: how exactly do you build a factor? The question matters because seemingly minor methodological decisions, such as where to set breakpoints, whether to value-weight or equal-weight, how to handle missing accounting data, which stocks to exclude, can alter the magnitude, statistical significance, and even the sign of a factor premium.
In the U.S. context, Fama and French (1993) established a canonical procedure: sort stocks independently on size and a characteristic, form six value-weighted portfolios from 2×3 intersections, and define the factor as the average return of the two high-characteristic portfolios minus the average return of the two low-characteristic portfolios. This procedure has been replicated thousands of times. But it was designed for the U.S. market circa 1990, with its deep liquidity, broad cross-section, and CRSP/Compustat data infrastructure. Applying it mechanically to Vietnam, a market with 700 listed stocks, extreme illiquidity in the bottom tercile, high concentration in the top decile, and accounting data that arrives with variable lags, requires careful adaptation.
Hou, Xue, and Zhang (2020) replicated 452 anomalies from the U.S. literature and found that over half fail to replicate even in U.S. data with minor methodological variations. The replication crisis in empirical asset pricing makes it essential that researchers understand and document every construction choice. This chapter provides the tools to do so transparently.
19.1 The Factor Construction Pipeline
Every tradeable factor follows the same logical pipeline:
- Define the universe: Select the eligible securities (e.g., common stocks with liquidity and size filters).
- Compute the signal: Calculate the characteristic of interest (e.g., value, momentum, profitability).
- Set breakpoints: Determine how stocks will be sorted (e.g., median, quintiles, deciles).
- Assign portfolios: Group stocks into high and low (or multiple) portfolios based on the signal.
- Compute returns: Calculate portfolio returns (equal- or value-weighted).
- Construct the factor: Take Long (high) - Short (low).
- Validate: Test performance, significance, and robustness.
Each step involves choices that interact with each other. A breakpoint that works well for a liquid universe may be inappropriate for the full cross-section. A weighting scheme that reduces noise in the U.S. may amplify it in Vietnam. We address each step systematically.
19.2 Data Construction
from datacore import DataCoreClient
client = DataCoreClient()
# Monthly returns (survivorship-bias-free)
monthly = client.get_monthly_returns(
exchanges=['HOSE', 'HNX'],
start_date='2008-01-01',
end_date='2024-12-31',
include_delisted=True,
fields=[
'ticker', 'month_end', 'monthly_return', 'market_cap',
'shares_outstanding', 'volume_avg_20d', 'turnover_value_avg_20d',
'n_zero_volume_days', 'exchange'
]
)
# Annual accounting data
accounting = client.get_fundamentals(
exchanges=['HOSE', 'HNX'],
start_date='2006-01-01',
end_date='2024-12-31',
include_delisted=True,
frequency='annual',
fields=[
'ticker', 'fiscal_year', 'filing_date',
'total_assets', 'total_equity', 'book_equity',
'net_income', 'revenue', 'gross_profit',
'operating_profit', 'total_debt', 'retained_earnings',
'dividends_paid', 'capex', 'depreciation',
'shares_outstanding_fy'
]
)
# Daily prices for momentum and volatility signals
daily = client.get_daily_prices(
exchanges=['HOSE', 'HNX'],
start_date='2008-01-01',
end_date='2024-12-31',
include_delisted=True,
fields=['ticker', 'date', 'adjusted_close', 'volume', 'turnover_value']
)
monthly['month_end'] = pd.to_datetime(monthly['month_end'])
monthly = monthly.sort_values(['ticker', 'month_end'])
print(f"Monthly returns: {len(monthly):,} firm-months")
print(f"Accounting: {len(accounting):,} firm-years")
print(f"Unique tickers: {monthly['ticker'].nunique()}")19.3 Step 1: Universe Definition
The first and most consequential choice is which stocks enter the factor construction universe. The universe definition determines what population the factor premium describes and whether it is implementable.
19.3.1 The Universe Problem in Vietnam
Vietnam presents a specific challenge: the cross-section is small (600-800 stocks on HOSE and HNX combined), and the size distribution is extremely skewed. The top 10 stocks by market capitalization account for roughly 50% of the total market cap on HOSE. The bottom tercile consists of micro-cap stocks that often trade fewer than 5 days per month. Including these stocks inflates apparent factor premia because their prices are noisy and stale, but excluding them shrinks the already small cross-section.
def apply_universe_filters(df, filters='standard'):
"""
Apply universe filters to the monthly return panel.
Parameters
----------
filters : str
'none': all stocks
'minimal': exclude zero market cap and extreme returns
'standard': + minimum listing age + positive volume
'strict': + minimum market cap + minimum turnover
Returns
-------
Filtered DataFrame with 'in_universe' column
"""
d = df.copy()
d['in_universe'] = True
# Always: remove missing returns and market cap
d.loc[d['monthly_return'].isna(), 'in_universe'] = False
d.loc[d['market_cap'].isna() | (d['market_cap'] <= 0), 'in_universe'] = False
# Minimal: winsorize extreme returns (likely data errors)
if filters in ['minimal', 'standard', 'strict']:
d.loc[d['monthly_return'].abs() > 1.0, 'in_universe'] = False
# Standard: listing age >= 6 months
if filters in ['standard', 'strict']:
d['listing_age'] = (
d.groupby('ticker').cumcount() + 1
)
d.loc[d['listing_age'] < 6, 'in_universe'] = False
# Require at least 10 positive-volume days in the month
d.loc[d['n_zero_volume_days'] > 12, 'in_universe'] = False
# Strict: minimum market cap (20th percentile of HOSE)
if filters == 'strict':
mcap_threshold = (
d[d['exchange'] == 'HOSE']
.groupby('month_end')['market_cap']
.transform(lambda x: x.quantile(0.20))
)
# Apply HOSE threshold to all stocks
d['mcap_threshold'] = (
d.groupby('month_end')['market_cap']
.transform(lambda x: x.quantile(0.20))
)
d.loc[d['market_cap'] < d['mcap_threshold'], 'in_universe'] = False
# Minimum average daily turnover (VND 200 million)
d.loc[d['turnover_value_avg_20d'] < 2e8, 'in_universe'] = False
return d
# Apply all filter levels and compare
filter_summary = {}
for level in ['none', 'minimal', 'standard', 'strict']:
filtered = apply_universe_filters(monthly, filters=level)
in_univ = filtered[filtered['in_universe']]
filter_summary[level] = {
'Firm-months': len(in_univ),
'Avg stocks/month': in_univ.groupby('month_end')['ticker'].nunique().mean(),
'Avg MCap coverage (%)': (
in_univ.groupby('month_end')['market_cap'].sum()
/ filtered.groupby('month_end')['market_cap'].sum()
).mean() * 100
}
filter_df = pd.DataFrame(filter_summary).T
print("Universe Filter Effects:")
print(filter_df.round(1).to_string())fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors_filter = {
'none': '#BDC3C7', 'minimal': '#3498DB',
'standard': '#2C5F8A', 'strict': '#C0392B'
}
for level in ['none', 'minimal', 'standard', 'strict']:
filtered = apply_universe_filters(monthly, filters=level)
counts = (
filtered[filtered['in_universe']]
.groupby('month_end')['ticker']
.nunique()
)
axes[0].plot(counts.index, counts.values,
color=colors_filter[level], linewidth=1.5, label=level)
axes[0].set_ylabel('Number of Stocks')
axes[0].set_title('Panel A: Universe Size')
axes[0].legend()
# Panel B: Market cap coverage
for level in ['minimal', 'standard', 'strict']:
filtered = apply_universe_filters(monthly, filters=level)
total_mcap = filtered.groupby('month_end')['market_cap'].sum()
filtered_mcap = (
filtered[filtered['in_universe']]
.groupby('month_end')['market_cap']
.sum()
)
coverage = (filtered_mcap / total_mcap * 100).dropna()
axes[1].plot(coverage.index, coverage.values,
color=colors_filter[level], linewidth=1.5, label=level)
axes[1].set_ylabel('Market Cap Coverage (%)')
axes[1].set_title('Panel B: Market Capitalization Coverage')
axes[1].legend()
axes[1].set_ylim([60, 102])
plt.tight_layout()
plt.show()19.4 Step 2: Signal Construction
19.4.1 Point-in-Time Accounting Data
As discussed in the missing data chapter, accounting signals must be aligned with their public availability date to avoid look-ahead bias. We implement a general-purpose point-in-time merge:
def pit_merge_accounting(monthly_df, accounting_df, lag_months=4):
"""
Merge accounting data with monthly returns respecting
the point-in-time availability constraint.
Vietnamese annual reports are due within 90 days of fiscal
year-end. We use a conservative 4-month lag.
Parameters
----------
lag_months : int
Number of months after fiscal year-end before data
are assumed to be publicly available.
"""
acc = accounting_df.copy()
# Accounting data becomes available lag_months after FY end
# If filing_date is available, use it; otherwise use FY-end + lag
if 'filing_date' in acc.columns:
acc['filing_date'] = pd.to_datetime(acc['filing_date'])
acc['available_date'] = acc['filing_date']
# Fallback for missing filing dates
acc['fy_end'] = pd.to_datetime(
acc['fiscal_year'].astype(str) + '-12-31'
)
acc['available_date'] = acc['available_date'].fillna(
acc['fy_end'] + pd.DateOffset(months=lag_months)
)
else:
acc['available_date'] = pd.to_datetime(
acc['fiscal_year'].astype(str) + '-12-31'
) + pd.DateOffset(months=lag_months)
# For each firm-month, find the most recent available accounting data
merged = monthly_df.copy()
# Efficient approach: for June rebalancing, use FY t-1 data
# which is available by April of year t (4-month lag)
merged['year'] = merged['month_end'].dt.year
merged['month'] = merged['month_end'].dt.month
# Map: if month >= (lag_months + 1), use current year's FY-1 data
# Otherwise, use FY-2 data
merged['data_fy'] = np.where(
merged['month'] >= lag_months + 1,
merged['year'] - 1,
merged['year'] - 2
)
# Merge
acc_cols = [c for c in acc.columns if c not in
['filing_date', 'available_date', 'fy_end']]
merged = merged.merge(
acc[acc_cols].rename(columns={'fiscal_year': 'data_fy'}),
on=['ticker', 'data_fy'],
how='left'
)
return merged
# Apply point-in-time merge
panel = pit_merge_accounting(monthly, accounting, lag_months=4)
# Construct common signals
panel['log_mcap'] = np.log(panel['market_cap'].clip(lower=1))
# Book-to-market
panel['bm'] = panel['book_equity'] / panel['market_cap']
panel.loc[panel['bm'] <= 0, 'bm'] = np.nan # Negative BE firms
# Gross profitability (Novy-Marx 2013)
panel['gp_at'] = panel['gross_profit'] / panel['total_assets']
# Operating profitability (Fama-French 2015)
panel['op'] = panel['operating_profit'] / panel['book_equity']
# Investment (asset growth)
panel['investment'] = (
panel.groupby('ticker')['total_assets']
.pct_change(periods=1)
)
# Leverage
panel['leverage'] = panel['total_debt'] / panel['total_assets']
print("Signal Coverage:")
for sig in ['bm', 'gp_at', 'op', 'investment', 'leverage']:
pct = panel[sig].notna().mean()
print(f" {sig:<15}: {pct:.1%}")19.4.2 Momentum and Volatility Signals
Price-based signals require return history, not accounting data, so they have different timing requirements.
# Past returns for momentum signals
panel = panel.sort_values(['ticker', 'month_end'])
# Momentum: cumulative return from month t-12 to t-2 (skip most recent month)
panel['ret_12_2'] = (
panel.groupby('ticker')['monthly_return']
.transform(lambda x: x.shift(2).rolling(11).apply(
lambda r: (1 + r).prod() - 1, raw=True))
)
# Short-term reversal: month t-1 return
panel['ret_1'] = panel.groupby('ticker')['monthly_return'].shift(1)
# Idiosyncratic volatility (from daily data, rolling 60 days)
daily['date'] = pd.to_datetime(daily['date'])
daily['daily_return'] = daily.groupby('ticker')['adjusted_close'].pct_change()
ivol = (
daily.groupby('ticker')
.apply(lambda g: g.set_index('date')['daily_return']
.rolling(60, min_periods=40).std() * np.sqrt(252))
.reset_index(name='ivol')
)
ivol['month_end'] = ivol['date'].dt.to_period('M').dt.to_timestamp('M')
ivol_monthly = (
ivol.groupby(['ticker', 'month_end'])['ivol']
.last()
.reset_index()
)
panel = panel.merge(ivol_monthly, on=['ticker', 'month_end'], how='left')
print("Price Signal Coverage:")
for sig in ['ret_12_2', 'ret_1', 'ivol']:
pct = panel[sig].notna().mean()
print(f" {sig:<15}: {pct:.1%}")19.5 Step 3: Breakpoint Computation
19.5.1 The Breakpoint Decision
Breakpoints determine which stocks are “high” versus “low” on a given characteristic. The two key choices are:
- Breakpoint universe: Should breakpoints be computed from all stocks or from a subset (e.g., HOSE only)?
- Number of groups: 2×3 (Fama-French standard), 5×5 (for finer sorts), or independent terciles/quintiles?
Fama and French (1993) use NYSE breakpoints for U.S. sorts because this prevents the large number of small Nasdaq/AMEX stocks from dominating the breakpoint distribution. The analog in Vietnam is to use HOSE breakpoints, since HOSE lists the larger, more liquid firms and HNX lists smaller firms. Using all-stock breakpoints would place most HOSE stocks in the upper size groups and most HNX stocks in the lower groups, producing mechanically different results.
def compute_breakpoints(df, signal_col, n_groups, bp_universe='hose',
exchange_col='exchange'):
"""
Compute cross-sectional breakpoints for portfolio sorting.
Parameters
----------
bp_universe : str
'all': use all stocks in universe
'hose': use only HOSE stocks (analogous to NYSE breakpoints)
n_groups : int
Number of groups (2, 3, 5, or 10)
Returns
-------
Series of breakpoints (quantiles)
"""
signal = df[signal_col].dropna()
if bp_universe == 'hose':
mask = df[exchange_col] == 'HOSE'
signal = df.loc[mask, signal_col].dropna()
quantiles = np.linspace(0, 1, n_groups + 1)[1:-1]
breakpoints = signal.quantile(quantiles)
return breakpoints
# Example: compare HOSE vs all-stock breakpoints for book-to-market
example_month = panel[panel['month_end'] == '2023-06-30'].copy()
example_month = example_month[example_month['bm'].notna()]
bp_hose = compute_breakpoints(example_month, 'bm', 3, bp_universe='hose')
bp_all = compute_breakpoints(example_month, 'bm', 3, bp_universe='all')
print("BM Tercile Breakpoints (June 2023):")
print(f" HOSE-only: {bp_hose.values.round(3)}")
print(f" All stocks: {bp_all.values.round(3)}")
print(f"\n Difference: HOSE breakpoints are "
f"{'higher' if bp_hose.values[0] > bp_all.values[0] else 'lower'} "
f"than all-stock breakpoints")# Compute breakpoints for every month
bp_comparison = []
for month, group in panel.dropna(subset=['bm']).groupby('month_end'):
bp_h = compute_breakpoints(group, 'bm', 3, bp_universe='hose')
bp_a = compute_breakpoints(group, 'bm', 3, bp_universe='all')
# Count stocks in each tercile under each rule
for bp_name, bp_vals in [('HOSE', bp_h), ('All', bp_a)]:
low = (group['bm'] <= bp_vals.iloc[0]).sum()
mid = ((group['bm'] > bp_vals.iloc[0]) &
(group['bm'] <= bp_vals.iloc[1])).sum()
high = (group['bm'] > bp_vals.iloc[1]).sum()
bp_comparison.append({
'month_end': month, 'bp_rule': bp_name,
'median_bp': bp_vals.iloc[0],
'n_low': low, 'n_mid': mid, 'n_high': high
})
bp_df = pd.DataFrame(bp_comparison)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Panel A: Median breakpoint over time
for rule, color in [('HOSE', '#2C5F8A'), ('All', '#C0392B')]:
subset = bp_df[bp_df['bp_rule'] == rule]
axes[0].plot(subset['month_end'], subset['median_bp'],
color=color, linewidth=1.5, label=f'{rule} breakpoints')
axes[0].set_ylabel('Lower Tercile Breakpoint (BM)')
axes[0].set_title('Panel A: Breakpoint Time Series')
axes[0].legend()
# Panel B: Number of stocks in high BM group
for rule, color in [('HOSE', '#2C5F8A'), ('All', '#C0392B')]:
subset = bp_df[bp_df['bp_rule'] == rule]
axes[1].plot(subset['month_end'], subset['n_high'],
color=color, linewidth=1.5, label=f'{rule} breakpoints')
axes[1].set_ylabel('Stocks in High BM Group')
axes[1].set_title('Panel B: High BM Portfolio Size')
axes[1].legend()
plt.tight_layout()
plt.show()19.6 Step 4: Portfolio Formation
19.6.1 The Generic Factor Engine
We implement a general-purpose factor construction function that takes any signal column and produces a long-short factor return series. The function encapsulates all methodological choices as parameters, making it easy to test sensitivity.
def construct_factor(
panel_df,
signal_col,
size_col='market_cap',
return_col='monthly_return',
date_col='month_end',
exchange_col='exchange',
formation_month=6,
rebalance_freq='annual',
n_signal_groups=3,
n_size_groups=2,
weighting='value',
bp_universe='hose',
independent_sorts=True,
long_group='high',
min_stocks_per_portfolio=5,
signal_lag=0,
universe_filter='standard'
):
"""
Construct a tradeable factor following the Fama-French methodology.
Parameters
----------
signal_col : str
Column with the sorting variable.
formation_month : int
Month of year for portfolio formation (6 = June for FF).
rebalance_freq : str
'annual' (FF standard), 'semi', 'quarterly', 'monthly'.
n_signal_groups : int
Number of signal groups (3 for FF standard, 5 for quintiles).
n_size_groups : int
Number of size groups (2 for FF standard).
weighting : str
'value' (VW) or 'equal' (EW).
bp_universe : str
'hose' or 'all' for breakpoint computation.
independent_sorts : bool
True for independent double sorts (FF standard).
long_group : str
'high' or 'low'—which signal group is the long leg.
signal_lag : int
Additional months to lag the signal beyond the
standard point-in-time alignment.
Returns
-------
Dictionary with 'factor_returns', 'portfolio_returns',
'diagnostics'.
"""
df = panel_df.copy()
# Apply universe filter
df = apply_universe_filters(df, filters=universe_filter)
df = df[df['in_universe']].copy()
# Lag the signal if requested
if signal_lag > 0:
df[signal_col] = (
df.groupby('ticker')[signal_col].shift(signal_lag)
)
# Determine formation dates
if rebalance_freq == 'annual':
# Form portfolios in formation_month, hold for 12 months
df['formation_date'] = df[date_col].apply(
lambda d: pd.Timestamp(
year=d.year if d.month >= formation_month else d.year - 1,
month=formation_month, day=30
)
)
elif rebalance_freq == 'monthly':
df['formation_date'] = df[date_col] - pd.DateOffset(months=1)
elif rebalance_freq == 'quarterly':
df['formation_date'] = df[date_col].apply(
lambda d: pd.Timestamp(
year=d.year,
month=((d.month - 1) // 3) * 3 + 1,
day=1
) - pd.DateOffset(days=1)
)
# Assign signal and size groups at each formation date
all_portfolios = []
formation_dates = sorted(df['formation_date'].unique())
for f_date in formation_dates:
# Stocks available at formation
formation_data = df[df['formation_date'] == f_date].copy()
# Get signal values at formation
available = formation_data.dropna(subset=[signal_col, size_col])
if len(available) < min_stocks_per_portfolio * n_signal_groups * n_size_groups:
continue
# Compute breakpoints
size_bp = compute_breakpoints(
available, size_col, n_size_groups, bp_universe
)
signal_bp = compute_breakpoints(
available, signal_col, n_signal_groups, bp_universe
)
# Assign groups
available['size_group'] = np.searchsorted(
size_bp.values, available[size_col].values
)
available['signal_group'] = np.searchsorted(
signal_bp.values, available[signal_col].values
)
all_portfolios.append(available)
if not all_portfolios:
return None
portfolios = pd.concat(all_portfolios, ignore_index=True)
# Compute portfolio returns
def weighted_return(group):
if weighting == 'value':
if group[size_col].sum() > 0:
return np.average(group[return_col], weights=group[size_col])
else:
return group[return_col].mean()
else:
return group[return_col].mean()
port_returns = (
portfolios
.groupby([date_col, 'size_group', 'signal_group'])
.apply(weighted_return)
.reset_index(name='port_return')
)
# Construct factor: average of high-signal portfolios minus
# average of low-signal portfolios (across size groups)
high_label = n_signal_groups - 1 if long_group == 'high' else 0
low_label = 0 if long_group == 'high' else n_signal_groups - 1
high_ports = port_returns[port_returns['signal_group'] == high_label]
low_ports = port_returns[port_returns['signal_group'] == low_label]
high_avg = high_ports.groupby(date_col)['port_return'].mean()
low_avg = low_ports.groupby(date_col)['port_return'].mean()
factor_returns = (high_avg - low_avg).to_frame('factor_return')
# Diagnostics
port_counts = (
portfolios
.groupby([date_col, 'size_group', 'signal_group'])['ticker']
.nunique()
.reset_index(name='n_stocks')
)
diagnostics = {
'avg_stocks_per_portfolio': port_counts['n_stocks'].mean(),
'min_stocks_per_portfolio': port_counts['n_stocks'].min(),
'ann_return': factor_returns['factor_return'].mean() * 12,
'ann_vol': factor_returns['factor_return'].std() * np.sqrt(12),
'sharpe': (factor_returns['factor_return'].mean()
/ factor_returns['factor_return'].std() * np.sqrt(12)),
't_stat': (factor_returns['factor_return'].mean()
/ (factor_returns['factor_return'].std()
/ np.sqrt(len(factor_returns)))),
'n_months': len(factor_returns)
}
return {
'factor_returns': factor_returns,
'portfolio_returns': port_returns,
'portfolios': portfolios,
'diagnostics': diagnostics
}19.6.2 Building the Core Factors
We now use the engine to construct the standard Fama-French factors for Vietnam:
# SMB (Size): small minus big
# Signal = market cap; long_group = 'low' (small stocks)
smb_result = construct_factor(
panel, signal_col='log_mcap', long_group='low',
formation_month=6, rebalance_freq='annual',
n_signal_groups=2, n_size_groups=1, # No double sort for size itself
weighting='value', bp_universe='hose'
)
# HML (Value): high BM minus low BM
hml_result = construct_factor(
panel, signal_col='bm', long_group='high',
formation_month=6, rebalance_freq='annual',
n_signal_groups=3, n_size_groups=2,
weighting='value', bp_universe='hose'
)
# RMW (Profitability): robust minus weak
rmw_result = construct_factor(
panel, signal_col='op', long_group='high',
formation_month=6, rebalance_freq='annual',
n_signal_groups=3, n_size_groups=2,
weighting='value', bp_universe='hose'
)
# CMA (Investment): conservative minus aggressive
cma_result = construct_factor(
panel, signal_col='investment', long_group='low',
formation_month=6, rebalance_freq='annual',
n_signal_groups=3, n_size_groups=2,
weighting='value', bp_universe='hose'
)
# WML (Momentum): winners minus losers
wml_result = construct_factor(
panel, signal_col='ret_12_2', long_group='high',
formation_month=6, rebalance_freq='monthly',
n_signal_groups=3, n_size_groups=2,
weighting='value', bp_universe='hose'
)
# Summary table
print("Vietnamese Factor Summary:")
print(f"{'Factor':<8} {'Ann. Ret':>10} {'Ann. Vol':>10} {'Sharpe':>8} "
f"{'t-stat':>8} {'Avg N':>8}")
print("-" * 54)
for name, result in [('SMB', smb_result), ('HML', hml_result),
('RMW', rmw_result), ('CMA', cma_result),
('WML', wml_result)]:
if result is None:
continue
d = result['diagnostics']
print(f"{name:<8} {d['ann_return']:>10.4f} {d['ann_vol']:>10.4f} "
f"{d['sharpe']:>8.2f} {d['t_stat']:>8.2f} "
f"{d['avg_stocks_per_portfolio']:>8.1f}")fig, ax = plt.subplots(figsize=(14, 6))
factor_colors = {
'SMB': '#2C5F8A', 'HML': '#C0392B', 'RMW': '#27AE60',
'CMA': '#E67E22', 'WML': '#8E44AD'
}
for name, result in [('SMB', smb_result), ('HML', hml_result),
('RMW', rmw_result), ('CMA', cma_result),
('WML', wml_result)]:
if result is None:
continue
fr = result['factor_returns']
cum = (1 + fr['factor_return']).cumprod()
ax.plot(cum.index, cum.values, color=factor_colors[name],
linewidth=2, label=name)
ax.axhline(y=1, color='gray', linewidth=0.5)
ax.set_ylabel('Cumulative Return')
ax.set_xlabel('Date')
ax.set_title('Vietnamese Factor Cumulative Returns (2×3 VW, HOSE Breakpoints)')
ax.legend(ncol=5)
ax.set_yscale('log')
plt.tight_layout()
plt.show()19.7 Step 5: Sensitivity to Construction Choices
The most important lesson in factor construction is that the resulting factor premium is not uniquely determined by the economic hypothesis; it depends substantially on implementation choices. We systematically vary each choice and examine how the factor changes.
19.7.1 Weighting: Value-Weighted vs. Equal-Weighted
sensitivity_results = {}
for name, signal, long_grp in [
('HML', 'bm', 'high'), ('RMW', 'op', 'high'), ('WML', 'ret_12_2', 'high')
]:
for wt in ['value', 'equal']:
result = construct_factor(
panel, signal_col=signal, long_group=long_grp,
weighting=wt, bp_universe='hose',
rebalance_freq='annual' if name != 'WML' else 'monthly'
)
if result:
d = result['diagnostics']
sensitivity_results[f"{name}_{wt}"] = {
'Factor': name, 'Weighting': wt,
'Ann. Return': d['ann_return'],
't-stat': d['t_stat']
}
sens_df = pd.DataFrame(sensitivity_results).T
print("VW vs EW Factor Returns:")
print(sens_df.round(3).to_string())19.7.2 Breakpoint Universe: HOSE vs. All Stocks
for name, signal, long_grp in [
('HML', 'bm', 'high'), ('WML', 'ret_12_2', 'high')
]:
for bp in ['hose', 'all']:
result = construct_factor(
panel, signal_col=signal, long_group=long_grp,
bp_universe=bp, weighting='value',
rebalance_freq='annual' if name != 'WML' else 'monthly'
)
if result:
d = result['diagnostics']
sensitivity_results[f"{name}_bp_{bp}"] = {
'Factor': name, 'Breakpoints': bp,
'Ann. Return': d['ann_return'],
't-stat': d['t_stat'],
'Avg N': d['avg_stocks_per_portfolio']
}
bp_sens = pd.DataFrame({k: v for k, v in sensitivity_results.items()
if 'bp_' in k}).T
print("\nBreakpoint Universe Sensitivity:")
print(bp_sens.round(3).to_string())19.7.3 Number of Groups: 2×3 vs. 5×5 vs. Deciles
group_configs = [
(2, 3, '2x3 (FF standard)'),
(2, 5, '2x5 (quintiles)'),
(1, 10, '1x10 (deciles)')
]
group_results = {}
for n_size, n_signal, label in group_configs:
result = construct_factor(
panel, signal_col='bm', long_group='high',
n_size_groups=n_size, n_signal_groups=n_signal,
weighting='value', bp_universe='hose',
rebalance_freq='annual'
)
if result:
d = result['diagnostics']
group_results[label] = {
'Ann. Return': d['ann_return'],
't-stat': d['t_stat'],
'Avg N per port': d['avg_stocks_per_portfolio'],
'Min N per port': d['min_stocks_per_portfolio']
}
print("HML: Sorting Granularity Sensitivity:")
print(pd.DataFrame(group_results).T.round(3).to_string())19.7.4 Rebalancing Frequency
rebal_results = {}
for freq in ['annual', 'quarterly', 'monthly']:
result = construct_factor(
panel, signal_col='bm', long_group='high',
rebalance_freq=freq, weighting='value', bp_universe='hose'
)
if result:
d = result['diagnostics']
rebal_results[freq] = {
'Ann. Return': d['ann_return'],
'Ann. Vol': d['ann_vol'],
't-stat': d['t_stat']
}
print("HML: Rebalancing Frequency Sensitivity:")
print(pd.DataFrame(rebal_results).T.round(4).to_string())19.7.5 Comprehensive Sensitivity Summary
# Systematic grid search for HML
configs = list(product(
['value', 'equal'], # Weighting
['hose', 'all'], # Breakpoint universe
[(2, 3), (2, 5), (1, 10)] # (n_size, n_signal)
))
grid_results = []
for wt, bp, (ns, nsig) in configs:
result = construct_factor(
panel, signal_col='bm', long_group='high',
weighting=wt, bp_universe=bp,
n_size_groups=ns, n_signal_groups=nsig,
rebalance_freq='annual'
)
if result:
d = result['diagnostics']
grid_results.append({
'Weighting': 'VW' if wt == 'value' else 'EW',
'Breakpoints': bp.upper(),
'Sort': f'{ns}x{nsig}',
'Ann. Return': d['ann_return'],
't-stat': d['t_stat']
})
grid_df = pd.DataFrame(grid_results)
print("HML Factor: Full Sensitivity Grid:")
print(grid_df.to_string(index=False))
# Pivot for heatmap
pivot = grid_df.pivot_table(
values='Ann. Return', index=['Weighting', 'Breakpoints'],
columns='Sort'
)
fig, ax = plt.subplots(figsize=(8, 5))
sns.heatmap(pivot * 100, annot=True, fmt='.1f', cmap='RdYlGn',
center=0, linewidths=0.5, ax=ax,
cbar_kws={'label': 'Ann. Return (%)'})
ax.set_title('HML Premium: Sensitivity to Construction Choices')
plt.tight_layout()
plt.show()19.8 Factor Correlation Structure
# Merge all factor return series
factor_panel = pd.DataFrame()
for name, result in [('SMB', smb_result), ('HML', hml_result),
('RMW', rmw_result), ('CMA', cma_result),
('WML', wml_result)]:
if result is None:
continue
fr = result['factor_returns'].rename(columns={'factor_return': name})
if factor_panel.empty:
factor_panel = fr
else:
factor_panel = factor_panel.merge(fr, left_index=True,
right_index=True, how='outer')
corr = factor_panel.corr()
fig, ax = plt.subplots(figsize=(7, 6))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, vmin=-1, vmax=1, square=True,
linewidths=0.5, ax=ax)
ax.set_title('Factor Return Correlations')
plt.tight_layout()
plt.show()19.9 Factor Validation
A well-constructed factor should pass several diagnostic tests before being used in asset pricing research.
19.9.1 Diagnostic Checklist
def validate_factor(factor_result, name):
"""
Run standard diagnostic tests on a constructed factor.
"""
fr = factor_result['factor_returns']['factor_return']
diag = factor_result['diagnostics']
ports = factor_result['portfolios']
tests = {}
# 1. Statistical significance (t > 2)
tests['t-stat'] = diag['t_stat']
tests['t > 2'] = abs(diag['t_stat']) > 2.0
# 2. Economic magnitude
tests['Ann. Return'] = diag['ann_return']
tests['Ann. Vol'] = diag['ann_vol']
tests['Sharpe'] = diag['sharpe']
# 3. Adequate portfolio diversification
tests['Avg N per portfolio'] = diag['avg_stocks_per_portfolio']
tests['Min N per portfolio'] = diag['min_stocks_per_portfolio']
tests['Min N >= 5'] = diag['min_stocks_per_portfolio'] >= 5
# 4. Not dominated by a single month
tests['Max monthly return'] = fr.max()
tests['Min monthly return'] = fr.min()
tests['Fraction > 0'] = (fr > 0).mean()
# 5. Persistence (ACF at lag 1)
if len(fr) > 12:
tests['ACF(1)'] = fr.autocorr(lag=1)
# 6. Consistency across subperiods
mid = len(fr) // 2
first_half = fr.iloc[:mid]
second_half = fr.iloc[mid:]
tests['Return (1st half)'] = first_half.mean() * 12
tests['Return (2nd half)'] = second_half.mean() * 12
tests['Same sign both halves'] = (
np.sign(first_half.mean()) == np.sign(second_half.mean())
)
return tests
print("Factor Validation Summary:")
print("=" * 70)
for name, result in [('SMB', smb_result), ('HML', hml_result),
('RMW', rmw_result), ('CMA', cma_result),
('WML', wml_result)]:
if result is None:
continue
tests = validate_factor(result, name)
print(f"\n{name}:")
for test, value in tests.items():
if isinstance(value, bool):
status = 'PASS' if value else 'FAIL'
print(f" {test:<30}: {status}")
elif isinstance(value, float):
print(f" {test:<30}: {value:.4f}")
else:
print(f" {test:<30}: {value}")19.9.2 Monotonicity Test
A factor built from a characteristic sort should produce monotonically increasing (or decreasing) average returns across quantiles. Violations of monotonicity suggest the signal-return relationship is nonlinear or absent.
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()
signals = [
('bm', 'Book-to-Market', 'high'),
('op', 'Operating Profit.', 'high'),
('investment', 'Investment', 'low'),
('ret_12_2', 'Momentum (12-2)', 'high'),
('ivol', 'Idio. Volatility', 'low'),
]
for i, (sig, label, long_grp) in enumerate(signals):
result = construct_factor(
panel, signal_col=sig, long_group=long_grp,
n_size_groups=1, n_signal_groups=10,
weighting='value', bp_universe='hose',
rebalance_freq='annual' if sig not in ['ret_12_2'] else 'monthly'
)
if result is None:
continue
port_ret = result['portfolio_returns']
decile_means = (
port_ret.groupby('signal_group')['port_return']
.mean() * 12 * 100
)
colors_mono = plt.cm.RdYlGn_r(np.linspace(0.1, 0.9, len(decile_means)))
axes[i].bar(range(len(decile_means)), decile_means.values,
color=colors_mono, edgecolor='white')
axes[i].set_xticks(range(len(decile_means)))
axes[i].set_xticklabels([f'D{d+1}' for d in range(len(decile_means))],
fontsize=8)
axes[i].set_ylabel('Ann. Return (%)')
axes[i].set_title(label)
axes[i].axhline(y=0, color='gray', linewidth=0.5)
axes[5].set_visible(False)
plt.suptitle('Decile Portfolio Returns by Signal', fontsize=14)
plt.tight_layout()
plt.show()19.9.3 Spanning Tests
Does a new factor add information beyond existing factors? We test this by regressing each factor on all other factors and examining the intercept (alpha). A significant alpha means the factor captures return variation not spanned by the others:
factor_names = [c for c in factor_panel.columns if factor_panel[c].notna().sum() > 24]
spanning_data = factor_panel[factor_names].dropna()
print("Spanning Tests (alpha = intercept when regressed on other factors):")
print(f"{'Factor':<8} {'Alpha (ann.)':>12} {'t-stat':>8} {'R²':>6}")
print("-" * 36)
for target in factor_names:
others = [f for f in factor_names if f != target]
y = spanning_data[target]
X = sm.add_constant(spanning_data[others])
model = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6})
alpha_ann = model.params['const'] * 12
alpha_t = model.tvalues['const']
r2 = model.rsquared
print(f"{target:<8} {alpha_ann:>12.4f} {alpha_t:>8.2f} {r2:>6.3f}")19.10 Vietnamese-Specific Considerations
19.10.1 State-Owned Enterprise Classification
A unique feature of Vietnamese equities is the high proportion of state-owned enterprises (SOEs). The state retains majority or significant minority stakes in many listed firms, which affects governance, information environment, and trading dynamics. Factors may behave differently within SOE and non-SOE subsamples:
# Get SOE classification
firm_info = client.get_firm_info(
exchanges=['HOSE', 'HNX'],
fields=['ticker', 'state_ownership_pct', 'is_soe']
)
panel_soe = panel.merge(firm_info[['ticker', 'is_soe']], on='ticker', how='left')
for label, subset in [('SOE', panel_soe[panel_soe['is_soe'] == True]),
('Non-SOE', panel_soe[panel_soe['is_soe'] == False])]:
hml = construct_factor(
subset, signal_col='bm', long_group='high',
weighting='value', bp_universe='all',
rebalance_freq='annual'
)
if hml:
d = hml['diagnostics']
print(f"HML ({label}): Ann = {d['ann_return']:.4f}, "
f"t = {d['t_stat']:.2f}, N = {d['avg_stocks_per_portfolio']:.0f}")19.10.2 Foreign Ownership Limits
Foreign ownership caps (49% for most sectors, 30% for banking) affect the investable universe for international investors. Factors constructed from the full universe may not be achievable by foreign investors if the long leg concentrates in stocks at the foreign ownership limit:
fol = client.get_foreign_ownership(
exchanges=['HOSE', 'HNX'],
fields=['ticker', 'month_end', 'foreign_pct', 'foreign_limit_pct']
)
# Merge with HML portfolios
if hml_result:
hml_ports = hml_result['portfolios'].merge(
fol, on=['ticker', 'month_end'], how='left'
)
hml_ports['near_limit'] = (
(hml_ports['foreign_limit_pct'] - hml_ports['foreign_pct']) < 5
)
fol_by_group = (
hml_ports.groupby('signal_group')
.agg(
avg_foreign_pct=('foreign_pct', 'mean'),
pct_near_limit=('near_limit', 'mean')
)
)
print("Foreign Ownership in HML Portfolios:")
print(fol_by_group.round(3))19.11 Recommended Factor Specifications for Vietnam
Based on the sensitivity analysis, we recommend the following baseline specifications (Table 19.1).
| Choice | Recommendation | Rationale |
|---|---|---|
| Universe | Standard filter (listing age ≥ 6m, volume > 0) | Excludes shell firms without losing too much breadth |
| Breakpoints | HOSE stocks only | Prevents HNX micro-caps from dominating breakpoints |
| Size groups | 2 (median split) | Sufficient control with limited cross-section |
| Signal groups | 3 (terciles) for factor construction; 5 or 10 for portfolio analysis | 2×3 = adequate diversification per portfolio |
| Weighting | Value-weighted | Investable; less noisy than EW |
| Rebalancing | Annual (June) for accounting signals; monthly for momentum | Standard; consistent with Fama-French |
| Accounting lag | 4 months (available by April for Dec FY) | Conservative PIT alignment |
| Minimum stocks | ≥ 5 per portfolio per month | Below this, single-stock idiosyncratic risk dominates |
Always report results under the baseline and at least one alternative specification (e.g., EW, all-stock breakpoints) to demonstrate robustness.
19.12 Summary
This chapter has developed a modular, transparent factor construction framework for Vietnamese equities. The key insights are:
The Fama and French (1993) 2×3 methodology translates well to Vietnam with one critical adaptation: breakpoints should be computed from HOSE stocks only. Using all-stock breakpoints allows HNX micro-caps to dominate the small-stock groups, inflating apparent premia with economically untradeable returns.
Value weighting produces more conservative (and more implementable) factor premia than equal weighting. The difference is particularly large for signals correlated with size (BM, investment), where EW overweights the smallest stocks that contribute most to the premium but least to investable returns.
No single construction choice determines whether a factor “exists.” A factor premium that appears only under one specific combination of breakpoints, weighting, and rebalancing frequency is fragile and should be treated with skepticism. Robust factors survive a grid of specifications. The sensitivity analysis framework developed here (e.g., varying weighting, breakpoints, sort granularity, and rebalancing simultaneously) should be standard practice.
The construct_factor() function developed in this chapter is designed for reuse throughout the book. Any anomaly variable can be fed through the same pipeline, producing a tradeable factor with full diagnostics. This ensures methodological consistency across chapters and makes it easy to compare premia on an apples-to-apples basis.