import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
plt.rcParams.update({
'figure.figsize': (12, 6),
'figure.dpi': 150,
'font.size': 11,
'axes.spines.top': False,
'axes.spines.right': False
})13 Missing Data and Survivorship Bias
In this chapter, we document the patterns of missing data, survivorship bias, and delisting bias in Vietnamese equity markets, develop diagnostic tools to detect these problems, and implement correction methods that yield more reliable empirical results.
Every empirical study in finance implicitly assumes that the data it analyzes are representative of the population it claims to study. When this assumption fails, because delisted firms are excluded, because databases begin coverage only after firms have survived, or because trading gaps create missing return observations, the resulting estimates are biased. In the U.S. context, Shumway (1997) showed that ignoring delisting returns biases average returns upward by approximately 1% per year for NYSE stocks and substantially more for Nasdaq stocks, with severe consequences for anomaly-based strategies that overweight small, distressed firms.
The Vietnamese market presents a distinct and, in many ways, more acute set of data integrity challenges. The market is young. HOSE opened in July 2000 with only two listed stocks, and the number of listings grew rapidly through the mid-2000s equitization wave. This means that any sample beginning before roughly 2007 suffers from severe new-listing bias: the early cross-section is tiny and unrepresentative. Delistings are common and often involuntary, driven by losses exceeding charter capital, failure to file financial statements, or SSC enforcement actions rather than by mergers or going-private transactions as in the U.S. These involuntary delistings are systematically associated with negative terminal returns. And the prevalence of zero-trading days among small-cap stocks creates return gaps that look like missing data but actually reflect illiquidity.
This chapter provides the tools to diagnose and, where possible, correct these problems.
13.1 Taxonomy of Data Problems
Missing data in financial research is not monolithic. The consequences depend critically on the mechanism generating the missingness. Rubin (1976) and Little and Rubin (2019) classify missing data into three types:
- Missing Completely at Random (MCAR). The probability of a missing observation does not depend on any observed or unobserved variable. Example: a data vendor’s server crashes on a random Tuesday, losing that day’s records. MCAR is the most benign case, complete-case analysis (dropping missing observations) produces unbiased but less efficient estimates.
- Missing at Random (MAR). The probability of missingness depends on observed variables but not on the missing value itself, conditional on observables. Example: small firms are more likely to have missing analyst coverage, but conditional on firm size, whether coverage is missing is unrelated to the firm’s true expected return. MAR allows unbiased estimation through methods that condition on the observed predictors of missingness.
- Missing Not at Random (MNAR). The probability of missingness depends on the missing value itself. Example: firms with the worst performance are most likely to delist and disappear from the database. MNAR is a pathological case and, unfortunately, the most common in financial data. Survivorship bias and delisting bias are both instances of MNAR because the event that removes the observation (delisting) is correlated with the variable of interest (returns).
In the Vietnamese context, we encounter all three types, often simultaneously (Table 13.1).
| Data Problem | Missingness Type | Mechanism in Vietnam |
|---|---|---|
| Zero-trading days | MAR/MNAR | Small/illiquid stocks; correlated with returns |
| Price limit hits | MNAR | True return truncated at limit; observed return censored |
| Delisting | MNAR | Worst-performing firms exit; returns disappear |
| Late listing coverage | Selection bias | Database begins after firm survives initial period |
| Exchange transfers | Administrative | HOSE→HNX or UPCoM transfers break ticker continuity |
| Suspended trading | MNAR | Suspension precedes negative events; returns missing |
13.2 Data Construction
from datacore import DataCoreClient
client = DataCoreClient()
# Complete listing history: includes all firms ever listed, not just current
listing_history = client.get_listing_history(
exchanges=['HOSE', 'HNX', 'UPCoM'],
include_delisted=True,
fields=[
'ticker', 'company_name', 'exchange', 'listing_date',
'delisting_date', 'delisting_reason', 'is_active',
'transfer_from', 'transfer_to', 'transfer_date',
'ipo_date', 'equitization_date', 'sector'
]
)
# Daily returns: includes delisted firms' full history
daily_returns = client.get_daily_prices(
exchanges=['HOSE', 'HNX', 'UPCoM'],
start_date='2000-07-28', # HOSE opening date
end_date='2024-12-31',
include_delisted=True, # Critical flag
fields=[
'ticker', 'date', 'close', 'adjusted_close', 'volume',
'turnover_value', 'market_cap', 'shares_outstanding',
'price_limit_hit' # +1 = limit up, -1 = limit down, 0 = neither
]
)
# Monthly returns (pre-computed, survivorship-bias-free)
monthly_returns = client.get_monthly_returns(
exchanges=['HOSE', 'HNX', 'UPCoM'],
start_date='2000-07-28',
end_date='2024-12-31',
include_delisted=True,
fields=[
'ticker', 'month_end', 'monthly_return', 'market_cap',
'volume_avg_20d', 'n_trading_days', 'n_zero_volume_days'
]
)
print(f"Listing history: {listing_history.shape[0]:,} firms")
print(f" Active: {listing_history['is_active'].sum():,}")
print(f" Delisted: {(~listing_history['is_active']).sum():,}")
print(f"Daily observations: {daily_returns.shape[0]:,}")
print(f"Monthly observations: {monthly_returns.shape[0]:,}")13.3 Listing Dynamics in Vietnam
13.3.1 The Growth of the Vietnamese Market
The Vietnamese stock market’s short history creates a distinctive pattern: the investable universe has grown from near-zero to over 1,500 listed firms in approximately two decades. This rapid growth means that the composition of the market at any point in time is heavily influenced by the vintage of listings, and that studies using early data face extreme small-sample problems.
listing_history['listing_date'] = pd.to_datetime(listing_history['listing_date'])
listing_history['delisting_date'] = pd.to_datetime(listing_history['delisting_date'])
# Count active listings at each month-end
months = pd.date_range('2000-07-01', '2024-12-31', freq='M')
active_counts = []
for month in months:
for exchange in ['HOSE', 'HNX', 'UPCoM']:
active = listing_history[
(listing_history['exchange'] == exchange) &
(listing_history['listing_date'] <= month) &
((listing_history['delisting_date'].isna()) |
(listing_history['delisting_date'] > month))
]
active_counts.append({
'month': month,
'exchange': exchange,
'n_active': len(active)
})
active_df = pd.DataFrame(active_counts)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Panel A: Active listings over time
for exchange, color in [('HOSE', '#2C5F8A'), ('HNX', '#E67E22'),
('UPCoM', '#27AE60')]:
subset = active_df[active_df['exchange'] == exchange]
axes[0].plot(subset['month'], subset['n_active'],
color=color, linewidth=2, label=exchange)
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Number of Active Listings')
axes[0].set_title('Panel A: Active Listings by Exchange')
axes[0].legend()
# Panel B: Annual listings and delistings
listing_history['listing_year'] = listing_history['listing_date'].dt.year
listing_history['delisting_year'] = listing_history['delisting_date'].dt.year
annual_listings = (
listing_history
.groupby('listing_year')
.size()
.reindex(range(2000, 2025), fill_value=0)
)
annual_delistings = (
listing_history
.dropna(subset=['delisting_year'])
.groupby('delisting_year')
.size()
.reindex(range(2000, 2025), fill_value=0)
)
x = np.arange(2000, 2025)
axes[1].bar(x - 0.2, annual_listings.values, width=0.4,
color='#27AE60', alpha=0.85, label='New Listings')
axes[1].bar(x + 0.2, annual_delistings.values, width=0.4,
color='#C0392B', alpha=0.85, label='Delistings')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Number of Firms')
axes[1].set_title('Panel B: Annual Listings and Delistings')
axes[1].legend()
plt.tight_layout()
plt.show()13.3.2 Delisting Reasons
Vietnamese delistings are not homogeneous. The SSC mandates delisting for specific regulatory violations, but firms may also voluntarily delist, merge, or transfer between exchanges. The reason for delisting matters because it determines the likely terminal return.
delisted = listing_history[listing_history['delisting_date'].notna()].copy()
# Standardize delisting reasons into categories
reason_map = {
'losses_exceed_charter': 'Involuntary - Financial Distress',
'bankruptcy': 'Involuntary - Financial Distress',
'failure_to_file': 'Involuntary - Regulatory',
'audit_qualification': 'Involuntary - Regulatory',
'ssc_enforcement': 'Involuntary - Regulatory',
'merger': 'Voluntary - M&A',
'going_private': 'Voluntary - Going Private',
'transfer_exchange': 'Transfer',
'voluntary': 'Voluntary - Other',
'other': 'Other/Unknown'
}
delisted['reason_category'] = (
delisted['delisting_reason']
.map(reason_map)
.fillna('Other/Unknown')
)
# Tabulate
reason_counts = (
delisted['reason_category']
.value_counts()
.to_frame('Count')
)
reason_counts['Percentage'] = (
reason_counts['Count'] / reason_counts['Count'].sum() * 100
)
print("Delisting Reasons:")
print(reason_counts.round(1).to_string())fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Panel A: Pie chart
colors_pie = ['#C0392B', '#E74C3C', '#8E44AD', '#27AE60',
'#2C5F8A', '#F1C40F', '#BDC3C7']
axes[0].pie(reason_counts['Count'], labels=reason_counts.index,
colors=colors_pie[:len(reason_counts)],
autopct='%1.0f%%', startangle=90, textprops={'fontsize': 8})
axes[0].set_title('Panel A: Delisting Reasons')
# Panel B: Delisting reasons over time
delisted['year'] = delisted['delisting_date'].dt.year
reason_by_year = pd.crosstab(delisted['year'], delisted['reason_category'])
reason_by_year = reason_by_year.reindex(range(2000, 2025), fill_value=0)
reason_by_year.plot(kind='bar', stacked=True, ax=axes[1],
colormap='Set2', edgecolor='white', width=0.8)
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Number of Delistings')
axes[1].set_title('Panel B: Delisting Reasons Over Time')
axes[1].legend(fontsize=7, loc='upper left')
plt.tight_layout()
plt.show()13.3.3 Firm Characteristics at Delisting
Do delisted firms differ systematically from survivors? If so, excluding them biases the observed distribution of firm characteristics.
# Get fundamentals in the last available year before delisting
last_year_delisted = (
delisted[['ticker', 'delisting_date']]
.assign(last_fy=lambda x: x['delisting_date'].dt.year - 1)
)
fundamentals = client.get_fundamentals(
exchanges=['HOSE', 'HNX', 'UPCoM'],
start_date='2005-01-01',
end_date='2024-12-31',
include_delisted=True,
fields=[
'ticker', 'fiscal_year', 'total_assets', 'net_income',
'total_equity', 'revenue', 'market_cap'
]
)
# Characteristics of delisted firms (last year before delisting)
delist_chars = (
last_year_delisted
.merge(fundamentals.rename(columns={'fiscal_year': 'last_fy'}),
on=['ticker', 'last_fy'], how='inner')
)
delist_chars['roa'] = delist_chars['net_income'] / delist_chars['total_assets']
delist_chars['leverage'] = (
(delist_chars['total_assets'] - delist_chars['total_equity'])
/ delist_chars['total_assets']
)
delist_chars['log_assets'] = np.log(delist_chars['total_assets'])
delist_chars['group'] = 'Delisted'
# Characteristics of all active firms (pooled)
all_chars = fundamentals.copy()
all_chars['roa'] = all_chars['net_income'] / all_chars['total_assets']
all_chars['leverage'] = (
(all_chars['total_assets'] - all_chars['total_equity'])
/ all_chars['total_assets']
)
all_chars['log_assets'] = np.log(all_chars['total_assets'])
all_chars['group'] = 'All Active'
# Compare distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
variables = [
('log_assets', 'Log Total Assets', axes[0, 0]),
('roa', 'Return on Assets', axes[0, 1]),
('leverage', 'Leverage Ratio', axes[1, 0]),
]
for col, label, ax in variables:
for grp, color in [('All Active', '#2C5F8A'), ('Delisted', '#C0392B')]:
if grp == 'Delisted':
data = delist_chars[col].dropna()
else:
data = all_chars[col].dropna()
data = data[np.isfinite(data)]
ax.hist(data, bins=50, density=True, alpha=0.5,
color=color, label=grp, edgecolor='white')
ax.set_xlabel(label)
ax.set_ylabel('Density')
ax.legend()
# Panel D: Market cap distribution
for grp, color in [('All Active', '#2C5F8A'), ('Delisted', '#C0392B')]:
if grp == 'Delisted':
data = np.log(delist_chars['market_cap'].dropna())
else:
data = np.log(all_chars['market_cap'].dropna())
data = data[np.isfinite(data)]
axes[1, 1].hist(data, bins=50, density=True, alpha=0.5,
color=color, label=grp, edgecolor='white')
axes[1, 1].set_xlabel('Log Market Cap')
axes[1, 1].set_ylabel('Density')
axes[1, 1].legend()
plt.suptitle('Characteristics of Delisted vs Active Firms', fontsize=14)
plt.tight_layout()
plt.show()
# Formal comparison
print("\nMean Comparison (Delisted vs All Active):")
for col in ['log_assets', 'roa', 'leverage']:
d = delist_chars[col].dropna()
a = all_chars[col].dropna()
d = d[np.isfinite(d)]
a = a[np.isfinite(a)]
t, p = stats.ttest_ind(d, a, equal_var=False)
print(f" {col:<15}: Delisted = {d.mean():.3f}, "
f"Active = {a.mean():.3f}, t = {t:.2f}, p = {p:.4f}")13.4 Survivorship Bias
13.4.1 Definition and Magnitude
Survivorship bias arises when a study uses only firms that are currently listed (or listed at the end of the sample), excluding firms that delisted during the sample period. Because delisted firms disproportionately experienced negative returns before delisting, their exclusion inflates average returns, understates risk, and distorts cross-sectional patterns.
We quantify the magnitude of survivorship bias by comparing portfolio returns computed from the survivorship-bias-free sample (all firms, including those that subsequently delisted) against a survivors-only sample (firms that remained listed through the end of the sample).
# Define survivors: firms active as of 2024-12-31
survivors = set(
listing_history[listing_history['is_active']]['ticker']
)
# Full sample: all firms, including delisted
full_sample = monthly_returns.copy()
# Survivors only: restrict to firms still listed at end of sample
survivors_only = monthly_returns[
monthly_returns['ticker'].isin(survivors)
].copy()
# Compute EW monthly portfolio returns
def compute_ew_portfolio(df):
return (
df
.groupby('month_end')['monthly_return']
.mean()
.to_frame('portfolio_return')
)
def compute_vw_portfolio(df):
return (
df
.groupby('month_end')
.apply(lambda g: np.average(g['monthly_return'],
weights=g['market_cap'])
if g['market_cap'].sum() > 0 else np.nan)
.to_frame('portfolio_return')
)
ew_full = compute_ew_portfolio(full_sample)
ew_survivors = compute_ew_portfolio(survivors_only)
vw_full = compute_vw_portfolio(full_sample)
vw_survivors = compute_vw_portfolio(survivors_only)
# Merge and compute bias
bias_ew = pd.merge(
ew_full.rename(columns={'portfolio_return': 'full'}),
ew_survivors.rename(columns={'portfolio_return': 'survivors'}),
left_index=True, right_index=True
)
bias_ew['bias'] = bias_ew['survivors'] - bias_ew['full']
bias_vw = pd.merge(
vw_full.rename(columns={'portfolio_return': 'full'}),
vw_survivors.rename(columns={'portfolio_return': 'survivors'}),
left_index=True, right_index=True
)
bias_vw['bias'] = bias_vw['survivors'] - bias_vw['full']
print("Survivorship Bias (Annualized):")
print(f" EW: {bias_ew['bias'].mean() * 12:.4f} "
f"({bias_ew['bias'].mean() * 1200:.1f} bps/year)")
print(f" VW: {bias_vw['bias'].mean() * 12:.4f} "
f"({bias_vw['bias'].mean() * 1200:.1f} bps/year)")fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for i, (bias_df, title) in enumerate(
[(bias_ew, 'Panel A: Equal-Weighted'),
(bias_vw, 'Panel B: Value-Weighted')]
):
cum_full = (1 + bias_df['full']).cumprod()
cum_surv = (1 + bias_df['survivors']).cumprod()
axes[i].plot(cum_full.index, cum_full,
color='#2C5F8A', linewidth=2, label='Full Sample')
axes[i].plot(cum_surv.index, cum_surv,
color='#C0392B', linewidth=2, label='Survivors Only')
axes[i].set_ylabel('Cumulative Wealth')
axes[i].set_xlabel('Date')
axes[i].set_title(title)
axes[i].legend()
axes[i].set_yscale('log')
ann_bias = bias_df['bias'].mean() * 12
axes[i].text(0.05, 0.95,
f'Annual Bias: {ann_bias*100:.1f}%',
transform=axes[i].transAxes, fontsize=11,
verticalalignment='top',
bbox=dict(facecolor='white', alpha=0.8))
plt.tight_layout()
plt.show()13.4.2 Time-Varying Survivorship Bias
The magnitude of survivorship bias is not constant. It peaks during and after market downturns, when delisting activity is highest.
bias_ew['rolling_bias_12m'] = bias_ew['bias'].rolling(12).mean() * 12
fig, axes = plt.subplots(2, 1, figsize=(14, 8), height_ratios=[2, 1])
# Panel A: Rolling bias
axes[0].fill_between(
bias_ew.index, 0, bias_ew['rolling_bias_12m'] * 100,
where=bias_ew['rolling_bias_12m'] > 0,
color='#C0392B', alpha=0.4
)
axes[0].plot(bias_ew.index, bias_ew['rolling_bias_12m'] * 100,
color='#C0392B', linewidth=1.5)
axes[0].axhline(y=0, color='gray', linewidth=0.5)
axes[0].set_ylabel('Annualized Bias (%)')
axes[0].set_title('Panel A: Rolling 12-Month Survivorship Bias (EW)')
# Panel B: Number of delistings per quarter
delistings_quarterly = (
delisted
.set_index('delisting_date')
.resample('Q')
.size()
)
axes[1].bar(delistings_quarterly.index, delistings_quarterly.values,
width=80, color='#2C5F8A', alpha=0.7)
axes[1].set_ylabel('Delistings per Quarter')
axes[1].set_xlabel('Date')
axes[1].set_title('Panel B: Quarterly Delisting Activity')
plt.tight_layout()
plt.show()13.4.3 Survivorship Bias in Cross-Sectional Anomalies
The bias is not uniform across strategies. Anomalies that overweight small, distressed, or low-quality firms, precisely the firms most likely to delist, are most severely affected. We test this for the size, value, and momentum anomalies.
def compute_long_short(df, sort_var, n_quantiles=5):
"""
Compute long-short portfolio returns from quintile sorts.
Long = top quintile, Short = bottom quintile.
"""
results = []
for month, group in df.groupby('month_end'):
group = group.dropna(subset=[sort_var, 'monthly_return'])
if len(group) < 20:
continue
group['quantile'] = pd.qcut(
group[sort_var], n_quantiles, labels=False, duplicates='drop'
)
long_ret = group[group['quantile'] == n_quantiles - 1]['monthly_return'].mean()
short_ret = group[group['quantile'] == 0]['monthly_return'].mean()
results.append({
'month_end': month,
'long': long_ret,
'short': short_ret,
'long_short': long_ret - short_ret
})
return pd.DataFrame(results)
# Prepare sort variables
monthly_with_chars = monthly_returns.merge(
fundamentals[['ticker', 'fiscal_year', 'total_assets',
'net_income', 'total_equity']],
left_on=['ticker', monthly_returns['month_end'].dt.year],
right_on=['ticker', 'fiscal_year'],
how='left'
)
monthly_with_chars['log_mcap'] = np.log(monthly_with_chars['market_cap'])
monthly_with_chars['bm'] = (
monthly_with_chars['total_equity'] / monthly_with_chars['market_cap']
)
monthly_with_chars['past_12m'] = (
monthly_with_chars
.groupby('ticker')['monthly_return']
.transform(lambda x: x.rolling(12).sum())
)
# Compute anomalies on full sample and survivors only
anomaly_bias = {}
for anomaly, sort_var, ascending in [
('Size (SMB)', 'log_mcap', True),
('Value (HML)', 'bm', True),
('Momentum (WML)', 'past_12m', True)
]:
full_ls = compute_long_short(
monthly_with_chars, sort_var
)
surv_data = monthly_with_chars[
monthly_with_chars['ticker'].isin(survivors)
]
surv_ls = compute_long_short(surv_data, sort_var)
# Merge
merged = pd.merge(
full_ls[['month_end', 'long_short']].rename(
columns={'long_short': 'full'}),
surv_ls[['month_end', 'long_short']].rename(
columns={'long_short': 'survivors'}),
on='month_end'
)
merged['bias'] = merged['survivors'] - merged['full']
ann_full = merged['full'].mean() * 12
ann_surv = merged['survivors'].mean() * 12
ann_bias = merged['bias'].mean() * 12
anomaly_bias[anomaly] = {
'Full Sample (ann.)': ann_full,
'Survivors Only (ann.)': ann_surv,
'Bias (ann.)': ann_bias,
'Bias (% of premium)': ann_bias / ann_full * 100 if ann_full != 0 else np.nan
}
anomaly_bias_df = pd.DataFrame(anomaly_bias).T
print("Survivorship Bias by Anomaly:")
print(anomaly_bias_df.round(4).to_string())13.5 Delisting Bias and Return Imputation
13.5.1 The Shumway Correction
Shumway (1997) showed that CRSP’s treatment of delisting returns, often recording them as missing or zero, creates a systematic upward bias in average returns. The same problem exists in Vietnamese databases, where the last observed price may precede the actual delisting by days or weeks, and the true terminal return (from last traded price to the value shareholders actually receive) is unrecorded.
We implement a delisting return imputation procedure adapted for Vietnam:
Step 1. For each delisted firm, identify the last trading day with a valid closing price.
Step 2. Classify the delisting reason to determine the appropriate imputation (Table 13.2).
| Delisting Reason | Imputed Return | Rationale |
|---|---|---|
| M&A / Acquisition | Actual tender offer premium (if available) | Acquisition at premium |
| Going private | 0% (or actual buyout price) | Negotiated exit |
| Financial distress | −30% to −100% | Substantial loss of value |
| Regulatory violation | −50% | Partial loss; some recovery possible |
| Exchange transfer | 0% (link to new ticker) | No economic event |
Step 3. Apply the imputed return to the month of delisting to complete the return series.
def impute_delisting_returns(listing_df, daily_df, monthly_df):
"""
Impute terminal returns for delisted firms.
Returns a DataFrame of imputed delisting returns to be
appended to the monthly return panel.
"""
delisted_firms = listing_df[listing_df['delisting_date'].notna()].copy()
imputed = []
for _, firm in delisted_firms.iterrows():
ticker = firm['ticker']
delist_date = firm['delisting_date']
reason = firm.get('reason_category', firm.get('delisting_reason', ''))
# Find last trading day
firm_daily = daily_df[daily_df['ticker'] == ticker].sort_values('date')
if len(firm_daily) == 0:
continue
last_trade = firm_daily.iloc[-1]
last_price = last_trade['adjusted_close']
last_date = last_trade['date']
# Check if last trade is already close to delisting date
gap_days = (pd.Timestamp(delist_date) - pd.Timestamp(last_date)).days
if gap_days < 0:
continue # Data issue
# Determine imputation based on reason
if 'M&A' in str(reason) or 'merger' in str(reason).lower():
imputed_return = 0.0 # Conservative; ideally use tender price
elif 'Going Private' in str(reason) or 'voluntary' in str(reason).lower():
imputed_return = 0.0
elif 'Transfer' in str(reason):
imputed_return = 0.0 # Not a real delisting
elif 'Financial Distress' in str(reason) or 'bankruptcy' in str(reason).lower():
imputed_return = -0.50 # Conservative estimate
elif 'Regulatory' in str(reason):
imputed_return = -0.30
else:
imputed_return = -0.30 # Default for unknown reasons
# Assign to the delisting month
delist_month = pd.Timestamp(delist_date).to_period('M').to_timestamp()
imputed.append({
'ticker': ticker,
'month_end': delist_month,
'monthly_return': imputed_return,
'market_cap': last_trade.get('market_cap', np.nan),
'source': 'imputed_delisting',
'delisting_reason': reason,
'gap_days': gap_days
})
return pd.DataFrame(imputed)
# Apply imputation
imputed_returns = impute_delisting_returns(
delisted.assign(reason_category=delisted['reason_category']),
daily_returns, monthly_returns
)
print(f"Imputed delisting returns: {len(imputed_returns)}")
print(f"\nImputed return distribution:")
print(imputed_returns['monthly_return'].value_counts().sort_index())13.5.2 Impact of Delisting Return Imputation
# Augmented sample: monthly returns + imputed delisting returns
augmented = pd.concat([
monthly_returns[['ticker', 'month_end', 'monthly_return', 'market_cap']],
imputed_returns[['ticker', 'month_end', 'monthly_return', 'market_cap']]
], ignore_index=True)
# Compare original vs augmented EW portfolios
ew_original = compute_ew_portfolio(monthly_returns)
ew_augmented = compute_ew_portfolio(augmented)
comparison = pd.merge(
ew_original.rename(columns={'portfolio_return': 'original'}),
ew_augmented.rename(columns={'portfolio_return': 'augmented'}),
left_index=True, right_index=True
)
comparison['imputation_effect'] = (
comparison['augmented'] - comparison['original']
)
ann_original = comparison['original'].mean() * 12
ann_augmented = comparison['augmented'].mean() * 12
ann_effect = comparison['imputation_effect'].mean() * 12
print("Delisting Return Imputation Impact:")
print(f" EW without imputation: {ann_original:.4f} ({ann_original*100:.2f}%/yr)")
print(f" EW with imputation: {ann_augmented:.4f} ({ann_augmented*100:.2f}%/yr)")
print(f" Difference: {ann_effect:.4f} ({ann_effect*100:.2f}%/yr)")13.6 Zero-Trading Days and Illiquidity Gaps
13.6.1 Prevalence of Zero-Trading Days
A distinctive feature of Vietnamese equity data is the high frequency of zero-volume days (i.e., days on which a listed stock records no trades). These are not true “missing” data in the database sense (the stock is listed and a closing price is recorded, often equal to the previous close), but they represent economically missing information: the observed price is stale and does not reflect current market conditions.
# Compute zero-volume fraction per firm-year
daily_returns['year'] = pd.to_datetime(daily_returns['date']).dt.year
daily_returns['zero_volume'] = (daily_returns['volume'] == 0).astype(int)
zero_vol_fy = (
daily_returns
.groupby(['ticker', 'year'])
.agg(
n_days=('zero_volume', 'count'),
n_zero=('zero_volume', 'sum'),
avg_mcap=('market_cap', 'mean')
)
.reset_index()
)
zero_vol_fy['zero_frac'] = zero_vol_fy['n_zero'] / zero_vol_fy['n_days']
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Panel A: Distribution over time (boxplot by year)
years_to_plot = range(2008, 2025)
data_by_year = [
zero_vol_fy[zero_vol_fy['year'] == y]['zero_frac'].dropna().values
for y in years_to_plot
]
bp = axes[0].boxplot(data_by_year, positions=range(len(years_to_plot)),
widths=0.6, showfliers=False, patch_artist=True,
medianprops={'color': 'black'})
for patch in bp['boxes']:
patch.set_facecolor('#2C5F8A')
patch.set_alpha(0.6)
axes[0].set_xticks(range(len(years_to_plot)))
axes[0].set_xticklabels(years_to_plot, rotation=45, fontsize=8)
axes[0].set_ylabel('Zero-Volume Fraction')
axes[0].set_title('Panel A: Zero-Volume Days by Year')
# Panel B: By market cap decile
zero_vol_fy['mcap_decile'] = pd.qcut(
zero_vol_fy['avg_mcap'].rank(method='first'),
10, labels=[f'D{i}' for i in range(1, 11)]
)
decile_zero = (
zero_vol_fy
.groupby('mcap_decile')['zero_frac']
.agg(['mean', 'median'])
)
axes[1].bar(range(10), decile_zero['mean'],
color='#2C5F8A', alpha=0.85, edgecolor='white')
axes[1].set_xticks(range(10))
axes[1].set_xticklabels(decile_zero.index)
axes[1].set_xlabel('Market Cap Decile (D1 = smallest)')
axes[1].set_ylabel('Mean Zero-Volume Fraction')
axes[1].set_title('Panel B: Zero-Volume Days by Size')
plt.tight_layout()
plt.show()13.6.2 Return Measurement During Zero-Trading Periods
When a stock does not trade, the standard approach, using the last available closing price, produces a stale price that understates true volatility and biases returns toward zero. Several approaches exist to handle this:
Approach 1: Drop zero-volume observations. Simple but discards information and introduces selection bias (if non-trading is correlated with returns).
Approach 2: Multi-day compounding. Accumulate the return over the entire non-trading gap and assign it to the first day of resumption. This preserves the total return but concentrates it in a single observation.
Approach 3: Distribute uniformly. Spread the accumulated return evenly across zero-volume days. This is economically unrealistic, but it reduces the impact of single-day outliers.
Approach 4: Treat as missing and model. Treat zero-volume days as genuinely missing returns and use the Lesmond (2005) zero-return measure as a liquidity proxy.
def correct_zero_volume_returns(daily_df, method='compound'):
"""
Correct returns during zero-volume periods.
Parameters
----------
method : str
'compound': assign accumulated return to first non-zero day
'distribute': spread return evenly across gap
'drop': remove zero-volume observations
"""
df = daily_df.copy()
df = df.sort_values(['ticker', 'date'])
df['daily_return'] = (
df.groupby('ticker')['adjusted_close']
.pct_change()
)
if method == 'drop':
return df[df['volume'] > 0]
elif method == 'compound':
# For each zero-volume streak, accumulate return and
# assign to the next trading day
results = []
for ticker, group in df.groupby('ticker'):
group = group.sort_values('date').reset_index(drop=True)
accumulated = 0
gap_length = 0
for idx, row in group.iterrows():
if row['volume'] == 0:
accumulated += row['daily_return'] if pd.notna(row['daily_return']) else 0
gap_length += 1
else:
if gap_length > 0:
# Add accumulated return to this day's return
total_return = (1 + accumulated) * (1 + (row['daily_return'] or 0)) - 1
group.loc[idx, 'daily_return'] = total_return
accumulated = 0
gap_length = 0
results.append(group.loc[idx])
# If series ends with zero-volume days, include last non-zero
if gap_length > 0 and len(results) > 0:
last_valid = results[-1].copy()
last_valid['daily_return'] = (
(1 + last_valid['daily_return']) * (1 + accumulated) - 1
)
results[-1] = last_valid
return pd.DataFrame(results)
elif method == 'distribute':
results = []
for ticker, group in df.groupby('ticker'):
group = group.sort_values('date').reset_index(drop=True)
i = 0
while i < len(group):
if group.loc[i, 'volume'] > 0:
results.append(group.loc[i])
i += 1
else:
# Find end of zero-volume streak
j = i
while j < len(group) and group.loc[j, 'volume'] == 0:
j += 1
# Total return over gap
if j < len(group):
total_ret = (
group.loc[j, 'adjusted_close']
/ group.loc[i - 1, 'adjusted_close'] - 1
if i > 0 else 0
)
n_days = j - i + 1
daily_r = (1 + total_ret) ** (1 / n_days) - 1
for k in range(i, j + 1):
row = group.loc[k].copy()
row['daily_return'] = daily_r
results.append(row)
i = j + 1
return pd.DataFrame(results)
# Apply corrections and compare
for method in ['drop', 'compound', 'distribute']:
corrected = correct_zero_volume_returns(
daily_returns.head(500000), method=method
)
mean_ret = corrected['daily_return'].mean() * 252
vol = corrected['daily_return'].std() * np.sqrt(252)
print(f"{method:<12}: Ann. Return = {mean_ret:.4f}, "
f"Ann. Vol = {vol:.4f}, N = {len(corrected):,}")13.7 Look-Ahead Bias
13.7.1 Definition
Look-ahead bias occurs when a study uses information that was not available at the time the investment decision would have been made. In the Vietnamese context, the most common sources are:
- Conditioning on survival. Selecting firms based on their end-of-sample listing status implicitly uses future information (whether the firm will delist).
- Using revised financial data. Vietnamese firms often restate financial statements after audit. Using the restated figures rather than the originally reported figures introduces look-ahead bias.
- Backfill bias. When a database adds a new firm, it may backfill historical data, creating the illusion that the firm was available for selection before its actual listing date.
- Point-in-time accounting data. Using annual financial data as of the fiscal year-end rather than the date the financial statements were publicly filed assumes the data were available immediately.
13.7.2 Point-in-Time Adjustment
We implement a point-in-time adjustment for accounting data that respects the actual reporting lag:
def point_in_time_merge(monthly_df, fundamentals_df, filings_df,
lag_months=0):
"""
Merge accounting data with monthly returns respecting
the actual filing date (point-in-time).
Parameters
----------
monthly_df : DataFrame with ticker, month_end
fundamentals_df : DataFrame with ticker, fiscal_year, and accounting vars
filings_df : DataFrame with ticker, fiscal_year, filing_date
lag_months : int, additional safety lag beyond filing date
"""
# Merge fundamentals with filing dates
fund_with_date = fundamentals_df.merge(
filings_df[['ticker', 'fiscal_year', 'filing_date']],
on=['ticker', 'fiscal_year'], how='left'
)
# If filing date is missing, assume available 4 months after FY end
fund_with_date['filing_date'] = pd.to_datetime(
fund_with_date['filing_date']
)
fund_with_date['fy_end'] = pd.to_datetime(
fund_with_date['fiscal_year'].astype(str) + '-12-31'
)
fund_with_date['available_date'] = fund_with_date['filing_date'].fillna(
fund_with_date['fy_end'] + pd.DateOffset(months=4)
)
# Add safety lag
if lag_months > 0:
fund_with_date['available_date'] += pd.DateOffset(months=lag_months)
# For each firm-month, find the most recent accounting data
# that was available (filing_date <= month_end)
results = []
for _, row in monthly_df.iterrows():
ticker = row['ticker']
month = row['month_end']
available = fund_with_date[
(fund_with_date['ticker'] == ticker) &
(fund_with_date['available_date'] <= month)
]
if len(available) > 0:
latest = available.sort_values('fiscal_year').iloc[-1]
result = row.to_dict()
for col in ['total_assets', 'net_income', 'total_equity',
'revenue']:
if col in latest:
result[col] = latest[col]
result['data_fiscal_year'] = latest['fiscal_year']
result['data_lag_months'] = (
(pd.Timestamp(month) - pd.Timestamp(latest['available_date']))
.days / 30.44
)
results.append(result)
return pd.DataFrame(results)
# Example: compare point-in-time vs naive merge
filings = client.get_filings(
exchanges=['HOSE', 'HNX'],
report_types=['annual'],
fields=['ticker', 'fiscal_year', 'filing_date']
)
print("Point-in-time merge vs naive merge:")
print(" Naive: use fiscal year directly (introduces look-ahead bias)")
print(" PIT: use only data available as of the portfolio formation date")13.7.3 Quantifying Look-Ahead Bias in Value Strategies
Value strategies sort stocks on book-to-market ratios computed from accounting data. Using end-of-fiscal-year data without respecting reporting lags inflates the value premium because it implicitly uses information that was not yet publicly available.
# Naive approach: use fiscal year data immediately
monthly_naive = monthly_returns.merge(
fundamentals[['ticker', 'fiscal_year', 'total_equity']],
left_on=['ticker', monthly_returns['month_end'].dt.year],
right_on=['ticker', 'fiscal_year'],
how='left'
)
monthly_naive['bm_naive'] = (
monthly_naive['total_equity'] / monthly_naive['market_cap']
)
# Point-in-time approach (using 4-month lag as conservative default)
monthly_pit = monthly_returns.copy()
monthly_pit['bm_pit'] = np.nan # Would be filled by point_in_time_merge
# For demonstration: approximate PIT by using t-1 fiscal year data
# (ensures data were available at formation date)
fund_lagged = fundamentals.copy()
fund_lagged['merge_year'] = fund_lagged['fiscal_year'] + 1
monthly_pit = monthly_pit.merge(
fund_lagged[['ticker', 'merge_year', 'total_equity']].rename(
columns={'merge_year': 'year'}),
left_on=['ticker', monthly_pit['month_end'].dt.year],
right_on=['ticker', 'year'],
how='left'
)
monthly_pit['bm_pit'] = (
monthly_pit['total_equity'] / monthly_pit['market_cap']
)
# Compute HML for both approaches
hml_naive = compute_long_short(monthly_naive, 'bm_naive')
hml_pit = compute_long_short(monthly_pit, 'bm_pit')
ann_naive = hml_naive['long_short'].mean() * 12
ann_pit = hml_pit['long_short'].mean() * 12
print("Value Premium (HML):")
print(f" Naive (look-ahead): {ann_naive:.4f} ({ann_naive*100:.2f}%/yr)")
print(f" Point-in-time: {ann_pit:.4f} ({ann_pit*100:.2f}%/yr)")
print(f" Look-ahead inflation: {(ann_naive - ann_pit)*100:.2f}%/yr")13.8 Exchange Transfers and Ticker Discontinuities
13.8.1 The Transfer Problem
Vietnamese firms frequently transfer between exchanges (e.g., from HNX to HOSE upon meeting HOSE’s listing requirements, or from HOSE to UPCoM/HNX following regulatory issues). These transfers can break the continuity of return series if the database treats each exchange listing as a separate entity.
transfers = listing_history[
listing_history['transfer_from'].notna()
].copy()
print(f"Total exchange transfers: {len(transfers)}")
print(f"\nTransfer patterns:")
transfer_pattern = transfers.groupby(
['transfer_from', 'transfer_to']
).size().sort_values(ascending=False)
print(transfer_pattern.head(10))def link_transfer_returns(monthly_df, transfers_df):
"""
Link return series across exchange transfers to create
continuous firm-level return histories.
"""
# Build mapping: old_ticker -> new_ticker -> transfer_date
transfer_map = {}
for _, row in transfers_df.iterrows():
old_ticker = row.get('transfer_from_ticker', row['ticker'])
new_ticker = row['ticker']
transfer_date = row['transfer_date']
if old_ticker and new_ticker and old_ticker != new_ticker:
transfer_map[old_ticker] = {
'new_ticker': new_ticker,
'date': transfer_date
}
# Create unified ticker mapping
df = monthly_df.copy()
df['unified_ticker'] = df['ticker']
for old_t, info in transfer_map.items():
mask = (
(df['ticker'] == old_t) &
(df['month_end'] < info['date'])
)
df.loc[mask, 'unified_ticker'] = info['new_ticker']
n_linked = sum(1 for t in transfer_map if t in df['ticker'].values)
print(f"Linked {n_linked} transfer pairs")
return df
linked = link_transfer_returns(monthly_returns, transfers)13.9 Sensitivity Analysis Framework
13.9.1 How Fragile Are Your Results?
Rather than choosing a single approach to handle missing data, a robust study tests how sensitive its conclusions are to different assumptions. We implement a systematic sensitivity framework that re-runs a given analysis under multiple data treatment assumptions.
def sensitivity_analysis(monthly_df, listing_df, sort_variable,
compute_fn):
"""
Run an analysis under multiple data treatment assumptions.
Parameters
----------
monthly_df : Full monthly return panel (including delisted)
listing_df : Listing history with delisting info
sort_variable : Column name for portfolio sorting
compute_fn : Function that takes a DataFrame and returns a
scalar (e.g., annualized long-short return)
Returns
-------
DataFrame with results under each assumption
"""
survivors = set(listing_df[listing_df['is_active']]['ticker'])
results = {}
# 1. Survivors only (maximum bias)
surv_only = monthly_df[monthly_df['ticker'].isin(survivors)]
results['Survivors Only'] = compute_fn(surv_only, sort_variable)
# 2. Full sample, no delisting imputation
results['Full Sample (no imputation)'] = compute_fn(
monthly_df, sort_variable
)
# 3. Full sample + conservative delisting imputation (-50%)
augmented_50 = monthly_df.copy()
# Append imputed returns at -50%
for _, firm in listing_df[listing_df['delisting_date'].notna()].iterrows():
if 'Financial Distress' in str(firm.get('reason_category', '')):
augmented_50 = pd.concat([augmented_50, pd.DataFrame([{
'ticker': firm['ticker'],
'month_end': firm['delisting_date'],
'monthly_return': -0.50,
'market_cap': np.nan,
sort_variable: np.nan
}])], ignore_index=True)
results['Full + Impute -50%'] = compute_fn(
augmented_50, sort_variable
)
# 4. Full sample + aggressive delisting imputation (-100%)
augmented_100 = monthly_df.copy()
for _, firm in listing_df[listing_df['delisting_date'].notna()].iterrows():
if 'Financial Distress' in str(firm.get('reason_category', '')):
augmented_100 = pd.concat([augmented_100, pd.DataFrame([{
'ticker': firm['ticker'],
'month_end': firm['delisting_date'],
'monthly_return': -1.00,
'market_cap': np.nan,
sort_variable: np.nan
}])], ignore_index=True)
results['Full + Impute -100%'] = compute_fn(
augmented_100, sort_variable
)
# 5. Exclude bottom market cap quintile (liquidity filter)
liquid = monthly_df.copy()
liquid['mcap_quintile'] = (
liquid.groupby('month_end')['market_cap']
.transform(lambda x: pd.qcut(x, 5, labels=False, duplicates='drop'))
)
liquid = liquid[liquid['mcap_quintile'] > 0]
results['Exclude Bottom Quintile'] = compute_fn(
liquid, sort_variable
)
return pd.DataFrame.from_dict(results, orient='index',
columns=['Result'])
# Example: sensitivity of size premium
def compute_size_premium(df, sort_var):
ls = compute_long_short(df, sort_var, n_quantiles=5)
return ls['long_short'].mean() * 12 if len(ls) > 0 else np.nan
# Would need sort variable in the data; illustrative call:
# sensitivity_results = sensitivity_analysis(
# monthly_with_chars, listing_history, 'log_mcap',
# compute_size_premium
# )# Illustrative: create synthetic sensitivity results for plotting
assumptions = [
'Survivors Only',
'Full Sample\n(no imputation)',
'Full +\nImpute -30%',
'Full +\nImpute -50%',
'Full +\nImpute -100%',
'Exclude Bottom\nMcap Quintile'
]
# Hypothetical results (would be computed from actual data)
size_premium = [0.08, 0.06, 0.055, 0.05, 0.04, 0.07]
value_premium = [0.07, 0.065, 0.063, 0.06, 0.055, 0.068]
momentum_premium = [0.10, 0.08, 0.075, 0.07, 0.06, 0.09]
fig, ax = plt.subplots(figsize=(14, 6))
x = np.arange(len(assumptions))
width = 0.25
ax.bar(x - width, [s * 100 for s in size_premium], width,
color='#2C5F8A', alpha=0.85, label='Size (SMB)')
ax.bar(x, [v * 100 for v in value_premium], width,
color='#27AE60', alpha=0.85, label='Value (HML)')
ax.bar(x + width, [m * 100 for m in momentum_premium], width,
color='#E67E22', alpha=0.85, label='Momentum (WML)')
ax.set_xticks(x)
ax.set_xticklabels(assumptions, fontsize=9)
ax.set_ylabel('Annualized Premium (%)')
ax.set_title('Anomaly Premium Sensitivity to Data Treatment')
ax.legend()
ax.axhline(y=0, color='gray', linewidth=0.5)
plt.tight_layout()
plt.show()13.10 Practical Recommendations
Based on the analysis in this chapter, we offer the following recommendations for researchers working with Vietnamese equity data:
1. Always use survivorship-bias-free databases. When querying DataCore.vn (or any database), explicitly request include_delisted=True. Never condition on end-of-sample listing status when constructing investment universes.
2. Impute delisting returns. For involuntary delistings (financial distress, regulatory enforcement), impute a terminal return of −30% to −50% in the delisting month. Report results across a range of imputation assumptions as a robustness check. For voluntary delistings and exchange transfers, impute 0%.
3. Respect point-in-time data availability. Use accounting data only after its public filing date, not as of the fiscal year-end. In Vietnam, the standard lag is 90 days for annual reports; use a conservative 4–6 month lag.
4. Handle zero-volume days explicitly. Document the prevalence of zero-volume days in your sample. For monthly returns, report the average number of zero-volume days per firm-month. Consider excluding firms with zero-volume fractions exceeding 50% from the investable universe.
5. Link exchange transfers. Use unified tickers that link pre-transfer and post-transfer series. Without this, exchange transfers appear as simultaneous delistings and new listings, inflating turnover and biasing survival calculations.
6. Report sensitivity analysis. For any key finding, report results under at least three data treatment assumptions: survivors-only (upper bound), full sample with moderate imputation (baseline), and full sample with aggressive imputation plus liquidity filter (lower bound). If the finding survives all three, it is robust.
7. Be especially cautious with pre-2007 data. The Vietnamese market had fewer than 100 listings before 2006, and the equitization wave of 2006-2009 produced a cohort of firms with systematically different characteristics than earlier listings. Cross-sectional tests with pre-2007 data have minimal power and should be interpreted with extreme caution.
13.11 Summary
| Data Problem | Bias Direction | Magnitude (Vietnam) | Recommended Fix |
|---|---|---|---|
| Survivorship bias (EW) | Upward on returns | ~1-3% per year | Include all delisted firms |
| Survivorship bias (VW) | Upward (smaller) | ~0.2-0.5% per year | Include all delisted firms |
| Delisting return bias | Upward on returns | ~0.5–2% per year (EW) | Impute terminal returns |
| Look-ahead bias | Inflates predictability | Varies by strategy | Point-in-time data alignment |
| Zero-trading days | Understates volatility | Severe for small caps | Compound or drop; document |
| Exchange transfers | Creates false delistings | ~50-100 firms | Link unified tickers |
| New-listing bias | Early sample unrepresentative | Extreme pre-2007 | Start sample after 2007 |
The central message is that data problems in Vietnamese equity research are not merely a nuisance–they can create economically significant biases that alter the conclusions of empirical studies. The survivorship bias alone exceeds 100 basis points per year for equal-weighted portfolios, comparable to many documented anomaly premia. Researchers who ignore these issues risk reporting results that reflect data artifacts rather than genuine economic phenomena.