# Measuring Divergence of Investor Opinion
A foundational question in financial economics concerns how differences in investor beliefs affect asset prices and trading activity. In markets where investors hold heterogeneous expectations about a firm's future cash flows, the aggregation of these divergent views into a single market price becomes a non-trivial exercise with profound implications for asset valuation, return predictability, and market efficiency. The concept of **divergence of investor opinion** (hereafter DIVOP) has emerged as a central construct in both the accounting and finance literatures, serving as a lens through which researchers examine the information environment of firms, the dynamics of uncertainty resolution, and the nature of market reactions to news.
The theoretical foundations of the DIVOP literature trace back to @miller1977risk, who proposed that when investors disagree about the value of a security and short-sale constraints prevent pessimistic investors from fully expressing their views, the market price will reflect the valuation of the most optimistic investors. This leads to systematic overpricing that is increasing in the degree of opinion divergence. The overpricing persists until information events, such as earnings announcements, reduce disagreement and prices converge toward fundamental values [@berkman2009sell]. @varian1985divergence offers an alternative perspective in which divergence of opinion represents an additional risk factor, leading to *higher* rather than lower expected returns, creating a theoretical tension that has motivated extensive empirical investigation.
The empirical literature on DIVOP has expanded considerably since these seminal contributions. Researchers have documented that divergence of opinion helps explain a range of asset pricing anomalies, including post-earnings announcement drift [@garfinkel2006volume; @anderson2007opinion], the cross-sectional return difference between value and growth stocks [@doukas2004divergent], short- and long-run post-IPO returns [@houge2001divergence], pre- and post-acquisition stock returns [@alexandridis2007divergence], takeover premia [@chatterjee2012takeovers], and the broad cross-section of stock returns [@diether2002differences; @doukas2006divergence]. The explanatory power of DIVOP has been demonstrated using a rich set of empirical proxies, ranging from analyst forecast dispersion and abnormal trading volume to bid-ask spreads and idiosyncratic volatility.
Despite the maturity of the DIVOP literature in developed markets, particularly the United States, its application to emerging markets remains remarkably thin. This gap is especially notable given that the theoretical conditions under which divergence of opinion matters most (namely, binding short-sale constraints, information asymmetry, and heterogeneous investor sophistication) are arguably *more* prevalent in emerging markets than in their developed counterparts. The Vietnamese equity market presents a compelling laboratory for studying investor disagreement. The market is characterized by several features that amplify the relevance of the DIVOP framework:
1. **Binding short-sale constraints.** Short selling was not permitted in Vietnam until January 2025, and even after its introduction, the mechanism remains restricted to a limited set of securities with significant regulatory constraints on execution. This closely mirrors the theoretical setting of @miller1977risk, where pessimistic investors are unable to fully express their views through short positions.
2. **Dominance of retail investors.** Individual investors account for approximately 80-85% of daily trading volume on HOSE and HNX, compared to roughly 25% in the United States. Retail investors are more susceptible to behavioral biases, sentiment-driven trading, and information processing limitations that naturally give rise to heterogeneous beliefs [@phan2023role].
3. **Information asymmetry and transparency challenges.** Despite improvements in disclosure standards, Vietnam's regulatory framework for corporate reporting remains less stringent than those in developed markets. Selective disclosure, delayed filing of financial statements, and limited enforcement of insider trading regulations create an environment in which investors operate with substantially different information sets [@vo2017further].
4. **Foreign ownership limits.** Caps on foreign ownership (currently 49% for most sectors, with exceptions) create a segmented market where domestic and foreign investors may hold systematically different views about firm value, amplifying the divergence of opinion.
5. **Thin analyst coverage.** Whereas a typical S&P 500 firm is followed by 15-25 sell-side analysts, coverage of Vietnamese equities is concentrated among a relatively small number of domestic brokerages and a handful of international research houses. This limits the informativeness of traditional analyst-based DIVOP measures and necessitates greater reliance on market-based proxies.
This chapter provides a methodology for constructing multiple proxies for divergence of investor opinion adapted to the institutional characteristics of the Vietnamese market. We draw on the methodological frameworks established by @garfinkel2009measuring and @diether2002differences, while introducing modifications that account for the microstructure of Vietnamese exchanges, the $T+2$ settlement cycle, the absence (until recently) of short selling, and the availability of data through domestic financial platforms. Specifically, we construct and analyze the following DIVOP proxies:
- **Unexplained Volume (DTO):** Market-adjusted turnover detrended by its rolling median, capturing abnormal trading activity attributable to disagreement after controlling for liquidity and market-wide effects.
- **Standardized Unexplained Volume (SUV):** A regression-based measure that explicitly controls for the informedness and liquidity components of volume by modeling turnover as a function of signed returns.
- **Stock Return Volatility (VOLATILITY):** The standard deviation of daily returns over a rolling estimation window, serving as a proxy for the dispersion of investor valuations.
- **Bid-Ask Spread (BASPREAD):** The proportional quoted spread, reflecting the adverse selection component associated with heterogeneous information among market participants.
- **Analyst Forecast Dispersion (DISP):** The cross-sectional standard deviation of individual analyst earnings forecasts, directly measuring disagreement among informed market participants.
- **Idiosyncratic Volatility (IVOL):** The residual volatility from a factor model regression, isolating the firm-specific component of return variation that reflects divergent investor interpretations of firm-level information.
- **Amihud Illiquidity (ILLIQ):** The price impact ratio proposed by @amihud2002illiquidity, which captures the information asymmetry dimension of disagreement through the price response to order flow.
For each proxy, we describe the theoretical motivation, the data requirements, the construction methodology adapted for Vietnamese data, the empirical properties observed in the Vietnamese cross-section, and the practical considerations that researchers should bear in mind when employing these measures. We pay particular attention to issues that are specific to emerging markets, including thin trading, corporate action adjustments, exchange-specific microstructure effects, and the interplay between foreign ownership constraints and measures of investor disagreement.
------------------------------------------------------------------------
# Theoretical Framework {#sec-theoretical-framework}
## The Miller (1977) Overpricing Hypothesis
The canonical model of divergence of opinion and asset pricing begins with @miller1977risk. Miller's central insight is simple: in a market where investors hold heterogeneous beliefs about the future payoffs of a risky asset and short-sale constraints prevent some investors from acting on their pessimistic views, the equilibrium price will be set by the subset of investors who are most optimistic about the asset's value. The severity of overpricing is increasing in both the degree of opinion divergence and the stringency of short-sale constraints. Formally, if investor $i$ assigns a valuation $V_i$ to a security, the market price $P$ satisfies:
$$
P = E[V_i \mid V_i \geq V^*]
$$
where $V^*$ is the marginal investor's valuation, which exceeds the unconditional mean valuation $E[V_i]$ whenever short-sale constraints bind for some investors. The degree of overpricing is:
$$
\text{Overpricing} = P - E[V_i] = E[V_i \mid V_i \geq V^*] - E[V_i]
$$
which is positive and increasing in the dispersion of the distribution of $V_i$ (i.e., divergence of opinion) and in $V^*$ (i.e., the severity of short-sale constraints).
Miller's model generates several testable predictions:
- **Cross-sectional prediction:** Stocks with **higher divergence of opinion should have *lower* subsequent returns** as prices gradually correct toward fundamental values.
- **Time-series prediction:** Information events that reduce disagreement (e.g., earnings announcements) should be associated with negative abnormal returns for high-DIVOP stocks, as the "optimism premium" dissipates.
- **Interaction prediction:** The overpricing effect should be strongest among stocks that simultaneously exhibit high divergence of opinion *and* binding short-sale constraints.
## Alternative Theoretical Perspectives
@varian1985divergence proposes an alternative framework in which divergence of opinion acts as a risk factor. If investors are risk-averse and disagreement represents genuine uncertainty about future payoffs, then **higher dispersion of beliefs should be associated with *higher* expected returns** as compensation for bearing the additional risk. This creates a sharp empirical dichotomy: the Miller hypothesis predicts a negative DIVOP-return relation, whereas the Varian model predicts a positive relation.
The distinction between these theories hinges critically on the market microstructure and institutional setting (\@tbl-divop-theories).
| Theoretical Framework | Short-Sale Constraints | DIVOP-Return Relation | Key Mechanism |
|:-----------------|:-----------------|:-----------------|:-----------------|
| @miller1977risk | Binding | Negative | Optimistic bias in price |
| @varian1985divergence | Non-binding | Positive | Risk premium for uncertainty |
| @hong2003differences | Binding, gradual info | Negative, time-varying | Slow diffusion of bearish views |
| @scheinkman2003overconfidence | Binding, overconfidence | Negative | Speculative bubble premium |
: Summary of theoretical predictions for the DIVOP-return relation under different assumptions {#tbl-divop-theories}
@hong2003differences extend Miller's framework by incorporating gradual information diffusion. In their model, bearish information is impounded into prices more slowly than bullish information because short-sale constraints raise the cost of acting on negative views. This generates momentum-like patterns in which high-DIVOP stocks exhibit positive short-run returns (as optimists push prices up) followed by negative long-run returns (as bearish information eventually reaches the market).
@scheinkman2003overconfidence introduce an additional dimension by noting that when investors are overconfident about their private signals *and* short-sale constraints bind, stock prices contain a "speculative bubble" component that reflects the option value of reselling the asset to a future investor who may be even more optimistic. This model predicts that both high trading volume and high price volatility should be associated with overpricing, providing a theoretical basis for using volume-based and volatility-based DIVOP proxies.
## Relevance to the Vietnamese Market
The Vietnamese equity market provides an unusually clean setting for testing the Miller hypothesis. Vietnam's equity market operated without any short-selling mechanism from its inception in 2000 through January 2025, which was a full quarter-century in which the first necessary condition of Miller's model (binding short-sale constraints) was satisfied by regulation rather than by market frictions. Even after the introduction of covered short selling in 2025, the mechanism remains restricted to securities meeting specific liquidity and market capitalization thresholds, and the regulatory environment imposes borrowing requirements that significantly raise the cost of shorting relative to developed markets.
The dominance of retail investors amplifies the second necessary condition (i.e., heterogeneous beliefs). Research on the Vietnamese market has documented significant herding behavior [@vo2017further; @vo2015foreign], sentiment-driven trading [@phan2023role; @nguyen2018search], and information asymmetry between domestic and foreign investors [@vo2017foreign]. These behavioral characteristics naturally generate wider dispersion of investor valuations compared to markets dominated by institutional investors with access to similar analytical frameworks and information sources.
@tbl-divop-vietnam-us compares key institutional features relevant to the DIVOP framework between Vietnam and the United States.
| Feature | Vietnam (HOSE/HNX) | United States (NYSE/NASDAQ) |
|:-----------------------|:-----------------------|:-----------------------|
| Short selling | Introduced Jan 2025 (limited) | Permitted (Reg SHO since 2005) |
| Retail investor share of volume | \~80-85% | \~25% |
| Settlement cycle | T+2 (T+1 planned for 2026) | T+1 (since May 2024) |
| Daily price limits | $\pm$ 7% (HOSE), $\pm$ 10% (HNX) | None |
| Foreign ownership cap | 49% (most sectors) | None |
| Average analyst coverage (VN30) | 5-10 analysts | 15-25 analysts |
| Mandatory quarterly reporting | Yes (since 2012) | Yes |
| Options/derivatives market | VN30 Index Futures (since 2017) | Extensive options/futures |
: Institutional comparison of Vietnam and the United States relevant to divergence of opinion {#tbl-divop-vietnam-us}
The presence of daily price limits ($\pm$ 7% on HOSE and $\pm$ 10% on HNX) creates an additional mechanism through which divergence of opinion can be amplified. When a stock hits its price limit, investors who wish to trade in the direction of the limit are unable to do so, leading to accumulated unfilled orders and delayed price discovery. This institutional feature may create short-term spikes in measured DIVOP that reflect limit-induced friction rather than genuine disagreement. We address this issue in our empirical methodology by flagging limit-hit days and conducting robustness checks that exclude these observations.
# Data Sources and Sample Construction {#sec-data}
## Data Sources
The construction of DIVOP proxies for the Vietnamese market requires daily stock-level trading data and, for the analyst dispersion measures, individual analyst forecast data. We source all data from [DataCore.vn](https://datacore.vn/en), which provides coverage of all securities listed on HOSE, HNX, and the UPCoM (Unlisted Public Company Market) exchange. @tbl-divop-data-sources summarizes the datasets and key variables used in this study.
| Dataset | Key Variables | Frequency |
|:-----------------------|:-----------------------|:-----------------------|
| Daily Stock Trading | Close price, high, low, open, volume, shares outstanding, adjusted price, bid, ask | Daily |
| Corporate Actions | Dividends, stock splits, bonus issues, rights offerings | Event-based |
| Company Information | Exchange code, industry classification (ICB), listing date, delisting date | Static/Periodic |
| Analyst Forecasts | Individual analyst EPS forecasts, announcement dates, fiscal period end, analyst ID, broker name | Per estimate |
| Market Index | VN-Index daily returns, VN30 returns, HNX-Index returns | Daily |
| Foreign Ownership | Foreign buy/sell volume, foreign ownership percentage, remaining foreign room | Daily |
: Data sources and key variables for DIVOP proxy construction {#tbl-divop-data-sources}
## Sample Construction
We construct our sample using the following filters, applied sequentially:
```{python}
#| label: sample-construction
#| code-summary: "Sample construction and initial data loading"
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.linear_model import LinearRegression
from scipy import stats as scipy_stats
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# =============================================================================
# Configuration Parameters
# =============================================================================
# Users can modify these parameters to adjust the methodology
CONFIG = {
# Sample period
'beg_date': '2007-01-01',
'end_date': '2024-12-31',
# Estimation windows (in trading days)
'est_window': 60, # Rolling window for SUV and volatility
'detrend_window': 180, # Window for DTO detrending median
'lag': 7, # Lag for DTO detrending
'gap': 5, # Gap between estimation period and event date
# Filters
'min_price': 1000, # Minimum price in VND
'min_volume_days': 0.8, # Min fraction of non-zero volume days in window
'min_analysts': 3, # Minimum number of analysts for DISP
'max_spread_pct': 0.50, # Maximum bid-ask spread as fraction of midpoint
'forecast_carry_days': 105,# Days to carry forward stale analyst forecasts
# Exchange identifiers
'exchanges': ['HOSE', 'HNX'],
# Price limit thresholds (for flagging)
'price_limit_hose': 0.07,
'price_limit_hnx': 0.10,
}
print("Configuration parameters loaded successfully.")
print(f"Sample period: {CONFIG['beg_date']} to {CONFIG['end_date']}")
print(f"Estimation window: {CONFIG['est_window']} trading days")
print(f"Detrending window: {CONFIG['detrend_window']} trading days")
```
The sample universe includes all common stocks (ordinary shares) listed on HOSE and HNX during the period January 2007 through December 2024. We begin in 2007 rather than at market inception (2000 for HOSE, 2005 for HNX) for two reasons. First, the early years of the Vietnamese market were characterized by an extremely small number of listed firms (fewer than 30 on HOSE through 2005), making cross-sectional analysis unreliable. Second, data quality and consistency improve substantially after the market expansion of 2006-2007, during which the number of listed firms on HOSE grew from approximately 40 to over 100.
We apply the following filters to construct the analysis sample:
1. **Security type filter.** We retain only common stocks (ordinary shares), excluding preferred shares, exchange-traded funds (ETFs), covered warrants, and certificates of deposit. This is analogous to the standard filter in the U.S. literature that restricts to CRSP share codes 10 and 11.
2. **Exchange filter.** We include stocks listed on HOSE and HNX but exclude UPCoM securities in our baseline analysis. UPCoM is a registration-based trading venue with less stringent listing requirements and substantially lower liquidity, which may introduce noise into volume-based and spread-based measures. We include UPCoM in robustness checks.
3. **Price filter.** We exclude stock-day observations with closing prices below 1,000 VND. This threshold serves the same purpose as the "penny stock" exclusion common in U.S. studies (typically \$1 or \$5 thresholds) and helps mitigate the influence of extreme percentage returns and spreads at very low price levels.
4. **Minimum trading activity.** For volume-based measures, we require that a stock has non-zero trading volume on at least 80% of trading days within each estimation window. This filter eliminates the most thinly traded securities for which turnover-based measures would be unreliable.
```{python}
#| label: load-and-filter
#| code-summary: "Load and filter daily stock data"
def load_daily_data(config):
"""
Load daily stock trading data from DataCore.vn.
In practice, this function connects to the DataCore API or reads
from a local database/CSV. Here we document the expected schema.
Expected columns:
- ticker: str, stock ticker symbol (e.g., 'VCB', 'HPG', 'VNM')
- date: datetime, trading date
- open, high, low, close: float, daily OHLC prices (VND)
- volume: int, trading volume (shares)
- shares_outstanding: int, total shares outstanding
- adjusted_close: float, price adjusted for corporate actions
- adj_factor: float, cumulative adjustment factor
- bid, ask: float, best bid/ask at close
- exchange: str, exchange code ('HOSE', 'HNX', 'UPCOM')
- industry_icb: str, ICB industry classification code
- foreign_buy_vol, foreign_sell_vol: int, foreign investor volumes
- foreign_ownership_pct: float, foreign ownership percentage
"""
# =========================================================================
# Replace with actual DataCore API call:
# from datacore import Client
# client = Client(api_key='YOUR_KEY')
# df = client.daily_stock(
# start=config['beg_date'], end=config['end_date'],
# exchanges=config['exchanges']
# )
# =========================================================================
print("Connect to DataCore.vn and load daily stock data.")
print("Expected schema: ticker, date, open, high, low, close, volume,")
print(" shares_outstanding, adjusted_close, adj_factor, bid, ask,")
print(" exchange, industry_icb, foreign_buy_vol, foreign_sell_vol,")
print(" foreign_ownership_pct")
return None # Replace with actual data
def apply_sample_filters(df, config):
"""Apply sequential sample construction filters."""
print("\n=== Sample Construction ===")
n0 = len(df)
# Date filter
df = df[(df['date'] >= config['beg_date']) &
(df['date'] <= config['end_date'])].copy()
print(f"[1] Date filter: {len(df):,} obs (from {n0:,})")
# Exchange filter
df = df[df['exchange'].isin(config['exchanges'])].copy()
print(f"[2] Exchange filter ({config['exchanges']}): {len(df):,} obs")
# Price filter
df = df[df['close'] >= config['min_price']].copy()
print(f"[3] Price >= {config['min_price']:,} VND: {len(df):,} obs")
# Compute daily return from adjusted prices
df = df.sort_values(['ticker', 'date'])
df['ret'] = df.groupby('ticker')['adjusted_close'].pct_change()
# Flag price limit hits
df['limit_hit'] = (
((df['exchange'] == 'HOSE') &
(df['ret'].abs() >= config['price_limit_hose'] - 0.001)) |
((df['exchange'] == 'HNX') &
(df['ret'].abs() >= config['price_limit_hnx'] - 0.001))
)
n_tickers = df['ticker'].nunique()
print(f"\nFinal sample: {len(df):,} stock-day obs, "
f"{n_tickers} unique tickers")
print(f"Limit-hit days: {df['limit_hit'].sum():,} "
f"({100*df['limit_hit'].mean():.2f}%)")
return df
```
## Corporate Action Adjustments {#sec-corp-actions}
Proper adjustment for corporate actions is critical for volume-based DIVOP measures, as events such as stock splits, bonus share issues, and rights offerings change the number of shares outstanding and can create artificial spikes in measured turnover. We need to use cumulative adjustment factors that account for stock dividends (bonus shares), stock splits, rights offerings, and cash dividends (price adjustment only). We use these to construct adjusted volume and adjusted shares outstanding:
$$
\text{AdjVolume}_{i,t} = \text{Volume}_{i,t} \times \text{CumAdjFactor}_{i,t}
$$
$$
\text{AdjSharesOut}_{i,t} = \text{SharesOut}_{i,t} \times \text{CumAdjFactor}_{i,t}
$$
This ensures that the turnover ratio is consistent across corporate action events.
```{python}
#| label: corp-action
#| code-summary: "Corporate action adjustment"
def adjust_for_corporate_actions(df):
"""Apply cumulative adjustment factors to volume and shares outstanding."""
df = df.copy()
df['adj_volume'] = df['volume'] * df['adj_factor']
df['adj_shares_out'] = df['shares_outstanding'] * df['adj_factor']
# Daily turnover ratio
df['turnover'] = np.where(
df['adj_shares_out'] > 0,
df['adj_volume'] / df['adj_shares_out'],
np.nan
)
# Flag extreme turnover (> 50% of float)
extreme = df['turnover'] > 0.50
if extreme.any():
print(f"Warning: {extreme.sum()} obs with turnover > 50%, set to NaN")
df.loc[extreme, 'turnover'] = np.nan
return df
```
## Trading Calendar Construction {#sec-calendar}
The rolling regression approach for SUV and volatility requires a trading calendar that ensures each estimation window contains exactly the specified number of trading days. We construct this directly from observed trading dates.
```{python}
#| label: trading-calendar
#| code-summary: "Build trading calendar for rolling estimation windows"
def build_trading_calendar(df, config):
"""
Map each trading date to its estimation window [est_start, est_end].
For date t, the estimation window runs from
t - gap - est_window to t - gap - 1 (in trading-day terms).
"""
trading_dates = sorted(df['date'].unique())
trading_dates = pd.Series(trading_dates)
est_window = config['est_window']
gap = config['gap']
offset = est_window + gap
records = []
for i in range(offset, len(trading_dates)):
records.append({
'date': trading_dates.iloc[i],
'est_start': trading_dates.iloc[i - gap - est_window],
'est_end': trading_dates.iloc[i - gap - 1]
})
calendar = pd.DataFrame(records)
print(f"Trading calendar: {len(calendar)} dates, "
f"{calendar['date'].min()} to {calendar['date'].max()}")
return calendar
```
# Volume-Based DIVOP Proxies {#sec-volume-based}
## Theoretical Motivation
Trading volume has long been recognized as a natural proxy for divergence of investor opinion. In the rational expectations framework of @milgrom1982information, trade occurs only when investors disagree about the value of a security (i.e., a "no-trade theorem" that implies, by contrapositive, that observed trading volume must reflect some form of heterogeneous beliefs). @harris1993differences and @kandel1995differential formalize this intuition, showing that trading volume is positively related to the dispersion of investors' prior beliefs and to the degree to which public information is differentially interpreted.
The challenge in using raw trading volume as a DIVOP proxy is that volume is also driven by factors unrelated to disagreement, including portfolio rebalancing, liquidity needs, tax-loss selling, and index reconstitution effects. @garfinkel2009measuring proposes two approaches to extract the disagreement component from raw volume. The first, **Unexplained Volume (DTO)**, removes market-wide volume effects and secular trends. The second, **Standardized Unexplained Volume (SUV)**, additionally controls for the information content of returns through a cross-sectional regression, isolating the "pure disagreement" component of trading activity.
## Unexplained Volume (DTO) {#sec-dto}
### Construction Methodology
The construction of the Unexplained Volume measure proceeds in four steps.
**Step 1: Compute firm-level daily turnover.** For each stock $i$ on day $t$:
$$
\text{Turn}_{i,t} = \frac{\text{AdjVolume}_{i,t}}{\text{AdjSharesOut}_{i,t}}
$$
**Step 2: Compute market-wide turnover.** We calculate aggregate turnover across all common stocks as a value-weighted average:
$$
\text{MktTurn}_{t} = \frac{\sum_{i} \text{AdjVolume}_{i,t}}{\sum_{i} \text{AdjSharesOut}_{i,t}}
$$
Unlike the U.S. methodology that computes market turnover across NYSE/AMEX stocks only and applies a scaling adjustment for NASDAQ securities [following @anderson2005market], we compute market turnover across all HOSE and HNX common stocks without any exchange-specific volume scaling. Both Vietnamese exchanges operate as order-driven markets (HOSE uses continuous order matching; HNX uses a combination of continuous matching and periodic call auctions) without the dealer-market double-counting issue that necessitates the NASDAQ volume adjustment in U.S. studies.
**Step 3: Compute market-adjusted turnover.**
$$
\text{MATO}_{i,t} = \text{Turn}_{i,t} - \text{MktTurn}_{t}
$$
**Step 4: Detrend by rolling median.** To remove secular trends in firm-specific trading activity:
$$
\text{DTO}_{i,t} = \text{MATO}_{i,t} - \text{Median}_{180}(\text{MATO}_{i,t-7})
$$
where $\text{Median}_{180}(\text{MATO}_{i,t-7})$ is the median of market-adjusted turnover over the 180-trading-day window ending 7 days before date $t$. The 7-day lag prevents the current day's turnover from influencing its own detrending baseline.
```{python}
#| label: dto-construction
#| code-summary: "Construct the Unexplained Volume (DTO) measure"
def compute_market_turnover(df):
"""Compute daily market-wide turnover across all stocks."""
mkt_turn = df.groupby('date').apply(
lambda x: x['adj_volume'].sum() / x['adj_shares_out'].sum()
if x['adj_shares_out'].sum() > 0 else np.nan
).reset_index()
mkt_turn.columns = ['date', 'market_turnover']
return mkt_turn
def compute_dto(df, config):
"""
Construct Unexplained Volume (DTO).
Steps:
1. Subtract market turnover -> MATO
2. Rolling 180-day median of MATO (lagged 7 days) -> trend
3. DTO = MATO - trend
"""
detrend_window = config['detrend_window']
lag = config['lag']
# Market turnover
mkt_turn = compute_market_turnover(df)
df = df.merge(mkt_turn, on='date', how='left')
# Market-adjusted turnover
df['mato'] = df['turnover'] - df['market_turnover']
# Rolling median with lag, computed per stock
df = df.sort_values(['ticker', 'date'])
def _rolling_median_lagged(group):
mato = group['mato']
med = mato.rolling(
window=detrend_window,
min_periods=int(detrend_window * 0.5)
).median()
return med.shift(lag)
df['mato_trend'] = (
df.groupby('ticker', group_keys=False)
.apply(lambda g: _rolling_median_lagged(g))
)
# DTO
df['dto'] = df['mato'] - df['mato_trend']
print("DTO construction complete.")
print(f" Non-missing: {df['dto'].notna().sum():,}")
print(f" Mean: {df['dto'].mean():.6f}, Std: {df['dto'].std():.6f}")
return df
```
### Vietnam-Specific Considerations for DTO
Several features of the Vietnamese market require attention when constructing DTO:
1. **No NASDAQ-type volume adjustment needed.** Both HOSE and HNX are order-driven auction markets. The double-counting adjustment applied to NASDAQ securities in the U.S. literature is not necessary.
2. **Thinly traded stocks.** A substantial fraction of listed Vietnamese stocks, particularly on HNX, may have zero volume on many trading days. For stocks with intermittent trading, the rolling median may be biased toward zero, making DTO less informative. We require at least 80% non-zero volume days in each estimation window.
3. **Price limit effects on volume.** When a stock hits its daily price limit, unfilled orders accumulate and recorded volume may understate true clearing volume. The following day often shows a "catch-up" effect. Researchers should consider flagging limit-hit days.
4. **Foreign investor trading decomposition.** DataCore provides volume by investor type (foreign versus domestic). Researchers may wish to construct separate DTO measures for foreign and domestic volume, or use the foreign-to-domestic volume ratio as an additional dimension of disagreement.
## Standardized Unexplained Volume (SUV) {#sec-suv}
### Construction Methodology
The Standardized Unexplained Volume measure, proposed by @garfinkel2009measuring, isolates the disagreement component of volume by explicitly controlling for the information content of returns. The insight is that trading volume has both a **liquidity** component and an **informedness** component correlated with the magnitude and sign of returns. By regressing turnover on signed returns and extracting the standardized residual, SUV captures volume attributable to disagreement after controlling for both liquidity trends and information-driven trading.
For each stock $i$, on each trading date $t$, we estimate using data from the estimation window $[\tau_1, \tau_2]$:
$$
\text{Turn}_{i,s} = \alpha_i + \beta_i^{+} \cdot \text{RetPos}_{i,s} + \beta_i^{-} \cdot \text{RetNeg}_{i,s} + \epsilon_{i,s}, \quad s \in [\tau_1, \tau_2]
$$ {#eq-suv-regression}
where $\text{RetPos}_{i,s} = |r_{i,s}| \cdot \mathbf{1}(r_{i,s} > 0)$ and $\text{RetNeg}_{i,s} = |r_{i,s}| \cdot \mathbf{1}(r_{i,s} < 0)$.
The Standardized Unexplained Volume on date $t$ is:
$$
\text{SUV}_{i,t} = \frac{\text{Turn}_{i,t} - \hat{\text{Turn}}_{i,t}}{\hat{\sigma}_{\epsilon,i}}
$$ {#eq-suv}
where $\hat{\text{Turn}}_{i,t}$ is the predicted turnover and $\hat{\sigma}_{\epsilon,i}$ is the RMSE from @eq-suv-regression.
The asymmetric specification with separate coefficients for positive and negative returns reflects that the volume-return relation differs by return sign. In the U.S., buying pressure tends to generate more volume than selling pressure due to short-sale frictions. In Vietnam, where short selling was unavailable until 2025, this asymmetry should be even more pronounced because all selling activity was constrained to existing shareholders.
```{python}
#| label: suv-construction
#| code-summary: "Construct Standardized Unexplained Volume (SUV)"
def compute_suv(df, calendar, config):
"""
Compute Standardized Unexplained Volume via rolling regressions.
For each stock-date, regress Turn on RetPos and RetNeg over the
estimation window, then compute SUV = (actual - predicted) / RMSE.
"""
est_window = config['est_window']
min_obs = int(est_window * config['min_volume_days'])
# Prepare signed return components
df = df.copy()
df['ret_pos'] = np.where(df['ret'] > 0, np.abs(df['ret']), 0.0)
df['ret_neg'] = np.where(
(df['ret'] < 0) & df['ret'].notna(), np.abs(df['ret']), 0.0
)
results = []
grouped = {t: g for t, g in df.groupby('ticker')}
for _, cal_row in calendar.iterrows():
dt = cal_row['date']
est_s, est_e = cal_row['est_start'], cal_row['est_end']
for ticker, tdata in grouped.items():
# Estimation window
est = tdata[
(tdata['date'] >= est_s) & (tdata['date'] <= est_e)
].dropna(subset=['turnover', 'ret_pos', 'ret_neg'])
if len(est) < min_obs:
continue
# Event date
evt = tdata[tdata['date'] == dt]
if evt.empty or evt['turnover'].isna().all():
continue
# OLS: Turn = alpha + beta_pos * RetPos + beta_neg * RetNeg
X = est[['ret_pos', 'ret_neg']].values
y = est['turnover'].values
reg = LinearRegression().fit(X, y)
y_hat = reg.predict(X)
rmse = np.sqrt(np.mean((y - y_hat) ** 2))
if rmse <= 0:
continue
# Predict and standardize for event date
X_evt = evt[['ret_pos', 'ret_neg']].values
pred = reg.predict(X_evt)[0]
actual = evt['turnover'].values[0]
suv = (actual - pred) / rmse
results.append({
'ticker': ticker, 'date': dt,
'suv': suv,
'predicted_turnover': pred,
'rmse_turn': rmse,
'n_est': len(est),
'alpha_turn': reg.intercept_,
'beta_pos': reg.coef_[0],
'beta_neg': reg.coef_[1],
})
suv_df = pd.DataFrame(results)
print(f"SUV: {len(suv_df):,} stock-date obs")
print(f" Mean: {suv_df['suv'].mean():.4f}, "
f"Median: {suv_df['suv'].median():.4f}")
return suv_df
```
### Interpreting the SUV Regression Coefficients
The estimated coefficients from @eq-suv-regression are informative about market microstructure. @garfinkel2009measuring reports $\hat{\beta}^{+} > \hat{\beta}^{-}$ for most U.S. stocks. In Vietnam, we expect this asymmetry to be even stronger because:
- **No short selling (pre-2025):** All selling is by existing shareholders, limiting volume response to negative returns.
- **T+2 settlement:** Investors cannot immediately reinvest sale proceeds, further dampening sell-side volume.
- **Price limits:** The $\pm$ 7% (HOSE) and $\pm$ 10% (HNX) daily limits truncate the return distribution, compressing the range of both regressors.
Researchers should report summary statistics of $(\hat{\alpha}, \hat{\beta}^{+}, \hat{\beta}^{-}, R^2)$ across the cross-section and over time.
```{python}
#| label: suv-diagnostics
#| code-summary: "Diagnostic statistics for SUV turnover regressions"
def suv_diagnostics(suv_df):
"""Report cross-sectional summary of SUV regression parameters."""
print("\n=== SUV Regression Diagnostics ===")
params = ['alpha_turn', 'beta_pos', 'beta_neg']
print(suv_df[params].describe(
percentiles=[.05, .25, .50, .75, .95]
).T.to_string(float_format='{:.6f}'.format))
# Asymmetry test
diff = suv_df['beta_pos'] - suv_df['beta_neg']
print(f"\nbeta_pos - beta_neg: mean = {diff.mean():.6f}, "
f"frac > 0 = {(diff > 0).mean():.3f}")
```
# Volatility-Based DIVOP Proxies {#sec-volatility}
## Total Return Volatility {#sec-total-vol}
### Theoretical Motivation
Stock return volatility serves as a proxy for divergence of opinion through several channels. @shalen1993volume develops a model in which both volume and volatility are increasing in the dispersion of investor beliefs. @scheinkman2003overconfidence predict that higher volatility reflects the speculative trading component driven by overconfident investors who disagree about value. Empirically, @boehme2006short and @chatterjee2012takeovers use idiosyncratic volatility as a DIVOP proxy and find it positively correlated with other disagreement measures and negatively associated with subsequent returns when short-sale constraints bind.
### Construction
Total return volatility is the standard deviation of daily returns over the rolling estimation window:
$$
\text{VOLATILITY}_{i,t} = \sqrt{\frac{1}{N_i - 1} \sum_{s \in [\tau_1, \tau_2]} (r_{i,s} - \bar{r}_i)^2}
$$ {#eq-volatility}
where $N_i$ is the number of non-missing return observations for stock $i$ in the window $[\tau_1, \tau_2]$.
## Idiosyncratic Volatility (IVOL) {#sec-ivol}
Idiosyncratic volatility isolates firm-specific return variation by removing the systematic component explained by market movements. We compute IVOL from the residuals of a market model:
$$
r_{i,s} = \alpha_i + \beta_i \cdot r_{m,s} + \epsilon_{i,s}, \quad s \in [\tau_1, \tau_2]
$$ {#eq-market-model}
$$
\text{IVOL}_{i,t} = \text{Std}(\hat{\epsilon}_{i,s})
$$ {#eq-ivol}
Researchers may extend this to a @fama1993common three-factor or five-factor model using Vietnamese factor portfolios constructed elsewhere in this book. A richer factor model yields IVOL estimates that better isolate truly idiosyncratic disagreement, at the cost of requiring factor portfolio construction.
```{python}
#| label: volatility-construction
#| code-summary: "Construct total and idiosyncratic volatility"
def compute_volatility(df, calendar, config):
"""
Compute total return volatility and idiosyncratic volatility
via rolling estimation windows.
Total vol = std(returns) in window.
IVOL = std(residuals) from market model regression.
"""
est_window = config['est_window']
min_obs = int(est_window * config['min_volume_days'])
# Value-weighted market return
def _vw_ret(g):
valid = g.dropna(subset=['ret'])
if valid.empty:
return np.nan
w = valid['adj_shares_out'] * valid['close']
return np.average(valid['ret'], weights=w)
mkt_ret = df.groupby('date').apply(_vw_ret).reset_index()
mkt_ret.columns = ['date', 'mkt_ret']
df = df.merge(mkt_ret, on='date', how='left')
results = []
grouped = {t: g for t, g in df.groupby('ticker')}
for _, cal_row in calendar.iterrows():
dt = cal_row['date']
est_s, est_e = cal_row['est_start'], cal_row['est_end']
for ticker, tdata in grouped.items():
est = tdata[
(tdata['date'] >= est_s) & (tdata['date'] <= est_e)
].dropna(subset=['ret', 'mkt_ret'])
if len(est) < min_obs:
continue
# Total volatility
total_vol = est['ret'].std()
# Market model -> IVOL
X = est[['mkt_ret']].values
y = est['ret'].values
reg = LinearRegression().fit(X, y)
resid = y - reg.predict(X)
ivol = np.std(resid, ddof=1)
results.append({
'ticker': ticker, 'date': dt,
'total_volatility': total_vol,
'idio_volatility': ivol,
'market_beta': reg.coef_[0],
'market_alpha': reg.intercept_,
'r_squared_mm': reg.score(X, y),
'n_vol': len(est),
})
vol_df = pd.DataFrame(results)
print(f"Volatility: {len(vol_df):,} stock-date obs")
print(f" Total vol (ann. mean): "
f"{vol_df['total_volatility'].mean() * np.sqrt(252):.4f}")
print(f" IVOL (ann. mean): "
f"{vol_df['idio_volatility'].mean() * np.sqrt(252):.4f}")
return vol_df
```
### Vietnam-Specific Considerations for Volatility
1. **Price limits compress measured volatility.** Daily limits of $\pm$ 7% (HOSE) and $\pm$ 10% (HNX) mechanically truncate the return distribution, leading to underestimation of true volatility. On limit-hit days, the true equilibrium return may exceed the observed return. Researchers should be aware that volatility-based DIVOP measures may be downward-biased for stocks that frequently hit limits.
2. **VN-Index concentration.** The VN-Index is highly concentrated, the top 10 stocks often account for 50-60% of index weight. For small- and mid-cap stocks, an equal-weighted market return or a composite HOSE+HNX index may provide a better market factor in @eq-market-model.
3. **Thin trading and non-synchronous returns.** For thinly traded stocks, consecutive zero-return days can depress measured volatility. The @dimson1979risk adjustment (including lagged and lead market returns in the market model) may help correct for non-synchronous trading bias in the beta estimate, though its effect on IVOL is typically small.
# Spread-Based and Liquidity DIVOP Proxies {#sec-spread}
## Bid-Ask Spread (BASPREAD) {#sec-baspread}
### Theoretical Motivation
The bid-ask spread reflects the adverse selection costs faced by limit order providers. When investors hold heterogeneous beliefs, each trade is more likely to convey private information, raising the adverse selection component of the spread. @handa2003quote show that in order-driven markets the spread widens when divergence of opinion increases because limit order providers face greater risk of being picked off by informed traders. @chung2014simple demonstrate that closing bid-ask spreads from daily data provide a reliable approximation to intraday effective spreads.
### Construction
We compute the proportional bid-ask spread using end-of-day quote data:
$$
\text{BASPREAD}_{i,t} = \frac{\text{Ask}_{i,t} - \text{Bid}_{i,t}}{\text{Midpoint}_{i,t}}
$$ {#eq-baspread}
where $\text{Midpoint}_{i,t} = (\text{Ask}_{i,t} + \text{Bid}_{i,t}) / 2$. When end-of-day bid and ask are unavailable, we use the daily high-low range as a fallback. Following @chung2014simple, we delete observations where both Bid and Ask are zero, and where the spread exceeds 50% of the midpoint.
## Amihud Illiquidity (ILLIQ) {#sec-amihud}
The @amihud2002illiquidity ratio measures the price impact of order flow:
$$
\text{ILLIQ}_{i,t} = \frac{|r_{i,t}|}{\text{DolVol}_{i,t}}
$$ {#eq-amihud}
where $\text{DolVol}_{i,t} = \text{Volume}_{i,t} \times \text{Price}_{i,t}$ (in billions VND for scaling). Higher ILLIQ reflects greater information asymmetry. We average daily ratios over monthly horizons and use the log transformation due to heavy right skew.
```{python}
#| label: spread-illiq
#| code-summary: "Construct bid-ask spread and Amihud illiquidity"
def compute_spread_and_illiq(df, config):
"""Compute bid-ask spread (BASPREAD) and Amihud illiquidity."""
df = df.copy()
# --- Bid-Ask Spread ---
df['midpoint_ba'] = (df['ask'] + df['bid']) / 2
df['baspread_ba'] = np.where(
(df['ask'] > 0) & (df['bid'] > 0) & (df['midpoint_ba'] > 0),
(df['ask'] - df['bid']) / df['midpoint_ba'], np.nan
)
# Fallback: high/low range
df['midpoint_hl'] = (df['high'] + df['low']) / 2
df['baspread_hl'] = np.where(
(df['high'] > 0) & (df['low'] > 0) & (df['midpoint_hl'] > 0),
(df['high'] - df['low']) / df['midpoint_hl'], np.nan
)
df['baspread'] = df['baspread_ba'].fillna(df['baspread_hl'])
df['midpoint'] = df['midpoint_ba'].fillna(df['midpoint_hl'])
# Chung & Zhang (2009) filters
bad = (df['baspread'].isna()) | \
(df['baspread'] > config['max_spread_pct']) | \
(df['baspread'] < 0)
df.loc[bad, 'baspread'] = np.nan
# --- Amihud Illiquidity ---
df['dollar_vol'] = df['volume'] * df['close'] / 1e9
df['amihud_daily'] = np.where(
df['dollar_vol'] > 0,
np.abs(df['ret']) / df['dollar_vol'], np.nan
)
print(f"BASPREAD: {df['baspread'].notna().sum():,} valid obs, "
f"mean = {df['baspread'].mean():.6f}")
print(f"AMIHUD: {df['amihud_daily'].notna().sum():,} valid obs, "
f"mean = {df['amihud_daily'].mean():.6f}")
return df
def compute_amihud_monthly(df):
"""Monthly Amihud = mean daily |ret|/dollar_vol (min 15 days)."""
df = df.copy()
df['ym'] = df['date'].dt.to_period('M')
agg = df.groupby(['ticker', 'ym']).agg(
illiq_mean=('amihud_daily', 'mean'),
n_days=('amihud_daily', 'count'),
).reset_index()
agg = agg[agg['n_days'] >= 15].copy()
agg['log_illiq'] = np.log(agg['illiq_mean'] + 1e-10)
return agg
```
### Vietnam-Specific Considerations for Spread and Liquidity
1. **Tick size schedule.** Vietnam uses variable tick sizes: 10 VND (prices \< 10,000), 50 VND (10,000--49,950), and 100 VND (≥ 50,000) on HOSE. These impose a floor on quoted spreads for low-priced stocks. Researchers should be cautious interpreting cross-price-decile spread variation as reflecting opinion divergence rather than tick-size mechanics.
2. **Order-driven market structure.** Both HOSE and HNX are pure order-driven markets where public limit orders provide liquidity. This makes the @chung2014simple CRSP-based spread approximation appropriate.
3. **Lot size requirements.** HOSE requires 100-share standard lots for continuous trading. For high-priced stocks, the standard lot represents a large capital commitment, potentially inflating quoted spreads relative to effective trading costs.
4. **Call auction effects.** Opening and closing sessions on HOSE use periodic call auctions, which can produce bid-ask quotes that differ substantially from continuous-trading spreads.
# Analyst Forecast Dispersion {#sec-analyst}
## Theoretical Motivation
Analyst forecast dispersion, the cross-sectional standard deviation of individual analysts' earnings forecasts, is the most direct measure of divergence of opinion. Unlike market-based proxies that capture disagreement indirectly, forecast dispersion directly measures disagreement among informed market participants. @abarbanell1995analysts establish the theoretical basis, and @diether2002differences demonstrate that stocks with higher analyst forecast dispersion earn lower subsequent returns, consistent with the Miller overpricing hypothesis.
## Data Challenges in Vietnam
Constructing analyst forecast dispersion in Vietnam presents substantial challenges relative to the U.S.:
- **Coverage breadth.** While I/B/E/S covers over 4,000 U.S. companies, only 100--150 Vietnamese firms typically have coverage by at least 3 analysts, concentrated among VN30 constituents.
- **Data sources.** Analyst forecasts are available from DataCore.vn, FiinPro, Bloomberg, and Refinitiv. The choice of source affects coverage and timeliness.
- **Forecast staleness.** With limited coverage, forecasts may go unrevised for months. Following I/B/E/S methodology, we carry each forecast forward for a maximum of 105 days.
## Construction Methodology
The construction proceeds as follows:
1. **Clean individual forecasts.** Remove observations where the announcement date precedes the review date. Keep only annual EPS forecasts. For each analyst-ticker-fiscal period, retain only the latest forecast per calendar month.
2. **Handle stopped and excluded estimates.** Remove forecasts where the analyst has left the brokerage or the estimate has been excluded from consensus.
3. **Carry forward with staleness control.** Each forecast is valid until the earlier of: (a) the next forecast by the same analyst, (b) 105 days after the announcement, or (c) the actual earnings announcement date.
4. **Expand to monthly frequency.** For each ticker-month, identify all valid outstanding forecasts and compute dispersion.
5. **Compute scaled measures:**
$$
\text{DISP1}_{i,m} = \frac{\text{Std}(\hat{\text{EPS}}_{i,m}^{(a)})}{|\text{Mean}(\hat{\text{EPS}}_{i,m}^{(a)})|}
\qquad
\text{DISP2}_{i,m} = \frac{\text{Std}(\hat{\text{EPS}}_{i,m}^{(a)})}{\bar{P}_{i,m}}
$$
```{python}
#| label: analyst-dispersion
#| code-summary: "Construct analyst forecast dispersion (DISP1, DISP2)"
def construct_analyst_dispersion(forecasts_df, price_df, config):
"""
Construct analyst forecast dispersion measures.
Parameters
----------
forecasts_df : pd.DataFrame
Individual analyst forecasts with: ticker, analyst_id, broker,
fpedats, anndats, revdats, value (EPS), anndats_act.
price_df : pd.DataFrame
Monthly price: ticker, month, mean_price.
config : dict
With min_analysts, forecast_carry_days.
"""
carry_days = config['forecast_carry_days']
min_analysts = config['min_analysts']
df = forecasts_df.copy()
df = df[df['anndats'] <= df['revdats']].copy()
df = df.dropna(subset=['fpedats', 'anndats', 'value'])
# Latest forecast per analyst-month
df['ym'] = df['anndats'].dt.to_period('M')
df = df.sort_values(
['ticker', 'fpedats', 'analyst_id', 'ym', 'anndats', 'revdats']
)
df = df.groupby(['ticker', 'fpedats', 'analyst_id', 'ym']).tail(1)
# Carry-forward end date
df = df.sort_values(
['ticker', 'analyst_id', 'fpedats', 'anndats'],
ascending=[True, True, True, False]
)
df['next_ann'] = df.groupby(
['ticker', 'analyst_id', 'fpedats']
)['anndats'].shift(-1)
def _carry_end(row):
candidates = [row['anndats'] + pd.Timedelta(days=carry_days)]
if pd.notna(row.get('next_ann')):
candidates.append(row['next_ann'])
if pd.notna(row.get('anndats_act')):
candidates.append(row['anndats_act'])
return min(candidates)
df['carry_end'] = df.apply(_carry_end, axis=1)
# Monthly expansion
months = pd.period_range(config['beg_date'], config['end_date'], freq='M')
records = []
for month in months:
me = month.to_timestamp(how='end')
valid = df[(df['anndats'] <= me) & (df['carry_end'] > me)].copy()
valid = valid[valid['fpedats'] > me]
valid = valid.sort_values(['ticker', 'analyst_id', 'anndats'])
valid = valid.groupby(['ticker', 'analyst_id']).tail(1)
disp = valid.groupby('ticker').agg(
n_analysts=('analyst_id', 'nunique'),
mean_fcst=('value', 'mean'),
std_fcst=('value', 'std'),
).reset_index()
disp['month'] = month
records.append(disp)
if not records:
return pd.DataFrame()
disp_df = pd.concat(records, ignore_index=True)
# Scaled measures
disp_df['disp1'] = np.where(
disp_df['mean_fcst'].abs() > 0,
disp_df['std_fcst'] / disp_df['mean_fcst'].abs(), np.nan
)
disp_df = disp_df.merge(price_df, on=['ticker', 'month'], how='left')
disp_df['disp2'] = np.where(
disp_df['mean_price'] > 0,
disp_df['std_fcst'] / disp_df['mean_price'], np.nan
)
disp_df['disp_raw'] = disp_df['std_fcst']
out = disp_df[disp_df['n_analysts'] >= min_analysts].copy()
print(f"DISP: {len(out):,} ticker-months (>= {min_analysts} analysts)")
print(f" Mean analysts: {out['n_analysts'].mean():.1f}")
return out
```
## Scaling Considerations
Following @cheong2011eps, we note that each scaling choice has pitfalls. DISP1 (scaled by absolute mean forecast) can produce extreme values when the mean forecast approaches zero---common for Vietnamese firms near breakeven. DISP2 (scaled by price) introduces a mechanical negative correlation between price and scaled dispersion. We recommend reporting all three versions (DISP1, DISP2, and unscaled DISP_RAW with $\ln(\text{Price})$ as an additional control), and winsorizing DISP1 at the 1st and 99th percentiles.
::: callout-warning
## Caution on Analyst Dispersion in Thin-Coverage Markets
With typical coverage of 5--10 analysts per firm in Vietnam (versus 15--25 in the U.S.), forecast dispersion is estimated with substantially greater noise. A dispersion measure from 3 analysts has a very different sampling distribution than one from 20. Always include the number of analysts as a control and test robustness with varying minimum-analyst thresholds (3, 5, 7).
:::
# Cross-Sectional Correlations Among DIVOP Proxies {#sec-correlation}
An important empirical question is the degree to which the various DIVOP proxies capture the same underlying construct. If divergence of opinion is a well-defined latent variable, we expect positive correlations among all proxies, though correlations need not be high since each captures a different facet of disagreement.
```{python}
#| label: correlation-analysis
#| code-summary: "Spearman rank correlations among DIVOP proxies"
def compute_divop_correlations(merged_df, proxies=None):
"""
Compute and visualize Spearman correlations among DIVOP proxies.
We use rank correlations because many proxies are right-skewed.
"""
if proxies is None:
proxies = [
'dto', 'suv', 'total_volatility', 'idio_volatility',
'baspread', 'amihud_daily', 'disp1', 'disp2'
]
available = [p for p in proxies if p in merged_df.columns]
data = merged_df[available].dropna()
n = len(available)
rho_mat = np.eye(n)
p_mat = np.zeros((n, n))
for i in range(n):
for j in range(i + 1, n):
rho, p = scipy_stats.spearmanr(
data[available[i]], data[available[j]]
)
rho_mat[i, j] = rho_mat[j, i] = rho
p_mat[i, j] = p_mat[j, i] = p
labels = {'dto': 'DTO', 'suv': 'SUV',
'total_volatility': 'VOL', 'idio_volatility': 'IVOL',
'baspread': 'SPREAD', 'amihud_daily': 'ILLIQ',
'disp1': 'DISP1', 'disp2': 'DISP2'}
pretty = [labels.get(c, c) for c in available]
corr_df = pd.DataFrame(rho_mat, index=pretty, columns=pretty)
# Heatmap
fig, ax = plt.subplots(figsize=(9, 7))
mask = np.triu(np.ones_like(corr_df, dtype=bool), k=1)
sns.heatmap(
corr_df, mask=mask, annot=True, fmt='.3f',
cmap='RdBu_r', center=0, vmin=-0.4, vmax=0.7,
square=True, linewidths=0.5,
cbar_kws={'shrink': 0.8, 'label': 'Spearman ρ'}, ax=ax
)
ax.set_title('Spearman Correlations Among DIVOP Proxies\n'
'Vietnamese Equity Market', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('divop_correlations.png', dpi=300, bbox_inches='tight')
plt.show()
return corr_df
```
### Expected Correlation Patterns
Based on U.S. evidence and theory, we expect:
| Pair | Expected | Rationale |
|:-----------------------|:-----------------------|:-----------------------|
| DTO × SUV | High positive | Both capture abnormal volume; SUV refines DTO |
| VOL × IVOL | High positive | IVOL is a subset of total volatility |
| SPREAD × ILLIQ | Moderate-high positive | Both capture information asymmetry |
| Volume × Volatility | Moderate positive | @shalen1993volume links both to belief dispersion |
| Analyst × Market-based | Weak-moderate positive | Different investor populations |
: Expected correlation structure among DIVOP proxies {#tbl-expected-corr}
# Descriptive Statistics and Cross-Sectional Properties {#sec-empirical}
## Summary Statistics
```{python}
#| label: descriptive-stats
#| code-summary: "Descriptive statistics for all DIVOP proxies"
def descriptive_statistics(merged_df):
"""Comprehensive descriptive statistics for DIVOP proxies."""
proxies = {
'dto': 'Unexplained Volume (DTO)',
'suv': 'Std Unexplained Volume (SUV)',
'total_volatility': 'Total Return Volatility',
'idio_volatility': 'Idiosyncratic Volatility',
'baspread': 'Bid-Ask Spread',
'amihud_daily': 'Amihud Illiquidity',
'disp1': 'Analyst Disp (mean-scaled)',
'disp2': 'Analyst Disp (price-scaled)',
}
avail = {k: v for k, v in proxies.items() if k in merged_df.columns}
rows = []
for col, label in avail.items():
s = merged_df[col].dropna()
rows.append({
'Proxy': label, 'N': f'{len(s):,}',
'Mean': f'{s.mean():.6f}', 'Std': f'{s.std():.6f}',
'P5': f'{s.quantile(.05):.6f}',
'Median': f'{s.median():.6f}',
'P95': f'{s.quantile(.95):.6f}',
'Skew': f'{s.skew():.2f}',
'Kurt': f'{s.kurtosis():.2f}',
})
stats = pd.DataFrame(rows).set_index('Proxy')
print("\n" + "=" * 90)
print("Descriptive Statistics of DIVOP Proxies")
print("Vietnamese Equity Market, HOSE and HNX")
print("=" * 90)
print(stats.to_string())
return stats
```
## DIVOP by Firm Characteristics
```{python}
#| label: by-characteristics
#| code-summary: "DIVOP by size, exchange, and foreign ownership"
def divop_by_size(merged_df):
"""Mean DIVOP proxies by market-cap quintile."""
df = merged_df.copy()
df['mkt_cap'] = df['close'] * df['shares_outstanding']
df['size_q'] = df.groupby('date')['mkt_cap'].transform(
lambda x: pd.qcut(x, 5,
labels=['Q1 Small','Q2','Q3','Q4','Q5 Large'],
duplicates='drop')
)
proxies = ['dto','suv','total_volatility','idio_volatility',
'baspread','amihud_daily']
avail = [p for p in proxies if p in df.columns]
tab = df.groupby('size_q')[avail].mean()
print("\n=== Mean DIVOP by Size Quintile ===")
print(tab.to_string(float_format='{:.6f}'.format))
return tab
def divop_by_exchange(merged_df):
"""Compare mean DIVOP across HOSE and HNX."""
proxies = ['dto','suv','total_volatility','idio_volatility',
'baspread','amihud_daily']
avail = [p for p in proxies if p in merged_df.columns]
tab = merged_df.groupby('exchange')[avail].mean()
print("\n=== Mean DIVOP by Exchange ===")
print(tab.to_string(float_format='{:.6f}'.format))
return tab
```
## Time-Series Evolution
```{python}
#| label: time-series-plot
#| code-summary: "Time-series evolution of DIVOP proxies"
def plot_divop_timeseries(merged_df):
"""Plot monthly cross-sectional median DIVOP with crisis shading."""
df = merged_df.copy()
df['ym'] = df['date'].dt.to_period('M')
proxies = ['dto','suv','total_volatility','baspread']
avail = [p for p in proxies if p in df.columns]
monthly = df.groupby('ym')[avail].median()
monthly.index = monthly.index.to_timestamp()
fig, axes = plt.subplots(len(avail), 1,
figsize=(13, 3.5*len(avail)), sharex=True)
if len(avail) == 1: axes = [axes]
labels = {'dto':'DTO','suv':'SUV',
'total_volatility':'Volatility','baspread':'Spread'}
colors = ['#1976D2','#388E3C','#F57C00','#D32F2F']
for i, (proxy, ax) in enumerate(zip(avail, axes)):
ax.plot(monthly.index, monthly[proxy],
color=colors[i], linewidth=1.3)
ax.set_ylabel(labels.get(proxy, proxy), fontsize=10)
ax.grid(True, alpha=0.25)
for s, e, c in [('2008-01','2009-06','red'),
('2020-01','2020-12','orange'),
('2022-09','2023-06','purple')]:
ax.axvspan(pd.Timestamp(s), pd.Timestamp(e),
alpha=0.1, color=c)
axes[0].set_title(
'Time-Series of DIVOP Proxies\n'
'Monthly Cross-Sectional Median, HOSE & HNX',
fontsize=13, fontweight='bold')
from matplotlib.patches import Patch
axes[-1].legend(handles=[
Patch(facecolor='red', alpha=.2, label='GFC 2008-09'),
Patch(facecolor='orange', alpha=.2, label='COVID-19'),
Patch(facecolor='purple', alpha=.2, label='Bond Crisis 2022-23'),
], loc='upper right', fontsize=8)
plt.tight_layout()
plt.savefig('divop_timeseries.png', dpi=300, bbox_inches='tight')
plt.show()
```
# Putting It All Together {#sec-pipeline}
```{python}
#| label: merge-all
#| code-summary: "Master pipeline: build full DIVOP dataset"
def build_divop_dataset(config):
"""
Master pipeline: load data, construct all DIVOP proxies,
merge into a single stock-date panel.
"""
df = load_daily_data(config)
df = apply_sample_filters(df, config)
df = adjust_for_corporate_actions(df)
calendar = build_trading_calendar(df, config)
df = compute_dto(df, config)
suv_df = compute_suv(df, calendar, config)
vol_df = compute_volatility(df, calendar, config)
df = compute_spread_and_illiq(df, config)
# Merge
base = df[['ticker','date','ret','close','volume',
'shares_outstanding','exchange','industry_icb',
'foreign_ownership_pct','turnover',
'mato','dto','baspread','amihud_daily','limit_hit']].copy()
if not suv_df.empty:
base = base.merge(
suv_df[['ticker','date','suv','predicted_turnover']],
on=['ticker','date'], how='left')
if not vol_df.empty:
base = base.merge(
vol_df[['ticker','date','total_volatility',
'idio_volatility','market_beta']],
on=['ticker','date'], how='left')
print(f"\n=== Final DIVOP Dataset ===")
print(f"Shape: {base.shape}")
print(f"Tickers: {base['ticker'].nunique()}")
return base
```
# Empirical Applications {#sec-applications}
## Application 1: DIVOP and the Cross-Section of Returns
The fundamental test of the Miller hypothesis is whether stocks with higher divergence of opinion earn lower subsequent returns. We implement Fama-MacBeth cross-sectional regressions:
$$
r_{i,t+1:t+h} = \gamma_{0,t} + \gamma_{1,t} \cdot \text{DIVOP}_{i,t} + \gamma_{2,t}' \mathbf{X}_{i,t} + \varepsilon_{i,t}
$$
where $\mathbf{X}_{i,t}$ includes controls for market beta, log market capitalization, and log book-to-market ratio. The Miller hypothesis predicts $\bar{\gamma}_1 < 0$.
```{python}
#| label: fama-macbeth
#| code-summary: "Fama-MacBeth regressions of returns on DIVOP"
def fama_macbeth_divop(merged_df, divop_proxy='suv',
controls=None, horizon=21):
"""
Fama-MacBeth cross-sectional regressions.
Miller predicts gamma_1 < 0; Varian predicts gamma_1 > 0.
"""
if controls is None:
controls = ['market_beta', 'log_mktcap']
df = merged_df.copy()
df = df.sort_values(['ticker', 'date'])
df['fwd_ret'] = df.groupby('ticker')['ret'].transform(
lambda x: x.shift(-1).rolling(horizon).sum().shift(-(horizon-1))
)
df['log_mktcap'] = np.log(
df['close'] * df['shares_outstanding'] + 1
)
reg_vars = ['fwd_ret', divop_proxy] + \
[c for c in controls if c in df.columns]
df_reg = df[['ticker','date'] + reg_vars].dropna()
from numpy.linalg import lstsq
results = []
for date, cross in df_reg.groupby('date'):
if len(cross) < 30: continue
y = cross['fwd_ret'].values
X_cols = [divop_proxy] + [c for c in controls if c in cross.columns]
X = np.column_stack([np.ones(len(cross)), cross[X_cols].values])
try:
coefs, _, _, _ = lstsq(X, y, rcond=None)
results.append({
'date': date, 'intercept': coefs[0],
f'gamma_{divop_proxy}': coefs[1], 'n': len(cross),
})
except Exception: continue
fm = pd.DataFrame(results)
gc = f'gamma_{divop_proxy}'
mu = fm[gc].mean()
se = fm[gc].std() / np.sqrt(len(fm))
t = mu / se
print(f"\n=== Fama-MacBeth: {divop_proxy} -> "
f"{horizon}-day fwd returns ===")
print(f" Mean gamma: {mu:.6f}, t-stat: {t:.3f}")
if t < -1.96: print(" -> Supports Miller (1977)")
elif t > 1.96: print(" -> Supports Varian (1985)")
else: print(" -> Inconclusive at 5%")
return fm
```
## Application 2: DIVOP and Earnings Announcements
Following @berkman2009sell, we test whether high-DIVOP stocks experience negative abnormal returns around earnings announcements, as uncertainty resolution reduces the optimism premium.
```{python}
#| label: ea-event
#| code-summary: "Event study: DIVOP and earnings announcement returns"
def divop_earnings_event(merged_df, ea_dates_df,
divop_proxy='suv', window=(-1, 3)):
"""
Sort stocks into DIVOP quintiles pre-EA, compute CAR in window.
Miller predicts: Q5 (high DIVOP) has lower CAR than Q1 (low DIVOP).
"""
df = merged_df.copy()
ea = ea_dates_df.copy()
# Pre-EA DIVOP value (5 days before)
ea['pre_date'] = ea['ea_date'] - pd.Timedelta(days=5)
ea = ea.merge(
df[['ticker','date',divop_proxy]].rename(
columns={'date':'pre_date'}),
on=['ticker','pre_date'], how='inner'
)
ea['divop_q'] = pd.qcut(
ea[divop_proxy], 5,
labels=['Q1 Low','Q2','Q3','Q4','Q5 High'],
duplicates='drop'
)
print(f"\n=== EA Event Study by {divop_proxy} quintile ===")
print(f" Window: ({window[0]}, {window[1]}) days")
print(f" Miller predicts: Q5 has lower CAR than Q1")
return ea
```
## Application 3: Composite DIVOP Index via PCA
When a single summary measure of disagreement is needed, PCA on the battery of standardized proxies extracts the common "disagreement factor."
```{python}
#| label: pca-composite
#| code-summary: "Composite DIVOP index via PCA"
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
def composite_divop_pca(merged_df, proxies=None):
"""Extract first principal component from standardized DIVOP proxies."""
if proxies is None:
proxies = ['dto','suv','total_volatility','idio_volatility',
'baspread','amihud_daily']
avail = [p for p in proxies if p in merged_df.columns]
data = merged_df[['ticker','date'] + avail].dropna()
scaler = StandardScaler()
X = scaler.fit_transform(data[avail])
pca = PCA(n_components=3)
factors = pca.fit_transform(X)
data['divop_composite'] = factors[:, 0]
# Ensure positive correlation with inputs
for col in avail:
if data['divop_composite'].corr(data[col]) < 0:
data['divop_composite'] *= -1
break
loadings = pd.DataFrame(
pca.components_.T, index=avail,
columns=['PC1','PC2','PC3']
)
print(f"\n=== PCA Composite DIVOP ===")
print(f"Variance explained: "
f"{pca.explained_variance_ratio_[:3].round(3)}")
print(f"\nLoadings:\n{loadings.to_string(float_format='{:.4f}'.format)}")
return data[['ticker','date','divop_composite']], loadings
```
# Conclusion and Practical Recommendations
This chapter has provided a comprehensive methodology for constructing seven distinct proxies for divergence of investor opinion adapted to the Vietnamese equity market. We conclude with practical recommendations:
**1. Prefer multiple proxies.** No single DIVOP measure is without limitations. We recommend constructing and reporting results for at least three proxies spanning different economic channels (volume, volatility, spreads or analyst-based).
**2. Account for Vietnam-specific microstructure.** Daily price limits, T+2 settlement, foreign ownership constraints, and the order-driven market structure all affect DIVOP properties. Flag limit-hit days, include exchange fixed effects, and control for foreign ownership.
**3. Vietnam as a natural laboratory for Miller (1977).** The absence of short selling through 2024 and the dominance of retail investors create conditions that closely match Miller's theoretical setting. The introduction of short selling in 2025 creates a natural experiment for examining how relaxation of short-sale constraints affects the DIVOP-return relation.
**4. Control for analyst coverage when using DISP measures.** With typical coverage of 5--10 analysts per firm, forecast dispersion is estimated with greater noise than in developed markets. Always include the number of analysts as a control variable and conduct robustness checks with varying minimum-analyst thresholds.
**5. Consider constructing a composite index.** When researchers need a single summary measure of disagreement, the PCA-based composite index described in @sec-applications provides a principled approach to aggregating information across the individual proxies. The first principal component typically explains 30-50% of the common variation in the battery of DIVOP measures.
**6. Winsorize aggressively.** Several DIVOP proxies (particularly DISP1, Amihud ILLIQ, and SUV) exhibit extreme outliers in the Vietnamese data. Winsorization at the 1st and 99th percentiles (or even 2nd and 98th for DISP1) is essential for obtaining reliable regression results.
**7. Be cautious about causal inference.** DIVOP proxies are endogenous, they respond to the same firm characteristics (size, leverage, growth) that also affect returns. Researchers should use appropriate controls, consider instrumental variables where feasible, and be explicit about the limitations of their identification strategy.
The DIVOP framework is particularly relevant for the Vietnamese market at this point in its development. As the market matures toward potential FTSE Emerging Market reclassification, as short selling becomes more widely available, and as institutional investor participation grows, the dynamics of opinion divergence and its pricing implications are likely to evolve significantly. The methodology presented in this chapter provides researchers with the tools to document and analyze these changes as they unfold.