6 Constructing and Analyzing Equity Return Series

This chapter develops a practical framework for transforming raw equity price records into return series suitable for empirical financial analysis. The focus is on methodological clarity and reproducibility, with particular attention to data issues that are prevalent in emerging equity markets such as Vietnam.

The discussion proceeds from individual stocks to a broad market cross-section, using constituents of the VN30 index as the primary empirical setting.

6.1 Data Access and Preparation

We begin by loading the core numerical and data manipulation libraries. These provide all functionality required for return construction without relying on specialized financial wrappers.

import pandas as pd
import numpy as np

For this project, we retrieve our historical price data using the DataCore API. If you wish to replicate this analysis or use the dataset for your own work, you will need to access the data through their platform.

6.1.1 Prerequisites for API Access

To run the code below, you need to configure a few things first:

Obtain an API Key: You must subscribe to the relevant dataset on the DataCore platform to receive a unique API key.
Whitelist Your IP Address: The API requires your IP address to be whitelisted for security.
- Local Machine: If you are running this code on your personal computer, you generally need to whitelist your public IP address.
- Cloud or Remote Sessions (e.g., HPC Open OnDemand): If you are using a remote server such as those provided by DataCore, the server’s IP address will change with each new session. You must retrieve the server’s private/public IP for that specific session and whitelist it in your DataCore account settings before running the script.
Set Environment Variables: To keep your credentials secure, do not hardcode your API key into your scripts. Instead, save it as an environment variable named datacore_api on your machine.

Note: If you only want to test the code performance, DataCore provides a preview endpoint that does not require an API key, though the data returned is limited.

import requests
import pandas as pd

url = "https://gateway.datacore.vn/data/ds/preview"
params = {
    "dataSetCode": "fundamental_annual",
    "pageSize": 10000 
}
headers = {
    "Accept": "application/json",
    "Origin": "https://datacore.vn",
    "Referer": "https://datacore.vn/"
}

response = requests.get(url, params=params, headers=headers)
data = response.json()

columns = data['data']['fields']
rows = data['data']['dataDetail']

df = pd.DataFrame(rows, columns=columns)
print(df.head())
print("Total rows:", len(df))

  symbol  year  total_current_asset ca_fin        ca_cce       ca_cash  \
0    CLL  2011         1.078338e+11   None  8.313178e+10  4.131776e+09   
1    CLL  2012         2.360575e+10   None  8.003560e+09  4.003560e+09   
2    CLL  2013         5.764370e+10   None  3.496426e+10  9.964256e+09   
3    CLL  2014         4.973590e+10   None  1.718744e+10  1.718744e+10   
4    CLL  2015         2.389115e+11   None  1.790364e+11  2.403638e+10   

  ca_cash_inbank ca_cash_attransit  ca_cash_equivalent  ca_fin_invest  ...  \
0           None              None        7.900000e+10   0.000000e+00  ...   
1           None              None        4.000000e+09   0.000000e+00  ...   
2           None              None        2.500000e+10   0.000000e+00  ...   
3           None              None        0.000000e+00   1.000000e+09  ...   
4           None              None        1.550000e+11   1.000000e+09  ...   

   operating_margin      roe      roa sector_pe sector_pb  sector_ps  \
0           0.35473  0.19560  0.10649   3.00213   0.26108    0.22170   
1           0.45369  0.20337  0.13018   2.40335   0.26211    0.21745   
2           0.45997  0.23459  0.16452   3.11089   0.41013    0.32436   
3           0.40554  0.19984  0.14747   3.25886   0.46823    0.42146   
4           0.33774  0.16525  0.12633   6.77337   0.81110    0.68401   

   sector_eps  sector_ros  sector_roe  sector_roa  
0  3341.54903     0.07385     0.08753     0.05039  
1  4296.47005     0.09048     0.11392     0.06346  
2  4024.74648     0.10427     0.13645     0.07415  
3  5100.81019     0.12933     0.14956     0.08087  
4  5216.38499     0.10098     0.11815     0.06234  

[5 rows x 308 columns]
Total rows: 10

6.1.2 Checking Your IP Address

If you need to verify the IP address of the machine running your code (to whitelist it), you can use the following Python snippets.

To find your Public IP:

import requests

try:
    public_ip = requests.get("https://api.ipify.org").text
    print(f"Public IP: {public_ip}")
except requests.exceptions.RequestException as e:
    print(f"Could not retrieve Public IP: {e}")

To find your Private IP (useful for specific remote server setups):

import socket

def get_private_ip():
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(("8.8.8.8", 80))
        private_ip = s.getsockname()[0]
        s.close()
        return private_ip
    except Exception as e:
        return f"Error: {e}"

print(f"Private IP: {get_private_ip()}")

6.1.3 Fetching the Dataset

The following script demonstrates how to securely authenticate and paginate through the DataCore API to retrieve the full dataset_historical_price dataset.

# Convert the date column to proper datetime objects
prices["date"] = pd.to_datetime(prices["date"])

# Ensure price and ratio columns are numeric before calculation
prices["close_price"] = pd.to_numeric(prices["close_price"])
prices["adj_ratio"] = pd.to_numeric(prices["adj_ratio"])

# Calculate the adjusted close price
prices["adjusted_close"] = prices["close_price"] * prices["adj_ratio"]

# Rename columns to match standard conventions
prices = prices.rename(
    columns={
        "vol_total": "volume",
        "open_price": "open",
        "low_price": "low",
        "high_price": "high",
        "close_price": "close",
    }
)

# Sort the dataset logically by symbol and date
prices = prices.sort_values(["symbol", "date"])

print("Data manipulation complete. The dataset is ready for analysis.")

Data manipulation complete. The dataset is ready for analysis.

prices["date"] = pd.to_datetime(prices["date"])


prices["adjusted_close"] = prices["close_price"] * prices["adj_ratio"]


prices = prices.rename(
    columns={
        "vol_total": "volume",
        "open_price": "open",
        "low_price": "low",
        "high_price": "high",
        "close_price": "close",
    }
)

prices = prices.sort_values(["symbol", "date"])

Adjusted closing prices incorporate mechanical changes due to corporate actions such as cash dividends and stock splits. Using adjusted prices ensures that subsequent return calculations reflect investor-relevant performance rather than accounting artifacts.

6.2 Examining a Single Equity

To ground the discussion, we isolate the trading history of a single large-cap stock, FPT, over a long sample period.

import datetime as dt

start = pd.Timestamp("2000-01-01")
end = pd.Timestamp(dt.date.today().year - 1, 12, 31)


fpt = prices.loc[
    (prices["symbol"] == "FPT")
    & (prices["date"] >= start)
    & (prices["date"] <= end),
    ["date", "symbol", "volume", "open", "low", "high", "close", "adjusted_close"],
].copy()

This subset contains the standard daily market variables required for most empirical studies. Before computing returns, it is good practice to visually inspect the price series.

from plotnine import ggplot, aes, geom_line, labs

(
    ggplot(fpt, aes(x="date", y="adjusted_close"))
    + geom_line()
    + labs(title="Adjusted price path of FPT", x="", y="")
)

Title: FPT stock prices — Figure 6.1: Prices are in VND, adjusted for dividend payments and stock splits.

6.3 From Prices to Returns

Most empirical asset pricing models are formulated in terms of returns rather than price levels. The simple daily return is defined as

\[ r_t = \frac{p_t}{p_t - 1} - 1, \]

where $p_t$ denotes the adjusted closing price at the end of trading day $t$.

Before computing returns, we must address invalid price observations. In Vietnamese equity data, adjusted prices occasionally take the value zero. These entries typically arise from IPO placeholders, trading suspensions, or historical backfilling conventions and cannot be used to compute meaningful returns.

prices.loc[prices["adjusted_close"] <= 0, ["symbol", "date", "adjusted_close"]].head()

	symbol	date
33886	ADP	2010-02-09
33887	ADP	2010-02-24
33888	ADP	2010-03-01
33889	ADP	2010-03-03
33890	ADP	2010-03-12

We therefore exclude non-positive adjusted prices and compute returns stock by stock. Correct chronological ordering is essential.

returns = (
    prices
    .loc[prices["adjusted_close"] > 0]
    .sort_values(["symbol", "date"])
    .assign(ret=lambda x: x.groupby("symbol")["adjusted_close"].pct_change())
    [["symbol", "date", "ret"]]
)
returns = returns.dropna(subset=["ret"])

The initial return for each stock is missing by construction, since no lagged price is available. These observations are mechanical and can safely be removed in most applications.

6.4 Limiting the Influence of Extreme Returns

Daily return series often contain extreme observations driven by data errors, thin trading, or abrupt price adjustments. A common approach is to winsorize returns using cross-sectional percentile cutoffs.

def winsorize_cs(df, column="ret", lower_q=0.01, upper_q=0.99):
    lo = df[column].quantile(lower_q)
    hi = df[column].quantile(upper_q)
    out = df.copy()
    out[column] = out[column].clip(lo, hi)
    return out

returns = winsorize_cs(returns)

Applying winsorization across the full cross-section limits the impact of extreme market-wide observations while preserving relative differences between firms. Winsorizing within each stock is rarely appropriate in panel settings and can severely distort illiquid securities.

6.5 Distributional Features of Returns

We next examine the empirical distribution of daily returns for FPT. The figure below also marks the historical 5 percent quantile, which provides a simple, non-parametric measure of downside risk.

from mizani.formatters import percent_format
from plotnine import geom_histogram, geom_vline, scale_x_continuous


fpt_ret = returns.loc[returns["symbol"] == "FPT"].copy()
q05 = fpt_ret["ret"].quantile(0.05)

(
    ggplot(fpt_ret, aes(x="ret"))
    + geom_histogram(bins=100)
    + geom_vline(xintercept=q05, linetype="dashed")
    + scale_x_continuous(labels=percent_format())
    + labs(title="Distribution of daily FPT returns", x="", y="")
)

Title: Distribution of daily FPT stock returns in percent. The figure shows a histogram of daily returns. The vertical line indicates that the historical five percent quantile of daily returns was around negative three percent. — Figure 6.2: The dotted vertical line indicates the historical five percent quantile.

Summary statistics offer a compact description of return behavior and should always be inspected before formal modeling.

returns["ret"].describe().round(3)

count    4305063.000
mean           0.000
std            0.035
min           -0.125
25%           -0.004
50%            0.000
75%            0.003
max            0.130
Name: ret, dtype: float64

Computing these statistics by calendar year can reveal periods of elevated volatility or structural change.

(
    returns
    .assign(year=lambda x: x["date"].dt.year)
    .groupby("year")["ret"]
    .describe()
    .round(3)
)

	count	mean	std	min	25%	50%	75%	max
year
2010	131548.0	-0.001	0.036	-0.125	-0.021	0.0	0.018	0.13
2011	166826.0	-0.003	0.033	-0.125	-0.020	0.0	0.011	0.13
2012	177938.0	0.000	0.033	-0.125	-0.012	0.0	0.015	0.13
2013	180417.0	0.001	0.033	-0.125	-0.004	0.0	0.008	0.13
2014	181907.0	0.001	0.034	-0.125	-0.008	0.0	0.011	0.13
2015	197881.0	0.000	0.033	-0.125	-0.006	0.0	0.005	0.13
2016	227896.0	0.000	0.035	-0.125	-0.005	0.0	0.003	0.13
2017	283642.0	0.001	0.034	-0.125	-0.002	0.0	0.001	0.13
2018	329887.0	0.000	0.035	-0.125	0.000	0.0	0.000	0.13
2019	352754.0	0.000	0.033	-0.125	0.000	0.0	0.000	0.13
2020	369367.0	0.001	0.035	-0.125	0.000	0.0	0.000	0.13
2021	379415.0	0.002	0.038	-0.125	-0.005	0.0	0.007	0.13
2022	387050.0	-0.001	0.038	-0.125	-0.008	0.0	0.004	0.13
2023	391605.0	0.001	0.034	-0.125	-0.002	0.0	0.002	0.13
2024	400379.0	0.000	0.031	-0.125	-0.002	0.0	0.000	0.13
2025	146551.0	0.000	0.037	-0.125	-0.004	0.0	0.002	0.13

6.6 Expanding to a Market Cross-Section

The same procedures apply naturally to a larger universe of stocks. We now restrict attention to the constituents of the VN30 index.

vn30 = [
    "ACB","BCM","BID","BVH","CTG","FPT","GAS","GVR","HDB","HPG",
    "MBB","MSN","MWG","PLX","POW","SAB","SHB","SSB","STB","TCB",
    "TPB","VCB","VHM","VIB","VIC","VJC","VNM","VPB","VRE","EIB",
]


prices_vn30 = prices.loc[prices["symbol"].isin(vn30)]
from plotnine import theme

(
    ggplot(prices_vn30, aes(x="date", y="adjusted_close", color="symbol"))
    + geom_line()
    + labs(title="Adjusted prices of VN30 constituents", x="", y="")
    + theme(legend_position="none")
)

Title: Stock prices of VN30 index constituents. — Figure 6.3: Prices in VND, adjusted for dividend payments and stock splits.

Returns for the VN30 universe are computed analogously.

returns_vn30 = (
    prices_vn30
    .sort_values(["symbol", "date"])
    .assign(ret=lambda x: x.groupby("symbol")["adjusted_close"].pct_change())
    [["symbol", "date", "ret"]]
    .dropna()
)


returns_vn30.groupby("symbol")["ret"].describe().round(3)

	count	mean	std	min	25%	50%	75%	max
symbol
ACB	3822.0	-0.000	0.023	-0.407	-0.006	0.0	0.007	0.097
BCM	1795.0	0.001	0.027	-0.136	-0.010	0.0	0.010	0.159
BID	2811.0	0.000	0.024	-0.369	-0.010	0.0	0.011	0.070
BVH	3825.0	0.000	0.024	-0.097	-0.012	0.0	0.012	0.070
CTG	3825.0	0.000	0.024	-0.376	-0.010	0.0	0.010	0.070
EIB	3825.0	-0.000	0.022	-0.302	-0.008	0.0	0.008	0.070
FPT	3825.0	-0.000	0.024	-0.439	-0.008	0.0	0.009	0.070
GAS	3236.0	0.000	0.022	-0.289	-0.009	0.0	0.010	0.070
GVR	1775.0	0.001	0.030	-0.137	-0.014	0.0	0.016	0.169
HDB	1828.0	-0.001	0.028	-0.391	-0.009	0.0	0.010	0.070
HPG	3825.0	-0.001	0.032	-0.581	-0.010	0.0	0.011	0.070
MBB	3371.0	-0.000	0.023	-0.473	-0.008	0.0	0.008	0.069
MSN	3825.0	0.000	0.024	-0.553	-0.010	0.0	0.010	0.070
MWG	2701.0	-0.000	0.035	-0.751	-0.009	0.0	0.011	0.070
PLX	2009.0	-0.000	0.021	-0.140	-0.010	0.0	0.010	0.070
POW	1784.0	0.000	0.023	-0.071	-0.012	0.0	0.011	0.102
SAB	2100.0	-0.000	0.024	-0.745	-0.008	0.0	0.007	0.070
SHB	3824.0	-0.000	0.028	-0.338	-0.013	0.0	0.013	0.100
SSB	1029.0	-0.000	0.023	-0.292	-0.005	0.0	0.004	0.070
STB	3825.0	0.000	0.024	-0.321	-0.010	0.0	0.010	0.070
TCB	1732.0	-0.000	0.035	-0.884	-0.009	0.0	0.010	0.070
TPB	1761.0	-0.001	0.029	-0.477	-0.009	0.0	0.009	0.070
VCB	3825.0	-0.000	0.024	-0.539	-0.009	0.0	0.009	0.070
VHM	1744.0	-0.000	0.024	-0.419	-0.009	0.0	0.008	0.070
VIB	2072.0	-0.000	0.031	-0.489	-0.009	0.0	0.010	0.109
VIC	3825.0	-0.000	0.027	-0.673	-0.008	0.0	0.008	0.070
VJC	2046.0	-0.000	0.020	-0.455	-0.007	0.0	0.006	0.070
VNM	3825.0	-0.000	0.023	-0.547	-0.007	0.0	0.007	0.070
VPB	1927.0	-0.000	0.033	-0.678	-0.010	0.0	0.010	0.070
VRE	1871.0	-0.000	0.024	-0.295	-0.012	0.0	0.011	0.070

6.7 Aggregating Returns Across Time

Financial variables are observed at different frequencies. While equity prices are recorded daily, many empirical questions require monthly or annual returns. Lower-frequency returns are constructed by compounding higher-frequency observations.

returns_monthly = (
    returns_vn30
    .assign(month=lambda x: x["date"].dt.to_period("M").dt.to_timestamp())
    .groupby(["symbol", "month"], as_index=False)
    .agg(ret=("ret", lambda x: np.prod(1 + x) - 1))
)

Comparing daily and monthly return distributions illustrates how aggregation dampens volatility and alters tail behavior.

from plotnine import facet_wrap

fpt_d = returns_vn30.loc[returns_vn30["symbol"] == "FPT"].assign(freq="Daily")
fpt_m = returns_monthly.loc[returns_monthly["symbol"] == "FPT"].assign(freq="Monthly")


fpt_both = pd.concat([
    fpt_d[["ret", "freq"]],
    fpt_m[["ret", "freq"]],
])

(
    ggplot(fpt_both, aes(x="ret"))
    + geom_histogram(bins=50)
    + scale_x_continuous(labels=percent_format())
    + labs(title="FPT returns at different frequencies", x="", y="")
    + facet_wrap("freq", scales="free")
)

Title: Distribution of FPT returns across different frequencies. The figure shows the distribution of daily and monthly returns in two separate facets. — Figure 6.4: Returns are based on prices adjusted for dividend payments and stock splits.

6.8 Aggregation Across Firms: Trading Activity

Aggregation is not limited to time. In some settings, it is informative to aggregate variables across firms. As an illustration, we compute total daily trading value for VN30 stocks by multiplying share volume by adjusted prices and summing across firms.

trading_value = (
    prices_vn30
    .assign(value=lambda x: x["volume"] * x["adjusted_close"] / 1e9)
    .groupby("date")["value"]
    .sum()
    .reset_index()
    .assign(value_lag=lambda x: x["value"].shift(1))
)
(
    ggplot(trading_value, aes(x="date", y="value"))
    + geom_line()
    + labs(title="Aggregate VN30 trading value (billion VND)", x="", y="")
)

Finally, we assess persistence in trading activity by comparing trading value on consecutive days.

from plotnine import geom_point, geom_abline

(
    ggplot(trading_value, aes(x="value_lag", y="value"))
    + geom_point()
    + geom_abline(intercept=0, slope=1, linetype="dashed")
    + labs(
        title="Persistence in VN30 trading value",
        x="Previous day",
        y="Current day",
    )
)

/home/mikenguyen/project/tidyfinance/.venv/lib/python3.13/site-packages/plotnine/layer.py:374: PlotnineWarning: geom_point : Removed 1 rows containing missing values.

Title: Aggregate daily trading volume of VN30 index constitutens. The figure shows a volatile time series of daily trading volume. — Figure 6.5: Total daily trading volume.

A strong alignment along the 45-degree line indicates that high-activity trading days tend to be followed by similarly active days, a common empirical regularity in equity markets.

6.9 Summary

This chapter established a reproducible workflow for transforming raw price data into return series, diagnosing common data issues, and aggregating information across time and firms. These steps provide the empirical foundation for subsequent analyses of risk, return predictability, and market dynamics in Vietnam’s equity market.

# Constructing and Analyzing Equity Return Series This chapter develops a practical framework for transforming raw equity price records into return series suitable for empirical financial analysis. The focus is on methodological clarity and reproducibility, with particular attention to data issues that are prevalent in emerging equity markets such as Vietnam. The discussion proceeds from individual stocks to a broad market cross-section, using constituents of the VN30 index as the primary empirical setting. ## Data Access and Preparation We begin by loading the core numerical and data manipulation libraries. These provide all functionality required for return construction without relying on specialized financial wrappers. ```{python} import pandas as pd import numpy as np ``` For this project, we retrieve our historical price data using the DataCore API. If you wish to replicate this analysis or use the dataset for your own work, you will need to access the data through their platform. ### Prerequisites for API Access To run the code below, you need to configure a few things first: 1. **Obtain an API Key:** You must subscribe to the relevant dataset on the [DataCore](https://datacore.vn/) platform to receive a unique API key. 2. **Whitelist Your IP Address:** The API requires your IP address to be whitelisted for security. * **Local Machine:** If you are running this code on your personal computer, you generally need to whitelist your public IP address. * **Cloud or Remote Sessions (e.g., HPC Open OnDemand):** If you are using a remote server such as those provided by DataCore, the server's IP address will change with each new session. You must retrieve the server's private/public IP for that specific session and whitelist it in your DataCore account settings before running the script. 3. **Set Environment Variables:** To keep your credentials secure, do not hardcode your API key into your scripts. Instead, save it as an environment variable named `datacore_api` on your machine. *Note: If you only want to test the code performance, DataCore provides a preview endpoint that does not require an API key, though the data returned is limited.* ```{python} #| label: datacore-demo import requests import pandas as pd url = "https://gateway.datacore.vn/data/ds/preview" params = { "dataSetCode": "fundamental_annual", "pageSize": 10000 } headers = { "Accept": "application/json", "Origin": "https://datacore.vn", "Referer": "https://datacore.vn/" } response = requests.get(url, params=params, headers=headers) data = response.json() columns = data['data']['fields'] rows = data['data']['dataDetail'] df = pd.DataFrame(rows, columns=columns) print(df.head()) print("Total rows:", len(df)) ``` ### Checking Your IP Address If you need to verify the IP address of the machine running your code (to whitelist it), you can use the following Python snippets. **To find your Public IP:** ```{python} #| eval: false import requests try: public_ip = requests.get("https://api.ipify.org").text print(f"Public IP: {public_ip}") except requests.exceptions.RequestException as e: print(f"Could not retrieve Public IP: {e}") ``` **To find your Private IP (useful for specific remote server setups):** ```{python} #| eval: false import socket def get_private_ip(): try: s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) s.connect(("8.8.8.8", 80)) private_ip = s.getsockname()[0] s.close() return private_ip except Exception as e: return f"Error: {e}" print(f"Private IP: {get_private_ip()}") ``` ### Fetching the Dataset The following script demonstrates how to securely authenticate and paginate through the DataCore API to retrieve the full `dataset_historical_price` dataset. ```{python} #| eval: false #| include: false import requests import time import os import pandas as pd url = "https://gateway.datacore.vn/data/ds/search" headers = { 'x-api-key': os.environ.get('datacore_api'), 'Content-Type': 'application/json' } limit = 10000 current_page = 1 all_data = [] fields = None while True: print(f"Fetching page {current_page}...") payload = { "dataSetCode": "dataset_historical_price", "conditions": [], "selectFields": [], "page": current_page, "limit": limit } try: response = requests.post(url, headers=headers, json=payload) response.raise_for_status() data = response.json() records = data['data']['dataDetail'] total_pages = int(data['data']['totalPage']) total_records = int(data['data']['num']) if fields is None: fields = data['data']['fields'] print(f"Fields: {fields}") print(f"Total records: {total_records:,} across {total_pages} pages") if not records: print("No more data found.") break all_data.extend(records) print(f" Page {current_page}/{total_pages} — {len(all_data):,}/{total_records:,} records collected") if current_page >= total_pages: break current_page += 1 time.sleep(0.1) except requests.exceptions.RequestException as e: print(f"Error on page {current_page}: {e}") break print(f"\nTotal records retrieved: {len(all_data):,}") df = pd.DataFrame(all_data, columns=fields) print(df.shape) df.head() ```  ```{python} #| include: false import os import boto3 from botocore.client import Config class ObjectStorage: def __init__(self): self.client = boto3.client( "s3", endpoint_url=os.environ["MINIO_ENDPOINT"], aws_access_key_id=os.environ["MINIO_ACCESS_KEY"], aws_secret_access_key=os.environ["MINIO_SECRET_KEY"], region_name=os.getenv("MINIO_REGION", "us-east-1"), config=Config(signature_version="s3v4"), ) storage = ObjectStorage() bucket = os.environ["MINIO_BUCKET"] ```  ```{python} #| include: false from io import BytesIO prices = pd.read_csv( BytesIO( storage.client.get_object( Bucket=bucket, Key="historycal_price/dataset_historical_price.csv", )["Body"].read() ), low_memory=False, ) ``` ```{python} # Convert the date column to proper datetime objects prices["date"] = pd.to_datetime(prices["date"]) # Ensure price and ratio columns are numeric before calculation prices["close_price"] = pd.to_numeric(prices["close_price"]) prices["adj_ratio"] = pd.to_numeric(prices["adj_ratio"]) # Calculate the adjusted close price prices["adjusted_close"] = prices["close_price"] * prices["adj_ratio"] # Rename columns to match standard conventions prices = prices.rename( columns={ "vol_total": "volume", "open_price": "open", "low_price": "low", "high_price": "high", "close_price": "close", } ) # Sort the dataset logically by symbol and date prices = prices.sort_values(["symbol", "date"]) print("Data manipulation complete. The dataset is ready for analysis.") ``` ```{python} #| eval: false prices["date"] = pd.to_datetime(prices["date"]) prices["adjusted_close"] = prices["close_price"] * prices["adj_ratio"] prices = prices.rename( columns={ "vol_total": "volume", "open_price": "open", "low_price": "low", "high_price": "high", "close_price": "close", } ) prices = prices.sort_values(["symbol", "date"]) ``` Adjusted closing prices incorporate mechanical changes due to corporate actions such as cash dividends and stock splits. Using adjusted prices ensures that subsequent return calculations reflect investor-relevant performance rather than accounting artifacts. ## Examining a Single Equity To ground the discussion, we isolate the trading history of a single large-cap stock, FPT, over a long sample period. ```{python} import datetime as dt start = pd.Timestamp("2000-01-01") end = pd.Timestamp(dt.date.today().year - 1, 12, 31) fpt = prices.loc[ (prices["symbol"] == "FPT") & (prices["date"] >= start) & (prices["date"] <= end), ["date", "symbol", "volume", "open", "low", "high", "close", "adjusted_close"], ].copy() ``` This subset contains the standard daily market variables required for most empirical studies. Before computing returns, it is good practice to visually inspect the price series. ```{python} from plotnine import ggplot, aes, geom_line, labs ``` ```{python} #| label: fig-100 #| fig-cap: "Prices are in VND, adjusted for dividend payments and stock splits." #| fig-alt: "Title: FPT stock prices" ( ggplot(fpt, aes(x="date", y="adjusted_close")) + geom_line() + labs(title="Adjusted price path of FPT", x="", y="") ) ``` ## From Prices to Returns Most empirical asset pricing models are formulated in terms of returns rather than price levels. The simple daily return is defined as $$ r_t = \frac{p_t}{p_t - 1} - 1, $$ where $p_t$ denotes the adjusted closing price at the end of trading day $t$. Before computing returns, we must address invalid price observations. In Vietnamese equity data, adjusted prices occasionally take the value zero. These entries typically arise from IPO placeholders, trading suspensions, or historical backfilling conventions and cannot be used to compute meaningful returns. ```{python} prices.loc[prices["adjusted_close"] <= 0, ["symbol", "date", "adjusted_close"]].head() ``` We therefore exclude non-positive adjusted prices and compute returns stock by stock. Correct chronological ordering is essential. ```{python} returns = ( prices .loc[prices["adjusted_close"] > 0] .sort_values(["symbol", "date"]) .assign(ret=lambda x: x.groupby("symbol")["adjusted_close"].pct_change()) [["symbol", "date", "ret"]] ) returns = returns.dropna(subset=["ret"]) ``` The initial return for each stock is missing by construction, since no lagged price is available. These observations are mechanical and can safely be removed in most applications. ## Limiting the Influence of Extreme Returns Daily return series often contain extreme observations driven by data errors, thin trading, or abrupt price adjustments. A common approach is to winsorize returns using cross-sectional percentile cutoffs. ```{python} def winsorize_cs(df, column="ret", lower_q=0.01, upper_q=0.99): lo = df[column].quantile(lower_q) hi = df[column].quantile(upper_q) out = df.copy() out[column] = out[column].clip(lo, hi) return out returns = winsorize_cs(returns) ``` Applying winsorization across the full cross-section limits the impact of extreme market-wide observations while preserving relative differences between firms. Winsorizing within each stock is rarely appropriate in panel settings and can severely distort illiquid securities. ## Distributional Features of Returns We next examine the empirical distribution of daily returns for FPT. The figure below also marks the historical 5 percent quantile, which provides a simple, non-parametric measure of downside risk. ```{python} from mizani.formatters import percent_format from plotnine import geom_histogram, geom_vline, scale_x_continuous fpt_ret = returns.loc[returns["symbol"] == "FPT"].copy() q05 = fpt_ret["ret"].quantile(0.05) ``` ```{python} #| label: fig-101 #| fig-alt: "Title: Distribution of daily FPT stock returns in percent. The figure shows a histogram of daily returns. The vertical line indicates that the historical five percent quantile of daily returns was around negative three percent." #| fig-cap: "The dotted vertical line indicates the historical five percent quantile." #| ( ggplot(fpt_ret, aes(x="ret")) + geom_histogram(bins=100) + geom_vline(xintercept=q05, linetype="dashed") + scale_x_continuous(labels=percent_format()) + labs(title="Distribution of daily FPT returns", x="", y="") ) ``` Summary statistics offer a compact description of return behavior and should always be inspected before formal modeling. ```{python} returns["ret"].describe().round(3) ``` Computing these statistics by calendar year can reveal periods of elevated volatility or structural change. ```{python} ( returns .assign(year=lambda x: x["date"].dt.year) .groupby("year")["ret"] .describe() .round(3) ) ``` ## Expanding to a Market Cross-Section The same procedures apply naturally to a larger universe of stocks. We now restrict attention to the constituents of the VN30 index. ```{python} vn30 = [ "ACB","BCM","BID","BVH","CTG","FPT","GAS","GVR","HDB","HPG", "MBB","MSN","MWG","PLX","POW","SAB","SHB","SSB","STB","TCB", "TPB","VCB","VHM","VIB","VIC","VJC","VNM","VPB","VRE","EIB", ] prices_vn30 = prices.loc[prices["symbol"].isin(vn30)] from plotnine import theme ``` ```{python} #| label: fig-102 #| fig-cap: "Prices in VND, adjusted for dividend payments and stock splits." #| fig-alt: "Title: Stock prices of VN30 index constituents." ( ggplot(prices_vn30, aes(x="date", y="adjusted_close", color="symbol")) + geom_line() + labs(title="Adjusted prices of VN30 constituents", x="", y="") + theme(legend_position="none") ) ``` Returns for the VN30 universe are computed analogously. ```{python} returns_vn30 = ( prices_vn30 .sort_values(["symbol", "date"]) .assign(ret=lambda x: x.groupby("symbol")["adjusted_close"].pct_change()) [["symbol", "date", "ret"]] .dropna() ) returns_vn30.groupby("symbol")["ret"].describe().round(3) ``` ## Aggregating Returns Across Time Financial variables are observed at different frequencies. While equity prices are recorded daily, many empirical questions require monthly or annual returns. Lower-frequency returns are constructed by compounding higher-frequency observations. ```{python} returns_monthly = ( returns_vn30 .assign(month=lambda x: x["date"].dt.to_period("M").dt.to_timestamp()) .groupby(["symbol", "month"], as_index=False) .agg(ret=("ret", lambda x: np.prod(1 + x) - 1)) ) ``` Comparing daily and monthly return distributions illustrates how aggregation dampens volatility and alters tail behavior. ```{python} from plotnine import facet_wrap fpt_d = returns_vn30.loc[returns_vn30["symbol"] == "FPT"].assign(freq="Daily") fpt_m = returns_monthly.loc[returns_monthly["symbol"] == "FPT"].assign(freq="Monthly") fpt_both = pd.concat([ fpt_d[["ret", "freq"]], fpt_m[["ret", "freq"]], ]) ``` ```{python} #| label: fig-103 #| fig-cap: "Returns are based on prices adjusted for dividend payments and stock splits." #| fig-alt: "Title: Distribution of FPT returns across different frequencies. The figure shows the distribution of daily and monthly returns in two separate facets." ( ggplot(fpt_both, aes(x="ret")) + geom_histogram(bins=50) + scale_x_continuous(labels=percent_format()) + labs(title="FPT returns at different frequencies", x="", y="") + facet_wrap("freq", scales="free") ) ``` ## Aggregation Across Firms: Trading Activity Aggregation is not limited to time. In some settings, it is informative to aggregate variables across firms. As an illustration, we compute total daily trading value for VN30 stocks by multiplying share volume by adjusted prices and summing across firms. ```{python} trading_value = ( prices_vn30 .assign(value=lambda x: x["volume"] * x["adjusted_close"] / 1e9) .groupby("date")["value"] .sum() .reset_index() .assign(value_lag=lambda x: x["value"].shift(1)) ) ( ggplot(trading_value, aes(x="date", y="value")) + geom_line() + labs(title="Aggregate VN30 trading value (billion VND)", x="", y="") ) ``` Finally, we assess persistence in trading activity by comparing trading value on consecutive days. ```{python} from plotnine import geom_point, geom_abline ``` ```{python} #| label: fig-104 #| fig-cap: "Total daily trading volume." #| fig-alt: "Title: Aggregate daily trading volume of VN30 index constitutens. The figure shows a volatile time series of daily trading volume." ( ggplot(trading_value, aes(x="value_lag", y="value")) + geom_point() + geom_abline(intercept=0, slope=1, linetype="dashed") + labs( title="Persistence in VN30 trading value", x="Previous day", y="Current day", ) ) ``` A strong alignment along the 45-degree line indicates that high-activity trading days tend to be followed by similarly active days, a common empirical regularity in equity markets. ## Summary This chapter established a reproducible workflow for transforming raw price data into return series, diagnosing common data issues, and aggregating information across time and firms. These steps provide the empirical foundation for subsequent analyses of risk, return predictability, and market dynamics in Vietnam’s equity market.