10  Size Sorts

In this chapter, we continue with portfolio sorts in a univariate setting. Yet, we consider firm size as a sorting variable, which gives rise to a well-known return factor: the size premium. The size premium arises from buying small stocks and selling large stocks. Prominently, Fama and French (1993) include it as a factor in their three-factor model. Apart from that, asset managers commonly include size as a key firm characteristic when making investment decisions.

We also introduce new choices in the formation of portfolios. In particular, we discuss listing exchanges, industries, weighting regimes, and periods. These choices matter for the portfolio returns and result in different size premiums (see Hasler (2021), Soebhag, Van Vliet, and Verwijmeren (2022), and Walter, Weber, and Weiss (2022) for more insights into decision nodes and their effect on premiums).

import pandas as pd
import numpy as np
import sqlite3

from plotnine import *
from mizani.formatters import percent_format
from itertools import product
from joblib import Parallel, delayed, cpu_count

10.1 Data Preparation

First, we retrieve the relevant data from our SQLite database introduced in Accessing and Managing Financial Data and DataCore Data. Firm size is defined as market equity in most asset pricing applications. We further use the Fama-French factor returns for performance evaluation.

tidy_finance = sqlite3.connect(
  database="data/tidy_finance_python.sqlite"
)

prices_monthly = pd.read_sql_query(
  sql="SELECT * FROM prices_monthly", 
  con=tidy_finance, 
  parse_dates={"date"}
)

factors_ff3_monthly = pd.read_sql_query(
  sql="SELECT * FROM factors_ff3_monthly", 
  con=tidy_finance, 
  parse_dates={"date"}
)

10.2 Size Distribution

Before we build our size portfolios, we investigate the distribution of the variable firm size. Visualizing the data is a valuable starting point to understand the input to the analysis. Figure 10.1 shows the fraction of total market capitalization concentrated in the largest firm. To produce this graph, we create monthly indicators that track whether a stock belongs to the largest x percent of the firms. Then, we aggregate the firms within each bucket and compute the buckets’ share of total market capitalization.

Figure 10.1 shows that the largest 1 percent of firms cover up to 50 percent of the total market capitalization, and holding just the 25 percent largest firms in the universe essentially replicates the market portfolio. The distribution of firm size thus implies that the largest firms of the market dominate many small firms whenever we use value-weighted benchmarks.

market_cap_concentration = (prices_monthly
  .groupby("date")
  .apply(lambda x: pd.Series({
    "Largest 1%": x.loc[x["mktcap"] >= x["mktcap"].quantile(0.99), "mktcap"].sum() / x["mktcap"].sum(),
    "Largest 5%": x.loc[x["mktcap"] >= x["mktcap"].quantile(0.95), "mktcap"].sum() / x["mktcap"].sum(),
    "Largest 10%": x.loc[x["mktcap"] >= x["mktcap"].quantile(0.90), "mktcap"].sum() / x["mktcap"].sum(),
    "Largest 25%": x.loc[x["mktcap"] >= x["mktcap"].quantile(0.75), "mktcap"].sum() / x["mktcap"].sum()
  }), include_groups=False)
  .reset_index()
  .melt(id_vars="date", var_name="name", value_name="value")
)

market_cap_concentration_figure = (
  ggplot(
    market_cap_concentration, 
    aes(x="date", y="value", color="name", linetype="name")
  ) +
  geom_line() +
  scale_y_continuous(labels=percent_format()) +
  scale_x_date(name="", date_labels="%Y") +
  labs(
    x="", y="", color="", linetype="",
    title="Percentage of total market capitalization in largest stocks"
  ) +
  theme(legend_title=element_blank())
)
market_cap_concentration_figure.show()
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:4: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:5: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:6: RuntimeWarning: invalid value encountered in scalar divide
/tmp/ipykernel_3244990/419327778.py:7: RuntimeWarning: invalid value encountered in scalar divide
/home/mikenguyen/project/tidyfinance/.venv/lib/python3.13/site-packages/plotnine/geoms/geom_path.py:100: PlotnineWarning: geom_path: Removed 17 rows containing missing values.
Title: Percentage of total market capitalization in largest stocks. The figure shows a line chart with four different lines that are relatively stable during the entire sample period. The largest 1 percent of all stocks on average comprise around 40 percent of the entire market capitalization. For the largest 25 percent, the share is around 90 percent.
Figure 10.1: The figure shows the percentage of total market capitalization in largest stocks. We report the aggregate market capitalization of all stocks that belong to the 1, 5, 10, and 25 percent quantile of the largest firms in the monthly cross-section relative to the market capitalization of all stocks during the month.

Next, firm sizes also differ across listing exchanges. The primary listings of stocks were important in the past and are potentially still relevant today.

market_cap_share = (prices_monthly
  .groupby(["date", "exchange"])
  .aggregate({"mktcap": "sum"})
  .reset_index(drop=False)
  .groupby("date")
  .apply(lambda x:
    x.assign(total_market_cap=lambda x: x["mktcap"].sum(),
             share=lambda x: x["mktcap"]/x["total_market_cap"]
             )
    )
  .reset_index(drop=True)
)

plot_market_cap_share = (
  ggplot(market_cap_share, 
         aes(x="date", y="share", 
             fill="exchange", color="exchange")) +
  geom_area(position="stack", stat="identity", alpha=0.5) +
  geom_line(position="stack") +
  scale_y_continuous(labels=percent_format()) +
  scale_x_date(name="", date_labels="%Y") +
  labs(x="", y="", fill="", color="",
       title="Share of total market capitalization per listing exchange") +
  theme(legend_title=element_blank())
)
plot_market_cap_share.draw()
Figure 10.2

Finally, we consider the distribution of firm size across listing exchanges and create summary statistics. The function describe() does not include all statistics we are interested in, which is why we create the function compute_summary() that adds the standard deviation and the number of observations. Then, we apply it to the most current month of our data on each listing exchange. We also add a row with the overall summary statistics. In the following, we use this distinction to update our portfolio sort procedure.

def compute_summary(data, variable, filter_variable, percentiles):
    """Compute summary statistics for a given variable and percentiles."""
    
    summary = (data[[filter_variable, variable]]
      .groupby(filter_variable)
      .describe(percentiles=percentiles)
    ) 
    
    summary.columns = summary.columns.droplevel(0)
    
    summary_overall = (data[variable]
      .describe(percentiles=percentiles)
    )
    
    summary.loc["Overall", :] = summary_overall
    
    return summary.round(0)

compute_summary(
  prices_monthly[prices_monthly["date"] == prices_monthly["date"].max()],
  variable="mktcap",
  filter_variable="exchange",
  percentiles=[0.05, 0.5, 0.95]
)

10.3 Univariate Size Portfolios with Flexible Breakpoints

In Univariate Portfolio Sorts, we construct portfolios with a varying number of breakpoints and different sorting variables. Here, we extend the framework such that we compute breakpoints on a subset of the data, for instance, based on selected listing exchanges.

We introduce exchanges as an argument in our assign_portfolio() function from Univariate Portfolio Sorts. The exchange-specific argument then enters in the filter data["exchanges"].isin(exchanges). For example, if exchanges='HOSE' is specified, only stocks listed on HOSE are used to compute the breakpoints. Alternatively, you could specify exchanges=["HOSE", "HNX", "UPCoM"], which keeps all stocks listed on either of these exchanges.

def assign_portfolio(data, exchanges, sorting_variable, n_portfolios):
    """Assign portfolio for a given sorting variable."""
    
    breakpoints = (data
      .query(f"exchange in {exchanges}")
      .get(sorting_variable)
      .quantile(np.linspace(0, 1, num=n_portfolios+1), 
                interpolation="linear")
      .drop_duplicates()
    )
    breakpoints.iloc[[0, -1]] = [-np.Inf, np.Inf]
    
    assigned_portfolios = pd.cut(
      data[sorting_variable],
      bins=breakpoints,
      labels=range(1, breakpoints.size),
      include_lowest=True,
      right=False
    )
    
    return assigned_portfolios

10.4 Weighting Schemes for Portfolios

Apart from computing breakpoints on different samples, researchers often use different portfolio weighting schemes. So far, we weighted each portfolio constituent by its relative market equity of the previous period. This protocol is called value-weighting. The alternative protocol is equal-weighting, which assigns each stock’s return the same weight, i.e., a simple average of the constituents’ returns. Notice that equal-weighting is difficult in practice as the portfolio manager needs to rebalance the portfolio monthly, while value-weighting is a truly passive investment.

We implement the two weighting schemes in the function compute_portfolio_returns() that takes a logical argument to weight the returns by firm value. Additionally, the long-short portfolio is long in the smallest firms and short in the largest firms, consistent with research showing that small firms outperform their larger counterparts. Apart from these two changes, the function is similar to the procedure in Univariate Portfolio Sorts.

def calculate_returns(data, value_weighted):
    """Calculate (value-weighted) returns."""
    
    if value_weighted:
      return np.average(data["ret_excess"], weights=data["mktcap_lag"])
    else:
      return data["ret_excess"].mean()
          
def compute_portfolio_returns(n_portfolios=10, 
                              exchanges=["HOSE", "HNX", "UPCoM"], 
                              value_weighted=True, 
                              data=prices_monthly):
    """Compute (value-weighted) portfolio returns."""
    
    returns = (data
      .groupby("date")
      .apply(lambda x: x.assign(
        portfolio=assign_portfolio(x, exchanges, 
                                   "mktcap_lag", n_portfolios))
      )
      .reset_index(drop=True)
      .groupby(["portfolio", "date"])
      .apply(lambda x: x.assign(
        ret=calculate_returns(x, value_weighted))
      )
      .reset_index(drop=True)
      .groupby("date")
      .apply(lambda x: 
        pd.Series({"size_premium": x.loc[x["portfolio"].idxmin(), "ret"]-
          x.loc[x["portfolio"].idxmax(), "ret"]}))
      .reset_index(drop=True)
      .aggregate({"size_premium": "mean"})
    )
    
    return returns

To see how the function compute_portfolio_returns() works, we consider a simple median breakpoint example with value-weighted returns. We are interested in the effect of restricting listing exchanges on the estimation of the size premium. In the first function call, we compute returns based on breakpoints from all listing exchanges. Then, we computed returns based on breakpoints from HOSE-listed stocks.

ret_all = compute_portfolio_returns(
  n_portfolios=2,
  exchanges=["HOSE", "HNX", "UPCoM"],
  value_weighted=True,
  data=prices_monthly
)

ret_HOSE = compute_portfolio_returns(
  n_portfolios=2,
  exchanges=["HOSE"],
  value_weighted=True,
  data=prices_monthly
)

data = pd.DataFrame([ret_all*100, ret_HOSE*100], 
                    index=["HOSE, HNX & UPCoM", "HOSE"])
data.columns = ["Premium"]
data.round(2)

10.5 P-Hacking and Non-Standard Errors

Since the choice of the listing exchange has a significant impact, the next step is to investigate the effect of other data processing decisions researchers have to make along the way. In particular, any portfolio sort analysis has to decide at least on the number of portfolios, the listing exchanges to form breakpoints, and equal- or value-weighting. Further, one may exclude firms that are active in the finance industry or restrict the analysis to some parts of the time series. All of the variations of these choices that we discuss here are part of scholarly articles published in the top finance journals. We refer to Walter, Weber, and Weiss (2022) for an extensive set of other decision nodes at the discretion of researchers.

The intention of this application is to show that the different ways to form portfolios result in different estimated size premiums. Despite the effects of this multitude of choices, there is no correct way. It should also be noted that none of the procedures is wrong. The aim is simply to illustrate the changes that can arise due to the variation in the evidence-generating process (Menkveld et al., n.d.). The term non-standard errors refers to the variation due to (suitable) choices made by researchers. Interestingly, in a large-scale study, Menkveld et al. (n.d.) find that the magnitude of non-standard errors is similar to the estimation uncertainty based on a chosen model. This shows how important it is to adjust for the seemingly innocent choices in the data preparation and evaluation workflow. Moreover, it seems that this methodology-related uncertainty should be embraced rather than hidden away.

From a malicious perspective, these modeling choices give the researcher multiple chances to find statistically significant results. Yet this is considered p-hacking, which renders the statistical inference invalid due to multiple testing (Harvey, Liu, and Zhu 2016).

Nevertheless, the multitude of options creates a problem since there is no single correct way of sorting portfolios. How should a researcher convince a reader that their results do not come from a p-hacking exercise? To circumvent this dilemma, academics are encouraged to present evidence from different sorting schemes as robustness tests and report multiple approaches to show that a result does not depend on a single choice. Thus, the robustness of premiums is a key feature.

Below, we conduct a series of robustness tests, which could also be interpreted as a p-hacking exercise. To do so, we examine the size premium in different specifications presented in the table p_hacking_setup. The function itertools.product() produces all possible permutations of its arguments. Note that we use the argument data to exclude financial firms and truncate the time series.

n_portfolios = [2, 5, 10]
exchanges = [["HOSE"], ["HOSE", "HNX", "UPCoM"]]
value_weighted = [True, False]
data = [
  prices_monthly,
  prices_monthly[prices_monthly["industry"] != "Finance"],
  prices_monthly[prices_monthly["date"] < "1990-06-01"],
  prices_monthly[prices_monthly["date"] >= "1990-06-01"],
]
p_hacking_setup = list(
  product(n_portfolios, exchanges, value_weighted, data)
)

To speed the computation up, we parallelize the (many) different sorting procedures. Finally, we report the resulting size premiums in descending order. There are indeed substantial size premiums possible in our data, in particular when we use equal-weighted portfolios.

n_cores = cpu_count()-1
p_hacking_results = pd.concat(
  Parallel(n_jobs=n_cores)
  (delayed(compute_portfolio_returns)(x, y, z, w) 
   for x, y, z, w in p_hacking_setup)
)
p_hacking_results = p_hacking_results.reset_index(name="size_premium")

10.6 Size-Premium Variation

We provide a graph in Figure 10.3 that shows the different premiums. The figure also shows the relation to the average Fama-French SMB (small minus big) premium used in the literature, which we include as a dotted vertical line.

p_hacking_results_figure = (
  ggplot(
    p_hacking_results, 
    aes(x="size_premium")
  )
  + geom_histogram(bins=len(p_hacking_results))
  + scale_x_continuous(labels=percent_format())
  + labs(
      x="", y="",
      title="Distribution of size premiums for various sorting choices"
    )
  + geom_vline(
      aes(xintercept=factors_ff3_monthly["smb"].mean()), linetype="dashed"
    )
)
p_hacking_results_figure.show()
Figure 10.3

10.7 Key Takeaways

  • Firm size is a crucial factor in asset pricing, and sorting stocks by size reveals the size premium, where small-cap stocks tend to outperform large-cap stocks.
  • Portfolio returns are sensitive to research design choices like exchange filters, weighting schemes, and the number of portfolios—decisions that can meaningfully shift results.
  • Methodological flexibility can lead to non-standard errors and potential p-hacking.
  • Validate results by varying assumptions and show that findings hold across multiple specifications.