11 Value and Bivariate Sorts

In this chapter, we extend the univariate portfolio analysis of Univariate Portfolio Sorts to bivariate portfolio sorting, in which stocks are assigned to portfolios based on two characteristics. Bivariate sorts are commonly used in the academic asset pricing literature and underpin the factors in the Fama-French three-factor model. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.

We form portfolios on firm size and the book-to-market ratio. Calculating book-to-market ratios requires accounting data, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.

import pandas as pd
import numpy as np
import datetime as dt
import sqlite3

11.1 Data Preparation

First, we load the necessary data from our SQLite database introduced in Accessing and Managing Financial Data and DataCore Data. We conduct portfolio sorts based on our sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in Size Sorts and P-Hacking.

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

prices_monthly = (pd.read_sql_query(
    sql=("SELECT symbol, date, ret_excess, mktcap, " 
         "mktcap_lag, exchange FROM prices_monthly"),
    con=tidy_finance,
    parse_dates={"date"})
  .dropna()
)

Further, we utilize accounting data. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable datadate to its monthly value, as we only consider monthly returns here and do not need to account for the exact date.

book_equity = (pd.read_sql_query(
    sql="SELECT symbol, datadate, be FROM datacore",
    con=tidy_finance, 
    parse_dates={"datadate"})
  .dropna()
  .assign(
    date=lambda x: (
      pd.to_datetime(x["datadate"]).dt.to_period("M").dt.to_timestamp()
    )
  )
)

11.2 Book-to-Market Ratio

A fundamental problem in handling accounting data is the look-ahead bias; we must not include data in forming a portfolio that was not available knowledge at the time. Of course, researchers have more information when looking into the past than agents actually had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents’ trades. Hence, we have to lag accounting information.

As in the previous chapter, we continue to lag firm size by one month. Then, we compute the book-to-market ratio, which relates a firm’s book equity to its market equity. Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.

Having both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the sorting_date-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet, the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the fillna(method="ffill")-function after sorting by date and firm (which we identify by symbol and symbol) and on a firm basis (which we do by .groupby() as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.

size = (prices_monthly
  .assign(sorting_date=lambda x: x["date"]+pd.DateOffset(months=1))
  .rename(columns={"mktcap": "size"})
  .get(["symbol", "sorting_date", "size"])
)

bm = (book_equity
  .merge(prices_monthly, how="inner", on=["symbol", "date"])
  .assign(bm=lambda x: x["be"]/x["mktcap"],
          sorting_date=lambda x: x["date"]+pd.DateOffset(months=6))
  .assign(accounting_date=lambda x: x["sorting_date"])
  .get(["symbol", "symbol", "sorting_date", "accounting_date", "bm"])
)

data_for_sorts = (prices_monthly
  .merge(bm, 
         how="left", 
         left_on=["symbol", "symbol", "date"], 
         right_on=["symbol", "symbol", "sorting_date"])
  .merge(size, 
         how="left", 
         left_on=["symbol", "date"], 
         right_on=["symbol", "sorting_date"])
  .get(["symbol", "symbol", "date", "ret_excess", 
        "mktcap_lag", "size", "bm", "exchange", "accounting_date"])
)

data_for_sorts = (data_for_sorts
  .sort_values(by=["symbol", "symbol", "date"])
  .groupby(["symbol", "symbol"])
  .apply(lambda x: x.assign(
      bm=x["bm"].fillna(method="ffill"), 
      accounting_date=x["accounting_date"].fillna(method="ffill")
    )
  )
  .reset_index(drop=True)
  .assign(threshold_date = lambda x: (x["date"]-pd.DateOffset(months=12)))
  .query("accounting_date > threshold_date")
  .drop(columns=["accounting_date", "threshold_date"])
  .dropna()
)

The last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function, allowing for the specification of exchanges to be used for the breakpoints. Additionally, we reintroduce the argument sorting_variable into the function for defining different sorting variables.

def assign_portfolio(data, exchanges, sorting_variable, n_portfolios):
    """Assign portfolio for a given sorting variable."""
    
    breakpoints = (data
      .query(f"exchange in {exchanges}")
      .get(sorting_variable)
      .quantile(np.linspace(0, 1, num=n_portfolios+1), 
                interpolation="linear")
      .drop_duplicates()
    )
    breakpoints.iloc[0] = -np.inf
    breakpoints.iloc[breakpoints.size-1] = np.inf
    
    assigned_portfolios = pd.cut(
      data[sorting_variable],
      bins=breakpoints,
      labels=range(1, breakpoints.size),
      include_lowest=True,
      right=False
    )
    
    return assigned_portfolios

After these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.

11.3 Independent Sorts

Bivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the value premium (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms. The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.

To implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables portfolio_bm and portfolio_size, respectively. Then, these separate portfolios are combined to the final sort stored in portfolio_combined. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.

value_portfolios = (data_for_sorts
  .groupby("date")
  .apply(lambda x: x.assign(
      portfolio_bm=assign_portfolio(
        data=x, sorting_variable="bm", n_portfolios=5, exchanges=["HOSE"]
      ),
      portfolio_size=assign_portfolio(
        data=x, sorting_variable="size", n_portfolios=5, exchanges=["HOSE"]
      )
    )
  )
  .reset_index(drop=True)
  .groupby(["date", "portfolio_bm", "portfolio_size"])
  .apply(lambda x: pd.Series({
      "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
    })
  )
  .reset_index()
)

Equipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher’s choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.

value_premium = (value_portfolios
  .groupby(["date", "portfolio_bm"])
  .aggregate({"ret": "mean"})
  .reset_index()
  .groupby("date")
  .apply(lambda x: pd.Series({
    "value_premium": (
        x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() - 
          x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean()
      )
    })
  )
  .aggregate({"value_premium": "mean"})
)

11.4 Dependent Sorts

In the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable. In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts guarantee that portfolios have roughly equal numbers of stocks when breakpoints are computed from all exchanges. However, if breakpoints are based only on HOSE stocks, portfolio counts will generally be uneven — reflecting the large presence of small-cap stocks on HNX and UPCoM (see Exercise below).

To implement the dependent sorts, we first create the size portfolios by calling assign_portfolio() with sorting_variable="me". Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.

value_portfolios = (data_for_sorts
  .groupby("date")
  .apply(lambda x: x.assign(
      portfolio_size=assign_portfolio(
        data=x, sorting_variable="size", n_portfolios=5, exchanges=["HOSE"]
      )
    )
  )
  .reset_index(drop=True)
  .groupby(["date", "portfolio_size"])
  .apply(lambda x: x.assign(
      portfolio_bm=assign_portfolio(
        data=x, sorting_variable="bm", n_portfolios=5, exchanges=["HOSE"]
      )
    )
  )
  .reset_index(drop=True)
  .groupby(["date", "portfolio_bm", "portfolio_size"])
  .apply(lambda x: pd.Series({
      "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
    })
  )
  .reset_index()
)

value_premium = (value_portfolios
  .groupby(["date", "portfolio_bm"])
  .aggregate({"ret": "mean"})
  .reset_index()
  .groupby("date")
  .apply(lambda x: pd.Series({
    "value_premium": (
        x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() -
          x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean()
      )
    })
  )
  .aggregate({"value_premium": "mean"})
)

Overall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of Size Sorts, we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.

11.5 Key Takeaways

Bivariate portfolio sorts assign stocks based on two characteristics, such as firm size and book-to-market ratio, to better capture return patterns in asset pricing.
Independent sorts treat each variable separately, while dependent sorts condition the second sort on the first.
Proper handling of accounting data, especially lagging the book-to-market ratio, is essential to avoid look-ahead bias and ensure valid backtesting.
Value premiums are derived by comparing returns of high versus low book-to-market portfolios, with results sensitive to sorting choices and weighting schemes.

--- title: Value and Bivariate Sorts format: html: toc: true number-sections: true jupyter: python3 execute: echo: true warning: false message: false --- In this chapter, we extend the univariate portfolio analysis of Univariate Portfolio Sorts to bivariate portfolio sorting, in which stocks are assigned to portfolios based on two characteristics. Bivariate sorts are commonly used in the academic asset pricing literature and underpin the factors in the Fama-French three-factor model. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions. We form portfolios on firm size and the book-to-market ratio. Calculating book-to-market ratios requires accounting data, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts. ```{python} import pandas as pd import numpy as np import datetime as dt import sqlite3 ``` ## Data Preparation First, we load the necessary data from our SQLite database introduced in Accessing and Managing Financial Data and DataCore Data. We conduct portfolio sorts based on our sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in Size Sorts and P-Hacking. ```{python} #| eval: false tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite") prices_monthly = (pd.read_sql_query( sql=("SELECT symbol, date, ret_excess, mktcap, " "mktcap_lag, exchange FROM prices_monthly"), con=tidy_finance, parse_dates={"date"}) .dropna() ) ``` Further, we utilize accounting data. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate` to its monthly value, as we only consider monthly returns here and do not need to account for the exact date. ```{python} #| eval: false book_equity = (pd.read_sql_query( sql="SELECT symbol, datadate, be FROM datacore", con=tidy_finance, parse_dates={"datadate"}) .dropna() .assign( date=lambda x: ( pd.to_datetime(x["datadate"]).dt.to_period("M").dt.to_timestamp() ) ) ) ``` ## Book-to-Market Ratio A fundamental problem in handling accounting data is the *look-ahead bias*; we must not include data in forming a portfolio that was not available knowledge at the time. Of course, researchers have more information when looking into the past than agents actually had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents' trades. Hence, we have to lag accounting information. As in the previous chapter, we continue to lag firm size by one month. Then, we compute the book-to-market ratio, which relates a firm's book equity to its market equity. Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well. Having both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet, the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fillna(method="ffill")`-function after sorting by date and firm (which we identify by `symbol` and `symbol`) and on a firm basis (which we do by `.groupby()` as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report. ```{python} #| eval: false size = (prices_monthly .assign(sorting_date=lambda x: x["date"]+pd.DateOffset(months=1)) .rename(columns={"mktcap": "size"}) .get(["symbol", "sorting_date", "size"]) ) bm = (book_equity .merge(prices_monthly, how="inner", on=["symbol", "date"]) .assign(bm=lambda x: x["be"]/x["mktcap"], sorting_date=lambda x: x["date"]+pd.DateOffset(months=6)) .assign(accounting_date=lambda x: x["sorting_date"]) .get(["symbol", "symbol", "sorting_date", "accounting_date", "bm"]) ) data_for_sorts = (prices_monthly .merge(bm, how="left", left_on=["symbol", "symbol", "date"], right_on=["symbol", "symbol", "sorting_date"]) .merge(size, how="left", left_on=["symbol", "date"], right_on=["symbol", "sorting_date"]) .get(["symbol", "symbol", "date", "ret_excess", "mktcap_lag", "size", "bm", "exchange", "accounting_date"]) ) data_for_sorts = (data_for_sorts .sort_values(by=["symbol", "symbol", "date"]) .groupby(["symbol", "symbol"]) .apply(lambda x: x.assign( bm=x["bm"].fillna(method="ffill"), accounting_date=x["accounting_date"].fillna(method="ffill") ) ) .reset_index(drop=True) .assign(threshold_date = lambda x: (x["date"]-pd.DateOffset(months=12))) .query("accounting_date > threshold_date") .drop(columns=["accounting_date", "threshold_date"]) .dropna() ) ``` The last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function, allowing for the specification of exchanges to be used for the breakpoints. Additionally, we reintroduce the argument `sorting_variable` into the function for defining different sorting variables. ```{python} #| eval: false def assign_portfolio(data, exchanges, sorting_variable, n_portfolios): """Assign portfolio for a given sorting variable.""" breakpoints = (data .query(f"exchange in {exchanges}") .get(sorting_variable) .quantile(np.linspace(0, 1, num=n_portfolios+1), interpolation="linear") .drop_duplicates() ) breakpoints.iloc[0] = -np.inf breakpoints.iloc[breakpoints.size-1] = np.inf assigned_portfolios = pd.cut( data[sorting_variable], bins=breakpoints, labels=range(1, breakpoints.size), include_lowest=True, right=False ) return assigned_portfolios ``` After these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis. ## Independent Sorts Bivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms. The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits. To implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm` and `portfolio_size`, respectively. Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns. ```{python} #| eval: false value_portfolios = (data_for_sorts .groupby("date") .apply(lambda x: x.assign( portfolio_bm=assign_portfolio( data=x, sorting_variable="bm", n_portfolios=5, exchanges=["HOSE"] ), portfolio_size=assign_portfolio( data=x, sorting_variable="size", n_portfolios=5, exchanges=["HOSE"] ) ) ) .reset_index(drop=True) .groupby(["date", "portfolio_bm", "portfolio_size"]) .apply(lambda x: pd.Series({ "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"]) }) ) .reset_index() ) ``` Equipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher's choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium. ```{python} #| eval: false value_premium = (value_portfolios .groupby(["date", "portfolio_bm"]) .aggregate({"ret": "mean"}) .reset_index() .groupby("date") .apply(lambda x: pd.Series({ "value_premium": ( x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() - x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean() ) }) ) .aggregate({"value_premium": "mean"}) ) ``` ## Dependent Sorts In the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable. In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts guarantee that portfolios have roughly equal numbers of stocks when breakpoints are computed from all exchanges. However, if breakpoints are based only on HOSE stocks, portfolio counts will generally be uneven — reflecting the large presence of small-cap stocks on HNX and UPCoM (see Exercise below). To implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()` with `sorting_variable="me"`. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium. ```{python} #| eval: false value_portfolios = (data_for_sorts .groupby("date") .apply(lambda x: x.assign( portfolio_size=assign_portfolio( data=x, sorting_variable="size", n_portfolios=5, exchanges=["HOSE"] ) ) ) .reset_index(drop=True) .groupby(["date", "portfolio_size"]) .apply(lambda x: x.assign( portfolio_bm=assign_portfolio( data=x, sorting_variable="bm", n_portfolios=5, exchanges=["HOSE"] ) ) ) .reset_index(drop=True) .groupby(["date", "portfolio_bm", "portfolio_size"]) .apply(lambda x: pd.Series({ "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"]) }) ) .reset_index() ) value_premium = (value_portfolios .groupby(["date", "portfolio_bm"]) .aggregate({"ret": "mean"}) .reset_index() .groupby("date") .apply(lambda x: pd.Series({ "value_premium": ( x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() - x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean() ) }) ) .aggregate({"value_premium": "mean"}) ) ``` Overall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of Size Sorts, we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices. ## Key Takeaways - Bivariate portfolio sorts assign stocks based on two characteristics, such as firm size and book-to-market ratio, to better capture return patterns in asset pricing. - Independent sorts treat each variable separately, while dependent sorts condition the second sort on the first. - Proper handling of accounting data, especially lagging the book-to-market ratio, is essential to avoid look-ahead bias and ensure valid backtesting. - Value premiums are derived by comparing returns of high versus low book-to-market portfolios, with results sensitive to sorting choices and weighting schemes.