Tidy Asset Pricing - Part IV: The Value Premium

Update 2022-02-11: check out our new book Tidy Finance with R for an updated and improved implementation of the content below.

In this note, I explore the positive relation between the book-to-market (BM) ratio and expected stock returns – called the value premium – following Bali, Engle and Murray. In three earlier notes, I first prepared the CRSP sample in R, estimated and analyzed market betas and analyzed the relation between firm size and stock returns using the same data. I do not go into the theoretical foundations of the value premium, but rather focus on its estimation in a tidy manner. The usual disclaimer applies, so the text below references an opinion and is for information purposes only and I do not intend to provide any investment advice. The code below replicates most of the results of Bali et al. up to a few basis points if I restrict my data to their sample period. If you spot any mistakes or want to share any suggestions for better implementation, just drop a comment below.

I mainly use the following packages throughout this note:

library(tidyverse)
library(lubridate)  # for working with dates
library(broom)      # for tidying up estimation results
library(kableExtra) # for nicer html tables

Calculating the Book-to-Market Ratio

The BM ratio is defined as the book value of a firm’s common equity (BE) divided by the market value of the firm’s equity (ME), where the book value comes from the firm’s balance sheet and the market value is equal to the market capitalization of the firm as provided in the CRSP data. To get the BE data, I download the CRSP/Compustat Merged (CCM) data from WRDS. Importantly, this data already contains links of balance sheet information to the stock identifiers of CRSP.

First, I extract the following variables from the CCM sample:

ccmfunda <- read_csv("raw/ccmfunda.csv")
colnames(ccmfunda) <- tolower(colnames(ccmfunda))

# select and parse relevant variables
compustat <- ccmfunda %>%
transmute(
permno = as.integer(lpermno),      # stock identifier
seq = as.numeric(seq),             # stockholders' equity
txdb = as.numeric(txdb),           # deferred taxes
itcb = as.numeric(itcb),           # investment tax credit
pstkrv = as.numeric(pstkrv),       # preferred stock redemption value
pstkl = as.numeric(pstkl),         # preferred stock liquidating value
pstk = as.numeric(pstk),           # preferred stock par value
indfmt = as.character(indfmt),     # industry format
datafmt = as.character(datafmt)    # data format
) 

Then, I apply a number of consistency checks and filters to the data. First, I need to make sure that only industry formats equal to "INDL" and data formats equal to "STD" are used.

compustat <- compustat %>%
filter(indfmt == "INDL" & datafmt == "STD") %>%
select(-c(indfmt, datafmt))

Second, I should only use links that are established by comparing CUSIP values in the Compustat and CRSP database ("LU") or links that have been researched and verified ("LC").

compustat <- compustat %>%
filter(linktype %in% c("LU", "LC"))

Third, I verify that links are still active at the time a firm’s accounting information is computed.

compustat <- compustat %>%
filter(datadate <= linkenddt | is.na(linkenddt))

Fourth, I make sure that there are no duplicate observations.

compustat <- compustat %>%
distinct()

Fifth, I check that there is only one observation for each stock-date pair.

compustat %>%
filter(n() > 1) %>%
nrow() == 0

After these checks, I can proceed to calculate BE. The calculation is described in Bali et al. and translates to the following code where bvps refers to the book value of preferred stock.

compustat <- compustat %>%
mutate(bvps = case_when(!is.na(pstkrv) ~ pstkrv,
is.na(pstkrv) & !is.na(pstkl) ~ pstkl,
is.na(pstkrv) & is.na(pstkl) & !is.na(pstk) ~ pstk,
TRUE ~ as.numeric(0)),
be = seq + txdb + itcb - bvps,
be = if_else(be < 0, as.numeric(NA), be)) %>%
select(permno, datadate, be) 

To match the BE variable to the CRSP data and compute the BM ratio, I follow the procedure of Fama and French (1992), which is the most standard approach to doing this. Fama and French compute BM at the end of each calendar year and use that information starting in June of the next calendar year until the following May.

Firms have different fiscal year ends that indicate the timing of the accounting information as reported in the column datadate. However, this date does not indicate when the accounting information was publicly available. Moreoever, in some cases, firms change the month in which their fiscal year ends, resulting in multiple entries in the Compustat data per calendar year $$y$$. The Fama-French procedure first ensures that there is only one entry per calendar year $$y$$ by taking the last available information.

compustat <- compustat %>%
group_by(permno, year) %>%
ungroup()

At this stage, I also kick out unnecessary rows with missing values in the BE column. These observations would lead to missing values in the upcoming matching procedure anyway.

compustat <- compustat %>%
na.omit()
nrow(compustat)

Fama and French assume that data from the calendar year $$y$$ is not known until the end of June of year $$y+1$$. I hence define the corresponding reference date for each observation.

compustat <- compustat %>%
mutate(reference_date = ymd(paste0(year + 1, "-06-01"))) %>%
select(-year)

For each return observation in year $$y+1$$, the relevant accounting information is either still from the end of year $$y-1$$ until May, or computed at the end of year $$y$$ starting from June. I match the BE variable following this rationale in the next code chunk.

crsp <- read_rds("data/crsp.rds")

crsp <- crsp %>%
mutate(year = year(date),
month = month(date),
reference_date = ymd(if_else(month < 6,
paste0(year - 1, "-06-01"),
paste0(year, "-06-01"))))

crsp <- crsp %>%
left_join(compustat, by = c("permno", "reference_date"))

As mentioned above, the ME variable is determined each year at the end of a calendar year, but it is used to calculate BM starting in June of the following year.

crsp_me <- crsp %>%
filter(month(date_start) == 12) %>%
mutate(reference_date = ymd(paste0(year(date_start) + 1, "-06-01"))) %>%
select(permno, reference_date, me = mktcap)

crsp <- crsp %>%
left_join(crsp_me, by = c("permno", "reference_date"))

Next, I also add an estimate for a stock’s beta which I calculated in this note. I use the beta based on daily data using a 12 month estimation window, as common in the literature according to Bali et al.

betas <- read_rds("data/betas_daily.rds")
crsp <- crsp %>%
left_join(betas %>%
select(permno, date_start, beta = beta_12m),
by = c("permno", "date_start"))

Finally, I can calculate the BM ratio and define additional variables that later enter the regression analysis.

crsp <- crsp %>%
mutate(bm = be / me,
log_bm = if_else(bm == 0, as.numeric(NA), log(bm)),
size = log(mktcap))

The earliest date in my Compustat sample is in October 1960, while CRSP already starts in 1926. It might be thus interesting to look at the coverage of the variables of interest over the full sample period.

crsp %>%
group_by(date) %>%
summarize(share_with_me = sum(!is.na(me)) / n() * 100,
share_with_be = sum(!is.na(be)) / n() * 100,
share_with_bm = sum(!is.na(bm)) / n() * 100) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = share_with_me, color = "Market Equity")) +
geom_line(aes(y = share_with_be, color = "Book Equity")) +
geom_line(aes(y = share_with_bm, color = "Book-to-Market")) +
labs(x = "", y = "Share of Stocks with Information (in %)", color = "Variable") +
scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
theme_classic()

There seem to be a few periods in which market and book equity data is temporarily not available anymore for a huge chunk of firms. However, I currently have no idea why this is the case.

I follow the convention outlined in Bali et al. and start the sample in June 1963. Apart from the lack of accounting information, Bali et al. also mention that AMEX stocks start to be included in CRSP in July 1962 as a further reason (see this post).

crsp <- crsp %>%
filter(date >= "1963-06-01")

Summary Statistics

I proceed to present summary statistics for the different measures I use in the following analyses. I first compute corresponding summary statistics for each month and before I average each summary statistics across all months. Note that I call the moments package to compute skewness and kurtosis. Since I repeatedly use a similar procedure, I define the following function which uses the dot-dot-dot argument and the curly-curly operator to compute summary statistics for aribtrary columns grouped by some variable.

summary_statistics <- function(data, ..., by) {
data %>%
select({{ by }}, ...) %>%
pivot_longer(cols = -{{ by }}, names_to = "measure") %>%
na.omit() %>%
group_by(measure, {{ by }}) %>%
summarize(mean = mean(value),
sd = sd(value),
skew = moments::skewness(value),
kurt = moments::kurtosis(value),
min = min(value),
q05 = quantile(value, 0.05),
q25 = quantile(value, 0.25),
q50 = quantile(value, 0.50),
q75 = quantile(value, 0.75),
q95 = quantile(value, 0.95),
max = max(value),
n = n()) %>%
ungroup()
}

The following table provides the time series averages of summary statistics for all observations with a non-missing BM.

crsp_ts <- crsp %>%
filter(!is.na(bm)) %>%
summary_statistics(bm, log_bm, be, me, mktcap, by = date)

crsp_ts %>%
select(-date) %>%
group_by(measure) %>%
summarize_all(list(mean)) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
measure mean sd skew kurt min q05 q25 q50 q75 q95 max n
be 856.15 4097.58 16.99 453.04 0.10 4.94 27.75 98.09 384.36 3227.64 99261.72 3779.89
bm 0.93 1.30 13.48 476.48 0.01 0.15 0.41 0.72 1.14 2.25 45.29 3779.89
log_bm -0.48 0.87 -0.73 5.72 -6.41 -2.00 -0.95 -0.38 0.07 0.75 3.30 3779.84
me 1801.58 8290.59 15.38 349.23 0.88 8.62 46.20 187.48 781.51 6892.07 212293.66 3779.89
mktcap 1941.59 8921.13 15.16 338.47 0.71 8.21 46.79 198.64 853.28 7512.70 229728.20 3766.14

BM, BE and ME all exhibit a high positive skewness which justifies the log transformation in the regression analysis later. The average and median of BM are below one which indicates that the book equity values tend to be smaller than market equity valuations.

Correlations

I now proceed to examine the cross-sectional correlations between the variables of interest. To do so, I define the following function which takes an arbitrary number of columns as input.

compute_correlations <- function(data, ..., upper = TRUE) {
cor_matrix <- data %>%
select(...) %>%
cor(use = "complete.obs", method = "pearson")
if (upper == TRUE) {
cor_matrix[upper.tri(cor_matrix, diag = FALSE)] <- NA
}
return(cor_matrix)
}

The table below provides the results.

correlations <- crsp %>%
filter(!is.na(bm)) %>%
compute_correlations(bm, log_bm, be, me, mktcap, size, beta)

correlations %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
bm log_bm be me mktcap size beta
bm 1.00
log_bm 0.51 1.00
be 0.00 0.01 1.00
me -0.04 -0.10 0.74 1.00
mktcap -0.04 -0.09 0.72 0.96 1.00
size -0.18 -0.29 0.31 0.35 0.35 1.00
beta -0.09 -0.23 0.05 0.06 0.06 0.35 1

As expected, the cross-sectional correlation between BM and its log transformation is quite high. Moreover, BM is negatively correlated with each of market beta and size. The negative correlation between size and BM is somewhat mechnical als market capitalization is the denominator of BM.

Persistence

In this section, I exmanie the persistence of BM and its log transformation over time via the following function.

compute_persistence <- function(data, var, tau) {
dates <- data %>%
distinct(date) %>%
arrange(date) %>%
mutate(date_lag = lag(date, tau))

correlation <- data %>%
select(permno, date, {{ var }}) %>%
left_join(dates, by = "date") %>%
left_join(data %>%
select(permno, date, {{ var }}) %>%
rename("{{ var }}_lag" := {{ var }}),
by = c("permno", "date_lag"="date")) %>%
select(contains(rlang::quo_text(enquo(var)))) %>%
cor(use = "pairwise.complete.obs", method = "pearson")

return(correlation[1, 2])
}

persistence <- tribble(~tau, ~bm, ~log_bm,
12, compute_persistence(crsp, bm, 12), compute_persistence(crsp, log_bm, 12),
24, compute_persistence(crsp, bm, 24), compute_persistence(crsp, log_bm, 24),
36, compute_persistence(crsp, bm, 36), compute_persistence(crsp, log_bm, 36),
48, compute_persistence(crsp, bm, 48), compute_persistence(crsp, log_bm, 48),
60, compute_persistence(crsp, bm, 60), compute_persistence(crsp, log_bm, 60),
120, compute_persistence(crsp, bm, 120), compute_persistence(crsp, log_bm, 120))

The next table provides the persistence results.

persistence %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
tau bm log_bm
12 0.61 0.79
24 0.51 0.66
36 0.49 0.59
48 0.48 0.53
60 0.33 0.48
120 0.24 0.38

The table indicates thath both BM and log(BM) exhibit substantial persistence, but with decaying intensity over time. Regardless whether BM measures factor sensitivity or some sort of mispricing, it seems to persist for an extended period of time for the average stock.

Portfolio Analysis

Now, I turn to the cross-sectional relation between the book-to-market ratio and future stock returns. To conduct the univariate portfolio analysis, I need a couple of functions that I introduced in an earlier post.

weighted_mean <- function(x, w, ..., na.rm = FALSE){
if(na.rm){
x1 <- x[!is.na(x) & !is.na(w)]
w <- w[!is.na(x) & !is.na(w)]
x <- x1
}
weighted.mean(x, w, ..., na.rm = FALSE)
}

get_breakpoints_funs <- function(var, n_portfolios = 10) {
# get relevant percentiles
percentiles <- seq(0, 1, length.out = (n_portfolios + 1))
percentiles <- percentiles[percentiles > 0 & percentiles < 1]

# construct set of named quantile functions
percentiles_names <- map_chr(percentiles, ~paste0(rlang::quo_text(enquo(var)), "_q", .x*100))
percentiles_funs <- map(percentiles, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = percentiles_names)
return(percentiles_funs)
}

get_portfolio <- function(x, breakpoints) {
portfolio <- as.integer(1 + findInterval(x, unlist(breakpoints)))
return(portfolio)
}

The next function is an adapted version from the size post as it features the additional option of whether to use only NYSE stocks or all stocks to determine the breakpoints for portfolio sorts.

construct_univariate_portfolios <- function(data, var, n_portfolios = 10, nyse_breakpoints = TRUE) {
data <- data %>%
filter(!is.na({{ var }}))

# determine breakpoints based on NYSE stocks only or on all stocks
if (nyse_breakpoints == TRUE) {
data_quantiles <- data %>%
filter(exchange == "NYSE") %>%
select(date, {{ var }})
} else {
data_quantiles <- data %>%
select(date, {{ var }})
}

var_funs <- get_breakpoints_funs({{ var }}, n_portfolios)
quantiles <- data_quantiles %>%
group_by(date) %>%
summarize_at(vars({{ var }}), lst(!!! var_funs)) %>%
group_by(date) %>%
nest() %>%
rename(quantiles = data)

# independently sort all stocks into portfolios based on breakpoints
portfolios <- data %>%
left_join(quantiles, by = "date") %>%
mutate(portfolio = map2_dbl({{ var }}, quantiles, get_portfolio)) %>%
select(-quantiles)

# compute average portfolio characteristics
portfolios_ts <- portfolios %>%
group_by(portfolio, date) %>%
summarize(ret_ew = mean(ret_adj_excess_f1, na.rm = TRUE),
ret_vw = weighted_mean(ret_adj_excess_f1, mktcap, na.rm = TRUE),
ret_mkt = mean(mkt_ff_excess_f1, na.rm = TRUE),
{{ var }} := mean({{ var }}, na.rm = TRUE),
log_bm = mean(log_bm, na.rm = TRUE),
mktcap = mean(mktcap, na.rm = TRUE),
beta = mean(beta, na.rm = TRUE),
n_stocks = n(),
n_stocks_nyse = sum(exchange == "NYSE", na.rm = TRUE)) %>%
na.omit() %>%
ungroup()

## long-short portfolio
portfolios_ts_ls <- portfolios_ts %>%
select(portfolio, date, ret_ew, ret_vw, ret_mkt) %>%
filter(portfolio %in% c(max(portfolio), min(portfolio))) %>%
pivot_wider(names_from = portfolio, values_from = c(ret_ew, ret_vw)) %>%
mutate(ret_ew = .[[4]] - .[[3]],
ret_vw = .[[6]] - .[[5]],
portfolio = paste0(n_portfolios, "-1")) %>%
select(portfolio, date, ret_ew, ret_vw, ret_mkt)

## combine everything
out <- portfolios_ts %>%
mutate(portfolio = as.character(portfolio)) %>%
bind_rows(portfolios_ts_ls) %>%
mutate(portfolio = factor(portfolio, levels = c(as.character(seq(1, 10, 1)), paste0(n_portfolios, "-1"))))

return(out)
}

The table below provides the summary statistics for each portfolio using the above functions.

portfolios_bm_nyse <- construct_univariate_portfolios(crsp, bm)
portfolios_bm_all <- construct_univariate_portfolios(crsp, bm, nyse_breakpoints = FALSE)

portfolios_table_nyse <- portfolios_bm_nyse %>%
filter(portfolio != "10-1") %>%
group_by(portfolio) %>%
summarize(bm_mean = mean(bm),
log_bm_mean = mean(log_bm),
mktcap_mean = mean(mktcap),
n = mean(n_stocks),
n_nyse = mean(n_stocks_nyse / n_stocks) * 100,
beta = mean(beta)) %>%
t()

portfolios_table_nyse <- portfolios_table_nyse[-1, ] %>%
as.numeric() %>%
matrix(., ncol = 10)
colnames(portfolios_table_nyse) <- seq(1, 10, 1)
rownames(portfolios_table_nyse) <- c("BM", "Log(BM)", "MktCap", "Number of Stocks", "% NYSE Stocks" , "Beta")

portfolios_table_all <- portfolios_bm_all %>%
filter(portfolio != "10-1") %>%
group_by(portfolio) %>%
summarize(bm_mean = mean(bm),
log_bm_mean = mean(log_bm),
mktcap_mean = mean(mktcap),
n = mean(n_stocks),
n_nyse = mean(n_stocks_nyse / n_stocks) * 100,
beta = mean(beta)) %>%
t()

portfolios_table_all <- portfolios_table_all[-1, ] %>%
as.numeric() %>%
matrix(., ncol = 10)
colnames(portfolios_table_all) <- seq(1, 10, 1)
rownames(portfolios_table_all) <- c("BM", "Log(BM)", "MktCap", "Number of Stocks", "% NYSE Stocks" , "Beta")

# combine to table and print html
rbind(portfolios_table_nyse, portfolios_table_all) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows("NYSE Breakpoints", 1, 6) %>%
pack_rows("All Stocks Breakpoints", 7, 12)
1 2 3 4 5 6 7 8 9 10
NYSE Breakpoints
BM 0.17 0.34 0.46 0.57 0.68 0.79 0.92 1.09 1.34 2.68
Log(BM) -1.98 -1.15 -0.84 -0.63 -0.45 -0.29 -0.13 0.03 0.25 0.79
MktCap 3303.06 3008.75 2590.34 2031.89 1768.70 1595.45 1276.90 1132.78 915.48 618.58
Number of Stocks 555.87 381.03 344.00 326.27 323.71 325.71 330.87 350.88 378.26 473.68
% NYSE Stocks 30.89 38.77 42.16 43.75 44.36 44.58 43.75 41.80 39.25 32.43
Beta 1.07 0.97 0.90 0.86 0.82 0.78 0.74 0.70 0.68 0.63
All Stocks Breakpoints
BM 0.14 0.29 0.42 0.54 0.66 0.79 0.95 1.14 1.45 2.95
Log(BM) -2.20 -1.31 -0.95 -0.69 -0.48 -0.29 -0.11 0.08 0.31 0.87
MktCap 3178.81 3136.80 2738.19 2105.06 1787.95 1479.09 1228.90 1063.14 880.77 603.66
Number of Stocks 379.38 378.87 379.04 378.82 378.81 379.19 378.90 378.85 378.96 379.47
% NYSE Stocks 28.50 35.84 40.43 42.75 43.66 44.23 41.99 40.58 37.64 31.47
Beta 1.08 1.00 0.92 0.88 0.83 0.79 0.74 0.70 0.66 0.63

By construction, the average BM ratio increases across portfolios. Market capitalization, betas and the number of stocks in each portfolio decrease across portfolios. These patterns are the same regardless of which breakpoints are used.

Given the portfolio returns, I want to evaluate whether a portfolio exhibits on average positive or negative excess returns. The following function estimates the mean excess return and CAPM alpha for each portfolio and computes corresponding Newey and West (1987) $$t$$-statistics (using six lags as common in the literature) testing the null hypothesis that the average portfolio excess return or CAPM alpha is zero. I just recycle the function I wrote in this post.

estimate_portfolio_returns <- function(data, ret) {
# estimate average returns per portfolio
average_ret <- data %>%
group_by(portfolio) %>%
arrange(date) %>%
do(model = lm(paste0(rlang::as_name(enquo(ret))," ~ 1"), data = .)) %>%
mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6)))) %>%
broom::tidy(model) %>%
ungroup() %>%
mutate(nw_tstat = estimate / nw_stderror) %>%
select(estimate, nw_tstat) %>%
t()

# estimate capm alpha per portfolio
average_capm_alpha <- data %>%
group_by(portfolio) %>%
arrange(date) %>%
do(model = lm(paste0(rlang::as_name(enquo(ret))," ~ 1 + ret_mkt"), data = .)) %>%
mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6))[1])) %>%
broom::tidy(model)%>%
ungroup() %>%
filter(term == "(Intercept)") %>%
mutate(nw_tstat = estimate / nw_stderror) %>%
select(estimate, nw_tstat) %>%
t()

# construct output table
out <- rbind(average_ret, average_capm_alpha)
colnames(out) <-c(as.character(seq(1, 10, 1)), "10-1")
rownames(out) <- c("Excess Return", "t-Stat" , "CAPM Alpha", "t-Stat")

return(out)
}

The following table provides the results of the univariate portfolio sorts using NYSE breakpoints.

rbind(estimate_portfolio_returns(portfolios_bm_nyse, ret_ew),
estimate_portfolio_returns(portfolios_bm_nyse, ret_vw)) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows("NYSE Breakpoints - Equal-Weighted Portfolio Returns", 1, 4) %>%
pack_rows("NYSE Breakpoints - Value-Weighted Portfolio Returns", 5, 8) 
1 2 3 4 5 6 7 8 9 10 10-1
NYSE Breakpoints - Equal-Weighted Portfolio Returns
Excess Return 0.29 0.57 0.67 0.76 0.82 0.89 0.92 1.00 1.12 1.28 0.99
t-Stat 0.94 2.13 2.64 3.04 3.37 3.76 3.97 4.18 4.45 4.13 5.30
CAPM Alpha -0.43 -0.08 0.05 0.17 0.25 0.35 0.41 0.49 0.59 0.73 1.16
t-Stat -2.67 -0.64 0.38 1.40 2.03 2.85 3.27 3.78 4.14 3.78 5.95
NYSE Breakpoints - Value-Weighted Portfolio Returns
Excess Return 0.45 0.58 0.52 0.57 0.55 0.58 0.66 0.66 0.79 0.82 0.37
t-Stat 2.11 3.18 2.82 3.06 3.04 3.35 3.34 3.58 4.13 3.78 1.97
CAPM Alpha -0.11 0.05 0.00 0.06 0.06 0.11 0.18 0.19 0.30 0.30 0.41
t-Stat -1.24 0.92 0.02 0.78 0.67 1.52 1.80 1.73 2.76 2.33 2.08

Excess returns are monotonically increasing across BM portfolios, where stocks with high BM ratios seem to exhibit statistically significant excess returns. It is hence not surprising that the difference portfolio yields statistically significant positive excess returns. Adjusting the returns using the CAPM risk model has little effect on this result. Using value-weighted returns leads to a weaker relation between BM and expected returns, but the overall patterns are the same.

The next table provides the results of the univariate portfolio sorts using all stocks to calculate breakpoints.

rbind(estimate_portfolio_returns(portfolios_bm_all, ret_ew),
estimate_portfolio_returns(portfolios_bm_all, ret_vw)) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows("All Stocks Breakpoints - Equal-Weighted Portfolio Returns", 1, 4) %>%
pack_rows("All Stocks Breakpoints - Value-Weighted Portfolio Returns", 5, 8) 
1 2 3 4 5 6 7 8 9 10 10-1
All Stocks Breakpoints - Equal-Weighted Portfolio Returns
Excess Return 0.17 0.46 0.59 0.71 0.81 0.90 0.95 1.04 1.15 1.33 1.16
t-Stat 0.52 1.57 2.31 2.85 3.29 3.76 3.96 4.24 4.52 4.17 5.69
CAPM Alpha -0.57 -0.23 -0.04 0.11 0.23 0.36 0.42 0.52 0.63 0.78 1.35
t-Stat -3.21 -1.64 -0.31 0.90 1.87 2.85 3.26 3.84 4.25 3.84 6.27
All Stocks Breakpoints - Value-Weighted Portfolio Returns
Excess Return 0.46 0.42 0.55 0.53 0.55 0.63 0.65 0.71 0.80 0.87 0.41
t-Stat 1.93 2.16 2.96 2.78 2.93 3.63 3.31 3.77 4.09 3.79 1.93
CAPM Alpha -0.13 -0.13 0.03 0.01 0.04 0.16 0.16 0.24 0.29 0.34 0.47
t-Stat -1.15 -1.95 0.49 0.14 0.50 2.10 1.56 2.28 2.45 2.46 2.10

Perhaps unsurprisingly, the patterns are essentially the same as in the case of using only NYSE stocks to calculate breakpoints.

This positive relation between the book-to-market ratio and expected stock returns is known as the value premium as stocks with high BM ratios (value stocks) tend to outperform stocks with low BM ratios (growth stocks).

Regression Analysis

As a last step, I analyze the relation between firm size and stock returns using Fama and MacBeth (1973) regression analysis. Each month, I perform a cross-sectional regression of one-month-ahead excess stock returns on the given measure of firm size. Time-series averages over these cross-sectional regressions then provide the desired results.

Again, I essentially recylce functions that I developed in an earlier post. Note that I depart from Bali et al. and condition on all three explanatory variables being defined in the data, regardless of which measure I am using. I do this to ensure comparability across results, i.e. I want to run the same set of regressions across different specifications. Otherwise I cannot evaluate whether adding a covariate, using a different measure or changes in sample composition lead to different results (or all of them).

winsorize <- function(x, cut = 0.005){
cut_point_top <- quantile(x, 1 - cut, na.rm = TRUE)
cut_point_bottom <- quantile(x, cut, na.rm = TRUE)
i <- which(x >= cut_point_top)
x[i] <- cut_point_top
j <- which(x <= cut_point_bottom)
x[j] <- cut_point_bottom
return(x)
}

fama_macbeth_regression <- function(data, model, cut = 0.005) {
# prepare and winsorize data
data_nested <- data %>%
filter(!is.na(ret_adj_f1) & !is.na(size) & !is.na(beta) & !is.na(bm) & !is.na(log_bm)) %>%
group_by(date) %>%
mutate_at(vars(size, beta, bm, log_bm), ~winsorize(., cut = cut)) %>%
nest()

# perform cross-sectional regressions for each month
cross_sectional_regs <- data_nested %>%
mutate(model = map(data, ~lm(enexpr(model), data = .x))) %>%
mutate(tidy = map(model, broom::tidy),
glance = map(model, broom::glance),
n = map(model, stats::nobs))

# extract average coefficient estimates
fama_macbeth_coefs <- cross_sectional_regs %>%
unnest(tidy) %>%
group_by(term) %>%
summarize(coefficient = mean(estimate))

# compute newey-west standard errors of average coefficient estimates
newey_west_std_errors <- cross_sectional_regs %>%
unnest(tidy) %>%
group_by(term) %>%
arrange(date) %>%
group_modify(~enframe(sqrt(diag(sandwich::NeweyWest(lm(estimate ~ 1, data = .x), lag = 6))))) %>%
select(term, nw_std_error = value)

# put coefficient estimates and standard errors together and compute t-statistics
fama_macbeth_coefs <- fama_macbeth_coefs %>%
left_join(newey_west_std_errors, by = "term") %>%
mutate(nw_t_stat = coefficient / nw_std_error) %>%
select(term, coefficient, nw_t_stat) %>%
pivot_longer(cols = c(coefficient, nw_t_stat), names_to = "statistic") %>%
mutate(statistic = paste(term, statistic, sep = " ")) %>%
select(-term)

# extract average r-squared and average number of observations
fama_macbeth_stats <- cross_sectional_regs %>%
unnest(c(glance, n)) %>%
ungroup() %>%
n = mean(n)) %>%

# combine desired output and return results
out <- rbind(fama_macbeth_coefs, fama_macbeth_stats)
return(out)
}

The following table provides the regression results for various specifications.

m1 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ bm) %>% rename(m1 = value)
m2 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ bm + beta) %>% rename(m2 = value)
m3 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ bm + size) %>% rename(m3 = value)
m4 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ bm + beta + size) %>% rename(m4 = value)
m5 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ log_bm) %>% rename(m5 = value)
m6 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ log_bm + beta) %>% rename(m6 = value)
m7 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ log_bm + size) %>% rename(m7 = value)
m8 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ log_bm + beta + size) %>% rename(m8 = value)

regression_table <- list(m1, m2, m3, m4, m5, m6, m7, m8) %>%
reduce(full_join, by = "statistic") %>%
right_join(tibble(statistic = c("bm coefficient", "bm nw_t_stat",
"log_bm coefficient", "log_bm nw_t_stat",
"beta coefficient", "beta nw_t_stat",
"size coefficient",  "size nw_t_stat",
"(Intercept) coefficient", "(Intercept) nw_t_stat",

regression_table %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows("coefficient estimates", 1, 10) %>%
pack_rows("summary statistics", 11, 12) 
statistic m1 m2 m3 m4 m5 m6 m7 m8
coefficient estimates
bm coefficient 0.37 0.30 0.26 0.21
bm nw_t_stat 5.05 4.79 3.44 3.37
log_bm coefficient 0.38 0.34 0.28 0.25
log_bm nw_t_stat 5.73 5.77 3.75 4.08
beta coefficient -0.21 -0.09 -0.16 -0.04
beta nw_t_stat -1.69 -0.53 -1.39 -0.25
size coefficient -0.11 -0.11 -0.10 -0.11
size nw_t_stat -2.46 -2.02 -2.29 -2.03
(Intercept) coefficient 0.87 1.08 1.41 1.51 1.35 1.45 1.74 1.77
(Intercept) nw_t_stat 3.37 5.25 3.21 3.90 5.39 6.63 4.26 4.75
summary statistics
adj_r_squared 0.01 0.02 0.02 0.04 0.01 0.02 0.02 0.04
n 3730.47 3730.47 3730.47 3730.47 3730.47 3730.47 3730.47 3730.47

Across all specifications, I find a strong positive relation between BM and future excess returns, regardless of which set of controls is included or whether BM or its log-transformed version is used. Consistent with results from earlier posts, I detect no relation between beta and future stock returns and a strong negative relation between firm size and future stock returns. These results provide strong evidence of a value premium and no evidence that this premium is a manifestation of the relations between beta, market capitalization and future stock returns.

Christoph Scheuch
Director of Product

My interests include Product Management, Data Science, R and FinTech-related stuff. matter.