# Tidy Asset Pricing - Part III: Size and Stock Returns

Update 2022-02-11: check out our new book Tidy Finance with R for an updated and improved implementation of the content below.

In this note, I explore the relation between firm size and stock returns following the methodology outlined in Bali, Engle and Murray. In two earlier notes, I first prepared the CRSP sample in R and then estimated and analyzed market betas using the same data. I do not go into the theoretical foundations of the size effect, but rather focus on its estimation in a tidy manner. I just want to mention that the size effect refers to the observation that stocks with large market capitalization tend to have lower returns than stocks with small market capitalizations (e.g., Fama and French, 1992). The usual disclaimer applies, so the text below references an opinion and is for information purposes only and I do not intend to provide any investment advice. The code below replicates most of the results of Bali et al. up to a few basis points if I restrict my data to their sample period. If you spot any mistakes or want to share any suggestions for better implementation, just drop a comment below.

I mainly use the following packages throughout this note:

library(tidyverse)
library(lubridate)  # for working with dates
library(broom)      # for tidying up estimation results
library(kableExtra) # for nicer html tables

## Calculating Firm Size

First, I load the data that I prepared earlier and add the market betas. I follow Bali et al. and use the betas estimated using daily return data and a twelve month estimation window as the measure for a stock’s covariance with the market.

crsp <- read_rds("data/crsp.rds")

crsp <- crsp %>%
left_join(betas %>%
select(permno, date_start, beta = beta_12m),
by = c("permno", "date_start")) # note: date_start refers to beginning of month

Therer are two approaches to measure the market capitalization of a stock. The simplest approach is to take the share price and number of shares outstanding as of the end of each month for which the market capitalization is measured. This is the approach I used in the sample construction post to define the variable mktcap. Alternatively, Fama and French calculate market capitalization as of the last trading day of June in each year and hold the value constant for the months from June of the same year until May of the next year. The benefit of this approach is that the market capitalization measure does not vary with short-term movements in the stock price which may cause unwanted correlation with the stock returns. I implement the Fama-French approach below by defining a reference date for each month and then adding the June market cap as an additional column.

crsp <- crsp %>%
mutate(year = year(date),
month = month(date),
reference_date = if_else(month < 6,
paste(year-1, 6, 30, sep = "-"),
paste(year, 6, 30, sep = "-")))

crsp_mktcap_ff <- crsp %>%
filter(month == 6) %>%
select(permno, reference_date, mktcap_ff = mktcap)

crsp <- crsp %>%
left_join(crsp_mktcap_ff, by = c("permno", "reference_date"))

As it turns out, which approach I use has little impact on the results of the empirical analyses as the measures are highly correlated. While the timing of the calculation has a small effect, the distribution of measures might have a profound impact. As I show below, the distribution of market capitalization is highly skewed with a small number of stocks whose market capitalization is very large. For regression analyses, I hence use the log of market capitalization which I denote as size. For more meaningful interpretation of the market capitalization measures, I also calculate inflation-adjusted values (in terms of end of 2018 dollars).

crsp <- crsp %>%
mutate(mktcap_cpi = mktcap / cpi,
mktcap_ff_cpi = mktcap_ff / cpi,
size = log(mktcap),
size_cpi = log(mktcap_cpi),
size_ff = log(mktcap_ff),
size_ff_cpi = log(mktcap_ff_cpi))

## Summary Statistics

I proceed to present summary statistics for the different measures of stock size using the full sample period from December 1925 until December 2018. I first compute corresponding summary statistics for each month and then average each summary statistics across all months. Note that I call the moments package to compute skewness and kurtosis.

# compute summary statistics for each month
crsp_ts <- crsp %>%
select(date, mktcap, mktcap_ff:size_ff_cpi) %>%
pivot_longer(cols = mktcap:size_ff_cpi, names_to = "measure") %>%
na.omit() %>%
group_by(measure, date) %>%
summarize(mean = mean(value),
sd = sd(value),
skew = moments::skewness(value),
kurt = moments::kurtosis(value),
min = min(value),
q05 = quantile(value, 0.05),
q25 = quantile(value, 0.25),
q50 = quantile(value, 0.50),
q75 = quantile(value, 0.75),
q95 = quantile(value, 0.95),
max = max(value),
n = n()) %>%
ungroup()

# average summary statistics across all months
crsp_ts %>%
select(-date) %>%
group_by(measure) %>%
summarize_all(list(mean)) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
measure mean sd skew kurt min q05 q25 q50 q75 q95 max n
mktcap 1099.38 5186.43 14.57 323.08 0.52 5.76 29.35 114.91 476.70 4169.19 140802.80 3164.76
mktcap_cpi 1974.27 8993.83 14.56 323.01 2.86 20.31 82.73 267.93 982.34 7382.44 238170.44 3164.34
mktcap_ff 1092.84 5091.25 14.34 311.15 0.67 6.17 30.18 116.82 479.06 4170.68 135991.85 3057.32
mktcap_ff_cpi 1964.99 8857.12 14.34 311.16 3.35 20.99 83.71 270.10 983.78 7377.41 231163.32 3056.89
size 3.94 1.81 0.30 2.99 -1.34 1.18 2.66 3.82 5.12 7.10 10.40 3164.76
size_cpi 5.47 1.81 0.30 2.99 0.19 2.71 4.19 5.35 6.65 8.63 11.93 3164.34
size_ff 3.97 1.79 0.33 2.98 -1.02 1.24 2.69 3.84 5.13 7.11 10.39 3057.32
size_ff_cpi 5.49 1.79 0.33 2.98 0.51 2.77 4.21 5.37 6.65 8.63 11.91 3056.89

To further investigate the distribution of market capitalization, I examine the percentage of total market cap that comprised very large stocks. The figure below plots the percentage of total stock market capitalization that is captured by the larges $$x$$% of stocks over the full sample period.

crsp_top <- crsp %>%
select(permno, date, mktcap) %>%
na.omit() %>%
group_by(date) %>%
mutate(top01 = if_else(mktcap >= quantile(mktcap, 0.99), 1L, 0L),
top05 = if_else(mktcap >= quantile(mktcap, 0.95), 1L, 0L),
top10 = if_else(mktcap >= quantile(mktcap, 0.90), 1L, 0L),
top25 = if_else(mktcap >= quantile(mktcap, 0.75), 1L, 0L)) %>%
summarize(total = sum(mktcap),
top01 = sum(mktcap[top01 == 1]) / total,
top05 = sum(mktcap[top05 == 1]) / total,
top10 = sum(mktcap[top10 == 1]) / total,
top25 = sum(mktcap[top25 == 1]) / total) %>%
select(-total) %>%
pivot_longer(cols = top01:top25, names_to = "group")

crsp_top %>%
mutate(group = case_when(group == "top01" ~ "Largest 1% of Stocks",
group == "top05" ~ "Largest 5% of Stocks",
group == "top10" ~ "Largest 10% of Stocks",
group == "top25" ~ "Largest 25% of Stocks"),
group = factor(group, levels = c("Largest 1% of Stocks", "Largest 5% of Stocks",
"Largest 10% of Stocks", "Largest 25% of Stocks"))) %>%
ggplot(aes(x = date, y = value, group = group)) +
geom_line(aes(color = group, linetype = group)) +
scale_y_continuous(labels = scales::percent, breaks = scales::pretty_breaks(),
limits = c(0, 1)) +
labs(x = "", y = "Percentage of Total Market Capitalization",
fill = "", color = "", linetype = "") +
scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
theme_classic()

## Correlations

The next table presents the time-series averages of the monthly cross-sectional (Pearson product-moment) correlations between different measures of market capitalization.

cor_matrix <- crsp %>%
select(mktcap, size, mktcap_ff, size_ff, beta) %>%
cor(use = "complete.obs", method = "pearson")

cor_matrix[upper.tri(cor_matrix, diag = TRUE)] <- NA

cor_matrix[-1, -5] %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
mktcap size mktcap_ff size_ff
size 0.34
mktcap_ff 0.98 0.33
size_ff 0.34 0.98 0.34
beta 0.05 0.28 0.05 0.28

Indeed, the different approaches are highly correlated, mostly because the total market capitalization of a firm in most cases changes very little over the course of a year (see next section). The correlation between beta and size indicates that larger stocks tend to have higher market betas.

## Persistence

Next, I turn to the persistence analysis, i.e. the correlation of each variable with its lagged values over time. To compute these correlations, I define the following function where I use the fairly new curly-curly operator which simplifies writing functions around tidyverse pipelines.

compute_persistence <- function(var, tau) {
dates <- crsp %>%
distinct(date) %>%
arrange(date) %>%
mutate(date_lag = lag(date, tau))

correlation <- crsp %>%
select(permno, date, {{ var }}) %>%
left_join(dates, by = "date") %>%
left_join(crsp %>%
select(permno, date, {{ var }}) %>%
rename("{{ var }}_lag" := {{ var }}),
by = c("permno", "date_lag"="date")) %>%
select(contains(rlang::quo_text(enquo(var)))) %>%
cor(use = "pairwise.complete.obs", method = "pearson")

return(correlation[1, 2])
}

persistence <- tribble(~tau, ~size, ~size_ff,
1, compute_persistence(size, 1), compute_persistence(size_ff, 1),
3, compute_persistence(size, 3), compute_persistence(size_ff, 3),
6, compute_persistence(size, 6), compute_persistence(size_ff, 6),
12, compute_persistence(size, 12), compute_persistence(size_ff, 12),
24, compute_persistence(size, 24), compute_persistence(size_ff, 24),
36, compute_persistence(size, 36), compute_persistence(size_ff, 36),
48, compute_persistence(size, 48), compute_persistence(size_ff, 48),
60, compute_persistence(size, 60), compute_persistence(size_ff, 60),
120, compute_persistence(size, 120), compute_persistence(size_ff, 120))

The following table provides the results of the persistence analysis. Indeed, the size measures exhibit a high persistence even up to 10 years. As the Fama-French measure is only updated every June, it mechanically exhibits a slightly higher persistence, but the impact is in fact minimal compared to the monthly updated variable.

persistence %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
tau size size_ff
1 1.00 1.00
3 0.99 0.99
6 0.98 0.98
12 0.96 0.97
24 0.93 0.94
36 0.91 0.92
48 0.89 0.90
60 0.88 0.89
120 0.83 0.84

## Portfolio Analysis

To analyze the cross-sectional relation between market capitalization and future stock returns, I start with univariate portfolio analysis. The idea is to sort all stocks into portfolios based on quantiles calculated using NYSE stocks only and then calculate equal-weighted and value-weighted average returns for each portfolio.

Turns out that I need to adjust the basic weighted.mean function to take out observations with missing weights because it would otherwise return missing values. This is important whenever a stock’s return is defined while the market capitalization is not (e.g. due to missing number of shares outstanding).

weighted_mean <- function(x, w, ..., na.rm = FALSE){
if(na.rm){
x1 <- x[!is.na(x) & !is.na(w)]
w <- w[!is.na(x) & !is.na(w)]
x <- x1
}
weighted.mean(x, w, ..., na.rm = FALSE)
}

The next function I use is a helper function that returns a list of quantile functions that I can throw into the summarize command to get a list of breakpoints for any given number of portfolios. The code snippet below is a modified version of the code found here. Most importantly, the function uses purrr:partial to fil in some arguments to the quantile functions. I apply the function to return breakpoints for 3 market cap portfolios as an example.

get_breakpoints_funs <- function(var, n_portfolios = 10) {
# get relevant percentiles
percentiles <- seq(0, 1, length.out = (n_portfolios + 1))
percentiles <- percentiles[percentiles > 0 & percentiles < 1]

# construct set of named quantile functions
percentiles_names <- map_chr(percentiles, ~paste0(rlang::quo_text(enquo(var)), "_q", .x*100))
percentiles_funs <- map(percentiles, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = percentiles_names)
return(percentiles_funs)
}
get_breakpoints_funs(mktcap, n_portfolios = 3)
## $mktcap_q33.3333333333333 ## <partialised> ## function (...) ## quantile(probs = .x, na.rm = TRUE, ...) ## ##$mktcap_q66.6666666666667
## <partialised>
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)

Given a list of breakpoints, I want to sort stocks into the corresponding portfolio. This turns out to be a bit tricky if I want to write a function that does the sorting for a flexible number of portfolios. Fortunately, my brilliant colleague Stefan Voigt pointed out the findInterval function to me.

get_portfolio <- function(x, breakpoints) {
portfolio <- as.integer(1 + findInterval(x, unlist(breakpoints)))
return(portfolio)
}

The next function finally constructs univariate portfolios for any sorting variable using NYSE breakpoints. Note that the rationale behind using only NYSE stocks to calculate quantiles is that for a large portio of the sample period, NYSE stocks tended to be much larger than stocks listed on AMEX or NASDAQ. If the breakpoints were calculated using all stocks in the CRSP sample, the breakpoints would effectively just separate NYSE stocks from all other stocks. However, it does not ensure an equal number of NYSE stocks in each portfolios, as I show below. Again, I leverage the new curly-curly operator and associated naming capabilities.

construct_univariate_portfolios <- function(data, var, n_portfolios = 10) {
# keep only observations where the sorting variable is defined
data <- data %>%
filter(!is.na({{ var }}))

# determine breakpoints based on NYSE stocks only
data_nyse <- data %>%
filter(exchange == "NYSE") %>%
select(date, {{ var }})

var_funs <- get_breakpoints_funs({{ var }}, n_portfolios)
quantiles <- data_nyse %>%
group_by(date) %>%
summarize_at(vars({{ var }}), lst(!!! var_funs)) %>%
group_by(date) %>%
nest() %>%
rename(quantiles = data)

# independently sort all stocks into portfolios based on NYSE breakpoints
portfolios <- data %>%
left_join(quantiles, by = "date") %>%
mutate(portfolio = map2_dbl({{ var }}, quantiles, get_portfolio)) %>%
select(-quantiles)

# compute average portfolio characteristics
portfolios_ts <- portfolios %>%
group_by(portfolio, date) %>%
summarize(ret_ew = mean(ret_adj_excess_f1, na.rm = TRUE),
ret_vw = weighted_mean(ret_adj_excess_f1, mktcap, na.rm = TRUE),
ret_mkt = mean(mkt_ff_excess_f1, na.rm = TRUE),
"{{ var }}_mean" := mean({{ var }}, na.rm = TRUE),
"{{ var }}_sum" := sum({{ var }}, na.rm = TRUE),
beta = mean(beta, na.rm = TRUE),
n_stocks = n(),
n_stocks_nyse = sum(exchange == "NYSE", na.rm = TRUE)) %>%
na.omit() %>%
ungroup()

## long-short portfolio
portfolios_ts_ls <- portfolios_ts %>%
select(portfolio, date, ret_ew, ret_vw, ret_mkt) %>%
filter(portfolio %in% c(max(portfolio), min(portfolio))) %>%
pivot_wider(names_from = portfolio, values_from = c(ret_ew, ret_vw)) %>%
mutate(ret_ew = .[[4]] - .[[3]],
ret_vw = .[[6]] - .[[5]],
portfolio = paste0(n_portfolios, "-1")) %>%
select(portfolio, date, ret_ew, ret_vw, ret_mkt)

## combine everything
out <- portfolios_ts %>%
mutate(portfolio = as.character(portfolio)) %>%
bind_rows(portfolios_ts_ls) %>%
mutate(portfolio = factor(portfolio, levels = c(as.character(seq(1, 10, 1)), paste0(n_portfolios, "-1"))))

return(out)
}

portfolios_mktcap <- construct_univariate_portfolios(crsp, mktcap)
portfolios_mktcap_ff <- construct_univariate_portfolios(crsp, mktcap_ff)

The table below provides the summary statistics for each portfolio.

# compute market cap totals
mktcap_totals <- crsp %>%
group_by(date) %>%
summarize(mktcap_total = sum(mktcap, na.rm = TRUE),
mktcap_ff_total = sum(mktcap_ff, na.rm = TRUE))

# portfolio characteristics
portfolios_mktcap_table <- portfolios_mktcap %>%
filter(portfolio != "10-1") %>%
left_join(mktcap_totals, by = "date") %>%
group_by(portfolio) %>%
summarize(mktcap_mean = mean(mktcap_mean),
mktcap_share = mean(mktcap_sum / mktcap_total) * 100,
n = mean(n_stocks),
n_nyse = mean(n_stocks_nyse / n_stocks) * 100,
beta = mean(beta)) %>%
t()

portfolios_mktcap_table <- portfolios_mktcap_table[-1, ] %>%
as.numeric() %>%
matrix(., ncol = 10)
colnames(portfolios_mktcap_table) <- seq(1, 10, 1)
rownames(portfolios_mktcap_table) <- c("Market Cap (in $B)", "% of Total Market Cap", "Number of Stocks", "% NYSE Stocks" , "Beta") # repeat for fama-french measure portfolios_mktcap_ff_table <- portfolios_mktcap_ff %>% filter(portfolio != "10-1") %>% left_join(mktcap_totals, by = "date") %>% group_by(portfolio) %>% summarize(mktcap_ff_mean = mean(mktcap_ff_mean), mktcap_ff_share = mean(mktcap_ff_sum / mktcap_ff_total) * 100, n = mean(n_stocks), n_nyse = mean(n_stocks_nyse / n_stocks) * 100, beta = mean(beta)) %>% t() portfolios_mktcap_ff_table <- portfolios_mktcap_ff_table[-1, ] %>% as.numeric() %>% matrix(., ncol = 10) colnames(portfolios_mktcap_ff_table) <- seq(1, 10, 1) rownames(portfolios_mktcap_ff_table) <- c("Market Cap (in$B)", "% of Total Market Cap", "Number of Stocks", "% NYSE Stocks" , "Beta")

# combine to table and print html
rbind(portfolios_mktcap_table, portfolios_mktcap_ff_table) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows("Market Cap (Turan-Engle-Murray)", 1, 5) %>%
pack_rows("Market Cap (Fama-French)", 6, 10)
1 2 3 4 5 6 7 8 9 10
Market Cap (Turan-Engle-Murray)
Market Cap (in $B) 25.43 100.24 181.84 288.66 439.11 656.88 1000.36 1678.64 3353.75 16646.88 % of Total Market Cap 1.13 1.16 1.44 1.89 2.51 3.33 4.77 7.52 13.76 62.48 Number of Stocks 1417.84 387.86 266.60 217.65 186.92 162.65 150.42 141.72 134.15 129.71 % NYSE Stocks 43.28 56.31 64.16 69.63 74.92 81.14 85.10 88.52 92.13 94.54 Beta 0.76 0.99 1.02 1.02 1.02 1.01 1.02 1.01 1.00 1.01 Market Cap (Fama-French) Market Cap (in$B) 26.30 102.10 182.39 288.65 435.70 648.76 985.41 1654.13 3306.93 16286.20
% of Total Market Cap 1.14 1.15 1.43 1.88 2.48 3.31 4.75 7.54 13.81 62.52
Number of Stocks 1360.21 363.88 255.62 209.18 180.04 158.78 146.58 139.24 131.72 127.46
% NYSE Stocks 44.19 57.60 65.18 70.70 76.04 81.80 85.78 88.71 92.37 94.76
Beta 0.76 1.00 1.02 1.03 1.03 1.01 1.02 1.01 1.00 1.01

The numbers are again very similar for both measures. The table illustrates the usage of NYSE breakpoints as the share of NYSE stocks increases across portfolios. The table also shows that the portfolio of largest firms takes up more than 60% of total market capitalization although it contains on average only about 130 firms.

Given the portfolio returns, I want to evaluate whether a portfolio exhibits on average positive or negative excess returns. The following function estimates the mean excess return and CAPM alpha for each portfolio and computes corresponding Newey and West (1987) $$t$$-statistics (using six lags as common in the literature) testing the null hypothesis that the average portfolio excess return or CAPM alpha is zero. I just recycle the function I wrote in this post.

estimate_portfolio_returns <- function(data, ret) {
# estimate average returns per portfolio
average_ret <- data %>%
group_by(portfolio) %>%
arrange(date) %>%
do(model = lm(paste0(rlang::as_name(enquo(ret))," ~ 1"), data = .)) %>%
mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6)))) %>%
broom::tidy(model) %>%
ungroup() %>%
mutate(nw_tstat = estimate / nw_stderror) %>%
select(estimate, nw_tstat) %>%
t()

# estimate capm alpha per portfolio
average_capm_alpha <- data %>%
group_by(portfolio) %>%
arrange(date) %>%
do(model = lm(paste0(rlang::as_name(enquo(ret))," ~ 1 + ret_mkt"), data = .)) %>%
mutate(nw_stderror = sqrt(diag(sandwich::NeweyWest(model, lag = 6))[1])) %>%
broom::tidy(model)%>%
ungroup() %>%
filter(term == "(Intercept)") %>%
mutate(nw_tstat = estimate / nw_stderror) %>%
select(estimate, nw_tstat) %>%
t()

# construct output table
out <- rbind(average_ret, average_capm_alpha)
colnames(out) <-c(as.character(seq(1, 10, 1)), "10-1")
rownames(out) <- c("Excess Return", "t-Stat" , "CAPM Alpha", "t-Stat")

return(out)
}

rbind(estimate_portfolio_returns(portfolios_mktcap, ret_ew),
estimate_portfolio_returns(portfolios_mktcap_ff, ret_ew),
estimate_portfolio_returns(portfolios_mktcap, ret_vw),
estimate_portfolio_returns(portfolios_mktcap_ff, ret_vw)) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows("Market Cap (Turan-Engle-Murray) - Equal-Weighted Portfolio Returns", 1, 4) %>%
pack_rows("Market Cap (Fama-French) - Equal-Weighted Portfolio Returns", 5, 8)%>%
pack_rows("Market Cap (Turan-Engle-Murray) - Value-Weighted Portfolio Returns", 9, 12) %>%
pack_rows("Market Cap (Fama-French) - Value-Weighted Portfolio Returns", 13, 16)
1 2 3 4 5 6 7 8 9 10 10-1
Market Cap (Turan-Engle-Murray) - Equal-Weighted Portfolio Returns
Excess Return 1.62 1.01 0.95 0.92 0.88 0.84 0.79 0.81 0.72 0.60 -1.01
t-Stat 3.90 3.41 3.35 3.72 3.73 3.81 3.69 4.03 3.84 3.62 -3.25
CAPM Alpha 0.65 0.08 0.04 0.07 0.06 0.05 0.02 0.07 0.02 -0.04 -0.68
t-Stat 2.89 0.65 0.40 0.84 0.75 0.85 0.42 1.59 0.65 -1.67 -2.90
Market Cap (Fama-French) - Equal-Weighted Portfolio Returns
Excess Return 1.46 1.04 1.00 0.92 0.89 0.88 0.82 0.79 0.73 0.61 -0.85
t-Stat 3.90 3.32 3.62 3.62 3.75 3.85 3.79 3.90 3.88 3.59 -3.16
CAPM Alpha 0.52 0.10 0.11 0.07 0.06 0.07 0.05 0.05 0.03 -0.04 -0.56
t-Stat 2.65 0.75 1.05 0.75 0.82 1.04 0.82 0.95 0.66 -1.45 -2.65
Market Cap (Turan-Engle-Murray) - Value-Weighted Portfolio Returns
Excess Return 1.34 1.00 0.96 0.92 0.88 0.84 0.80 0.81 0.71 0.60 -0.74
t-Stat 3.40 3.41 3.38 3.73 3.74 3.80 3.72 4.08 3.83 3.73 -2.50
CAPM Alpha 0.36 0.08 0.05 0.07 0.06 0.05 0.03 0.07 0.02 -0.01 -0.37
t-Stat 1.83 0.62 0.51 0.86 0.75 0.83 0.51 1.69 0.68 -0.41 -1.73
Market Cap (Fama-French) - Value-Weighted Portfolio Returns
Excess Return 1.08 0.94 0.97 0.90 0.88 0.87 0.81 0.79 0.72 0.60 -0.48
t-Stat 3.25 3.21 3.62 3.70 3.79 3.91 3.85 4.01 3.90 3.75 -2.08
CAPM Alpha 0.15 0.03 0.10 0.08 0.07 0.08 0.06 0.07 0.03 -0.01 -0.16
t-Stat 0.93 0.26 0.99 0.86 0.97 1.23 1.05 1.44 0.91 -0.34 -0.88

The results show that all porfolios deliver positive excess returns, but the returns decrease as firm size increases. As a consequence, the portfolio that is long the largest firms and short the smallest firms yields statistically significant negative excess returns. Even though the CAPM alpha of each portfolio is statistically indistinguishable from zero, the long-short portfolio still yields negative alphas (except for the Fama-French value-weighted portfolios). Taken together, the results of the univariate portfolio analysis indicate a negative relation between firm market capitalization and future stock returns.

## Regression Analysis

As a last step, I analyze the relation between firm size and stock returns using Fama and MacBeth (1973) regression analysis. Each month, I perform a cross-sectional regression of one-month-ahead excess stock returns on the given measure of firm size. Time-series averages over these cross-sectional regressions then provide the desired results.

Again, I essentially recylce functions that I developed in an earlier post. Note that I depart from Bali et al. and condition on all three explanatory variables being defined in the data, regardless of which measure I am using. I do this to ensure comparability across results, i.e. I want to run the same set of regressions across different specifications. Otherwise I cannot evaluate whether adding a covariate, using a different measure or changes in sample composition lead to different results (or all of them).

winsorize <- function(x, cut = 0.005){
cut_point_top <- quantile(x, 1 - cut, na.rm = TRUE)
cut_point_bottom <- quantile(x, cut, na.rm = TRUE)
i <- which(x >= cut_point_top)
x[i] <- cut_point_top
j <- which(x <= cut_point_bottom)
x[j] <- cut_point_bottom
return(x)
}

fama_macbeth_regression <- function(data, model, cut = 0.005) {
# prepare and winsorize data
data_nested <- data %>%
filter(!is.na(ret_adj_f1) & !is.na(size) & !is.na(size_ff) & !is.na(beta)) %>%
group_by(date) %>%
mutate_at(vars(size, size_ff, beta), ~winsorize(., cut = cut)) %>%
nest()

# perform cross-sectional regressions for each month
cross_sectional_regs <- data_nested %>%
mutate(model = map(data, ~lm(enexpr(model), data = .x))) %>%
mutate(tidy = map(model, broom::tidy),
glance = map(model, broom::glance),
n = map(model, stats::nobs))

# extract average coefficient estimates
fama_macbeth_coefs <- cross_sectional_regs %>%
unnest(tidy) %>%
group_by(term) %>%
summarize(coefficient = mean(estimate))

# compute newey-west standard errors of average coefficient estimates
newey_west_std_errors <- cross_sectional_regs %>%
unnest(tidy) %>%
group_by(term) %>%
arrange(date) %>%
group_modify(~enframe(sqrt(diag(sandwich::NeweyWest(lm(estimate ~ 1, data = .x), lag = 6))))) %>%
select(term, nw_std_error = value)

# put coefficient estimates and standard errors together and compute t-statistics
fama_macbeth_coefs <- fama_macbeth_coefs %>%
left_join(newey_west_std_errors, by = "term") %>%
mutate(nw_t_stat = coefficient / nw_std_error) %>%
select(term, coefficient, nw_t_stat) %>%
pivot_longer(cols = c(coefficient, nw_t_stat), names_to = "statistic") %>%
mutate(statistic = paste(term, statistic, sep = " ")) %>%
select(-term)

# extract average r-squared and average number of observations
fama_macbeth_stats <- cross_sectional_regs %>%
unnest(c(glance, n)) %>%
ungroup() %>%
n = mean(n)) %>%

# combine desired output and return results
out <- rbind(fama_macbeth_coefs, fama_macbeth_stats)
return(out)
}

m1 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ size) %>% rename(m1 = value)
m2 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ size + beta) %>% rename(m2 = value)
m3 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ size_ff) %>% rename(m3 = value)
m4 <- fama_macbeth_regression(crsp, ret_adj_f1 ~ size_ff + beta) %>% rename(m4 = value)

The following table provides the regression results for various specifications.

regression_table <- m1 %>%
full_join(m2, by = "statistic") %>%
full_join(m3, by = "statistic") %>%
full_join(m4, by = "statistic") %>%
right_join(tibble(statistic = c("(Intercept) coefficient", "(Intercept) nw_t_stat",
"size coefficient",  "size nw_t_stat",
"size_ff coefficient", "size_ff nw_t_stat",
"beta coefficient", "beta nw_t_stat",

colnames(regression_table) <- c("statistic", "m1", "m2", "m3", "m4")
regression_table[, 1] <- c("intercept", "t-stat",
"size", "t-stat", "size_ff", "t-stat",
"beta", "t-stat",

regression_table %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
statistic m1 m2 m3 m4
intercept 1.73 1.76 1.64 1.70
t-stat 4.74 5.64 4.68 5.59
size -0.17 -0.17
t-stat -3.71 -3.17
size_ff -0.15 -0.14
t-stat -3.41 -2.86
beta -0.09 -0.14
t-stat -0.67 -1.05