Tidy Asset Pricing - Part V: The Fama-French 3-Factor Model

Update 2022-02-11: check out our new book Tidy Finance with R for an updated and improved implementation of the content below.

This note is an effort to replicate the famous Fama-French three-factor model, in particular the three factors that enter its equation: the market factor, the size factor (small firms outperform larger firms), and the value factor (firms with high book-to-market equity ratios outperform firms with low ratios). Throughout this note, I build on earlier posts that explored empirical asset pricing approaches in a similar direction. Namely, I constructed the CRSP stock return sample, I estimated stock-specific market betas, and I analyzed the relation between stock returns and size and I replicated the value premium. I keep the description of individual steps very short and refer to these notes for more details.

At the end of this post, I demonstrate that my effort yields a decent result: my version of the mkt factor exhibits a 99% correlation with the market factor provided on Kenneth French’s website. Similarly, my smb and hml factors show a 97% and 95% correlation, respectively, with the original data. Wayne Chang provides an R script on his website that supposedly achieves higher correlations. However, I have to admit that I have not yet tried to run his code because I find it very hard to grasp what is going on by just glancing over it. Maybe I’ll devote some more time to his code at a later stage.

The usual disclaimer applies, so the text below references an opinion and is for information purposes only and I do not intend to provide any investment advice.

I mainly use the following packages throughout this note:

library(tidyverse)  # for grammar
library(scales)     # for nicer axes of figures
library(viridis)    # for nicer colors in figures
library(vroom)      # for fast csv reading
library(lubridate)  # for working with dates
library(moments)    # for skewness and kurtosis 
library(kableExtra) # for nicer html tables

Prepare CRSP Data

This note is self-contained in the sense that everything starts from the raw data downloaded from WRDS. Note that your replication efforts might end up differently if you use another data source! First, I need to construct the sample of monthly stock returns. The code chunk below exhibits two differences to the earlier post: I drop observations before 1962 as most papers in empirical asset pricing start in 1963 due to data availability; I drop stocks that are listed on exchanges other than NYSE, AMEX or NASDAQ since Fama and French stress that they focus their seminal paper on these stocks.

# Read raw monthly crsp file
tbl.crsp_msf <- vroom("raw/crsp_msf.csv",
                       col_types = cols(.default = col_character()))
colnames(tbl.crsp_msf) <- str_to_lower(colnames(tbl.crsp_msf))

# Parse only relevant variables
tbl.crsp <- tbl.crsp_msf %>%
  transmute(permno = as.integer(permno),     # security identifier
            date = ymd(date),                # month identifier
            ret = as.numeric(ret) * 100,     # return (converted to percent)
            shrout = as.numeric(shrout),     # shares outstanding (in thousands)
            altprc = as.numeric(altprc),     # last traded price in a month
            exchcd = as.integer(exchcd),     # exchange code
            shrcd = as.integer(shrcd),       # share code
            siccd = as.integer(siccd),       # industry code
            dlret = as.numeric(dlret) * 100, # delisting return (converted to percent)
            dlstcd = as.integer(dlstcd)      # delisting code
  ) 

# Analysis of fama french (1993) starts in 1963 so I only need data after 1962
tbl.crsp <- tbl.crsp %>%
  filter(date >= "1962-01-01")

# Keep only US-based common stocks (10 and 11)
tbl.crsp <- tbl.crsp %>%
  filter(shrcd %in% c(10, 11)) %>%
  select(-shrcd)

# Keep only distinct observations to avoid multiple counting
tbl.crsp <- tbl.crsp %>%
  distinct(permno, date, .keep_all = TRUE) # remove duplicates 

# Compute market cap
# Note: altprc is the negative of average of bid and ask from last traded price
#       for which the data is available if there is no last traded price
tbl.crsp <- tbl.crsp %>%
  mutate(mktcap = abs(shrout * altprc) / 1000, # in millions of dollars
         mktcap = if_else(mktcap == 0, as.numeric(NA), mktcap)) 

# Define exchange labels and keep only NYSE, AMEX and NASDAQ stocks
tbl.crsp <- tbl.crsp %>%
  mutate(exchange = case_when(exchcd %in% c(1, 31) ~ "NYSE",
                              exchcd %in% c(2, 32) ~ "AMEX",
                              exchcd %in% c(3, 33) ~ "NASDAQ",
                              TRUE ~ "Other")) %>%
  filter(exchange != "Other")

# Adjust delisting returns (see Shumway, 1997)
tbl.crsp <- tbl.crsp %>%
  mutate(ret_adj = case_when(is.na(dlstcd) ~ ret,
                              !is.na(dlstcd) & !is.na(dlret) ~ dlret,
                              dlstcd %in% c(500, 520, 580, 584) | 
                                               (dlstcd >= 551 & dlstcd <= 574) ~ -30,
                              TRUE ~ -100)) %>%
  select(-c(dlret, dlstcd))

Prepare CRSP/Compustat Merged Data

The next code chunk is explained in this post with two additions: I do not include firms until they have appeared in Compustat for two years, just as in Fama and French (1993). However, this extra filter only has a negligible impact on the overall results. Moreover, the construction of book equity is now based on the instructions provided on Kenneth French’s website as opposed to the slightly simpler implementation in Bali, Engle and Murray (2016).

# Read raw CRSP/Compustat merged file
tbl.ccmfund <- vroom("raw/ccmfund.csv",
                     col_types = cols(.default = col_character()))
colnames(tbl.ccmfund) <- str_to_lower(colnames(tbl.ccmfund))

# Select and parse relevant variables
tbl.compustat <- tbl.ccmfund %>%
  transmute(
    gvkey = as.integer(gvkey),         # firm identifier
    permno = as.integer(lpermno),      # stock identifier
    datadate = ymd(datadate),          # date of report
    linktype = as.character(linktype), # link type
    linkenddt = ymd(linkenddt),        # date when link ends to be valid
    seq = as.numeric(seq),             # stockholders' equity
    ceq = as.numeric(ceq),             # total common/ordinary equity
    at = as.numeric(at),               # total assets
    lt = as.numeric(lt),               # total liabilities
    txditc = as.numeric(txditc),       # deferred taxes and investment tax credit
    txdb = as.numeric(txdb),           # deferred taxes
    itcb = as.numeric(itcb),           # investment tax credit
    pstkrv = as.numeric(pstkrv),       # preferred stock redemption value
    pstkl = as.numeric(pstkl),         # preferred stock liquidating value
    pstk = as.numeric(pstk),           # preferred stock par value
    indfmt = as.character(indfmt),     # industry format
    datafmt = as.character(datafmt)    # data format
  ) 

# Make sure that only correct industry and data format is used
tbl.compustat <- tbl.compustat %>%
  filter(indfmt == "INDL" & datafmt == "STD") %>%
  select(-c(indfmt, datafmt))

# Check that only valid links are used
tbl.compustat <- tbl.compustat %>%
  filter(linktype %in% c("LU", "LC"))

# Check that links are still active at time of information release
tbl.compustat <- tbl.compustat %>%
  filter(datadate <= linkenddt | is.na(linkenddt))

# Keep only distinct observations
tbl.compustat <- tbl.compustat %>%
  distinct()

# Calculate book value of preferred stock and equity
tbl.compustat <- tbl.compustat %>%
  mutate(be = coalesce(seq, ceq + pstk, at - lt) + 
           coalesce(txditc, txdb + itcb, 0) - 
           coalesce(pstkrv, pstkl, pstk, 0),
         be = if_else(be < 0, as.numeric(NA), be)) %>%
  select(gvkey, permno, datadate, be) 

# Determine year for matching and keep only last observation per year
tbl.compustat <- tbl.compustat %>%
  mutate(year = year(datadate)) %>%
  group_by(permno, year) %>%
  filter(datadate == max(datadate)) %>%
  ungroup()

# Keep only observations once a firm has been included for two years
tbl.compustat <- tbl.compustat %>%
  group_by(gvkey) %>%
  mutate(first_year = min(year)) %>%
  ungroup() %>%
  filter(year > first_year + 1) %>%
  select(-c(gvkey, first_year))

# Kick out unnecessary rows with missing values
tbl.compustat <- tbl.compustat %>%
  na.omit()

# Determine reference date for matching (i.e. June of next calendar year)
tbl.compustat <- tbl.compustat %>%
  mutate(reference_date = ymd(paste0(year + 1, "-06-01"))) %>%
  select(-year)

Construct Stock Sample

The stock sample used for the construction of the factors also builds on this post. Note that the relevant returns for the factor construction are the adjusted returns in the current month. In the end, only stocks that have valid book equity data from year \(y-1\), market equity data at the end of year \(y-1\) and market capitalization in June of year \(y\) enter the factor construction procedure. Moreover, the market capitalization of June of year \(y\) is used to construct value-weighted returns until May of year \(y+1\). I also restrict the sample to start in 1963, consistent with Fama and French.

# Select relevant variables
tbl.stocks <- tbl.crsp %>%
  select(permno, date, exchange, ret = ret_adj, mktcap) %>%
  na.omit()

# Define reference date for each stock (i.e. new sorting starts in June of year y)
tbl.stocks <- tbl.stocks %>%
  mutate(reference_date = ymd(if_else(month(date) < 6, 
                                      paste0(year(date) - 1, "-06-01"), 
                                      paste0(year(date), "-06-01")))) 

# Add book equity data for year y-1 which is used starting in June of year y
tbl.stocks <- tbl.stocks %>%
  left_join(tbl.compustat, by = c("permno", "reference_date"))

# Add market equity data from the end of year y-1 which is used for bm ratio in June of year y
tbl.stocks_me <- tbl.stocks %>%
  filter(month(date) == 12) %>%
  mutate(reference_date = ymd(paste0(year(date) + 1, "-06-01"))) %>%
  select(permno, reference_date, me = mktcap)

tbl.stocks <- tbl.stocks %>%
  left_join(tbl.stocks_me, by = c("permno", "reference_date"))

# Compute book-to-market ratio
tbl.stocks <- tbl.stocks %>%
  mutate(bm = be / me)

# Add market cap of June of year y which is used for value-weighted returns
tbl.stocks_weight <- tbl.stocks %>%
  filter(month(date) == 6) %>%
  select(permno, reference_date, mktcap_weight = mktcap)

tbl.stocks <- tbl.stocks %>%
  left_join(tbl.stocks_weight, by = c("permno", "reference_date"))

# Only keep stocks that have all the necessary data 
# (i.e. market equity data for December y-1 and June y, and book equity data for y-1)
tbl.stocks <- tbl.stocks %>%
  na.omit() %>%
  filter(date >= "1963-01-01") # Fama-French paper starts here

Size Sorts

To construct the size portfolios, Fama and French independently sort all stocks into two portfolios according to the median of all NYSE stocks in the June of each year. Portfolios are then kept in these portfolios until June of the following year. This is exactly what happens in the code chunk below.

# In June of each year, all NYSE stocks are ranked on size to get the median
tbl.size_breakpoints <- tbl.stocks %>%
  filter(month(date) == 6 & exchange == "NYSE") %>%
  select(reference_date, mktcap) %>%
  group_by(reference_date) %>%
  summarize(size_median = median(mktcap))

# Also in June, all stocks are sorted into 2 portfolios
tbl.size_sorts <- tbl.stocks %>%
  filter(month(date) == 6) %>%
  left_join(tbl.size_breakpoints, by = "reference_date") %>%
  mutate(size_portfolio = case_when(mktcap > size_median ~ "B",
                                    mktcap <= size_median ~ "S",
                                    TRUE ~ as.character(NA))) %>%
  select(permno, reference_date, size_portfolio)

# Add size portfolio assignment back to stock data
tbl.stocks <- tbl.stocks %>% 
  left_join(tbl.size_sorts, by = c("permno", "reference_date"))

Now each stock in my sample received an assignment to a size portfolio of either big "B" or small "S".

Value Sorts

Fama and French determine the value portfolios based on 30% and 70% breakpoints again using NYSE stocks only in June of each year. Each stock is then independently sorted into either high BM "H", medium bm "M" or low bm "L".

# Calculate value breakpoints using NYSE stocks
tbl.value_breakpoints <- tbl.stocks %>%
  filter(month(date) == 6 & exchange == "NYSE") %>%
  select(reference_date, bm) %>%
  group_by(reference_date) %>%
  summarize(value_q30 = quantile(bm, 0.3),
            value_q70 = quantile(bm, 0.7))

# Also in June, all stocks are sorted into 3 portfolios
tbl.value_sorts <- tbl.stocks %>%
  filter(month(date) == 6) %>%
  left_join(tbl.value_breakpoints, by = "reference_date") %>%
  mutate(value_portfolio = case_when(bm > value_q70 ~ "H",
                                     bm <= value_q70 & bm > value_q30 ~ "M", 
                                     bm <= value_q30 ~ "L",
                                     TRUE ~ as.character(NA))) %>%
  select(permno, reference_date, value_portfolio)

# Add value portfolio assignment back to stock data
tbl.stocks <- tbl.stocks %>% 
  left_join(tbl.value_sorts, by = c("permno", "reference_date"))

Construct Factor Portfolios

To construct the smb and hml portfolios, I simply compute value-weighted returns for all combinations of value and size portfolios and then take the difference of the resulting portfolios as described by Fama and French.

tbl.portfolios <- tbl.stocks %>%
  group_by(date, size_portfolio, value_portfolio) %>%
  summarize(ret_vw = weighted.mean(ret, mktcap_weight)) %>%
  ungroup() %>%
  mutate(portfolio = paste0(size_portfolio, "/", value_portfolio))

tbl.factors <- tbl.portfolios %>%
  group_by(date) %>%
  summarize(smb = mean(ret_vw[portfolio %in% c("S/H", "S/M", "S/L")]) - 
              mean(ret_vw[portfolio %in% c("B/H", "B/M", "B/L")]),
            hml = mean(ret_vw[portfolio %in% c("S/H", "B/H")]) - 
              mean(ret_vw[portfolio %in% c("S/L", "B/L")]))

Next, I also add the mkt factor as the monthly weighted average return across all stocks. I kick out the last month in my sample as it exhibits about a close to -100% return for whatever reason.

tbl.factors <- tbl.factors %>%
  left_join(tbl.stocks %>%
              group_by(date) %>%
              summarize(mkt = weighted.mean(ret, mktcap_weight)), by = "date") %>%
  select(date, mkt, smb, hml) %>% # rearrange columns
  filter(date < "2019-12-01") # something weird is going on with returns in the last month

Finally, I also add the corresponding factors from Ken French’s website.

tbl.factors_ff <- read_csv("raw/F-F_Research_Data_Factors.csv", skip = 3)

tbl.factors_ff <- tbl.factors_ff %>%
  transmute(date = ymd(paste0(X1, "01")),
            rf_ff = as.numeric(RF),
            mkt_ff = as.numeric(`Mkt-RF`) + as.numeric(RF),
            smb_ff = as.numeric(SMB),
            hml_ff = as.numeric(HML)) %>%
  filter(date <= max(tbl.crsp$date)) %>%
  mutate(date = floor_date(date, "month")) %>%
  select(-rf_ff)

tbl.factors <- tbl.factors %>%
  mutate(date = floor_date(date, "month")) %>%
  left_join(tbl.factors_ff, by = "date")

Replication Results

In the last section, I analyze how well I am able to replicate the original Fama-French factors. First, let us look at summary statistics across all 678 months.

tab.summary <- tbl.factors %>%
  select(-date) %>%
  pivot_longer(cols = everything(), names_to = "measure") %>%
  group_by(measure) %>%
  summarize(mean = mean(value),
            sd = sd(value),
            skew = skewness(value),
            kurt = kurtosis(value),
            min = min(value),
            q05 = quantile(value, 0.05),
            q25 = quantile(value, 0.25),
            q50 = quantile(value, 0.50),
            q75 = quantile(value, 0.75),
            q95 = quantile(value, 0.95),
            max = max(value),
            n = n()) %>%
  ungroup() %>%
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
tab.summary
measure mean sd skew kurt min q05 q25 q50 q75 q95 max n
hml 0.25 2.84 -0.09 6.54 -15.10 -3.88 -1.29 0.21 1.65 4.78 14.94 678
hml_ff 0.31 2.81 0.19 4.90 -11.06 -4.02 -1.31 0.27 1.71 5.18 12.60 678
mkt 0.99 4.39 -0.41 4.98 -21.86 -6.52 -1.38 1.22 3.76 7.58 17.10 678
mkt_ff 0.91 4.37 -0.52 4.97 -22.64 -6.68 -1.65 1.25 3.74 7.43 16.61 678
smb 0.23 2.93 0.52 5.76 -12.23 -4.06 -1.55 0.10 1.98 4.66 16.40 678
smb_ff 0.18 3.04 0.44 7.88 -16.72 -4.22 -1.58 0.11 2.00 4.89 21.13 678

The distributions of my factors and the original ones seem to be quite similar for all three factors, but still far from perfect. Kolmogorov-Smirnov tests confirm that there is no statistically significant difference in the distributions.

# Kolmogorov-Smirnov test
ks.test(tbl.factors$mkt, tbl.factors$mkt_ff)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  tbl.factors$mkt and tbl.factors$mkt_ff
## D = 0.022124, p-value = 0.9964
## alternative hypothesis: two-sided
ks.test(tbl.factors$smb, tbl.factors$smb_ff)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  tbl.factors$smb and tbl.factors$smb_ff
## D = 0.020649, p-value = 0.9987
## alternative hypothesis: two-sided
ks.test(tbl.factors$hml, tbl.factors$hml_ff)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  tbl.factors$hml and tbl.factors$hml_ff
## D = 0.025074, p-value = 0.9834
## alternative hypothesis: two-sided

I am even more interested in the correlations of the factors, which I provide in the next table.

fun.compute_correlations <- function(data, ..., upper = TRUE) {
  cor_matrix <- data %>%
    select(...) %>%
    cor(use = "complete.obs", method = "pearson") 
  if (upper == TRUE) {
    cor_matrix[upper.tri(cor_matrix, diag = FALSE)] <- NA
  }
  return(cor_matrix)
}
tab.correlations <- fun.compute_correlations(tbl.factors, mkt, mkt_ff, smb, smb_ff, hml, hml_ff) %>% 
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
tab.correlations
mkt mkt_ff smb smb_ff hml hml_ff
mkt 1.00
mkt_ff 0.99 1.00
smb 0.27 0.31 1.00
smb_ff 0.25 0.30 0.97 1.00
hml -0.23 -0.25 -0.14 -0.19 1.00
hml_ff -0.22 -0.25 -0.12 -0.19 0.95 1

The mkt factor exhibits a correlation of 99% which is a pretty good fit. The smb factor also shows a satisfying fit with 97%. The hml factor only has a correlation of 95% with its original counterpart which might be driven by different sample construction procedures that I am not yet aware of or some ex-post changes in the CRSP or Compustat data that I capture, but Fama-French do not include.

As a last step, I present the time series of cumulative returns over the full sample period.

fig.factors <- tbl.factors %>%
  pivot_longer(-date, names_to = "portfolio", values_to = "return") %>%
  group_by(portfolio) %>%
  arrange(date) %>%
  mutate(cum_return = cumsum(log(1+return/100))) %>%
  ungroup() %>%
  mutate(portfolio = str_to_upper(portfolio),
         portfolio = str_replace(portfolio, "_FF", " (Fama-French)")) %>%
  ggplot(aes(x = date, y = cum_return, color = portfolio)) +
  geom_line(aes()) +
  labs(x = NULL, y = NULL, color = NULL,
       title = "Cumulative log returns of original and replicated Fama-French factors") + 
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
  scale_y_continuous(labels = percent, breaks = pretty_breaks()) + 
  theme_classic() +
  scale_color_viridis_d()

There seems to be a considerable divergence between my factors and the original data over time which is less severe for the smb factor. Please feel free to share any suggestions in the comments below on how to improve the fit. Overall, I am nonetheless somewhat surprised how quickly one can set up a decent replication of the three famous Fama-French factors.

Christoph Scheuch
Christoph Scheuch
Director of Product

My interests include Product Management, Data Science, R and FinTech-related stuff. matter.

Related