Tidy Asset Pricing - Part I: The CRSP Sample in R

Update 2022-02-11: check out our new book Tidy Finance with R for an updated and improved implementation of the content below.

As a newbie to the empirical asset pricing literature, I found it quite hard to fully grasp how results come about solely relying on the data descriptions in the corresponding papers. Fortunately, Bali, Engle and Murray (BEM) provide an excellent textbook that not only provides a comprehensive overview of the empirical research on the cross-section of expected returns, but also offers detailed description of common practices in sample construction and methodology. In this series of notes, I replicate the main results of their textbook in R. Moreover, I try to implement all analyses in a tidy manner using mainly packages beloning to the tidyverse family since it’s fun and fairly transparent.

In this first note I start out with the raw monthly CRSP data, compute some descriptive statistics and construct the main sample I use in later analyses. In the following notes, I use the constructed sample to investigate the size effect. As a start, here is the list of packages that I use throughout the note.

library(tidyverse) # for grammar
library(scales)    # for nicer axes of figures
library(viridis)   # for nicer colors in figures
library(vroom)     # for fast csv reading
library(lubridate) # for working with dates

I just downloaded the full monthly CRSP sample from WRDS. The full monthly CRSP sample from December 1925 until December 2019 has about 1.2GB. After reading in the data using the incredibly fast vroom package, I convert all column names to lower case out of a personal preference (alternatively, I could also use janitor::clean_names()). Note that I only read in the required columns and that I enforce the character format because I prefer to parse the column types correctly as a next step.

tbl.crsp_msf <- vroom("raw/crsp_msf.csv", 
                       col_types = cols(.default = col_character()),
                       col_select = c(PERMNO, date, RET, SHROUT, ALTPRC, 
                                      EXCHCD, SHRCD, SICCD, DLRET, DLSTCD))
colnames(tbl.crsp_msf) <- str_to_lower(colnames(tbl.crsp_msf))

Next, I make sure that all relevant variables are correctly parsed.

tbl.crsp_monthly <- tbl.crsp_msf %>%
  transmute(permno = as.integer(permno),       # security identifier
            date = ymd(date),                  # date of the observation
            month = floor_date(date, "month"), # month identifier
            ret = as.numeric(ret) * 100,       # return (in percent)
            shrout = as.numeric(shrout),       # shares outstanding (in thousands)
            altprc = as.numeric(altprc),       # last traded price in a month
            exchcd = as.integer(exchcd),       # exchange code
            shrcd = as.integer(shrcd),         # share code
            siccd = as.integer(siccd),         # industry code
            dlret = as.numeric(dlret) * 100,   # delisting return (in percent)
            dlstcd = as.integer(dlstcd)        # delisting code
  ) 
rm(tbl.crsp_msf)

As common in the literature, I focus on US-based common stocks in the CRSP database. US-based common stocks are identified with share codes 10 and 11, where the first digit pins down ordinary common shares and the second digit indicates that the secudity has not been further defined (0) or does not need to be further defined (1). I refer to the original CRSP documentation on a full description of the meaning of this distinction.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  filter(shrcd %in% c(10, 11)) %>%
  select(-shrcd)

I apply the following two checks to most data sets I work with. I make sure to keep only distinct observations and that there is only one observation by date for each security (namely, the first row for any permno-date combination). The main reason is that I want to work with a nice panel of unique security-date combinations where no pair appears more than once in the data.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  distinct(permno, date, .keep_all = TRUE) # remove duplicates 

Following BEM, I compute the market capitalization as the absolute value of the product of shares outstanding and the price of the stock as of the end of the last trading day of a given month. I have to take the absolute value since altprc is the negative of average of bid and ask from last traded price for which the data is available if there is no last traded price. Since the shares outstanding are reported in thousands of shares, I divide by 1000 such that the resulting measure is in millions of dollars.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  mutate(mktcap = abs(shrout * altprc) / 1000, # in millions of dollars
         mktcap = if_else(mktcap == 0, as.numeric(NA), mktcap)) 

As a first glance at the data, I look at the stock exchange composition of the CRSP sample. Stocks listed on NYSDE, AMEX, and NASDAQ are indicated with values of 1 or 31, 2 or 32, and 3 or 33, respectively, in the exchange code filed. I collect stocks traded on all other exchange (e.g., Arca Stock Market, Boston Stock Exchange, Chicago Stock Exchange, etc.) in a separate category.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  mutate(exchange = case_when(exchcd %in% c(1, 31) ~ "NYSE",
                              exchcd %in% c(2, 32) ~ "AMEX",
                              exchcd %in% c(3, 33) ~ "NASDAQ",
                              TRUE ~ "Other"))

The figure below plots the number of stocks per exchange. NYSE has the longest history in the data, but NASDAQ exhibits a considerable large number of stocks. Interestingly, the number of stocks on AMEX is decreasing steadily over the last couple of decades. By the end of 2019, there are 2,227 stocks on NASDAQ, 1,274 on NYSE, 167 on AMEX and 3 belonging to the other category.

fig.securities <- tbl.crsp_monthly %>%
  count(exchange, date) %>%
  ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +
  geom_line() +
  labs(x = NULL, y = NULL, color = NULL, linetype = NULL,
       title = "Monthly number of securities by exchange") + 
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") + 
  scale_y_continuous(labels = comma, breaks = pretty_breaks()) +
  theme_classic() +
  scale_color_viridis_d()

Next, I look at the market capitalization per exchange. To do so, I adjust the market capitalization values for inflation using the using the consumer price index from the US Bureau of Labor Statistics as provided by the FRED database. All values are in end of 2019 dollars for intertemporal comparability. NYSE has by far the largest market capitalization, followed by NASDAQ. At the end of 2019, total market value on NYSE was 19,123 billion and 12,536 billion on NASDAQ. AMEX only plays a minor role by now with a total market capitalization of 42 billion at the end of 2019, while other exchanges only make up 13 billion.

tbl.cpi_raw <- read_csv("raw/CPIAUCNS.csv")

tbl.cpi <- tbl.cpi_raw %>%
  filter(DATE <= max(tbl.crsp_monthly$date)) %>%
  transmute(month = floor_date(DATE, "month"),
            cpi = CPIAUCNS) %>% 
  arrange(month) %>% 
  mutate(cpi = cpi / last(cpi))

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  left_join(tbl.cpi, by = c("month")) %>% 
  mutate(mktcap_cpi = mktcap / cpi)

fig.mktcap <- tbl.crsp_monthly %>%
  group_by(exchange, date) %>%
  summarize(mktcap = sum(mktcap_cpi, na.rm = TRUE) / 1000) %>%
  ungroup() %>%
  ggplot(aes(x = date, y = mktcap, color = exchange, linetype = exchange)) +
  geom_line() +
  labs(x = NULL, y = NULL, color = NULL, linetype = NULL,
       title = "Monthly total market value (billions of Dec 2019 Dollars) by exchange") + 
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
  scale_y_continuous(labels = comma, breaks = pretty_breaks()) + 
  theme_classic() +
  scale_color_viridis_d()

Now, I turn to the industry composition of the CRSP sample. I follow the below convention on defining industries using the standard industry classification (SIC) codes.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  mutate(industry = case_when(siccd >= 1 & siccd <= 999 ~ "Agriculture",
                              siccd >= 1000 & siccd <= 1499 ~ "Mining",
                              siccd >= 1500 & siccd <= 1799 ~ "Construction",
                              siccd >= 2000 & siccd <= 3999 ~ "Manufacturing",
                              siccd >= 4000 & siccd <= 4999 ~ "Transportation",
                              siccd >= 5000 & siccd <= 5199 ~ "Wholesale",
                              siccd >= 5200 & siccd <= 5999 ~ "Retail",
                              siccd >= 6000 & siccd <= 6799 ~ "Finance",
                              siccd >= 7000 & siccd <= 8999 ~ "Services",
                              siccd >= 9000 & siccd <= 9999 ~ "Public",
                              TRUE ~ "Missing"))

The figure below plots the number of stocks in the sample in each of the SIC industries. Most of the stocks are apparently in Manufacturing albeit the number peaked somewhere in the 90ies. The number of public administration stocks seems to the the only category on the rise in recent years.

fig.industries <- tbl.crsp_monthly %>%
  filter(!is.na(industry)) %>%
  count(industry, date) %>%
  ggplot(aes(x = date, y = n, color = industry, linetype = industry)) +
  geom_line() +
  labs(x = NULL, y = NULL, color = NULL, linetype = NULL,
       title = "Monthly number of securities by industry") + 
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") + 
  scale_y_continuous(labels = comma, breaks = pretty_breaks()) + 
  theme_classic() +
  scale_color_viridis_d()

I repeat the exercise by computing the market value of all stocks belonging to the respective industries. All values are again in terms of billions of end of 2019 dollars. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Towards the end of the sample financial firms and services begin to make up a substantial portion of the market value.

fig.industries_mktcap <- tbl.crsp_monthly %>%
  filter(!is.na(industry)) %>%
  group_by(industry, date) %>%
  summarize(mktcap = sum(mktcap_cpi, na.rm = TRUE) / 1000) %>%
  ungroup() %>%
  ggplot(aes(x = date, y = mktcap, color = industry, linetype = industry)) +
  geom_line() +
  labs(x = NULL, y = NULL, color = NULL, linetype = NULL,
       title = "Monthly total market value (billions of Dec 2019 Dollars) by industry") + 
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
  scale_y_continuous(labels = comma, breaks = pretty_breaks()) + 
  theme_classic() +
  scale_color_viridis_d()

Next, I turn to the calculation of returns. Monthly stock returns are in the ret field (i.e., the return realized by holding the stock from its last trade in the previous month to its last trade in the current month). However, I have to adjust for delistings which are fortunately recorded by CRSP. The adjustment procedure below follows Shumway (1997). Apparently, the adjustment kicks in on both tails of the return distribution as I can see from the summary statistics.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  mutate(ret_adj = case_when(is.na(dlstcd) ~ ret,
                              !is.na(dlstcd) & !is.na(dlret) ~ dlret,
                              dlstcd %in% c(500, 520, 580, 584) | 
                                               (dlstcd >= 551 & dlstcd <= 574) ~ -30,
                              TRUE ~ -100)) %>%
  select(-c(dlret, dlstcd))

summary(select(tbl.crsp_monthly, ret, ret_adj)) 
##       ret             ret_adj       
##  Min.   : -99.36   Min.   :-100.00  
##  1st Qu.:  -6.40   1st Qu.:  -6.45  
##  Median :   0.00   Median :   0.00  
##  Mean   :   1.18   Mean   :   1.02  
##  3rd Qu.:   6.99   3rd Qu.:   6.95  
##  Max.   :2400.00   Max.   :6366.67  
##  NA's   :123119    NA's   :104382

As a main reference point for each stock return, I want to consider the return on the market. According to the Capital Asset Pricing Model (CAPM), cross-sectional variation in expected asset returns should be a function of the covariance between the return of the asset and the return on the market portfolio. The value-weighted portfolio of all US-based common stocks in the CRSP database is one of the main proxies used in the empirical asset pricing literature which I also get from WRDS.

tbl.crsp_si <- read_csv("raw/crsp_si.csv")
colnames(tbl.crsp_si) <- str_to_lower(colnames(tbl.crsp_si))

tbl.crsp_monthly_market <- tbl.crsp_si %>%
  transmute(month = floor_date(ymd(date), "month"),
            vwretd = vwretd * 100,
            ewretd = ewretd * 100)

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  left_join(tbl.crsp_monthly_market, by = "month")

In most cases, it makes sense to use stock returns in excess of the risk-free security over the same period. I take the monthly risk-free rate from Ken French’s data library and also add the other Fama-French factors while I am already at it. In particular, I also add another measure for the return on the market portfolio provided by Fama and French. The main difference to the CRSP measure lies in the fact that Fama and French exclude firms that are not based in the US, closed-end funds and other securities that are not common stocks. I discuss the other factors in later notes where I try to replicate them.

tbl.factors_ff_monthly <- read_csv("raw/F-F_Research_Data_Factors.csv", skip = 3) %>%
  transmute(month = floor_date(ymd(paste0(X1, "01")), "month"),
            rf_ff = as.numeric(RF),
            mkt_ff = as.numeric(`Mkt-RF`) + as.numeric(RF),
            smb_ff = as.numeric(SMB),
            hml_ff = as.numeric(HML))

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  left_join(tbl.factors_ff_monthly, by = "month") 

As a final descriptive statistic, I plot the cumulative log returns of our market measures (i.e., equal-weighted, value-weighted and the Fama-French market factor). Clearly, the value-weighted CRSP index and the Fama-French market factor are highly correlated, while the equal-weighted CRSP index stands out due its high returns apparently driven by smaller stocks.

tbl.market <- tbl.crsp_monthly %>%
  distinct(date, vwretd, ewretd, mkt_ff) %>%
  pivot_longer(-date, names_to = "portfolio", values_to = "return") %>%
  na.omit() %>%
  group_by(portfolio) %>%
  arrange(date) %>%
  mutate(cum_return = cumsum(log(1+return/100))) %>%
  ungroup() 

fig.market <- tbl.market %>%
  mutate(portfolio = case_when(portfolio == "vwretd" ~ "CRSP (Value-Weighted)",
                               portfolio == "ewretd" ~ "CRSP (Equal-Weighted)",
                               portfolio == "mkt_ff" ~ "MKT (Fama-French)")) %>%
  ggplot(aes(x = date, y = cum_return, color = portfolio, linetype = portfolio)) +
  geom_line() +
  labs(x = NULL, y = NULL, color = NULL, linetype = NULL,
       title = "Cumulative log return for different market factors") + 
  scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
  scale_y_continuous(labels = percent, breaks = pretty_breaks()) + 
  theme_classic() +
  scale_color_viridis_d()

The focus of empirical asset pricing is to examine the ability of different stock characteristics to predict the cross section of future stock returns. This is why I add the (excess) returns for the subsequent month of each stock to the sample. Note that I use beginning of month dates to ensure correct matching of dates (you may in principle also use consistent end-of-month dates, but subtracting months from the first day of month is easier).

tbl.crsp_monthly_f1 <- tbl.crsp_monthly %>%
  transmute(permno = permno,
            month = floor_date(date, "month") %m-% months(1),
            ret_excess_f1 = ret - rf_ff,
            ret_adj_excess_f1 = ret_adj - rf_ff,
            mkt_ff_excess_f1 = mkt_ff - rf_ff)

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  left_join(tbl.crsp_monthly_f1, by = c("permno", "month"))

To wrap up this data set, I arrange everything by security and date before saving the sample and provide some summary statistics of all the variables that were used above. I end up with a monthly CRSP file that has about 700 MB.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
  arrange(permno, date)

write_rds(tbl.crsp_monthly, "data/crsp_monthly.rds")

summary(tbl.crsp_monthly)
##      permno           date                month                 ret         
##  Min.   :10000   Min.   :1925-12-31   Min.   :1925-12-01   Min.   : -99.36  
##  1st Qu.:21960   1st Qu.:1977-04-29   1st Qu.:1977-04-01   1st Qu.:  -6.40  
##  Median :49373   Median :1990-12-31   Median :1990-12-01   Median :   0.00  
##  Mean   :50206   Mean   :1988-03-18   Mean   :1988-02-18   Mean   :   1.18  
##  3rd Qu.:78749   3rd Qu.:2002-04-30   3rd Qu.:2002-04-01   3rd Qu.:   6.99  
##  Max.   :93436   Max.   :2019-12-31   Max.   :2019-12-01   Max.   :2400.00  
##                                                            NA's   :123119   
##      shrout             altprc             exchcd           siccd     
##  Min.   :       0   Min.   : -1832.5   Min.   :-2.000   Min.   :   0  
##  1st Qu.:    2175   1st Qu.:     1.6   1st Qu.: 1.000   1st Qu.:3079  
##  Median :    6774   Median :    10.8   Median : 2.000   Median :3842  
##  Mean   :   40744   Mean   :    27.0   Mean   : 2.107   Mean   :4511  
##  3rd Qu.:   23212   3rd Qu.:    25.2   3rd Qu.: 3.000   3rd Qu.:6030  
##  Max.   :29206400   Max.   :339590.0   Max.   :33.000   Max.   :9999  
##  NA's   :3743       NA's   :72367                       NA's   :2411  
##      mktcap            exchange              cpi            mktcap_cpi       
##  Min.   :      0.0   Length:3645522     Min.   :0.04903   Min.   :      0.0  
##  1st Qu.:     15.9   Class :character   1st Qu.:0.23349   1st Qu.:     47.6  
##  Median :     66.8   Mode  :character   Median :0.52068   Median :    180.9  
##  Mean   :   1490.3                      Mean   :0.49551   Mean   :   2233.2  
##  3rd Qu.:    357.8                      3rd Qu.:0.69968   3rd Qu.:    812.0  
##  Max.   :1304764.7                      Max.   :1.00145   Max.   :1304764.7  
##  NA's   :72504                                            NA's   :72504      
##    industry            ret_adj            vwretd             ewretd       
##  Length:3645522     Min.   :-100.00   Min.   :-29.1731   Min.   :-31.275  
##  Class :character   1st Qu.:  -6.45   1st Qu.: -1.7553   1st Qu.: -1.987  
##  Mode  :character   Median :   0.00   Median :  1.2963   Median :  1.402  
##                     Mean   :   1.02   Mean   :  0.9308   Mean   :  1.123  
##                     3rd Qu.:   6.95   3rd Qu.:  3.8775   3rd Qu.:  4.231  
##                     Max.   :6366.67   Max.   : 39.4143   Max.   : 66.594  
##                     NA's   :104382    NA's   :497        NA's   :497      
##      rf_ff           mkt_ff            smb_ff            hml_ff       
##  Min.   :-0.06   Min.   :-29.100   Min.   :-16.720   Min.   :-13.280  
##  1st Qu.: 0.15   1st Qu.: -1.750   1st Qu.: -1.690   1st Qu.: -1.320  
##  Median : 0.39   Median :  1.290   Median :  0.080   Median :  0.270  
##  Mean   : 0.37   Mean   :  0.949   Mean   :  0.128   Mean   :  0.366  
##  3rd Qu.: 0.51   3rd Qu.:  3.920   3rd Qu.:  1.860   3rd Qu.:  1.780  
##  Max.   : 1.35   Max.   : 38.950   Max.   : 36.700   Max.   : 35.460  
##  NA's   :3627    NA's   :3627      NA's   :3627      NA's   :3627     
##  ret_excess_f1     ret_adj_excess_f1 mkt_ff_excess_f1 
##  Min.   : -99.52   Min.   :-101.31   Min.   :-29.130  
##  1st Qu.:  -6.77   1st Qu.:  -6.82   1st Qu.: -2.020  
##  Median :  -0.36   Median :  -0.37   Median :  1.020  
##  Mean   :   0.81   Mean   :   0.65   Mean   :  0.578  
##  3rd Qu.:   6.65   3rd Qu.:   6.62   3rd Qu.:  3.520  
##  Max.   :2399.66   Max.   :6365.91   Max.   : 38.850  
##  NA's   :126291    NA's   :107577    NA's   :29162

Finally, here is the code chunk for setting up the daily return data from CRSP. Note that the raw daily CRSP sample has about 24GB, whereas the final sample is about 3 GB. I just use a scientific cluster with plenty of memory to run this part.

# Read in data
tbl.crsp_dsf <- vroom("raw/crsp_dsf.csv", 
                       col_types = cols(.default = col_character()),
                       col_select = c(PERMNO, date, RET, SHRCD))
colnames(tbl.crsp_dsf) <- str_to_lower(colnames(tbl.crsp_dsf))

# Keep only common stocks and relevant columns
tbl.crsp_daily <- tbl.crsp_dsf %>%
  mutate(shrcd = as.integer(shrcd)) %>% 
  filter(shrcd %in% c(10, 11)) %>%
  transmute(permno = as.integer(permno),     
            date = ymd(date),               
            ret = as.numeric(ret) * 100    
  )
rm(tbl.crsp_dsf)

# Keep only distinct observations
tbl.crsp_daily <- tbl.crsp_daily %>%
  distinct(permno, date, .keep_all = TRUE)

# Add fama french factors (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)
tbl.factors_ff_daily <- read_csv("raw/F-F_Research_Data_Factors_daily.csv", skip = 3) %>%
  transmute(date = ymd(X1),
            rf_ff = as.numeric(RF),
            mkt_ff = as.numeric(`Mkt-RF`) + as.numeric(RF)) %>%
  filter(date <= max(tbl.crsp_daily$date))

tbl.crsp_daily <- tbl.crsp_daily %>%
  left_join(tbl.factors_ff_daily, by = "date")

# Drop rows with missing values
tbl.crsp_daily <- tbl.crsp_daily %>%
  na.omit()

# Set reference month
tbl.crsp_daily <- tbl.crsp_daily %>%
  mutate(month = floor_date(date, "month")) %>% 
  select(permno, date, month, ret, rf_ff, mkt_ff)

# Save sample
write_rds(tbl.crsp_daily, "data/crsp_daily.rds")
Christoph Scheuch
Christoph Scheuch
Director of Product

My interests include Product Management, Data Science, R and FinTech-related stuff. matter.

Related