# Tidy Asset Pricing - Part I: The CRSP Sample in R

Update 2022-02-11: check out our new book Tidy Finance with R for an updated and improved implementation of the content below.

As a newbie to the empirical asset pricing literature, I found it quite hard to fully grasp how results come about solely relying on the data descriptions in the corresponding papers. Fortunately, Bali, Engle and Murray (BEM) provide an excellent textbook that not only provides a comprehensive overview of the empirical research on the cross-section of expected returns, but also offers detailed description of common practices in sample construction and methodology. In this series of notes, I replicate the main results of their textbook in R. Moreover, I try to implement all analyses in a tidy manner using mainly packages beloning to the tidyverse family since it’s fun and fairly transparent.

In this first note I start out with the raw monthly CRSP data, compute some descriptive statistics and construct the main sample I use in later analyses. In the following notes, I use the constructed sample to investigate the size effect. As a start, here is the list of packages that I use throughout the note.

library(tidyverse) # for grammar
library(scales)    # for nicer axes of figures
library(viridis)   # for nicer colors in figures
library(vroom)     # for fast csv reading
library(lubridate) # for working with dates

I just downloaded the full monthly CRSP sample from WRDS. The full monthly CRSP sample from December 1925 until December 2019 has about 1.2GB. After reading in the data using the incredibly fast vroom package, I convert all column names to lower case out of a personal preference (alternatively, I could also use janitor::clean_names()). Note that I only read in the required columns and that I enforce the character format because I prefer to parse the column types correctly as a next step.

tbl.crsp_msf <- vroom("raw/crsp_msf.csv",
col_types = cols(.default = col_character()),
col_select = c(PERMNO, date, RET, SHROUT, ALTPRC,
EXCHCD, SHRCD, SICCD, DLRET, DLSTCD))
colnames(tbl.crsp_msf) <- str_to_lower(colnames(tbl.crsp_msf))

Next, I make sure that all relevant variables are correctly parsed.

tbl.crsp_monthly <- tbl.crsp_msf %>%
transmute(permno = as.integer(permno),       # security identifier
date = ymd(date),                  # date of the observation
month = floor_date(date, "month"), # month identifier
ret = as.numeric(ret) * 100,       # return (in percent)
shrout = as.numeric(shrout),       # shares outstanding (in thousands)
altprc = as.numeric(altprc),       # last traded price in a month
exchcd = as.integer(exchcd),       # exchange code
shrcd = as.integer(shrcd),         # share code
siccd = as.integer(siccd),         # industry code
dlret = as.numeric(dlret) * 100,   # delisting return (in percent)
dlstcd = as.integer(dlstcd)        # delisting code
)
rm(tbl.crsp_msf)

As common in the literature, I focus on US-based common stocks in the CRSP database. US-based common stocks are identified with share codes 10 and 11, where the first digit pins down ordinary common shares and the second digit indicates that the secudity has not been further defined (0) or does not need to be further defined (1). I refer to the original CRSP documentation on a full description of the meaning of this distinction.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
filter(shrcd %in% c(10, 11)) %>%
select(-shrcd)

I apply the following two checks to most data sets I work with. I make sure to keep only distinct observations and that there is only one observation by date for each security (namely, the first row for any permno-date combination). The main reason is that I want to work with a nice panel of unique security-date combinations where no pair appears more than once in the data.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
distinct(permno, date, .keep_all = TRUE) # remove duplicates 

Following BEM, I compute the market capitalization as the absolute value of the product of shares outstanding and the price of the stock as of the end of the last trading day of a given month. I have to take the absolute value since altprc is the negative of average of bid and ask from last traded price for which the data is available if there is no last traded price. Since the shares outstanding are reported in thousands of shares, I divide by 1000 such that the resulting measure is in millions of dollars.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
mutate(mktcap = abs(shrout * altprc) / 1000, # in millions of dollars
mktcap = if_else(mktcap == 0, as.numeric(NA), mktcap)) 

As a first glance at the data, I look at the stock exchange composition of the CRSP sample. Stocks listed on NYSDE, AMEX, and NASDAQ are indicated with values of 1 or 31, 2 or 32, and 3 or 33, respectively, in the exchange code filed. I collect stocks traded on all other exchange (e.g., Arca Stock Market, Boston Stock Exchange, Chicago Stock Exchange, etc.) in a separate category.

tbl.crsp_monthly <- tbl.crsp_monthly %>%
mutate(exchange = case_when(exchcd %in% c(1, 31) ~ "NYSE",
exchcd %in% c(2, 32) ~ "AMEX",
exchcd %in% c(3, 33) ~ "NASDAQ",
TRUE ~ "Other"))

The figure below plots the number of stocks per exchange. NYSE has the longest history in the data, but NASDAQ exhibits a considerable large number of stocks. Interestingly, the number of stocks on AMEX is decreasing steadily over the last couple of decades. By the end of 2019, there are 2,227 stocks on NASDAQ, 1,274 on NYSE, 167 on AMEX and 3 belonging to the other category.

fig.securities <- tbl.crsp_monthly %>%
count(exchange, date) %>%
ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +
geom_line() +
labs(x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Monthly number of securities by exchange") +
scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") +
scale_y_continuous(labels = comma, breaks = pretty_breaks()) +
theme_classic() +
scale_color_viridis_d()

Next, I look at the market capitalization per exchange. To do so, I adjust the market capitalization values for inflation using the using the consumer price index from the US Bureau of Labor Statistics as provided by the FRED database. All values are in end of 2019 dollars for intertemporal comparability. NYSE has by far the largest market capitalization, followed by NASDAQ. At the end of 2019, total market value on NYSE was 19,123 billion and 12,536 billion on NASDAQ. AMEX only plays a minor role by now with a total market capitalization of 42 billion at the end of 2019, while other exchanges only make up 13 billion.

tbl.cpi_raw <- read_csv("raw/CPIAUCNS.csv")

tbl.cpi <- tbl.cpi_raw %>%
filter(DATE <= max(tbl.crsp_monthly$date)) %>% transmute(month = floor_date(DATE, "month"), cpi = CPIAUCNS) %>% arrange(month) %>% mutate(cpi = cpi / last(cpi)) tbl.crsp_monthly <- tbl.crsp_monthly %>% left_join(tbl.cpi, by = c("month")) %>% mutate(mktcap_cpi = mktcap / cpi) fig.mktcap <- tbl.crsp_monthly %>% group_by(exchange, date) %>% summarize(mktcap = sum(mktcap_cpi, na.rm = TRUE) / 1000) %>% ungroup() %>% ggplot(aes(x = date, y = mktcap, color = exchange, linetype = exchange)) + geom_line() + labs(x = NULL, y = NULL, color = NULL, linetype = NULL, title = "Monthly total market value (billions of Dec 2019 Dollars) by exchange") + scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") + scale_y_continuous(labels = comma, breaks = pretty_breaks()) + theme_classic() + scale_color_viridis_d() Now, I turn to the industry composition of the CRSP sample. I follow the below convention on defining industries using the standard industry classification (SIC) codes. tbl.crsp_monthly <- tbl.crsp_monthly %>% mutate(industry = case_when(siccd >= 1 & siccd <= 999 ~ "Agriculture", siccd >= 1000 & siccd <= 1499 ~ "Mining", siccd >= 1500 & siccd <= 1799 ~ "Construction", siccd >= 2000 & siccd <= 3999 ~ "Manufacturing", siccd >= 4000 & siccd <= 4999 ~ "Transportation", siccd >= 5000 & siccd <= 5199 ~ "Wholesale", siccd >= 5200 & siccd <= 5999 ~ "Retail", siccd >= 6000 & siccd <= 6799 ~ "Finance", siccd >= 7000 & siccd <= 8999 ~ "Services", siccd >= 9000 & siccd <= 9999 ~ "Public", TRUE ~ "Missing")) The figure below plots the number of stocks in the sample in each of the SIC industries. Most of the stocks are apparently in Manufacturing albeit the number peaked somewhere in the 90ies. The number of public administration stocks seems to the the only category on the rise in recent years. fig.industries <- tbl.crsp_monthly %>% filter(!is.na(industry)) %>% count(industry, date) %>% ggplot(aes(x = date, y = n, color = industry, linetype = industry)) + geom_line() + labs(x = NULL, y = NULL, color = NULL, linetype = NULL, title = "Monthly number of securities by industry") + scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") + scale_y_continuous(labels = comma, breaks = pretty_breaks()) + theme_classic() + scale_color_viridis_d() I repeat the exercise by computing the market value of all stocks belonging to the respective industries. All values are again in terms of billions of end of 2019 dollars. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Towards the end of the sample financial firms and services begin to make up a substantial portion of the market value. fig.industries_mktcap <- tbl.crsp_monthly %>% filter(!is.na(industry)) %>% group_by(industry, date) %>% summarize(mktcap = sum(mktcap_cpi, na.rm = TRUE) / 1000) %>% ungroup() %>% ggplot(aes(x = date, y = mktcap, color = industry, linetype = industry)) + geom_line() + labs(x = NULL, y = NULL, color = NULL, linetype = NULL, title = "Monthly total market value (billions of Dec 2019 Dollars) by industry") + scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") + scale_y_continuous(labels = comma, breaks = pretty_breaks()) + theme_classic() + scale_color_viridis_d() Next, I turn to the calculation of returns. Monthly stock returns are in the ret field (i.e., the return realized by holding the stock from its last trade in the previous month to its last trade in the current month). However, I have to adjust for delistings which are fortunately recorded by CRSP. The adjustment procedure below follows Shumway (1997). Apparently, the adjustment kicks in on both tails of the return distribution as I can see from the summary statistics. tbl.crsp_monthly <- tbl.crsp_monthly %>% mutate(ret_adj = case_when(is.na(dlstcd) ~ ret, !is.na(dlstcd) & !is.na(dlret) ~ dlret, dlstcd %in% c(500, 520, 580, 584) | (dlstcd >= 551 & dlstcd <= 574) ~ -30, TRUE ~ -100)) %>% select(-c(dlret, dlstcd)) summary(select(tbl.crsp_monthly, ret, ret_adj))  ## ret ret_adj ## Min. : -99.36 Min. :-100.00 ## 1st Qu.: -6.40 1st Qu.: -6.45 ## Median : 0.00 Median : 0.00 ## Mean : 1.18 Mean : 1.02 ## 3rd Qu.: 6.99 3rd Qu.: 6.95 ## Max. :2400.00 Max. :6366.67 ## NA's :123119 NA's :104382 As a main reference point for each stock return, I want to consider the return on the market. According to the Capital Asset Pricing Model (CAPM), cross-sectional variation in expected asset returns should be a function of the covariance between the return of the asset and the return on the market portfolio. The value-weighted portfolio of all US-based common stocks in the CRSP database is one of the main proxies used in the empirical asset pricing literature which I also get from WRDS. tbl.crsp_si <- read_csv("raw/crsp_si.csv") colnames(tbl.crsp_si) <- str_to_lower(colnames(tbl.crsp_si)) tbl.crsp_monthly_market <- tbl.crsp_si %>% transmute(month = floor_date(ymd(date), "month"), vwretd = vwretd * 100, ewretd = ewretd * 100) tbl.crsp_monthly <- tbl.crsp_monthly %>% left_join(tbl.crsp_monthly_market, by = "month") In most cases, it makes sense to use stock returns in excess of the risk-free security over the same period. I take the monthly risk-free rate from Ken French’s data library and also add the other Fama-French factors while I am already at it. In particular, I also add another measure for the return on the market portfolio provided by Fama and French. The main difference to the CRSP measure lies in the fact that Fama and French exclude firms that are not based in the US, closed-end funds and other securities that are not common stocks. I discuss the other factors in later notes where I try to replicate them. tbl.factors_ff_monthly <- read_csv("raw/F-F_Research_Data_Factors.csv", skip = 3) %>% transmute(month = floor_date(ymd(paste0(X1, "01")), "month"), rf_ff = as.numeric(RF), mkt_ff = as.numeric(Mkt-RF) + as.numeric(RF), smb_ff = as.numeric(SMB), hml_ff = as.numeric(HML)) tbl.crsp_monthly <- tbl.crsp_monthly %>% left_join(tbl.factors_ff_monthly, by = "month")  As a final descriptive statistic, I plot the cumulative log returns of our market measures (i.e., equal-weighted, value-weighted and the Fama-French market factor). Clearly, the value-weighted CRSP index and the Fama-French market factor are highly correlated, while the equal-weighted CRSP index stands out due its high returns apparently driven by smaller stocks. tbl.market <- tbl.crsp_monthly %>% distinct(date, vwretd, ewretd, mkt_ff) %>% pivot_longer(-date, names_to = "portfolio", values_to = "return") %>% na.omit() %>% group_by(portfolio) %>% arrange(date) %>% mutate(cum_return = cumsum(log(1+return/100))) %>% ungroup() fig.market <- tbl.market %>% mutate(portfolio = case_when(portfolio == "vwretd" ~ "CRSP (Value-Weighted)", portfolio == "ewretd" ~ "CRSP (Equal-Weighted)", portfolio == "mkt_ff" ~ "MKT (Fama-French)")) %>% ggplot(aes(x = date, y = cum_return, color = portfolio, linetype = portfolio)) + geom_line() + labs(x = NULL, y = NULL, color = NULL, linetype = NULL, title = "Cumulative log return for different market factors") + scale_x_date(expand = c(0, 0), date_breaks = "10 years", date_labels = "%Y") + scale_y_continuous(labels = percent, breaks = pretty_breaks()) + theme_classic() + scale_color_viridis_d() The focus of empirical asset pricing is to examine the ability of different stock characteristics to predict the cross section of future stock returns. This is why I add the (excess) returns for the subsequent month of each stock to the sample. Note that I use beginning of month dates to ensure correct matching of dates (you may in principle also use consistent end-of-month dates, but subtracting months from the first day of month is easier). tbl.crsp_monthly_f1 <- tbl.crsp_monthly %>% transmute(permno = permno, month = floor_date(date, "month") %m-% months(1), ret_excess_f1 = ret - rf_ff, ret_adj_excess_f1 = ret_adj - rf_ff, mkt_ff_excess_f1 = mkt_ff - rf_ff) tbl.crsp_monthly <- tbl.crsp_monthly %>% left_join(tbl.crsp_monthly_f1, by = c("permno", "month")) To wrap up this data set, I arrange everything by security and date before saving the sample and provide some summary statistics of all the variables that were used above. I end up with a monthly CRSP file that has about 700 MB. tbl.crsp_monthly <- tbl.crsp_monthly %>% arrange(permno, date) write_rds(tbl.crsp_monthly, "data/crsp_monthly.rds") summary(tbl.crsp_monthly) ## permno date month ret ## Min. :10000 Min. :1925-12-31 Min. :1925-12-01 Min. : -99.36 ## 1st Qu.:21960 1st Qu.:1977-04-29 1st Qu.:1977-04-01 1st Qu.: -6.40 ## Median :49373 Median :1990-12-31 Median :1990-12-01 Median : 0.00 ## Mean :50206 Mean :1988-03-18 Mean :1988-02-18 Mean : 1.18 ## 3rd Qu.:78749 3rd Qu.:2002-04-30 3rd Qu.:2002-04-01 3rd Qu.: 6.99 ## Max. :93436 Max. :2019-12-31 Max. :2019-12-01 Max. :2400.00 ## NA's :123119 ## shrout altprc exchcd siccd ## Min. : 0 Min. : -1832.5 Min. :-2.000 Min. : 0 ## 1st Qu.: 2175 1st Qu.: 1.6 1st Qu.: 1.000 1st Qu.:3079 ## Median : 6774 Median : 10.8 Median : 2.000 Median :3842 ## Mean : 40744 Mean : 27.0 Mean : 2.107 Mean :4511 ## 3rd Qu.: 23212 3rd Qu.: 25.2 3rd Qu.: 3.000 3rd Qu.:6030 ## Max. :29206400 Max. :339590.0 Max. :33.000 Max. :9999 ## NA's :3743 NA's :72367 NA's :2411 ## mktcap exchange cpi mktcap_cpi ## Min. : 0.0 Length:3645522 Min. :0.04903 Min. : 0.0 ## 1st Qu.: 15.9 Class :character 1st Qu.:0.23349 1st Qu.: 47.6 ## Median : 66.8 Mode :character Median :0.52068 Median : 180.9 ## Mean : 1490.3 Mean :0.49551 Mean : 2233.2 ## 3rd Qu.: 357.8 3rd Qu.:0.69968 3rd Qu.: 812.0 ## Max. :1304764.7 Max. :1.00145 Max. :1304764.7 ## NA's :72504 NA's :72504 ## industry ret_adj vwretd ewretd ## Length:3645522 Min. :-100.00 Min. :-29.1731 Min. :-31.275 ## Class :character 1st Qu.: -6.45 1st Qu.: -1.7553 1st Qu.: -1.987 ## Mode :character Median : 0.00 Median : 1.2963 Median : 1.402 ## Mean : 1.02 Mean : 0.9308 Mean : 1.123 ## 3rd Qu.: 6.95 3rd Qu.: 3.8775 3rd Qu.: 4.231 ## Max. :6366.67 Max. : 39.4143 Max. : 66.594 ## NA's :104382 NA's :497 NA's :497 ## rf_ff mkt_ff smb_ff hml_ff ## Min. :-0.06 Min. :-29.100 Min. :-16.720 Min. :-13.280 ## 1st Qu.: 0.15 1st Qu.: -1.750 1st Qu.: -1.690 1st Qu.: -1.320 ## Median : 0.39 Median : 1.290 Median : 0.080 Median : 0.270 ## Mean : 0.37 Mean : 0.949 Mean : 0.128 Mean : 0.366 ## 3rd Qu.: 0.51 3rd Qu.: 3.920 3rd Qu.: 1.860 3rd Qu.: 1.780 ## Max. : 1.35 Max. : 38.950 Max. : 36.700 Max. : 35.460 ## NA's :3627 NA's :3627 NA's :3627 NA's :3627 ## ret_excess_f1 ret_adj_excess_f1 mkt_ff_excess_f1 ## Min. : -99.52 Min. :-101.31 Min. :-29.130 ## 1st Qu.: -6.77 1st Qu.: -6.82 1st Qu.: -2.020 ## Median : -0.36 Median : -0.37 Median : 1.020 ## Mean : 0.81 Mean : 0.65 Mean : 0.578 ## 3rd Qu.: 6.65 3rd Qu.: 6.62 3rd Qu.: 3.520 ## Max. :2399.66 Max. :6365.91 Max. : 38.850 ## NA's :126291 NA's :107577 NA's :29162 Finally, here is the code chunk for setting up the daily return data from CRSP. Note that the raw daily CRSP sample has about 24GB, whereas the final sample is about 3 GB. I just use a scientific cluster with plenty of memory to run this part. # Read in data tbl.crsp_dsf <- vroom("raw/crsp_dsf.csv", col_types = cols(.default = col_character()), col_select = c(PERMNO, date, RET, SHRCD)) colnames(tbl.crsp_dsf) <- str_to_lower(colnames(tbl.crsp_dsf)) # Keep only common stocks and relevant columns tbl.crsp_daily <- tbl.crsp_dsf %>% mutate(shrcd = as.integer(shrcd)) %>% filter(shrcd %in% c(10, 11)) %>% transmute(permno = as.integer(permno), date = ymd(date), ret = as.numeric(ret) * 100 ) rm(tbl.crsp_dsf) # Keep only distinct observations tbl.crsp_daily <- tbl.crsp_daily %>% distinct(permno, date, .keep_all = TRUE) # Add fama french factors (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) tbl.factors_ff_daily <- read_csv("raw/F-F_Research_Data_Factors_daily.csv", skip = 3) %>% transmute(date = ymd(X1), rf_ff = as.numeric(RF), mkt_ff = as.numeric(Mkt-RF) + as.numeric(RF)) %>% filter(date <= max(tbl.crsp_daily$date))

tbl.crsp_daily <- tbl.crsp_daily %>%
left_join(tbl.factors_ff_daily, by = "date")

# Drop rows with missing values
tbl.crsp_daily <- tbl.crsp_daily %>%
na.omit()

# Set reference month
tbl.crsp_daily <- tbl.crsp_daily %>%
mutate(month = floor_date(date, "month")) %>%
select(permno, date, month, ret, rf_ff, mkt_ff)

# Save sample
write_rds(tbl.crsp_daily, "data/crsp_daily.rds")
##### Christoph Scheuch
###### Director of Product

My interests include Product Management, Data Science, R and FinTech-related stuff. matter.