Scraping ESG Data from Yahoo Finance with R

In this post, I provide a recollection of my effort to scrap Environmental, Social and Governance (ESG) information from Yahoo Finance (e.g., Apple). In particular, Yahoo Finance provides total ESG scores, environment, social and governance risk scores, as well as controversy levels, all compiled by Sustainalytics which is by now owned by Morningstar. My code builds on the walk-through by Kyle Ruden, which I adapted to the current page structure of Yahoo Finance and my own coding style. In addition, I added a few steps that I, as web scraping newbie, had to look up while going through his guide.

To begin with, I want to urge you to read at least the legal and ethical considerations put forward by Kyle. Most importantly, I want to mention that, when performing web scraping tasks, it is both good practice and often required to set a custom user agent request header to identify yourself, as well as sending requests at a modest rate to ‘smell like a human’. I consider both of these key aspects in my code below.

Throughout this note, I rely on the following packages:

library(tidyverse) # overall grammar
library(tidytext)  # only for reorder_within function
library(scales)    # only for scales function
library(httr)      # http verbs
library(rvest)     # wrapper around xlm2 and httr
library(robotstxt) # only for paths_allowed

Get Symbols

First, we want to get some companies for which we want to scrap ESG information from Yahoo Finance. Let us get a table of symbols and industry information of the S&P 500 constituents from Wikipedia. The function read_html normalizes the page to a valid XML document. html_nodes then allows us to point exactly to the table we can find on the website using the name of the CSS node. html_table then parses the HTML table into a data frame. Note that, as one of the last steps, we need to replace all dots in the symbols with dashes to get the symbols used by Yahoo Finance.

tbl.Symbols <- read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies") %>%
  html_nodes(css = "table[id='constituents']") %>%
  html_table() %>%
  data.frame() %>%
  as_tibble() %>% 
  select(Symbol, Company = Security, Sector = GICS.Sector, Industry = GICS.Sub.Industry) %>%
  mutate(Symbol = str_replace(Symbol, "[.]", "-")) %>%
  arrange(Symbol)

The following chunk prints what we got from Wikipedia. We will use the sector information in the last section of this post where we take a quick look at the scraped data.

tbl.Symbols
## # A tibble: 505 x 4
##    Symbol Company             Sector             Industry                       
##    <chr>  <chr>               <chr>              <chr>                          
##  1 A      Agilent Technologi~ Health Care        Health Care Equipment          
##  2 AAL    American Airlines ~ Industrials        Airlines                       
##  3 AAP    Advance Auto Parts  Consumer Discreti~ Automotive Retail              
##  4 AAPL   Apple Inc.          Information Techn~ Technology Hardware, Storage &~
##  5 ABBV   AbbVie Inc.         Health Care        Pharmaceuticals                
##  6 ABC    AmerisourceBergen ~ Health Care        Health Care Distributors       
##  7 ABMD   ABIOMED Inc         Health Care        Health Care Equipment          
##  8 ABT    Abbott Laboratories Health Care        Health Care Equipment          
##  9 ACN    Accenture plc       Information Techn~ IT Consulting & Other Services 
## 10 ADBE   Adobe Inc.          Information Techn~ Application Software           
## # ... with 495 more rows

Locate ESG Information

The point where I struggled when I tried to replicate other guides was the search for the exact location of the information that I want to scrap (and the fact that the old locations seemed to have changed). After some trial and error, it turns out that it is really easy. Once you download a web page, you can in principle either use CSS nodes or XML paths to extract information using html_nodes as above. However, the CSS nodes on Yahoo Finance have a weird structure that is apparently not straight-forward to use in this function. Fortunately, XML paths work perfectly! Google will explain to you what these terms mean, I only demonstrate how you find the relevant paths which we use in the scraping function below.

Let us stick to Apple as our main example and go to the sustainability tab on Yahoo Finance. If we right-click on the ESG score (e.g., using Google Chrome), we can see the the option to ‘Inspect’.

Once you click on it, a tab to the right opens where you see the underlying code. What is even more useful is the fact that the browser highlights the corresponding elements on the website as you hover over the code. This way, it is really easy to locate the information we are after. So we click on the relevant element and we copy the XML path.

So the location of the total ESG score on the page is:

'//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[1]'

We can just point-and-click on all the items we want to scrap and collect the relevant XML paths. Once we downloaded a page, we just tell html_node where to look for the information we want (and afterwards how to parse it).

Define Scraping Function

My function to scrap ESG data takes two inputs: the stock symbol and your user agent. We got the symbols from Wikipedia, but we need to define our own user agent. For instance, I use an agent that looks like this:

agent <- "Your Name (your@email.com). Doing personal research."

The main function then proceeds as follows:

  1. Construct the link of the page we want to download.
  2. Check if scraping is allowed.
  3. Store the time stamp of our scraping effort.
  4. Download the page.
  5. Store individual information using the XML paths we manually extracted following the point-and-click procedure from above.
  6. Collect all information in a table.

Note that the XML paths might be different at a later change, e.g., because of new website designs of Yahoo Finance. So you might have to update those manually yourself in case you try to use my code.

scrap_esg_data <- function(symbol, agent) {
  # construct link
  link <- str_c("https://finance.yahoo.com/quote/", symbol, "/sustainability?p=", symbol)
  
  # check if scraping is allowed
  check <- suppressMessages(paths_allowed(link))
  
  if (check == TRUE) {
    # store scrap date
    scrap_date <- Sys.time()
    
    # download page
    page <- GET(link, user_agent(agent)) %>% 
      read_html()
    
    # store total ESG score
    total_esg_score <- page %>% 
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[1]') %>%
      html_text() %>% parse_number()
    
    # store total ESG percentile
    total_esg_percentile <- page %>% 
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[1]/div/div[2]/div[2]/span') %>%
      html_text() %>% parse_number()
    
    # store environment risk score
    environment_risk_score <- page %>%
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[2]/div/div[2]/div[1]') %>%
      html_text() %>% parse_number()
    
    # store social risk score
    social_risk_score <- page %>%
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[3]/div/div[2]/div[1]') %>%
      html_text() %>% parse_number() 
    
    # store governance risk score
    governance_risk_score <- page %>%
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[1]/div/div[4]/div/div[2]/div[1]') %>%
      html_text() %>% parse_number()
    
    # store controversy level
    controversy_level <- page %>%
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[2]/div[2]/div/div/div/div[1]/div') %>%
      html_text() %>% parse_number()
    
    # store last update date (currently in %m/%Y format)
    last_update_date <- page %>%
      html_node(xpath = '//*[@id="Col1-0-Sustainability-Proxy"]/section/div[3]/span[2]/span') %>%
      html_text() %>% str_remove(., "Last updated on ")
    
    # collect output in tibble
    output <- tibble(
      Symbol = symbol,
      Link = link,
      ScrapDate = scrap_date,
      TotalESGScore = total_esg_score,
      EnvironmentRiskScore = environment_risk_score,
      SocialRiskScore = social_risk_score,
      GovernanceRiskScore = governance_risk_score,
      ControversyLevel = controversy_level,
      LastUpdateDate = last_update_date
    )
      return(output)
  } else {
    # if scraping is not allowed, throw an error
    stop(str_c("No bots allowed on page '", link ,"'!"))
  }
}

Scrap ESG Data

For output data of this size, I prefer to initialize an empty list that I sequentially fill using a simple loop and then bind the rows again to a table once the loop is done. Moreover, we have to incorporate (random) waiting times before we try to retrieve information from Yahoo Finance such that our scraping resembles the normal clicking behavior of a human. In case we get a timeout, there is no ESG information, so we tell the loop to try again, but wait a little longer. After 10 attempts, the loop gives up and assumes that there is no ESG information for the particular symbol, which is in fact the case most of the time.

# initialize output list
lst.ESGInfo <- vector("list", nrow(tbl.Symbols))

# loop over symbols
for (j in 1:nrow(tbl.Symbols)) {
  # extract ESG information
  tbl.Results <- NULL
  attempt <- 1
  
  # possibly use multiple attempts to fetch data (with max 10)
  while (is.null(tbl.Results) & attempt <= 10) {
    
    # wait random amount of time to avoid timeouts with increasing waiting time
    Sys.sleep(runif(1, attempt*5, attempt*10))
    attempt <- attempt + 1
    
    # try to extract ESG information
    tbl.Results <- try(
      scrap_esg_data(tbl.Symbols$Symbol[j], agent)
    )
    
    # ESG score might be missing b/c we were blocked so try again just in case
    if (is.na(tbl.Results$TotalESGScore)) {
      tbl.Results <- NULL
    }
  }
  
  # put results into list once done trying to get the data
  lst.ESGInfo[[j]] <- tbl.Results
  
  # print progress
  cat(as.character(Sys.time()), tbl.Symbols$Symbol[j], "done!\n")
}
tbl.ESGInfo <- tbl.Symbols %>%
  left_join(bind_rows(lst.ESGInfo), by = c("Symbol"))

The loop from above takes a couple of hours in the current specification because of the increasing waiting times. I am sure that there are better ways to solve the timeout problem, so feel free to drop a comment below. The whole table then looks like this and also includes our initial example Apple:

tbl.ESGInfo
## # A tibble: 505 x 12
##    Symbol Company Sector Industry Link  ScrapDate           TotalESGScore
##    <chr>  <chr>   <chr>  <chr>    <chr> <dttm>                      <dbl>
##  1 A      Agilen~ Healt~ Health ~ http~ 2020-08-16 09:35:22            17
##  2 AAL    Americ~ Indus~ Airlines http~ 2020-08-16 09:35:33            31
##  3 AAP    Advanc~ Consu~ Automot~ http~ 2020-08-16 09:35:39            12
##  4 AAPL   Apple ~ Infor~ Technol~ http~ 2020-08-16 09:35:47            24
##  5 ABBV   AbbVie~ Healt~ Pharmac~ http~ 2020-08-16 09:35:55            30
##  6 ABC    Ameris~ Healt~ Health ~ http~ 2020-08-16 09:36:06            18
##  7 ABMD   ABIOME~ Healt~ Health ~ <NA>  NA                             NA
##  8 ABT    Abbott~ Healt~ Health ~ http~ 2020-08-16 09:42:52            30
##  9 ACN    Accent~ Infor~ IT Cons~ http~ 2020-08-16 09:42:58            11
## 10 ADBE   Adobe ~ Infor~ Applica~ http~ 2020-08-16 09:43:06            14
## # ... with 495 more rows, and 5 more variables: EnvironmentRiskScore <dbl>,
## #   SocialRiskScore <dbl>, GovernanceRiskScore <dbl>, ControversyLevel <dbl>,
## #   LastUpdateDate <chr>

Quick Evaluation of ESG Scores

Let us take a quick look at the data we collected. First, let us check the overall coverage of our sample:

percent(nrow(na.omit(tbl.ESGInfo)) / nrow(tbl.ESGInfo))
## [1] "93%"

This is not too bad. I believe that for most of the companies without ESG scores in my sample, Yahoo Finance does not provide any data. Admittedly, I should check manually at some point, but for the purpose of this note, this is definitely a success. To analyze sector-level breakdowns, I construct a summary table which I use as the main source for the following figures.

tab.ESGScoresBySectors <- tbl.ESGInfo %>%
  group_by(Sector) %>%
  summarize(Companies = n(),
            Coverage = sum(!is.na(TotalESGScore)) / n(),
            `Total ESG Score` = mean(TotalESGScore, na.rm = TRUE),
            `Environment Risk Score` = mean(EnvironmentRiskScore, na.rm = TRUE),
            `Social Risk Score` = mean(SocialRiskScore, na.rm = TRUE),
            `Governance Risk Score` = mean(GovernanceRiskScore, na.rm = TRUE),
            `Controversy Level` = mean(ControversyLevel, na.rm = TRUE)) %>%
  arrange(-Coverage)

The first figure gives us the coverage per sector. All real estate companies have ESG scores, while only a bit more than three quarters of communication services feature this information.

tab.ESGScoresBySectors %>%
  mutate(Labels = str_c(Companies*Coverage, " out of ", Companies)) %>%
  ggplot(aes(y = reorder(Sector, Coverage), 
             x = Coverage, fill = factor(round(Coverage, 0)))) +
  geom_col(show.legend = FALSE) + 
  theme_minimal() + 
  geom_text(aes(label = Labels), hjust = 1.1, color = "white") +
  coord_cartesian(xlim = c(0, 1)) +
  scale_x_continuous(labels = percent) +
  labs(x = NULL, y = NULL,
       title = "How many companies have ESG scores per sector?",
       subtitle = "Based on Yahoo Finance and S&P 500 data as of August 2020")

Next, I want to look at average ESG scores by sector (s/o to the amazing Julia Silge for pointing out the reorder_within function which makes facet grids even more fun). For instance, the real estate sector has the lowest total ESG score, indicating the lowest degree to which a sector’s business value is at risk driven by environmental, social and governance risks. Financials exhibit the the lowest environmental risk, while the energy sector (at least the part included in the S&P 500) has the highest exposure to environmental risks.

tab.ESGScoresBySectors %>%
  pivot_longer(cols = c(`Total ESG Score`, `Environment Risk Score`,
                        `Social Risk Score`, `Governance Risk Score`),
               names_to = "Types", values_to = "Scores") %>%
  mutate(Types = factor(Types, levels = c("Total ESG Score", "Environment Risk Score",
                                          "Social Risk Score", "Governance Risk Score")),
         Sector = reorder_within(Sector, -Scores, Types)) %>%
  ggplot(aes(y = Sector, x = Scores, fill = Types)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Types, scales = "free_y") +
  theme_minimal() + 
  scale_y_reordered() +
  geom_text(aes(label = round(Scores, 0)), hjust = 1.1, color = "white") +
  labs(y = NULL, x = NULL,
       title = "What are the average ESG scores per sector?",
       subtitle = "Based on Yahoo Finance and S&P 500 data as of August 2020")

Finally, I am also interested in the average controversy level which measures to which degree companies are involved in incidents and events that may negatively impact stakeholders, the environment or their operations. I decided to plot the controversy of each sector relative to the average overall controversy. Real estate and information technology seem to be far less controverse than consumer staples and communication services.

tab.ESGScoresBySectors %>%
  mutate(RelativeControversy = `Controversy Level` - mean(`Controversy Level`)) %>%
  ggplot(aes(y = reorder(Sector, -RelativeControversy), 
             x = RelativeControversy, fill = (RelativeControversy < 0))) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() + theme(legend.position = "none") + 
  coord_cartesian(xlim = c(-1.5, 1.5)) +
  labs(y = NULL, x = NULL,
       title = "What is the average sector-level controversy relative to overall controversy?",
       subtitle = "Based on Yahoo Finance and S&P 500 data as of August 2020")

I think there is a lot more interesting stuff to uncover using the ESG scores, but for now I’ll leave it at that. Please feel free to share any suggestions in the comments below on how to improve my proposed scrapping procedure. I am nonetheless surprised, how easy scraping information from websites is using these amazing packages.

Christoph Scheuch
Christoph Scheuch
Director of Product

My interests include Product Management, Data Science, R and FinTech-related stuff. matter.

Related