ALT Analytics: Analyzing the S&P 500 with FMP

3 key takeaways

FMP offers a range of financial data
R has tools to simplify analysis and visualization
Always check and validate the data

Financial Modeling Prep

Financial Modeling Prep and the sister site FMP Cloud provides access to a range of financial data points. This article will use the fmpcloudr package to pull data from the API into R. The analysis will look at historical returns, trade volume, and the composition of the S&P 500. One thing to note, because FMP is still relatively new, some of the historical data points are unavailable. This will be highlighted within the analysis.

First, set up the R environment:

# Load the necessary libraries
library(fmpcloudr)
library(dplyr)
library(lubridate)
library(plotly)

# Set options and create if.na function
options(scipen=999)
options(dplyr.width = Inf)
if.na <- function(x,y)(ifelse(is.na(x),y,x))

# Set the FMP token - stored in a local location on server
fmpKey = readRDS('/home/rstudio/Secure/fmp.rds')
fmpcloudr::fmpc_set_token(fmpKey, noBulkWarn = TRUE)

Index Data

FMP provides access to index data which is not the case for many of the free and inexpensive pricing APIs available. For many of the Total Return indexes data only goes back to April of 2020, but the Price Return indexes have a much deeper history. Note: Total Return includes dividend reinvestment.

First we will pull index data for the three major indexes: S&P 500, Nasdaq, and Dow Jones combined with Gold. We can search for these symbols using fmpc_symbols_by_market.

# Search for available symbols for indexes and commodities
AvailIndx = fmpc_symbols_by_market(c('index','commodity'))

# Filter for Gold, S&P, Dow and Nasdaq
AvailIndx %>%
  mutate(lowerName = tolower(name)) %>%
  filter(grepl('gold',lowerName) | grepl('s&p 500',lowerName) |
         grepl('dow jones ind',lowerName) | grepl('nasdaq',lowerName)) %>% 
  select(1,2)  -> symbName

knitr::kable(symbName)

symbol	name
GCUSD	Gold Apr 21
ZGUSD	Gold 100 oz. Apr 20
^DJI	Dow Jones Industrial Average
^IXIC	NASDAQ Composite
^GSPC	S&P 500
^SP500TR	S&P 500 (TR)
^DJITR	Dow Jones Industrial Average TR
^XNDX	NASDAQ 100 Total Return Index
^NDX	NASDAQ 100
^VXN	CBOE NASDAQ 100 Voltility
^GVZ	CBOE Gold Volatitity Index

Once we have the symbols we can use fmpc_price_history to pull the price history for each index. We will pull the Price Return index due to the longer history. Luckily the S&P total return is also available.

# Pull data for price return indexes, Gold, and S&P TR back to 2000
IndxHist = fmpc_price_history(c('^NDX','^DJT','^GSPC','^SP500TR','GCUSD'), 
                              startDate = '2000-01-01')

# Calculate the cumulative return of each symbol
IndxHist %>%
  # Bring in Naming data
  left_join(select(AvailIndx,symbol,name), by = 'symbol') %>%
  group_by(symbol) %>%
  # Calculate the cumulative return for each symbol
  mutate(return = if.na(log(adjClose/lag(adjClose)),0),    # Calculate log returns for each index
         name = ifelse(symbol == 'GCUSD','Gold',
                       ifelse(symbol == '^DJT','Dow Jones (PR)',
                              ifelse(symbol == '^NDX','Nasdaq 100 (PR)',name))),
         cumReturn = exp(cumsum(return))-1) -> IndxReturn

Finally, we can plot the data to see how the cumulative return looks overtime.

# use plotly to create interactive plot
plot_ly(IndxReturn) %>%
  add_trace(x = ~date, y = ~cumReturn, color = ~name, type = 'scatter', mode = 'lines') %>%
  layout(title = 'Major Index Performance since Jan 2000',
         hovermode = 'compare',
         legend = list(orientation = "h", xanchor = "center", x = 0.5),
         xaxis = list(title = ''),
         yaxis = list(title = 'Return', tickformat = '.0%'))

As can be seen above, gold has performed very well since 2000 despite a massive bear market between 2011-2015. The other averages have all moved in line with each other. The reason the Nasdaq looks to be underperforming over this time period is because of the huge draw down during the tech crash in the early 2000s.

Price Return vs Total Return

As mentioned above and in a previous article, the price return and total return differ due to the reinvestment of dividends. While dividends typically range between 1% and 2%, the compounding effect can be quite dramatic.

In this analysis we will compare the price return index to the total return performance of the ETFs. As a point of comparison, I have also included the S&P total return index to see how SPY performs against the index. The SPY return will be slightly below the index due to ETF fees.

# Pull the ETF data for S&P and Nasdaq - QQQ and SPY
ETFHist = fmpc_price_history(c('QQQ','SPY'), startDate = '2000-01-01')

# Calculate the returns
ETFHist %>%
  group_by(symbol) %>%
  mutate(return = if.na(log(adjClose/lag(adjClose)),0),
         cumReturn = exp(cumsum(return))-1) %>%
  filter(date == max(date)) -> ETFReturn

# Pull the Index data from the analysis above
NasSP = IndxReturn %>% 
  filter(symbol %in% c('^GSPC','^NDX','^SP500TR')) %>%
  filter(date == max(date))

# Create a name map
map = tibble(symbol = c('^GSPC','^NDX','^SP500TR','QQQ','SPY'),
             plotname = c('SP 500 PR','Nasdaq PR','SP 500 TR','Nasdaq ETF TR','SP 500 ETF TR'))

# Stack the data and take the max date which has the cumulative return for the period
rbind(NasSP,ETFReturn) %>%
  filter(date == max(date)) %>%
  left_join(map, by = 'symbol') -> PlotTRPR

Once the returns have been calculated, we can plot the data using a bar chart to show the cumulative return over the time period. The tech heavy QQQ is not known for high dividends and this can be seen in the chart below. The S&P TR vs PR difference is much greater than the same difference for Nasdaq (108% differnce versus only 34% for the Nasdaq).

# Plot the ETF Data in a bar chart
plot_ly(PlotTRPR) %>%
  add_trace(x = ~plotname, y = ~cumReturn, color = ~plotname, type = 'bar') %>%
  layout(title = 'Price Return vs Total Return',
         hovermode = 'compare',
         showlegend = FALSE,
         xaxis = list(title = ''),
         yaxis = list(title = 'Return', tickformat = '.0%'))

Average Daily Volume

ETFs have become extremely popular over the past twenty years. We can use trading on the S&P 500 as a proxy to see the increase in activity. Daily trade volume is a great indication along with total assets. There are three main ETFs that track the S&P 500: SPY, IVV, and VOO. We can analyze the trade volume on all three to see how trade volume has trended over time. Below we will pull the data and plot it into a stacked bar chart.

# Pull the three tickers back to SPY inception
SPETF = fmpc_price_history(c('IVV','VOO','SPY'), startDate = '1993-01-01')

# Calculate notional volume, and take daily average for each quarter
SPETF %>%
  mutate(notionVol = adjClose * unadjustedVolume) %>%          # Notional amount is price * share volume
  mutate(qtrEnd = ceiling_date(date, unit = 'quarter')-1) %>%  # Calculate qtr end date
  group_by(qtrEnd, symbol) %>%                                 # Group qtr/symbol to calcualte average
  summarise(avgDlyvol = mean(notionVol)) %>%
  arrange(qtrEnd) -> ETFVol

# Plot the data in a stacked bar chart
plot_ly(ETFVol) %>%
  add_trace(x = ~qtrEnd, y = ~avgDlyvol, color = ~symbol, type = 'bar') %>%
  layout(title = 'Average Daily Volume',
         hovermode = 'compare',
         barmode = 'stack',
         legend = list(orientation = "h", xanchor = "center", x = 0.5),
         xaxis = list(title = ''),
         yaxis = list(title = 'Notional Daily Volume'))

As can be seen above, trade volume has increased dramatically in S&P ETFs since the inception of the first ETF (SPY) back in 1993. One glaring takeaway is how much SPY dominates the trade volume. Ironically, this is not due entirely to size. In the pull below, we can see that even though SPY has the largest market cap (almost 50% of the total), it’s trade volume dominates in comparison garnering over 90% of the trade volume. SPY has always been the most popular ETF and is used by many intuitions due to its liquidity.

# Proiles show the current market cap
knitr::kable(fmpc_security_profile(c('SPY','IVV','VOO')) %>%
                mutate(mktCap = format(mktCap, big.mark = ','),
                      volAvg = format(volAvg, big.mark = ',')) %>%
                select(symbol, volAvg, mktCap, companyName), "simple")

symbol	volAvg	mktCap	companyName
SPY	63,201,003	357,980,865,000	SPDR S&P 500
IVV	4,072,066	243,324,404,000	iShares Core S&P 500
VOO	3,010,931	111,153,118,580	Vanguard S&P 500

Concentration of the S&P 500

The S&P 500 index was formed in 1957 and has changed dramatically over that time. Many companies have been added and removed based on certain selection criteria. Recent articles have highlighted how the concentration in the top 5 companies has never been greater. This analysis will attempt to replicate those findings using FMP.

Reconstructing the S&P 500 historically is a bit tricky, but FMP offers API endpoints that allow you to reconstruct the index historically and chart the concentration.

As with any data exercise, many times the data is not perfect. Unfortunately that is the case here, so the data must be cleaned and modified. The FMP support stuff is extremely responsive and very helpful. They have been great at correcting any data issues I flag to their attention. This is a bit low on the priority list, so I have not raised the issues herein, but I also think the data is good enough for exploration. Additionally, it could be challenging to pull historical market cap data for delisted companies.

The goal of this section will be to analyze the top 5 companies in the S&P by doing the following:

Pull in historical S&P 500 constituents
Clean the data
Prepare it for plotting
Use plotly to create an overlay of the S&P 500 overtime
Try and recreate the historical S&P 500 concentration
Use an animation from plotly to playback the change overtime

Pulling the data

# Pull the current list of the S&P 500
spcur = fmpc_symbols_index(period = 'current')

# Pull the historical list of companies that have been added and removed
sphist = fmpc_symbols_index(period = 'historical', index = 'sp500') %>%
  mutate(date = as.Date(date))

The data goes back to 1994, but as mentioned above, the data is not perfectly clean and consistent. For example, Morgan Stanley appears in the current S&P 500 list, but also shows as having been removed in the historical S&P 500 list back in 1997 without ever being added again.

knitr::kable(spcur %>% filter(symbol == 'MS'))

symbol	name	sector	subSector	headQuarter	dateFirstAdded	cik	founded
MS	Morgan Stanley	Financials	Investment Banking & Brokerage	New York, New York		0000895421	1935

# Show the SP 500 history for Morgan Stanley
knitr::kable(sphist %>% filter(symbol == 'MS'))

dateAdded	addedSecurity	removedTicker	removedSecurity	date	reason	symbol

I don’t want to get too caught up with all the data issues, so instead I will make notes about them in the comments.

Recreating S&P 500 history

To simplify the output, I only want to show a snapshot of the S&P at the end of each year. The data set starts in 1994. Thus, I will take the S&P 500 as of 1994 and loop through each year up to the current, adding and removing the tickers as indicated by the data set. Because of the issue highlighted above with Morgan Stanley, I will also force add any ticker currently in the S&P 500 that has an add date on or before the year in the loop.

Note: I recognize that loops are frowned upon in R, better to use a function within map or apply, but I thought this would be easier to read.

# Filter for the first year of 1993
sp500yr = sphist %>%
  filter(year(date)==1994) %>%
  mutate(date = as.Date('1994-12-31')) %>%
  select(symbol,date)

# Make this the "previous year" to start the loop
prevYear = sp500yr

# Create a vector of years to Loop through
yrloop = 1995:year(Sys.Date()-20)

# Loop through each year from 1995 - 2020
for (i in yrloop) {
    
  # Identify securities that have been removed
  rmvd = sphist %>% 
    filter(year(date)==i, removedSecurity != '') %>%
    pull(symbol)
  
  # Identify securities that have been added
  added = sphist %>%
    filter(year(date)==i, addedSecurity != '') %>%
    select(symbol) %>%
    mutate(date = as.Date(paste0(i,'-12-31')))
  
  # Use the previous year tickers as a starting point for this year and update the date
  Curyear = prevYear %>%
    mutate(date = as.Date(paste0(i,'-12-31'))) 
  
  # For current S&P 500 funds include any tickers that were added on or before the loop year
  # This is to correct for the Morgan Stanley issue noted above
  spcur %>% filter(substr(dateFirstAdded,1,4)<=i) %>%
    mutate(date = as.Date(paste0(i,'-12-31'))) %>%
    select(symbol,date) -> currentCheck
  
  # Bind them together, filter out removed tickers, and make unique
  Curyear = rbind(Curyear,added,currentCheck) %>%
    filter(!(symbol %in% rmvd)) %>% 
    distinct()
  
  # The current year becomes the new previous year
  prevYear = Curyear
  
  # Append each year to the next in the loop
  sp500yr = rbind(sp500yr,Curyear)

}

Once the loop is complete we can look at the data to see how it turned out

table(sp500yr$date)


1995-12-31 1996-12-31 1997-12-31 1998-12-31 1999-12-31 2000-12-31 
       193        194        204        208        218        228 
2001-12-31 2002-12-31 2003-12-31 2004-12-31 2005-12-31 2006-12-31 
       235        245        249        253        260        271 
2007-12-31 2008-12-31 2009-12-31 2010-12-31 2011-12-31 2012-12-31 
       291        313        331        348        365        383 
2013-12-31 2014-12-31 2015-12-31 2016-12-31 2017-12-31 2018-12-31 
       401        415        436        453        477        492 
2019-12-31 2020-12-31 2021-12-31 
       506        517        519

As can be seen, the data is not perfect but is close enough. There are 476 tickers starting in 1994 building up to 500. There are currently 505 tickers in the SP 500, but you can also see some years had more than this. The data will be cleaned up more in later steps.

Getting historical market cap data

FMP also provides an endpoint to collect historical market cap data. Unfortunately the history here is not great. For any companies that were not actively traded after Jan 2020, the history is unavailable which will include many removed names.

This API request will fetch a large amount of data for over 1,000 securities. Additionally it will put time spacers between each request to ensure the API calls do not exceed the FMP limit. The total run time will be about 3-5 minutes.

Note: This is an example of using lapply instead of a for loop

# Get a list of unique symbols
unqsym = unique(sp500yr$symbol)

# 5 minutes to run
# Use the bind rows function to collapse each pull
mktcapSP = bind_rows(lapply(unqsym, function(x) {
  
  # Pull the market cap for at least 7,000 trading days (~1993) 
  tick = fmpc_security_mrktcap(x, limit = 7000)
  if(is.null(tick)) return()
  
  # Modify the pull to shrink and clean the data
  tick %>% 
    mutate(year = year(date)) %>%
    group_by(year,symbol) %>%
    filter(date <= '2020-09-30') %>% # Ignore the most recent data due to issues in mkt cap
    filter(date == max(date)) %>% # Take last day available for each year
    ungroup
  
}))

We can aggregate the data by grouping into the top 5 companies on each day. Calculating the percentage of the top 5 then becomes relatively easy.

# Create market cap percentage for each year where data is available
sp500yr %>% 
  mutate(year = year(date)) %>%
  left_join(select(mktcapSP,-date), by = c('symbol','year')) %>%  # join in market cap data
  mutate(marketCap = ifelse(is.na(marketCap),0,marketCap)) %>%    # Set NA's to 0
  group_by(year) %>%
  arrange(desc(marketCap)) %>%
  mutate(grp = ifelse(row_number()>5,'Outside top 5','Top 5')) %>%
  group_by(year,grp) %>%
  summarise(marketCap = sum(marketCap)) %>%
  mutate(percentage = marketCap/sum(marketCap)) ->                                      
  mktcapYear

top5Perc = filter(mktcapYear, grp == 'Top 5')

plot_ly(mktcapYear) %>%
  add_trace(x = ~year, y = ~marketCap, name = ~grp, color = ~grp, type = 'bar') %>%
  add_trace(data = top5Perc, x = ~year,y= ~percentage,name = 'Top 5 companies market cap %', yaxis = "y2",
                type = 'scatter', mode='line',line=list(color='black')) %>%
  layout(title = 'Market Cap and Percentage of top 5 companies',
         barmode = 'group',
         hovermode = 'compare',
         xaxis = list(title=''),
         margin = list(l = 75, r = 75, b = 50, t = 50, pad = 4),
         yaxis2 = list(side = 'right', overlaying = 'y' , title='Top 5 Market Cap %',zeroline = F,showgrid = F,tickformat = '.1%'),
         yaxis = list(side = 'left',title='Market Cap',zeroline = F,showgrid = T),
         legend = list(traceorder='reversed',orientation = "h"))

Based on the chart above, the peak was hit back in 2006-2009 even though it is rising rapidly again. I think this may be due to missing market cap data the further back we go. Let’s see below.

Animation Plot

The data above does not align with the articles that have recently been published about the S&P 500 reaching a historic concentration in the top 5 securities. To better understand what is going on, it will be beneficial to use an animation plot.

In order to plot the data correctly in an animation, there must be an equal number of groups in each frame of the time lapse. Therefore if there are 400 securities in one year and 500 in another, you must use the lowest common denominator for all years.

# Check for valid market caps by year
sp500yr %>% 
  mutate(year = year(date)) %>%
  left_join(select(mktcapSP,-date), by = c('symbol','year')) %>%  # join in market cap data
  mutate(marketCap = ifelse(is.na(marketCap),0,marketCap)) %>%    # Set NA's to 0
  filter(marketCap != 0)  ->                                      # Filter our 0's
  valid_mkt_cap

table(valid_mkt_cap$year)


1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
 182  183  194  201  209  218  225  235  237  240  248  259  277  298 
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
 312  325  338  353  372  385  409  434  459  475  494  505

As can be seen in the table above, our lowest common denominator is 203. No surprise this is back in 1994 because the data is more scarce the farther back you go. This essentially implies that 60% of the S&P from 1994 no longer trades!

In a perfect world, we would have all 500 tickers for each year, but this is not the case. That being said, the most likely reason for removal from the S&P is low market cap, acquisition, or cease trading. Generally speaking, this will affect smaller companies more frequently, so hopefully should not invalidate our findings. We can confirm with a second plot.

In the next step we will reduce our population to the 200 largest companies by market cap for each year in the analysis. We will then bucket these into 40 groups based on market cap, so that each group will have exactly 5 tickers.

After this step we can plot the data

valid_mkt_cap %>%                                           
  group_by(year) %>%                                        # Group the data for each year    
  arrange(desc(marketCap)) %>%                              # Arrange in descending order            
  filter(row_number()<=200) %>%                             # Take the top 200 funds by year              
  mutate(ngroup = paste0('grp:',ntile(marketCap,40))) %>%   # Create 40 buckets in each year
  group_by(year,ngroup) %>%                                 # Group by year and bucket          
  summarise(marketCap = sum(marketCap)) %>%                 # Sum the market cap of each bucket
  group_by(year) %>%                                        # Group by year   
  mutate(mktcapPer = marketCap/sum(marketCap)) %>%          # Calc the % of each bucket by year
  ungroup() -> plotReady

Plotting the output

Once the data has been set so that each year has 40 buckets sorted by market cap, we can plot the data into a plotly animation chart.

plot_ly(
    plotReady,
    x = ~marketCap, y = ~mktcapPer, 
    size = ~marketCap, name = 'Mrkt Cap Bucket',
    frame = ~year, hoverinfo = "text",
    text = ~paste0(ngroup,'\nMkt Cap ($M):',format(round(marketCap/1e6,0),big.mark=","),
                   '\nMkt Cap %:',round(mktcapPer*100,2),'%'),
    type = 'scatter', mode = 'markers'
  ) %>%
  layout(
    title = 'Concentration of the S&P 500 over time',
    xaxis = list(title = 'Market Cap'),
    yaxis = list(title = 'Market Cap % of total', tickformat = '.1%')
  )

The full playback looks good. Based on this chart we are not at the most concentration. As can be seen, in 2008 we crossed over 35% for the 5 largest companies and now currently sit below 30%. That being said, we are only looking at 200 of the 500 companies, which could be very misleading.

Validating the data

To confirm our findings, let’s move our time lapse up to 2006. By doing so, we can keep the top 340 companies for each year instead of 200 (as shown above). This means we will have 68 groups rather than 40. First we run our manipulation followed by the plot.

valid_mkt_cap %>%    
  filter(year>=2006) %>%
  group_by(year) %>%                                        # Group the data for each year    
  arrange(desc(marketCap)) %>%                              # Arrange in descending order            
  filter(row_number()<=340) %>%                             # Take the top 340 funds by year              
  mutate(ngroup = paste0('grp:',ntile(marketCap,68))) %>%   # Create 68 buckets in each year
  group_by(year,ngroup) %>%                                 # Group by year and bucket          
  summarise(marketCap = sum(marketCap)) %>%                 # Sum the market cap of each bucket
  group_by(year) %>%                                        # Group by year   
  mutate(mktcapPer = marketCap/sum(marketCap)) %>%          # Calc the % of each bucket by year
  ungroup() -> plotReady

Plotting again

plot_ly(
    plotReady,
    x = ~marketCap, y = ~mktcapPer, 
    size = ~marketCap, name = 'Mrkt Cap Bucket',
    frame = ~year, hoverinfo = "text",
    text = ~paste0(ngroup,'\nMkt Cap ($M):',format(round(marketCap/1e6,0),big.mark=","),
                   '\nMkt Cap %:',round(mktcapPer*100,2),'%'),
    type = 'scatter', mode = 'markers'
  ) %>%
  layout(
    title = 'Concentration of the S&P 500 over time since 2006',
    xaxis = list(title = 'Market Cap'),
    yaxis = list(title = 'Market Cap % of total', tickformat = '.1%')
  )

Even by adding in an extra 340 companies we are still showing 2007 as higher concentration than today. Unfortunately, it looks like our analysis is not correct. There are many articles showing exactly the opposite of our findings here.

What can be causing the discrepancy? As mentioned several times, incomplete data. Although the bottom 200 or 300 companies will not be as large as the upper end, there could be enough market cap to skew the weightings. I do not blame FMP for this. They only started collecting data as of 2020. In order to get a complete picture, we would most likely have to get data from a paid source.

Wrapping Up

In this article, we used R and several packages to analyze financial stock data. We looked at price history, trade volume, and S&P 500 concentration. This merely scratches the surface of the data available from FMP. I highly recommend digging into the fmpcoudr package to see what else is available.

cat(readRDS('../../utils/disclosure.rds'))

Disclosure: The content herein is my own opinion and should not be considered financial advice or recommendations. I am not receiving compensation for any materials produced. I have no business relationship with any companies mentioned.

Comment on this article Share: