By Nathaniel Pereyra and Tara Choudhary
Welcome to our CMSC320 Final Tutorial! This tutorial uses data found here on kaggle.com that details New York Stock Exchange Information from 2012 to 2016.
We will use this data to discover trends in stock prices over time, which financial sectors do better overall compared to other sectors, as well as attempt to narrow down the important indicators of high-value stocks and explain them in an understandable way.
Acknowledgements: data in prices.csv was fetched from Yahoo Finance, and data in fundamentals.csv is from Nasdaq Financials, extended by some fields from EDGAR SEC databases.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from bs4 import BeautifulSoup as bs
import requests
import statsmodels.api as sm
import statsmodels.formula.api as smf
mpl.rcParams['figure.max_open_warning'] = 50
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/fundamentals.csv?token=AQ2IACJRMVYGB5ARTLUIWXTBYOU34'
fund = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/prices-split-adjusted.csv?token=AQ2IACNSNIKEM4UF2ZTRAQLBYOU5Y'
prices_split = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/prices.csv?token=AQ2IACOTBNAQV44AMJZO2HTBYOU7Q'
prices = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/securities.csv?token=AQ2IACO6GXWT4KUMH6IAZQLBYOVBW'
securities = pd.read_csv(url)
The above pieces of code import the necessary libraries, as well as import the files containing the data about stock prices over time, as well as fundamentals about each company mentioned in the data (ie the number of shares traded, the outstanding debt of the company, their net revenue, etc)
r = requests.get("https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv")
soup = bs(r.content, 'html.parser')
table = soup.find("table").find("tbody")
lst = pd.read_html(r.content, header = 0, flavor = 'bs4')
lst[0].columns = ['index', 'symbol', 'name', 'sector']
sp500 = lst[0]
sp500 = sp500.drop(['index'], axis = 1)
However, over 500 unique companies are listed in those 4 datasets, too many to work with. As such, this code block scrapes data that lists the companies included in the S&P500 financial index.
Known as the Standard's and Poor 500, the S&P 500 is an index that tracks the performance of 500 corporations and companies listed on the NYSE. Some companies included in this index are Amazon, Apple, Netflix, Microsoft, and other large corporations.
This code block scrapes a Github site that lists these companies, along with their associated stock ticker value (ie AAPL, MSFT)
top50 = sp500.head(50)
narrowed_prices = prices.merge(top50, how='inner', on= 'symbol')
Using the scrapped data, along with the prices file obtained above, the number of companies that we are interested in can be narrowed down from 500 to 40 companies. Thus, we have obtained only the data on companies mentioned in the S&P 500 index, and eliminated the rest.
From this, we can analyze high value corporations and draw trends from their data.
On a similar note, we can now determine what companies are associated with a particular stock ticker, as the prices.csv file did not include the company name in the data
narrowed_prices.head()
narrowed_prices['date'] = pd.to_datetime(narrowed_prices['date'])
import matplotlib.dates as mpl_dates
lst = []
for i, row in narrowed_prices.groupby('symbol'):
plt.figure()
x = row['date']
y = row['close']
plt.title('{} - {}'.format(i, row['name'].any()))
plt.xlabel('Time')
plt.ylabel('Share Price')
plt.plot(x, y)
x = x.apply(mpl_dates.date2num)
m,b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, 'r')
lst.append((row['name'].any(), m))
plt.show
slope = pd.DataFrame(lst, columns = ['Company', 'Slope'])
slope.sort_values(by='Slope', ascending=False)
After converting the date column from a string to one that can be used with graphs, we can now plot the stock price of a company over time. As can be seen merely from the data, the stock prices of these companies generally increased over time, with a few notable exceptions - Google, Alliant Energy, Amphenol, and Alaska Air
The shear drop in these companies was not due to a sudden devaluation of the companies, but rather due to a stock split occuring. In layman's rerms, that means that the number of shares of a company are increased. This causes the price of an individual share to decrease, but the overall value of the company remains the same.
However, when looking at the data, the y-axis of the graphs are not standard. Amazon ranges from 100 to 800, while Applied Materials only has a range from 19 to around 40. Thus, companies cannot be easily compared to one another due to this range in share prices.
However, the slope of the lines of best fit, or the red line on each graph, can be used to determine which companies experienced the most growth over time.
As can be seen in the table, Amazon experienced the most growth, with an increase of abour 23 cents in stock price each day. Although that does not seem like a lot, an investment of 1000 dollars in Amazon stock would yield 230 dollars per day on average. Furthermore, assuming the market holds steady, and all these companies experience constant growth, these values can be used as a good indicator of worthwhile companies to invest in.
for i, sector in narrowed_prices.groupby('sector'):
mean = []
for j, date in sector.groupby('date'):
mean.append((j, date['close'].mean()))
plt.plot(*zip(*mean))
plt.xlabel('Year')
plt.ylabel('Share Price')
plt.title(i)
plt.show()
Here, rather tha plotting each company individually, the overall preformance of each sector is plotted. However, here the variation in share price can be noticed easily, with high outliers throwing off the average value for a sector. For example, the Information Technology included Google, and due to the high valuation of Google compared to other IT related companies, the sector is greatly affected by changes in Google, as can be seen with the high drop in share price in mid 2014, as that is when the stock split occured for Google.
As such, the data must be operated upon to prevent the variation in share price from skewing the data
np_scaled = narrowed_prices.copy()
for column in ['open', 'close', 'low', 'high','volume']:
np_scaled[column] = (np_scaled[column] - np_scaled[column].mean())/np_scaled[column].std()
np_scaled.head()
To prevent this variation from skewing the data, the data is normalized along each column
for i, sector in np_scaled.groupby('sector'):
mean = []
for j, date in sector.groupby('date'):
mean.append((j, date['close'].mean()))
plt.plot(*zip(*mean))
plt.xlabel('Year')
plt.ylabel('Share Price')
plt.title(i)
plt.show()
Now, since the data has been normalized, trends can be easily determined from the data.
When looking at the graphs, for example, it can be determined that some market sectors often have higher share prices than others, just by looking at the y-axis. While all sectors, on average, increase in share price over time, sectors like Communication Services have positive values for share prices, while others like Energy, do not.
mean = []
for j, date in sector.groupby('date'):
mean.append((j, date['close'].mean()))
plt.plot(*zip(*mean))
plt.xlabel('Year')
plt.ylabel('Share Price')
plt.title('Share Price over Time')
plt.show()
Now, let's move onto determining which company indicators are the most important in determining a company's Earnings Per Share value.
In layman's terms, a company's Earnings Per Share is a company's profit divide dby the number of outstanding shares. Basically, the higher the EPS value, the more profit a company makes, and thus the more money a company should theoretically make for shareholders, baring outside cirumstances
fund.columns
Above are all the variables included in the fundamentals.csv file. As can be seen, there are many different variables. Our goal is to use this data, and see what variables most directly affect a company's Earnings Per Share value.
We are using EPS, as we believe it to be the best measure of a company's worth. Then, using linear regression, we will create a regression equation and determine which variables are significant to the regression, and which are not
fund = fund.fillna(0)
Linear Regression
Below, we are running an initial linear regression model + analyzing its summary chart. We opted to drop the non-numerical columns of the fund data and used SKLearn to run an ordinary least squares Linear Regression, and then fit it to a linear model.
reg = linear_model.LinearRegression()
y = 'Earnings Per Share'
x = fund.drop([y, 'Ticker Symbol', 'Period Ending', 'Unnamed: 0', 'For Year'], axis = 1)
reg.fit(x, fund[y])
x2 = sm.add_constant(x)
est = sm.OLS(fund[y], x2)
est2 = est.fit()
est2.summary()
Using a significance value of 0.05, we will drop all variables with p-values (P>|t| in the chart above) that are greater than 0.05. This means that that variable is not significant to the regression as a whole, and can be removed from the equation. As a source from Princeton explains, "with a P value of 5% (or .05) there is only a 5% chance that results you are seeing would have come up in a random distribution, so we can say with a 95% probability of being correct that the variable is having some effect, assuming your model is specified correctly."
Thus, that removes Accounts Payable, Accounts Recievable, Add'l income/expense items, After Tax ROE, Capital Surplus, Cash and Cash Equivalents, Common Stocks, Current Ratio, Deferred Asset Charges, Deferred Liability Charges, Earnings before Interest and Tax, Earnings before Tax, Effect of Exchange Rate, Equity Earnings/Loss Unconsolidated Subsidiary, Fixed Assets, Goodwill, Gross Profit, Intangible Assets, Interest Expense, Inventory, Investments, Liabilities, Long-Term Debt, Long-Term Investments, Minority Interest, Misc. Stocks, Net Borrowings, Net Cash Flow, Net Cash Flow-Operating, Net Cash Flows-Financing, Net Cash Flows-Investing, Net Receivables, Net-Recurring Items, Other Assets, Other Current Assets, Other Current Liabilities, Other Equity, Other Financing Activities, Other Investing Activities, Other Liabilities, Other Operating Activities, Pre-Tax Margin, Pre-Tax ROE, Profit Margin, Quick Ratio, Research and Development, Retained Earnings, Sale and Purchase of Stock, Short-Term Debt / Current Portion of Long-Term Debt, Short-Term Investments, Total Assets, Total Current Assets, and Treasury Stock as insignificant to the regression equation.
Thus, by using a bit of code and math, we were able to eliminate more than half of the variables, narrowing down the number of significant variables to the ones listed below.
Capital Expenditures, Cash Ratio, Changes in Inventories, Cost of Revenue, Depreciation, Gross Margin, Income Tax, Net Income, Net Income Adjustments, Net Income Applicable to Common Shareholders, Net Income-Cont Operations, Operating Income, Operating Margin, Other Operating Items, Sales General & Admin, Total Current Liabilities, Total Equity, Total Liabilities, Total Liabilities & Equity, Total Revenue, & Estimated Shares Outstanding.
reg = linear_model.LinearRegression()
arr = ['Capital Expenditures', 'Cash Ratio', 'Changes in Inventories', 'Cost of Revenue', 'Depreciation', 'Gross Margin', 'Income Tax',
'Net Income', 'Net Income Adjustments', 'Net Income Applicable to Common Shareholders', 'Net Income-Cont. Operations', 'Operating Income',
'Operating Margin', 'Other Operating Items', 'Sales, General and Admin.', 'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
'Total Liabilities & Equity', 'Total Revenue', 'Estimated Shares Outstanding']
y = 'Earnings Per Share'
x = fund.drop([y, 'Ticker Symbol', 'Period Ending', 'Unnamed: 0', 'For Year'], axis = 1)
reg.fit(fund[arr], fund[y])
x2 = sm.add_constant(fund[arr])
est = sm.OLS(fund[y], x2)
est2 = est.fit()
est2.summary()
After eliminating the unnecessary variables, we run the regression again, to see if any more variables can be removed.
Upon running the regression, Gross Margin and Net Income-Cont. Operations can be removed from the equation. Due to all of the other insignificant variables skewing the regression, these two variables went unseen in terms of their significance, but since those variables are removed, their true significance can be found in the regression, and thus can be removed.
reg = linear_model.LinearRegression()
y = 'Earnings Per Share'
x = ['Capital Expenditures', 'Cash Ratio', 'Changes in Inventories', 'Cost of Revenue', 'Depreciation', 'Income Tax',
'Net Income', 'Net Income Adjustments', 'Net Income Applicable to Common Shareholders', 'Operating Income',
'Operating Margin', 'Other Operating Items', 'Sales, General and Admin.', 'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
'Total Liabilities & Equity', 'Total Revenue', 'Estimated Shares Outstanding']
reg.fit(fund[x], fund[y])
x2 = sm.add_constant(fund[x])
est = sm.OLS(fund[y], x2)
est2 = est.fit()
est2.summary()
Finally, the reduced regression equation is found. To understand this equation, look at the coef column in the summary above. Those are the coefficient values for each variable in the equation. So, if Cash Ratio increases by a value of 1, then the Earnings Per Share value will also increase by a value of 0.0021, holding all other variables constant.
Thus, the EPS value can be determined by plugging in all values for these variables, multiplying them by the associated coefficient value, and adding the results together.
The reason for many of these coefficients being small is due to the large numbers associated with those variables. In order to output an EPS value, which often lie in the single digits, small coefficients are needed to counteract the large values for many of these variables.
Now, some of the variables that affect the EPS value the most are Cash Ratio, Operating Margin, Total Equity, and Total Liabilities.
Cash ratio is the measure of the liquidity of a company. The numerator holds the cash and cash equivalance, basically anything that is super liquid, which, in simple terms, means that even in the worst market conditions, someone will be willing to buy it. The denominator is the net liabilities, which is the liabilities - assets. You always want this to be >= 1.
Operating margin is the measure of a company's total revenue on sales after they pay for variable costs of production, such as on making the product itself, but disregarding interest or tax, including income tax. To calculate this, we take operating income and divide it by total revenue. In general, the higher the operating margin, the better off the company is.
Total equity is arguably the most important factor in our experiment, as it is used to determine a company's net worth. To calculate total equity, we would take the net difference between a company's total assets and its total liabilities. If equity is positive, this is a good thing and the company has enough assets to cover its liabilities. However, if negative, the opposite is true, and if this is an unresolved pattern, it can lead to insolvency.
Total liabilities are a company's obligations to sources outside of their domain, usually found in the form of debts. It is what they owe, as opposed to their assets, which is what they own. Liabilities aren't usually helpful as a solo measure, rather they are combined with other measures to form more meaningful data (such as cash ratio).
For an even more in depth explanation of these terms and their relevance to our tutorial, we recommend you take a look at Investopedia.
x_train, x_test, y_train, y_test = train_test_split(fund[x], fund[y], test_size = 0.2, random_state = 0)
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
x2 = sm.add_constant(x_train)
est = sm.OLS(y_train, x2)
est2 = est.fit()
est2.summary()
y_pred = reg.predict(x_test)
diff = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
diff
residuals = []
for i, row in diff.iterrows():
d = row['Actual'] - row['Predicted']
residuals.append(d)
plt.scatter(residuals, y_pred)
plt.xlabel('Residual Value')
plt.ylabel('Actual Value')
plt.title('Residuals vs Actual Value')
import scipy as sp
fig, ax = plt.subplots(figsize = (6,2.5))
_, (_, _, r) = sp.stats.probplot(residuals, plot=ax, fit=True)
r**2
The past few sections of code have dealt with determining accurate the regression is when it has to predict EPS values, given the rest of the data.
According to the resulting output, around 77% of the variation in Earning's Per Share can be explained by the data. Basically, the regression is correct 77 percent of the time.
Also, the scatterplot displays the difference between the predicted EPS value and the actual value, also known as the residual. As can be seen, most of the points lie around (0,0) on the graph, indicating that the predicted values were mostly accurate, with a few outliers. The Probability plot below the scatterplot displays the same thing. For the most part, the residuals follow a normal distribution, except there are more outliers, or extreme differences in the predicted values for EPS, and the actual values. Since the regression was only correct 77% of the time, 33% of the data resulted in incorrect values for EPS, most likely when the data was lied far from the average. Thus, that is why above scatterplot curves at both ends, due to the high number of extreme values. In statistical terms, this plot suffers from "heavy tails"
fund_scaled = fund.copy()
x = ['Capital Expenditures', 'Cash Ratio', 'Changes in Inventories', 'Cost of Revenue', 'Depreciation', 'Income Tax',
'Net Income', 'Net Income Adjustments', 'Net Income Applicable to Common Shareholders', 'Operating Income',
'Operating Margin', 'Other Operating Items', 'Sales, General and Admin.', 'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
'Total Liabilities & Equity', 'Total Revenue', 'Estimated Shares Outstanding']
for column in x:
fund_scaled[column] = (fund_scaled[column] - fund_scaled[column].mean())/fund_scaled[column].std()
fund_scaled.head()
y = 'Earnings Per Share'
x2 = sm.add_constant(fund_scaled[x])
est = sm.OLS(fund_scaled[y], x2)
est2 = est.fit()
est2.summary()
By normalizing the data, the incredibly large values can be reduced into smaller, more understandable data. Also, the coefficients can be analyzed easier. According to the data, Net Income plays a large role in causing Earning's Per Share to increase, with a coefficient of 12, while Net Income Applicable causes EPS to decrease when it rises, with a coefficient of -10.
Thus, according to the data, signs of an increase in Earnings Per Share are Cash Ratio, Cost of Revenue, Net Income, Operating Income, Other Operating Items, Sales, Generam and Admin, and Total Equity all are directly related to Earnings Per Share. When these variables experience an increase, so does EPS.
On the other hand, Earnings per share falls when Capital Expenditures, Changes in Inventories, Depreciation, Income Tax, Net Income Adjustments, Net Income Applicable to Common Shareholders, Operating Margin, Total Current Liabilities, Total Liabilities, Total Liabilities and Equity, Total Revenue, and Extimates Shares Outstanding all rise.
So, in one sentence, when Net Income is high, Earnings Per Share will rise by 12 percent, but if Net Income Adjustments is high, EPS will fall by 10 percent.
import seaborn as sns
x2 = ['Cost of Revenue', 'Depreciation', 'Income Tax',
'Net Income', 'Net Income Applicable to Common Shareholders', 'Operating Income',
'Operating Margin', 'Sales, General and Admin.', 'Total Revenue']
sns.heatmap(fund[x2].corr(), cmap='Blues', annot=True)
plt.title('Relevant Independent Factors vs. Other Relevant Independent Factors')
plt.show()
In this correlation heatmap, you are able to see the correlation the data has with other data values, rather than its effect on EPS, as we have mostly discussed throughout this tutorial. Cutting out data where the coefficients weren't particularly large (roughly < -1 and > 1), we see some strong correlations.
For example, we can tell that since Operating Margin doesn't have a high absolute value coefficient with any of the other factors, but still affects EPS, that it is very much an independent factor in determining EPS trends. On the contrary, we see that Net Income has coefficients of .92, .88, and .67 in regards to Operating Income, Income Tax, and Total Revenue, respectably. As we know from the Linear Regression summary above, Net Income has a big impact on earnings per share, but this indicates that these factors are strongly correlated with Net Income as well, which increases their significance in terms of EPS (this is fairly obvious if you look at how to calculate Net Income, however seeing the data helps to really solidify these findings).
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
x_train, x_test, y_train, y_test = train_test_split(fund[x], fund[y], test_size = 0.08, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
y_test = y_test.reset_index(drop = True)
model = Sequential()
model.add(Dense(32, activation = 'relu', input_dim = 19))
model.add(Dense(units = 32, activation = 'relu'))
model.add(Dense(units = 32, activation = 'relu'))
model.add(Dense(units = 1))
model.compile(optimizer = 'adam',loss = 'mean_squared_error')
model.fit(x_train, y_train, batch_size = 10, epochs = 100)
y_pred = model.predict(x_test)
plt.plot(y_test, color = 'red', label = 'Real data')
plt.plot(y_pred, color = 'blue', label = 'Predicted data')
plt.title('Prediction')
plt.legend()
plt.show()
However, the linear regression alone does not accurately explain the variation in the values for Earnings Per Share. So, the relationships between the 19 different variables and EPS may not be linear. Instead, we created a neural network to try and predict Earning's Per Share, given the data.
The data is standardized, to prevent outliers from skewing the data, as well as to reduce the number of large values, and make the results more understandable and easy to interpret.
Then, the neural network is created and run, with each epoch operating on a subset of the data, and learning from the prior results. Thus, loss in the data is reduced as the neural net runs, as it adjusts coefficient values for each variable until the most optimal ones are found. Basically, the neural net learns over time.
After that, the model is tested using the testing data, and its predictions are compared to the actual results. As can be seen in the graph about, the model does a rather good job in predicting Earnings Per Share given the varied fundamental data found in fundamentals.csv. With that, we believe the neural net can more accurately predict EPS values, and so this model could be used on new data to predict values for Earning's Per Share in the future.