Open In Colab

New York Stock Exchange Analysis

By Nathaniel Pereyra and Tara Choudhary

Welcome to our CMSC320 Final Tutorial! This tutorial uses data found here on kaggle.com that details New York Stock Exchange Information from 2012 to 2016.

We will use this data to discover trends in stock prices over time, which financial sectors do better overall compared to other sectors, as well as attempt to narrow down the important indicators of high-value stocks and explain them in an understandable way.

Acknowledgements: data in prices.csv was fetched from Yahoo Finance, and data in fundamentals.csv is from Nasdaq Financials, extended by some fields from EDGAR SEC databases.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from bs4 import BeautifulSoup as bs
import requests
import statsmodels.api as sm
import statsmodels.formula.api as smf
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [ ]:
mpl.rcParams['figure.max_open_warning'] = 50
In [ ]:
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/fundamentals.csv?token=AQ2IACJRMVYGB5ARTLUIWXTBYOU34'
fund = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/prices-split-adjusted.csv?token=AQ2IACNSNIKEM4UF2ZTRAQLBYOU5Y'
prices_split = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/prices.csv?token=AQ2IACOTBNAQV44AMJZO2HTBYOU7Q'
prices = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/ndpereyra/CMSC320Final-Tutorial/main/Datasets/securities.csv?token=AQ2IACO6GXWT4KUMH6IAZQLBYOVBW'
securities = pd.read_csv(url)

The above pieces of code import the necessary libraries, as well as import the files containing the data about stock prices over time, as well as fundamentals about each company mentioned in the data (ie the number of shares traded, the outstanding debt of the company, their net revenue, etc)

In [ ]:
r = requests.get("https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv")
soup = bs(r.content, 'html.parser')
table = soup.find("table").find("tbody")
lst = pd.read_html(r.content, header = 0, flavor = 'bs4')
lst[0].columns = ['index', 'symbol', 'name', 'sector']

sp500 = lst[0]
sp500 = sp500.drop(['index'], axis = 1)

However, over 500 unique companies are listed in those 4 datasets, too many to work with. As such, this code block scrapes data that lists the companies included in the S&P500 financial index.

Known as the Standard's and Poor 500, the S&P 500 is an index that tracks the performance of 500 corporations and companies listed on the NYSE. Some companies included in this index are Amazon, Apple, Netflix, Microsoft, and other large corporations.

This code block scrapes a Github site that lists these companies, along with their associated stock ticker value (ie AAPL, MSFT)

In [ ]:
top50 = sp500.head(50)
In [ ]:
narrowed_prices = prices.merge(top50, how='inner', on= 'symbol')

Using the scrapped data, along with the prices file obtained above, the number of companies that we are interested in can be narrowed down from 500 to 40 companies. Thus, we have obtained only the data on companies mentioned in the S&P 500 index, and eliminated the rest.

From this, we can analyze high value corporations and draw trends from their data.

On a similar note, we can now determine what companies are associated with a particular stock ticker, as the prices.csv file did not include the company name in the data

In [ ]:
narrowed_prices.head()
Out[ ]:
date symbol open close low high volume name sector
0 2010-01-04 A 31.389999 31.300001 31.130000 31.630001 3815500.0 Agilent Technologies Health Care
1 2010-01-05 A 31.209999 30.960001 30.760000 31.220001 4186000.0 Agilent Technologies Health Care
2 2010-01-06 A 30.850001 30.850001 30.760000 31.000001 3243700.0 Agilent Technologies Health Care
3 2010-01-07 A 30.780001 30.809999 30.500000 30.820001 3095100.0 Agilent Technologies Health Care
4 2010-01-08 A 30.640000 30.800000 30.399999 30.850001 3733900.0 Agilent Technologies Health Care
In [ ]:
narrowed_prices['date'] = pd.to_datetime(narrowed_prices['date'])
In [ ]:
import matplotlib.dates as mpl_dates
lst = []
for i, row in narrowed_prices.groupby('symbol'):
    plt.figure()
    x = row['date']
    y = row['close']
    plt.title('{} - {}'.format(i, row['name'].any()))
    plt.xlabel('Time')
    plt.ylabel('Share Price')
    plt.plot(x, y)
    x = x.apply(mpl_dates.date2num)
    m,b = np.polyfit(x, y, 1)
    plt.plot(x, m*x + b, 'r')
    lst.append((row['name'].any(), m))
    plt.show
In [ ]:
slope = pd.DataFrame(lst, columns = ['Company', 'Slope'])
slope.sort_values(by='Slope', ascending=False)
Out[ ]:
Company Slope
27 Amazon 0.232741
37 Alphabet (Class A) 0.082971
36 Alphabet (Class C) 0.072215
2 Advance Auto Parts 0.055262
24 Amgen 0.054489
39 3M 0.043415
28 Anthem 0.041811
25 Ameriprise Financial 0.034668
31 Air Products & Chemicals 0.033481
8 Adobe 0.032541
29 Aon 0.031997
5 AmerisourceBergen 0.031539
7 Accenture 0.030937
26 American Tower 0.028056
34 American Water Works 0.020809
21 Allegion 0.020282
20 Allstate Corp 0.020133
1 American Airlines Group 0.018608
40 Altria Group 0.018563
4 AbbVie 0.016291
35 American Express 0.015733
9 Analog Devices 0.014275
17 Akamai Technologies 0.013708
15 American International Group 0.013643
12 American Electric Power 0.013418
33 Activision Blizzard 0.011847
19 Alaska Air Group 0.011409
16 Arthur J. Gallagher & Co. 0.010426
38 Alliant Energy 0.009699
11 Ameren Corp 0.009684
14 Aflac 0.009355
18 Albemarle Corporation 0.008524
10 ADM 0.007787
32 Amphenol 0.007666
22 Applied Materials 0.005718
23 Ametek 0.004311
0 Agilent Technologies 0.004060
13 AES Corp 0.000208
6 Abbott Laboratories -0.005761
30 APA Corporation -0.025184
3 Apple -0.120091

After converting the date column from a string to one that can be used with graphs, we can now plot the stock price of a company over time. As can be seen merely from the data, the stock prices of these companies generally increased over time, with a few notable exceptions - Google, Alliant Energy, Amphenol, and Alaska Air

The shear drop in these companies was not due to a sudden devaluation of the companies, but rather due to a stock split occuring. In layman's rerms, that means that the number of shares of a company are increased. This causes the price of an individual share to decrease, but the overall value of the company remains the same.

However, when looking at the data, the y-axis of the graphs are not standard. Amazon ranges from 100 to 800, while Applied Materials only has a range from 19 to around 40. Thus, companies cannot be easily compared to one another due to this range in share prices.

However, the slope of the lines of best fit, or the red line on each graph, can be used to determine which companies experienced the most growth over time.

As can be seen in the table, Amazon experienced the most growth, with an increase of abour 23 cents in stock price each day. Although that does not seem like a lot, an investment of 1000 dollars in Amazon stock would yield 230 dollars per day on average. Furthermore, assuming the market holds steady, and all these companies experience constant growth, these values can be used as a good indicator of worthwhile companies to invest in.

In [ ]:
for i, sector in narrowed_prices.groupby('sector'):
    mean = []
    for j, date in sector.groupby('date'):
        mean.append((j, date['close'].mean()))
    plt.plot(*zip(*mean))
    plt.xlabel('Year')
    plt.ylabel('Share Price')
    plt.title(i)
    plt.show()

Here, rather tha plotting each company individually, the overall preformance of each sector is plotted. However, here the variation in share price can be noticed easily, with high outliers throwing off the average value for a sector. For example, the Information Technology included Google, and due to the high valuation of Google compared to other IT related companies, the sector is greatly affected by changes in Google, as can be seen with the high drop in share price in mid 2014, as that is when the stock split occured for Google.

As such, the data must be operated upon to prevent the variation in share price from skewing the data

In [ ]:
np_scaled = narrowed_prices.copy()
for column in ['open', 'close', 'low', 'high','volume']:
    np_scaled[column] = (np_scaled[column] - np_scaled[column].mean())/np_scaled[column].std()
np_scaled.head()
Out[ ]:
date symbol open close low high volume name sector
0 2010-01-04 A -0.449923 -0.450725 -0.449470 -0.450657 -0.147011 Agilent Technologies Health Care
1 2010-01-05 A -0.451073 -0.452899 -0.451856 -0.453256 -0.126100 Agilent Technologies Health Care
2 2010-01-06 A -0.453373 -0.453602 -0.451856 -0.454651 -0.179282 Agilent Technologies Health Care
3 2010-01-07 A -0.453820 -0.453858 -0.453533 -0.455792 -0.187669 Agilent Technologies Health Care
4 2010-01-08 A -0.454715 -0.453921 -0.454178 -0.455602 -0.151616 Agilent Technologies Health Care

To prevent this variation from skewing the data, the data is normalized along each column

In [ ]:
for i, sector in np_scaled.groupby('sector'):
    mean = []
    for j, date in sector.groupby('date'):
        mean.append((j, date['close'].mean()))
    plt.plot(*zip(*mean))
    plt.xlabel('Year')
    plt.ylabel('Share Price')
    plt.title(i)
    plt.show()

Now, since the data has been normalized, trends can be easily determined from the data.

When looking at the graphs, for example, it can be determined that some market sectors often have higher share prices than others, just by looking at the y-axis. While all sectors, on average, increase in share price over time, sectors like Communication Services have positive values for share prices, while others like Energy, do not.

In [ ]:
mean = []
for j, date in sector.groupby('date'):
    mean.append((j, date['close'].mean()))
plt.plot(*zip(*mean))
plt.xlabel('Year')
plt.ylabel('Share Price')
plt.title('Share Price over Time')
plt.show()

Determining the most important factors when researching stocks

Now, let's move onto determining which company indicators are the most important in determining a company's Earnings Per Share value.

In layman's terms, a company's Earnings Per Share is a company's profit divide dby the number of outstanding shares. Basically, the higher the EPS value, the more profit a company makes, and thus the more money a company should theoretically make for shareholders, baring outside cirumstances

In [ ]:
fund.columns
Out[ ]:
Index(['Unnamed: 0', 'Ticker Symbol', 'Period Ending', 'Accounts Payable',
       'Accounts Receivable', 'Add'l income/expense items', 'After Tax ROE',
       'Capital Expenditures', 'Capital Surplus', 'Cash Ratio',
       'Cash and Cash Equivalents', 'Changes in Inventories', 'Common Stocks',
       'Cost of Revenue', 'Current Ratio', 'Deferred Asset Charges',
       'Deferred Liability Charges', 'Depreciation',
       'Earnings Before Interest and Tax', 'Earnings Before Tax',
       'Effect of Exchange Rate',
       'Equity Earnings/Loss Unconsolidated Subsidiary', 'Fixed Assets',
       'Goodwill', 'Gross Margin', 'Gross Profit', 'Income Tax',
       'Intangible Assets', 'Interest Expense', 'Inventory', 'Investments',
       'Liabilities', 'Long-Term Debt', 'Long-Term Investments',
       'Minority Interest', 'Misc. Stocks', 'Net Borrowings', 'Net Cash Flow',
       'Net Cash Flow-Operating', 'Net Cash Flows-Financing',
       'Net Cash Flows-Investing', 'Net Income', 'Net Income Adjustments',
       'Net Income Applicable to Common Shareholders',
       'Net Income-Cont. Operations', 'Net Receivables', 'Non-Recurring Items',
       'Operating Income', 'Operating Margin', 'Other Assets',
       'Other Current Assets', 'Other Current Liabilities', 'Other Equity',
       'Other Financing Activities', 'Other Investing Activities',
       'Other Liabilities', 'Other Operating Activities',
       'Other Operating Items', 'Pre-Tax Margin', 'Pre-Tax ROE',
       'Profit Margin', 'Quick Ratio', 'Research and Development',
       'Retained Earnings', 'Sale and Purchase of Stock',
       'Sales, General and Admin.',
       'Short-Term Debt / Current Portion of Long-Term Debt',
       'Short-Term Investments', 'Total Assets', 'Total Current Assets',
       'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
       'Total Liabilities & Equity', 'Total Revenue', 'Treasury Stock',
       'For Year', 'Earnings Per Share', 'Estimated Shares Outstanding'],
      dtype='object')

Above are all the variables included in the fundamentals.csv file. As can be seen, there are many different variables. Our goal is to use this data, and see what variables most directly affect a company's Earnings Per Share value.

We are using EPS, as we believe it to be the best measure of a company's worth. Then, using linear regression, we will create a regression equation and determine which variables are significant to the regression, and which are not

In [ ]:
fund = fund.fillna(0)

Linear Regression

Below, we are running an initial linear regression model + analyzing its summary chart. We opted to drop the non-numerical columns of the fund data and used SKLearn to run an ordinary least squares Linear Regression, and then fit it to a linear model.

In [ ]:
reg = linear_model.LinearRegression()
y = 'Earnings Per Share'
x = fund.drop([y, 'Ticker Symbol', 'Period Ending', 'Unnamed: 0', 'For Year'], axis = 1)
reg.fit(x, fund[y])
Out[ ]:
LinearRegression()
In [ ]:
x2 = sm.add_constant(x)
est = sm.OLS(fund[y], x2)
est2 = est.fit()
est2.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Earnings Per Share R-squared: 0.353
Model: OLS Adj. R-squared: 0.326
Method: Least Squares F-statistic: 12.77
Date: Mon, 20 Dec 2021 Prob (F-statistic): 1.20e-114
Time: 16:43:45 Log-Likelihood: -4830.6
No. Observations: 1781 AIC: 9809.
Df Residuals: 1707 BIC: 1.022e+04
Df Model: 73
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.2541 0.312 10.438 0.000 2.643 3.866
Accounts Payable -6.602e-11 4.91e-11 -1.345 0.179 -1.62e-10 3.02e-11
Accounts Receivable 9.919e-11 2.14e-10 0.464 0.642 -3.2e-10 5.18e-10
Add'l income/expense items 7.186e-11 2.59e-10 0.277 0.782 -4.37e-10 5.8e-10
After Tax ROE -0.0012 0.003 -0.466 0.641 -0.006 0.004
Capital Expenditures -2.52e-10 1.21e-10 -2.087 0.037 -4.89e-10 -1.52e-11
Capital Surplus 1.115e-10 7.66e-11 1.454 0.146 -3.89e-11 2.62e-10
Cash Ratio 0.0083 0.003 2.927 0.003 0.003 0.014
Cash and Cash Equivalents -1.038e-11 2e-11 -0.521 0.603 -4.95e-11 2.87e-11
Changes in Inventories -5.174e-10 2.66e-10 -1.947 0.052 -1.04e-09 3.72e-12
Common Stocks 1.351e-10 8.18e-11 1.652 0.099 -2.53e-11 2.95e-10
Cost of Revenue 3.313e-10 1.25e-10 2.643 0.008 8.55e-11 5.77e-10
Current Ratio 0.0002 0.002 0.077 0.938 -0.004 0.005
Deferred Asset Charges 9.154e-11 8.15e-11 1.122 0.262 -6.84e-11 2.51e-10
Deferred Liability Charges 3.327e-11 6.83e-11 0.487 0.626 -1.01e-10 1.67e-10
Depreciation -5.807e-10 1.41e-10 -4.117 0.000 -8.57e-10 -3.04e-10
Earnings Before Interest and Tax -8.351e-10 2.27e-09 -0.369 0.712 -5.28e-09 3.61e-09
Earnings Before Tax 1.454e-09 2.26e-09 0.642 0.521 -2.99e-09 5.9e-09
Effect of Exchange Rate -1.879e-09 9.85e-10 -1.907 0.057 -3.81e-09 5.36e-11
Equity Earnings/Loss Unconsolidated Subsidiary -6.354e-11 4.77e-10 -0.133 0.894 -1e-09 8.73e-10
Fixed Assets 2.76e-11 3.05e-11 0.905 0.366 -3.22e-11 8.74e-11
Goodwill 4.657e-11 3.71e-11 1.255 0.210 -2.62e-11 1.19e-10
Gross Margin 0.0117 0.005 2.206 0.028 0.001 0.022
Gross Profit 5.869e-11 9.9e-11 0.593 0.553 -1.35e-10 2.53e-10
Income Tax -1.143e-09 4.71e-10 -2.427 0.015 -2.07e-09 -2.19e-10
Intangible Assets 3.952e-11 3.93e-11 1.006 0.315 -3.75e-11 1.17e-10
Interest Expense 1.004e-09 2.23e-09 0.450 0.653 -3.37e-09 5.38e-09
Inventory 7.635e-11 6.83e-11 1.117 0.264 -5.77e-11 2.1e-10
Investments 3.389e-11 3.63e-11 0.935 0.350 -3.72e-11 1.05e-10
Liabilities 1.665e-11 8.07e-11 0.206 0.837 -1.42e-10 1.75e-10
Long-Term Debt -6.964e-11 4.56e-11 -1.527 0.127 -1.59e-10 1.98e-11
Long-Term Investments -2.844e-12 2.46e-11 -0.116 0.908 -5.1e-11 4.53e-11
Minority Interest -1.469e-10 9.32e-11 -1.576 0.115 -3.3e-10 3.59e-11
Misc. Stocks -2.384e-10 5.41e-10 -0.440 0.660 -1.3e-09 8.24e-10
Net Borrowings -7.611e-11 1.48e-10 -0.514 0.608 -3.67e-10 2.15e-10
Net Cash Flow 2.472e-10 4.09e-10 0.604 0.546 -5.55e-10 1.05e-09
Net Cash Flow-Operating -2.811e-10 4.04e-10 -0.697 0.486 -1.07e-09 5.1e-10
Net Cash Flows-Financing -2.383e-10 4.38e-10 -0.545 0.586 -1.1e-09 6.2e-10
Net Cash Flows-Investing -3.126e-10 4.1e-10 -0.762 0.446 -1.12e-09 4.92e-10
Net Income 5.477e-09 1.55e-09 3.523 0.000 2.43e-09 8.53e-09
Net Income Adjustments -1.202e-10 6.08e-11 -1.975 0.048 -2.4e-10 -8.51e-13
Net Income Applicable to Common Shareholders -4.486e-09 1.54e-09 -2.917 0.004 -7.5e-09 -1.47e-09
Net Income-Cont. Operations -9.252e-10 4.46e-10 -2.076 0.038 -1.8e-09 -5.12e-11
Net Receivables 6.704e-11 4.76e-11 1.407 0.160 -2.64e-11 1.6e-10
Non-Recurring Items -1.168e-10 1.52e-10 -0.770 0.442 -4.15e-10 1.81e-10
Operating Income 4.939e-10 2.19e-10 2.259 0.024 6.5e-11 9.23e-10
Operating Margin -0.0563 0.012 -4.513 0.000 -0.081 -0.032
Other Assets -1.66e-11 3.24e-11 -0.513 0.608 -8.01e-11 4.69e-11
Other Current Assets -2.899e-11 9.31e-11 -0.311 0.755 -2.12e-10 1.54e-10
Other Current Liabilities 1.628e-11 3.65e-11 0.447 0.655 -5.52e-11 8.78e-11
Other Equity 1.736e-10 9.03e-11 1.923 0.055 -3.44e-12 3.51e-10
Other Financing Activities -8.584e-11 1.48e-10 -0.579 0.562 -3.76e-10 2.05e-10
Other Investing Activities 5.596e-13 4.92e-11 0.011 0.991 -9.59e-11 9.7e-11
Other Liabilities 9.984e-13 3.83e-11 0.026 0.979 -7.41e-11 7.61e-11
Other Operating Activities -1.654e-11 6.83e-11 -0.242 0.809 -1.51e-10 1.18e-10
Other Operating Items 2.259e-10 1.04e-10 2.170 0.030 2.17e-11 4.3e-10
Pre-Tax Margin 0.0298 0.016 1.887 0.059 -0.001 0.061
Pre-Tax ROE 0.0008 0.002 0.454 0.650 -0.003 0.004
Profit Margin -0.0190 0.016 -1.200 0.230 -0.050 0.012
Quick Ratio -0.0059 0.004 -1.619 0.106 -0.013 0.001
Research and Development -7.897e-11 1.6e-10 -0.494 0.621 -3.92e-10 2.35e-10
Retained Earnings 1.344e-10 7.62e-11 1.763 0.078 -1.51e-11 2.84e-10
Sale and Purchase of Stock -6.602e-11 1.59e-10 -0.416 0.677 -3.77e-10 2.45e-10
Sales, General and Admin. 2.349e-10 9.49e-11 2.475 0.013 4.87e-11 4.21e-10
Short-Term Debt / Current Portion of Long-Term Debt 4.58e-11 4.64e-11 0.986 0.324 -4.53e-11 1.37e-10
Short-Term Investments -2.811e-11 5.55e-11 -0.506 0.613 -1.37e-10 8.08e-11
Total Assets -7.968e-10 4.33e-10 -1.840 0.066 -1.65e-09 5.27e-11
Total Current Assets 5.761e-11 3.63e-11 1.585 0.113 -1.37e-11 1.29e-10
Total Current Liabilities -1.357e-10 4.17e-11 -3.253 0.001 -2.18e-10 -5.39e-11
Total Equity 3.341e-05 3.2e-06 10.429 0.000 2.71e-05 3.97e-05
Total Liabilities 3.341e-05 3.2e-06 10.429 0.000 2.71e-05 3.97e-05
Total Liabilities & Equity -3.341e-05 3.2e-06 -10.428 0.000 -3.97e-05 -2.71e-05
Total Revenue -3.23e-10 1.25e-10 -2.590 0.010 -5.68e-10 -7.84e-11
Treasury Stock 1.103e-10 7.76e-11 1.421 0.156 -4.2e-11 2.62e-10
Estimated Shares Outstanding -1.05e-09 1.84e-10 -5.698 0.000 -1.41e-09 -6.89e-10
Omnibus: 1750.027 Durbin-Watson: 0.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 140866.472
Skew: 4.490 Prob(JB): 0.00
Kurtosis: 45.634 Cond. No. 2.25e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.25e+15. This might indicate that there are
strong multicollinearity or other numerical problems.

Using a significance value of 0.05, we will drop all variables with p-values (P>|t| in the chart above) that are greater than 0.05. This means that that variable is not significant to the regression as a whole, and can be removed from the equation. As a source from Princeton explains, "with a P value of 5% (or .05) there is only a 5% chance that results you are seeing would have come up in a random distribution, so we can say with a 95% probability of being correct that the variable is having some effect, assuming your model is specified correctly."

Thus, that removes Accounts Payable, Accounts Recievable, Add'l income/expense items, After Tax ROE, Capital Surplus, Cash and Cash Equivalents, Common Stocks, Current Ratio, Deferred Asset Charges, Deferred Liability Charges, Earnings before Interest and Tax, Earnings before Tax, Effect of Exchange Rate, Equity Earnings/Loss Unconsolidated Subsidiary, Fixed Assets, Goodwill, Gross Profit, Intangible Assets, Interest Expense, Inventory, Investments, Liabilities, Long-Term Debt, Long-Term Investments, Minority Interest, Misc. Stocks, Net Borrowings, Net Cash Flow, Net Cash Flow-Operating, Net Cash Flows-Financing, Net Cash Flows-Investing, Net Receivables, Net-Recurring Items, Other Assets, Other Current Assets, Other Current Liabilities, Other Equity, Other Financing Activities, Other Investing Activities, Other Liabilities, Other Operating Activities, Pre-Tax Margin, Pre-Tax ROE, Profit Margin, Quick Ratio, Research and Development, Retained Earnings, Sale and Purchase of Stock, Short-Term Debt / Current Portion of Long-Term Debt, Short-Term Investments, Total Assets, Total Current Assets, and Treasury Stock as insignificant to the regression equation.

Thus, by using a bit of code and math, we were able to eliminate more than half of the variables, narrowing down the number of significant variables to the ones listed below.

Capital Expenditures, Cash Ratio, Changes in Inventories, Cost of Revenue, Depreciation, Gross Margin, Income Tax, Net Income, Net Income Adjustments, Net Income Applicable to Common Shareholders, Net Income-Cont Operations, Operating Income, Operating Margin, Other Operating Items, Sales General & Admin, Total Current Liabilities, Total Equity, Total Liabilities, Total Liabilities & Equity, Total Revenue, & Estimated Shares Outstanding.

In [ ]:
reg = linear_model.LinearRegression()
arr = ['Capital Expenditures', 'Cash Ratio', 'Changes in Inventories', 'Cost of Revenue', 'Depreciation', 'Gross Margin', 'Income Tax',
       'Net Income', 'Net Income Adjustments', 'Net Income Applicable to Common Shareholders', 'Net Income-Cont. Operations', 'Operating Income', 
       'Operating Margin', 'Other Operating Items', 'Sales, General and Admin.', 'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
       'Total Liabilities & Equity', 'Total Revenue', 'Estimated Shares Outstanding']
y = 'Earnings Per Share'
x = fund.drop([y, 'Ticker Symbol', 'Period Ending', 'Unnamed: 0', 'For Year'], axis = 1)
reg.fit(fund[arr], fund[y])
Out[ ]:
LinearRegression()
In [ ]:
x2 = sm.add_constant(fund[arr])
est = sm.OLS(fund[y], x2)
est2 = est.fit()
est2.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Earnings Per Share R-squared: 0.313
Model: OLS Adj. R-squared: 0.305
Method: Least Squares F-statistic: 40.13
Date: Mon, 20 Dec 2021 Prob (F-statistic): 7.21e-128
Time: 16:43:46 Log-Likelihood: -4884.0
No. Observations: 1781 AIC: 9810.
Df Residuals: 1760 BIC: 9925.
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.4717 0.219 15.830 0.000 3.042 3.902
Capital Expenditures -1.755e-10 7.26e-11 -2.416 0.016 -3.18e-10 -3.3e-11
Cash Ratio 0.0020 0.001 2.044 0.041 8.21e-05 0.004
Changes in Inventories -5.222e-10 2.45e-10 -2.127 0.034 -1e-09 -4.08e-11
Cost of Revenue 2.032e-10 4.83e-11 4.211 0.000 1.09e-10 2.98e-10
Depreciation -5.501e-10 1.04e-10 -5.299 0.000 -7.54e-10 -3.47e-10
Gross Margin 0.0039 0.005 0.869 0.385 -0.005 0.013
Income Tax -4.693e-10 1.36e-10 -3.460 0.001 -7.35e-10 -2.03e-10
Net Income 3.425e-09 1.12e-09 3.051 0.002 1.22e-09 5.63e-09
Net Income Adjustments -1.719e-10 4.34e-11 -3.958 0.000 -2.57e-10 -8.67e-11
Net Income Applicable to Common Shareholders -2.607e-09 1.11e-09 -2.351 0.019 -4.78e-09 -4.32e-10
Net Income-Cont. Operations -1.738e-10 1.52e-10 -1.141 0.254 -4.72e-10 1.25e-10
Operating Income 3.584e-10 6.53e-11 5.489 0.000 2.3e-10 4.87e-10
Operating Margin -0.0489 0.005 -9.300 0.000 -0.059 -0.039
Other Operating Items 2.138e-10 6.31e-11 3.390 0.001 9.01e-11 3.37e-10
Sales, General and Admin. 2.1e-10 5.13e-11 4.095 0.000 1.09e-10 3.11e-10
Total Current Liabilities -4.978e-11 2.03e-11 -2.449 0.014 -8.96e-11 -9.91e-12
Total Equity 6.676e-05 4.22e-06 15.827 0.000 5.85e-05 7.5e-05
Total Liabilities 6.676e-05 4.22e-06 15.827 0.000 5.85e-05 7.5e-05
Total Liabilities & Equity -6.676e-05 4.22e-06 -15.827 0.000 -7.5e-05 -5.85e-05
Total Revenue -2.009e-10 4.8e-11 -4.190 0.000 -2.95e-10 -1.07e-10
Estimated Shares Outstanding -7.285e-10 1.41e-10 -5.178 0.000 -1e-09 -4.53e-10
Omnibus: 1707.352 Durbin-Watson: 0.932
Prob(Omnibus): 0.000 Jarque-Bera (JB): 125766.153
Skew: 4.337 Prob(JB): 0.00
Kurtosis: 43.243 Cond. No. 4.18e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.18e+15. This might indicate that there are
strong multicollinearity or other numerical problems.

After eliminating the unnecessary variables, we run the regression again, to see if any more variables can be removed.

Upon running the regression, Gross Margin and Net Income-Cont. Operations can be removed from the equation. Due to all of the other insignificant variables skewing the regression, these two variables went unseen in terms of their significance, but since those variables are removed, their true significance can be found in the regression, and thus can be removed.

In [ ]:
reg = linear_model.LinearRegression()
y = 'Earnings Per Share'
x = ['Capital Expenditures', 'Cash Ratio', 'Changes in Inventories', 'Cost of Revenue', 'Depreciation', 'Income Tax',
       'Net Income', 'Net Income Adjustments', 'Net Income Applicable to Common Shareholders', 'Operating Income', 
       'Operating Margin', 'Other Operating Items', 'Sales, General and Admin.', 'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
       'Total Liabilities & Equity', 'Total Revenue', 'Estimated Shares Outstanding']
reg.fit(fund[x], fund[y])
Out[ ]:
LinearRegression()
In [ ]:
x2 = sm.add_constant(fund[x])
est = sm.OLS(fund[y], x2)
est2 = est.fit()
est2.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Earnings Per Share R-squared: 0.312
Model: OLS Adj. R-squared: 0.305
Method: Least Squares F-statistic: 44.46
Date: Mon, 20 Dec 2021 Prob (F-statistic): 4.63e-129
Time: 16:43:46 Log-Likelihood: -4885.1
No. Observations: 1781 AIC: 9808.
Df Residuals: 1762 BIC: 9912.
Df Model: 18
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.6384 0.145 25.085 0.000 3.354 3.923
Capital Expenditures -1.579e-10 6.94e-11 -2.276 0.023 -2.94e-10 -2.18e-11
Cash Ratio 0.0021 0.001 2.194 0.028 0.000 0.004
Changes in Inventories -5.247e-10 2.45e-10 -2.138 0.033 -1.01e-09 -4.33e-11
Cost of Revenue 1.94e-10 4.75e-11 4.081 0.000 1.01e-10 2.87e-10
Depreciation -5.547e-10 1.02e-10 -5.445 0.000 -7.55e-10 -3.55e-10
Income Tax -5.345e-10 1.26e-10 -4.235 0.000 -7.82e-10 -2.87e-10
Net Income 3.206e-09 1.11e-09 2.897 0.004 1.04e-09 5.38e-09
Net Income Adjustments -1.736e-10 4.33e-11 -4.006 0.000 -2.59e-10 -8.86e-11
Net Income Applicable to Common Shareholders -2.564e-09 1.11e-09 -2.314 0.021 -4.74e-09 -3.91e-10
Operating Income 3.767e-10 6.32e-11 5.959 0.000 2.53e-10 5.01e-10
Operating Margin -0.0472 0.005 -9.792 0.000 -0.057 -0.038
Other Operating Items 2.118e-10 6.23e-11 3.400 0.001 8.96e-11 3.34e-10
Sales, General and Admin. 2.077e-10 4.97e-11 4.179 0.000 1.1e-10 3.05e-10
Total Current Liabilities -4.866e-11 2.03e-11 -2.402 0.016 -8.84e-11 -8.93e-12
Total Equity 1.506e-05 6.01e-07 25.081 0.000 1.39e-05 1.62e-05
Total Liabilities 1.506e-05 6.01e-07 25.081 0.000 1.39e-05 1.62e-05
Total Liabilities & Equity -1.506e-05 6.01e-07 -25.081 0.000 -1.62e-05 -1.39e-05
Total Revenue -1.943e-10 4.72e-11 -4.114 0.000 -2.87e-10 -1.02e-10
Estimated Shares Outstanding -7.321e-10 1.4e-10 -5.222 0.000 -1.01e-09 -4.57e-10
Omnibus: 1706.213 Durbin-Watson: 0.935
Prob(Omnibus): 0.000 Jarque-Bera (JB): 125843.382
Skew: 4.332 Prob(JB): 0.00
Kurtosis: 43.259 Cond. No. 4.89e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.89e+15. This might indicate that there are
strong multicollinearity or other numerical problems.

Finally, the reduced regression equation is found. To understand this equation, look at the coef column in the summary above. Those are the coefficient values for each variable in the equation. So, if Cash Ratio increases by a value of 1, then the Earnings Per Share value will also increase by a value of 0.0021, holding all other variables constant.

Thus, the EPS value can be determined by plugging in all values for these variables, multiplying them by the associated coefficient value, and adding the results together.

The reason for many of these coefficients being small is due to the large numbers associated with those variables. In order to output an EPS value, which often lie in the single digits, small coefficients are needed to counteract the large values for many of these variables.

Now, some of the variables that affect the EPS value the most are Cash Ratio, Operating Margin, Total Equity, and Total Liabilities.

Cash ratio is the measure of the liquidity of a company. The numerator holds the cash and cash equivalance, basically anything that is super liquid, which, in simple terms, means that even in the worst market conditions, someone will be willing to buy it. The denominator is the net liabilities, which is the liabilities - assets. You always want this to be >= 1.

Operating margin is the measure of a company's total revenue on sales after they pay for variable costs of production, such as on making the product itself, but disregarding interest or tax, including income tax. To calculate this, we take operating income and divide it by total revenue. In general, the higher the operating margin, the better off the company is.

Total equity is arguably the most important factor in our experiment, as it is used to determine a company's net worth. To calculate total equity, we would take the net difference between a company's total assets and its total liabilities. If equity is positive, this is a good thing and the company has enough assets to cover its liabilities. However, if negative, the opposite is true, and if this is an unresolved pattern, it can lead to insolvency.

Total liabilities are a company's obligations to sources outside of their domain, usually found in the form of debts. It is what they owe, as opposed to their assets, which is what they own. Liabilities aren't usually helpful as a solo measure, rather they are combined with other measures to form more meaningful data (such as cash ratio).

For an even more in depth explanation of these terms and their relevance to our tutorial, we recommend you take a look at Investopedia.

In [ ]:
x_train, x_test, y_train, y_test = train_test_split(fund[x], fund[y], test_size = 0.2, random_state = 0)
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
x2 = sm.add_constant(x_train)
est = sm.OLS(y_train, x2)
est2 = est.fit()
est2.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Earnings Per Share R-squared: 0.334
Model: OLS Adj. R-squared: 0.326
Method: Least Squares F-statistic: 39.18
Date: Mon, 20 Dec 2021 Prob (F-statistic): 1.93e-110
Time: 16:43:46 Log-Likelihood: -3946.4
No. Observations: 1424 AIC: 7931.
Df Residuals: 1405 BIC: 8031.
Df Model: 18
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.5808 0.167 21.418 0.000 3.253 3.909
Capital Expenditures -1.95e-10 7.83e-11 -2.490 0.013 -3.49e-10 -4.14e-11
Cash Ratio 0.0027 0.001 2.379 0.018 0.000 0.005
Changes in Inventories -5.183e-10 2.79e-10 -1.861 0.063 -1.06e-09 2.81e-11
Cost of Revenue 1.899e-10 5.41e-11 3.512 0.000 8.38e-11 2.96e-10
Depreciation -6.071e-10 1.14e-10 -5.325 0.000 -8.31e-10 -3.83e-10
Income Tax -6.107e-10 1.37e-10 -4.472 0.000 -8.79e-10 -3.43e-10
Net Income 2.816e-09 1.24e-09 2.267 0.024 3.79e-10 5.25e-09
Net Income Adjustments -2.73e-10 5.56e-11 -4.906 0.000 -3.82e-10 -1.64e-10
Net Income Applicable to Common Shareholders -2.278e-09 1.24e-09 -1.835 0.067 -4.71e-09 1.57e-10
Operating Income 4.931e-10 7.78e-11 6.337 0.000 3.4e-10 6.46e-10
Operating Margin -0.0428 0.006 -7.763 0.000 -0.054 -0.032
Other Operating Items 2e-10 6.85e-11 2.918 0.004 6.56e-11 3.34e-10
Sales, General and Admin. 2.05e-10 5.58e-11 3.675 0.000 9.56e-11 3.14e-10
Total Current Liabilities -3.921e-11 2.31e-11 -1.701 0.089 -8.44e-11 6.01e-12
Total Equity 1.913e-05 8.93e-07 21.419 0.000 1.74e-05 2.09e-05
Total Liabilities 1.913e-05 8.93e-07 21.419 0.000 1.74e-05 2.09e-05
Total Liabilities & Equity -1.913e-05 8.93e-07 -21.419 0.000 -2.09e-05 -1.74e-05
Total Revenue -1.911e-10 5.34e-11 -3.579 0.000 -2.96e-10 -8.64e-11
Estimated Shares Outstanding -9.812e-10 1.68e-10 -5.851 0.000 -1.31e-09 -6.52e-10
Omnibus: 1397.256 Durbin-Watson: 2.044
Prob(Omnibus): 0.000 Jarque-Bera (JB): 105072.995
Skew: 4.458 Prob(JB): 0.00
Kurtosis: 44.127 Cond. No. 5.27e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.27e+15. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]:
y_pred = reg.predict(x_test)
diff = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
diff
Out[ ]:
Actual Predicted
1337 -0.20 3.491553
1411 5.42 3.175234
1182 2.22 3.357733
962 -2.97 2.753236
1739 0.71 2.760649
... ... ...
853 1.14 2.824613
192 0.36 -13.690136
260 3.66 3.414873
1525 3.61 3.182921
1564 2.61 2.712609

357 rows × 2 columns

In [ ]:
residuals = []
for i, row in diff.iterrows():
  d = row['Actual'] - row['Predicted']
  residuals.append(d)


plt.scatter(residuals, y_pred)
plt.xlabel('Residual Value')
plt.ylabel('Actual Value')
plt.title('Residuals vs Actual Value')
Out[ ]:
Text(0.5, 1.0, 'Residuals vs Actual Value')
In [ ]:
import scipy as sp
fig, ax = plt.subplots(figsize = (6,2.5))
_, (_, _, r) = sp.stats.probplot(residuals, plot=ax, fit=True)
r**2
Out[ ]:
0.7718068650561637

The past few sections of code have dealt with determining accurate the regression is when it has to predict EPS values, given the rest of the data.

According to the resulting output, around 77% of the variation in Earning's Per Share can be explained by the data. Basically, the regression is correct 77 percent of the time.

Also, the scatterplot displays the difference between the predicted EPS value and the actual value, also known as the residual. As can be seen, most of the points lie around (0,0) on the graph, indicating that the predicted values were mostly accurate, with a few outliers. The Probability plot below the scatterplot displays the same thing. For the most part, the residuals follow a normal distribution, except there are more outliers, or extreme differences in the predicted values for EPS, and the actual values. Since the regression was only correct 77% of the time, 33% of the data resulted in incorrect values for EPS, most likely when the data was lied far from the average. Thus, that is why above scatterplot curves at both ends, due to the high number of extreme values. In statistical terms, this plot suffers from "heavy tails"

In [ ]:
fund_scaled = fund.copy()
x = ['Capital Expenditures', 'Cash Ratio', 'Changes in Inventories', 'Cost of Revenue', 'Depreciation', 'Income Tax',
       'Net Income', 'Net Income Adjustments', 'Net Income Applicable to Common Shareholders', 'Operating Income', 
       'Operating Margin', 'Other Operating Items', 'Sales, General and Admin.', 'Total Current Liabilities', 'Total Equity', 'Total Liabilities',
       'Total Liabilities & Equity', 'Total Revenue', 'Estimated Shares Outstanding']
for column in x:
    fund_scaled[column] = (fund_scaled[column] - fund_scaled[column].mean())/fund_scaled[column].std()
fund_scaled.head()
Out[ ]:
Unnamed: 0 Ticker Symbol Period Ending Accounts Payable Accounts Receivable Add'l income/expense items After Tax ROE Capital Expenditures Capital Surplus Cash Ratio Cash and Cash Equivalents Changes in Inventories Common Stocks Cost of Revenue Current Ratio Deferred Asset Charges Deferred Liability Charges Depreciation Earnings Before Interest and Tax Earnings Before Tax Effect of Exchange Rate Equity Earnings/Loss Unconsolidated Subsidiary Fixed Assets Goodwill Gross Margin Gross Profit Income Tax Intangible Assets Interest Expense Inventory Investments Liabilities Long-Term Debt Long-Term Investments Minority Interest Misc. Stocks Net Borrowings Net Cash Flow Net Cash Flow-Operating Net Cash Flows-Financing Net Cash Flows-Investing Net Income Net Income Adjustments Net Income Applicable to Common Shareholders Net Income-Cont. Operations Net Receivables Non-Recurring Items Operating Income Operating Margin Other Assets Other Current Assets Other Current Liabilities Other Equity Other Financing Activities Other Investing Activities Other Liabilities Other Operating Activities Other Operating Items Pre-Tax Margin Pre-Tax ROE Profit Margin Quick Ratio Research and Development Retained Earnings Sale and Purchase of Stock Sales, General and Admin. Short-Term Debt / Current Portion of Long-Term Debt Short-Term Investments Total Assets Total Current Assets Total Current Liabilities Total Equity Total Liabilities Total Liabilities & Equity Total Revenue Treasury Stock For Year Earnings Per Share Estimated Shares Outstanding
0 0 AAL 2012-12-31 3.068000e+09 -222000000.0 -1.961000e+09 23.0 -0.213451 4.695000e+09 -0.091987 1.330000e+09 0.173895 127000000.0 -0.060930 78.0 0.000000e+00 223000000.0 -0.038038 -1.813000e+09 -2.445000e+09 0.0 0.0 1.340200e+10 0.000000e+00 58.0 1.435600e+10 -0.676664 8.690000e+08 632000000.0 5.800000e+08 3.060000e+08 4.730000e+08 7.116000e+09 0.0 0.0 0.0 -1.020000e+09 197000000.0 1.285000e+09 4.830000e+08 -1.571000e+09 -0.896859 0.687092 -0.900636 -4.084000e+09 1.124000e+09 3.860000e+08 -0.419046 -0.837781 2.167000e+09 6.260000e+08 4.524000e+09 -2.980000e+09 1.509000e+09 11000000.0 1.514700e+10 -141000000.0 -0.008181 10.0 31.0 8.0 72.0 0.0 -9.462000e+09 0.000000e+00 0.992361 1.419000e+09 3.412000e+09 2.351000e+10 7.072000e+09 0.451951 -0.769046 -0.104088 -0.190738 0.111430 -367000000.0 2012.0 -5.60 -0.177687
1 1 AAL 2013-12-31 4.975000e+09 -93000000.0 -2.723000e+09 67.0 -0.624865 1.059200e+10 0.133942 2.175000e+09 0.173895 5000000.0 -0.043832 104.0 0.000000e+00 935000000.0 -0.029356 -1.324000e+09 -2.180000e+09 0.0 0.0 1.925900e+10 4.086000e+09 59.0 1.572400e+10 -0.554816 2.311000e+09 856000000.0 1.012000e+09 -1.181000e+09 -2.350000e+08 1.535300e+10 0.0 0.0 0.0 2.208000e+09 660000000.0 6.750000e+08 3.799000e+09 -3.814000e+09 -0.886341 0.620641 -0.890022 -4.489000e+09 1.560000e+09 5.590000e+08 -0.171884 -0.642698 2.299000e+09 1.465000e+09 7.385000e+09 -2.032000e+09 1.711000e+09 481000000.0 1.491500e+10 -56000000.0 -0.005427 8.0 80.0 7.0 96.0 0.0 -1.129600e+10 0.000000e+00 0.985301 1.446000e+09 8.111000e+09 4.227800e+10 1.432300e+10 0.954633 -0.565675 0.006631 -0.065965 0.157531 0.0 2013.0 -11.25 -0.335735
2 2 AAL 2014-12-31 4.668000e+09 -160000000.0 -1.500000e+08 143.0 -1.362123 1.513500e+10 -0.020101 1.768000e+09 0.173895 7000000.0 0.107446 88.0 0.000000e+00 829000000.0 0.117791 4.099000e+09 3.212000e+09 0.0 0.0 2.308400e+10 4.091000e+09 63.0 2.703000e+10 -0.185445 2.240000e+09 887000000.0 1.004000e+09 1.799000e+09 -1.026000e+09 1.604300e+10 0.0 0.0 0.0 1.700000e+08 -146000000.0 3.080000e+09 -3.150000e+08 -2.911000e+09 0.294607 0.120946 0.301792 2.882000e+09 1.771000e+09 8.000000e+08 0.391197 -0.398845 2.060000e+09 8.980000e+08 7.059000e+09 -4.559000e+09 8.170000e+08 601000000.0 1.092800e+10 -500000000.0 0.146748 8.0 159.0 7.0 80.0 0.0 -8.562000e+09 -1.052000e+09 1.842739 1.677000e+09 6.309000e+09 4.322500e+10 1.175000e+10 0.912489 -0.381804 -0.014310 -0.061308 0.545947 0.0 2014.0 4.02 0.173294
3 3 AAL 2015-12-31 5.102000e+09 352000000.0 -7.080000e+08 135.0 -1.644005 1.159100e+10 -0.112527 1.085000e+09 0.173895 6000000.0 -0.041301 73.0 2.477000e+09 667000000.0 0.184052 5.496000e+09 4.616000e+09 0.0 0.0 2.751000e+10 4.091000e+09 73.0 2.989400e+10 -2.001700 2.249000e+09 880000000.0 8.630000e+08 4.430000e+08 -6.330000e+08 1.833000e+10 0.0 0.0 0.0 2.856000e+09 -604000000.0 6.249000e+09 -1.259000e+09 -5.594000e+09 1.478560 -1.081927 1.496640 7.610000e+09 1.425000e+09 1.051000e+09 0.777450 -0.154992 2.103000e+09 7.480000e+08 6.272000e+09 -4.732000e+09 9.600000e+07 114000000.0 1.017800e+10 95000000.0 0.170503 11.0 82.0 19.0 67.0 0.0 -1.230000e+09 -3.846000e+09 1.907711 2.231000e+09 5.864000e+09 4.841500e+10 9.985000e+09 0.933561 -0.241967 -0.005637 -0.035787 0.505413 0.0 2015.0 11.39 0.128460
4 4 AAP 2012-12-29 2.409453e+09 -89482000.0 6.000000e+05 32.0 0.329112 5.202150e+08 -0.400073 5.981110e+08 -0.492960 7000.0 -0.303975 124.0 0.000000e+00 0.0 -0.408854 6.579150e+08 6.240740e+08 0.0 0.0 1.292547e+09 7.638900e+07 50.0 3.098036e+09 -0.236586 2.884500e+07 33841000.0 2.308609e+09 0.000000e+00 4.263230e+08 6.044610e+08 0.0 0.0 0.0 1.774450e+08 540210000.0 6.852810e+08 1.279070e+08 -2.729780e+08 -0.330006 -0.073785 -0.328568 3.876700e+08 2.298660e+08 0.000000e+00 -0.318420 -0.350075 3.183300e+07 4.761400e+07 1.495580e+08 2.667000e+06 -3.349900e+07 -1796000.0 2.390210e+08 8213000.0 -0.299102 10.0 52.0 6.0 34.0 0.0 7.149000e+08 -1.860000e+07 -0.169893 6.270000e+05 0.000000e+00 4.613814e+09 3.184200e+09 -0.224376 -0.413157 -0.222346 -0.251173 -0.343964 -27095000.0 2012.0 5.29 -0.418205
In [ ]:
y = 'Earnings Per Share'
x2 = sm.add_constant(fund_scaled[x])
est = sm.OLS(fund_scaled[y], x2)
est2 = est.fit()
est2.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Earnings Per Share R-squared: 0.312
Model: OLS Adj. R-squared: 0.305
Method: Least Squares F-statistic: 44.46
Date: Mon, 20 Dec 2021 Prob (F-statistic): 4.63e-129
Time: 16:43:47 Log-Likelihood: -4885.1
No. Observations: 1781 AIC: 9808.
Df Residuals: 1762 BIC: 9912.
Df Model: 18
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.9413 0.090 32.851 0.000 2.766 3.117
Capital Expenditures -0.4706 0.207 -2.276 0.023 -0.876 -0.065
Cash Ratio 0.2079 0.095 2.194 0.028 0.022 0.394
Changes in Inventories -0.2048 0.096 -2.138 0.033 -0.393 -0.017
Cost of Revenue 5.9009 1.446 4.081 0.000 3.065 8.737
Depreciation -1.2139 0.223 -5.445 0.000 -1.651 -0.777
Income Tax -0.9783 0.231 -4.235 0.000 -1.431 -0.525
Net Income 12.8014 4.419 2.897 0.004 4.135 21.468
Net Income Adjustments -0.4624 0.115 -4.006 0.000 -0.689 -0.236
Net Income Applicable to Common Shareholders -10.1477 4.385 -2.314 0.021 -18.748 -1.547
Operating Income 1.9068 0.320 5.959 0.000 1.279 2.534
Operating Margin -0.9686 0.099 -9.792 0.000 -1.163 -0.775
Other Operating Items 0.6152 0.181 3.400 0.001 0.260 0.970
Sales, General and Admin. 1.8829 0.451 4.179 0.000 0.999 2.767
Total Current Liabilities -0.4642 0.193 -2.402 0.016 -0.843 -0.085
Total Equity 0.2196 0.326 0.674 0.500 -0.419 0.858
Total Liabilities -0.3265 0.167 -1.960 0.050 -0.653 0.000
Total Liabilities & Equity -0.2638 0.118 -2.229 0.026 -0.496 -0.032
Total Revenue -7.9565 1.934 -4.114 0.000 -11.750 -4.163
Estimated Shares Outstanding -0.7966 0.153 -5.222 0.000 -1.096 -0.497
Omnibus: 1706.213 Durbin-Watson: 0.935
Prob(Omnibus): 0.000 Jarque-Bera (JB): 125843.382
Skew: 4.332 Prob(JB): 0.00
Kurtosis: 43.259 Cond. No. 5.64e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.79e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

By normalizing the data, the incredibly large values can be reduced into smaller, more understandable data. Also, the coefficients can be analyzed easier. According to the data, Net Income plays a large role in causing Earning's Per Share to increase, with a coefficient of 12, while Net Income Applicable causes EPS to decrease when it rises, with a coefficient of -10.

Thus, according to the data, signs of an increase in Earnings Per Share are Cash Ratio, Cost of Revenue, Net Income, Operating Income, Other Operating Items, Sales, Generam and Admin, and Total Equity all are directly related to Earnings Per Share. When these variables experience an increase, so does EPS.

On the other hand, Earnings per share falls when Capital Expenditures, Changes in Inventories, Depreciation, Income Tax, Net Income Adjustments, Net Income Applicable to Common Shareholders, Operating Margin, Total Current Liabilities, Total Liabilities, Total Liabilities and Equity, Total Revenue, and Extimates Shares Outstanding all rise.

So, in one sentence, when Net Income is high, Earnings Per Share will rise by 12 percent, but if Net Income Adjustments is high, EPS will fall by 10 percent.

In [ ]:
import seaborn as sns


x2 = ['Cost of Revenue', 'Depreciation', 'Income Tax',
       'Net Income', 'Net Income Applicable to Common Shareholders', 'Operating Income', 
       'Operating Margin', 'Sales, General and Admin.', 'Total Revenue']
sns.heatmap(fund[x2].corr(), cmap='Blues', annot=True)
plt.title('Relevant Independent Factors vs. Other Relevant Independent Factors')
plt.show()

In this correlation heatmap, you are able to see the correlation the data has with other data values, rather than its effect on EPS, as we have mostly discussed throughout this tutorial. Cutting out data where the coefficients weren't particularly large (roughly < -1 and > 1), we see some strong correlations.

For example, we can tell that since Operating Margin doesn't have a high absolute value coefficient with any of the other factors, but still affects EPS, that it is very much an independent factor in determining EPS trends. On the contrary, we see that Net Income has coefficients of .92, .88, and .67 in regards to Operating Income, Income Tax, and Total Revenue, respectably. As we know from the Linear Regression summary above, Net Income has a big impact on earnings per share, but this indicates that these factors are strongly correlated with Net Income as well, which increases their significance in terms of EPS (this is fairly obvious if you look at how to calculate Net Income, however seeing the data helps to really solidify these findings).

In [ ]:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
In [ ]:
x_train, x_test, y_train, y_test = train_test_split(fund[x], fund[y], test_size = 0.08, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
In [ ]:
y_test = y_test.reset_index(drop = True)
In [ ]:
model = Sequential()
model.add(Dense(32, activation = 'relu', input_dim = 19))
model.add(Dense(units = 32, activation = 'relu'))
model.add(Dense(units = 32, activation = 'relu'))
model.add(Dense(units = 1))
In [ ]:
model.compile(optimizer = 'adam',loss = 'mean_squared_error')
In [ ]:
model.fit(x_train, y_train, batch_size = 10, epochs = 100)
Epoch 1/100
164/164 [==============================] - 1s 2ms/step - loss: 25.8686
Epoch 2/100
164/164 [==============================] - 0s 2ms/step - loss: 19.3298
Epoch 3/100
164/164 [==============================] - 0s 2ms/step - loss: 15.4320
Epoch 4/100
164/164 [==============================] - 0s 2ms/step - loss: 13.4524
Epoch 5/100
164/164 [==============================] - 0s 2ms/step - loss: 12.7696
Epoch 6/100
164/164 [==============================] - 0s 2ms/step - loss: 12.5535
Epoch 7/100
164/164 [==============================] - 0s 2ms/step - loss: 12.3524
Epoch 8/100
164/164 [==============================] - 0s 2ms/step - loss: 12.1208
Epoch 9/100
164/164 [==============================] - 0s 2ms/step - loss: 11.8820
Epoch 10/100
164/164 [==============================] - 0s 2ms/step - loss: 11.8122
Epoch 11/100
164/164 [==============================] - 0s 2ms/step - loss: 11.7124
Epoch 12/100
164/164 [==============================] - 0s 2ms/step - loss: 11.5251
Epoch 13/100
164/164 [==============================] - 0s 2ms/step - loss: 11.3785
Epoch 14/100
164/164 [==============================] - 0s 2ms/step - loss: 11.2018
Epoch 15/100
164/164 [==============================] - 0s 2ms/step - loss: 11.2789
Epoch 16/100
164/164 [==============================] - 0s 2ms/step - loss: 10.8348
Epoch 17/100
164/164 [==============================] - 0s 2ms/step - loss: 10.6868
Epoch 18/100
164/164 [==============================] - 0s 2ms/step - loss: 10.6430
Epoch 19/100
164/164 [==============================] - 0s 2ms/step - loss: 10.2268
Epoch 20/100
164/164 [==============================] - 0s 2ms/step - loss: 9.9523
Epoch 21/100
164/164 [==============================] - 0s 2ms/step - loss: 9.7200
Epoch 22/100
164/164 [==============================] - 0s 2ms/step - loss: 9.5784
Epoch 23/100
164/164 [==============================] - 0s 2ms/step - loss: 9.2709
Epoch 24/100
164/164 [==============================] - 0s 2ms/step - loss: 9.2671
Epoch 25/100
164/164 [==============================] - 0s 2ms/step - loss: 8.8424
Epoch 26/100
164/164 [==============================] - 0s 2ms/step - loss: 8.6599
Epoch 27/100
164/164 [==============================] - 0s 2ms/step - loss: 7.9926
Epoch 28/100
164/164 [==============================] - 0s 2ms/step - loss: 7.7569
Epoch 29/100
164/164 [==============================] - 0s 2ms/step - loss: 7.4639
Epoch 30/100
164/164 [==============================] - 0s 2ms/step - loss: 7.1294
Epoch 31/100
164/164 [==============================] - 0s 2ms/step - loss: 7.0412
Epoch 32/100
164/164 [==============================] - 0s 2ms/step - loss: 6.8786
Epoch 33/100
164/164 [==============================] - 0s 2ms/step - loss: 6.4331
Epoch 34/100
164/164 [==============================] - 0s 2ms/step - loss: 6.2220
Epoch 35/100
164/164 [==============================] - 0s 2ms/step - loss: 6.0230
Epoch 36/100
164/164 [==============================] - 0s 2ms/step - loss: 5.8345
Epoch 37/100
164/164 [==============================] - 0s 2ms/step - loss: 5.7645
Epoch 38/100
164/164 [==============================] - 0s 2ms/step - loss: 5.5776
Epoch 39/100
164/164 [==============================] - 0s 2ms/step - loss: 5.7595
Epoch 40/100
164/164 [==============================] - 0s 2ms/step - loss: 5.3950
Epoch 41/100
164/164 [==============================] - 0s 2ms/step - loss: 5.3471
Epoch 42/100
164/164 [==============================] - 0s 2ms/step - loss: 5.1581
Epoch 43/100
164/164 [==============================] - 0s 2ms/step - loss: 5.0787
Epoch 44/100
164/164 [==============================] - 0s 2ms/step - loss: 5.0146
Epoch 45/100
164/164 [==============================] - 0s 2ms/step - loss: 4.8552
Epoch 46/100
164/164 [==============================] - 0s 2ms/step - loss: 4.6035
Epoch 47/100
164/164 [==============================] - 0s 2ms/step - loss: 4.9631
Epoch 48/100
164/164 [==============================] - 0s 2ms/step - loss: 4.6153
Epoch 49/100
164/164 [==============================] - 0s 2ms/step - loss: 4.3588
Epoch 50/100
164/164 [==============================] - 0s 2ms/step - loss: 4.4514
Epoch 51/100
164/164 [==============================] - 0s 2ms/step - loss: 4.3179
Epoch 52/100
164/164 [==============================] - 0s 2ms/step - loss: 4.0161
Epoch 53/100
164/164 [==============================] - 0s 2ms/step - loss: 4.4694
Epoch 54/100
164/164 [==============================] - 0s 2ms/step - loss: 4.0939
Epoch 55/100
164/164 [==============================] - 0s 2ms/step - loss: 4.1917
Epoch 56/100
164/164 [==============================] - 0s 2ms/step - loss: 4.0338
Epoch 57/100
164/164 [==============================] - 0s 2ms/step - loss: 3.8432
Epoch 58/100
164/164 [==============================] - 0s 2ms/step - loss: 3.9286
Epoch 59/100
164/164 [==============================] - 0s 2ms/step - loss: 3.7860
Epoch 60/100
164/164 [==============================] - 0s 2ms/step - loss: 3.9047
Epoch 61/100
164/164 [==============================] - 0s 2ms/step - loss: 3.5743
Epoch 62/100
164/164 [==============================] - 0s 2ms/step - loss: 3.5587
Epoch 63/100
164/164 [==============================] - 0s 2ms/step - loss: 3.4964
Epoch 64/100
164/164 [==============================] - 0s 2ms/step - loss: 3.4554
Epoch 65/100
164/164 [==============================] - 0s 2ms/step - loss: 3.3530
Epoch 66/100
164/164 [==============================] - 0s 2ms/step - loss: 3.3728
Epoch 67/100
164/164 [==============================] - 0s 2ms/step - loss: 3.3322
Epoch 68/100
164/164 [==============================] - 0s 2ms/step - loss: 3.4949
Epoch 69/100
164/164 [==============================] - 0s 2ms/step - loss: 3.2444
Epoch 70/100
164/164 [==============================] - 0s 2ms/step - loss: 3.3594
Epoch 71/100
164/164 [==============================] - 0s 2ms/step - loss: 3.0016
Epoch 72/100
164/164 [==============================] - 0s 2ms/step - loss: 3.0807
Epoch 73/100
164/164 [==============================] - 0s 2ms/step - loss: 3.2715
Epoch 74/100
164/164 [==============================] - 0s 2ms/step - loss: 3.2105
Epoch 75/100
164/164 [==============================] - 0s 2ms/step - loss: 2.8688
Epoch 76/100
164/164 [==============================] - 0s 2ms/step - loss: 2.9450
Epoch 77/100
164/164 [==============================] - 0s 2ms/step - loss: 2.8859
Epoch 78/100
164/164 [==============================] - 0s 2ms/step - loss: 2.8188
Epoch 79/100
164/164 [==============================] - 0s 2ms/step - loss: 2.7412
Epoch 80/100
164/164 [==============================] - 0s 2ms/step - loss: 2.6194
Epoch 81/100
164/164 [==============================] - 0s 2ms/step - loss: 2.9750
Epoch 82/100
164/164 [==============================] - 0s 2ms/step - loss: 2.7639
Epoch 83/100
164/164 [==============================] - 0s 2ms/step - loss: 2.6185
Epoch 84/100
164/164 [==============================] - 0s 2ms/step - loss: 2.5259
Epoch 85/100
164/164 [==============================] - 0s 2ms/step - loss: 2.3649
Epoch 86/100
164/164 [==============================] - 0s 2ms/step - loss: 2.4412
Epoch 87/100
164/164 [==============================] - 0s 2ms/step - loss: 2.3231
Epoch 88/100
164/164 [==============================] - 0s 2ms/step - loss: 2.4465
Epoch 89/100
164/164 [==============================] - 0s 2ms/step - loss: 2.5104
Epoch 90/100
164/164 [==============================] - 0s 2ms/step - loss: 2.3716
Epoch 91/100
164/164 [==============================] - 0s 2ms/step - loss: 2.3570
Epoch 92/100
164/164 [==============================] - 0s 2ms/step - loss: 2.3715
Epoch 93/100
164/164 [==============================] - 0s 2ms/step - loss: 2.2564
Epoch 94/100
164/164 [==============================] - 0s 2ms/step - loss: 2.4695
Epoch 95/100
164/164 [==============================] - 0s 2ms/step - loss: 2.4565
Epoch 96/100
164/164 [==============================] - 0s 2ms/step - loss: 2.8448
Epoch 97/100
164/164 [==============================] - 0s 2ms/step - loss: 2.2191
Epoch 98/100
164/164 [==============================] - 0s 2ms/step - loss: 2.0584
Epoch 99/100
164/164 [==============================] - 0s 2ms/step - loss: 2.1709
Epoch 100/100
164/164 [==============================] - 0s 2ms/step - loss: 1.9946
Out[ ]:
<keras.callbacks.History at 0x7f776c1b3d50>
In [ ]:
y_pred = model.predict(x_test)
plt.plot(y_test, color = 'red', label = 'Real data')
plt.plot(y_pred, color = 'blue', label = 'Predicted data')
plt.title('Prediction')
plt.legend()
plt.show()

However, the linear regression alone does not accurately explain the variation in the values for Earnings Per Share. So, the relationships between the 19 different variables and EPS may not be linear. Instead, we created a neural network to try and predict Earning's Per Share, given the data.

The data is standardized, to prevent outliers from skewing the data, as well as to reduce the number of large values, and make the results more understandable and easy to interpret.

Then, the neural network is created and run, with each epoch operating on a subset of the data, and learning from the prior results. Thus, loss in the data is reduced as the neural net runs, as it adjusts coefficient values for each variable until the most optimal ones are found. Basically, the neural net learns over time.

After that, the model is tested using the testing data, and its predictions are compared to the actual results. As can be seen in the graph about, the model does a rather good job in predicting Earnings Per Share given the varied fundamental data found in fundamentals.csv. With that, we believe the neural net can more accurately predict EPS values, and so this model could be used on new data to predict values for Earning's Per Share in the future.