Impact of News Headlines on Stock Indices — Part 1

The aim of this project is to identify how global stock indices are impacted by the publication of news articles.

In part 1 of this post I explain how I obtained the data for the stock indices and how to connect to The Guardian API to obtain news articles. Hope you enjoy it 🙂

Introduction

stock market index is a hypothetical portfolio of investment holdings which represents a segment of the financial market. A number of factors influence the stock market from interest rates, economic growth and behavioural economics. There is a possibility of ‘beating’ the market if an investor has superior information. According to the Efficient Market Hypothesis, the share price reflects all information and neither fundamental nor technical analysis can generate excess returns.

Motivation

As conventional methods of trying to beat the stock market are used by a number of traders, different methods will need to be adopted to have a chance of outperforming the stock market to produce excess returns. Traditional approaches of technical and fundamental analysis may give investors an insight into long term investing, however using an unconventional approach of applying machine learning algorithms may improve the chances of generating excess returns. Investigating the impact of news headlines on stock indices or even incorporating this into a trading strategy may allow an investor to predict price mismatches or market movements.

Data

Stock Index Data

Two major stock indices were used for this project.

  • FTSE 100, a major UK index, which consists of the 100 largest market capitalisation companies.
FTSE 100 data frame obtained from investing.com

  • S&P 500 consists of 500 leading US companies which covers approximately 80% of market capitalisation.

The index values were obtained with a date range of 01/01/2004 to 01/12/2019.

The FTSE 100 index values were obtained from investing.com and the S&P 500 values were obtained using the pandas data reader and yfinance.

SP500 = pdr.get_data_yahoo(‘^GSPC’,start=’2004, 1, 1',end=’2019, 12, 1')SP500.reset_index(inplace=True)SP500.head()
S&P 500 data frame from pandas data reader and yfinance

In order to superimpose the time series plots of the FTSE 100 index and the S&P 500 index, the USD/GBP exchange rate was obtained from investing.com.

USD/GBP data frame obtained from investing.com

News Headlines Data

The Guardian API was used to extract headlines from 01/01/2004 to 01/12/2019. The news headlines include UK, Business and World News, which will be used to hopefully predict the global stock indices.

The Guardian is a leading UK media outlet, hence there could be a a tendency for bias towards news articles related to the UK. Furthermore, The Guardian is considered as a centre left-wing publisher with support with the Labour Party and being Pro-EU.

Questions to be answered:

The objective is to identify the impact of news headlines on stock market indices.

The following questions will be answered:

  • Are the news headlines predominantly positive, neutral or negative?
  • Can news headlines be classified into topics reliably?
  • Is there any correlation between news headlines and stock index price?
  • The news articles have been published by The Guardian but do they have an impact on other global stock indices?
  • Can news headlines be used to predict stock index values?

Plan

In order to answer the above questions the following plan has been developed:

  • Connect to The Guardian API and extract data
  • Clean data: to prepare useful and robust data to worth with
  • Feature engineering: e.g. sentiment analysis
  • Exploratory data analysis and modelling e.g. LDA (Latent Dirichlet Allocation)
  • Predict Stock Price

Guardian API

The Guardian has released an API to request news articles from the website. The Guardian allows 5,000 API calls per day. The range is from 01/01/2004 to 31/12/2019, which has more than 5,000 API calls, hence code has been run over 3 days to avoid reaching the maximum calls.

In order to connect to The Guardian API, the user will need to register as a developer here, to receive a unique API key.

The code that was released by Dan Nguyen was adapted to request news articles from The Guardian API.

guardian_api = ‘########################’ARTICLES_DIR = join(‘tempdata’, ‘newsarticles’) #Create a temporary directory to save the news articlesmakedirs(ARTICLES_DIR, exist_ok=True)# Sample URL — obtained from guardian to search for news articles# http://content.guardianapis.com/search?q=news# &api-key=your-api-key-goes-hereMY_API_KEY = open(“creds_guardian.txt”).read().strip()API_ENDPOINT = ‘http://content.guardianapis.com/search?q=news'my_params = {             ‘from-date’: “”,             ‘to-date’: “”,             ‘order-by’: “newest”,             ‘show-fields’: ‘all’,             ‘page-size’: 200,             ‘api-key’: guardian_api}#The code below uses the API request to gather news article information from The Guardian# day iteration from here:# http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-datesstart_date = date(2004, 1, 1)end_date = date(2010,12, 31) #ONLY ALLOWED 5,000 API REQUESTS SO SPLIT INTO BATCHESdayrange = range((end_date — start_date).days + 1)for daycount in dayrange:    dt = start_date + timedelta(days=daycount)    datestr = dt.strftime(‘%Y-%m-%d’)    fname = join(ARTICLES_DIR, datestr + ‘.json’)    if not exists(fname):       # then let’s download it       print(“Downloading”, datestr)       all_results = []       my_params[‘from-date’] = datestr       my_params[‘to-date’] = datestr       current_page = 1       total_pages = 1       while current_page <= total_pages:           print(“…page”, current_page)           my_params[‘page’] = current_page           resp = requests.get(API_ENDPOINT, my_params)           data = resp.json()           all_results.extend(data[‘response’][‘results’])           # if there is more than one page           current_page += 1           total_pages = data[‘response’][‘pages’]       with open(fname, ‘w’) as f:           print(“Writing to”, fname)           # re-serialize it for pretty indentation           f.write(json.dumps(all_results, indent=2)) #create separate json files for each day and save it in the tempdata/articles/ folder

The output of each news article with different parameters are saved as JSON files. I read these JSON files into a pandas data frame by using the following code:

#Inspired by tutorial on data cleaning in DataCamp: https://campus.datacamp.com/courses/cleaning-data-in-python/combining-data-for-analysis?ex=5#In order to read in the files, the code needs to iterate over each JSON file in the tempdata/articles/ file and#create a separate pandas dataframe for each day. Once this is done, concantenate each day into one big data frame#Create an empty list called framesframes = []#iterate over json_filesfor json in json_files:#read json into a dataframe called newsdatanewsdata = pd.read_json(json)#append newsdata to framesframes.append(newsdata)#Concatenate frames into a single dataframe called news articlesnewsarticles = pd.concat(frames)#Print the shape of newsarticlesprint(newsarticles.shape)
Pandas Data frame of the news articles returned from The Guardian API

In the next part of this post I will show some exploratory analysis on the data and how sentiment analysis is used to try and predict the stock index.

Thank you for reading this post!

Link to my GitHub repository on this project: https://github.com/Hishok/Impact-of-News-Headlines-on-Stock-Indices

Follow me on Medium for more posts and stay tuned for Part 2.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top