Category Archives: Data Analysis

Text Mining in Excel

This morning, I assembled a quick text-parsing tool to help me quickly analyze the contents of a text document.

Process

  1. Select Third Avenue Funds 2015 Semi-Annual Report
  2. Parse word frequency utilizing WriteWords Word Frequency Counter
  3. Input values in xls
  4. Source stopwords [1] list from Ranks NL
  5. Input stopwords in xls
  6. Exclude stopwords from selected text parse
  7. Sort words by frequency

Output

cata.3rdave

Use Cases

  • Input N # of pdfs to analyze key themes amongst a range of reports (eg, sector commentary (energy, technology, healthcare); specific equity commentary)
  • Conduct a look-back analysis to understand the frequency by which key themes occur over a specific period of time (eg, corporate actions (restructuring programs, buybacks, dividends) to understand key drivers of performance going forward
  • Correlation analysis: which themes correlate with greatest increases in stock and greatest drawdown? (eg, mentioning of poor weather conditions for Motorcycle sales for Harley Davidson)
  • Quickly source and isolate specific equities mentioned over a period of time for a specific fund (analyze investor reports from Q1 2010 to Q4 2014) and understand how investment strategies evolve over time

Back-solving

  • Input word or phrase and observe frequency of occurrence over time period over a range of select documents

[1] Also a good reference: Text Classification for Sentiment Analysis – Stopwords and Collocations

Automated Search Filtering for Potential Equity Positions

Criteria

  • Topic
  • Date: Early
  • Date: Late

Input criteria in interface & conduct batch search.

Use Case

  • Aggregate list of 20 equities
  • Observe top 3 drawdown days in pre-defined trading period
  • Input news search for each drawdown

Potential Next Steps

Conduct sentiment analysis per each article

  • Download each article to DB
  • Omit common terms (the, and, of)
  • Observe frequency of text distribution to analyze significant variables impacting price reaction

Crunchbase – Tool

Crunchbase Project

About a month ago, I assembled a little sheet that allows a user to navigate companies within the Crunchbase open source data set, which provides information about startup activity on companies in various industries.

While sources like Indeed do a good job of aggregating job postings at a single point in time, it paints an incomplete picture of the number of job vacancies that could be open at that point in time. I wanted to address another issue – what companies within different industries resided in what regions? Hiring needs at different companies change on a rather random basis, so I thought it would be helpful to help my friends create a “watch list” of companies within a specific industry so that they could

  • Discover Companies within specific industries
  • Explore what Companies are doing within different industries (links to websites are included in the sheet)
  • Locate Companies in varying regions

I also thought it would be beneficial for my friends because I’ve learned that a company’s funding history (date since last funding round + amount raised), can be indicative of a company’s hiring strategy. Tools like LinkedIn are great means by which to expand one’s professional network. Helping friends discover and explore different companies is one step closer to helping them connect with future coworkers/ companies.

I started to play with the data, and I started to work on scaling the project to include the entire dataset, but got bogged down with obligations for work. So so far, the dataset includes only Companies located in California. My sheet features two separate search fields that enable a user to

  1. Search for Companies by Industry
  2. Search for Companies by Region

The sheet autopopulates this data based on the search criteria listed above.The file is available for download if you click here (2mb XLS file). I’ll make periodic updates, but am making it available for now.

Navigation

The primary focus of the dataset is to highlight companies by industry and region. In cell F6 in tab titled “menu,” companies are listed in order of frequency they occur in the database. The top 25 industries by representation in the database are as follows:

Picture1

There are a total of 517 different industries represented in the dataset. To view a full list, click here.

User Interface

Navigation1

Two primary cells control the interface of the sheet, each via dropdown menu:

  • Cell F6 Controls Industry Navigation
  • Cell N7 Controls Region Navigation

Left Side Pane: Industry Summary

  • Number of Companies by Region
  • Median Funding
  • Average Funding

Middle Pane: Funding League Table

  • Top 20 Companies within each Industry Ranked by Amount Funded
  • Date of Last Funding Round
  • Link to Company Website

Navigation2

Right Pane: All Companies within Industry in Chosen Region

  • Company Website
  • Total Funding
  • Status (Acquired, Operating)
  • Headquarters
  • Number of Funding Rounds
  • Year Founded
  • First Funding Date
  • Latest Funding Date

EDGAR – Raw Data

Preliminary Analysis

Log Part 1 – Normal Distribution Follow Up

Picture1

  1. Define “significant” form 4 filings and determine if there is a relationship between the number of Form 4 filings (on a daily basis) and price impact
  2. Observe top 3 gains + losses days and count proximity from Form 4 Filing
  3. Eliminate confounded companies (poor earnings; negative media attention)
  4. Adjust for price changes relative to SP500

Log Part 2 – Qualitative Analysis, “Tails”

Picture2

  1. Frequency of media coverage relative to filing
  2. Catalysts for filing

Review

EDGAR – Volume

Introducing volume as a parameter of analysis to define significant insider activity

IF
Form 4 is filed

THEN
Calculate % trading volume exceeds average daily trading volume (ADTV )

//

  • Rank trading volume excess and observe vs
    • Price Reaction

Picture2

EDGAR – Assessing Available Data

Accessing the EDGAR FTP, I was able to download all SEC filings for CQ3 2014 which includes about 206,000 lines of data. It should not be a problem to scale the dataset to include previous years. For now, I focused on CQ3 2014, from a high level.

EDGAR classifies each company by CIK code. By accessing Rank and Filed, I was able to download an index of CIK codes that map to US Exchange Tickers. I integrated this index with my dataset, and now am able to identify companies on a more universal basis (tickers).

I logged the frequency of each form filing, and chose to isolate Form 3, Form 4, and 13-D.

Figure 3.1.2.1: Frequency of Form 3 Filings, CQ3 2014

Picture2_sec

Figure 3.1.2.2: Frequency of Form 4 Filings, CQ3 2014

Picture1_sec

Figure 3.1.2.3: Frequency of Form 13D/A Filings, CQ3 2014

Picture3_sec

I ran a quick regression between frequency of form filings (4, 3, SC 13D/A, 4/A, SC 13G/A, DEFA 14A, SC 13G, SC 13D) and the SP500 to see if there were any general relationships between frequency of filing and broad index performance.

Figure 3.1.2.4: SP500, 7/1/14 – 9/30/14

Picture4_sec

Figure 3.1.2.5: Regression Analysis, SP500 v. Form Filing Frequency

Picture5_sec

On a broad scale, the correlation is null. Delving in to more micro analysis, I sought to determine 10 companies I’d test my analysis upon. I incorporated each Company’s market capitalization from FactSet to screen for inactive companies and cleanse my dataset. I reduced my data set in the following order:

  • Began with 13,063 unique tickers
  • Excluded Companies with 0 market cap; remaining = 7,427 companies
  • Counted number of companies with market capitalization between $500MM and $2,000MM

I then sought to define what consituted an “active” filer. I parsed out the average and median filing frequencies for companies at different intervals.

Figure 3.1.2.6: Frequency of Filings at Different Market Capitalizations

Picture6_sec

I began to hone in on my target Company profile: somewhere between the range of $500MM and $5,000MM. I also took a sample of the aggregate sample size, measuring the first and third quartiles of filings for Companies with market cap > $500MM excluding companies that haven’t filed (active Companies are required to file very quarter).

Figure 3.1.2.7: First and Third Quartile: Number of Filings with SEC for Companies with > $500MM Market Cap

Picture7_sec

At this point in time, I defined a frequent filer as one that posted more than 23 filings per quarter. I justified this because my sample population of companies has a median of less than 13 filings per quarter, and only almost hitting 20 when including skewed averages.

The following companies are the companies I have chosen to analyze

Figure 3.1.2.8: EDGAR Project Universe v1

Picture8_sec

Follow-Up

  • Develop scalable time series model to observe price reaction relative to Form 3 + 4 filings.
    • Graphical representation for 3Q 2014 (linear) & date of form filing (point)
      • Record top/bottom 3 dates for price reaction
      • Count if form filing occurs within 3-4 days of resultant reaction, simple percentage basis.
      • End goal is to parse occurence within +/- 1, 2, 3, 4 days of filing over universe of 50 companies and record results.
      • If a relationship is established, structure data in terms of 1) classifications between gains / loss; 2) magnitude of gains/ loss per period observed for relevant period 3) normalize for corresponding SP500 gains/loss to eliminate market counfound
  • Count only unique occurences of filings and record titles of sellers (buyers)

EDGAR – Execution Plan

FTP Syntax
ftp://ftp.sec.gov/edgar/daily-index/form.20141015.idx

Plan

  1. Batch download daily information from SEC EDGAR for period October 1st, 2014 > October 25, 2014
  2. Count occurrences > Form 3, Form 4, Form 13D
  3. Record 1D, 2D, 7D, 14D price reaction
  4. Regress filing occurence vs. price reaction (standard 1,2,3,4 quartiles)

Considerations

  • Derive company ticker from SEC CIK code
  • Batch download price data from Yahoo finance (defer to FactSet if process is too time consuming)

Crunchbase – UI Update

Crunchbase Project

Following the initial analysis, I assembled three forms. After defining the following parameters:

  1. Industry
  2. Region

My form auto-popluates with relevant information related to a company’s geography, total funding, and website. Further detail:

  • Form One: Companies by region > median funding > average funding
  • Form Two: Funding League Table, Top 20 Companies in terms of total funding
  • Form Three: Auto-populate list of all companies in dataset including
    • Company Name
    • Company Website
    • Total Funding
    • Operating Status (Operating, Acquired, Closed)
    • Headquarters
    • Founding Year
    • Date of First Funding Round
    • Date of Last Funding Round

Form One:

Picture1

Form Two:

Picture2

Form Three: 

Picture3

Considerations + Follow Up

Following the initial analysis, I discovered a more efficient means by which to rank + index values that calculates results with greater efficiency (dataset size is 56% smaller with the new methodology I developed). At this point in time, it is possible to expand the data beyond the scope of California to include the aggregate database (the world). The database will be 60% larger (about 9mb) and incorporate > 1 million points of information.

Inclusions for v4

  • Keyword search function
  • Parameters defined by
    • Country
    • State
    • Region
    • Industry
  • Include search tags
  • Include catalog of all companies in DB, graphically
  • Cleanse data: 0 values = “Not Listed” or “Not Available”

Crunchbase – Introduction

Introduction

CrunchBase is a dataset of startup activity and it’s accessible to everyone.  Founded in 2007 by Mike Arrington, CrunchBase began as a simple crowd sourced database to track startups covered on TechCrunch.  Today you’ll find about 650k profiles of people and companies that are maintained by tens of thousands of contributors.

Crunchbase makes available all data in a wide variety of platforms, even Excel.

I recently exported the Crunchbase database, and with it, 7mb of company information ranging from Company name, website, total funding, number of funding rounds, and region. I initially downloaded the data to integrate it with acompany watchlist I’ve been developing to screen for VC funding within the Internet industry. I realized that the dataset could be used for many other purposes – primarily, identifying which regions in the country are home to companies within various industries that have received the most funding within the last few years.

Figure 4.1.1: Reducing the Size of the Crunchbase Dataset

all

The dataset encompasses 47,758 lines of information across 21 different columns (> 1 million data points). I began to run my analysis on this set, but as the formulas I’d developed to aid me in my analysis began to build on top of one another, my sheet began to process unbearingly slow. It became obvious that my computer just didn’t have the processing power to crunch the numbers as I’d like – in addition to excel not being the right arena for this type of analysis. I figured the best means by which to cut my losses was to simply reduce the size of my dataset. I cut down the size of my data set to include just 20,441 lines (Figure 4.1.1), which represents the 10 US states that have the greatest number of companies in the Crunchbase database, reducing the size of the database by more than 50%.

Figure 4.1.2: Cumulative Distribution of top 10 US Companies represented in the Crunchbase Database

cumulative

After cutting my data down, I narrowed the set to include New Jersey, Colorado, Pennsylvania, Illinois, Florida, Washington, Texas, Massachusets, New York, and California. California itself inhabits almost half of the dataset, and will involve most of my analysis.

At this point in time, I parsed out 2 categorizations: the first, is which regions within each state populated the greatest number of Companies. The second is which industries are most represented within each state (remember, our database is in reference to Crunchbase companies, so it generally skews towards tech). For emphasis: our database was extrapulated from Crunchbase, so the primary companies under observation are those that have raise significantventure capital funding over the last few years.

When continuing to assemble my dataset, I omitted companies that didn’t have identifiers either for regions or industries. While my data is imperfect (some companies don’t fall in to clear categories, especially for industries), it is still quite representative of the general industry bias within regions. Additionally, because Crunchbase is an open source database, a Company in the absence of either data point (region, industry), can imply that it is not a very active company in the eyes of its respective industry, or by VC’s, because if it were, someone would have inputted information beforehand (or it may be a very early stage company).

Number of Omissions per State (per the industry dataset):

  • CA: 671
  • NY: 210
  • MA: 147
  • TX: 143
  • WA: 72
  • FL: 108
  • IL: 72
  • PA: 84
  • CO: 56
  • NJ: 60

Figure 4.1.3: Example Regional and Industry Statistics: California Companies with recent VC/PE Funding

Regional Statistics: California

ca_regions

Industry Statistics: California

CA_industry

Figure 4.1.4: Example Regional and Industry Statistics: New York Companies with recent VC/PE Funding

Regional Statistics: New York

ny_regions

Industry Statistics: New York

NY_industry

EDGAR – Introduction

The Securities Exchange Comission (SEC) makes available all corporate filings on EDGAR (Electronic Data Gathering, Analysis, and Retrieval system).

Figure 3.1.1: SEC Forms, Uses, and Filing Period

Picture2With this publicly available data, I will scrape the website to run backtests on corporate insider trading v. stock price reaction. I’ll utilize Form 3 and Form 4 initially to run my sample. Form 3 is filed when an individual becomes a corporate insider, which is defined as one person with a beneficial interest with > 10% holdings in a company. Form 4 is used when these “insiders” buy or sell securities.

Moving forward, I will enlarge my study to incorporate instituional buying and selling. The third part of the process will involve the Form 13D, which is filed within 10 days of an institution claiming a > 5% stake within a company.

If I feel necessary, there are two branches to include within the study. The first is 13G, which is when an institution acquires a significant interest in a security, but only for passive purposes (eg, mutual funds). Another study, involves proxy statement filings. This can involve investor activism (eg, Yahoo/Starboard), or merger agreements. While equally interesting, I will hold off on this analysis for now, and focus on Forms 3 and 4 to remain focused.

Figure 3.1.2: Filing Period for SEC Forms

asdf

Initially, I plan on observing time series data for 2-3 companies. If I develop a suitable infrastructure for analysis (eg, filing > price reaction (increase/decrease)), and am able to evaluate significant relationships between the two (which I predict that I will), I would like to run the same analsysis for the S&P 500 for the last 5 years (5 years, to exclude the financial crisis)