Project Idea:
- Parse tickers by industry
- Track trading performance (%52-week high)
- Insert in node tree
- Import to (and export from) Gephi
Project Idea:
This morning, I assembled a quick text-parsing tool to help me quickly analyze the contents of a text document.
Process
Output
Use Cases
Back-solving
[1] Also a good reference: Text Classification for Sentiment Analysis – Stopwords and Collocations
Crunchbase Project
About a month ago, I assembled a little sheet that allows a user to navigate companies within the Crunchbase open source data set, which provides information about startup activity on companies in various industries.
While sources like Indeed do a good job of aggregating job postings at a single point in time, it paints an incomplete picture of the number of job vacancies that could be open at that point in time. I wanted to address another issue – what companies within different industries resided in what regions? Hiring needs at different companies change on a rather random basis, so I thought it would be helpful to help my friends create a “watch list” of companies within a specific industry so that they could
I also thought it would be beneficial for my friends because I’ve learned that a company’s funding history (date since last funding round + amount raised), can be indicative of a company’s hiring strategy. Tools like LinkedIn are great means by which to expand one’s professional network. Helping friends discover and explore different companies is one step closer to helping them connect with future coworkers/ companies.
I started to play with the data, and I started to work on scaling the project to include the entire dataset, but got bogged down with obligations for work. So so far, the dataset includes only Companies located in California. My sheet features two separate search fields that enable a user to
The sheet autopopulates this data based on the search criteria listed above.The file is available for download if you click here (2mb XLS file). I’ll make periodic updates, but am making it available for now.
Navigation
The primary focus of the dataset is to highlight companies by industry and region. In cell F6 in tab titled “menu,” companies are listed in order of frequency they occur in the database. The top 25 industries by representation in the database are as follows:
There are a total of 517 different industries represented in the dataset. To view a full list, click here.
User Interface
Two primary cells control the interface of the sheet, each via dropdown menu:
Left Side Pane: Industry Summary
Middle Pane: Funding League Table
Right Pane: All Companies within Industry in Chosen Region
Preliminary Analysis
Log Part 1 – Normal Distribution Follow Up
Log Part 2 – Qualitative Analysis, “Tails”
Review
Accessing the EDGAR FTP, I was able to download all SEC filings for CQ3 2014 which includes about 206,000 lines of data. It should not be a problem to scale the dataset to include previous years. For now, I focused on CQ3 2014, from a high level.
EDGAR classifies each company by CIK code. By accessing Rank and Filed, I was able to download an index of CIK codes that map to US Exchange Tickers. I integrated this index with my dataset, and now am able to identify companies on a more universal basis (tickers).
I logged the frequency of each form filing, and chose to isolate Form 3, Form 4, and 13-D.
Figure 3.1.2.1: Frequency of Form 3 Filings, CQ3 2014
Figure 3.1.2.2: Frequency of Form 4 Filings, CQ3 2014
Figure 3.1.2.3: Frequency of Form 13D/A Filings, CQ3 2014
I ran a quick regression between frequency of form filings (4, 3, SC 13D/A, 4/A, SC 13G/A, DEFA 14A, SC 13G, SC 13D) and the SP500 to see if there were any general relationships between frequency of filing and broad index performance.
Figure 3.1.2.4: SP500, 7/1/14 – 9/30/14
Figure 3.1.2.5: Regression Analysis, SP500 v. Form Filing Frequency
On a broad scale, the correlation is null. Delving in to more micro analysis, I sought to determine 10 companies I’d test my analysis upon. I incorporated each Company’s market capitalization from FactSet to screen for inactive companies and cleanse my dataset. I reduced my data set in the following order:
I then sought to define what consituted an “active” filer. I parsed out the average and median filing frequencies for companies at different intervals.
Figure 3.1.2.6: Frequency of Filings at Different Market Capitalizations
I began to hone in on my target Company profile: somewhere between the range of $500MM and $5,000MM. I also took a sample of the aggregate sample size, measuring the first and third quartiles of filings for Companies with market cap > $500MM excluding companies that haven’t filed (active Companies are required to file very quarter).
Figure 3.1.2.7: First and Third Quartile: Number of Filings with SEC for Companies with > $500MM Market Cap
At this point in time, I defined a frequent filer as one that posted more than 23 filings per quarter. I justified this because my sample population of companies has a median of less than 13 filings per quarter, and only almost hitting 20 when including skewed averages.
The following companies are the companies I have chosen to analyze
Figure 3.1.2.8: EDGAR Project Universe v1
Follow-Up
FTP Syntax
ftp://ftp.sec.gov/edgar/daily-index/form.20141015.idx
Plan
Considerations
Crunchbase Project
Following the initial analysis, I assembled three forms. After defining the following parameters:
My form auto-popluates with relevant information related to a company’s geography, total funding, and website. Further detail:
Form One:
Form Two:
Form Three:
Considerations + Follow Up
Following the initial analysis, I discovered a more efficient means by which to rank + index values that calculates results with greater efficiency (dataset size is 56% smaller with the new methodology I developed). At this point in time, it is possible to expand the data beyond the scope of California to include the aggregate database (the world). The database will be 60% larger (about 9mb) and incorporate > 1 million points of information.
Inclusions for v4
Introduction
CrunchBase is a dataset of startup activity and it’s accessible to everyone. Founded in 2007 by Mike Arrington, CrunchBase began as a simple crowd sourced database to track startups covered on TechCrunch. Today you’ll find about 650k profiles of people and companies that are maintained by tens of thousands of contributors.
Crunchbase makes available all data in a wide variety of platforms, even Excel.
I recently exported the Crunchbase database, and with it, 7mb of company information ranging from Company name, website, total funding, number of funding rounds, and region. I initially downloaded the data to integrate it with acompany watchlist I’ve been developing to screen for VC funding within the Internet industry. I realized that the dataset could be used for many other purposes – primarily, identifying which regions in the country are home to companies within various industries that have received the most funding within the last few years.
Figure 4.1.1: Reducing the Size of the Crunchbase Dataset
The dataset encompasses 47,758 lines of information across 21 different columns (> 1 million data points). I began to run my analysis on this set, but as the formulas I’d developed to aid me in my analysis began to build on top of one another, my sheet began to process unbearingly slow. It became obvious that my computer just didn’t have the processing power to crunch the numbers as I’d like – in addition to excel not being the right arena for this type of analysis. I figured the best means by which to cut my losses was to simply reduce the size of my dataset. I cut down the size of my data set to include just 20,441 lines (Figure 4.1.1), which represents the 10 US states that have the greatest number of companies in the Crunchbase database, reducing the size of the database by more than 50%.
Figure 4.1.2: Cumulative Distribution of top 10 US Companies represented in the Crunchbase Database
After cutting my data down, I narrowed the set to include New Jersey, Colorado, Pennsylvania, Illinois, Florida, Washington, Texas, Massachusets, New York, and California. California itself inhabits almost half of the dataset, and will involve most of my analysis.
At this point in time, I parsed out 2 categorizations: the first, is which regions within each state populated the greatest number of Companies. The second is which industries are most represented within each state (remember, our database is in reference to Crunchbase companies, so it generally skews towards tech). For emphasis: our database was extrapulated from Crunchbase, so the primary companies under observation are those that have raise significantventure capital funding over the last few years.
When continuing to assemble my dataset, I omitted companies that didn’t have identifiers either for regions or industries. While my data is imperfect (some companies don’t fall in to clear categories, especially for industries), it is still quite representative of the general industry bias within regions. Additionally, because Crunchbase is an open source database, a Company in the absence of either data point (region, industry), can imply that it is not a very active company in the eyes of its respective industry, or by VC’s, because if it were, someone would have inputted information beforehand (or it may be a very early stage company).
Number of Omissions per State (per the industry dataset):
Figure 4.1.3: Example Regional and Industry Statistics: California Companies with recent VC/PE Funding
Regional Statistics: California
Industry Statistics: California
Figure 4.1.4: Example Regional and Industry Statistics: New York Companies with recent VC/PE Funding
Regional Statistics: New York
Industry Statistics: New York
The Securities Exchange Comission (SEC) makes available all corporate filings on EDGAR (Electronic Data Gathering, Analysis, and Retrieval system).
Figure 3.1.1: SEC Forms, Uses, and Filing Period
With this publicly available data, I will scrape the website to run backtests on corporate insider trading v. stock price reaction. I’ll utilize Form 3 and Form 4 initially to run my sample. Form 3 is filed when an individual becomes a corporate insider, which is defined as one person with a beneficial interest with > 10% holdings in a company. Form 4 is used when these “insiders” buy or sell securities.
Moving forward, I will enlarge my study to incorporate instituional buying and selling. The third part of the process will involve the Form 13D, which is filed within 10 days of an institution claiming a > 5% stake within a company.
If I feel necessary, there are two branches to include within the study. The first is 13G, which is when an institution acquires a significant interest in a security, but only for passive purposes (eg, mutual funds). Another study, involves proxy statement filings. This can involve investor activism (eg, Yahoo/Starboard), or merger agreements. While equally interesting, I will hold off on this analysis for now, and focus on Forms 3 and 4 to remain focused.
Figure 3.1.2: Filing Period for SEC Forms
Initially, I plan on observing time series data for 2-3 companies. If I develop a suitable infrastructure for analysis (eg, filing > price reaction (increase/decrease)), and am able to evaluate significant relationships between the two (which I predict that I will), I would like to run the same analsysis for the S&P 500 for the last 5 years (5 years, to exclude the financial crisis)