Crunchbase – Introduction

Introduction

CrunchBase is a dataset of startup activity and it’s accessible to everyone.  Founded in 2007 by Mike Arrington, CrunchBase began as a simple crowd sourced database to track startups covered on TechCrunch.  Today you’ll find about 650k profiles of people and companies that are maintained by tens of thousands of contributors.

Crunchbase makes available all data in a wide variety of platforms, even Excel.

I recently exported the Crunchbase database, and with it, 7mb of company information ranging from Company name, website, total funding, number of funding rounds, and region. I initially downloaded the data to integrate it with acompany watchlist I’ve been developing to screen for VC funding within the Internet industry. I realized that the dataset could be used for many other purposes – primarily, identifying which regions in the country are home to companies within various industries that have received the most funding within the last few years.

Figure 4.1.1: Reducing the Size of the Crunchbase Dataset

all

The dataset encompasses 47,758 lines of information across 21 different columns (> 1 million data points). I began to run my analysis on this set, but as the formulas I’d developed to aid me in my analysis began to build on top of one another, my sheet began to process unbearingly slow. It became obvious that my computer just didn’t have the processing power to crunch the numbers as I’d like – in addition to excel not being the right arena for this type of analysis. I figured the best means by which to cut my losses was to simply reduce the size of my dataset. I cut down the size of my data set to include just 20,441 lines (Figure 4.1.1), which represents the 10 US states that have the greatest number of companies in the Crunchbase database, reducing the size of the database by more than 50%.

Figure 4.1.2: Cumulative Distribution of top 10 US Companies represented in the Crunchbase Database

cumulative

After cutting my data down, I narrowed the set to include New Jersey, Colorado, Pennsylvania, Illinois, Florida, Washington, Texas, Massachusets, New York, and California. California itself inhabits almost half of the dataset, and will involve most of my analysis.

At this point in time, I parsed out 2 categorizations: the first, is which regions within each state populated the greatest number of Companies. The second is which industries are most represented within each state (remember, our database is in reference to Crunchbase companies, so it generally skews towards tech). For emphasis: our database was extrapulated from Crunchbase, so the primary companies under observation are those that have raise significantventure capital funding over the last few years.

When continuing to assemble my dataset, I omitted companies that didn’t have identifiers either for regions or industries. While my data is imperfect (some companies don’t fall in to clear categories, especially for industries), it is still quite representative of the general industry bias within regions. Additionally, because Crunchbase is an open source database, a Company in the absence of either data point (region, industry), can imply that it is not a very active company in the eyes of its respective industry, or by VC’s, because if it were, someone would have inputted information beforehand (or it may be a very early stage company).

Number of Omissions per State (per the industry dataset):

  • CA: 671
  • NY: 210
  • MA: 147
  • TX: 143
  • WA: 72
  • FL: 108
  • IL: 72
  • PA: 84
  • CO: 56
  • NJ: 60

Figure 4.1.3: Example Regional and Industry Statistics: California Companies with recent VC/PE Funding

Regional Statistics: California

ca_regions

Industry Statistics: California

CA_industry

Figure 4.1.4: Example Regional and Industry Statistics: New York Companies with recent VC/PE Funding

Regional Statistics: New York

ny_regions

Industry Statistics: New York

NY_industry