Analyzing Plotly’s Python package downloads

In this post, we will collect and analyze download statistics for Plotly’s Python package available on PyPI. We will also compare the downloads with other interactive charting tools like Bokeh, Vincent, and MPLD3.

Data Collection

PyPI used to show download stats for the packages, but they have terminated the service as they are currently developing the next generation of Python Package Repository, warehouse.

Linehaul will act as a statistics collection daemon for incoming logs from the new PyPI (warehouse). Right now, the current activity log on PyPI is being stored in a BigQuery database. (source: [Distutils] Publicly Queryable Statistics)

We will use the gbq.read_gbq function to read BigQuery dataset into Pandas DataFrame objects.

We will use linregress function for linear regression of scatter plots.

Read the post Using Google BigQuery with Plotly and Pandas to create a new project.

This query will collect the timestamp, package name, and total download count columns from the table (on a daily basis).

The following function run the query and returns a DataFrame object, if successful.

We will construct different DataFrames for each package.

Inspecting for missing data

Using a simple TimeDelta calculation, we can find if some rows are missing from the DataFrame.

We find that there are no rows from 2016-03-06 to 2016-05-21.

Data Transformation

Here, we will append the missing values in the DataFrames.

We are using the pandas.concat function to append the new DataFrame with missing values to the old DataFrame.

The following function returns the updated DataFrame after sorting it (sort_values) by the values in the column ‘day’.

Updated DataFrames with the recovered missing data.

Package Downloads Comparison (daily)

 

Package Downloads Comparison (Monthly)

The dataset was created on Jan 22, 2016. We will use these months on the x-axis.

We are using pandas’ groupby method to gather all the row by their month value and then adding their count to find out ‘total downloads’ in the month.

Growth of Plotly package downloads

Following the tutorial Linear fit in Python, we will try to find an approximate regression line for the scatter graph of Plotly package’s downloads.

The following traces are for the package downloads scatter plot (for each package).

Similarly, we can find the approximate growth line for ‘Matplotlib’.

Daily download counts for ‘Matplotlib’ ranges around 7000-8000 as of now.

How much time will it take for Plotly to reach that level?

Using the Plotly’s growth line equation Y=13.29X−282.55, we can find out the approximate no. of days for downloads to reach 8000.

Y(8000), results in X = 624 (nearest integer value), where current day index is 220 as of Aug 29, 2016.

That means it will take almost 400 days (from 29 Aug, 2016) for Plotly to reach the current download range of Matplotlib.

The IPython Notebook for this analysis is available here, Analyzing Plotly’s Python package downloads.