Using Google BigQuery with Plotly and Pandas

In this IPython Notebook, we will learn about integrating Google’s BigQuery with Plotly.

We will query a BigQuery dataset and visualize the results using the Plotly library.

What is BigQuery?

It’s a service by Google, which enables analysis of massive datasets. You can use the traditional SQL-like language to query the data.

You can host your own data on BigQuery to use the super fast performance at scale.

Google BigQuery Public Datasets

There are a few datasets stored in BigQuery, available for general public to use.

Some of the publicly available datasets are:

  • Hacker News (stories and comments)
  • USA Baby Names
  • GitHub activity data
  • USA disease surveillance

We will use the Hacker News dataset for our analysis.

Prerequisites

You need to have the following libraries:

Create Project

A project can be created on the Google Developer Console.

Enable BigQuery API

You need to activate the BigQuery API for the project.

Enable BigQuery
Enable BigQuery

Project ID Credentials
Project ID Credentials

Top 10 Most Active Users on Hacker News (by total stories submitted)

We will select the top 10 high scoring authors and their respective score values.

The pandas.gbq module provides a method read_gbq for querying the BigQuery stored dataset and stores the result as a DataFrame.

Using the create_table method from the FigureFactory module, we can generate a table from the resulting DataFrame.

Top 10 Hacker News Submissions (by score)

We will select the title and score columns in the descending order of their score, keeping only top 10 stories among all.

You can see that the lists consist of the stories involving some big names.

  • “Death of Steve Jobs and Aaron Swartz”.
  • “Announcements of the Hyperloop and the game 2048”.
  • “Microsoft open sourcing the .NET”

The story title is visible when you hover over the bubbles.

From which Top-level domain (TLD) most of the stories come?

Here we have used the url-function TLD from BigQuery’s query syntax.

We collect the domain for all URLs with their respective count and group them by it.

We can notice that the .com top-level domain contributes to most of the stories on Hacker News.

Public response to the “Who Is Hiring?” posts

There is an account on Hacker News by the name whoishiring.

This account automatically submits a ‘Who is Hiring?’ post at 11 AM Eastern time on the first weekday of every month.

Submission Traffic Volume in a Week

We can observe that the Hacker News faces fewer submissions during the weekends.

Programming Language Trend on Hacker News

We will compare the trends for the Python and PHP programming languages, using the Hacker News post titles.

As we already know about this trend, Python is dominating PHP throughout the timespan.