Playing with GitHub Data: The Start (Part 1)

May 5, 2014July 2, 2014 / codemestat / 1 Comment

Since March/April, I have been sitting on my hands not quite ready to hit the “Publish” button to push my words off onto the instantaneous online press. But as time passes by, I am losing the details of the what, why, how of my GitHub data explorations and going back through my GitHub history to identify my work processes requires more and more mental aerobics. Hence, while I am still in a state of partial clarity, I will do my best to squeeze out my months-past thoughts during an afternoon writing spurt.

* * * Here I go! * * *

Wanting to have fun exploring GitHub data, I headed first to the playground: GitHub’s API. It had all that I could want for my preliminary explorations. I could grab data on repository languages, pushes, forks, stars, etc. You name it. Problem was it was like standing at the base of a bouldering wall, positioning yourself in your first hanging form, and then not knowing how to reach the next boulder on the colored route. My technique was not quite even a V0 and my arms were too weak to compensate by just pulling myself up through sheer muscle.

That’s when I found GitHub Archive. GitHub Archive is a “project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.” With the data hosted on Google BigQuery, the GitHub event data was readily available to be queried using SQL.

SQL! SELECT statements! I can do that!

I decided to go ahead and first explore GitHub’s data via GitHub Archive. The API would have to wait.

I started out with a basic question: What is the growth trend of total GitHub repositories? What is the growth trend of total GitHub projects i.e., repositories excluding forks?

To answer these questions, I parsed the GitHub Archive data in three ways:

GitHub Archive
GitHub Archive: CreateEvent
GitHub Archive: PushEvent

Side Track: CreateEvent vs. PushEvent

The reason why I looked at both CreateEvent and PushEvent was to investigate the peculiarities that emerged when running my initial query for the number of repositories by programming language and creation date, restricted only to CreateEvent. Looking for whether anyone else had run a similar analysis on which I could compare my results, I stumbled upon Adam Bard’s blog post “Top GitHub Languages for 2013 (so far)”.

Here was his query.

This query was similar to mine but after exploring the data more closely, I had a couple issues with this approach. Although it may be difficult to see without looking at the raw data structure, the questions that popped up were:

Wouldn’t counting non-distinct languages i.e., count(repository_language) result in duplicate counts of unique repositories?
Shouldn’t we be looking at only CreateEvent’s payload_ref_type == “repository” rather than also including branches and tags?
Don’t we have a “null” issue for repository language identifications of CreateEvents?
- over 99% of CreateEvents with payload_ref_type == “repository” have “null” repository language
- over 60% of CreateEvents for all payload_ref_type {repository (>99%), branch (~28%), tag (~57%)} have “null” repository language

Example

Let’s look at the Ceasar’s repository “twosheds”.

If we count using count(repository_language), we would have the following total counts by language:

Python: 14
null: 1

Now this seems wrong. Why would we count these records multiple times as unique repositories? Instead, these should fall under one repository count for “twosheds” and be categorized as written in Python. What if we restrict the query to only CreateEvent’s payload_ref_type == “repository” and count as count(distinct(repository_url))? The issue here is that the assigned language is now “null”.

Sure, we could think of ways around this — for example, take the maximum non-missing language identification for each unique CreateEvent’s repository URL, regardless of the payload reference type. However, given the prevalence of “null” language identifications for CreateEvent, why not try PushEvent instead? We can check if PushEvent has similar missingness with its repository language identifications. If not, under the assumption that active repositories are pushed to, we can count the distinct repository URLs of PushEvents to identify unique repositories by creation date and programming language.

Indeed, only ~12% of PushEvents have “null” repository language identifications. PushEvent seems to be good alternative!

Back on Track & Back to Basics

Let’s go back to our initial question: What is the growth trend of total GitHub repositories? What is the growth trend of total GitHub projects i.e., repositories excluding forks? I computed the yearly totals using count(distinct(repository_url)) to count the total number of unique repositories by creation date (year).

As a litmus of sorts, I compared these yearly counts to those from GitHub Search.

What do we notice here?

Assuming GitHub Search as the barometer of expectation:

prior to 2012, GitHub Archive (all, CreateEvent, PushEvent) seems to produce underestimates of total repositories and projects
repositories: Search ~ GitHub Archive (all)
projects: Search ~ PushEvent

Now, why do we observe these differences?

I mulled over the observed differences, raising my stack of questions higher. What is going on here? How will it affect my analyses?

For now, I will leave it here and continue next time with more on the GitHub analysis I ended up pursuing.

As I say adieu, I ask: What interesting or curious things do you notice here so far?

Figuring out GitHub Search

April 6, 2014July 2, 2014 / codemestat / Leave a comment

I have been wanting to play around with GitHub data as a side project so when I first started poking around for what GitHub meta & event data I could grab, not yet being an API ninja, I went to the simplest point-click-Enter-Return shop in town: GitHub Search.

But when I was there, I noticed something strange. Each time I would refresh a search using the same parameters (e.g., “created:<2013-12-31 fork:true”), the results would produce different numbers. Moreover, conflicting counts of total repositories were displayed in the same search results: under “Repositories” and in “We’ve found [] repository results”.

What was happening? I asked my to-go guy regarding all-things-tech (the technical stuff beyond the default “just try restarting your computer”). He gave me a “Hmmmm that’s strange. Something to do with caching?”. Not quite convinced, I decided to seek a more authoritative explanation from GitHub.

I wasn’t waiting with bated breath (already having generated low expectations from a lifetime of mediocre customer service, unfair as it is to batch many tech startups/companies with bureaucratic behemoths like “Comcrap”, as my roommate amusingly nicknames).

Less than two days later I received a response from GitHub Support.

Thanks for getting in touch and asking about this!

Some queries are computationally expensive for our Search to execute. For performance reasons, there are limits to how long queries can be executed. In rare situations when a timeout is reached, Search collects and returns matches that were already found, and the execution of the query is stopped.

That particular query you’re running is hitting this timeout, and as a result — you’re getting small variations in the number or results that were collected before the timeout was hit.

The Search team is discussing this internally and looking into ways to make such situations less confusing, so your feedback is more than welcome!

The reason why the two numbers provided under “We’ve found X repository results” and “Repositories” are different is caching. The number under “Repositories” is cached so it doesn’t change with page refreshes, as does the “We’ve found X repository results” number. That’s definitely confusing, and I’ll open an internal issue to get that fixed.

Hope this explains things! If you have any other questions — please let us know.

Cheers

What a phenomenal answer!

GitHub Support not only responded in days time but addressed each question in a way that did not pass it off with a canned auto-reply or diminish it as “some silly question” wasting their customer support engineer’s precious time. The response even came with a swaddling of exclamations (“!”), welcoming the input as helpful feedback to be acted on for improved user experience.

I continued my conversation with the GitHub Support person with questions regarding a second-party data source. Likewise a timely, thorough, and – best of all – human response followed.

Thank you, GitHub!

Who Am I?

February 16, 2014July 9, 2014 / codemestat / Leave a comment

From the east coast to the midwest to the west coast, I have hopped around a bit before finally digging my feet into Palo Alto, CA ~ the quaint yet pompous town next to Stanford University and plop center of Silicon Valley. While I have a B.A. in Mathematics & Statistics and an M.S. in Statistics, Measurement, Assessment & Research Technology (yes, it’s as unwieldy as it sounds and the extra three content areas was an administrator’s attempt at a catchy acronym), I am an outsider here in the land of behemoth tech superpowers and small-yet-$$,$$$,$$$-backed startups. I am not a programming polyglot. I am not ripe out of (dropping out of) an Ivy League. And, I am not hustling my next BIG startup idea. I am a mere mortal among a crowd of intimidatingly clever, indefatigably resourceful polymath kids.

So more of who I am and less of who I am not.

I love wrangling messy data. (I do it for fun & as a living: 80% of my every day.)
I love writing: the scribbling-down-stream-of-thoughts kind. (I do it for fun.)
And, I am dipping my toes in the tech pool by picking up a couple new programming languages.

This is my first stab at melding my two loves (& my side-pursuit) and the first time sharing my journey with someone (something) other than my email account.

So here it goes …

(And, no, I am not a rosy-cheeked Buddha man … but I do have pinchable cheeks.)

: code me stat :

@codemestat

Author: codemestat

Playing with GitHub Data: The Start (Part 1)

Figuring out GitHub Search

Who Am I?