What are the Trends for the Top 10 GitHub Languages?

June 25, 2014July 2, 2014 / codemestat / Leave a comment

The graphs above display the annual top 10 GitHub languages over time in terms of repos and projects (data source: https://github.com/search/advanced). One important note is that the full history of a language’s presence in GitHub from inception to demise is not shown.

For example, a drop-out in the graph does not necessarily mean that no one is pushing repos or projects in that language anymore but may simply illustrate that the language is no longer ranked in the top 10 for that given year. Of course, the distinction cannot be explicitly tweezed out. Maybe the language has completely vanished! However, we can assume quite confidently that a top 10 language didn’t just suddenly die off. Conversely, a language that pops up is not necessarily a hot “new” language but simply may not have been previously ranked in the top 10.

Example: Shell is not observed in 2013 as it was replaced in the ranks by CSS, which in turn is in existence prior to 2013 but not in the top 10 (and hence not plotted in the graph).

The fun of taking simple numbers such as this and visually plotting the trends is to see if we can tell a story. So, what can we make from these graphs? How do the patterns jive with the known trends in programming languages?

A few starters:

GitHub started by Ruby programmers.
Growth of Java from 2011.
- “A significant trend seen in 2011 was the return to Java by several prominent projects. Twitter, for example, joined the Java Community Process, after earlier moving their search architecture from Ruby on Rails to Java/Lucene. Another recent example has been Yammer moving part of their offering from Scala to Java. Other informative posts that provide evidence of resurgent interest in Java include Edd Dumbill‘s O’Reilly Radar posts in advance of OSCON Java 2011. Oracle Technology Network‘s Our Most Popular Tech Articles of 2011 is dominated by Java-related articles.”
Growth of JavaScript from 2011.
- “2011 was a huge year for JavaScript. First, my citing of Dart, CoffeeScript, and Node.js as “honorable mention” developments (later in this post) and my citing of the year’s biggest winner as HTML5 are evidence in and of themselves of the influence of JavaScript in 2011. Oracle announced at JavaOne 2011 their intention to provide a new server-side JavaScript implementation (Project Nashorn) to illustrate and test non-Java language JVM support and for a high-quality server-side JavaScript implementation that runs on the JVM. jQuery‘s success (and its own 2011 growth) is also another example illustrating the rising prominence of JavaScript.”

Playing with GitHub Data: The Start (Part 2)

May 28, 2014July 2, 2014 / codemestat / Leave a comment

In continuation to the questions raised in “Playing with GitHub Data: The Start (Part 1)”, I decided to ask the Man. The man being Ilya Grigorik: the man behind GitHub Archive.

Surely, he will be able to provide insight on the differences observed between the GitHub search and GitHub Archive query search counts.

I reached out to Ilya.

He responds.

So let it be known. Keep it in mind when playing with GitHub Archive via BigQuery.

Playing with GitHub Data: The Start (Part 1)

May 5, 2014July 2, 2014 / codemestat / 1 Comment

Since March/April, I have been sitting on my hands not quite ready to hit the “Publish” button to push my words off onto the instantaneous online press. But as time passes by, I am losing the details of the what, why, how of my GitHub data explorations and going back through my GitHub history to identify my work processes requires more and more mental aerobics. Hence, while I am still in a state of partial clarity, I will do my best to squeeze out my months-past thoughts during an afternoon writing spurt.

* * * Here I go! * * *

Wanting to have fun exploring GitHub data, I headed first to the playground: GitHub’s API. It had all that I could want for my preliminary explorations. I could grab data on repository languages, pushes, forks, stars, etc. You name it. Problem was it was like standing at the base of a bouldering wall, positioning yourself in your first hanging form, and then not knowing how to reach the next boulder on the colored route. My technique was not quite even a V0 and my arms were too weak to compensate by just pulling myself up through sheer muscle.

That’s when I found GitHub Archive. GitHub Archive is a “project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.” With the data hosted on Google BigQuery, the GitHub event data was readily available to be queried using SQL.

SQL! SELECT statements! I can do that!

I decided to go ahead and first explore GitHub’s data via GitHub Archive. The API would have to wait.

I started out with a basic question: What is the growth trend of total GitHub repositories? What is the growth trend of total GitHub projects i.e., repositories excluding forks?

To answer these questions, I parsed the GitHub Archive data in three ways:

GitHub Archive
GitHub Archive: CreateEvent
GitHub Archive: PushEvent

Side Track: CreateEvent vs. PushEvent

The reason why I looked at both CreateEvent and PushEvent was to investigate the peculiarities that emerged when running my initial query for the number of repositories by programming language and creation date, restricted only to CreateEvent. Looking for whether anyone else had run a similar analysis on which I could compare my results, I stumbled upon Adam Bard’s blog post “Top GitHub Languages for 2013 (so far)”.

Here was his query.

This query was similar to mine but after exploring the data more closely, I had a couple issues with this approach. Although it may be difficult to see without looking at the raw data structure, the questions that popped up were:

Wouldn’t counting non-distinct languages i.e., count(repository_language) result in duplicate counts of unique repositories?
Shouldn’t we be looking at only CreateEvent’s payload_ref_type == “repository” rather than also including branches and tags?
Don’t we have a “null” issue for repository language identifications of CreateEvents?
- over 99% of CreateEvents with payload_ref_type == “repository” have “null” repository language
- over 60% of CreateEvents for all payload_ref_type {repository (>99%), branch (~28%), tag (~57%)} have “null” repository language

Example

Let’s look at the Ceasar’s repository “twosheds”.

If we count using count(repository_language), we would have the following total counts by language:

Python: 14
null: 1

Now this seems wrong. Why would we count these records multiple times as unique repositories? Instead, these should fall under one repository count for “twosheds” and be categorized as written in Python. What if we restrict the query to only CreateEvent’s payload_ref_type == “repository” and count as count(distinct(repository_url))? The issue here is that the assigned language is now “null”.

Sure, we could think of ways around this — for example, take the maximum non-missing language identification for each unique CreateEvent’s repository URL, regardless of the payload reference type. However, given the prevalence of “null” language identifications for CreateEvent, why not try PushEvent instead? We can check if PushEvent has similar missingness with its repository language identifications. If not, under the assumption that active repositories are pushed to, we can count the distinct repository URLs of PushEvents to identify unique repositories by creation date and programming language.

Indeed, only ~12% of PushEvents have “null” repository language identifications. PushEvent seems to be good alternative!

Back on Track & Back to Basics

Let’s go back to our initial question: What is the growth trend of total GitHub repositories? What is the growth trend of total GitHub projects i.e., repositories excluding forks? I computed the yearly totals using count(distinct(repository_url)) to count the total number of unique repositories by creation date (year).

As a litmus of sorts, I compared these yearly counts to those from GitHub Search.

What do we notice here?

Assuming GitHub Search as the barometer of expectation:

prior to 2012, GitHub Archive (all, CreateEvent, PushEvent) seems to produce underestimates of total repositories and projects
repositories: Search ~ GitHub Archive (all)
projects: Search ~ PushEvent

Now, why do we observe these differences?

I mulled over the observed differences, raising my stack of questions higher. What is going on here? How will it affect my analyses?

For now, I will leave it here and continue next time with more on the GitHub analysis I ended up pursuing.

As I say adieu, I ask: What interesting or curious things do you notice here so far?

Figuring out GitHub Search

April 6, 2014July 2, 2014 / codemestat / Leave a comment

I have been wanting to play around with GitHub data as a side project so when I first started poking around for what GitHub meta & event data I could grab, not yet being an API ninja, I went to the simplest point-click-Enter-Return shop in town: GitHub Search.

But when I was there, I noticed something strange. Each time I would refresh a search using the same parameters (e.g., “created:<2013-12-31 fork:true”), the results would produce different numbers. Moreover, conflicting counts of total repositories were displayed in the same search results: under “Repositories” and in “We’ve found [] repository results”.

What was happening? I asked my to-go guy regarding all-things-tech (the technical stuff beyond the default “just try restarting your computer”). He gave me a “Hmmmm that’s strange. Something to do with caching?”. Not quite convinced, I decided to seek a more authoritative explanation from GitHub.

I wasn’t waiting with bated breath (already having generated low expectations from a lifetime of mediocre customer service, unfair as it is to batch many tech startups/companies with bureaucratic behemoths like “Comcrap”, as my roommate amusingly nicknames).

Less than two days later I received a response from GitHub Support.

Thanks for getting in touch and asking about this!

Some queries are computationally expensive for our Search to execute. For performance reasons, there are limits to how long queries can be executed. In rare situations when a timeout is reached, Search collects and returns matches that were already found, and the execution of the query is stopped.

That particular query you’re running is hitting this timeout, and as a result — you’re getting small variations in the number or results that were collected before the timeout was hit.

The Search team is discussing this internally and looking into ways to make such situations less confusing, so your feedback is more than welcome!

The reason why the two numbers provided under “We’ve found X repository results” and “Repositories” are different is caching. The number under “Repositories” is cached so it doesn’t change with page refreshes, as does the “We’ve found X repository results” number. That’s definitely confusing, and I’ll open an internal issue to get that fixed.

Hope this explains things! If you have any other questions — please let us know.

Cheers

What a phenomenal answer!

GitHub Support not only responded in days time but addressed each question in a way that did not pass it off with a canned auto-reply or diminish it as “some silly question” wasting their customer support engineer’s precious time. The response even came with a swaddling of exclamations (“!”), welcoming the input as helpful feedback to be acted on for improved user experience.

I continued my conversation with the GitHub Support person with questions regarding a second-party data source. Likewise a timely, thorough, and – best of all – human response followed.

Thank you, GitHub!

: code me stat :

@codemestat

Projects

What are the Trends for the Top 10 GitHub Languages?

Playing with GitHub Data: The Start (Part 2)

Playing with GitHub Data: The Start (Part 1)

Figuring out GitHub Search