Skip to content

DATA: data processing steps

Preston Parry edited this page Mar 1, 2015 · 4 revisions
  1. Aggregate to count of active repos by user by language using the query below:

"select COUNT(DISTINCT(repository_url)) AS activeReposByLang, repository_language, actor_attributes_login, actor_attributes_location from [githubarchive:month.201311] where ( type = 'PushEvent' OR type = 'ForkEvent' OR type = 'PullRequestEvent' OR (type = 'IssuesEvent' AND (payload_action="opened" OR payload_action=="reopened")) OR type = 'WatchEvent' OR type = 'MemberEvent' OR type = 'PublicEvent' OR type = 'ReleaseEvent' ) AND repository_language !='' AND repository_url != '' group each by actor_attributes_login, repository_language, actor_attributes_location; "

Notes: A. the group EACH by is a bigquery convention when you have a lot of rows you're grouping by. B. I expanded the number of events we're tracking. this did not substantially increase the number of rows in the dataset, signaling that while it may have increased the number of active repos for a given user for a given language, it was unlikely to have added a new user/language combination to the data set. C. The result of this is what was GroupedUsers2 in my initial MySQL results. This bypasses one more of the aggregation steps and puts it on BigQuery, which is clearly a lot faster and better able to store results than MySQL.

  1. Save MySQL placesWithGeo data to a file to upload to BigQuery SELECT * FROM placesWithGeo INTO OUTFILE '~/ghLocal/thesis-project/db/placesWithGeoExportForBigQuery.csv' FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'

Notes: this does not export the headers. We need to type those in manually.

  1. Upload placesWithGeo data to BigQuery. We did not end up using this step. Note: A. do not try to export the point column. B. inspect the data before uploading: there's some weird newline character in the first row. C. It will spit out a number of errors because things aren't formatted according to expectations. just fix each one as it pops up. D. Do not include headers

  2. Export tables from BigQuery to Google Cloud Storage Notes: filenames must end with a .csv in the box you type in. specifying a csv filetype in the dropdown above the field is not enough.

  3. Download from Cloud Storage

  4. Create tables in MySQL database: sql statement to create the table: CREATE TABLE raw_data_import ( activeReposByLang INT(10), repository_language VARCHAR(100), actor_attributes_login VARCHAR(255), actor_attributes_location VARCHAR(255), ID int(10) NOT NULL auto_increment, PRIMARY KEY (ID));

command line command to create the table from the sql statement file: mysql -u root dbname < path/to/schema.sql

  1. Load data from files into new tables: -- this loads our user-aggregated data from BigQuery to raw_data_import load data infile '~/ghLocal/thesis-project/db2/201311_all_data_grouped_to_user_by_lang.csv' into table raw_data_import character set latin1 fields terminated by ',' enclosed by '"' lines terminated by '\n';

Notes: MAKE SURE sql_mode is set to blank. SELECT @@GLOBAL.sql_mode; SELECT @@SESSION.sql_mode;

If the mode is not blank: SET GLOBAL sql_mode = ''; SET SESSION sql_mode = '';

  1. Copy over existing placesWithGeo table: CREATE TABLE ghdb2.placesWithGeo SELECT * FROM ghdb.placesWithGeo;

  2. Add countryCode to the users table: CREATE TABLE 13UsersWithCountries SELECT MAX(activeReposByLang) AS activeReposByLang, repository_language, actor_attributes_login, actor_attributes_location, countryCode FROM 13raw_data_import AS users LEFT OUTER JOIN placesWithGeo AS places ON users.actor_attributes_location = places.user_location; GROUP BY actor_attributes_login, repository_language;

Notes: A. While doing a LEFT OUTER JOIN, SQL will duplicate rows if that row matches multiple rows in the right table. B. Our dataset should avoid this (the places table is technically unique), but it is, in fact, duplicating rows. C. We might be able to reduce this occurrence (and reduce query time) by going back to placesWithGeo and selecting only distinct from it. D. What we did here was to group by the actor_attributes_login (userID), and then take the MAX of their activeRepoCount (which should really be the only value in their activeRepoCount. E. This query takes a long time to run. Anywhere from 45 minutes to two hours. There are a number of ways we could optimize this in the future: a. Select distinct from places (though it should already be distinct, MySQL for some reason does not think that it is) b. Run this join in Google BigQuery c. Try running this join against a smaller subset of the 1000 most popular places. Then for the users that fails on, run the full join. d. Remove all the users who have a NULL location from this computation. F. Be sure to rename aggregated columns. If you don't rename the column you're pulling for MAX(activeReposByLang), it becomes challenging to select that column in future queries. Possible, but needlessly challenging when you can just type in the "AS activeReposByLang" part in a second or two.

  1. Aggregate from user to country level: CREATE TABLE countries SELECT replace(replace(countryCode,char(10),''),char(13),'') AS countryCode, SUM(activeReposByLang) AS activeRepos, COUNT(actor_attributes_login) AS activeProgrammers, repository_language FROM 13users GROUP BY countryCode, repository_language; Notes: A. this is super quick to run once you've got the countryCode in the users table. B. There is some kind of an error in the placesWithGeo table that has both 'US' and 'US' with a newline character as options. This clearly prevents us from aggregating correctly, so that's why we're doing the SELECT replace() stuff above. char 10 and 13 are both newline characters. C. This is now one of our complete datasets!

  2. To aggregate to yoyGrowth:

mysql> CREATE TABLE yearlyGrowth SELECT countryCode, COUNT(DISTINCT(actor_attributes_login)) AS activeProgrammers FROM 14users GROUP BY countryCode;

mysql> CREATE TABLE yearlyGrowth2 SELECT countryCode, COUNT(DISTINCT(actor_attributes_login)) AS activeProgrammers FROM 13users GROUP BY countryCode;

mysql> CREATE TABLE yoyGrowth SELECT yearlyGrowth.countryCode, yearlyGrowth.activeProgrammers AS programmers2014, yearlyGrowth2.activeProgrammers AS programmers2013 FROM yearlyGrowth INNE Query OK, 186 rows a

  1. aggregate to languages: CREATE TABLE languages SELECT repository_language AS language, COUNT(DISTINCT(actor_attributes_login)) AS activeProgrammers FROM 14users GROUP BY language ORDER BY activeProgrammers;

  2. grab the top developers in each language for each country: /dataProcessing/topDevsByCountry

  3. set up github oauth

  4. grab the avatar urls from github.

Clone this wiki locally