Git and GitHub in a data analysis class

Git_icon.svg GitHub-Mark-120px-plus

As I’ve written about in other posts (here, here, and here), this semester I taught a course called Advanced Data Analysis for the Social Science, which is the second course in our department’s required sequence for Ph.D. students. Sociology departments around the US all have a pretty similar required sequence. In teaching the course this time, I tried to modernize it so that it would train students for the future, not just the present or the past.

Because so much of actually doing data analysis requires writing code, I wanted to teach my students some modern software engineering practices. This is not because I wanted to make them software engineers. Rather, I wanted to empower them to be creative social scientist, and writing clean, reliable, reusable code really helps with that.

So, this semester, I required all the students in my class to use Git and GitHub. I was a bit hesitant to do it because Git is notoriously confusing and I didn’t even know how to use it myself. But, it all worked out pretty well, and I would recommend it to others. In this post, I’ll describe what we did and how it worked.

Why are you using Git and GitHub in your class? And what are they anyway?

Social scientists doing data analysis write—and debug—lots and lots of code. Therefore, social scientists have lots and lots to learn from people who specialize in writing code: software engineers. And, there is one tool that I see all software engineers use for serious projects: version control. Every software engineer, every time. In fact, when I talk to software engineers about my own work, they are kind of surprised that I hadn’t been using version control for my research code (although that is going to change because of this class).

Essentially, version control allows you to keep track of every change that you’ve ever made to your code. And, for a more in-depth explanation of version control and how it can facilitate social science research, I’d recommend this excellent piece by Gentzkow and Shapiro: Code and Data for the Social Sciences: A Practitioner’s Guide.

There are many different version control system that you can use, and one of the best ones seems to be Git. And, one big advantage of git is that it works well with GitHub. Unfortunately, Git is notorious for being confusing. This tweet pretty much sums it up (it’s a joke, sort of):

But, despite some very real pain at the beginning of the semester, all of my students were using Git and GitHub successfully for real research by the end of the semester.

How did you use Git and GitHub in your class?

The students used GitHub, and therefore Git, for two things: their homework and their replication project. For homework, a big advantage of GitHub is that it make it easy for many people—classmates, teaching assistants, and professors—to give comments simultaneously on specific parts of the work students turn in. And, using GitHub for homework enables linting, an automated process we used to give students feedback on their code.

Based on the helpful GitHub Education classroom guide, I created this set of instructions for my students. I would post homework templates in RMarkdown (e.g., homework 1) on our class GitHub page and then the students would just follow these steps.

homework_workflow

It looks a bit intimidating, but everyone got the hang of it pretty quickly. Really, by the end of the semester, everyone was using GitHub for homework with no problems.

The students also used GitHub to collaborate and receive feedback on their replication and extension projects (you can see that stuff here). This is also went pretty smoothly, with one exception described below.

What were the hard parts?

The hardest part conceptually for the students was getting used to the idea of version control. Git is just strange when you encounter it the first time so we spent a whole class on it. The hardest part technically was enabling RStudio and GitHub to talk to each other. Once this connection was made, everything went reasonably smoothly because RStudio has good support for version control. I would estimate that making the first push to GitHub took students several hours. It also took on average about one hour of instructor time per student. To be clear, about half the class got it running on their own, but the other half required one-on-one help, and I can’t see how to make the process any smoother because all the students wanted to use their own laptops (and every laptop was special in its own way).

The hardest parts for me were learning Git myself, becoming familiar with GitHub, and setting up GitHub so that we could comment on the students’ replication projects. This last issue not particularly interesting, but it was a problem we had so I’d like to explain it briefly and describe how we solved it in order to save you and your students some pain. GitHub offers good infrastructure to comment on pull requests (students submitted their homeworks as pull requests), but it does not offer good infrastructure to comment on repositories (students worked on the replications projects in a repository). Therefore, long story short, our solution was to turn a repository into a pull request.

How would you do this next time?

I plan to use Git and GitHub next time I teach this course. However, there are two things that I would do differently. First, I would try to move this stuff into the first semester rather than the second semester. Learning Git and GitHub requires an investment, and I’d like the students to have as much time as possible to reap the benefits of that investment (of course, since they are in their first year of grad school they will have the rest of their career to reap the rewards). Second, I would have a more clear position on whether work for this class is open to the world or open to just our class. As is, some of what we did is visible to the world (if you know how to look) and some is just visible to our class (we used GitHub’s organization feature to make this possible). I’m fine with my own stuff being open, but I can see that some students might not want their homework or projects visible to everyone in the world. One the other hand, having students work in the open might get them more comfortable with a process of open and reproducible research.

Also, there is one thing that I’d definitely do the same: work with a professional to help with the trickier issues of Git and GitHub. This semester I worked Luke Baker from the Agathon Group, and he’s great. He helped get things set-up at the beginning of the semester; he wrote a script to strip my solutions out of the homework template; he got lintR running so that students got automatic feedback on their code style, and he also solved the tricky problem of turning repositories into pull requests. Neither I nor my teaching assistant had the background to solve these problems quickly. All in all, I think Luke worked about 15 hours spread over the course of the semester. Were we to do it again, I think it would take him just 5 hours. Also, all of this work was remote and we didn’t run into any logistical challenges.

I guess the biggest thing that I’ll be watching going forward is whether these students continue to use Git and GitHub in their research.

3 thoughts on “Git and GitHub in a data analysis class

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s