Invisibilia, the Fragile Families Challenge, and Bit by Bit


This week’s episode of Invisibilia featured my research on the Fragile Families Challenge.  The Challenge is a scientific mass collaboration that combines predictive modeling, causal inference, and in-depth interviews to yield insights that can improve the lives of disadvantaged children in the United States. Like many research projects, the Fragile Families Challenge emerged from a complex mix of inspirations.  But, for me personally, a big part of the Fragile Families Challenge grew out of writing my new book Bit by Bit: Social Research in the Digital Age.  In this post, I’ll describe how Bit by Bit helped give birth to the Fragile Families Challenge.

Bit by Bit is about social research in the age of big data.  It is for social scientists who want to do more data science, data scientists who want to do more social science, and anyone interested in the combination of these two fields.  Rather than being organized around specific data sources or machine learning methods, Bit by Bit progresses through four broad research designs: observing behavior, asking questions, running experiments, and creating mass collaboration. Each of these approaches requires a different relationship between researchers and participants, and each enables us to learn different things.

As I was working on Bit by Bit, many people seemed genuinely excited about most of the book . . . except the chapter on mass collaboration. When I talked about this chapter with colleagues and friends, I was often greeted with skepticism (or worse).  Many of them felt that mass collaboration simply had no place in social research. In fact, at my book manuscript workshop—which was made up of people that I deeply respected—the general consensus seemed to be that I should drop this chapter from Bit by Bit.  But I felt strongly that it should be included, in part because it enabled researchers to do new and different kinds of things.  The more time I spent defending the idea of mass collaboration for social research, the more I became convinced that it was really interesting, important, and exciting.  So, once I finished up the manuscript for Bit by Bit, I set my sights on designing the mass collaboration that became the Fragile Families Challenge.

The Fragile Families Challenge, described in more detail at the project website and blog, should be seen as part of the larger landscape of mass collaboration research.  Perhaps the most well known example of a mass collaboration solving a big intellectual problem is Wikipedia, where a mass collaboration of volunteers created a fantastic encyclopedia that is available to everyone.

Collaboration in research is nothing new, of course. What is new, however, is that the digital age enables collaboration with a much larger and more diverse set of people: the billions of people around the world with Internet access. I expect that these new mass collaborations will yield amazing results not just because of the number of people involved but also because of their diverse skills and perspectives. How can we incorporate everyone with an Internet connection into our research process? What could you do with 100 research assistants? What about 100,000 skilled collaborators?

As I write in Bit by Bit, I think it is helpful to roughly distinguish between three types of mass collaboration projects: human computation, open call, and distributed data collection.  Human computation projects are ideally suited for easy-task-big-scale problems, such as labeling a million images. These are projects that in the past might have been performed by undergraduate research assistants. Contributions to human computation projects don’t require specialized skills, and the final output is typically an average of all of the contributions. A classic example of a human computation project is Galaxy Zoo, where a hundred thousand volunteers helped astronomers classify a million galaxies. Open call projects, on the other hand, are more suited for problems where you are looking for novel answers to clearly formulated questions. In the past, these are projects that might have involved asking colleagues. Contributions to open call projects come from people who may have specialized skills, and the final output is usually the best contribution. A classic example of an open call is the Netflix Prize, where thousands of scientists and hackers worked to develop new algorithms to predict customers’ ratings of movies. Finally, distributed data collection projects are ideally suited for large-scale data collection. These are projects that in the past might have been performed by undergraduate research assistants or survey research companies. Contributions to distributed data collection projects typically come from people who have access to locations that researchers do not, and the final product is a simple collection of the contributions. A classic example of a distributed data collection is eBird, in which hundreds of thousands of volunteers contribute reports about birds they see.



Given this way of organizing things, you can think of the Fragile Families Challenge as an open call project, and when designing the Challenge, I draw inspiration from the other open call projects that I wrote about such as the Netflix Prize, Foldit, and Peer-to-Patent.

If you’d like to learn more about how mass collaboration can be used in social research, I’d recommend reading Chapter 5 of Bit by Bit or watching this talk I gave at Stanford in the Human-Computer Interaction Seminar.  If you’d like to learn more about the Fragile Families Challenge, which is still ongoing, I’d recommend our project website and blog.  Finally, if you are interested in social research in the age of big data, I’d recommend reading all of Bit by Bit: Social Research in the Digital Age.

Good things can come from being open


When I discuss open and reproducible research with graduate students, their minds often drift toward all the bad things that could happen from having their code and data available. It is certainly true that bad things could happen, but my sense is that people often overestimate these kinds of risks and underestimate the benefits of being open. So, in this post I wanted to highlight an example of something good that can happen from being open with data and code: it can raise the visibility of your work and help make it more useful to others.

This fall I was teaching an undergraduate course on data analysis, and I was looking for a compelling, modern example of real research that involved dummy variables. Fortunately, Kevin Munger had done an interesting experiment on harassment on Twitter, and all of the data and code were available on github. So, I download his data and code, tweaked them a bit, and then built my lecture around his study. In case they are helpful to someone else, here are the slides (and here are the slides in R Markdown format).

I’ve never meet Kevin Munger, but I wanted to thank him for posting his data and code. It helped me, and it helped my students. This is just one small example of a good thing that can come from being more open.

Bit by Bit is now available for pre-order


I’m very happy to announce that Bit by Bit is now available for pre-order. If you order it right now, you should have it around Thanksgiving (November 23, 2017). Bit by Bit is for social scientists who want to do more data science, data scientists who want to do more social science, and anyone interested in the hybrid of these two fields.

Here are links where you can pre-order the book:
• Amazon:
• Barnes & Noble:
• IndieBound:
• Princeton University Press:

Also, at the end of this post is some information from my publisher about the book, including a 25% off coupon and information about how you can request an exam copy.

Continue reading

meetup about teaching computational social science at ASA

Please join me for an informal meetup about teaching computational social science Monday, August 14 at 3pm.  We will meet at the Princeton University Press booth in the exhibit hall at ASA.  The purpose of the meetup is for people teaching computational social science—or thinking about teaching it—to share experiences and troubleshoot common problems.  The number and variety of courses on computational social science is growing rapidly, and I think that we can all benefit from hearing about the exciting things that people are doing.  I look forward to seeing you in Montreal.

Making sense of the rlnorm() function in R


Post by Malte Möser and Matthew Salganik

There’s an activity in Bit by Bit: Social Research in the Digital Age that requires generating random draws from a log-normal distribution.  Unfortunately, the rlnorm() function in R doesn’t work exactly how many people expect.  So, we wanted to write a little post about it.  That way, if you are working on the activity—which is about power analysis—you can focus on power analysis and not the rlnorm() function.

Continue reading

Announcing the Open Review Toolkit

Originally post on Freedom to Tinker


I’m happy to announce the release of the Open Review Toolkit, open source software that enables you to convert your book manuscript into a website that can be used for Open Review. During the Open Review process everyone can read and annotate your manuscript, and you can collect valuable data to help launch your book. The goals of the Open Review process are better books, higher sales, and increased access to knowledge. In an earlier post, I described some of the helpful feedback that I’ve received during the Open Review of my book Bit by Bit: Social Research in the Digital Age.  Now, in this post I’ll describe more about the Open Review Toolkit—which has been generously supported by a grant from the Alfred P. Sloan Foundation—and how you can use it for your book.

Continue reading