Sunday, December 15, 2013

Why Big Data Needs Big Humanities

There's a new book out using Google's Ngrams and Wikipedia to discover the historical significance of people, places, and things: Who is Bigger? I have only taken a cursory glance at the web page, but it doesn't take a genius to see that the results look deeply biased, and it's no surprise why.

The two data sets they used, Wikipedia and Google's Ngrams, are both deeply biased towards recent, Western data. Wikipedia authors and editors are famously biased towards young, white, Western males. It's no surprise then that the results on the web page are obviously biased towards recent, Western people, places and things (not uniquely so, to be clear, but the bias is obvious imho).

The most glaring example is the complete non-existence of Genghis Khan on any of the lists. Khan is undeniably one of the most influential humans to have ever existed. In the book Destiny Disrupted: A History of the World Through Islamic Eyes, author Tamim Ansary referred to Khan as the Islamic world's Hitler. But he died in 1227 and mostly influenced what we in the West call the East.

Another example is the appearance of the two most recent US presidents, George W. Bush and Barack Obama, in the top ten of the top fifty most influential things in history. Surely this is a pure recency effect. How can this be taken seriously as historical analysis?

Perhaps these biases are discussed in the book's methodology discussion, I don't know. Again, this is my first impression based on the web page. But it speaks to a point I blogged earlier in response to a dust-up between CMU computer scientists and UT Austin grad students:

"NLP engineers are good at finding data and working with it, but often bad at interpreting it. I don't mean they're bad at interpreting the results of complex analysis performed on data. I mean they are often bad at understanding the nature of their data to begin with. I think the most important argument the UT Austin team make against the CMU team is this (important point underlined and boldfaced just in case you're stupid):
By focusing on cinematic archetypes, Bamman et al.’s research misses the really exciting potential of their data. Studying Wikipedia entries gives us access into the ways that people talk about film, exploring both general patterns of discourse and points of unexpected divergence.
In other words, the CMU team didn't truly understand what their data was. They didn't get data about Personas or Stereotypes in film. Rather, they got data about how a particular group of people talk about a topic. This is a well known issue in humanities studies of all kinds, but it's much less understood in sciences and engineering, as far as I can tell."

One of the CMU team members responded to this with the fair point that they were developing a methodology first and foremost and their conference paper was focused on that. I agree with that point. But it does not apply to the Who is Bigger project primarily because it is a long book, and claims explicitly to be an application of computational methods to "measure historical significance". That is a bold claim.

To their credit, the authors say they use their method to study "the underrepresentation of women in the historical record", but that doesn't seem to be their main point. As the UT Austin grad student suggested above, the cultural nature of the data is the main story, not a charming sub plot. Can you acknowledge the cultural inadequacies of a data set at the same time you use it for cultural analysis? This strikes me as unwise.

I acknowledge again that this is a first impression based on a web site.

UPDATE: Cass Sunstein wrote a thorough debunking of the project's methodology a few weeks ago, concluding that the authors "have produced a pretty wacky book, one that offers an important warning about the misuses of quantification."

No comments:

TV Linguistics - Pronouncify.com and the fictional Princeton Linguistics department

 [reposted from 11/20/10] I spent Thursday night on a plane so I missed 30 Rock and the most linguistics oriented sit-com episode since ...