Last year, I dwelled briefly on the implications of the Google Books settlement and Dan Cohen’s critique of all pre-Google history as merely “anecdotal.” In the mean time, the “Culturomics” project has burst onto the scene, offering a new way of doing the kind of total history envisioned by Cohen. Culturomics, which uses the Google data set to trace how different words and phrases change over time, has inspired cautious optimism among historians and other groups. I think the project’s claim to represent all of human culture is potentially dangerous, and I will explain why below. But first, to show that I’m not just another luddite crank, I’d like to demonstrate how truly valuable Google Books has been for my work.
Historians, in general, do not like to foreground their methodology. Narrative historians like myself, especially, tend to bury our research strategies and theoretical scaffolding in footnotes and appendices and prospectuses. This helps create a more seamless reading experience, but is not always a good thing. So, in the spirit of open source and to make up for missing the HASTAC Conference this weekend, I will share part of the digital methodology from one of my recent articles.
The article (which you can check out here, if you’re lucky enough to have the right institutional subscription) examines the story of a murder committed by a South American man just south of Chiloé Island in the mid-1700s. I argue that the sole witness to the murder is unreliable and use contextual analysis, manuscripts, printed narratives, and oral histories to back my claim. The murder story appealed to Charles Darwin, who used it at two key moments in his career, and unfortunately it has been mindlessly repeated by historians ever since. Thanks in part to Darwin, the story is now falsely associated with the Yahgan people he encountered on the Cape Horn Archipelago – a group hundreds of miles away and very different from that of the original alleged murderer.
Although I don’t really talk about it in the article, I used anti-plagiarism software (designed to catch student cheats) to track the copying of certain quotes and phrases across texts. The fuzzy logic employed by some of these programs, meant to catch students who alter a word here or a phrase there, is especially helpful in identifying “borrowed” passages in historical documents. I used The Complete Work of Charles Darwin Online to mine for certain phrases. Their large collection of foreign language material is really cool – there is a strong argument to be made for these kinds of subject-specific digital repositories as separate entities from the big universal search engines. I also used Google Ngram and related platforms to chart the use of the murder story by various authors over time.
The results (summarized in the chart at left) confirmed my thesis. Use of the original murder story (the blue line) dropped precipitously in the middle of the nineteenth century. Meanwhile, Darwin’s version of the murder (the red line) shot way up. Texts that falsely attribute the murder to the Yahgan people (the green line) correlate more or less directly with the popularization of Darwin’s version of the murder. You can view my original data set here (it’s not the final version, since I stopped using Google Docs at some point, but it gives you the idea).
The graph cannot, however, establish that the murder story is a lie. It can only replicate the lie as it develops over time. Without the broader context established by more traditional historical research, these results would be meaningless. And this brings me to the danger inherent in Culturomics. First, machine-readable texts do not, and will never, represent the totality of the human experience. What about paintings, illustrations, and photographs, statues and figurative art, architecture, music, material culture, and ecology? What about oral history? What about economic, statistical and demographic evidence? Although there are millions upon millions of books, magazines, newspapers, and other printed material, they represent only the visible, privileged, literate tip of a vast store of human culture.
Even more troubling, texts lie. “There is no document of civilization,” said Walter Benjamin, “which is not at the same time a document of barbarism.” One of the great insights of the “New Social History” was the need to rub documents against the grain. Text mining usually rubs with the grain, merely reproducing the endemic biases and structured incompleteness of the written past. The graph can only replicate the lie.
This is not to say that Culturomics is hopelessly biased and needs to be discarded. On the contrary, it is precisely this kind of utopian enthusiasm – the dream that we can actually develop a more total vision of human culture – that is needed to keep history afloat. Large scale text mining is simply wonderful. Like all great inventions, though, it can be used for good or for ill. And it makes sense, I think, to guard against the naive assumption that all of human culture or history can be reduced to a computational algorithm.
Cross-posted at HASTAC