Category Archives: Uncategorized

Random Language Generation, Part 1

Debate continues over the usefulness of the so-called Digital Humanities. See this skirmish in the Los Angeles Review of Books for a recent example. As a graduate student in the Department of English Language and Literature (and let this serve as my introduction to this blog–hello!), I often encounter skepticism about whether computational methods can reveal anything we don’t already know about texts. In many cases, I tend to agree.

But there’s a more obvious reason that scholars should be engaged with the digital: an increasing number of contemporary cultural objects are born digital. I’m talking about artists such as Jason Salavon, whose practice involves taking the mean average of a series of photographic portraits and displaying the results. Other artists in the MET’s “After Photoshop” exhibit (up until Spring 2013) are similarly worth checking out. Salavon’s tech notes on his “amalgamation” work are especially fascinating.

In literature, Nick Montfort’s “Taroko Gorge” is a truly born-digital creation. Written in Python and ported into JavaScript for the web, it’s inspired a series of imitations (which Montfort hilariously strikes out in his hyperlinks to on the right margin of the “Taroko Gorge” webpage). The poem implements a basic vocabulary and set of syntactical rules–then it simply runs them forever in random combinations. Or until you quit your browser window!


I leave it to the reader (or another post!) to take apart Montfort’s actual code.  What I want to suggest is that humanists can speak to a number of issues surrounding pieces like “Taroko Gorge.” Most obviously, there is the question of authorship. How can a poem that is different each time you load it “belong” to a unitary author? Is Montfort the author of the poem that is endlessly scrolling in my next tab, or does he merely possess some rights with respect to his code? And when others grab that code and just switch out the old words for some new ones, are they plagiarizing? Appropriating?

But “problematizing” the conception of “authorship” is, to my mind, the low-hanging fruit. FWIW, Montfort welcomes the remixes of his poem but prefers remixes of the code, rather than the vocabulary.

As I see it, these remixes say nothing about the poetic quality of my original program. However, they speak endlessly of the poetic potential of a page of code. I would be delighted, of course, to see many more remixes of “Taroko Gorge.” But it would be even better if it becomes impossible to discern that the new poetic programs being developed are related to my original program. It would be even better if ten other programs are written that work in completely different ways to generate poems, and if each of these are themselves remixed ten times. The idea of developing computational art in a short amount of time, in an hour or in a day – an idea that I’ve ripped off and remixed from many others – is what really deserves to get picked up and reworked, again and again.

This kind of random language generation has a history, distinct methods, and an ongoing social impact. Our work as humanists can only improve if we understand these processes and effects. As it so happens, I’m currently writing a random sentence generator for a linguistics seminar homework. Like Montfort, I’m also writing in Python. In today’s post, I’ll outline one step of random sentence generation. Subsequent posts will touch on the other components needed to create a simplified version of something like “Taroko Gorge.”

The first step is to implement a context-free grammar (CFG) that should ideally follow a set of conventions based on work by Noam Chomsky in the early 1960s and called Chomsky Normal Form (CNF). But let’s leave aside CNF for the moment. Here are some sample rules from our CFG:

  1. ROOT –> S .
  2. S –> NP VP
  3. NP –> DET N
  4. VP –> V NP
  5. DET –> the | a
  6. N –>  gorge | mountain | crag | stone
  7. V –>  roams | sweeps

What this tells us is that we start out with a ROOT and can then generate a S (sentence) followed by a period. S expands to a Noun Phrase and a Verb Phrase. NP expands to a Determiner and a Noun; VP can become a Verb followed by another Noun Phrase. All the upper-case symbols are called non-terminals, because they can expand into other constituents (that is, they don’t “terminate” in the English words that will end up making the sentence). 

Rules 5, 6 , and 7 are the terminals. These are the words that will actually appear in the sentence (thus “terminating” the expansion of non-terminals). When we see a DET or determiner we can either select “the” or “a” and likewise for the nouns and verbs. So let’s start with ROOT and generate a pseudo-random sentence by expanding each non-terminal from left to right.

  1. S .
  2. NP VP .
  3. DET N VP .
  4. The N VP .
  5. The gorge VP .
  6. The gorge V NP .
  7. The gorge sweeps NP .
  8. The gorge sweeps DET N .
  9. The gorge sweeps a N .
  10. The gorge sweeps a crag .

Now obviously these rules are not complex enough to represent English. There’s no way to generate a prepositional phrase yet, or even an adjective. Much less things like singular-plural agreement between nouns and verbs. But by grabbing a few words from Montfort’s vocabulary and pairing them with very basic rules about English sentences, we can start to see how we might generate random language. Can you think of (or comment on!) other possible sentences our grammar can make at this early stage? 

Next time, we’ll talk about storing a set of these rules in a file and then writing a program that stores them in a convenient memory structure so that it can randomly select different expansion options.

Digital History 2.0

The title of this blog is intentionally oxymoronic. Digital History stands for the fresh, the new, the innovative; Yale is a byword for the venerable, the traditional, and the conservative. The two terms exist in an awkward tension. I have always thought that if the digital humanities – as a methodology, as a practice, as a discipline – could thrive at a place like Yale, they could thrive anywhere. As an arbiter of the establishment, Yale offers a challenging test case for the digital revolution. The Past’s Digital Presence, a conference hosted here two years ago, was an important first step. (Most of the conference presentations are now available online, so if you missed it the first time around, you can relive it at home!) Exciting new initiatives like Historian’s Eye or the recently adopted Digital Himalaya project, show Yale faculty experimenting with new forms and engaging new technologies to drive their scholarship.

In this forward-looking spirit, I am proud to announce the rebirth of Digital History at Yale as a group blog. So keep a lookout for some new names in the time to come – graduate students like myself who have a thing or two to say about the digital humanities, or whatever else is on their mind.

A Curious Artifact

Christopher Hitchens died last week. He was an arrogant and abrasive man and a souse. He was also a frightful intellect and a dazzling writer, capable of holding forth on any topic from oral sex to the ten commandments. One obituary writer describes him as “an excitingly dangerous orator.” Although I did not always agree with him, in a weird way, I felt sorry for him.

Some years ago, a friend and I organized a debate between Hitch and political scientist Michael Parenti. It was a learning experience. Even with support from a motley coalition of faculty and student groups, we managed to run a debt. A few weeks later, we graduated. My friend ended up in Venezuela, and I ended up in New Zealand.

When I learned that Hitch had died, I dusted off my DVD of the debate. Unfortunately, the quality was not great. There was a gap where the cameraman switched tapes, and the second tape ran out before the end of the event. So I had to do some creative editing. I swapped out the original soundtrack for a more complete audio recording and used Handbrake to encode to mp4. The original DV tapes from which I had authored the DVD were long gone, so I had to transcode from interlaced mpeg2. I think, in retrospect, it would have been better to convert the VOBs to DV using ffmpegX, edit the DV stream in Final Cut or Quicktime, and then export to mp4. If you ever need to extract and remaster a DVD, this is the method I would recommend. By the time I figured this out, however, I had already invested too much in the direct-to-mp4 method.

Surprisingly, uploading to YouTube was the hardest part. YouTube’s transcoding engine did not care for my spliced edits, which introduced several different tracks and bitrates. The mp4 container is wonderfully robust, capable of supporting a range of tracks and even chapter markers, but it took four days of uploads before I found a way to merge everything together in a way that YouTube would accept. The resulting copy is less than spectacular, but it’s better than nothing. The timecode at the very end is still corrupted somehow. Since mp4 is YouTube’s container of choice, I find it frustrating that they insist on running videos through an additional layer of encoding, over which I have no control. Why not provide their specs and allow users to upload directly to the back end with little or no downsampling?

The video is freely available under a Creative Commons license. A curious historical document, like Hitch, it now belongs to the ages (but definitely not to the angels).