Random Language Generation, Part 1

Debate continues over the usefulness of the so-called Digital Humanities. See this skirmish in the Los Angeles Review of Books for a recent example. As a graduate student in the Department of English Language and Literature (and let this serve as my introduction to this blog–hello!), I often encounter skepticism about whether computational methods can reveal anything we don’t already know about texts. In many cases, I tend to agree.

But there’s a more obvious reason that scholars should be engaged with the digital: an increasing number of contemporary cultural objects are born digital. I’m talking about artists such as Jason Salavon, whose practice involves taking the mean average of a series of photographic portraits and displaying the results. Other artists in the MET’s “After Photoshop” exhibit (up until Spring 2013) are similarly worth checking out. Salavon’s tech notes on his “amalgamation” work are especially fascinating.

In literature, Nick Montfort’s “Taroko Gorge” is a truly born-digital creation. Written in Python and ported into JavaScript for the web, it’s inspired a series of imitations (which Montfort hilariously strikes out in his hyperlinks to on the right margin of the “Taroko Gorge” webpage). The poem implements a basic vocabulary and set of syntactical rules–then it simply runs them forever in random combinations. Or until you quit your browser window!


I leave it to the reader (or another post!) to take apart Montfort’s actual code.  What I want to suggest is that humanists can speak to a number of issues surrounding pieces like “Taroko Gorge.” Most obviously, there is the question of authorship. How can a poem that is different each time you load it “belong” to a unitary author? Is Montfort the author of the poem that is endlessly scrolling in my next tab, or does he merely possess some rights with respect to his code? And when others grab that code and just switch out the old words for some new ones, are they plagiarizing? Appropriating?

But “problematizing” the conception of “authorship” is, to my mind, the low-hanging fruit. FWIW, Montfort welcomes the remixes of his poem but prefers remixes of the code, rather than the vocabulary.

As I see it, these remixes say nothing about the poetic quality of my original program. However, they speak endlessly of the poetic potential of a page of code. I would be delighted, of course, to see many more remixes of “Taroko Gorge.” But it would be even better if it becomes impossible to discern that the new poetic programs being developed are related to my original program. It would be even better if ten other programs are written that work in completely different ways to generate poems, and if each of these are themselves remixed ten times. The idea of developing computational art in a short amount of time, in an hour or in a day – an idea that I’ve ripped off and remixed from many others – is what really deserves to get picked up and reworked, again and again.

This kind of random language generation has a history, distinct methods, and an ongoing social impact. Our work as humanists can only improve if we understand these processes and effects. As it so happens, I’m currently writing a random sentence generator for a linguistics seminar homework. Like Montfort, I’m also writing in Python. In today’s post, I’ll outline one step of random sentence generation. Subsequent posts will touch on the other components needed to create a simplified version of something like “Taroko Gorge.”

The first step is to implement a context-free grammar (CFG) that should ideally follow a set of conventions based on work by Noam Chomsky in the early 1960s and called Chomsky Normal Form (CNF). But let’s leave aside CNF for the moment. Here are some sample rules from our CFG:

  1. ROOT –> S .
  2. S –> NP VP
  3. NP –> DET N
  4. VP –> V NP
  5. DET –> the | a
  6. N –>  gorge | mountain | crag | stone
  7. V –>  roams | sweeps

What this tells us is that we start out with a ROOT and can then generate a S (sentence) followed by a period. S expands to a Noun Phrase and a Verb Phrase. NP expands to a Determiner and a Noun; VP can become a Verb followed by another Noun Phrase. All the upper-case symbols are called non-terminals, because they can expand into other constituents (that is, they don’t “terminate” in the English words that will end up making the sentence). 

Rules 5, 6 , and 7 are the terminals. These are the words that will actually appear in the sentence (thus “terminating” the expansion of non-terminals). When we see a DET or determiner we can either select “the” or “a” and likewise for the nouns and verbs. So let’s start with ROOT and generate a pseudo-random sentence by expanding each non-terminal from left to right.

  1. S .
  2. NP VP .
  3. DET N VP .
  4. The N VP .
  5. The gorge VP .
  6. The gorge V NP .
  7. The gorge sweeps NP .
  8. The gorge sweeps DET N .
  9. The gorge sweeps a N .
  10. The gorge sweeps a crag .

Now obviously these rules are not complex enough to represent English. There’s no way to generate a prepositional phrase yet, or even an adjective. Much less things like singular-plural agreement between nouns and verbs. But by grabbing a few words from Montfort’s vocabulary and pairing them with very basic rules about English sentences, we can start to see how we might generate random language. Can you think of (or comment on!) other possible sentences our grammar can make at this early stage? 

Next time, we’ll talk about storing a set of these rules in a file and then writing a program that stores them in a convenient memory structure so that it can randomly select different expansion options.

One thought on “Random Language Generation, Part 1

Comments are closed.