Category Archives: Archives

Scraping Samuel Richardson

March 12, 2013Archives, Digital Scholarship, Research and Teaching Tools, Semantic Web, Yale ProjectsStephen Krewson

It’s hard enough to read Samuel Richardson’s Pamela. It’s even harder to finish his later, longer epistolary novel: Clarissa, or, the History of a Young Lady (1748) [984,870 words]. Having toiled through both books, I was resting easy until confronted with a curious volume that I volunteered to present on in a graduate seminar on the 18C novel. The title? A collection of the moral and instructive sentiments, maxims, cautions, and reflexions, contained in the histories of Pamela, Clarissa, and Sir Charles Grandison (1755). The CMIS, as I’ll abbreviate it, consists of several hundred topics, each with multiple entries that consists of a short summary and a page reference. Here’s an example from the Clarissa section, thanks to ECCO.

Without going into the meaning or importance of these references, I want to focus on a practical problem: how could we extract every single citation of the eight-volume “Octavo Edition” of 1751? Our most basic data structure should be able to capture the volume and page numbers and associate them with the correct topic. While Richardson may very well have used some kind of card index, I can safely say that no subsequent reader or critic has bothered to count anything in the CMIS. But its very structure demands a database!

As a novice user of Python, it will be somewhat embarrassing to share the script I wrote to “scrape” the page numbers from an e-text of the CMIS (subscription required) helpfully prepared by the wonderful people and machines at the Text Creation Partnership (TCP). The TCP’s version was essential since the OCR-produced text (using ABBYY FineReader 8.0) at the Internet Archive is riddled with errors.

I started by cutting and pasting two things into text files in my Python directory: (1) the full contents of the Clarissa section of the CMIS and (2) a list of all 136 topics (from “Adversity. Affliction. Calamity. Misfortune.” to “Youth”) pulled from the TCP table of contents page.

import sys
import re
from collections import defaultdict
from rome import Roman

The first step is to import the modules we’ll need. “Sys” and “re” (regular expressions) are standard; default dictionary is a super helpful way to set the default key-value to 0 (or anything you choose) and avoid key errors; rome is a third-party package that converted Roman numerals to Arabic.

# Read in two files: (1) digitized 'Sentiments' (2) TOC of topics
f1 = open(sys.argv[1], 'r')
f2 = open(sys.argv[2], 'r')

# Create topics list, filtering out alphabetical headings
topics = [line.strip() for line in f2 if len(line) > 3]

# Dictionary for converting volumes into one series of pages
volume = {1:348, 2:355, 3:352, 4:385, 5:358, 6:431, 7:442, 8:399}
startPage = {1:0, 2:348, 3:703, 4:1055, 5:1440, 6:1798, 7:2229, 8:2671}

This section of code reads in the two files as ‘f1’ and ‘f2.’ I’ll grab the contents of f2 and write them to a list called ‘topics,’ doing a little cleanup on the way. Essentially, the list comprehension filters out the alphabetical headings like “A.” or “Z.” (since these are less than three characters in length. Now I have an array of all 136 topics which I can loop over to check if a line in my main file is a topic heading. You probably noticed that the references in CMIS were formatted by volume and page. I’d like to get rid of the volume number and convert all citations to a ‘global’ page number. The first dictionary lists the volume and its total number of pages; the second contains the overall page number at which any given volume begins. Thus, the final volume starts at page 2,671.

counter = 0
match = ''

# Core dictionaries: (1) citations ranked by frequency and (2) sorted by location
frequency = defaultdict(lambda: 0)
location = {}

# Loop over datafile
for line in f1:
    if line.strip() in topics:
        match = line.strip()
        counter += 1
        location[match] = [counter, []]

OK, the hardest thing for me was making sure the extracted references got tossed in the right topic bin. So I initialized a counter that would increment each time the code hits a new topic. The blank string ‘match’ will keep track of the topic name. The loop goes through each line in the main file, f1. The first if statement checks if the line (with white space stripped off) is present in the topics list. If it does, then counter and match update and a key with the topic name (e.g. “Youth.”) is created in the location dictionary. The values for this key will be a list: location[“Youth.”][0] equals 136, since this is the last topic.

elif re.search(r'[iv]+..*(?=[)', line):
        citation = re.search(r'[iv]+..*(?=[)', line)
        process = [x for x in re.split('W', citation.group()) if re.match('(d|[iv]+)', x)]
        current = ''
        for i in range(len(process)):
            if re.match(r'[iv]+', process[i]):
                current = process[i]
            else:
                #frequency[(int(Roman(current)), process[i])] += 1
                page = startPage[int(Roman(current))] + int(process[i])
                frequency[page] += 1
                location[match][1].append(page)

This is the heart of the code. The else-if statement deals with all lines that are NOT topic headings AND contain the regular expression I have specified. Let’s break down the regex:

'[iv]+..*(?=[)'

Brackets mean disjunction: so either ‘i’ OR ‘v’ is what we’re looking for. The Kleene plus (‘+’) says we need to have at least one of the immediately previous pattern, i.e. the ‘[iv]’. Then we escape the period using a backslash, because we only need to get the Roman numerals up to eight (‘viii’) followed by a period. The second period is a special wildcard and the Kleene star right after means we can have as many wildcards as we want up until the parentheses, which contain a lookahead assertion. The lookahead checks for a left bracket (remember how the citations always include the duodecimo references in brackets). In English, then, the regex checks for some combination of i’s and v’s followed by a period that is followed, at some point, by a bracket.

The process variable runs through the string returned in the regex expression and splits the Roman and Arabic numerals by whitespace, appending them to a list. The string “People in Adversity should endeavour to preserve laud|able customs, that so, if sun-shine return, they may not be losers by their trials, ii. 58. 310. [149. iii. 44].” would be returned as “ii. 58. 310. [” by the regex and then turned into [ii, 58, 310] by process. Current is an empty string designed to hold the current Roman numeral so we know, for instance, which volume to match up page 310 with. In the final lines, the current Roman numeral is converted to its startPage number and the page number is added to it. Then the frequency dictionary for that specific page is incremented and the key for the current topic in the location dictionary is updated with the newly extracted page number.

Obviously, this is a rather crude method. It’d be fun to optimize it (and I do need to fix it up so that it can deal with the handful of citations marked by ‘ibid.’), but scraping is supposed to be quick-and-dirty because it really only works with the specific document or webpage that you’re encountering. I doubt this code would do anything useful for other concordance-like texts in the TCP. But I would love to hear suggestions for how it could be better.

In a later post, I’ll talk about the problems I’ve faced in visualizing the data extracted from the CMIS.

Archival Fragment of the Amistad Revolt

September 17, 2012Archives, Digital ScholarshipAmistad Revolt, Archive Fever, Kale, Mendi Mission, Public Domain, Sierra Leone, SlaveryJoseph Yannielli

Sometimes the best cure for archive fever is to share it with the world.

“Pa Raymond,” Sierra Leone Mission Album, box 2, p. 122, Records of the United Brethren in Christ Foreign Missionary Society, United Methodist Archives, Drew University.

I was reminded of the mundane joys of the archive again several months ago when, thanks to a tip from a colleague, I located an extremely rare photograph of one of the survivors of the Amistad slave revolt in the United Methodist Archives in New Jersey. It is difficult to tell whether the old man, called “Pa Raymond” on the reverse of the photo, is the real deal, but circumstantial evidence suggests that he might be Kale Walu, or “Little Kale,” who was just a boy when he was abducted and enslaved in West Africa in 1839. Kale (also spelled Kali or Carly) was the author of the famous “crazy dolts” letter, addressed to John Quincy Adams on the eve of their trial in the United States Supreme Court. He assumed the name George Lewis when he returned to Africa in 1842, part of an ongoing project to reinvent former slaves as anglicized Christians. As one of the youngest among the returning group, he was something of a surrogate son for abolitionist missionary William Raymond and may have taken his surname later in life. Pa is Krio for “father,” an honorific title for village elders.

The photo was probably taken sometime in the early 20th century by the United Brethren in Christ, who had inherited an abolitionist outpost, called the “Mendi Mission,” in what is now southwestern Sierra Leone. Almost all of the photos in the collection date from after the rebellion of 1898. When Canadian missionary Alexander Banfield encountered a man claiming to be an Amistad veteran during a tour of Sierra Leone in 1917 (likely the same man in this photo), he estimated the man was about 100 years old. Although my work does not really focus on the Amistad captives (I’m interested in the larger story of American abolitionists in Africa), it is bracing to look into the eyes of this man. Sole survivor. Adopted son of the missionary, traveling barefoot through the bush. White-haired patriarch, holding something mysterious with his right hand. What have those eyes seen? Where are they looking now?

Thanks to the generous (and underpaid and understaffed) archivists in New Jersey and the embattled public domain laws of the United States, I am able to share this treasure with the world (I think) for the first time. It belongs to the world. I am just returning it.

Mal d’Archive

September 17, 2012Archives, Digital ScholarshipArchive Fever, Crowdsourcing, First World Problems, Georgia, Jacques Derrida, Philosophy, Sierra Leone, TranscriptionsJoseph Yannielli

You know you’re a pretentious academic blogger when you start titling your posts in French, and if you can quote one of the most notoriously abstruse French philosophers at the same time, well that’s just a bonus. Jacques Derrida is not much in style these days (if he ever was). His ideas, and especially his prose, have been the butt of many jokes over the past half-century, but his 1994 lecture series Mal d’Archive (later published and translated as “Archive Fever“) is a significant artifact of the early days of the digital revolution. Although I don’t quite agree with everything its author says, the book makes an earnest attempt to grapple with the intersection of technology and memory and offers some worthwhile insight.

: An archivist works feverishly.

The idiomatic en mal de does not have a direct analogue in English, but for Derrida it means both a sickness and “to burn with a passion.” It is an aching, a compulsive drive (in the Freudian sense) to “return to the origin.” It is the sort of fever rhapsodized by Peggy Lee, the kind of unquenchable desire that can only be remedied by more cowbell. Whatever Derrida means by archive fever (and I think he leaves its precise meaning deliberately ambiguous), it is a concept that has some resonance for historians. As a profession, we tend to privilege primary sources, or archival documents, over secondary sources, or longer works that analyze and interpret an archive. Yet even the most rudimentary archival fragment contains within it a narrative, a story, an argument. Every document is aspirational; every archive is also an interpretation. There is no such thing as a primary source. There are only secondary sources. We build our histories based on other histories. The archive, Derrida reminds us, is forever expanding and revising, preserving some things and excluding others. The archive, as both subject and object of interpretation, is always open-ended, it is “never closed.”

Of course, in a few weeks, in what can only be described as a stunning disregard for French philosophy, the Georgia State Archives will literally shut its doors. Citing budget cuts, the state announced it will close its archives to the public and restrict access to special appointments (and those appointments will be “limited” due to layoffs). For now, researchers can access a number of collections through the state’s Virtual Vault, but it is not clear whether more material will be added in the future. The closure comes at the behest of governor Nathan Deal, whose recent political career has been beset by ethics violations. The cutbacks are the latest in a string of controversial decisions by the Georgia governor, including the rejection of billions of dollars in medicare funds and a $30 million tax break for Delta Airlines, and will have a negative impact on government transparency. Coming on the heels of the ban on ethnic studies in Arizona, the campaign against “critical thinking” in Texas, attacks on teachers in Illinois and Wisconsin, and deep cuts in public support for higher education across the country, the news from Georgia seems a portent of dark times.

Archives are so essential to our understanding of the past, and our memory of the past is so important to our identity, that it can feel as if we have lost a little part of ourselves when one is suddenly closed, restricted, or destroyed. Historian Leslie Harris calls public archives “the hallmarks of civilization.” Although I don’t entirely agree (are groups that privilege oral tradition uncultivated barbarians?), Harris points to a fundamental truth. The archive is an integral component of a society’s self-perception. Without open access to archival collections, who could corroborate accusations that the government was conducting racist medical experiments? Who would discover the lost masterpiece of a brilliant author? Who would provide the census data to revise wartime death tolls? Who would locate the final key to unlock the gates of Hell? All of the boom and bluster about digitization and the democratization of knowledge notwithstanding, it is easy to forget that archival work is a material process. It takes place in actual physical locations and requires real workers. What does it mean for the vaunted Age of Information when states restrict or close access to public repositories?

However troubling the news from Georgia, all hope is not lost. This is not the end of days. Knowledge workers are fighting to preserve access to the archive. At the same time, efforts by historians to crowdsource the past offer a fascinating and potentially momentous expansion of archive fever. Several high profile projects are now underway to enlist “citizen archivists” to help build, organize, and transcribe documentary collections. Programmers at the always-innovative Roy Rosenzweig Center for History and New Media have just released a “community transcription tool” that will (hopefully) streamline the process of collaborative archiving, transcribing, and tagging across platforms. The potential for public engagement and the production of new knowledge is stupendous. Because they rely on the same volunteer ethos as Wikipedia, however, it is likely that part-time hobbyists will be more interested in parsing obscure Civil War missives than the correspondence of Jeremy Bentham. A citizen archivist with a passion for Iroquois genealogy might have little interest in, let’s say, the municipal records of East St. Louis. And this is precisely where major repositories and their well-trained staff can help supervise, guide, and even lead the public. What if every historian could upload all of their primary sources to a central repository when they finished a project? What if there was a universal queue where researchers could submit manuscripts for public transcription, along the lines of the now-ubiquitous reCAPTCHA service? Perhaps administrators could implement some sort of badge or other incentive program in exchange for transcribing important material? As all manner of documents are digitized, uploaded, and transcribed in a lopsided, haphazard, and ad-hoc fashion, in vastly disparate quality, in myriad formats, in myriad locations, physical archives and their staff are needed more than ever – if only to help level the playing field. Among the most important functions of the professional archivist is to remind us that there is much that is not yet online.

Note recording the arrival of the *Amistad* survivors in Freetown, Sierra Leone, Jan. 1842. Liberated African Register, Sierra Leone Public Archives, Freetown.

One of the best experiences I’ve ever had as a researcher was in the national archives of Sierra Leone. Despite a century and a half of colonialism, a decades-long civil war, and other challenges that come with occupying a bottom rung on the global development index, the collections remain open to the public and continue to grow and improve. They have even started to go digital thanks to some help from the British Library and the Harriet Tubman Resource Centre. Sitting in the Sierra Leone archives, with its maggot-bitten manuscripts, holes in the windows, and sweltering heat, suddenly the much-discussed global digital divide seems very real. Peering out of the window one day, as I did, to see a mass of students drumming and chanting, then chased by soldiers in riot gear, the screams from the crowd as you shield yourself from gun fire behind a bookshelf thick with papers, it is difficult to look at knowledge work the same way again. When I enter a private archive in the United States, with its marbled columns and leather chairs, its rows of computers and sophisticated security cameras, I am grateful and angry – grateful that this is offered to some, angry that it is denied to others. The archivists and their support team in Freetown are heroes. Full stop. I worry about them when I read about the conflict in Libya, which continues to spill across borders and has led indirectly to the destruction of priceless archives and religious monuments in Mali.

Compared to the situation in West Africa, the more modest efforts to preserve and teach the past across the United States seem like frivolous first world problems. On the other hand, all information is precious. Whether physical or digital, access to our shared heritage should not be held hostage to political agendas or economic ultimatums. Archives are a right, not a privilege. I like to think that Derrida, who grew up under a North African colonial regime, would appreciate this. If Sierra Leone can keep its archives open to the public, why can’t the state of Georgia?

Cross-posted at HASTAC