Category Archives: Semantic Web

Scraping Samuel Richardson

It’s hard enough to read Samuel Richardson’s Pamela. It’s even harder to finish his later, longer epistolary novel: Clarissa, or, the History of a Young Lady (1748) [984,870 words]. Having toiled through both books, I was resting easy until confronted with a curious volume that I volunteered to present on in a graduate seminar on the 18C novel. The title? A collection of the moral and instructive sentiments, maxims, cautions, and reflexions, contained in the histories of Pamela, Clarissa, and Sir Charles Grandison (1755). The CMIS, as I’ll abbreviate it, consists of several hundred topics, each with multiple entries that consists of a short summary and a page reference. Here’s an example from the Clarissa section, thanks to ECCO.

Without going into the meaning or importance of these references, I want to focus on a practical problem: how could we extract every single citation of the eight-volume “Octavo Edition” of 1751? Our most basic data structure should be able to capture the volume and page numbers and associate them with the correct topic. While Richardson may very well have used some kind of card index, I can safely say that no subsequent reader or critic has bothered to count anything in the CMIS. But its very structure demands a database!

As a novice user of Python, it will be somewhat embarrassing to share the script I wrote to “scrape” the page numbers from an e-text of the CMIS (subscription required) helpfully prepared by the wonderful people and machines at the Text Creation Partnership (TCP). The TCP’s version was essential since the OCR-produced text (using ABBYY FineReader 8.0) at the Internet Archive is riddled with errors.

I started by cutting and pasting two things into text files in my Python directory: (1) the full contents of the Clarissa section of the CMIS and (2) a list of all 136 topics (from “Adversity. Affliction. Calamity. Misfortune.” to “Youth”) pulled from the TCP table of contents page.

import sys
import re
from collections import defaultdict
from rome import Roman

The first step is to import the modules we’ll need. “Sys” and “re” (regular expressions) are standard; default dictionary is a super helpful way to set the default key-value to 0 (or anything you choose) and avoid key errors; rome is a third-party package that converted Roman numerals to Arabic.

# Read in two files: (1) digitized 'Sentiments' (2) TOC of topics
f1 = open(sys.argv[1], 'r')
f2 = open(sys.argv[2], 'r')

# Create topics list, filtering out alphabetical headings
topics = [line.strip() for line in f2 if len(line) > 3]

# Dictionary for converting volumes into one series of pages
volume = {1:348, 2:355, 3:352, 4:385, 5:358, 6:431, 7:442, 8:399}
startPage = {1:0, 2:348, 3:703, 4:1055, 5:1440, 6:1798, 7:2229, 8:2671}

This section of code reads in the two files as ‘f1’ and ‘f2.’ I’ll grab the contents of f2 and write them to a list called ‘topics,’ doing a little cleanup on the way. Essentially, the list comprehension filters out the alphabetical headings like “A.” or “Z.” (since these are less than three characters in length. Now I have an array of all 136 topics which I can loop over to check if a line in my main file is a topic heading. You probably noticed that the references in CMIS were formatted by volume and page. I’d like to get rid of the volume number and convert all citations to a ‘global’ page number. The first dictionary lists the volume and its total number of pages; the second contains the overall page number at which any given volume begins. Thus, the final volume starts at page 2,671.

counter = 0
match = ''

# Core dictionaries: (1) citations ranked by frequency and (2) sorted by location
frequency = defaultdict(lambda: 0)
location = {}

# Loop over datafile
for line in f1:
    if line.strip() in topics:
        match = line.strip()
        counter += 1
        location[match] = [counter, []]

OK, the hardest thing for me was making sure the extracted references got tossed in the right topic bin. So I initialized a counter that would increment each time the code hits a new topic. The blank string ‘match’ will keep track of the topic name. The loop goes through each line in the main file, f1. The first if statement checks if the line (with white space stripped off) is present in the topics list. If it does, then counter and match update and a key with the topic name (e.g. “Youth.”) is created in the location dictionary. The values for this key will be a list: location[“Youth.”][0] equals 136, since this is the last topic.

elif re.search(r'[iv]+..*(?=[)', line):
        citation = re.search(r'[iv]+..*(?=[)', line)
        process = [x for x in re.split('W', citation.group()) if re.match('(d|[iv]+)', x)]
        current = ''
        for i in range(len(process)):
            if re.match(r'[iv]+', process[i]):
                current = process[i]
            else:
                #frequency[(int(Roman(current)), process[i])] += 1
                page = startPage[int(Roman(current))] + int(process[i])
                frequency[page] += 1
                location[match][1].append(page)

This is the heart of the code. The else-if statement deals with all lines that are NOT topic headings AND contain the regular expression I have specified. Let’s break down the regex:

'[iv]+..*(?=[)'

Brackets mean disjunction: so either ‘i’ OR ‘v’ is what we’re looking for. The Kleene plus (‘+’) says we need to have at least one of the immediately previous pattern, i.e. the ‘[iv]’. Then we escape the period using a backslash, because we only need to get the Roman numerals up to eight (‘viii’) followed by a period. The second period is a special wildcard and the Kleene star right after means we can have as many wildcards as we want up until the parentheses, which contain a lookahead assertion. The lookahead checks for a left bracket (remember how the citations always include the duodecimo references in brackets). In English, then, the regex checks for some combination of i’s and v’s followed by a period that is followed, at some point, by a bracket.

The process variable runs through the string returned in the regex expression and splits the Roman and Arabic numerals by whitespace, appending them to a list. The string “People in Adversity should endeavour to preserve laud|able customs, that so, if sun-shine return, they may not be losers by their trials, ii. 58. 310. [149. iii. 44].” would be returned as “ii. 58. 310. [” by the regex and then turned into [ii, 58, 310] by process. Current is an empty string designed to hold the current Roman numeral so we know, for instance, which volume to match up page 310 with. In the final lines, the current Roman numeral is converted to its startPage number and the page number is added to it. Then the frequency dictionary for that specific page is incremented and the key for the current topic in the location dictionary is updated with the newly extracted page number.

Obviously, this is a rather crude method. It’d be fun to optimize it (and I do need to fix it up so that it can deal with the handful of citations marked by ‘ibid.’), but scraping is supposed to be quick-and-dirty because it really only works with the specific document or webpage that you’re encountering. I doubt this code would do anything useful for other concordance-like texts in the TCP. But I would love to hear suggestions for how it could be better.

In a later post, I’ll talk about the problems I’ve faced in visualizing the data extracted from the CMIS.

One man’s trash . . . is another man’s archive

“The most difficult thing about collecting is discarding.”
– Albert Köster

photo by @BeineckeLibrary

photo by @jmhuculak

The photo above was taken outside Sterling Memorial Library at Yale University. Those long rectangular drawers you see are what’s left of of that pre-digital archive known as the card catalog. The genealogy of this “universal paper machine”  has been detailed by Markus Krajewski in his delightful book Paper Machines. About Cards & Catalogs, 1548 -1929. Far from being the first form of reference technology, this system is only one in a long series of attempts to discover, store and classify knowledge. Yet the transition from the painstakingly compiled paper archive to the extended technological networks which are replacing that archive is more than a simple change in office furniture. The dumpster’s contents signal a change far more dramatic than replacing an index card with a doi, or swapping the cabinetry for a computer.

With a card catalog, the information on the index card would signal to its reader a wide range of information. This exchange between reader and the material read was relatively unproblematic, unless of course the information contained on the card catalog was written in an unfamiliar alphabet or language, or the reader lacked the basic literacy required to grasp the information. Thanks to this information, our reader might find have been able to find the location of the books in the library, some broad subject headings, and other bibliographical information. The reader would have acted on that information by either requesting the volume at the circulation desk, or moving on to a different bibliographic record all together.

In a similar vein, the Resource Description Framework  represents information about resources in the World Wide Web, but instead of using natural language on index cards to communicate sufficient meaning to our curious reader, it communicates that information in a machine readable form. It represents similar metadata about Web resources, such as the title, author, and modification date of a Web page, in addition to any copyright or licensing information. Yet unlike the card catalog, RDF  allows this information to be processed by applications, rather than being  displayed to people. It provides a common framework for expressing this information, so it can be exchanged between applications without loss of meaning.

photo by @jmhuculak

The photo on the left provides a good example. The data stored in the card catalogs could be compared to an application running on a single machine. The only people needing to understand the meaning of a given variable such as “author” or “date of publication” are those who consulted that card catalog directly. In the case of an application running on a single machine, those people would be the programmers reading the source code. But if we want the data contained in this card catalog to participate in a larger network, such as the world wide web, the meanings of the messages the applications exchange, “author”, “date of publication,” etc. need to be explicit.

In fact, currently far too much of the data fueling web applications is prevented from being shared and integrated into other Internet applications. The compartmentalization of the card catalog has carried over into web applications and data transmission becomes entangled in stovepipe systems, or “systems procured and developed to solve a specific problem, characterized by a limited focus and functionality, and containing data that cannot be easily shared with other systems.” (DOE 1999) These applications, instead of allowing users to combine data in new ways to make powerful and compelling connections, risk becoming the digital equivalent of an abandoned card catalog in a dumpster.

As more and more digital humanists share and distribute their work via the world wide web, a working knowledge of the importance of programming for the semantic web becomes essential. Simple mechanisms such as RDF play a key role in transmitting semantic data between machines while allowing applications to combine data in new ways. Much like The Fantastic Flying Books of Mr. Morris LessmoreRDF allows meaningful data transmission to rejoin the many applications hiding behind web interfaces. It transforms what might have been discarded into data rich applications. It also enables digital humanists to join their work to a larger ocean in “the stream of stories.”

Different parts of the Ocean contained different sorts of stories, and as all the stories that had ever been told and many that were still in the process of being invented could be found here, the Ocean of the Streams of Story was in fact the biggest library in the universe. And because the stories were held here in fluid form, they retained the ability to change, to become new versions of themselves, to join up with other stories and so become yet other stories; so that unlike a library of books the Ocean of the Streams of Story was much more than a storeroom of yarns. It was not dead but alive.

Salman Rushdie, Haroun and the Sea of Stories

cross posted at HASTAC