Category Archives: Research and Teaching Tools

Scraping Samuel Richardson

It’s hard enough to read Samuel Richardson’s Pamela. It’s even harder to finish his later, longer epistolary novel: Clarissa, or, the History of a Young Lady (1748) [984,870 words]. Having toiled through both books, I was resting easy until confronted with a curious volume that I volunteered to present on in a graduate seminar on the 18C novel. The title? A collection of the moral and instructive sentiments, maxims, cautions, and reflexions, contained in the histories of Pamela, Clarissa, and Sir Charles Grandison (1755). The CMIS, as I’ll abbreviate it, consists of several hundred topics, each with multiple entries that consists of a short summary and a page reference. Here’s an example from the Clarissa section, thanks to ECCO.

Without going into the meaning or importance of these references, I want to focus on a practical problem: how could we extract every single citation of the eight-volume “Octavo Edition” of 1751? Our most basic data structure should be able to capture the volume and page numbers and associate them with the correct topic. While Richardson may very well have used some kind of card index, I can safely say that no subsequent reader or critic has bothered to count anything in the CMIS. But its very structure demands a database!

As a novice user of Python, it will be somewhat embarrassing to share the script I wrote to “scrape” the page numbers from an e-text of the CMIS (subscription required) helpfully prepared by the wonderful people and machines at the Text Creation Partnership (TCP). The TCP’s version was essential since the OCR-produced text (using ABBYY FineReader 8.0) at the Internet Archive is riddled with errors.

I started by cutting and pasting two things into text files in my Python directory: (1) the full contents of the Clarissa section of the CMIS and (2) a list of all 136 topics (from “Adversity. Affliction. Calamity. Misfortune.” to “Youth”) pulled from the TCP table of contents page.

import sys
import re
from collections import defaultdict
from rome import Roman

The first step is to import the modules we’ll need. “Sys” and “re” (regular expressions) are standard; default dictionary is a super helpful way to set the default key-value to 0 (or anything you choose) and avoid key errors; rome is a third-party package that converted Roman numerals to Arabic.

# Read in two files: (1) digitized 'Sentiments' (2) TOC of topics
f1 = open(sys.argv[1], 'r')
f2 = open(sys.argv[2], 'r')

# Create topics list, filtering out alphabetical headings
topics = [line.strip() for line in f2 if len(line) > 3]

# Dictionary for converting volumes into one series of pages
volume = {1:348, 2:355, 3:352, 4:385, 5:358, 6:431, 7:442, 8:399}
startPage = {1:0, 2:348, 3:703, 4:1055, 5:1440, 6:1798, 7:2229, 8:2671}

This section of code reads in the two files as ‘f1’ and ‘f2.’ I’ll grab the contents of f2 and write them to a list called ‘topics,’ doing a little cleanup on the way. Essentially, the list comprehension filters out the alphabetical headings like “A.” or “Z.” (since these are less than three characters in length. Now I have an array of all 136 topics which I can loop over to check if a line in my main file is a topic heading. You probably noticed that the references in CMIS were formatted by volume and page. I’d like to get rid of the volume number and convert all citations to a ‘global’ page number. The first dictionary lists the volume and its total number of pages; the second contains the overall page number at which any given volume begins. Thus, the final volume starts at page 2,671.

counter = 0
match = ''

# Core dictionaries: (1) citations ranked by frequency and (2) sorted by location
frequency = defaultdict(lambda: 0)
location = {}

# Loop over datafile
for line in f1:
    if line.strip() in topics:
        match = line.strip()
        counter += 1
        location[match] = [counter, []]

OK, the hardest thing for me was making sure the extracted references got tossed in the right topic bin. So I initialized a counter that would increment each time the code hits a new topic. The blank string ‘match’ will keep track of the topic name. The loop goes through each line in the main file, f1. The first if statement checks if the line (with white space stripped off) is present in the topics list. If it does, then counter and match update and a key with the topic name (e.g. “Youth.”) is created in the location dictionary. The values for this key will be a list: location[“Youth.”][0] equals 136, since this is the last topic.

elif re.search(r'[iv]+..*(?=[)', line):
        citation = re.search(r'[iv]+..*(?=[)', line)
        process = [x for x in re.split('W', citation.group()) if re.match('(d|[iv]+)', x)]
        current = ''
        for i in range(len(process)):
            if re.match(r'[iv]+', process[i]):
                current = process[i]
            else:
                #frequency[(int(Roman(current)), process[i])] += 1
                page = startPage[int(Roman(current))] + int(process[i])
                frequency[page] += 1
                location[match][1].append(page)

This is the heart of the code. The else-if statement deals with all lines that are NOT topic headings AND contain the regular expression I have specified. Let’s break down the regex:

'[iv]+..*(?=[)'

Brackets mean disjunction: so either ‘i’ OR ‘v’ is what we’re looking for. The Kleene plus (‘+’) says we need to have at least one of the immediately previous pattern, i.e. the ‘[iv]’. Then we escape the period using a backslash, because we only need to get the Roman numerals up to eight (‘viii’) followed by a period. The second period is a special wildcard and the Kleene star right after means we can have as many wildcards as we want up until the parentheses, which contain a lookahead assertion. The lookahead checks for a left bracket (remember how the citations always include the duodecimo references in brackets). In English, then, the regex checks for some combination of i’s and v’s followed by a period that is followed, at some point, by a bracket.

The process variable runs through the string returned in the regex expression and splits the Roman and Arabic numerals by whitespace, appending them to a list. The string “People in Adversity should endeavour to preserve laud|able customs, that so, if sun-shine return, they may not be losers by their trials, ii. 58. 310. [149. iii. 44].” would be returned as “ii. 58. 310. [” by the regex and then turned into [ii, 58, 310] by process. Current is an empty string designed to hold the current Roman numeral so we know, for instance, which volume to match up page 310 with. In the final lines, the current Roman numeral is converted to its startPage number and the page number is added to it. Then the frequency dictionary for that specific page is incremented and the key for the current topic in the location dictionary is updated with the newly extracted page number.

Obviously, this is a rather crude method. It’d be fun to optimize it (and I do need to fix it up so that it can deal with the handful of citations marked by ‘ibid.’), but scraping is supposed to be quick-and-dirty because it really only works with the specific document or webpage that you’re encountering. I doubt this code would do anything useful for other concordance-like texts in the TCP. But I would love to hear suggestions for how it could be better.

In a later post, I’ll talk about the problems I’ve faced in visualizing the data extracted from the CMIS.

Follow the Money

This Wednesday, the Institute on Assets and Social Policy at Brandeis University released a new study showing that the wealth gap between white and black households has nearly tripled over the past 25 years. From 1984 to 2009, the median net worth of white families rose to $265,000, while that of black families remained at just $28,500. This widening disparity is not due to individual choices, the authors discovered, but to the cumulative effect of “historical wealth advantages” as well as past and ongoing discrimination. It does not take a rocket scientist to realize that wealth generates more wealth and that centuries of unpaid labor – from chattel slavery to the chain gang – have given white families a greater reserve of inherited equity.

The very same day, 3,000 miles to the east, a team of researchers at University College London launched a major new database entitled Legacies of British Slave-ownership. At its heart is an encyclopedia “containing information about every slave-owner in the British Caribbean, Mauritius or the Cape at the moment of abolition in 1833.” Not only this, the database includes information about how much individual slaveholders received as compensation for their human property and hints as to what they did with their money. The results illustrate the tremendous significance of slave-generated wealth for the British economic and political elite. The families of former Prime Minister William Gladstone and current Prime Minister David Cameron, for example, were direct beneficiaries. At the same time, the site makes it possible to trace many of the smaller-scale slaveholders scattered throughout the empire and to speculate about the impact of all that capital accumulation. Although still in its early stages, the site promises to be an outstanding resource for digital research and teaching.

In part because it is so new, the level of detail in the database can be uneven. Some individuals have elaborate biographies and reams of supporting material. Others have an outline sketch or a placeholder. To help correct this, the authors welcome new information from the public. All of the biographies must have taken a tremendous amount of time and effort to compile, and all claims are meticulously documented with links to both traditional and online sources. While there are few images and maps at this stage, the site features an excellent short essay that helps to place the project and its raw data in historical context. The focus is almost entirely on metropolitan Britain, and there is good reason for this. Nearly half of the £20 million paid to former slaveholders went directly to absentee planters residing in the metropole. Still, it might be useful to place this information in wider perspective.

A significant number of nineteenth-century emancipations involved some sort of compensation to erstwhile slaveholders or their agents. Throughout the Atlantic World, abolitionists occasionally raised funds to liberate individual slaves. This was how celebrity authors, such as Frederick Douglass, Harriet Jacobs, and Juan Francisco Manzano, acquired their free papers. In some cases, enslaved families were required to pay slaveholders directly. Under Connecticut’s gradual emancipation law, for example, male slaves born after a certain date were mandated to work for free until their 25th birthday (unless, of course, their enslavers attempted to smuggle them to the South beforehand). Even Haiti, which successfully abolished slavery while fighting off multiple European invasions, was extorted into a massive reparations payment to its former colonial masters, helping to generate a cycle of debt and poverty that continues to this day.

The United States Civil War is somewhat unique in this regard. Although slaveholders in Washington D.C. received government compensation when the District eliminated slavery in 1862, thanks to the logic of the war, the actions of abolitionists, and above all the determination of the enslaved, rebel slaveholders received little in exchange for the loss of their human property. According to recent estimates, that property was among the most valuable investments in the nation. By 1860, the aggregate value of all slaves was in the neighborhood of $10 trillion (in 2011 dollars), or 70% of current GDP. The sudden loss of this wealth represents what is very likely the most radical and widespread seizure of private capital until the Russian Revolution of 1917. But even in this case, emancipated slaves were left to fend for themselves, their pleas for land largely unanswered.

Although there have been a number of successful attempts to trace the influence of slavery within American institutions, especially universities and financial firms, the haphazard and piecemeal nature of emancipation left no comprehensive record. And this is what makes the compensation windfall included in the British Abolition Act of 1833 so fascinating. By scouring government records, researchers have been able to construct a fairly accurate picture of slavery beneficiaries and to trace their influence across a range of activities – commercial, cultural, historical, imperial, physical, and political. A cursory glance at the data reveals 222 politicians and 459 commercial firms among the recipients. A targeted search for railway investments yields over 500 individual entries totaling hundreds of thousands of pounds. According to the database, over 150 history books and pamphlets were made possible, at least in part, by slavery profits. That a sizable chunk of nineteenth-century historiography, as well as its modern heirs, owes its existence to the blood, sweat, and tears of millions of slaves is extremely consequential. And this fact alone deserves careful attention by every practicing historian.

Slaveholder compensation, which equals about £16.5 billion or $25 billion in present terms, was seen as a necessary measure for social stability. The British planter class was deemed, in short, too big to fail. The funds, as Nicholas Draper explains, were provided by a government loan. And it is worth noting that this loan was paid in large part by sugar duties – protectionist tariffs that drove up the price of imported goods. Since the poorest Britons relied on the cheap calories provided by sugar, they bore a disproportionate share of the cost. Meanwhile, former slaves were coerced into an “apprenticeship” system for a limited number of years, during which they would provide additional free labor for their erstwhile owners. So the wealth generated by this event, if you’ll pardon the dry economic jargon, was concentrated and regressive, taking from the poor and the enslaved and giving to the rich.

As its authors point out, the encyclopedia of British slaveholders carries interesting implications for the reparations debate. Although it does not dwell on this aspect, the site also carries significance for the ongoing historical debate about the relationship between capitalism and slavery. Recent work by Dale Tomich, Anthony Kaye, and Sven Beckert and Seth Rockman has placed nineteenth-century slavery squarely at the center of modern capitalism. While historians may quibble about the specifics, it is clear that the profits of slavery fueled large swaths of what we now call the Industrial Revolution and helped propel Great Britain and the United States into the forefront of global economic development. The database makes it possible to glimpse the full extent of that impact, really, for the first time.

Legacies of British Slave-ownership is refreshingly honest about the limitations of its data. Unlike most digital history projects of which I am aware, the authors have engaged their critics directly. One critique is that the project team is white and focused largely on the identities of white slaveholders. Yet, as the authors point out, it is difficult to relate the experience of the enslaved in a vacuum, hermetically sealed and separate from the actions and reactions of their oppressors. If I have learned anything from my study of the subject, it is that it is impossible to understand the history of slavery apart from the history of abolition, and it is impossible to understand the history of abolition apart from the history of slavery. The two are fundamentally intertwined.

So what about the other side to this story? What about all the slaves and abolitionists who called for immediate, uncompensated emancipation? What about the alternative visions they called into being through their actions and their imaginations? What about the different models they offered, however flawed or fleeting, for a world without slaveholders?

Writing to his “Old Master” in the summer of 1865, in one of the great masterworks of world literature, Jordan Anderson gave his thoughts on the matter:

I served you faithfully for thirty-two years, and Mandy twenty years. At twenty-five dollars a month for me, and two dollars a week for Mandy, our earnings would amount to eleven thousand six hundred and eighty dollars. Add to this the interest for the time our wages have been kept back, and deduct what you paid for our clothing, and three doctor’s visits to me, and pulling a tooth for Mandy, and the balance will show what we are in justice entitled to. Please send the money by Adams’s Express, in care of V. Winters, Esq., Dayton, Ohio.

Anderson’s descendants, in Ohio and elsewhere, are still waiting.

My Runaway Class

Over a decade ago, the world began to hear about the “digital native” – a new breed of young person reared on computers for whom Google, Wikipedia, Facebook, and Twitter are second nature. Digital natives thrive in an online universe where knowledge is democratized, authority is decentralized, and media is everywhere. And they are most comfortable in an environment that is fast-paced, interactive, and immediate. It reminds me of a line from Hedwig and the Angry Inch:

all our feelings and thoughts
expressed in ones and in oughts
in endless spiraling chains
you can’t decode or explain
cause you are so analog

There is a large and growing body of excellent material on the use of technology to engage digital natives in the classroom. But one thing I have learned over the past few years is that a student who is very comfortable with digital technology is not necessarily digitally literate. A student can spend twelve hours a day online but still not know how to run a sophisticated Google search or post a video, not to mention build a website or script an algorithm. A student who knows how to update her Facebook status does not necessarily know how to navigate the back end of a blog or find an article on JSTOR.

This does not mean that the high-tech classroom is a misguided endeavor – exactly the opposite. It means that educators have to work especially hard to guide students through the digital realm. We have an obligation to teach digital literacy. And since the best way to learn is by doing, I’ve been experimenting with new technologies for a while. I’d like to share the results of some recent tinkering. This is the story of my runaway class.

Last year I taught a course entitled “Slavery and Freedom in Early America.” The course is designed to be both chronological and accumulative. Beginning with Pre-Columbian slavery, it dwells on the wide spectrum of captivity and servitude under colonialism, the transition to African chattel slavery, the rise of antislavery movements, and revolutionary politics. It ends in 1830 with the third edition of David Walker’s Appeal…to the Coloured Citizens of the World. It is not so much a supplement to the traditional early American survey as an attempt to re-narrate the entire period from a substantially different perspective. Each week students are exposed to original documents coupled with the work of a professional historian. And each reading highlights different themes and interpretive strategies. The goal is to be able to marshal these different modes of interpretation to build a multifaceted view of a particular topic, culminating in a final research project.

Drawing on various active learning techniques, I attempted to make the course as dynamic as possible. We had a group blog for weekly reading responses, research prospectuses, and commentary. The blog also served as a centralized space for announcements, follow-ups, and detailed instructions for assignments (at the end of the semester I used the Anthologize plugin for WordPress to compile the entire course proceedings in book form). There were a plethora of digital images and videos, student presentations, peer instruction, and peer editing. We had a really fun, if somewhat chaotic, writing workshop speed date. We used Skype to video conference with the author of one of the required textbooks. We dug through various digital databases and related sites. We even grappled with present-day slavery through Slavery Footprint (an abolitionist social network not unlike the Quaker networks of the eighteenth century). Almost every week I asked the class about their definitions of slavery, and it was fascinating to see how they changed over time. Things really got interesting one day when I surprised them by asking them to define “freedom.” Their answers gave me a lot to think about long after the course had ended. I’ve posted the full syllabus here.

Aware of all of the discussions brewing around digital pedagogy, I gave special attention to the role of technology in the classroom. This culminated in an activity where students used their database skills to find runaway ads in colonial newspapers. Runaway wives, runaway servants, runaway children, runaway slaves – it was all fair game. I was more than a little nervous about giving the students such free reign. But the results were spectacular. The ads they unearthed were wide-ranging and rich, and no two students focused on the same thing. The sheer diversity of the material reminded me of Cathy Davidson’s musings on the brain science of attention. There is much benefit, Davidson argues, in harnessing myriad perspectives on a single topic. It is, in essence, a controlled form of crowdsourcing. Edward Ayers, the doyen of digital history, calls it “generative scholarship.”

One student found an ad for an escaped slave named Romeo, “about twenty-four years old, five feet six inches high, and well proportioned; his complexion a little of the yellowish cast.” Romeo was literate and “exercised his talents in giving passes and certificates of freedom to run-away slaves.” He ran off with a woman from a different county, “a small black girl named Juliet.” Another student found a convict with “a great many Letters and Figures on his Breast and Left Arm, some in red and some in black.” He was imprisoned in England, shipped to Virginia as a bond slave, escaped, traveled back to London, was recaptured, convicted, sent back to Virginia, and escaped again. Some students found notices of hapless travelers who had been captured and deposited in prison on suspicion of being a runaway, such as Thomas Perry, a Welshman, who could provide “no certificate of his freedom.” I also shared one of my personal favorites, a servant who eloped with his master’s wife on a pair of horses.

The students posted their ads to the course blog, and when they arrived for the following class I divided them into small groups. After some preliminary remarks, I asked them to choose an ad among the ones they had found and to write that person’s biography. This was an experiment in generative scholarship, not unlike Visualizing Emancipation or the super-neat History Harvests at the University of Nebraska. But my class was much more narrowly defined in time and scope. The students had to use their wits, their laptops, and all of the contextual information they had accrued from the readings and discussions in previous weeks. They had to build a plausible narrative for their runaway on demand, with no warning, no excuses, and no template. I circulated among the groups to monitor progress and occasionally offered questions or assistance.

The questions we asked were the typical ones employed by historians. What can you find out about Romeo and Juliet’s purported owners? What does the date tell you? What was going on in that location at that time? How many women ran away from their husbands in New York City in 1757? Was it unusual for servants to escape in groups of three or more? Did the time of year matter? How does the price offered for one runaway compare to others? What can you learn from their detailed physical descriptions? What about their profession? What about the lists of items they took with them on their journey? Is this information reliable? What governed decisions to escape or to stay? What, if anything, does this tell you about the relationship between petit marronage and grand marronage? How does this information comport with what we know about slavery in a particular place and time?

It’s shocking how much information you can glean about a person’s life after just a few minutes online, even persons who have been dead and gone for hundreds of years. The various newspaper databases – Readex, Accessible Archives, Proquest – and specialized projects, such as The Geography of Slavery in Virginia, proved invaluable. I directed students to the large collection of external databases featured on the Slavery Portal. Genealogy sites and historical map collections also came in handy. One student discovered that his subject had escaped from the same slaveholder multiple times at different points in his life. Using the Trans-Atlantic Slave Trade Database, we were able to locate the name of the ship that had carried an individual and their likely point of origin in Africa.

Students from different groups helped each other, which created a nice collaborative atmosphere. Sometimes there were dead ends, a common name or a paucity of leads. But even then the student could surmise, could use her imagination based on what she already knew about a particular time and place. And this was one of the goals of the exercise – to expose the central role of the imagination in historical practice. At the end of class, we shared what we had discovered and were able (briefly) to engage some big sociological questions about the lives and labors of colonial runaways. When I polled the students at the end of the semester about the most memorable moments of the course, the runaway class was their favorite by a wide margin. The final evaluations were among the best I have ever received.

There are aspects of this crowdsourcing experiment that I regret. I had hoped at least some students would take inspiration from the material for their final projects, and I’m sure some of the lessons from that day improved their papers. But because I scheduled the runaway class late in the semester, the students were reluctant, I think, to radically revise their project proposals. Of course, if I had run the class too early in the semester, the students would not have had the necessary background to make educated inferences about their subject. There were other snags. Because most students were not familiar or comfortable with the vast range of digital research tools out there, I had to do some hand-holding and gentle nudging. It was clear that my students needed more experience finding, using, and interpreting large online databases, not to mention Google Books, Wikipedia, Zotero, and other tools historians use every day. It might even make sense to run in-class tutorials on what researchers can do with a database like Colonial State Papers, Fold3, or Visualizing Emancipation. A large part of being an historian is just knowing what source materials are out there and how to turn them to your advantage.

I also regret not taking more detailed notes. In part because everything moved so fast, I was left without a finalized version of the students’ many fascinating discoveries. There was a lot of research and sharing going on, but not a lot of synthesis and reflection. I suppose asking the students to follow-through and actually write their speculative biographies would help. Maybe that would be a good midterm assignment? If I ran this course for years, I could easily see building a massive online database of runaways and their worlds, on a national or even international scale.

In the end, the runaway class was an object lesson in the raw energy and potential of digital history. It was interactive, immediate, and exciting. I would be interested to know if anyone has run a similar experiment or has suggestions for different ways to liven up the classroom.

Cross-posted at HASTAC

One man’s trash . . . is another man’s archive

“The most difficult thing about collecting is discarding.”
– Albert Köster

photo by @BeineckeLibrary

photo by @jmhuculak

The photo above was taken outside Sterling Memorial Library at Yale University. Those long rectangular drawers you see are what’s left of of that pre-digital archive known as the card catalog. The genealogy of this “universal paper machine”  has been detailed by Markus Krajewski in his delightful book Paper Machines. About Cards & Catalogs, 1548 -1929. Far from being the first form of reference technology, this system is only one in a long series of attempts to discover, store and classify knowledge. Yet the transition from the painstakingly compiled paper archive to the extended technological networks which are replacing that archive is more than a simple change in office furniture. The dumpster’s contents signal a change far more dramatic than replacing an index card with a doi, or swapping the cabinetry for a computer.

With a card catalog, the information on the index card would signal to its reader a wide range of information. This exchange between reader and the material read was relatively unproblematic, unless of course the information contained on the card catalog was written in an unfamiliar alphabet or language, or the reader lacked the basic literacy required to grasp the information. Thanks to this information, our reader might find have been able to find the location of the books in the library, some broad subject headings, and other bibliographical information. The reader would have acted on that information by either requesting the volume at the circulation desk, or moving on to a different bibliographic record all together.

In a similar vein, the Resource Description Framework  represents information about resources in the World Wide Web, but instead of using natural language on index cards to communicate sufficient meaning to our curious reader, it communicates that information in a machine readable form. It represents similar metadata about Web resources, such as the title, author, and modification date of a Web page, in addition to any copyright or licensing information. Yet unlike the card catalog, RDF  allows this information to be processed by applications, rather than being  displayed to people. It provides a common framework for expressing this information, so it can be exchanged between applications without loss of meaning.

photo by @jmhuculak

The photo on the left provides a good example. The data stored in the card catalogs could be compared to an application running on a single machine. The only people needing to understand the meaning of a given variable such as “author” or “date of publication” are those who consulted that card catalog directly. In the case of an application running on a single machine, those people would be the programmers reading the source code. But if we want the data contained in this card catalog to participate in a larger network, such as the world wide web, the meanings of the messages the applications exchange, “author”, “date of publication,” etc. need to be explicit.

In fact, currently far too much of the data fueling web applications is prevented from being shared and integrated into other Internet applications. The compartmentalization of the card catalog has carried over into web applications and data transmission becomes entangled in stovepipe systems, or “systems procured and developed to solve a specific problem, characterized by a limited focus and functionality, and containing data that cannot be easily shared with other systems.” (DOE 1999) These applications, instead of allowing users to combine data in new ways to make powerful and compelling connections, risk becoming the digital equivalent of an abandoned card catalog in a dumpster.

As more and more digital humanists share and distribute their work via the world wide web, a working knowledge of the importance of programming for the semantic web becomes essential. Simple mechanisms such as RDF play a key role in transmitting semantic data between machines while allowing applications to combine data in new ways. Much like The Fantastic Flying Books of Mr. Morris LessmoreRDF allows meaningful data transmission to rejoin the many applications hiding behind web interfaces. It transforms what might have been discarded into data rich applications. It also enables digital humanists to join their work to a larger ocean in “the stream of stories.”

Different parts of the Ocean contained different sorts of stories, and as all the stories that had ever been told and many that were still in the process of being invented could be found here, the Ocean of the Streams of Story was in fact the biggest library in the universe. And because the stories were held here in fluid form, they retained the ability to change, to become new versions of themselves, to join up with other stories and so become yet other stories; so that unlike a library of books the Ocean of the Streams of Story was much more than a storeroom of yarns. It was not dead but alive.

Salman Rushdie, Haroun and the Sea of Stories

cross posted at HASTAC

Ahead in the Clouds

The Chronicle published a lengthy review article last week on the science of brain mapping. The article focuses on Ken Hayworth, a researcher at Harvard who specializes in the study of neural networks (called connectomes). Hayworth believes, among other things, that we will one day be able to upload and replicate an individual human consciousness on a computer. It sounds like a great film plot. Certainly, it speaks to our ever-evolving obsession with our own mortality. Whatever the value of Hayworth’s prediction, many of us are already storing our consciousness on our computers. We take notes, download source material, write drafts, save bookmarks, edit content, post blogs and tweets and status updates. No doubt the amount of our intellectual life that unfolds in front of a screen varies greatly from person to person. But there are probably not too many modern writers like David McCullough, who spends most of his time clacking away on an antique typewriter in his backyard shed.

Although I still wade through stacks of papers and books and handwritten notes, the vast majority of my academic work lives on my computer, and that can be a scary prospect. I have heard horror stories of researchers who lose years of diligent work in the blink of an eye. I use Carbon Copy Cloner to mirror all of my data to an external hard drive next to my desk. Others might prefer Time Machine (for Macs) or Backup and Restore (for Windows). But what if I lose both my computer and my backup? Enter the wide world of cloud storage. Although it may be some time before we can backup our entire neural net on the cloud, it is now fairly easy to mirror the complicated webs of source material, notes, and drafts that live on our computers. Services like Dropbox, Google Drive, SpiderOak, and SugarSync offer between 2 and 5 GB of free space and various options for syncing local files to the cloud and across multiple computers and mobile devices. Most include the ability to share and collaborate on documents, which can be useful in classroom and research environments.

These free services work great for everyday purposes, but longer research projects require more space and organizational sophistication. The collection of over 10,000 manuscript letters at the heart of my dissertation, which I spent three years digitizing, organizing, categorizing, and annotating, consume about 30 GB. Not to mention the reams of digital photos, pdfs, and tiffs spread across dozens of project folders. It is not uncommon these days to pop into a library or an archive and snap several gigs of photos in a few hours. Whether this kind of speed-research is a boon or a curse is subject to debate. In any event, although they impose certain limits, ADrive, MediaFire, and Box (under a special promotion) offer 50 GB of free space in the cloud. Symform offers up to 200 GB if you contribute to their peer-to-peer network, but their interface is not ideal and when I gave the program a test drive it ate up almost 90% of my bandwidth. If you are willing to pay an ongoing monthly fee, there are countless options, including JustCloud‘s unlimited backup. I decided to take advantage of the Box deal to backup my various research projects, and since the process was far from straightforward, I thought I would share my solution with the world (or add it to the universal hive mind).

Below are the steps I used to hack together a free, cloud-synced backup of my research.  Although this process is designed to sync academic work, it could be modified to mirror other material or even your entire operating system (more or less). While these instructions are aimed at Mac users, the general principles should remain the same across platforms. I can make no promises regarding the security or longevity of material stored in the cloud. Although most services tout 256 bit SSL encryption, vulnerabilities are inevitable and the ephemeral nature of the online market makes it difficult to predict how long you will have access to your files. The proprietary structure of the cloud and government policing efforts are critical issues that deserve more attention. Finally, I want to reiterate that this process is for those looking to backup a fairly large amount of material. For backups under 5 GB, it is far easier to use one of the free synching services mentioned above.

Step 1: Signup for Box (or another service that offers more than a few GB of cloud storage). I took advantage of a limited-time promotion for Android users and scored 50 GB of free space.

Step 2: Make sure you can WebDAV into your account. From the Mac Finder, click Go –> Connect to Sever (or hit command-k). Enter “https://www.box.com/dav” as the server address. When prompted, enter the e-mail address and password that you chose when you setup your Box account. Your root directory should mount on the desktop as a network drive. Not all services offer WebDAV access, so your mileage may vary.

Step 3: Install Transmit (or a similar client that allows synced uploads). The full version costs $34, which may be worth it if you decide you want to continue using this method. Create a favorite for your account and make sure it works. The protocol should be WebDAV HTTPS (port 443), the server should be www.box.com, and the remote path should be /dav. Since Box imposes a 100 MB limit for a single file, I also created a rule that excludes all files larger than 100 MB. Click Transmit –> Preferences –> Rules to establish what files to skip. Since only a few of my research documents exceeded 100 MB, I was fine depositing these with another free cloud server. I realize not everyone will be comfortable with this.

Step 4: Launch Automator and compile a script to run an upload through Transmit. Select “iCal Alarm” as your template and find the Transmit actions. Select the action named “Synchronize” and drag it to the right. You should now be able to enter your upload parameters. Select the favorite you created in Step 3 and add any rules that are necessary. Select “delete orphaned destination items” to ensure an accurate mirror of your local file structure, but make sure the Local Path and the Remote Path point to the same place. Otherwise, the script will overwrite the remote folder to match the local folder and create a mess. I also recommend disabling the option to “determine server time offset automatically.”

Step 5: Save your alarm. This will generate a new event in iCal, in your Automator calendar (if you don’t have a calendar for automated tasks, the system should create one for you). Double-click the event to modify the timing. Set repeat to “every day” and adjust the alarm time to something innocuous, like 4am. Click “Done” and you should be all set.

Automator will launch Transmit every day at your appointed time and run a synchronization on the folder containing your research. The first time it runs, it should replicate the entire structure and contents of your folder. On subsequent occasions, it should only update those files that have been modified since the last sync. There is a lot that can go wrong with this particular workflow, and I did not include every contingency here, so please feel free to chime in if you think I’ve left out something important.

If, like me, you are a Unix nerd at heart, you can write a shell script to replicate most of this using something like cadaver or mount_webdavrsync, and cron. I might post some more technical instructions later, but I thought I should start out with basic point-and-click. If you have any comments or suggestions – other cloud servers, different process, different outcomes – please feel free to share them.

UPDATE: Konrad Lawson over at ProfHacker has posted a succinct guide to scripting rsync on Mac OS X. It’s probably better than anything I could come up with, so if you’re looking for a more robust solution and you’re not afraid of the command line, you should check it out.

Cross-posted at HASTAC