Hacks | Digital Histories @ Yale

Like most digital historians, my personal computer is packed to the gills with thousands upon thousands of documents in myriad formats and containers: JPEG, PDF, PNG, GIF, TIFF, DOC, DOCX, TXT, RTF, EPUB, MOBI, AVI, MP3, MP4, XLSX, CSV, HTML, XML, PHP, DMG, TAR, BIN, ZIP, OGG. Well, you get the idea. The folder for my dissertation alone contains almost 100,000 discrete files. As I mentioned last year, managing and preserving all of this data can be somewhat unwieldy. One solution to this dilemma is to do our work collaboratively on the open web. My esteemed colleague and fellow digital historian Caleb McDaniel is running a neat experiment in which he and his student assistants publish all of their research notes, primary documents, drafts, presentations, and other material online in a wiki.

Although I think there is a great deal of potential in projects like these, most of us remain hopelessly mired in virtual reams of data files spread across multiple directories and devices. A common issue is a folder with 200 JPEGs from some archival box or a folder with 1,000 PDFs from a microfilm scanner. One of my regular scholarly chores is to experiment with different ways to sort, tag, manipulate, and combine these files. This time around, I would like to focus on a potential solution for the latter task. So if, like most people, you have been itching for a way to compile your entire communist Christmas card collection into a single handy document, today is your lucky day. Now you can finally finish that article on why no one ever invited Stalin over to their house during the holidays.

Combining small numbers of image files or PDFs into larger, multipage PDFs is a relatively simply point-and-click operation using Preview (for Macs) or Adobe Acrobat. But larger, more complex operations can become annoying and repetitive pretty quickly. Since I began my IT career on Linux and since my Mac runs on a similar Unix core, I tend to fall back on shell scripting for exceptionally complicated operations. The venerable, if somewhat bloated, PDFtk suite is a popular choice for the programming historian, but there are plenty of other options as well. I’ve found the pdfsplit and pdfcat tools included in the latter package to be especially valuable. At the same time, I’ve been trying to use the Mac OS X Automator more often, and I’ve found that it offers what is arguably an easier, more user friendly interface, especially for folks who may be a bit more hesitant about shell scripting.

What follows is an Automator workflow that takes an input folder of JPEGs (or PDFs) and outputs a single combined PDF with the same name as the containing folder. It can be saved as a service, so you can simply right-click any folder and run the operation within the Mac Finder. I’ve used this workflow to combine thousands of research documents into searchable digests.

Step 1: Open Automator, create a new workflow and select the “Service” template. At the top right, set it to receive selected folders in the Finder.

Step 2: Insert the “Set Value of Variable” action from the library of actions on the left. Call the variable “Input.” Below this, add a “Run Applescript” action and paste in the following commands:

on run {input} tell application "Finder" set FilePath to (container of (first item of input)) as alias end tell return FilePath end run

Add another “Set Value of Variable” action below this and call it “Path.” This will establish the absolute path to the containing folder of your target folder for use later in the script. If this is all getting too confusing, just hang it there. It will probably make more sense by the end.

Step 3: Add a “Get Value of Variable” action and set it to “Input.” Click on “Options” on the bottom of the action and select “Ignore this action’s input.” This part is crucial, as you are starting a new stage of the process.

Step 4: Add the “Run Shell Script” action. Set the shell to Bash and pass input “as arguments.” Then paste the following code:

echo ${1##*/}

I admit that I am cheating a little bit here. This Bash command will retrieve the title of the target folder so that your output file is named properly. There is probably an easier way to do this using Applescript, but to be honest I’m just not that well versed in Applescript. Add another “Set Value of Variable” action below the shell script and call it “FolderName” or whatever else you want to call the variable – it really doesn’t matter.

Step 5: Add another “Get Value of Variable” action and set it to “Input.” Click on “Options” on the bottom of the action and select “Ignore this action’s input.” Once again, this step is crucial, as you are starting a new stage of the process.

Step 6: Add the action to “Get Folder Contents,” followed by the action to “Sort Finder Items.” Set the latter to sort by name in ascending order. This will assure that the pages of your output PDF are in the correct order, the same order in which they appeared in the source folder.

Step 7: Add the “New PDF from Images” action. This is where the actual parsing of the JPEGs will take place. Save the output to the “Path” variable. If you don’t see this option on the list, go to the top menu and click on View –> Variables. You should now see a list of variables at the bottom of the screen. At this point, you can simply drag and drop the “Path” variable into the output box. Set the output file name to something arbitrary like “combined.” If you want to combine individual PDF files instead of images, skip this step and scroll down to the end of this list for alternative instructions.

Step 8: Add the “Rename Finder Items” action and select “Replace Text.” Set it to find “combined” in the basename and replace it with the “FolderName” variable. Once again, you can drag and drop the appropriate variable from the list at the bottom of the screen. Save the workflow as something obvious like “Combine Images into PDF,” and you’re all set. When you right-click on a folder of JPEGs (or other images) in the Finder, you should be able to select your service. Try it out on some test folders with a small number of images to make sure all is working properly. The workflow should deposit your properly-named output PDF in the same directory as the source folder.

To combine PDFs rather than image files, follow steps 1-6 above. After retrieving and sorting the folder contents, add the “Combine PDF Pages” action and set it to combine documents by appending pages. Next add an action to “Rename Finder Items” and select “Name Single Item” from the pull-down menu. Set it to name the “Basename only” and drag and drop the “FolderName” variable into the text box. Lastly, add the “Move Finder Items” action and set the location to the “Path” variable. Save the service with a name like “Combine PDFs” and you’re done.

This procedure can be modified relatively easily to parse individually-selected files rather than entire folders. A folder action worked best for me, though, so that’s what I did. Needless to say, the containing folder has to be labeled appropriately for this to work. I find that I’m much better at properly naming my research folders than I am at naming all of the individual files within them. So, again, this process worked best for me. A lot can go wrong with this workflow. Automator can be fickle, and scripting protocols are always being updated and revised, so I disavow any liability for your personal filesystem. I also welcome any comments or suggestions to improve or modify this process.

The Chronicle published a lengthy review article last week on the science of brain mapping. The article focuses on Ken Hayworth, a researcher at Harvard who specializes in the study of neural networks (called connectomes). Hayworth believes, among other things, that we will one day be able to upload and replicate an individual human consciousness on a computer. It sounds like a great film plot. Certainly, it speaks to our ever-evolving obsession with our own mortality. Whatever the value of Hayworth’s prediction, many of us are already storing our consciousness on our computers. We take notes, download source material, write drafts, save bookmarks, edit content, post blogs and tweets and status updates. No doubt the amount of our intellectual life that unfolds in front of a screen varies greatly from person to person. But there are probably not too many modern writers like David McCullough, who spends most of his time clacking away on an antique typewriter in his backyard shed.

Although I still wade through stacks of papers and books and handwritten notes, the vast majority of my academic work lives on my computer, and that can be a scary prospect. I have heard horror stories of researchers who lose years of diligent work in the blink of an eye. I use Carbon Copy Cloner to mirror all of my data to an external hard drive next to my desk. Others might prefer Time Machine (for Macs) or Backup and Restore (for Windows). But what if I lose both my computer and my backup? Enter the wide world of cloud storage. Although it may be some time before we can backup our entire neural net on the cloud, it is now fairly easy to mirror the complicated webs of source material, notes, and drafts that live on our computers. Services like Dropbox, Google Drive, SpiderOak, and SugarSync offer between 2 and 5 GB of free space and various options for syncing local files to the cloud and across multiple computers and mobile devices. Most include the ability to share and collaborate on documents, which can be useful in classroom and research environments.

These free services work great for everyday purposes, but longer research projects require more space and organizational sophistication. The collection of over 10,000 manuscript letters at the heart of my dissertation, which I spent three years digitizing, organizing, categorizing, and annotating, consume about 30 GB. Not to mention the reams of digital photos, pdfs, and tiffs spread across dozens of project folders. It is not uncommon these days to pop into a library or an archive and snap several gigs of photos in a few hours. Whether this kind of speed-research is a boon or a curse is subject to debate. In any event, although they impose certain limits, ADrive, MediaFire, and Box (under a special promotion) offer 50 GB of free space in the cloud. Symform offers up to 200 GB if you contribute to their peer-to-peer network, but their interface is not ideal and when I gave the program a test drive it ate up almost 90% of my bandwidth. If you are willing to pay an ongoing monthly fee, there are countless options, including JustCloud‘s unlimited backup. I decided to take advantage of the Box deal to backup my various research projects, and since the process was far from straightforward, I thought I would share my solution with the world (or add it to the universal hive mind).

Below are the steps I used to hack together a free, cloud-synced backup of my research. Although this process is designed to sync academic work, it could be modified to mirror other material or even your entire operating system (more or less). While these instructions are aimed at Mac users, the general principles should remain the same across platforms. I can make no promises regarding the security or longevity of material stored in the cloud. Although most services tout 256 bit SSL encryption, vulnerabilities are inevitable and the ephemeral nature of the online market makes it difficult to predict how long you will have access to your files. The proprietary structure of the cloud and government policing efforts are critical issues that deserve more attention. Finally, I want to reiterate that this process is for those looking to backup a fairly large amount of material. For backups under 5 GB, it is far easier to use one of the free synching services mentioned above.

Step 1: Signup for Box (or another service that offers more than a few GB of cloud storage). I took advantage of a limited-time promotion for Android users and scored 50 GB of free space.

Step 2: Make sure you can WebDAV into your account. From the Mac Finder, click Go –> Connect to Sever (or hit command-k). Enter “https://www.box.com/dav” as the server address. When prompted, enter the e-mail address and password that you chose when you setup your Box account. Your root directory should mount on the desktop as a network drive. Not all services offer WebDAV access, so your mileage may vary.

Step 3: Install Transmit (or a similar client that allows synced uploads). The full version costs $34, which may be worth it if you decide you want to continue using this method. Create a favorite for your account and make sure it works. The protocol should be WebDAV HTTPS (port 443), the server should be www.box.com, and the remote path should be /dav. Since Box imposes a 100 MB limit for a single file, I also created a rule that excludes all files larger than 100 MB. Click Transmit –> Preferences –> Rules to establish what files to skip. Since only a few of my research documents exceeded 100 MB, I was fine depositing these with another free cloud server. I realize not everyone will be comfortable with this.

Step 4: Launch Automator and compile a script to run an upload through Transmit. Select “iCal Alarm” as your template and find the Transmit actions. Select the action named “Synchronize” and drag it to the right. You should now be able to enter your upload parameters. Select the favorite you created in Step 3 and add any rules that are necessary. Select “delete orphaned destination items” to ensure an accurate mirror of your local file structure, but make sure the Local Path and the Remote Path point to the same place. Otherwise, the script will overwrite the remote folder to match the local folder and create a mess. I also recommend disabling the option to “determine server time offset automatically.”

Step 5: Save your alarm. This will generate a new event in iCal, in your Automator calendar (if you don’t have a calendar for automated tasks, the system should create one for you). Double-click the event to modify the timing. Set repeat to “every day” and adjust the alarm time to something innocuous, like 4am. Click “Done” and you should be all set.

Automator will launch Transmit every day at your appointed time and run a synchronization on the folder containing your research. The first time it runs, it should replicate the entire structure and contents of your folder. On subsequent occasions, it should only update those files that have been modified since the last sync. There is a lot that can go wrong with this particular workflow, and I did not include every contingency here, so please feel free to chime in if you think I’ve left out something important.

If, like me, you are a Unix nerd at heart, you can write a shell script to replicate most of this using something like cadaver or mount_webdav, rsync, and cron. I might post some more technical instructions later, but I thought I should start out with basic point-and-click. If you have any comments or suggestions – other cloud servers, different process, different outcomes – please feel free to share them.

UPDATE: Konrad Lawson over at ProfHacker has posted a succinct guide to scripting rsync on Mac OS X. It’s probably better than anything I could come up with, so if you’re looking for a more robust solution and you’re not afraid of the command line, you should check it out.

Cross-posted at HASTAC

Digital Histories @ Yale

Tag Archives: Hacks

Combine JPEGs and PDFs with Automator