Tag Archives: Macs

Combine JPEGs and PDFs with Automator

leninchristmasLike most digital historians, my personal computer is packed to the gills with thousands upon thousands of documents in myriad formats and containers: JPEG, PDF, PNG, GIF, TIFF, DOC, DOCX, TXT, RTF, EPUB, MOBI, AVI, MP3, MP4, XLSX, CSV, HTML, XML, PHP, DMG, TAR, BIN, ZIP, OGG. Well, you get the idea. The folder for my dissertation alone contains almost 100,000 discrete files. As I mentioned last year, managing and preserving all of this data can be somewhat unwieldy. One solution to this dilemma is to do our work collaboratively on the open web. My esteemed colleague and fellow digital historian Caleb McDaniel is running a neat experiment in which he and his student assistants publish all of their research notes, primary documents, drafts, presentations, and other material online in a wiki.

Although I think there is a great deal of potential in projects like these, most of us remain hopelessly mired in virtual reams of data files spread across multiple directories and devices. A common issue is a folder with 200 JPEGs from some archival box or a folder with 1,000 PDFs from a microfilm scanner. One of my regular scholarly chores is to experiment with different ways to sort, tag, manipulate, and combines these files. This time around, I would like to focus on a potential solution for the latter task. So if, like most people, you have been itching for a way to compile your entire communist Christmas card collection into a single handy document, today is your lucky day. Now you can finally finish that article on why no one ever invited Stalin over to their house during the holidays.

Combining small numbers of image files or PDFs into larger, multipage PDFs is a relatively simply point-and-click operation using Preview (for Macs) or Adobe Acrobat. But larger, more complex operations can become annoying and repetitive pretty quickly. Since I began my IT career on Linux and since my Mac runs on a similar Unix core, I tend to fall back on shell scripting for exceptionally complicated operations. The venerable, if somewhat bloated, PDFtk suite is a popular choice for the programming historian, but there are plenty of other options as well. I’ve found the pdfsplit and pdfcat tools included in the latter package to be especially valuable. At the same time, I’ve been trying to use the Mac OS X Automator more often, and I’ve found that it offers what is arguably an easier, more user friendly interface, especially for folks who may be a bit more hesitant about shell scripting.

What follows is an Automator workflow that takes an input folder of JPEGs (or PDFs) and outputs a single combined PDF with the same name as the containing folder. It can be saved as a service, so you can simply right-click any folder and run the operation within the Mac Finder. I’ve used this workflow to combine thousands of research documents into searchable digests.

Step 1: Open Automator, create a new workflow and select the “Service” template. At the top right, set it to receive selected folders in the Finder.

Step 2: Insert the “Set Value of Variable” action from the library of actions on the left. Call the variable “Input.” Below this, add a “Run Applescript” action and paste in the following commands:

on run {input}
tell application "Finder"
set FilePath to (container of (first item of input)) as alias
end tell
return FilePath
end run

Add another “Set Value of Variable” action below this and call it “Path.” This will establish the absolute path to the containing folder of your target folder for use later in the script. If this is all getting too confusing, just hang it there. It will probably make more sense by the end.

combinesmallStep 3: Add a “Get Value of Variable” action and set it to “Input.” Click on “Options” on the bottom of the action and select “Ignore this action’s input.” This part is crucial, as you are starting a new stage of the process.

Step 4: Add the “Run Shell Script” action. Set the shell to Bash and pass input “as arguments.” Then paste the following code:

echo ${1##*/}

I admit that I am cheating a little bit here. This Bash command will retrieve the title of the target folder so that your output file is named properly. There is probably an easier way to do this using Applescript, but to be honest I’m just not that well versed in Applescript. Add another “Set Value of Variable” action below the shell script and call it “FolderName” or whatever else you want to call the variable – it really doesn’t matter.

Step 5: Add another “Get Value of Variable” action and set it to “Input.” Click on “Options” on the bottom of the action and select “Ignore this action’s input.” Once again, this step is crucial, as you are starting a new stage of the process.

Step 6: Add the action to “Get Folder Contents,” followed by the action to “Sort Finder Items.” Set the latter to sort by name in ascending order. This will assure that the pages of your output PDF are in the correct order, the same order in which they appeared in the source folder.

Step 7: Add the “New PDF from Images” action. This is where the actual parsing of the JPEGs will take place. Save the output to the “Path” variable. If you don’t see this option on the list, go to the top menu and click on View –> Variables. You should now see a list of variables at the bottom of the screen. At this point, you can simply drag and drop the “Path” variable into the output box. Set the output file name to something arbitrary like “combined.” If you want to combine individual PDF files instead of images, skip this step and scroll down to the end of this list for alternative instructions.

Step 8: Add the “Rename Finder Items” action and select “Replace Text.” Set it to find “combined” in the basename and replace it with the “FolderName” variable. Once again, you can drag and drop the appropriate variable from the list at the bottom of the screen. Save the workflow as something obvious like “Combine Images into PDF,” and you’re all set. When you right-click on a folder of JPEGs (or other images) in the Finder, you should be able to select your service. Try it out on some test folders with a small number of images to make sure all is working properly. The workflow should deposit your properly-named output PDF in the same directory as the source folder.

To combine PDFs rather than image files, follow steps 1-6 above. After retrieving and sorting the folder contents, add the “Combine PDF Pages” action and set it to combine documents by appending pages. Next add an action to “Rename Finder Items” and select “Name Single Item” from the pull-down menu. Set it to name the “Basename only” and drag and drop the “FolderName” variable into the text box. Lastly, add the “Move Finder Items” action and set the location to the “Path” variable. Save the service with a name like “Combine PDFs” and you’re done.

This procedure can be modified relatively easily to parse individually-selected files rather than entire folders. A folder action worked best for me, though, so that’s what I did. Needless to say, the containing folder has to be labeled appropriately for this to work. I find that I’m much better at properly naming my research folders than I am at naming all of the individual files within them. So, again, this process worked best for me. A lot can go wrong with this workflow. Automator can be fickle, and scripting protocols are always being updated and revised, so I disavow any liability for your personal filesystem. I also welcome any comments or suggestions to improve or modify this process.

Ahead in the Clouds

The Chronicle published a lengthy review article last week on the science of brain mapping. The article focuses on Ken Hayworth, a researcher at Harvard who specializes in the study of neural networks (called connectomes). Hayworth believes, among other things, that we will one day be able to upload and replicate an individual human consciousness on a computer. It sounds like a great film plot. Certainly, it speaks to our ever-evolving obsession with our own mortality. Whatever the value of Hayworth’s prediction, many of us are already storing our consciousness on our computers. We take notes, download source material, write drafts, save bookmarks, edit content, post blogs and tweets and status updates. No doubt the amount of our intellectual life that unfolds in front of a screen varies greatly from person to person. But there are probably not too many modern writers like David McCullough, who spends most of his time clacking away on an antique typewriter in his backyard shed.

Although I still wade through stacks of papers and books and handwritten notes, the vast majority of my academic work lives on my computer, and that can be a scary prospect. I have heard horror stories of researchers who lose years of diligent work in the blink of an eye. I use Carbon Copy Cloner to mirror all of my data to an external hard drive next to my desk. Others might prefer Time Machine (for Macs) or Backup and Restore (for Windows). But what if I lose both my computer and my backup? Enter the wide world of cloud storage. Although it may be some time before we can backup our entire neural net on the cloud, it is now fairly easy to mirror the complicated webs of source material, notes, and drafts that live on our computers. Services like Dropbox, Google Drive, SpiderOak, and SugarSync offer between 2 and 5 GB of free space and various options for syncing local files to the cloud and across multiple computers and mobile devices. Most include the ability to share and collaborate on documents, which can be useful in classroom and research environments.

These free services work great for everyday purposes, but longer research projects require more space and organizational sophistication. The collection of over 10,000 manuscript letters at the heart of my dissertation, which I spent three years digitizing, organizing, categorizing, and annotating, consume about 30 GB. Not to mention the reams of digital photos, pdfs, and tiffs spread across dozens of project folders. It is not uncommon these days to pop into a library or an archive and snap several gigs of photos in a few hours. Whether this kind of speed-research is a boon or a curse is subject to debate. In any event, although they impose certain limits, ADrive, MediaFire, and Box (under a special promotion) offer 50 GB of free space in the cloud. Symform offers up to 200 GB if you contribute to their peer-to-peer network, but their interface is not ideal and when I gave the program a test drive it ate up almost 90% of my bandwidth. If you are willing to pay an ongoing monthly fee, there are countless options, including JustCloud‘s unlimited backup. I decided to take advantage of the Box deal to backup my various research projects, and since the process was far from straightforward, I thought I would share my solution with the world (or add it to the universal hive mind).

Below are the steps I used to hack together a free, cloud-synced backup of my research.  Although this process is designed to sync academic work, it could be modified to mirror other material or even your entire operating system (more or less). While these instructions are aimed at Mac users, the general principles should remain the same across platforms. I can make no promises regarding the security or longevity of material stored in the cloud. Although most services tout 256 bit SSL encryption, vulnerabilities are inevitable and the ephemeral nature of the online market makes it difficult to predict how long you will have access to your files. The proprietary structure of the cloud and government policing efforts are critical issues that deserve more attention. Finally, I want to reiterate that this process is for those looking to backup a fairly large amount of material. For backups under 5 GB, it is far easier to use one of the free synching services mentioned above.

Step 1: Signup for Box (or another service that offers more than a few GB of cloud storage). I took advantage of a limited-time promotion for Android users and scored 50 GB of free space.

Step 2: Make sure you can WebDAV into your account. From the Mac Finder, click Go –> Connect to Sever (or hit command-k). Enter “https://www.box.com/dav” as the server address. When prompted, enter the e-mail address and password that you chose when you setup your Box account. Your root directory should mount on the desktop as a network drive. Not all services offer WebDAV access, so your mileage may vary.

Step 3: Install Transmit (or a similar client that allows synced uploads). The full version costs $34, which may be worth it if you decide you want to continue using this method. Create a favorite for your account and make sure it works. The protocol should be WebDAV HTTPS (port 443), the server should be www.box.com, and the remote path should be /dav. Since Box imposes a 100 MB limit for a single file, I also created a rule that excludes all files larger than 100 MB. Click Transmit –> Preferences –> Rules to establish what files to skip. Since only a few of my research documents exceeded 100 MB, I was fine depositing these with another free cloud server. I realize not everyone will be comfortable with this.

Step 4: Launch Automator and compile a script to run an upload through Transmit. Select “iCal Alarm” as your template and find the Transmit actions. Select the action named “Synchronize” and drag it to the right. You should now be able to enter your upload parameters. Select the favorite you created in Step 3 and add any rules that are necessary. Select “delete orphaned destination items” to ensure an accurate mirror of your local file structure, but make sure the Local Path and the Remote Path point to the same place. Otherwise, the script will overwrite the remote folder to match the local folder and create a mess. I also recommend disabling the option to “determine server time offset automatically.”

Step 5: Save your alarm. This will generate a new event in iCal, in your Automator calendar (if you don’t have a calendar for automated tasks, the system should create one for you). Double-click the event to modify the timing. Set repeat to “every day” and adjust the alarm time to something innocuous, like 4am. Click “Done” and you should be all set.

Automator will launch Transmit every day at your appointed time and run a synchronization on the folder containing your research. The first time it runs, it should replicate the entire structure and contents of your folder. On subsequent occasions, it should only update those files that have been modified since the last sync. There is a lot that can go wrong with this particular workflow, and I did not include every contingency here, so please feel free to chime in if you think I’ve left out something important.

If, like me, you are a Unix nerd at heart, you can write a shell script to replicate most of this using something like cadaver or mount_webdavrsync, and cron. I might post some more technical instructions later, but I thought I should start out with basic point-and-click. If you have any comments or suggestions – other cloud servers, different process, different outcomes – please feel free to share them.

UPDATE: Konrad Lawson over at ProfHacker has posted a succinct guide to scripting rsync on Mac OS X. It’s probably better than anything I could come up with, so if you’re looking for a more robust solution and you’re not afraid of the command line, you should check it out.

Cross-posted at HASTAC

Eternal Sunshine of the Spotless Draft

I am an inveterate Mac user. Some might say I’m a fanboy. Although I like to think that my brand loyalty is due to a cleaner, easier, more pleasing operating experience, there are other factors. Part of my attraction stems from the “Think Different” ad campaign of my youth – flattering for any impulsive iconoclast. Or maybe it’s that soothing chime. I don’t agree with everything Apple has ever done, especially now that they’ve thundered into the mainstream, but I still think that, when all is said and done, they can produce a better quality product than the competition (now if only they could do it humanely). Apple devices are marketed as polished, eloquent, intuitive. A common complaint about Microsoft, on the other hand, is that they have trouble releasing a finished product. Windows is notorious for being incomplete, buggy, awkward, in need of an endless cascade of updates and service packs. Of course, Mac OS X, Linux, Android, and every other decent piece of software does exactly the same thing. OS X has endured at least seven major revisions in the past decade, while Windows has suffered maybe three (it all depends on your definition of “major revision”). This endless turnover used to bother me. Does Firefox really need to release a new version every other day? How much useless bloat can software designers cram into MS Word before it finally explodes? Lately, however, I’ve come to accept and even embrace this radical incompleteness.

The age of static print was defined by permanence. Authors and editors had to work for a long time on multiple drafts, revisions, and proofs. The result was a clay tablet, or a scroll, or a codex book. With the onset of the printing press, it was easier to make corrections. Movable type could be reset and rearranged to create appended, expanded, and revised editions. Still, the emphasis was on stability. The paperback book I have on my desk right now looks pretty much exactly the same as it did when it was first published in 1987. And it will always look that way. A lot of effort went into its publication because it would be extremely difficult to revise it. It is a stable artifact. Digital culture, on the other hand, is a permanent palimpsest. What is here today is gone tomorrow, all that is solid melts into air. Digital publications do not have to be fully polished artifacts because they can be endlessly revised. There are benefits and drawbacks to this state of almost limitless transition. But now that the Encyclopedia Britannica has thrown up its hands and shuttered its print division, perhaps it is worth asking: what do we have to gain from adhering to a culture of permanence?

In the world of static print, errors or inaccuracies are irreversible. Filtration systems, such as line editing or peer review, help to mitigate against this problem, but even the most perfectionist among us are not immune from good faith mistakes. We have all had those moments when we come across a typo or an inelegant phrase that makes us cringe with regret. How wonderful would it be to correct it in an instant? And why stop at typos? Less than a year after I published an article on abolitionist convict George Thompson, I was wandering around in the vast annex where my school’s library dumps all of its old reference books. Here were hoary relics like the National Union Catalog or the Encyclopedia of the Papacy. I picked up a dusty tome and, by dumb luck, found an allusion to Thompson’s long-lost manuscript autobiography. When I wrote the article I had scoured every database known to man over the course of two years, including WorldCat and ArchiveGrid. But the manuscript, which was filed away in some godforsaken corner of the Chicago History Museum, had no corresponding  entry in any online catalog. I had to e-mail the museum staff and wait while a kindly librarian checked an old-school physical card catalog for the entry (so much for the vaunted age of digital research). Although it was too late to include the document in my article, at least I had time to include it in my dissertation. But what if I could include it in the article?

The perfectionist temptation can be disastrous. No doubt this impulse to continually revise is what led George Lucas to update the first three Star Wars films with new scenes and special effects. Many fans thought that the changes ruined the experience of the original artifacts. It may be better in some cases to leave well enough alone. Yet there is something to be said for revision. One of the things I love about the Slavery Portal is that it is constantly evolving. I am always adding new material or tweaking the interface. When I find a mistake, I fix it. When new data makes an older entry obsolete, I update it. Writing History in the Digital Age, a serious work of scholarship that is also technologically sophisticated and experimental, uses Commentpress to enable paragraph-by-paragraph annotation of its content. Thus a peer review process that is usually conducted in private among a small group of people over a long period of time becomes something that is open, immediate, collaborative, and democratic. Projects like this have landmarks, qualitative leaps, or nodal points, just like software that jumps from alpha stage to beta release or version 10.4.11 to 10.5. But they are always in process. For every George Lucas, there is a Leonardo da Vinci. The Florentine Master only completed around fifteen paintings in his lifetime and was a consummate procrastinator. His extensive manuscript collection remained unpublished at the time of his death and largely unavailable for a long time thereafter. What if da Vinci had a blog? (I can just imagine the comment thread on Vitruvian Man: “stevexxx37:  wuz up wit teh hair? get a cut yo hippie lolz!”)

Although I sometimes still agonize about fixes or changes I could make to older work, I have found that dispensing with the whole pretense of permanency can be tremendously therapeutic. Rather than obsess over writing a flawless dissertation, I have come to embrace imperfection. I have come to view my thesis or my scholarly articles not as end products, but as steps in a larger progression. In a sense, they are still drafts. In the sense that we are always revising and refining our understanding of the past, all history is draft. Static books and articles are essential building blocks of our historical consciousness. It is hard to imagine a world where the book I cite today might not be the same book tomorrow. And yet, to a certain extent, we live in that world. When Apple finds a security loophole or a backwards compatibility issue in its software, it releases a patch. If I find a typo or an inaccuracy in this post three days from now, I can fix it immediately. If I come across new information a year later, I can make a revision or post a follow-up. Everything is process. The other day, I updated the firmware on a picture frame.

I will, of course, continue to aim for the most polished, the most perfect work of which I am capable. As much as I would like, I cannot write my dissertation as a blog post. I will edit and revise, edit and revise. Sometimes you do not know what you need to revise until you make it permanent. At the end, maybe, I will have a landmark. And I will welcome its insufficiency. There is something liberating about being incompl…