In Tolkien’s grand narrative, the “one true ring” turned out to be a really bad idea, and it took a three-book sequence to destroy the thing. In the humanities in particular and the academy in general, we continue to be vexed by a file format that allows for productive interchange that is also open — both in the beer and speech senses. Microsoft’s Word files, DOC and DOCX, are clearly not it, though they are now so ingrained in everyone’s workflows, if only thanks to the application being omnipresent on most Windows computers, that many of us assume they are the basis for any interchange.
But anyone who has had to trade a complex document back and forth a few times with more than a few basic style options has learned, things get lost in transit.
Until recently, however, few applications did a decent job of reading and writing Word’s DOC file format. It was getting better — which may be one of the reasons why Microsoft changed to the DOCX format, who knows? — but it was still not reliable.
What are the alternatives?
- OpenOffice’s ODF has never quite caught on.
- RTF is fairly reliable, but it isn’t capable of much.
- HTML seems so “webby” and hasn’t, at least until CSS3, been at all friendly to printed matter.
Which leaves PDF.
Adobe wisely side-stepped competing directly with Microsoft in producing its “portable document format,” but unfortunately for Adobe, but perhaps fortunately for those of us for whom openness matters, PDF seems to have really hit its stride exactly in that moment where the rise of mobile computing devices call it most into question. After all, who here hasn’t muttered in frustration when accessing some simple text content on your phone or tablet and discovered it is in a PDF formatted for an 8 x 11 piece of paper. Oof!
And yet just as we in the humanities have leaned too much upon Word — I now traffic in tracking changes in Word documents in articles for journals and books (Ugh!) — we are starting to lean too much on PDF. A recent exchange in the Digital Humanities On-line mailing turned up the follow comment from Stephen Woodruff:
There are many ways of creating and encoding a PDF file, and not all result in text which can be copied and pasted if the text includes more than standard Ascii characters. Normal word processors hold a internationally accepted numerical representation of each letter plus a note of its font, size, colour and so on. So you can search for an “a” without caring whether its in Arial or Times, red or italic, and you can copy that numerical representation to another application, even if it doesn’t understand colour or have the same fonts.
PDF doesn’t always work like that. Some encodings are analogous to what a typical word processor would use, some are not: they store glyphs, effectively pictures of the individual letters, and have a table to convert back between those and the character codes needed by a copy-paste operation. Its that conversion back that can go wrong: you can read the PDF files and print them because all your eyes and the printer need are the shapes, but if they have been created badly you can not reliably extract the text. (I’m trying hard not to start complaining about the use of PDF, which is a PAGE description language not a TEXT description language, in the academic world.)
Stephen also points to a terrific post by Adobe’s James King which clarifies PDF’s purpose. King’s post ends with the following:
The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.
And so there you have it: PDF is really about presentation. Whether you can get text (data) back out of it is not this particular vessel’s problem nor its concern. That seems problematic to me for those of us who wish our content to be as portable and re-usable as possible. I think PDF is terrific as one possible output, one possible product, but it’s not the interchange format of which everyone dreams. Quite the opposite.