Acrobat Conversion of PDF to Word Leaves a Lot to Be Desired

Happy birthday to me! I spent the first two hours of my birthday this morning converting the PDF of the essay that [Jonathan Goodwin][] and I spent the last three months producing. Our collaboration was supported through a [Git][] repo on [BitBucket][] and our drafting was done in [LaTeX][].

That all worked great in my experience.

Then we sent the working draft to Jim Leary and Tom Dubois, the editors of the _Journal of American Folklore_, and to Tim Tangherlini, editor of the special issue on Computational Folkloristics. Everyone was very warm and welcoming, and we are delighted with the positive response so far from everyone who has received the essay.

But to enter into the journal’s workflow, the essay must get converted to a Word document. Now is not the moment to discuss why the humanities in general rely so heavily on a proprietary, closed-source software, but I think even that passing remark begins to suggest a broader self-critique in which we might engage to begin to change our practices better to suit our ideals.

Nonetheless, we needed to get our LaTeX transformed into a Word document. Two paths lay before us:

1. The first is what I will dub the *code route* in that it requires familiarity and comfort with the command line.
2. The second I will dub the *application route* in that one works through an omnibus GUI tool to get the work done.

My experience in the difference between the two routes was pretty typical. The *code route* is very fiddly on the front end and what it misses it misses quite obviously. The *application route* gives the user a smooth ride and you think everything is just fine until you realize just how much it isn’t and you spend the next few hours fiddling with all the things the application missed.

In particular, the two routes I ran last Friday and this Monday morning were, respectively, `latex2html` and Adobe Acrobat to Microsoft Word.

[`latex2html`][] is a Perl script that, in my experience, pretty quickly produces a folder with an html file for the text and a collection of graphic files. To convert a document to Word, you are directed to open the HTML file in Word and remove all the links, via *Edit > Links*, which should embed the graphics in the document itself. Save as a Word document, and you’re done. My experience was that half the graphics made it, but a bit better checking on my part may have improved these results, but none of the footnotes made it. (More on this in a moment.)

Adobe Acrobat is, of course, the 800 pound gorilla of PDF creation and editing. One of its [principle claims][] is just how easy it is to move from a PDF to a Microsoft Office document. Just click on the video to see just how easy it’s supposed to be to copy ad paste to, or convert from PDF to, Word, Excel, and PowerPoint. Go ahead. Click on it. It’s easy.

And, at first glance, the results are extraordinary, unless of course you actually want to do anything besides look at them. The first thing I noticed was that Acrobat had hard coded in the additional spaces that were used for justification in the original PDF, so instead of something thoughtful like:

find(space between words) > convert(to single space)

it attempts to count spaces and insert however many it thinks are reasonable. (This isn’t necessarily an incorrect behavior, but it should be a selectable behavior.)

No big deal: a few *Find and Replace* passes through the document and all the spaces are normalized to one.

But then the nightmare really begins, because Acrobat also hard codes in end of line of hyphenations, which cannot be removed unsupervised, since our document also contains a number of compound words. Worse, Acrobat also hard codes in a space at the end of each line, and so you can’t simply … *ack, I realize now I could probably have saved myself some time by finding and replacing a hyphen plus a space.* Living and learning, the Adobe/Microsoft way.

A small sample of the many gifts Adobe Acrobat Pro leaves for you when you convert a clean PDF to Word.

A small sample of the many gifts Adobe Acrobat Pro leaves for you when you convert a clean PDF to Word.

But wait, there’s more. All the vertical spacing is also hard coded in, and that includes the footnotes, which now appear as numbered paragraphs that interrupt main body paragraphs. Oi.

In the end, I think I would far prefer to re-insert the ten or so footnotes found in the original PDF into the `latex2html` document than do as much fiddling as I ended up doing with the Acrobat-version of the document. That’s ten edits over and against dozens to hundreds. If I were to add in hand-editing the graphics, we are still talking about a total of twenty edits. That’s a far more reasonable number.

Especially for a birthday morning. Thanks, Adobe!

As a sidenote, having spent a few hours with Word this morning, which would wonkily throw me back and forth across pages or make entire lines disappear when I removed a single character or made it difficult to select the hard page returns that Adobe had inserted, quite liberally, throughout the document, I realized just what a pleasure working in LaTeX is. So another tip of the hat to Jonathan for moving us in that direction.

[Jonathan Goodwin]: http://www.jgoodwin.net
[Git]: http://git-scm.com
[BitBucket]: https://bitbucket.org
[LaTeX]: http://www.latex-project.org
[`latex2html`]: http://www.latex2html.org
[Adobe Acrobat]: http://www.adobe.com/products/acrobat/
[principle claims]: http://www.adobe.com/products/acrobat/pdf-to-word-doc-converter.html

(Visited 199 times, 1 visits today)

Leave a Reply