A LaTeX Vita

Hmph. I sat down this morning to do something I have had on my task list for quite some time: to create a version of my vita in LaTeX that looked somewhat the way I would like it to look. The version I have maintained in a word processor for several years now has gotten more and more simple over the years, but it still has a few niceties that do not necessarily come with the basic LaTeX classes of documents nor is it as involved as some of the templates available scattered about the web. (When I am done I will, I hope, host the thing in its own repo for others interested in walking down this path.)

My goal was/is to create a document that is as dependent on as few additions to the basic LaTeX installation as possible and, at the same time, has as clean a set of documentation as I can possibly make it.

This is not as easy as one might like.

The current version uses only the geometry package with the preamble looking like this:

\documentclass[12pt, letter]{article}

At the very beginning of the document, I have contact information that I would prefer to look different from everything else, and, historically, I have simply centered it. While I could have, and did for a while during this task, set up an environment for this information, I decided that since it was a once-only occurrence, it was simpler simply to format it in place:

\openup -0.25em
\textbf{JOHN LAUDUN} \\
Department of English \\
github.com/johnlaudun \\
\openup 0.25em

There’s a colophon at the end of the document that tells readers that I used LaTeX and, when I get it working, that I used open source fonts. (While it too is centered, I still didn’t think it was worth creating an environment for a second instance.)

My next steps will be to determine how to filter lines in the file so that I can create versions of the vita of the last 2, 3, or 5 years — which are the common requests for various endeavors. (This would be easier if it were all in XML: then I could use XQuery and XSLT to filter and format, but I’m going to stick with the LaTeX for now.)

LaTeX Notes

The last time I used LaTeX was to collaborate with Jonathan Goodwin on our examination of the disciplinary history of folklore studies using topic modeling. Having spent five weeks with computer scientists and mathematician’s at the NSF’s Culture Analytics program at UCLA’s Institute for Pure and Applied Mathematics, that LaTeX is simply the assumed way that one communicates research results. To be clear, I have always admired LaTeX both for its power and for its open source nature, it’s just that my previous adventures with it were rather stumbling affairs. This time around, I simply installed LaTeX — as texlive-latex — via MacPorts, and I’ve chosen to use Texpad as my editor. The experience could not have been more smooth. I am writing up some notes now in it, and while I confess my itch to weak some of the defaults is strong — I don’t particularly care for the default type face in LaTeX (I never have) — it works.

As I work through my own implementation, I find myself admiring things list Neil Lawrence’s LaTeX style guide for his group.

Acrobat Conversion of PDF to Word Leaves a Lot to Be Desired

Happy birthday to me! I spent the first two hours of my birthday this morning converting the PDF of the essay that [Jonathan Goodwin][] and I spent the last three months producing. Our collaboration was supported through a [Git][] repo on [BitBucket][] and our drafting was done in [LaTeX][].

That all worked great in my experience.

Then we sent the working draft to Jim Leary and Tom Dubois, the editors of the _Journal of American Folklore_, and to Tim Tangherlini, editor of the special issue on Computational Folkloristics. Everyone was very warm and welcoming, and we are delighted with the positive response so far from everyone who has received the essay.

But to enter into the journal’s workflow, the essay must get converted to a Word document. Now is not the moment to discuss why the humanities in general rely so heavily on a proprietary, closed-source software, but I think even that passing remark begins to suggest a broader self-critique in which we might engage to begin to change our practices better to suit our ideals.

Nonetheless, we needed to get our LaTeX transformed into a Word document. Two paths lay before us:

1. The first is what I will dub the *code route* in that it requires familiarity and comfort with the command line.
2. The second I will dub the *application route* in that one works through an omnibus GUI tool to get the work done.

My experience in the difference between the two routes was pretty typical. The *code route* is very fiddly on the front end and what it misses it misses quite obviously. The *application route* gives the user a smooth ride and you think everything is just fine until you realize just how much it isn’t and you spend the next few hours fiddling with all the things the application missed.

In particular, the two routes I ran last Friday and this Monday morning were, respectively, `latex2html` and Adobe Acrobat to Microsoft Word.

[`latex2html`][] is a Perl script that, in my experience, pretty quickly produces a folder with an html file for the text and a collection of graphic files. To convert a document to Word, you are directed to open the HTML file in Word and remove all the links, via *Edit > Links*, which should embed the graphics in the document itself. Save as a Word document, and you’re done. My experience was that half the graphics made it, but a bit better checking on my part may have improved these results, but none of the footnotes made it. (More on this in a moment.)

Adobe Acrobat is, of course, the 800 pound gorilla of PDF creation and editing. One of its [principle claims][] is just how easy it is to move from a PDF to a Microsoft Office document. Just click on the video to see just how easy it’s supposed to be to copy ad paste to, or convert from PDF to, Word, Excel, and PowerPoint. Go ahead. Click on it. It’s easy.

And, at first glance, the results are extraordinary, unless of course you actually want to do anything besides look at them. The first thing I noticed was that Acrobat had hard coded in the additional spaces that were used for justification in the original PDF, so instead of something thoughtful like:

find(space between words) > convert(to single space)

it attempts to count spaces and insert however many it thinks are reasonable. (This isn’t necessarily an incorrect behavior, but it should be a selectable behavior.)

No big deal: a few *Find and Replace* passes through the document and all the spaces are normalized to one.

But then the nightmare really begins, because Acrobat also hard codes in end of line of hyphenations, which cannot be removed unsupervised, since our document also contains a number of compound words. Worse, Acrobat also hard codes in a space at the end of each line, and so you can’t simply … *ack, I realize now I could probably have saved myself some time by finding and replacing a hyphen plus a space.* Living and learning, the Adobe/Microsoft way.

A small sample of the many gifts Adobe Acrobat Pro leaves for you when you convert a clean PDF to Word.

A small sample of the many gifts Adobe Acrobat Pro leaves for you when you convert a clean PDF to Word.

But wait, there’s more. All the vertical spacing is also hard coded in, and that includes the footnotes, which now appear as numbered paragraphs that interrupt main body paragraphs. Oi.

In the end, I think I would far prefer to re-insert the ten or so footnotes found in the original PDF into the `latex2html` document than do as much fiddling as I ended up doing with the Acrobat-version of the document. That’s ten edits over and against dozens to hundreds. If I were to add in hand-editing the graphics, we are still talking about a total of twenty edits. That’s a far more reasonable number.

Especially for a birthday morning. Thanks, Adobe!

As a sidenote, having spent a few hours with Word this morning, which would wonkily throw me back and forth across pages or make entire lines disappear when I removed a single character or made it difficult to select the hard page returns that Adobe had inserted, quite liberally, throughout the document, I realized just what a pleasure working in LaTeX is. So another tip of the hat to Jonathan for moving us in that direction.

[Jonathan Goodwin]: http://www.jgoodwin.net
[Git]: http://git-scm.com
[BitBucket]: https://bitbucket.org
[LaTeX]: http://www.latex-project.org
[`latex2html`]: http://www.latex2html.org
[Adobe Acrobat]: http://www.adobe.com/products/acrobat/
[principle claims]: http://www.adobe.com/products/acrobat/pdf-to-word-doc-converter.html

Reason #35 to Like MacPorts

So the essay that Jonathan Goodwin and I wrote together using LDA topic modeling to explore the intellectual history of folklore studies is about to head into the _Journal of American Folklore_’s workflow and that means it has to get converted from LaTeX to Word. The way that conversion apparently works is:

LaTeX > HTML > Word

Fortunately, [someone has written a command line tool][], `latex2html`, that does the heavy lifting. And, thank you computing gods wherever, and whoever, ye may be, the tool is already in MacPorts. MacPorts makes it easy to find this out:

% port search latex2html
latex2html @2008 (print)
Convert LaTeX into HTML.

All that means is this:

% sudo port install latex2html

A whole lot of scrolled text later ends with:

—> Installing latex2html @2008_3
—> Activating latex2html @2008_3
—> Cleaning latex2html
—> Updating database of binaries: 100.0%
—> Scanning binaries for linking errors: 100.0%
—> No broken files found.

How do we use this tool?

> At this point, you’re ready to convert your files from LaTeX to HTML, and then possibly to Word. To invoke latex2html, switch to the directory with your .tex file, and latex2html filename. In its default state, latex2html will produce HTML that is broken up into multiple pages, usually one per section / subsection, much like the latex2html home page is. If you want to import your document into Word, you may wish to suppress this tendency. To do so, use the following command:

latex2html -split 0 -info 0 -no_navigation filename

> `-split 0` will make the entire LaTeX file into a single HTML page, while `-info 0` will remove the information bar at the bottom of the page and `-no_navigation` will remove the navigational menus on the top on bottom. This should produce a vanilla HTML file that Microsoft Word can read fairly easily.

> One thing to beware at this point: … Word will link to image files instead of including them in the document, which will mean that things like your equations will drop out if you send someone the .doc file without sending the image files as well. To fix this … go to Edit->Links, selecting all of the links in the dialog box, and clicking “Break Link”. Once that is done, save the file and the images will now be embedded into the document itself, ready for sending off to someone else.

`UPDATE: I just ran this, and it worked fine. And it was very fast.`

[someone has written a command line tool]: http://mildopinions.wordpress.com/2008/09/29/latex-to-html-and-word-with-latex2html-a-mini-tutorial-for-os-x-users/