Batch Converting DOCX Files

My students live in a Microsoft universe, for the most part. I don’t blame them: it’s what their parents and teachers know. And I blame those same adults in their lives for not teaching them how to do anything more powerful with that software, turning Word into nothing more than a typewriter with the ability to format things in an ad hoc fashion. Style sheets! Style sheets! Style sheets! As an university professor, I duly collect their Word documents, much I would collect their printed documents, and I read them, mark on them, and hand them back. Yawn.1

Sometimes, just to play with them, I take all their papers and I mine them for patterns: words and phrases and topics that occur across a number of papers. You can’t do that with Word documents, so you need to convert them into something more useful. (And, honestly, much of what my students turn in could be done in plain text and we would all be better off.)

On a Mac, textutil does the trick nicely:

textutil -convert txt ./MyDocxFiles/*.docx

I generally then select all the text files and move them to their own directory, where, for some forms of mining I simply lump them into one big file:

cat ./texts/*.txt > alltexts.txt

(I should probably figure out how to do the “convert to text” and “place in another directory” in one command line.)

pandoc can also do this, and I need to figure that syntax out.

  1. I also sit through their prettily formatted but also fairly substance-less PowerPoints — I’m not just picking on them here: I also work with them on making such presentations more meaningful.