Batch Converting DOCX Files

My students live in a Microsoft universe, for the most part. I don’t blame them: it’s what their parents and teachers know. And I blame those same adults in their lives for not teaching them how to do anything more powerful with that software, turning Word into nothing more than a typewriter with the ability to format things in an ad hoc fashion. Style sheets! Style sheets! Style sheets! As an university professor, I duly collect their Word documents, much I would collect their printed documents, and I read them, mark on them, and hand them back. Yawn.1

Sometimes, just to play with them, I take all their papers and I mine them for patterns: words and phrases and topics that occur across a number of papers. You can’t do that with Word documents, so you need to convert them into something more useful. (And, honestly, much of what my students turn in could be done in plain text and we would all be better off.)

On a Mac, textutil does the trick nicely:

textutil -convert txt ./MyDocxFiles/*.docx

I generally then select all the text files and move them to their own directory, where, for some forms of mining I simply lump them into one big file:

cat ./texts/*.txt > alltexts.txt

(I should probably figure out how to do the “convert to text” and “place in another directory” in one command line.)

pandoc can also do this, and I need to figure that syntax out.


  1. I also sit through their prettily formatted but also fairly substance-less PowerPoints — I’m not just picking on them here: I also work with them on making such presentations more meaningful. 

Cultural Mechanics

Kudos to James O’Sullivan for a title so great I want to steal it: Cultural Mechanics is his podcast focusing on a really diverse range of digital humanities and digital arts topics. (Right now I would say it’s more digital arts in nature, but that may not be his overall focus.) Here it is on SoundCloud.

Compelling Visualization Projects

The Rhythm of Food combines data from FooDB and Google Trends, looking for search patterns across time — and cleverly recognizing that the cycle of the year is a good way to organize time.

Of Types, Motifs, Tropes

For our next class, we are going to go a-hunting, tale-type hunting. I am going to bring an assortment of texts, some folktales and some not, that I will give you to track down. Your means of determining the nature of the texts will be the Tale-Type Index and the Motif Index. You will, I think, fairly quickly figure out how to use those two instruments to your best advantage.

It might also be a good moment to think about the nature of such cataloging efforts. One place to begin, as a kind of quick review of the origins and development of the indices is the Wikipedia entry on the Aarne–Thompson classification systems. (There is a separate entry on motif worth reading.) Once there, you will see a reference to a rather recent, in terms of the indices themselves, consideration by Alan Dundes’ “The Motif-Index and the Tale Type Index: A Critique”. (There is also Hans-Jörg Uther’s assessment in “Classifying Folktales”.)

The two indices work together to catalogue those tales within their pages by their constiuent parts, motifs. As a number of observers have remarked, this is no small matter and has lead some to regard the entire enterprise as hopeless, given the seemingly endless variability of the human imagination.

And yet, as seemingly old-fashioned as the tale-type and motif indices would seem to be, we have re-created them in TV Tropes. And so, it would seem, some of you have already played a drinking game to tale types. Congratulations.

Test Post with JP Markdown and Syntax Highlighting Activated

Okay, here’s some regular prose, which isn’t explanatory at all, and then here comes a block of code:

from stop_words import get_stop_words
from nltk.corpus import stopwords

mod_stop = get_stop_words('en')
nltk_stop = stopwords.words("english")

print("mod_stop is {} words, and nltk_stop is {} words".format(len(mod_stop), len(nltk_stop)))

returns:

mod_stop is 174 words, and nltk_stop is 153 words

Jetpack Markdown Troubleshooting

Here’s a screenshot of what a fenced code block looks like with both the Jetpack markdown turned on and the Syntax Highlighting Evolved plug-in activated:

And here it is with the syntax highlighter turned off. I don’t quite understand why the code block WP short code is showing up: