Assessing the Corpora available to Me

When I go to the Culture Analytics at UCLA’s IPAM in a little over a month, I want to arrive with at least one interesting corpus with which to work. I have the following options:

  • Louisiana treasure legends:
  • Hook legend:
  • Oil industry interviews: 480 texts

The oil industry interviews come as a collection of mostly DOC files with an RTF file or two mixed in. They are a mixed bag in terms of content, but perhaps doing some distant reading might turn up something interesting. To do that, I need to get them into a form with which I can work:

textutil -convert txt ~/Desktop/transcripts/*.docx

And, just after, the same command as above except with *.rtf at the end. Now I’ve got 480 plain text files. It would be nice, for the sake of using filenames later, to get rid of some part of the file names:

Lastname, Firstname 08-09-2006 final.txt
...
Lastname, Firstname and Firstname 01-23-02 final.txt

I created two Automator workflows: one workflow to make all the letters lowercase in the file names, a personal preference, and to replace spaces with underscores and another workflow to trim all occurrences of final or transcript from the end of files. (This could just as easily have been one workflow, but I created two, since I am guessing I will re-use these workflows again in the future.) Now file names look like this:

lastname_firstname_06-01-2006.txt

Still somewhat ungainly, but it will do for now.

Leave a Reply