Textual Analysis in Mac OS X

For those who use or access Macs, I just wanted to point out that videos of this year’s [WWDC sessions][wwdc2012] are up and they have a session on “Text and Linguistic Analysis.” All the sessions are at the URL below. I watched last year’s session on [latent semantic analysis][wwdc2011], which is also baked into the Mac OS, and it was quite good. You can watch the videos on-line or get them through iTunes to watch off-line, and they are also available as PDFs. (The site is worth checking out just to see how well it is designed.)

[wwdc2012]: https://developer.apple.com/videos/wwdc/2012/
[wwdc2011]: https://developer.apple.com/videos/wwdc/2011/


I run this site on Markdown, and I often use MultiMarkdown internally, but I really need to remember that reStructuredText is out there. Not only is it a Python project, a language that I am trying to learn, but it offers a lot of functionality. (Some would argue too much because its markup can strike some as being beyond the simple, plain text offered by something like Markdown.) Nevertheless, if I ever feel the need to increase the amount of metadata, and even some of the functionality, of my own documents, then it might be time to take the plunge.

Using IBM’s Word Cloud Generator

Okay, you have signed up for an IBM DeveloperWorks account and you have successfully clicked the **download** button. Now what? Now the fun begins.

First, you need to find the downloaded file, which should be a `zip` archive. Most modern operating systems should have the necessary applications to unzip the file — if your doesn’t, then look for a good archive utility that handles `zip`, `gzip`, `tar`, and other forms of compression on a site like MacUpdate or some other reliable source for software.

The unzipped file turns out to be a folder. Inside the folder you are going to see the following:

* a directory (folder) labelled examples
* a directory labelled license
* the actual word cloud generator application: `ibm-word-cloud.jar`
* a read-me file, `readme.html`
* a Windows batch file named `run-example.bat`
* a Unix shell script named `run-example.sh`

I am writing this from a computer running Mac OS X, which is a Unix machine with a pretty face, and so I am going to use the Unix shell script as my foundation, but the corresponding steps should work similarly for those of you running Windows OS and using the batch file. (Properly, I believe I should describe Mac OS X as POSIX-compliant, but I don’t know how many people, including myself, would understand at all what that meant.)

The `readme` file is somewhat helpful, but I find two documents (files) to be even more helpful. One is the shell script itself, which gives me an exact idea of what I should type — or, better, paste — into a terminal to begin to get results. And so the first thing you should do is copy the shell script, start with `java` and copy all the way to `example.png` into an open terminal window and hit return.

If you haven’t used the terminal before, or whatever your OS calls getting access to the command line, then you are in both for something of a shock as well as a real treat. The shock will come from something that looks, for those of you raised in the era of GUIs, so, well, *textual*, and the treat will come with realizing that even though its *textiness* seems so foreign, it’s actually fairly easy to use and you will be surprised how quickly you are going to get results.

And so, perhaps, the first place to begin is finding out where to find this Terminal application: in Mac OS X, it’s in the Utilities directory/folder within the Applications directory, which is at the root directory. The file hierarchy looks something like this:


You read this as follows:

* / (root)
* Applications/ (Applications directory)
* Utilities/ (Utilities directory)
* Terminal.app (application named Terminal)

Okay, now you have the Terminal application open, which means you have a window on your desktop which contains something like this:

`Last login: Wed Mar 30 16:51:05 on console`

The `%` is known as the prompt, which is short for “the command line prompt”, which means you are now working with the command line interface (CLI). Congratulations, you have just earned your first CLI credit.

Your prompt may very well be longer: I have shortened mine so that it places my current working directory between square braces and then gives me a percentage sign to tell me it’s ready to receive instructions. (There’s a lot more to say about the environment in which you now find yourself, but for the sake of getting on with this tutorial we will leave that for another time.)

If you paste the code that you copied out of the shell script above and try to run it from where you are, chances are you will get nothing. That is because the prompt can only run things when it knows where they are — much the same applies in the GUI, but Windows and Mac and Linux GUIs do a lot of work behind the scenes to find applications for you. You have two choices: add the file hierarchy to your command or to navigate to where the WCG application is and run it from within its directory. (If you were going to use the application a lot, there are some other considerations, but we will leave those for another time — but feel free to ask if you like.)

Typically, most Terminal windows will start you in your user home directory — which is indicated by the use of the tilde (`~`). My best advice for the sake of this current activity is to use Windows Explorer or the Mac Finder and move the unzipped folder containing the WCG, which is named “IBM Word Cloud” in my case, to the Desktop or to your Documents folder. Some place easy to get to.

Now, back at the CLI, you can do the following things:

* type `pwd` to print the working directory
* type `ls` to list the contents of the current working directory
* type `cd` to change the directory you are in
* type `cd ..` (that’s cd followed by a space followed by two periods) if you went into the wrong directory and need to back your way out of type `cd ~` to go back to the home directory no matter where you care

Navigate to the IBM Word Cloud directory, which we are going to pretend is on your Desktop. Now you can run that command `java … example.png` and it will produce results for you. If you look back in the IBM Word Cloud directory, using the Finder, you will see that a graphic file named `example.png` now exists there.

Congratulations, you have successfully produced your first word cloud, and your first graphic, from the command line. Everything else is a matter of making the kinds of adjustments we discussed in class.

Syntax Highlighting in Word

I am working on my paper for the computational folkloristics panel at AFS this year. My goal is to apply some of the network theory and visualization methods I learned at the NEH Institute on Networks and Networking in the Humanities do the intellectual history of folklore studies. I thought an interesting phenemonenon to tackle would be the emergence of performance studies as a paradigm. That is, what does a paradigm shift look like from the point of view of a network? What did it look like in folklore studies?

To do this work I am interacting with JSTOR’s *Data for Research* program, and I am trying to keep notes as I go. Because this will eventually be something I want to share with others, I am keeping my notes in Word — if only because I can control the presentation much more readily. For the XML with which I am working to be more readable, it could use some syntax highlighting, a feature I count on in my text editor, Textmate, but which is not available in Word … unless, of course, you happen upon on-line sites which will do the work for you.

One such site is [ToHTML](http://tohtml.com/). [PlanetB](http://www.planetb.ca/2008/11/syntax-highlight-code-in-word-documents/) will also do some syntax highlighting.

Word Wrapping in vim

It has taken some research, but I was unable ever to accomplish in emacs I have been able to pull off successfully, and in short order, in vim. To be sure, it’s not my doing but the work of others that I have cobbled together here, but what it provides is the kind of word wrapping to which most of us have become accustomed in GUIs. You will need to place the following in your .vimrc file in your home directory:

:set wrap
:set linebreak
:set nolist  " list disables linebreak
:set textwidth=0
:set wrapmargin=0
:set formatoptions+=l

From Word to HTML

Because Microsoft Word’s own *save-as-webpage* HTML conversion option is pretty awful. I am saving Word documents as RTF and then using Mac OS X’s `textutil` in the terminal. Enter `textutil -convert html foo.rtf` to convert `foo.rtf` to `foo.html`. (I’m working on some AFS website materials, and Word is the lingua franca of humanities scholarship, or so it seems. I can’t wait to show my digital humanities seminar what a Word document actually looks like.)