So nicely done:
During a clean out of my email application this morning — and we won’t discuss how badly I have been managing incoming mail of late, I came across a [Humanist DG] post, in response to an inquiry about text analysis software, to [Diction]:
> DICTION is a computer-aided text analysis program for determining the tone of a verbal message. DICTION searches a passage for five general features as well as thirty-five sub-features. It can process a variety of English language texts using a 10,000 word corpus and user-created custom dictionaries. DICTION currently runs on Windows® on a PC; a Mac® version is in development.
DICTION produces reports about the texts it processes and also writes the results to numeric files for later statistical analysis. Output includes raw totals, percentages, and standardized scores and, for small input files, extrapolations to a 500-word norm.
Okay, so they like to capitalize themselves. We get it.
Digging a little further into its features, you get a bit more information on the five general features:
> DICTION … uses a series of dictionaries to search a passage for five semantic features — Activity, Optimism, Certainty, Realism and Commonality — as well as thirty-five sub-features. DICTION uses predefined dictionaries and can use up to thirty custom dictionaries built with words that the user has defined, such as specific negative and positive words, for particular research needs.
And then there’s a bit more on the word lists:
> DICTION uses dictionaries (word-lists) to search a text for these qualities:
> * Certainty – Language indicating resoluteness, inflexibility, and completeness and a tendency to speak ex cathedra.
> * Activity – Language featuring movement, change, the implementation of ideas and the avoidance of inertia.
> * Optimism – Language endorsing some person, group, concept or event, or highlighting their positive entailments.
> * Realism – Language describing tangible, immediate, recognizable matters that affect people’s everyday lives.
> * Commonality – Language highlighting the agreed-upon values of a group and rejecting idiosyncratic modes of engagement.
> DICTION output includes raw totals, percentages, and standardized scores and, for small input files, extrapolations to a 500-word norm. DICTION also reports normative data for each of its forty scores based on a 50,000-item sample of discourse. The user may use these general norms for comparative purposes or select from among thirty-six sub-categories, including speeches, poetry, newspaper editorials, business reports, scientific documents, television scripts, telephone conversations, etc.
> On a computer with a 2.16 GHz Intel chip and 2 GB of RAM, DICTION can process 3,000 passages (1,500,000 words) in four minutes. The program can accept either individual or multiple-passages and, at your discretion, it provides special counts of orthographic characters and high frequency words.
Just to make sure I understand this, the “semantic” features here are really words on lists, which can come pre-populated or that you can modify or create, and then the additional 36 subcategories are really different corpora? Am I wrong in perceiving this as a more nuanced version of sentiment analysis, but still operating in much the same way by depending upon certain pre-determined word lists?
There’s a lot, to be sure, I don’t know about the history of CATA (computer-assisted textual analysis, which is my acronym for the day!) And there are certainly approaches that I do not yet fully understand the nuances. I think this must be one of them.
[Humanist DG]: http://dhhumanist.org
I was working on a post that outlines my own version of “Text Analytics 101” that I have been using in freshmen writing classes for the past three years, and I found myself considering, momentarily, the uses of “text mining” versus “text analytics” and “data mining” versus “big data.” I’m sure there are distinctions to be made between the two terms, but it’s also the case that terms map onto various disciplines/domains and or historical moments. A quick ngram search in Google, which is based on Google Books, produced the following graph:
A similar search for the first pair produced the following:
The only thing the two graphs suggest to me is that, possibly, the latter terms appear later and thus haven’t made it into paper. I would like to do a similar search of ngrams on the web, but I haven’t found the same simple interface for doing this kind of quick survey.
[Topic Modeling in Mallet](http://mallet.cs.umass.edu/topics.php)
There was a terrific presentation on latent semantic mapping at this year’s Worldwide Developers Conference. It not only was a great overview of latent semantic mapping (LSM) itself, but it also reveals that LSM is now, or will be in Lion?, built into the operating system. It will be available as a command-line tool as well as an API that can be written into native applications. The presentation also has a couple of nice case studies.
All of the presentations from this year’s WWDC are available, for free, on-line. Registration is required, but that is free, too. (I am no longer clear on the different classes of developers at Apple: one can sign up to be a Mac OS developer, an iOS developer, or a Safari extension developer. And you can still sign up to be a generic developer, as I have been for several years now.)
[Here’s the link to the LSM presentation that will work once you are registered.](https://developer.apple.com/videos/wwdc/2011/?id=136)
Well, I have my first successful Python script, but it isn’t quite doing what I want:
#! /usr/bin/env python import nltk input = open('/Users/john/Dropbox/python/mm.txt', 'r') text = input.read() lexicon = sorted(set(text)) print lexicon output = open('/Users/john/Dropbox/python/mm-wf2.txt', 'w')
Great stuff. Unfortunately my results are less than stellar — and, worse, while it will output to the command line, my code above creates only an empty file. Sigh.
A while ago I tweeted a note to Digitial Humanities Questions and Answers about putting together a Python or R script for getting a word frequency distribution for a text. The short explanation for why I want to do this is because it is one way to develop a drop, aka stop, lists in order to tweak network analysis of texts of visualization of those texts using techniques like a word cloud. I am interested in a Python or R script in particular because I want my solution to be platform independent, so that students in my digital humanities seminar can use the scripts no matter what platform they use. (I had come across some useful `bash` scripts, but that limits their use to *nix platforms like Mac OS X or Linux.)
Handily enough, a word frequency distribution function is available as part of the Python Natural Language Toolkit (NLTK) — the same functionality is also baked into R, as John Anderson demonstrated — but I am focusing any scripting acumen development for now on Python.
### Getting up and running with NLTK
To get up and running with NLTK in Python, you first need a fairly recent version of Python: 2.4 or better. (My MacBook is running 2.6.1, which is acceptable, and I’m not good enough, yet, to update.)
In addition to a recent version of Python, and in addition to the NLTK (more on that in a moment), you also need PyYAML. All the downloads for PyYAML are available here: http://pyyaml.org/download/pyyaml/. (Please note that from here on out I am describing the installation process for Mac OS X: the Windows routine uses different flavors of these resources — there is a PyYAML executable installer, for example.)
Download the tarballed and gzipped package and unpack it some place convenient. (YOu are going to delete when you are done, and so the place doesn’t matter.) I put my copy on the desktop, and so, having unpacked it, I navigated to its location in a terminal window:
`% cd /Users/me/Desktop/PyYAML-3.09`
(Please note that the presence of the `%` sign is simply to indicate that we are using the command line.) Once there, you run the setup module:
`% sudo python setup.py install`
From there, a whole lot of making and installing takes place as your terminal window scrolls quickly. It’s done within seconds. Now you need to download the appropriate NLTK file, mine was here:
This time it’s a GUI-based installer package. Follow the instructions, click on things, and you are done.
To check to make sure everything got done that needed to get done, return to your terminal window and invoke the Python interpreter:
At the Python interpreter prompt (`>>>`), type:
`>>> import nltk`
If everything went well, all you will get is a momentary pause, if any, and another interpreter prompt. Congratulations!