A screen shot of what R Studio looks like when you start it up for the first time: one page is missing.
I almost posted this on the new [Bamboo DiRT Wishlist][bdw], but then I thought better of it. The wish list needs to function as a straightforward wish list for a while before someone goes “meta” with it.
But my wish remains: a brief scan of the list of tools already available reveals that a lot of them do much the same thing. That in itself is not a bad thing, but somewhere I think it would be useful if we had a list of the kinds of things that often get done and what they might mean for various kinds of approaches to humanistic objects and topics.
For example, a lot of tools can quickly break texts into word lists — *oh, those the bags of words!* — and they can produce various kinds of outputs based on those lists: filter out function words, perhaps filter out proper names, or words that appear less than a certain number of time — or advanced uses can filter out words by part of speech. But what does it mean to get at the rich “middle” of words that do not appear too frequently or too infrequently in a text? Can we have some discussion on what we are up to?
Let me give an example from my own current work: I have developed a small collection of narratives. They are all legends, all from this part of the world (Louisiana), and they have all been transcribed by folklorists from audio recordings, and so I can assume, at least for the time being, a fair amount of fidelity to what was actually said by the speaker. (Two of the texts are, in fact, from my own field work.) My goal with this small collection of texts is to explore them in various ways to see what computational methods tell me interesting things and which ones seem to be less fruitful in this context. That’s not hard to do when you have roughly two dozen texts which range in size from a hundred some odd words to a little over one thousand words.
In fact, this particular range in size is one of the things that I find interesting. The shortest legend weighed in at a mere 153 words. The longest was seven times its size at 1015. The two stories I had collected were firmly in the middle at 375 and 655.
Now, let’s leave off for the time being that where a story starts and where it stops is not something we can necessarily declare with absolute certainty. Parts of conversation tend to wash over both ends of a story. Indeed, some stories tend to invite conversation at certain points or throughout their performance. But let’s do leave that boundary issue and let’s instead assume that an analyst can reliably depend on fellow analysts working in the same discipline to make judgments of like enough nature to his that texts are comparable and thus so are their word counts.
One of the audience members at this year’s meeting of the International Society for Contemporary Legend Research put it rather succinctly: *Why the hell are you counting words?*
Good question. I didn’t have a terribly good answer at the time, but here’s what I offered up and I stand by it:
1. First, I am fascinated by the idea that a text as small as 153 words can be said to accomplish all the things we now believe happens within an unit of narrative: the holding of an audience’s attention in a discursive situation, the creation of a storyworld, the deployment of a sequence of events that listeners perceive as having continuity.
2. Second, we have to start somewhere. Folklorists have an approximate sense of the size of various kinds of texts, but we do not have any qualified sense of their size. I am not suggesting that we will ever arrive at a moment where our definition of legend includes something like “a narrative of x type that ranges in size from 153 to 1025 words.” Rather, following David Herman here, I think we will find ourselves with better definitions that describe a kind of centrality of the kind “most legends range in size from x to y.” There are always going to be outliers: if we introduce numbers, we typically are doing so for statistical purposes, and so that gives us the change to look at mean, median, and mode as well as things that stand outside those dimensions.
But, look, this is a long digression from the topic with which I began. Let me bring it back into the fold by noting that my point is that we do not yet have within my own field of folklore studies let alone in the broader humanities any sense of what kinds of quantification matters. We have lots and lots of tools that quantify things, sometimes in ways that appear almost magical: think of the moment you first glimpsed a word cloud, a projection (visualization) of a word frequency list that is so intuitively “right” that many people use them without really having any idea what lies behind them.
And think, too, of all the people using on-line word cloud generators that have built-in stop word lists and who never stop to think about what it means for a word to be on such a list and that perhaps some of the words that are getting “stopped” are potentially part of a pattern they are seeking to discern? Humanists are, in some fashion, playing willy-nilly in a park created by linguists and computer scientists. It’s fun to jump on their various play sets, to keep the metaphor going for just a moment longer, but sometimes all you end up is dizzy and tumbling off the carousel. (And with that, the conceit went too far.)
Linguists built the tools with certain questions in mind. Linguistics as a discipline has historically been very focused on the sentence as its chief unit of analysis. Sentences and words. (I find this to be true as I explore various corpora, but this is my own auto-didactic impression and may be entirely out of line.) Folkloristics has historically operated one level of discourse up from linguistics and one level down from literary studies, focusing its efforts on single texts or series of single texts as discrete units of discourse in various forms of human behavior. The kinds of precision that linguistics long ago achieved is still emergent in folkloristics, but we lag substantially behind in quantitative descriptions of the materials with which we work. (There was a moment in the 50s and 60s when a variety of metric efforts were attempted — e.g., labometrics, chorometrics, etc. — but those efforts were displaced with the turn towards performance. (If you are a folklorist, keep an eye out for the essay from Jonathan Goodwin and me which attempts to quantify the “turn” within folklore’s intellectual history.)
I suspect we will need to wander about in quantities before we reach a moment where we can have a fuller discussion about what we are doing and why, but I would love to see that conversation sooner rather than later, if only because I really want to play with more ideas.
**Side note**: For those wondering why I am worried about a small collection of legends — I dare not call it a *corpus* yet — I am interested in being able to automate the process of determining a morphology for narratives. I’m not yet convinced that the CS solutions I’ve seen are very good. And I think they start with texts that are too big, too complex. My end goal here is to begin to map out the relationship between ideas (ideologies as networks of ideas as glimpsed through texts) and the narratives that contain / convey / shape / are shaped by them. Sort of network meets syntax. Parallel meets sequence.
I have, for the past several years now, introduced my undergraduate students to some elements of textual analysis using computational methods. I use text analytics here only tentatively: many readers will perhaps be more familiar with, and indeed prefer, the older term of text mining, but for me that term is close to data mining, which usually refers to working with a great deal more texts than I am going to discuss here — and that is why I suppose the term data mining has morphed into big data. (More on this anon.)
The number of texts here is one. That’s right one text, and, in fact the one text is a short story. Keeping the text small is one way I have found to keep any psychological barriers to entry low when I introduce students to text analytics. The particular short story I have chosen in the past, and which I use here, is Richard Connell’s “The Most Dangerous Game.” The 1924 story has two advantages when working with students: first, it is in the public domain, and, second, the text’s story has been so widely adapted that students are already familiar with it and have probably seem an adaptation of the story in some fashion within their own recent memory. E.g., only a year or so ago, the FX Network’s adult cartoon series, Archer, featured a version of the story entitled, “El Contador” (The Accountant).
If we are to use a computer to make possible certain kinds of analysis of texts, what are the kinds of things we might like to know?
- First, we want to know the overall length of the story, and we want to know some basic information like how many words, sentences, and paragraphs are used to tell the story, or make the argument in the case of essays. (This kind of information is useful for later mapping our the overall shape and structure of a text.)
- Second, we want to know how many of the words in the word count are actually unique, what those words are, and which words get used the most often and which the least. (This establishes the vocabulary used, points to any particular registers, and begins to reveal the interaction between words and meaning.)
- Third, using the word frequency distribution, the fancier term for counting individual words, we want to visualize the text both as a graph and as a word cloud. In doing so, we can begin to “see” for ourselves which words matter and which words don’t matter. (This introduces the idea of function words in the form of a word stop list.)
- Fourth, we want to use our new-found insight into word usage to examine particular instances: we want to see words in context and we want to see what words mean within the context of a particular text. (This highlights the role of context and offers a companion to meaning to be found simply in the words used.)
With that list in mind, I would like to introduce Python for Text Analytics, or PyTA — pronounced more like the genre of painting than the flatbread. The repository contains the text of the story as well as the scripts that will produce the results outlined above. Please note that, at this stage, the scripts are designed to be run with the target text inside the same folder (directory) as they are. If you want to use a different text, simply copy and paste it into the folder and change the filename in the script. The ReadMe file explains how to save the output of the scripts to a file, which will come in handy for Step 3 above.
For those readers already familiar with Python, and by familiar I mean you already have it installed and know how to access it, you can skip the next bit. For everyone else, a bit of review won’t hurt. Some people are going to want to know why I am doing all this in Python and not using off-the-shelf solutions. This is not the moment to engage in a recapitulation of all the usual arguments in favor of open source software, how it not only parallels the academy’s ideals but how it practically makes possible the spread of ideas in a world sometimes hostile to such spread. What matters here is:
- Python is free, widely-available, and has a large community to help anyone interested in using it.
- Python runs on all the three major operating systems: Windows, Linux, and Mac.
- The Python scripts used here are similarly free and available for users to do with them what they want.
The scripts are, in fact, in the public domain under the Creative Commons license for doing so. The links for the scripts take users to a GitHub page from which they can be downloaded, or, if you have a GitHub account yourself, you can fork the repo itself. Please feel free to do either.
Finally, please note that a basic working installation of Python will let you perform Steps 1-3 above. If you are interested in looking at words in context and in examining other kinds of relationships between words, Step 4, then you are going to need to have the Natural Language Toolkit installed. It’s not difficult to do so. If you are using a Mac, I posted instructions last year. Instructions for getting Python installed on a Windows PC are available, with further instructions for the NLTK also available. I assume everyone else is running Linux or BSD and, really, you don’t need my help. (Please note that there are a variety of suggested ways of getting the NLTK installed on a Mac, but the MacPorts route is really the way to go. Trust me: I’ve gone some of the other ways.)
Questions Raised by Counting
Now we can start working with an actual text and looking at some actual numbers. Every time I do this with students I find it helpful to have a conversation about how these numbers won’t in and of themselves tell us much about the text, but the various features they reveal or the questions they lead us to ask are useful. These numbers can’t draw conclusions, that’s the job of the human analyst, but they can provoke inquiry. And, sometimes, they reveal dimensions of a text that maybe we would not have thought about without seeing it quantified.
With that noted, let’s plunge into some rough stats for “the Most Dangerous Game” which we get, simply enough by runing
And it prints out the following, which we can copy and paste anywhere:
COUNTS Paragraphs : 205 Section Breaks: 0 Sentences : 717 Words : 7959 AVERAGES Sentences per paragraph: 3 Words per paragraph: 38
So an 8000 word story told in two hundred paragraphs and seven hundred sentences. (According to Lee Masterson, this puts MDG in the territory of the novelette, which seems odd to me or an index of how things have changed. If you want to know the other counts, see this Askville Answer.)
The other counts, as I called them in this script — feel free to change that — are for paragraphs and sentences. These numbers in and of themselves aren’t terribly interesting, until you play with them a bit, as I did to get the two averages: neither of which is something we typically discuss when examining texts — and probably the only reason most of us, and especially our students, are familiar with word counts is because we have had to deal with either minimums or maximums. We rarely think of them as having any kind of significant descriptive power. And yet when we combine some of these counts we end up with some interesting averages.
The first one, sentences per paragraph, is striking. Three? That seems like a terribly small number, which drives most readers to look at the story more closely. What they discover, as they skim through the pages of the PDF version of the story is that a great deal of the story is told in dialogue. There is so much dialogue that you have to scour the story for the moments of non-talking action. There are, in fact, two passages of extended narration of action: the first occurs when the ship on which we first meet Rainsford sinks and he has to make his way to the island and the second is the famous game itself.
These moments of action help to delimit the principle sections of the text:
- At sea and Rainsford’s arrival on the island
- At the chateau and the dialogue between Rainsford and Zaroff which reveals the nature of the game
- The hunt itself and the story’s conclusion
A fun thing to do is to copy and paste the text of these three sections into an image so students can “see” the story in its entirety:
For those familiar with the Hollywood Formula, the story meets the idealized ratio of 1-2-1 of content pretty closely. Closer to our topic at hand, the quick visualization also lets us see that the first and second sections have a lot of thin lines, representing paragraphs that are made up mostly of dialogue, and that the third section has some fatter lines. If we take our new insight into the text and do a little counting of paragraphs and words in these sections, we get the following results:
- At Sea: 1029 words / 37 paragraphs = 28 words per paragraph
- At the Chateau: 4407 words / 129 paragraphs = 34 words per paragraph
- The Hunt: 2367 words / 38 paragraphs = 62 words per paragraph
Words per paragraph is an odd measure, but one that reveals that the “action” parts of the story actually takes place in longer paragraphs.
The Words of the Story
Let’s find a bit more about the words themselves by running the next script,
words.py. This script produces quite a bit of output, and so my best advice is to capture it by entering the following at the command line:
python words.py > mdgwords.txt
This tells the command shell to send the output of the script to the file
mdgwords.txt. You can name the file anything you want. Or you don’t need to direct the output to a file: you could just watch the output fly by and then copy and paste it into a file. I am asking you to handle output like this, instead of writing it to a file for you because I am working on making it possible for you to work with texts of your own choosing without editing the script. (I’m getting there, I’m getting there.)
At the top of the file you’ll see a line of redundant information and a line of new information:
Words in text: 7959 Unique words: 1987
This is some new data. For fun, I like to give students the task of figuring out the mean for the words in a story: here it’s obviously something like four occurrences per word. (4.0055 to be exact.) But that’s obviously not what really happens. Below a dashed line in the text, students will see a list of words that begins like this:
Sorted by highest frequency first: the,505 he,248 a,248 of,172 and,162 i,154 to,148 was,140 his,137 rainsford,117
The occurs five hundred times in this short story? (It’s every sixteenth word by my count.) At this point, I like to talk about the list in its raw form above or I will ask students to trim off the first few lines of the file above and save the document as a comma-separated value file (
.csv). Once the file consists of only the word, number pairing seen above, it can be imported into Excel, where it can be easily turned into a bar chart. The first eight words in the list dominate any visualization — the same thing can be seen in a word cloud when no common words are dropped (or stopped). (A built-in word cloud script is in the works, but is not currently available.)
It may sound strange, but I have found it very effective to work my way through the list of words with a class with an Excel spread sheet projected at the front of the room: we highlight the words we think are interesting. Regularly what we find is that we have to scroll past the first screen of words before the first words that seem interesting us, based on our reading of the story, begin to turn up:
We have to go even further down before we begin to see a large percentage of words that seem significant. (Please note that the current version of the script sorts words first by their frequency and then alphabetically.)
This interesting middle range of words continues for a while until words begin to drop in usage: I regularly find that three occurrences is about the threshold for short stories for most readers.
Again, all of this simply prompts readers to ask more and better questions. As someone for whom the language of a text is terribly important, I find it terribly ironic that it’s numbers that makes that point best with some readers.
Words in Context
With this list of words in hand, it’s time to re-engage the text. I find it useful to assign students each a word with a high-value frequency, a middle value, and a low value. In the past, I have asked them simply to use the Find feature in their PDF viewer, but, if you have installed the NLTK, you can try out the next script,
concordance.py relies on several functions available through the NLTK that make it possible to work with texts. If you open the script, you can see for yourself, but if you simply run it, you will see:
% python concordance.py Enter the word you would like to see in context:
If you enter the word hunter, the program will print out:
Building index... Displaying 10 of 10 matches: ld , " agreed Rainsford . " For the hunter , " amended Whitney. " Not for ainsford. " You ? ? ? re a big-game hunter , not a philosopher. Who cares been a fairly large animal too. The hunter had his nerve with him to tac st three shots I heard was when the hunter flushed his quarry and wounded . Sanger Rainsford , the celebrated hunter , to my home. " Automatically . " They were no match at all for a hunter with his wits about him , and nsford. " " Thank you , I ? ? ? m a hunter , not a murderer. " " Dear me ling of security. Even so zealous a hunter as General Zaroff could not ? ? a small automatic pistol . The hunter shook his head several times , a spring. But the sharp eyes of the hunter stopped before they reached the
Such an output offers a very quick and easy way to see all the uses of hunter piled on top of each other. The first two line in our results beg another question, though, since they turn up the use of the word hunter in the dialogue between Rainsford and his first interlocutor, Whitney, which contains in it the line that foreshadows much of the rest of the story: The world is made up of two classes—-the hunters and the huntees. That’s an important line, but it doesn’t show up in our search for hunter because from a computer’s point of view hunter and hunters are not the same word, or token to use the more precise term from linguistics. (Is there a way to see hunter, hunters, and huntee all in the same list? There is, but it involves a bit more scripting than will reward us at this moment. You can see the draft script in the UPST directory:
In the next iteration of Text Analytics, and the Useful Python Scripts for Texts, we will take a look at things like collocations, bigrams, synonyms within a text, and other relations.
Thanks for reading. I hope this helps convince you how easy this kind of analysis is and, at the same time, how rewarding it can be. I have found all of these techniques especially powerful when working with undergraduates. Word clouds have become extremely popular, and they are a powerful visualization tool, but they really should represent the middle or end of an analytical process, after analysts have given some thought to the particular words involved.
In my freshman introduction to academic writing, we do some reading, because, after all, you need something about which to write. I focus on a small group of texts because they can hold the evidence in their hands and because I teach how to think and work with texts when I am not introducing people to academic writing. That is, I assume that a biologist teaching an introduction to academic writing would use biological data as the basis for her course. That English professors are uniquely situated to teach academic writing, broadly construed, is something for another conversation. Or perhaps it is an empirical move on the English department’s part, to claim all of academic writing when what we know how to do, and thus can claim to teach, is writing about texts.
So we read a small number of texts, two of which are short stories and two of which are screenplays. All four of the texts are available to the students as both plain texts and PDFs. I have, in the past, used a collection of Java apps (applets) that allow students to do things like create word frequency lists, create word clouds, or examine word collocations. (For the latter I am entirely indebted to James Dombrowski for his excellent Wordij.) While running these apps does introduce students to the command line, it does not do much beyond that, and I would like, no matter how silly this seems, at least to introduce them to the idea that they can use a scripting language to do things in various dimensions of their lives. Plus I hope that, like me, they discover that learning to code is also a way to learn another way to think.
And so I have begun a hunt for a collection of Python scripts that do some of the things we already do in class and perhaps some scripts that take us new places.
*Please note that as of December 21 — Happy Mayan Apocalypse Day! — this post is still in process and this material is not yet curated. Plus, I’m really looking for feedback from readers on what kinds of text analysis they would want students to do. Keep in mind that this is not “big data” but single texts or a very small collection of texts.*
Okay, first thing we already do is generate a word frequency list, which we visualize both as bar charts and as word clouds. What good does this do? Well, first, it introduces the idea of *function words*, words which must be present in discourse for it to go but to which we, apparently, attribute very little meaning. Just as important as this idea is the idea that in addition to function words there is a list of other words within a text which do not have a significant impact on its meaning and which can be ignored: stopword lists are great for this because students get to make this happen, quite mechanically, and then see the results in their much more focused, and interesting, word clouds.
One thing that might be useful to add here is a script that lemmatizes the words in a text, or its resulting list of words.
Someone asked a question on StackExchange about [how Wordle creates its word clouds], and they got an answer from a lot of people, including Wordle’s own Jonathon Feinberg. In particular, Reto Aebersold posted a link to his [PyTagCloud on GitHub]. There is also a link to someone creating a word cloud with Processing, but that’s for another time. (I am thinking, for a technical writing course, how we could take some of these outputs, feed them into Processing, and then have some sort of real world output, using something like Arduino. Oh, yeah, I’m ready to have some fun.)
And then there’s this interesting bit of code, [Story Statistics on DaniWeb]. I’m
### About the Texts
For those who are curious, the texts are:
* Richard Connell’s “The Most Dangerous Game”
* Frederic Brown’s “Arena”
* Star Trek (The Original Series) “Arena”
* Star Trek: The Next Generation “Darmok”
[how Wordle creates its word clouds]: http://stackoverflow.com/questions/342687/algorithm-to-implement-a-word-cloud-like-wordle
[PyTagCloud on GitHub]: http://github.com/atizo/PyTagCloud
[Story Statistics on DaniWeb]: http://www.daniweb.com/software-development/python/code/228125/story-statistics-python
For those who use or access Macs, I just wanted to point out that videos of this year’s [WWDC sessions][wwdc2012] are up and they have a session on “Text and Linguistic Analysis.” All the sessions are at the URL below. I watched last year’s session on [latent semantic analysis][wwdc2011], which is also baked into the Mac OS, and it was quite good. You can watch the videos on-line or get them through iTunes to watch off-line, and they are also available as PDFs. (The site is worth checking out just to see how well it is designed.)