–from Scientific American online: “Great literature is surprisingly arithmetic”
In case you haven’t been keeping up, the [Library of Congress hosts a number of blogs][blogs]. While some of them only infrequently publish, the overall amount of material available is really impressive.
I’m still trying to figure out what all I can do with the [Saffron] browser/visualizer. It claims to analyze the research communities of natural language processing, information retrieval, and the semantic web through “text mining and linked data principles.”
The list of research domains is rather short and under-explained for the uninitiated:
I clicked on [ANLP], which is *applied natural language processing, and you get both a list of hot topics:
As well as a taxonomy network/tree that offers labels when you hover over nodes, which are themselves clickable links:
Clicking on one of the “hot topics,” in this case [natural language text], gives you a bar chart of the frequency of the topic in documents for the past thirty years:
A list of similar topics:
A list of experts:
And a list of publications:
Like a lot of browsers, this kind of static presentation of the results impoverishes the exploration that it encourages. I also haven’t explored what are its inputs: I wonder how full/complete its historical record is.
[natural language text]: http://saffron.deri.ie/acl_anlp/topic/natural_language_text/
“Why count words?” It was a simple question[^cf1]. The person asking the question did not ask it in an overly skeptical, or hostile, fashion. He was honestly taken aback by a series of numbers I had rattled off that corresponded to a collection of texts, of legends, that I had assembled as my first step in my exploration of computational approaches to narrative. The illustration in front of the room had been a bar chart of sixteen legend texts, each collected by an established folklorist (and so the original oral texts were, I felt, reliably represented). The longest text in the collection was a little over one thousand words (1025); the shortest, only 150.
A multiplier of seven is not an order of magnitude in difference, but it is still enough of a spread that it bears further investigation. Mount Everest is, for example, seven times taller than Ben Nevis, the highest mountain in the British Isles. Climbing the former is considerably more prestigious than climbing the latter. The Gross Domestic Product of the U.S. is seven times greater than Brazil. The distance from New York to London is seven times greater than the distance from New York to Washington, D.C. The difference in the latter amounts to a change in continent and a trans-oceanic passage.
My initial answer to the question was simple: I counted words because I wanted to know if it is possible to create a story world using 150 words, and, if so, then I want to understand how that can happen. Given the size of a great number of literary forms, one thousand words is already amazingly concise, but 150 words? Each word must pack an incredible amount of power: something made even more amazing when one realizes that only half that number of words are unique in their usage in this little text. That is, one word alone, he, gets used twelve times. The next nine words that get used most often in this little legend are also fairly uninteresting: and, a, was, the, it, his, said, to, they. So a list of the text’s top ten words doesn’t reveal anything about the story itself, except that, perhaps, there is a singular figure, he, who is counterposed against a group of some kind, they. (It is only when we get to the next ten most often used words, all of which appear only two or three times in the text, that we beginning to get a sense of what the story might be about: man, dog, with, when, went, there, saw, off, horse, controller.)
How is this possible? How can such a small subset of words from an already small text make a story go? That is, I think, the real question. Counting words is but one step along the way, but an important one, and one that we, as folklorists, have failed to undertake. Think for a minute of all the texts that are indexed in the great collection projects of the twentieth century. Add to them all the texts we have collected under the auspices of the ethnography of speaking. It’s an impressive amount of work, and while we have made some synthetic gestures, we have, by and large, mostly focused on differences. All of those differences are, of course, quite compelling, but in focusing on differences, we have also missed an opportunity to make attempts at larger kinds of claims about human nature and culture.
The impulse to count words, for me, is but one step towards a larger understanding of how humans think their way through the world through things of their own making. In the case of texts, they quite literally string one word after another, usually within the flow of a larger program of discourse that itself may or may not be conducive to text-making. Despite all the complexities, people in a variety of speech act contexts somehow decide to initiate a text, place one word upon another in a sequence they both anticipate and, at the same time, manipulate, until they are satisfied, in some fashion, with the result and, like a discursive Atropos, end the life of the string.
Counting words, then, is but one step towards a larger understanding not only how many words, but which words, and in what order. Why these words and not others? And what are the relationship of these words used here to instantiate a story world, but of the actions within the story world to the human world within which they are embedded? In short, what can 150 words tell us about the relationship between words, ideas, and actions?
The great indices of the previous era of folklore scholarship took one step in this direction by attempting to map, mostly in bibliographic terms but indirectly in cartographic, the various texts that had been collected in the initial wave of the philological project. At the same time as Stith Thompson turned his great carousel to compile the Motif Index three-by-five card by three-by-five card, however, a few scholars and scientists were beginning to play with the idea of using computers, as slow and expensive as they were then, to compile statistics about texts[^cf2].
Statistics remains, for most humanists, either an enigma or an enemy. It represents, for many (and with good reason), a regime of mathematics, itself something of a mystery, which has been used too often to summarize a situation or a group of people when a more subtle form of analysis was needed. I will not, in this essay, defend its use in such contexts. Nor am I interested in defending, or capable of discussing, the larger statistical turn that so many forms of knowledge production have undertaken. I have only this, a reworking of a dite from my own childhood and perhaps yours too: just because others are doing it is not a reason for us to do it, too.
I understand very well the humanistic impulse to draw a line in the discursive stand and to cry out “the crunching of us into numbers ends here.” My suggestion here, at this metaphorical line lying before us, is that the crunching will go on and on, and it can do so either without us or with our efforts not only to humanize the crunching but also to stuff it so full of the human that it might very well turn into a new kind of science, a new kind of scholarship that will not only be interesting to others, but also to us as well.
One of the central requirements of statistics is that you must convert information — perhaps a simply little story about a treasure buried somewhere, perhaps a few dozen of such stories, or perhaps several thousand — into data. But such a transformation amount simply to assigning values, most often numbers but they need not be, to the objects that are central to the problem. The analyst defines the problem, and the analyst assigns the values. Folklore studies has already done this in the form of tale type numbers, and motif numbers, and even when we describe the process of contextualization of a particular text.
So why count words? Well, clearly one reason to do so is simply to explore texts and textuality, to satisfy our curiosity about the fundamental dimensions of human expressivity: the number of words in a text, the word clusters (or collocations) that occur within a text as well as the words that always appear in conjunction with others in particular kinds of texts (co-occurrences). A second reason to proceed in this fashion is to make it possible to discover relationships between texts that we have not yet discovered by more traditional means of study. Discovery, indeed the notion of indexing itself, are the chief reason behind so much of the effort in natural language processing, as we will discuss in a moment. The final reason is that by seeing folklore texts in a new light and seeing relationships between texts that we have not gleaned before leads to new forms of knowledge, forms that need not displace but rather refine and extend current ways of knowing.
[^cf1]: The first public presentation of this research project was at the 2013 meeting of the International Society for Contemporary Legend Research. I would like to thank that group for their incredibly generosity and hospitality.
[^cf2]: The image of Stith Thompson sitting in a building dedicated to housing a carousel forty-feet in diameter is one that I owe entirely to Henry Glassie.
And what it is isn’t exactly clear … but suddenly everyone is talking morphology. Here’s one for all the Rocky movies done visually:
I almost posted this on the new [Bamboo DiRT Wishlist][bdw], but then I thought better of it. The wish list needs to function as a straightforward wish list for a while before someone goes “meta” with it.
But my wish remains: a brief scan of the list of tools already available reveals that a lot of them do much the same thing. That in itself is not a bad thing, but somewhere I think it would be useful if we had a list of the kinds of things that often get done and what they might mean for various kinds of approaches to humanistic objects and topics.
For example, a lot of tools can quickly break texts into word lists — *oh, those the bags of words!* — and they can produce various kinds of outputs based on those lists: filter out function words, perhaps filter out proper names, or words that appear less than a certain number of time — or advanced uses can filter out words by part of speech. But what does it mean to get at the rich “middle” of words that do not appear too frequently or too infrequently in a text? Can we have some discussion on what we are up to?
Let me give an example from my own current work: I have developed a small collection of narratives. They are all legends, all from this part of the world (Louisiana), and they have all been transcribed by folklorists from audio recordings, and so I can assume, at least for the time being, a fair amount of fidelity to what was actually said by the speaker. (Two of the texts are, in fact, from my own field work.) My goal with this small collection of texts is to explore them in various ways to see what computational methods tell me interesting things and which ones seem to be less fruitful in this context. That’s not hard to do when you have roughly two dozen texts which range in size from a hundred some odd words to a little over one thousand words.
In fact, this particular range in size is one of the things that I find interesting. The shortest legend weighed in at a mere 153 words. The longest was seven times its size at 1015. The two stories I had collected were firmly in the middle at 375 and 655.
Now, let’s leave off for the time being that where a story starts and where it stops is not something we can necessarily declare with absolute certainty. Parts of conversation tend to wash over both ends of a story. Indeed, some stories tend to invite conversation at certain points or throughout their performance. But let’s do leave that boundary issue and let’s instead assume that an analyst can reliably depend on fellow analysts working in the same discipline to make judgments of like enough nature to his that texts are comparable and thus so are their word counts.
One of the audience members at this year’s meeting of the International Society for Contemporary Legend Research put it rather succinctly: *Why the hell are you counting words?*
Good question. I didn’t have a terribly good answer at the time, but here’s what I offered up and I stand by it:
1. First, I am fascinated by the idea that a text as small as 153 words can be said to accomplish all the things we now believe happens within an unit of narrative: the holding of an audience’s attention in a discursive situation, the creation of a storyworld, the deployment of a sequence of events that listeners perceive as having continuity.
2. Second, we have to start somewhere. Folklorists have an approximate sense of the size of various kinds of texts, but we do not have any qualified sense of their size. I am not suggesting that we will ever arrive at a moment where our definition of legend includes something like “a narrative of x type that ranges in size from 153 to 1025 words.” Rather, following David Herman here, I think we will find ourselves with better definitions that describe a kind of centrality of the kind “most legends range in size from x to y.” There are always going to be outliers: if we introduce numbers, we typically are doing so for statistical purposes, and so that gives us the change to look at mean, median, and mode as well as things that stand outside those dimensions.
But, look, this is a long digression from the topic with which I began. Let me bring it back into the fold by noting that my point is that we do not yet have within my own field of folklore studies let alone in the broader humanities any sense of what kinds of quantification matters. We have lots and lots of tools that quantify things, sometimes in ways that appear almost magical: think of the moment you first glimpsed a word cloud, a projection (visualization) of a word frequency list that is so intuitively “right” that many people use them without really having any idea what lies behind them.
And think, too, of all the people using on-line word cloud generators that have built-in stop word lists and who never stop to think about what it means for a word to be on such a list and that perhaps some of the words that are getting “stopped” are potentially part of a pattern they are seeking to discern? Humanists are, in some fashion, playing willy-nilly in a park created by linguists and computer scientists. It’s fun to jump on their various play sets, to keep the metaphor going for just a moment longer, but sometimes all you end up is dizzy and tumbling off the carousel. (And with that, the conceit went too far.)
Linguists built the tools with certain questions in mind. Linguistics as a discipline has historically been very focused on the sentence as its chief unit of analysis. Sentences and words. (I find this to be true as I explore various corpora, but this is my own auto-didactic impression and may be entirely out of line.) Folkloristics has historically operated one level of discourse up from linguistics and one level down from literary studies, focusing its efforts on single texts or series of single texts as discrete units of discourse in various forms of human behavior. The kinds of precision that linguistics long ago achieved is still emergent in folkloristics, but we lag substantially behind in quantitative descriptions of the materials with which we work. (There was a moment in the 50s and 60s when a variety of metric efforts were attempted — e.g., labometrics, chorometrics, etc. — but those efforts were displaced with the turn towards performance. (If you are a folklorist, keep an eye out for the essay from Jonathan Goodwin and me which attempts to quantify the “turn” within folklore’s intellectual history.)
I suspect we will need to wander about in quantities before we reach a moment where we can have a fuller discussion about what we are doing and why, but I would love to see that conversation sooner rather than later, if only because I really want to play with more ideas.
**Side note**: For those wondering why I am worried about a small collection of legends — I dare not call it a *corpus* yet — I am interested in being able to automate the process of determining a morphology for narratives. I’m not yet convinced that the CS solutions I’ve seen are very good. And I think they start with texts that are too big, too complex. My end goal here is to begin to map out the relationship between ideas (ideologies as networks of ideas as glimpsed through texts) and the narratives that contain / convey / shape / are shaped by them. Sort of network meets syntax. Parallel meets sequence.