I almost posted this on the new [Bamboo DiRT Wishlist][bdw], but then I thought better of it. The wish list needs to function as a straightforward wish list for a while before someone goes “meta” with it.
But my wish remains: a brief scan of the list of tools already available reveals that a lot of them do much the same thing. That in itself is not a bad thing, but somewhere I think it would be useful if we had a list of the kinds of things that often get done and what they might mean for various kinds of approaches to humanistic objects and topics.
For example, a lot of tools can quickly break texts into word lists — *oh, those the bags of words!* — and they can produce various kinds of outputs based on those lists: filter out function words, perhaps filter out proper names, or words that appear less than a certain number of time — or advanced uses can filter out words by part of speech. But what does it mean to get at the rich “middle” of words that do not appear too frequently or too infrequently in a text? Can we have some discussion on what we are up to?
Let me give an example from my own current work: I have developed a small collection of narratives. They are all legends, all from this part of the world (Louisiana), and they have all been transcribed by folklorists from audio recordings, and so I can assume, at least for the time being, a fair amount of fidelity to what was actually said by the speaker. (Two of the texts are, in fact, from my own field work.) My goal with this small collection of texts is to explore them in various ways to see what computational methods tell me interesting things and which ones seem to be less fruitful in this context. That’s not hard to do when you have roughly two dozen texts which range in size from a hundred some odd words to a little over one thousand words.
In fact, this particular range in size is one of the things that I find interesting. The shortest legend weighed in at a mere 153 words. The longest was seven times its size at 1015. The two stories I had collected were firmly in the middle at 375 and 655.
Now, let’s leave off for the time being that where a story starts and where it stops is not something we can necessarily declare with absolute certainty. Parts of conversation tend to wash over both ends of a story. Indeed, some stories tend to invite conversation at certain points or throughout their performance. But let’s do leave that boundary issue and let’s instead assume that an analyst can reliably depend on fellow analysts working in the same discipline to make judgments of like enough nature to his that texts are comparable and thus so are their word counts.
One of the audience members at this year’s meeting of the International Society for Contemporary Legend Research put it rather succinctly: *Why the hell are you counting words?*
Good question. I didn’t have a terribly good answer at the time, but here’s what I offered up and I stand by it:
1. First, I am fascinated by the idea that a text as small as 153 words can be said to accomplish all the things we now believe happens within an unit of narrative: the holding of an audience’s attention in a discursive situation, the creation of a storyworld, the deployment of a sequence of events that listeners perceive as having continuity.
2. Second, we have to start somewhere. Folklorists have an approximate sense of the size of various kinds of texts, but we do not have any qualified sense of their size. I am not suggesting that we will ever arrive at a moment where our definition of legend includes something like “a narrative of x type that ranges in size from 153 to 1025 words.” Rather, following David Herman here, I think we will find ourselves with better definitions that describe a kind of centrality of the kind “most legends range in size from x to y.” There are always going to be outliers: if we introduce numbers, we typically are doing so for statistical purposes, and so that gives us the change to look at mean, median, and mode as well as things that stand outside those dimensions.
But, look, this is a long digression from the topic with which I began. Let me bring it back into the fold by noting that my point is that we do not yet have within my own field of folklore studies let alone in the broader humanities any sense of what kinds of quantification matters. We have lots and lots of tools that quantify things, sometimes in ways that appear almost magical: think of the moment you first glimpsed a word cloud, a projection (visualization) of a word frequency list that is so intuitively “right” that many people use them without really having any idea what lies behind them.
And think, too, of all the people using on-line word cloud generators that have built-in stop word lists and who never stop to think about what it means for a word to be on such a list and that perhaps some of the words that are getting “stopped” are potentially part of a pattern they are seeking to discern? Humanists are, in some fashion, playing willy-nilly in a park created by linguists and computer scientists. It’s fun to jump on their various play sets, to keep the metaphor going for just a moment longer, but sometimes all you end up is dizzy and tumbling off the carousel. (And with that, the conceit went too far.)
Linguists built the tools with certain questions in mind. Linguistics as a discipline has historically been very focused on the sentence as its chief unit of analysis. Sentences and words. (I find this to be true as I explore various corpora, but this is my own auto-didactic impression and may be entirely out of line.) Folkloristics has historically operated one level of discourse up from linguistics and one level down from literary studies, focusing its efforts on single texts or series of single texts as discrete units of discourse in various forms of human behavior. The kinds of precision that linguistics long ago achieved is still emergent in folkloristics, but we lag substantially behind in quantitative descriptions of the materials with which we work. (There was a moment in the 50s and 60s when a variety of metric efforts were attempted — e.g., labometrics, chorometrics, etc. — but those efforts were displaced with the turn towards performance. (If you are a folklorist, keep an eye out for the essay from Jonathan Goodwin and me which attempts to quantify the “turn” within folklore’s intellectual history.)
I suspect we will need to wander about in quantities before we reach a moment where we can have a fuller discussion about what we are doing and why, but I would love to see that conversation sooner rather than later, if only because I really want to play with more ideas.
**Side note**: For those wondering why I am worried about a small collection of legends — I dare not call it a *corpus* yet — I am interested in being able to automate the process of determining a morphology for narratives. I’m not yet convinced that the CS solutions I’ve seen are very good. And I think they start with texts that are too big, too complex. My end goal here is to begin to map out the relationship between ideas (ideologies as networks of ideas as glimpsed through texts) and the narratives that contain / convey / shape / are shaped by them. Sort of network meets syntax. Parallel meets sequence.