“Why count words?” It was a simple question[^cf1]. The person asking the question did not ask it in an overly skeptical, or hostile, fashion. He was honestly taken aback by a series of numbers I had rattled off that corresponded to a collection of texts, of legends, that I had assembled as my first step in my exploration of computational approaches to narrative. The illustration in front of the room had been a bar chart of sixteen legend texts, each collected by an established folklorist (and so the original oral texts were, I felt, reliably represented). The longest text in the collection was a little over one thousand words (1025); the shortest, only 150.
A multiplier of seven is not an order of magnitude in difference, but it is still enough of a spread that it bears further investigation. Mount Everest is, for example, seven times taller than Ben Nevis, the highest mountain in the British Isles. Climbing the former is considerably more prestigious than climbing the latter. The Gross Domestic Product of the U.S. is seven times greater than Brazil. The distance from New York to London is seven times greater than the distance from New York to Washington, D.C. The difference in the latter amounts to a change in continent and a trans-oceanic passage.
My initial answer to the question was simple: I counted words because I wanted to know if it is possible to create a story world using 150 words, and, if so, then I want to understand how that can happen. Given the size of a great number of literary forms, one thousand words is already amazingly concise, but 150 words? Each word must pack an incredible amount of power: something made even more amazing when one realizes that only half that number of words are unique in their usage in this little text. That is, one word alone, he, gets used twelve times. The next nine words that get used most often in this little legend are also fairly uninteresting: and, a, was, the, it, his, said, to, they. So a list of the text’s top ten words doesn’t reveal anything about the story itself, except that, perhaps, there is a singular figure, he, who is counterposed against a group of some kind, they. (It is only when we get to the next ten most often used words, all of which appear only two or three times in the text, that we beginning to get a sense of what the story might be about: man, dog, with, when, went, there, saw, off, horse, controller.)
How is this possible? How can such a small subset of words from an already small text make a story go? That is, I think, the real question. Counting words is but one step along the way, but an important one, and one that we, as folklorists, have failed to undertake. Think for a minute of all the texts that are indexed in the great collection projects of the twentieth century. Add to them all the texts we have collected under the auspices of the ethnography of speaking. It’s an impressive amount of work, and while we have made some synthetic gestures, we have, by and large, mostly focused on differences. All of those differences are, of course, quite compelling, but in focusing on differences, we have also missed an opportunity to make attempts at larger kinds of claims about human nature and culture.
The impulse to count words, for me, is but one step towards a larger understanding of how humans think their way through the world through things of their own making. In the case of texts, they quite literally string one word after another, usually within the flow of a larger program of discourse that itself may or may not be conducive to text-making. Despite all the complexities, people in a variety of speech act contexts somehow decide to initiate a text, place one word upon another in a sequence they both anticipate and, at the same time, manipulate, until they are satisfied, in some fashion, with the result and, like a discursive Atropos, end the life of the string.
Counting words, then, is but one step towards a larger understanding not only how many words, but which words, and in what order. Why these words and not others? And what are the relationship of these words used here to instantiate a story world, but of the actions within the story world to the human world within which they are embedded? In short, what can 150 words tell us about the relationship between words, ideas, and actions?
The great indices of the previous era of folklore scholarship took one step in this direction by attempting to map, mostly in bibliographic terms but indirectly in cartographic, the various texts that had been collected in the initial wave of the philological project. At the same time as Stith Thompson turned his great carousel to compile the Motif Index three-by-five card by three-by-five card, however, a few scholars and scientists were beginning to play with the idea of using computers, as slow and expensive as they were then, to compile statistics about texts[^cf2].
Statistics remains, for most humanists, either an enigma or an enemy. It represents, for many (and with good reason), a regime of mathematics, itself something of a mystery, which has been used too often to summarize a situation or a group of people when a more subtle form of analysis was needed. I will not, in this essay, defend its use in such contexts. Nor am I interested in defending, or capable of discussing, the larger statistical turn that so many forms of knowledge production have undertaken. I have only this, a reworking of a dite from my own childhood and perhaps yours too: just because others are doing it is not a reason for us to do it, too.
I understand very well the humanistic impulse to draw a line in the discursive stand and to cry out “the crunching of us into numbers ends here.” My suggestion here, at this metaphorical line lying before us, is that the crunching will go on and on, and it can do so either without us or with our efforts not only to humanize the crunching but also to stuff it so full of the human that it might very well turn into a new kind of science, a new kind of scholarship that will not only be interesting to others, but also to us as well.
One of the central requirements of statistics is that you must convert information — perhaps a simply little story about a treasure buried somewhere, perhaps a few dozen of such stories, or perhaps several thousand — into data. But such a transformation amount simply to assigning values, most often numbers but they need not be, to the objects that are central to the problem. The analyst defines the problem, and the analyst assigns the values. Folklore studies has already done this in the form of tale type numbers, and motif numbers, and even when we describe the process of contextualization of a particular text.
So why count words? Well, clearly one reason to do so is simply to explore texts and textuality, to satisfy our curiosity about the fundamental dimensions of human expressivity: the number of words in a text, the word clusters (or collocations) that occur within a text as well as the words that always appear in conjunction with others in particular kinds of texts (co-occurrences). A second reason to proceed in this fashion is to make it possible to discover relationships between texts that we have not yet discovered by more traditional means of study. Discovery, indeed the notion of indexing itself, are the chief reason behind so much of the effort in natural language processing, as we will discuss in a moment. The final reason is that by seeing folklore texts in a new light and seeing relationships between texts that we have not gleaned before leads to new forms of knowledge, forms that need not displace but rather refine and extend current ways of knowing.
[^cf1]: The first public presentation of this research project was at the 2013 meeting of the International Society for Contemporary Legend Research. I would like to thank that group for their incredibly generosity and hospitality.
[^cf2]: The image of Stith Thompson sitting in a building dedicated to housing a carousel forty-feet in diameter is one that I owe entirely to Henry Glassie.