What We Talk about When We Talk about Stories

Rejected for a special issue of the Journal of Cultural Analytics, but, still, I think, an interesting project and one I will continue to pursue. If anyone else is interested, this is part of a larger project I have in mind and I am open to there being a working group.

Current efforts to treat narrative computationally tend to focus on either the very small or the very large. Studies of small texts, some only indifferently narrative in nature, have been the focus for those interested in social media, networks, and natural language technologies, which are largely dominated by the fields of information and computer sciences. Studies of large texts, so large that they contain many kinds of modalities with narrative the dominant, have largely been the purview of the field we now tend to call the digial humanities, dominated by the fields of literary studies, classics, and history.

The current work proposes to examine the texts that fall in the middle: larger than a few dozen words, but smaller than tens, or hundreds, of thousands of words. These are the texts that have historically been the purview of two fields that themselves line either side of the divide between the humanities and the human sciences, folklore studies and anthropology (respectively).

The paper profiles the knot of issues that keep these texts out of our scholarly-scientific systems. The most significant issue is the matter of “visibility”, of accessibility, of these texts as texts and thus also as data: largely oral by nature, most folk or traditional narratives (must) have been the product of a transcription process that cannot guarantee the same kind of textuality of a “born literary” text. (The borrowing of the notion of natality is somewhat purposeful here, since we often distinguish between texts that have been, sometimes laboriously, digitized and those that were “born digital.”) As scholarly fictions, if you will, they are largely embedded within the texts that treat them, only occasionally available in collections. With limited availability, and traditionally outside the realm of the fields that currently dominate the digital humanities, folk/traditional/oral narratives are not yet a part of the larger project to model narrative nor of efforts to consider the “shape of stories.”

This accessibility gap has overlooked both human and textual populations: most of the world’s verbal narratives are in fact oral in nature and millions upon millions are produced everyday by millions and millions of people and those narratives tend to range in size from somewhere around a hundred words to, perhaps, a few thousand words in length. The result is that any current model or notion of shape simply has allowed the wrong “figures figure figures.” Put another way, there can be no shape of stories without these stories.

Meir Sternberg on Narratology’s History

As part of a larger effort to think about the shape of small stories, I have begun to try to delineate more carefully the modes of oral discourse — e.g., description, narration, exposition, etc. Apart from the early work by Labov and Waletzky, whose work on narrative versus free clauses is foundational, the work I have found most compelling is that of Meir Sternberg. Re-reading his 1981 essay on “Ordering the Unordered: Time, Space, and Descriptive Coherence” is an exercise in wondering how one mind could anticipate so much of what was to come and what still needs to get done.

I’ll have more to say about Sternberg later, but in the mean time, I found this delightful excerpt from an interview in which he explains the difference, or the lack thereof, between classical and post-classical narratology.

NARRATOLOGY: CLASSICAL AND POSTCLASSICAL STUDIES. from Enthymema on Vimeo.

The Saffron Research Browser

I’m still trying to figure out what all I can do with the [Saffron][] browser/visualizer. It claims to analyze the research communities of natural language processing, information retrieval, and the semantic web through “text mining and linked data principles.”

The list of research domains is rather short and under-explained for the uninitiated:

Saffron's List of Research Domains

Saffron’s List of Research Domains

I clicked on [ANLP][], which is *applied natural language processing, and you get both a list of hot topics:

Hot Topics in ANLP

Hot Topics in ANLP

As well as a taxonomy network/tree that offers labels when you hover over nodes, which are themselves clickable links:

Taxonomy Network for ANLP

Taxonomy Network for ANLP

Clicking on one of the “hot topics,” in this case [natural language text][], gives you a bar chart of the frequency of the topic in documents for the past thirty years:

Frequency of Natural Language Text as a Topic over 30 Years

Frequency of Natural Language Text as a Topic over 30 Years

A list of similar topics:

Topics Similar to "Natural Language Text"

A list of experts:

Saffron's List of Experts Associated with "Natural Language Text"

Saffron’s List of Experts Associated with “Natural Language Text”

And a list of publications:

The Top 5 Publications for "Natural Language Text"

The Top 5 Publications for “Natural Language Text”

Like a lot of browsers, this kind of static presentation of the results impoverishes the exploration that it encourages. I also haven’t explored what are its inputs: I wonder how full/complete its historical record is.

[Saffron]: http://saffron.deri.ie
[ANLP]: http://saffron.deri.ie/acl_anlp
[natural language text]: http://saffron.deri.ie/acl_anlp/topic/natural_language_text/

Why Count Words?

“Why count words?” It was a simple question[^cf1]. The person asking the question did not ask it in an overly skeptical, or hostile, fashion. He was honestly taken aback by a series of numbers I had rattled off that corresponded to a collection of texts, of legends, that I had assembled as my first step in my exploration of computational approaches to narrative. The illustration in front of the room had been a bar chart of sixteen legend texts, each collected by an established folklorist (and so the original oral texts were, I felt, reliably represented). The longest text in the collection was a little over one thousand words (1025); the shortest, only 150.

A multiplier of seven is not an order of magnitude in difference, but it is still enough of a spread that it bears further investigation. Mount Everest is, for example, seven times taller than Ben Nevis, the highest mountain in the British Isles. Climbing the former is considerably more prestigious than climbing the latter. The Gross Domestic Product of the U.S. is seven times greater than Brazil. The distance from New York to London is seven times greater than the distance from New York to Washington, D.C. The difference in the latter amounts to a change in continent and a trans-oceanic passage.

My initial answer to the question was simple: I counted words because I wanted to know if it is possible to create a story world using 150 words, and, if so, then I want to understand how that can happen. Given the size of a great number of literary forms, one thousand words is already amazingly concise, but 150 words? Each word must pack an incredible amount of power: something made even more amazing when one realizes that only half that number of words are unique in their usage in this little text. That is, one word alone, he, gets used twelve times. The next nine words that get used most often in this little legend are also fairly uninteresting: and, a, was, the, it, his, said, to, they. So a list of the text’s top ten words doesn’t reveal anything about the story itself, except that, perhaps, there is a singular figure, he, who is counterposed against a group of some kind, they. (It is only when we get to the next ten most often used words, all of which appear only two or three times in the text, that we beginning to get a sense of what the story might be about: man, dog, with, when, went, there, saw, off, horse, controller.)

How is this possible? How can such a small subset of words from an already small text make a story go? That is, I think, the real question. Counting words is but one step along the way, but an important one, and one that we, as folklorists, have failed to undertake. Think for a minute of all the texts that are indexed in the great collection projects of the twentieth century. Add to them all the texts we have collected under the auspices of the ethnography of speaking. It’s an impressive amount of work, and while we have made some synthetic gestures, we have, by and large, mostly focused on differences. All of those differences are, of course, quite compelling, but in focusing on differences, we have also missed an opportunity to make attempts at larger kinds of claims about human nature and culture.

The impulse to count words, for me, is but one step towards a larger understanding of how humans think their way through the world through things of their own making. In the case of texts, they quite literally string one word after another, usually within the flow of a larger program of discourse that itself may or may not be conducive to text-making. Despite all the complexities, people in a variety of speech act contexts somehow decide to initiate a text, place one word upon another in a sequence they both anticipate and, at the same time, manipulate, until they are satisfied, in some fashion, with the result and, like a discursive Atropos, end the life of the string.

Counting words, then, is but one step towards a larger understanding not only how many words, but which words, and in what order. Why these words and not others? And what are the relationship of these words used here to instantiate a story world, but of the actions within the story world to the human world within which they are embedded? In short, what can 150 words tell us about the relationship between words, ideas, and actions?

The great indices of the previous era of folklore scholarship took one step in this direction by attempting to map, mostly in bibliographic terms but indirectly in cartographic, the various texts that had been collected in the initial wave of the philological project. At the same time as Stith Thompson turned his great carousel to compile the Motif Index three-by-five card by three-by-five card, however, a few scholars and scientists were beginning to play with the idea of using computers, as slow and expensive as they were then, to compile statistics about texts[^cf2].

Statistics remains, for most humanists, either an enigma or an enemy. It represents, for many (and with good reason), a regime of mathematics, itself something of a mystery, which has been used too often to summarize a situation or a group of people when a more subtle form of analysis was needed. I will not, in this essay, defend its use in such contexts. Nor am I interested in defending, or capable of discussing, the larger statistical turn that so many forms of knowledge production have undertaken. I have only this, a reworking of a dite from my own childhood and perhaps yours too: just because others are doing it is not a reason for us to do it, too.

I understand very well the humanistic impulse to draw a line in the discursive stand and to cry out “the crunching of us into numbers ends here.” My suggestion here, at this metaphorical line lying before us, is that the crunching will go on and on, and it can do so either without us or with our efforts not only to humanize the crunching but also to stuff it so full of the human that it might very well turn into a new kind of science, a new kind of scholarship that will not only be interesting to others, but also to us as well.

One of the central requirements of statistics is that you must convert information — perhaps a simply little story about a treasure buried somewhere, perhaps a few dozen of such stories, or perhaps several thousand — into data. But such a transformation amount simply to assigning values, most often numbers but they need not be, to the objects that are central to the problem. The analyst defines the problem, and the analyst assigns the values. Folklore studies has already done this in the form of tale type numbers, and motif numbers, and even when we describe the process of contextualization of a particular text.

So why count words? Well, clearly one reason to do so is simply to explore texts and textuality, to satisfy our curiosity about the fundamental dimensions of human expressivity: the number of words in a text, the word clusters (or collocations) that occur within a text as well as the words that always appear in conjunction with others in particular kinds of texts (co-occurrences). A second reason to proceed in this fashion is to make it possible to discover relationships between texts that we have not yet discovered by more traditional means of study. Discovery, indeed the notion of indexing itself, are the chief reason behind so much of the effort in natural language processing, as we will discuss in a moment. The final reason is that by seeing folklore texts in a new light and seeing relationships between texts that we have not gleaned before leads to new forms of knowledge, forms that need not displace but rather refine and extend current ways of knowing.

[^cf1]: The first public presentation of this research project was at the 2013 meeting of the International Society for Contemporary Legend Research. I would like to thank that group for their incredibly generosity and hospitality.

[^cf2]: The image of Stith Thompson sitting in a building dedicated to housing a carousel forty-feet in diameter is one that I owe entirely to Henry Glassie.