The ACL anthology “currently hosts 63797 papers on the study of computational linguistics and natural language processing.” A lot of subjects covered. No immediate search interface but you can the full anthology as BibTeX (only 5.35MB) and search that — if you throw in abstracts, it’s only 13MB.
Katherine Kinnaird is very smart: listen for yourself.
I’ve been enjoying working through Matthew Jockers’ [Text Analysis with R for Students of Literature](http://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/) and following the various discussions about topic modeling and other approaches to “big data” in the humanities on Twitter (and elsewhere — and I really do wish there was more of the elsewhere — more on this in a moment). At the same time, I am, some would argue desperately, trying to teach myself not only the Python language, and to learn the basic terms of computer science but also trying to get a basic grasp of the statistics that lies behind so much of this work.
I do so because not only do these realms fascinate me and, I think, have real possibilities for studying the kinds of texts that I like to study but I would also like to be part of that larger conversation about what dimensions of statistics are useful, and what are not, that the digital humanities will eventually have to have as the “digital” falls away. We will at some point get past the initial, and very exciting, phase of experimentation and grabbing at all the shiny toys, and begin to synthesize these experiments into the ongoing development of the continuum of work that stretches from the humanities to the human sciences.
Folklore and anthropology have long been the kissing cousins on either side of the perceived divide between those two orders, and I am fascinated, in watching the adaptation/adoption of corpus linguistic methods, often linked with information science and various forms of artificial intelligence, with the jump from sentences, or huge gatherings of sentences into things like corpora, to novels.
These is, I think, a middle ground. It’s not the “small date” of the old humanities, nor yet the “big data” which is our current fascination, but something more like middling amounts of data. *Medium data*? (That sounds better than *middling*, but it does suggest a statistical process, no?)
*Middling data* for now, I think.
I am using it to describe the 50 some-odd legend texts I have that range in size from around 100 words to over 1000 words. This size of texts is, in itself, a kind of middle ground between short texts like proverbs and longer texts like myths. (Some oral histories I have collected tend to fall on the shorter end of this range, as well as a number of personal anecdotes, which only means that we have a lot of counting to do in folklore studies to begin to establish things like this. Easy peasy work and still terribly interesting — how many words does a given context require either to reinforce the current reality or to conjure up an alternate one?)
50 texts of 500 words doesn’t seem like too much, does it? (I’m going to go for the middle number of 500 here, just for the sake of argument.) Why that’s only 25,000 words, a long-ish short story from a literary scholar’s point of view. But 50 distinct texts begins to stretch the boundaries of working memory for most human analysts, and certainly as that number grows, one begins to require alternative means of “holding” the texts in some sort of analytical space.
Of course, as the number grows, one needs to effect some kind of compression somewhere in the process. Where and how is why we need statistical reasoning to better inform how we proceed. (Sorry for the surfeit of adverbs there.) And I do love the kinds of things that topic modeling can do, as well as other forms of statistical analyses. Certainly achieving semi-accurate results with a minimum of failures and making effective use of available computational resources is of interest to computer scientists, but I don’t, at this point, particularly care about such things. Rather, I am interested in those forms of manipulation which let me explore a collection of material(s) — perhaps formally organized enough to begin to be something like a corpus but perhaps not.
This middle ground is the ground I want to work for the foreseeable future. It will let me explore the computational and statistical possibilities from within a territory that I can still attempt to grok using old-fashioned, dare I say “analog”?, methods methods. It’s this kind of middle ground work that made Moretti’s _Graphs, Maps, and Trees_ so compelling. (And he seems to have a distinct preference for working with middling data, if I read other essays and understand other talks he has given correctly.)
*Middling* data is a terrible name to be sure, but like the “middling” domains of folklore studies and cultural anthropology, domains often viewed from a certain askance perspective by practitioners in domains more central to either side of the divide between the humanities and the human sciences, I think that there are some terribly productive tensions to be more clearly articulated and discussed.
Then again, I would think that, wouldn’t I?
The talk I gave was partly written ahead of time and partly outlined by a series of slides I made expressly for the talk. My thanks to Zhu Gang for the photograph above. While the complete talk gets transformed into part of an essay I am submitting to a special issue of _Linguistic and Literary Computing_, I hope the slides will do: