Mac OS LSM Reference

Apple’s `LatentSemanticMapping.h` documentation is [here](http://developer.apple.com/library/mac/#documentation/LatentSemanticMapping/Reference/LatentSemanticMapping_header_reference/Reference/reference.html#//apple_ref/doc/uid/TP40011480).

And [here](http://pypi.python.org/pypi/pyobjc-framework-LatentSemanticMapping/2.3) is the Python Package Index page for the `pyobjc` framework to access the Mac OS LSM … which makes it very clear that I’m not ready to do anything with it. (It notes that Apple’s documentation is sparse, at best, and that you’ll need to be comfortable with `pyobjc`.)

LDA vs LSA

I came across this post from two years ago on the `topic-models` listserv while looking up the difference between LSA and LDA:

> LDA isn’t necessarily Bayesian (at least not the original variational algorithm — it produces point estimates). In my view, the dispute is really about those who want a probabilistic model (LDA) versus those that don’t care (LSA). Unfortunately, the probabilistic model produced by LDA is only of limited use, because it doesn’t model the document length. But it does have clear probabilistic semantics.
>
> My impression is that neither model is particularly better at handling polysemy and synonymy. They both do ok if you have enough documents to train on. But perhaps other members of the list can enlighten me.

The note is by Thomas G. Dietterich of Oregon State University. Thank you, Tom.

Of Topics, Networks, and a Discipline: An Update

I have returned to work on trying to understand the intellectual history of my discipline, folklore studies, through the lens of text mining or what Franco Moretti once more eloquently called “distant reading.” There are a number of ways to do this, but the two that seem to dominate experiments in the humanities are citation networks and topic modeling.

I want to note upfront that I am especially thankful to my two colleagues, Jonathan Goodwin and Clai Rice, whose own expertise and excitement have re-ignited my own. I am especially indebted to Jonathan for his perseverance as I look at graphs and ask for changes. As the man actually operating the machine, he has born the brunt of labor so far.

### Missing Citations ##

When I first conceived this project two years ago, I planned to work with citation networks. The idea came to me at the end of the NEH seminar on “Networks and Networking in the Humanities” that Tim Tangherlini organized. Tangherlini deserves a lot of praise for his efforts to jumpstart an entire generation of computational folklorists, but there was only a handful of us present at the seminar itself.

Part of that has to be that so many of us have been raised within the era of performance studies, which like its many cousins within the sociolinguistic diaspora, focuses on human nature at the level of interpersonal interaction. It has not, at least ostensibly, produced the kind of giant data sets of the field’s previous era of tale collectors. The result is that someone like myself came to the two-week computational bootcamp with a rather small data set, with a network that looked more like what Bruno Latour had in mind in terms of networks and not what Laslo Barabasi had in mind.

There I was surrounded by historians and literary scholars talking about hundreds and sometimes thousands of records or texts. My first thought was to follow Tangherlini’s own example and take advantage of an extant folklore collection. (Some of you will already know that Tangherlini and his collaborator’s won a Google grant to do some serious computational work with the Danish folklore materials available through the search giant’s digitization efforts.) Unfortunately for me, my expertise largely lies with the rather under-developed corpus of Louisiana folklore. Despite its status as a “folklore land”, there are not that many Louisiana folklore collections, and thus not many to be digitized. (I am, however, cooking up some plans to create such a collection. More on this some other time.)

My third project returned to the juxtaposition I described above, to the moment in which social construction of reality established itself in folklore studies in what folklorists know as “the turn towards performance.” Here was an instance in which there was a broad consensus within a field that a discursive domain, if not a full-fledged paradigm, had come into being. Most folklorists will point to a particular moment, the late sixties through the mid-seventies, and to a particular group of scholars. Okay. We have a qualitative description. But what does “the turn” actually look like objectively?

By objects I meant citations, and I was curious to explore what for me was a whole new area of inquiry, bibliometrics, to see what a paradigm shift or the emergence of a paradigm looked like within the humanities. It was something I would later discover Kuhn had mentioned in passing in _The Structure of Scientific Revolutions_. In the postscript, discussing how one identifies communities, especially emergent communities, he observes that “one must have recourse to attendance at special conferences, to the distribution of draft manuscripts or galley proofs prior to publication, and above all to formal and informal communication networks including those discovered in correspondence and in the *linkages among citations*” (178). (Special thanks to Scott Weingart for this reference, which is something he recalled from *his* more recent reading of Kuhn and mentioned to me in October 2011 when we met in Bloomington.)

The idea would be to pull citations from the _Journal of American Folklore_ (hereafter JAF) and to construct a series of citation networks, slicing through the decades before and after “the turn” in order to determine if the change was discernible and what kind of change it was. (I confess that, as a volumetric thinker, the very idea of a transformative three-dimensional model was appealing to me.) In Summer 2011, I was lucky enough to have a collaborator in Kyle Felker, a graduate student in our program who also had a background in digital archives, where he had picked up the necessary coding skills to do two things:

1. Develop a series of PHP scripts that could make calls to the JSTOR DFR API (that’s a lot of acronyms!) to retrieve the citational data
2. Work with the resulting XML to begin to make things happen

The material that we got through this work looked like this, where the `reverence id` was always some combination of JSTOR’s preface, `10.2307` and a multi-digit number, here a fictional `123456` because I am drawing references from multiple articles as examples:


Sixty-fifth Annual Meeting of the American Folklore Society,
in Tucson, December 1953


J. W. Walker, The True History of Robin Hood (Wakefield, 1952).


F. J. Child, English and Scottish Popular Ballads, I (Boston, 1882-98), 24


Aurelio M. Espinosa, “Notes
on the Origin and History of the Tar-Baby Story,”
JAF, XLIII (1930), 129-209


Richard Dorson, “The Michigan State University Folklore Archives,”

Even this excerpt should prove instructive in the problems we faced: First, citations in JSTOR are not “hard coded” as meta-data but are, rather, text mined. The result is that what ends up inside a `` element can vary widely, and while JSTOR does a reasonably good year of extracting year data, when it’s available, that is about the only thing. In the example above, you can see that the first three `reference` elements have a `year` attribute. The fourth one has a `refdoi` attribute because JSTOR’s algorithm’s have successfully identified the reference as being internal to JSTOR’s own data store. The last one makes it clear that the XML file is getting in generated, in part, from JSTOR’s algorithm is plumbing the depths of footnotes. In this particular example, the reference is only enough to begin a search for what it might look be, a trace as Derrida might describe it, but not a reference itself.

Kyle did an amazing job re-working the XML, mastering a variety of techniques that he later discussed in the paper we gave at the 2011 meeting of the American Folklore Society. Having made all that progress, however, we had to put the project on hold. Kyle decided to stop at an M.A. in order to pursue a number of digital library positions which he was being offered. I decided to stop because my technical expertise was too limited to move forward on my own, my coding fu was weak.

### Innumerable Topics ##

With Kyle having left to start his own career, the project laid fallow for a while, subject to the occasional conversation because it had a discernible scope and loads of problems to solve.

This past autumn, my colleagues Jonathan Goodwin and Clai Rice decided to reinvigorate the project. Their impulse was to set aside the citational research for a time, both because the data was so hard to parse algorithmically and because JSTOR was no longer making it readily available through their Data for Research portal, which meant we couldn’t extend our original data set to the longer period we felt might be necessary to have a better understanding of the paradigmatic shift. (We have yet to inquire why JSTOR shut down access to the citational data, and so it may very well be simply a matter of asking for them for more through direct correspondence.)

The data that is available are bags of words, or, rather word counts that are re-constituted as bags of words. Jonathan is a coding wizard, and in no time at all he had used the DFR API to pull word counts for all the articles in various folklore journals and to run them through Mallet’s LDA to generate a list of fifty topics. I’m not quite sure why he chose that number: it may be something that was discussed at the topic modeling workshop he and Clai attended in November, but my sense is that we may need to half and double the count just to see how the topics contract and expand. By doing so, we might get a better sense of where the “fit” might be. More on this in a moment.

So far, the visualizations he has focused on making are those that take each of the fifty topics and project a bar graph of the topics prominence across a 100-year period, also known as a century, for which we have data. The last I checked, the total number of articles in the set number slightly more than two thousand, and that’s working with articles from JAF as well _Western Folklore_ and the _Journal of Folklore Research_. We have contacted folks at Western Kentucky University about getting access to some, or all, of the contents of _Southern Folklore_ but for now that is not part of the data set. We are only dealing with articles, as classified by the JSTOR system, which mostly handles making this distinction. In some instances, especially in the earlier decades of JAF, other kinds of materials make their way in, but in the case of topic modeling, I don’t think this has much of an impact.

There are some interesting trends present in the visualizations, which deserve more commentary that I have time for here and will, we hope, make for an interesting series of observations for the readers of JAF, where we hope to place the first instance of this research. But I am also curious about relationships between topics, because we are seeing some overlapping trends. We do, thanks to some of Malet’s outputs have the option to create a bipartite network. Mallet’s output looks something like this:

Doc 1: Topic 1, .58 Topic 2, .20 Topic 3, .08
Doc 2: Topic 1, .34 Topic 2, .41 Topic 3, .07
Doc 3: Topic 1, .06 Topic 2, .18 Topic 3, .33

This is a simplified rendition to be sure, since I am only listing 3 topics out of 50, but it captures the gist of the output. It looks like a good place to project a topic network visualization out of the two modes:

But as [Scott Weingart points out][], most of us are not interested in including all 50 topics in out weighting of the relationships, which means we are throwing away data. It appears that many people graphing these things are simply choosing the top three topics for any given document. My impulse is to try playing with the numbers and find some way to graph some arbitrarily large percentage of the topics involved , but I’m not sure if I see a way out of this particular predicament in a way that’s universally applicable. I like [Ted Underwood’s response][], that he’s in search of “meaningful ambiguity”, but I’m not certain that’s not a response only a humanist can love.

[Underwood’s most recent suggested solution][] seems to be folding in Principal Component Analysis, and that’s where I’m headed next.

Jonathan has also been posting on his blog some of his notes on the work he’s been doing on this project and on adjacent projects. [Be sure to read there, too.][jg]

[Scott Weingart points out]: http://www.scottbot.net/HIAL/?p=23755
[Ted Underwood’s response]: http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/
[Underwood’s most recent suggested solution]: http://tedunderwood.com/2012/11/11/visualizing-topic-models/
[jg]: http://www.jgoodwin.net