Concatenate Text Files with File Names

Last night I needed to compile a folder (directory) of text files into a single file with the file name as a header. This simple bash script did the work:

% for f in *.txt; do echo "# $f"; cat "$f"; done > ../legends.txt

The hash sign ahead of the filename reveals that I compiled the document as a markdown text. I couldn’t quite figure out how to insert newlines into the script above, so I ended up using some regex to do that: finding instances of the hash tag and inserting a new line before it and then finding instances of .txt and inserting a newline after it. And then, finally, removing the .txt extension altogether. From there, I converted the document to HTML that I could format more clearly.

As seen here, I entered the script directly at the command line, but it could also, I suppose be saved thus:

#! /usr/bin/env bash

for f in *.txt; 
    do echo "# $f"; 
    cat "$f"; 

I’m not sure how to direct the output into a file within a bash script. I usually just do that at the command line. (I know, I know: I need to learn bash scripting. I’ll get there.)

More on Jupyter

While I was pretty happy to get both Python 3 and R working in Jupyter notebooks, I had no idea that you could use both in the same notebook. Check out this presentation by Myles Gartland where he explains the power of %R in Jupyter notebook: Youtube.

Trying Out Indico’s “plotlines”

Running parallel to Jockers’ attempts to “plot” texts via sentiment analysis, Indico Data Solutions has released a Python package plotlines as well as a Jupyter notebook of documentation and sample code.

Neither indico nor plotlines turned up in a port search so my next step was to try pip. My first attempt revealed that I was still using the Python 2.7 version of pip, and I needed both to get the version for Python 3.4 but also make sure it was the active version:

sudo port install py34-pip
sudo port select -- pip pip34

And, then, to the matter at hand:

sudo pip install -U indicoio


More Notes on Jupyter Notebook

I used MacPorts, as always, to install the new code for Jupyter:

port install py34-jupyter

Note: you may need to prepend sudo to install software on your setup.

But the new command jupyter notebook only returned -bash: jupyter: command not found for me. I tried various alternatives, but got nowhere until I returned to ipython notebook. Presto. And even better, now I have this:

Python or R?

Python or R?

Getting the R there is considerably simpler now, while in the R shell:

             repos = c('', 


See iRkernel for more information.

First Thoughts on Folklore’s Contribution to a Computational Narratology

A complete video of this talk, with me talking while slides go by, is available on Youtube. I am also working on a longer, revised version of this essay to address the concerns I have about the particular usage of folklore theory, Propp, while ignoring folkloric materials and making pretty large claims.

For my presentation at this year’s meeting, I proposed to our session’s organizer, Jill Rudy, that I attempt a more synthetic understanding of recent explorations by physicists and information and computer scientists who are using folklore materials as way to test the limits of their own theories and models either of things like social networks or of dimensions of textuality that have, until lately, largely not been within the sphere of the traditional study of either literary or folkloric texts (whatever the distinction is). At the risk of having over-promised and now under-delivering, I am not prepared to do that today, for a number of reasons. First, my own attempts to survey and synthesize the strains of scientific inquiry is not yet complete, and, second, because there’s no way I can do that in seven minutes. (That noted, if you are interested in such a survey, contact me. I’ll be happy to share when it’s ready.)

I have no idea what I was thinking when I proposed such a task for a diamond session, which may be in keeping with how some in the audience view any talk of computational this and digital that. As not thinking. And, perhaps, they would be right that it is not thinking, at least not in terms of the way we are used to thinking. But I see no reason to have to choose between one or the other. There’s no sense in pressing ahead with either-or when both-and is just as likely and for more useful.

With the five and a half minutes I have left in this presentation, what I would like to do is to focus on a particular moment in the spring of this year, the moment which was, in fact, on my mind when I wrote my overly ambitious abstract, in order to begin to think through the opportunities that lie before us as folklorists, if we are but willing to count.

On February second of this year, Matthew Jockers, an associate professor of English at the University of Nebraska, published “Revealing Sentiment and Plot Arcs with the Syuzhet Package” (Jockers 2015a). The post itself was a follow-up to an earlier exploration of the shapes of stories he had published the previous year (Jockers 2014) which was based in part on a re-publication of a video of a lecture given by Kurt Vonnegut at some point during the era of VHS tapes and/or DVDs.[1]

Vonnegut begins with a rather interesting assertion: “There is no reason why the simple shapes of stories can’t be fed into computers. They are beautiful shapes.” He then turns to a blackboard and draws a vertical line, calling it the “G-I axis” for good fortune and ill fortune and then a horizontal line that he describes as the B-E axis, for beginning and end. Vonnegut doesn’t raise the specter of computers again: the rest of his presentation is on mapping various shapes of stories, shapes that are based on the fortune of the protagonist: is she doing well or is she suffering at the hands of the antagonist?

Working at the chalkboard, Vonnegut offers a number of variations on the possible shapes of such narratives, which graphic designer Maya Eilam later turned into a poster-sized graphic available to readers of Open Culture, the website that had first made Vonnegut’s lecture a cause celebre.

Jocker’s syuzhet program realizes Vonnegut’s idea by processing the prose of a novel sentence by sentence and scoring each sentence on its positive or negative emotional valence using something known as sentiment analysis. (There are a number of problems with sentiment analysis that a fuller conversation should have, but let’s give Jockers some room to work, or at least play, and see what happens.) As sentences strung together build to create the novels that are Jockers’ focus, the sentiments they contain slowly trace a trajectory up and down along the time-line of the novel’s narration. Since novels are different lengths, Jockers uses some math to normalize for length, allowing the trajectories between two novels, for example James Joyce’s Portrait of the Artist as a Young Man and Oscar Wilde’s Portrait of Dorian Gray, to be comparable.

Based on his computational analysis of some 41,383 novels[ Some of these are recent. How did he get access?], tested against a close reading of a couple dozen of the novels, Jockers came to the conclusion that there are approximately six archetypal story shapes, at least one of which looked so much like Vonnegut’s “Man in Hole” shape that he named it such in homage.[2]

The two shapes which he has discussed most so far are the “man in hole” and one he has dubbed “man on hill” — not very elaborate terms, but I think it helps to remind everyone that the nature of this work is still very much a sketch and not yet as programmatic as some have taken it to be.

The example text that Jockers uses most often is Joyce’s Portrait, which in its initial visualization in the syuzhet package looks something like a pixelated cloud, but he eventually achieves a smooth, optimum shape through a series of mathematical transformations, all of which he is very upfront and clear about.

So, while I want to point out that this kind of work does allow you to build your theory of textuality into its very operation, I think we need to be very clear about how texts are being treated here: sentences are being weighted for their sentiment and those weights are being added up and averaged over larger and larger stretches of those text in order to achieve a particular kind of two-dimensional shape.

There have been a number of responses to Jockers’ work. Some have expressed concerns about the use of sentiment analysis, which, in the end, is simply a collection of words with a value between 1 and -1 assigned to them, and the application, as at least one observer has mused, can be fairly crude.[3] Like, for example is a positive word, since it is assumed that it is used as a verb. Such an assumption entirely misses its use as a preposition, as in “he smelled like a chicken farm in August.”

Jockers’s response to these concerns, and others, is that it all evens out over the course of a fifty or one hundred thousand, or more, word novel, so those ironic or sarcastic or non-standard uses of words do not really matter.

And, perhaps most importantly for those of us gathered in this room—okay, like I’m not—almost all forms of sentiment analysis would misunderstand, and misvalue, the use of like as a quotative: “And she was like, I told you he would say that.”

And that, ultimately, is the shortcoming of Jockers’ claim: he keeps talking about these shapes, drawn from novels, as the shape of stories. But folklorists would be the first to point out that novels are only ever produced by an incredibly small number of human beings, and while they are consumed by a larger number, even that number is not as large a percentage as the number of human beings who tell stories. So universalist claims about the shapes of stories based on an, albeit quite large, collection of novels are, I would suggest, fairly premature.

More importantly, where are the folklore collections with which we could begin to build comparisons with Jocker’s 41,383 novels?[4] For fun, I ran some of the legends from corpus of Louisiana treasure legends through Jocker’s syuzhet package.

A small legend from Barry Ancelet’s Cajun and Creole Folktales, consisting of a little over two dizen sentences and 333 words produced a graph that showed positive sentiments upfront and negatives in the latter part of the story.

A somewhat longer legend from my own fieldwork that is about twice as long at 653 words and almost 50 sentences had a bit more dynamism in terms of sentiment, but seemed to possess a similar overall trend.

When I tried to smooth the graphs to look at a larger trend using one of the options in the suzhet package, I confronted the fact that its code base requires texts of at least 200 sentences. The longest text in my collection, coming from work done by Carl Lindahl and Maida Owens for the Swapping Stories project, weighs in at only a little over a thousand words and, or but, less than a hundred sentences.

Still, using the Fourier transforms we are able to see some interesting consistencies emerge: first, let’s take a look at the small legend again, then transform it. Now, the second legend, one about a pirate in a tree that threatens African Americans. And finally, our long legend laid over the previous two.

This is small stuff, but it’s another dimension to think about. If I were with you now, I would put in a pitch for folklorists gathering to discuss how to make our data more share-able. I’m working with TEI to make that happen. Tim Tangherlini is in the room, and I know he has a smile on his face and some ideas in a head.

If anyone would like to discuss a copy of this paper or of the visuals, they are available at this URL, and I’m happy to discuss accessible, share-able data with anyone interested. Talk to you soon.

PDF of slides.

[1] The Vonnegut video is clearly an excerpt from a longer lecture which had been captured at some, so far unknown, date and time. The sequence of events that led up to its most recent popularity seems to be the following: on 2010 October 30, David Comberg uploaded the 4:36 segment of video on Youtube. There is no other information available. This is the video to which all others link. On 2011 April 4, Open Culture featured the video segment in a post titled “The Shape of a Story:Writing Tips from Kurt Vonnegut.” The site featured the segment again on 2014 February 18, this time with an impressive set of visualizations by graphic designer Maya Eilam. Open Culture also quoted from Vonnegut’s autobiography, Palm Sunday as a way to provide more context: “‘What has been my prettiest contribution to the culture?”’ asked Kurt Vonnegut in his autobiography Palm Sunday. His answer? His master’s thesis in anthropology for the University of Chicago, ‘which was rejected because it was so simple and looked like too much fun.’ The elegant simplicity and playfulness of Vonnegut’s idea is exactly its enduring appeal. The idea is so simple, in fact, that Vonnegut sums the whole thing up in one elegant sentence: ‘The fundamental idea is that stories have shapes which can be drawn on graph paper, and that the shape of a given society’s stories is at least as interesting as the shape of its pots or spearheads.’” A link to the site appeared on Reddit later that day.

[2] Some commenters have compared Vonnegut’s idea to Joseph Campbell’s monomyth. Interestingly, Campbell borrowed the term from James Joyce’s Finnegan’s Wake, and Jockers’ first explorations are with Joyce’s Portrait of the Artist.

[3] Some collections, like that of Hu and Liu, contain as many as 6800 words.

[4] The information on the novels involved in this set is rather thin. Initial descriptions of the contents suggested that only the word frequencies for the novels was available.