In Tolkien’s grand narrative, the “one true ring” turned out to be a really bad idea, and it took a three-book sequence to destroy the thing. In the humanities in particular and the academy in general, we continue to be vexed by a file format that allows for productive interchange that is also open — both in the beer and speech senses. Microsoft’s Word files, DOC and DOCX, are clearly not it, though they are now so ingrained in everyone’s workflows, if only thanks to the application being omnipresent on most Windows computers, that many of us assume they are the basis for any interchange.

But anyone who has had to trade a complex document back and forth a few times with more than a few basic style options has learned, things get lost in transit.

Until recently, however, few applications did a decent job of reading and writing Word’s DOC file format. It was getting better — which may be one of the reasons why Microsoft changed to the DOCX format, who knows? — but it was still not reliable.

What are the alternatives?

  • OpenOffice’s ODF has never quite caught on.
  • RTF is fairly reliable, but it isn’t capable of much.
  • HTML seems so “webby” and hasn’t, at least until CSS3, been at all friendly to printed matter.

Which leaves PDF.

Adobe wisely side-stepped competing directly with Microsoft in producing its “portable document format,” but unfortunately for Adobe, but perhaps fortunately for those of us for whom openness matters, PDF seems to have really hit its stride exactly in that moment where the rise of mobile computing devices call it most into question. After all, who here hasn’t muttered in frustration when accessing some simple text content on your phone or tablet and discovered it is in a PDF formatted for an 8 x 11 piece of paper. Oof!

And yet just as we in the humanities have leaned too much upon Word — I now traffic in tracking changes in Word documents in articles for journals and books (Ugh!) — we are starting to lean too much on PDF. A recent exchange in the Digital Humanities On-line mailing turned up the follow comment from Stephen Woodruff:

There are many ways of creating and encoding a PDF file, and not all result in text which can be copied and pasted if the text includes more than standard Ascii characters. Normal word processors hold a internationally accepted numerical representation of each letter plus a note of its font, size, colour and so on. So you can search for an “a” without caring whether its in Arial or Times, red or italic, and you can copy that numerical representation to another application, even if it doesn’t understand colour or have the same fonts.

PDF doesn’t always work like that. Some encodings are analogous to what a typical word processor would use, some are not: they store glyphs, effectively pictures of the individual letters, and have a table to convert back between those and the character codes needed by a copy-paste operation. Its that conversion back that can go wrong: you can read the PDF files and print them because all your eyes and the printer need are the shapes, but if they have been created badly you can not reliably extract the text. (I’m trying hard not to start complaining about the use of PDF, which is a PAGE description language not a TEXT description language, in the academic world.)

Stephen also points to a terrific post by Adobe’s James King which clarifies PDF’s purpose. King’s post ends with the following:

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

And so there you have it: PDF is really about presentation. Whether you can get text (data) back out of it is not this particular vessel’s problem nor its concern. That seems problematic to me for those of us who wish our content to be as portable and re-usable as possible. I think PDF is terrific as one possible output, one possible product, but it’s not the interchange format of which everyone dreams. Quite the opposite.

I am somewhat used to the chronicling of demise of the humanities to be found in the pages of the Chronicle of Higher Education and the The Times, but I must admit to be somewhat taken aback by similar treatments of the subject within the annals of scholarly societies themselves. At the most recent Digital Humanities meeting, Melissa Terras broached the issue. And then, and then, I was gleaning recent issues of Culture and Technology and came across a review by D. R. Koukal of Frank Donoghue’s The Last Professors: The Corporate University and the Fate of the Humanities. This link takes you to Project Muse, which houses the on-line PDFs of the journal. (ULL faculty and staff need to remember that we lose Muse and JSTOR — and, well, everything else — on August 31.) Donoghue’s argument, as I understand it from reading Koukal’s review, will come as no surprise to anyone keeping up with the last few decades of the humanities in the academy: the humanities lost the argument a while ago but are still in deep denial about their demise. That is, in the dominant rhetoric of immediate application and gain, the long-term, “life is complex” approach of the humanities is simply not seen as viable.

This is certainly not going to change in the immediate future as the world’s major economies, themselves in denial over the fact that they are actually in a depression and not a momentary recession, shrink. Those with jobs, anywhere but especially in the academy, are going to stand pat. Those without jobs are going to be pretty adamant about seeing immediate results. (Given the number of people unemployed and for how long, I would certainly not argue with their desire.)

This is a good time for humanists to roll up our collective and individual sleeves and not only produce the work we signed up to produce, but also to think about what more/else we need to be doing.

UPDATE: I missed this story in the Guardian about the cuts to universities in the United Kingdom. Story.

Following My Own Advice

For years now I have been encouraging students, both beginning and advanced, to keep a journal of their activities as one way of breaking down the barrier to getting writing done. I have especially encouraged graduate students working on their dissertations to try it. And I have done this while only being an intermittent practitioner myself. (I confess that this is in part one of the great advantages of having a spouse who practices the same profession: one is free to do much of the daily review over the dinner table. The pret-a-ecouter audience is great, but it disengages one important dimension of the process: writing.)

And so, John Anderson, if you are reading this post, here is me doing what I said, an account of trying my hand at textual analysis.

The Onus

At the end of last year I was invited to participate in an NEH seminar on “Networks and Networking in the Humanities” which will be hosted by UCLA’s Institute for Pure and Applied Mathematics later this summer. Earlier this year the participants received a list of homework assignments: two books to read, a technical paper or two, and the production of an edge list.

The books have been interesting. (More on each one in separate posts.) The technical paper was at the border of my ken, but I followed chunks of it. The production of the edge list, a list of links in a network, has been the hardest task. Of course, part of it was nomenclature. “Edge list” through for a loop, new as I am to networkese, but I grokked it with the help of the assigned reading — and a variety of web reading. (Thank you, intarwebs.)

But there was another dimension to the edge list assignment that was stymying me: the data. Yes, I have the emergent data from the boat book, but I don’t feel entirely comfortable rushing to produce more data for the sake of the seminar if it means rushing certain dimensions of the research and I don’t quite have a grip on all the data I already have in a way that I am comfortable pouring it into a new paradigm of analysis and modeling. (Like some mental version of Twister.)

And so I needed a data set with which I could work that would allow me to do the kind of analysis that I hoped network theories and models would make possible. In particular I am interested in applying these paradigms to ethnographic contexts where we need to understand how individuals make their way through the world using the ready-made mentifacts that we sometimes call folklore as “equipment for living.”

What I think that means is that I want to understand how individuals within a given group (a social graph, if you will) draw from a repertoire (network) of forms (stories, legends, anecdotes, jokes, etc.) which themselves variously reflect and refract a network of ideas (ideology) dispersed (variably) throughout the group.

Networks of People, Stories, and Ideas

Or, as folklorist Henry Glassie once put it: “Culture is made up of ideas, society of people.” But ideas just don’t bounce around peoples’ heads and they don’t exist out in the world, at least very often, unencapsulated. Ideas and values are usually embedded in the things we say and do.1 We keep these things around, these stories and explanations, because they resonate with our values and beliefs. At the same time, the forms not only give shape to the ideas but also shape them.

This dynamic interaction has been the focus of folklore studies for the past century. For the last forty years, studies of culture and language have taken an ethnographic turn, sometimes called “performance” and sometimes called “ethnomethodology,” which has focused on the important role that individuals play in the intertextual network of forms (and thus the ideological network embedded within them).

I am one of those performance-oriented scholars. Performance studies has produced a wide range of profound micro-level studies of folklore in action. In the last decade or so, there has begun to be an attempt to build back toward the philological framework from which the performance orientation sprang and against which it initially pushed back. It’s time to fold these things together, and I think network theories offer one possibility for doing so.

The Data

If not my own data, then what other corpus? I wanted to work with materials that I knew fairly well. I began to build a database of Louisiana folklore in print, focusing especially on tales and legends, but the amount of time to get a large enough corpus digitized and into the database, even using OCR software, quickly loomed too large. A great project, but one that could easily take up an entire summer, not the limited time I had to get something up and usable in order to begin to complete the seminar assignment — which I was late fulfilling anyway.

I did, however, initiate some conversations that may yet produce a foundation for such a database, contacting authors of several texts for electronic copies of their manuscripts to facilitate data entry. (The metadata is entirely a separate matter for now.)

The answer to my question didn’t come to me until I was in Providence, Rhode Island for the sixth, and final, Project Bamboo planning workshop. I don’t know if somebody said something or suggested something, but I struck upon the idea of using Zora Neale Hurston’s Mules and Men as the basis for the seminar assignment and for my own initial explorations into the various software tools that are available. I was reasonably hopeful that somewhere, someone would have digitized the text, and I was right: the text is not in Project Gutenberg, nor in the Oxford Text Archive, but at the University of Virginia’s American Studies’ hypertext collection. There I found a hypertext version of Mules and Men put together by Laura Grand-Jean in 2001.

I am not yet at a point where I could deploy a bash script to wget or curl or something else the pages I needed, but since I decided to focus on only the folktales section of the book, the book’s first half, it wasn’t too much of a task to click on each page and then copy the text and paste it into a plain text document in my text editor, Textmate. For reference, I also copied and pasted the HTML in hopes that it might prove useful for getting certain kinds of texts out. That is, I had hopes of figuring out how to tell a piece of software to pull everything out between <blockquote> tags. Unfortunately, Grand-Jean had used some non-standard <table> markup to handle the long blockquotes. I thought about doing some fancy find and replace work with regular expressions, but in the end I decided I would rather work with the plain text, which would also encourage (force) me to re-read the text. The latter proved useful as I came across some long texts embedded in dialogue that were worth including in the extracted corpus.

(The plain text version of Part One of Mules and Men can be found both on Scribd as well as on GitHub — forked critical editions of texts is an interesting idea, no? It weighs in at 55,798 words in 2,127 lines — somewhere along the way I’ll put up some stats on word counts for block quoted text, quoted text, narrative text, etc.)

And Now for Some Software

So I’ve got a digitized text. An ethnographic text.2 That will give me people and forms, and I’m reasonably familiar with the kinds of speech communities involved that I can take a crack at ideas. Now I hope to use software to begin to discern those patterns more clearly. (And to produce that edge list.)

The first thing I try is SEASR’s Meandre. Meandre is really something like a software suite, consisting of server and client software, both of which you install and run locally. The server software syncs with the component and workflow repositories at SEASR HQ which are then made available to you through the workbench.

Meandre Workbench

As a quick glance at the UI reveals, it’s not exactly user friendly. Then again, none of this software really is. The good folks running the seminar have provided us with links to useful software: Network Workbench, Wordij, and Pajek (which is, sigh, Windows-only). I am still working my way through these various packages, but I have to say that so far my best results have been using IBM’s Many Eyes.


  1. The poet William Carlos Williams once advised in “A Sort of Song” to: “Let the snake wait under / his weed / and the writing / be of words, slow and quick, sharp / to strike, quiet to wait, / sleepless. / — through metaphor to reconcile / the people and the stones. / Compose. (No ideas / but in things) / Invent! / Saxifrage is my flower that splits / the rocks.” His famous urging to himself and other poets to find the ideas that already surrounded them in the world echoes the anthropological project of the twentieth century: to find the intelligence and beauty in the always already peopled world of the everyday. (My apologies to Williams for eliminating his line breaks but my software, PHP Markdown Extra, wasn’t handling a poem within a footnote at all well.) 

  2. To be sure, I’m fully aware of the potential problems of Hurston’s text. For a fuller discussion, see my essay in African American Review (JSTOR). 

It’s the end of the first day of Project Bamboo’s Workshop 6, which represents an opportunity for the larger (arguably still emergent) community to shape a response to the new context, which is, as I understand it, a function of the Mellon Foundation’s merging of the Research in Technology program with the Scholarly Communications program.

In the interval between this change in context and the workshop itself, the core PB team has worked with a group of universities who early on had identified themselves as likely partner level contributors to whatever it is we’re building. That has resulted in the Bamboo Technology Project.

The goal of the BTP is to identify “strategic areas of work” within which they can plan and, in the case of Phase I projects, build something — because across the board any number of us agree that it’s time for Bamboo to make something, to have an identifiable product that we can show to colleagues and administrators and others that reveals the potential profit in universities and other organizations collaborating in an open way to build services, software, and standards for knowledge creation and distribution. The list of partners is impressive. (I will list them in an update to this post.) The four major areas of work to be completed in Phase I are: work spaces, scholarly web services, collections interoperability, and corpora space. (Phase I is to last eighteen months, as is Phase II to follow.) The first three areas already have some pieces in place that the BTP hopes to build upon and, at the same time, begin to draw together into the kind of whole that is the promise of Bamboo.

For work spaces, there is HubZero and an ECM (Enterprise Content Management System) which will be the foundations for further work.

For scholarly web services, the partner institutions will be able to draw upon a number of projects, including, but not limited to, PhiloLogic, Perseus, CLARIN, SEASR, and Prosopography. (Links to follow.) Most of these services offer some or all of what are becoming the usual analytical tools for textual scholars: document mapping, concordance, collocation, frequency, etc. Collection interoperability will focus on metadata interchange.

The one area of work that will not be built but will be subject to planning in Phase I is corpora space, which is going to focus on the production of five or so white papers as well as identifying some high priority/profile corpora that can be targeted for a project. (I would like this to be a folklore corpus, of course.)

There are other projects and plans within the BTP, but much of the morning was focused on determining the kind of consortium that would, during this transitional period, support the BTP projects. This is, of course, the reverse of Bamboo’s ultimate goal, but I think it rightly puts resources and imaginations in motion. A number of organizations have stuck with the planning process now for two years, and we will, I think, continue to stick with it because we believe in the greater good that Bamboo seeks to serve. What we need are tangibles to show to others to concretize our participation and to act as an incentive for others to join.

Once more firmly established, Bamboo can do a lot of good, if it can negotiate the somewhat crowded waters of already existing as well as emerging organizations, coalitions, and other consortia with similar goals and/or visions. E.g., CHCI, CenterNet, and now CHAIN. Part of what I think Chad Kainz was struggling to articulate in trying to develop an organizational structure for Bamboo was to make as many people and institutions feel included as is humanly possible. (In all honesty, humanists and their organizations can be a fairly territorial lot, as contradictory as that seems to the rhetoric that we so often deploy.)

One of the things it could do, that was the focus of our table’s conversation not once but twice during the day, is the development of a federated researcher/user identification system for the humanities. Think Thomson-Reuters’ ResearcherID but open source and run by the collaboration of member organizations — and even non-member organizations. Throw in DOIs for publications, projects, datasets, tools, and workflows and you have not only a very powerful, and searchable, data stream but one that fits within every organization’s already existing workflows of annual reports and assessments and every individual scholar’s workflows of vita maintenance. And it would be a natural component/connection to institutional repositories. (I will link to the small presentation I pulled together for my colleagues at UL-Lafayette in an update.)

UPDATE: The document is here.

There was a lot more that happened today. Some of it can be gleaned from Chad and David’s slide decks, which I hope they make available later, and some of it can be found in the planning documents, which may be available on the Bamboo website. For now, I will leave off my summary of the day here.

All of us owe a huge debt of gratitude to Maida Owens and the Louisiana Folklife Program. She has single-handedly persevered in getting almost all the contents, at least the tables of such if not the content itself, of the entire run of the Louisiana Folklore Miscellany online. Later issues, like the two issues I edited on Cultural Catholicism and In the Wake of the Storms also have the articles available. (The contents are in chronological order with the oldest first, so those issues are toward the bottom of the page.)

My friend Jason Jackson passes on the news that at the annual meeting of the Linguistics Society of America, the following resolution was passed:

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; …

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns; …
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.

As part of my [evolving relationship with Amazon.com](no link yet), I became aware of Amazon Web Services’ AWS in Education program:

AWS in Education provides a set of programs that enable the worldwide academic community to easily leverage the benefits of Amazon Web Services for teaching and research. With AWS in Education, educators, academic researchers, and students can apply to obtain free usage credits to tap into the on-demand infrastructure of Amazon Web Services to teach advanced courses, tackle research endeavors and explore new projects – tasks that previously would have required expensive up-front and ongoing investments in infrastructure.

With AWS you can requisition compute power, storage, database functionality, content delivery, and other services — gaining access to a suite of elastic IT infrastructure services as you demand them. AWS enables the academic community to inexpensively and rapidly build on global computing infrastructure to pursue course projects and accelerate their productivity and research results, while enjoying the same benefits of reliability, elasticity, and cost-effectiveness used by industry. The AWS in Education program offers: Teaching Grants for educators using AWS in courses (plus access to selected course content resources); Research Grants for academic researchers using AWS in their work; Project Grants for student organizations pursuing entrepreneurial endeavors; Tutorials for students that want to use AWS for self-directed learning; Solutions for university administrators looking to use cloud computing to be more efficient and cost-effective in the university’s IT Infrastructure.

The National Academies Press has just released a 180-page book on Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. The link will take you to the book’s page on the press’s website. It’s available as a paperback for $31.46, as a PDF for $27, or as a combo for $41. You can also follow a link on the page to read it on-line for free.

An article in a recent PNAS (Proceedings of the National Academy of Sciences) describes the use of stylometry, the study of artwork through math and statistics, to analyze paintings in order to determine if they are authentic to the attributed master, to a student, or are a fake. The paper describes a technique called sparse coding, in which “analysts break down works of art into tiny patches and represent them as a series mathematical functions. By comparing the functions produced with authentic artwork to those from possible imitators, they can produce an objective measure of whether the piece in question is real or fake.” The cover story on Ars Technica explains:

Sparse coding was originally developed for studying how neurons in the brain responded to visuals. It works by breaking down an image—for simplicity’s sake, usually one in grayscale—into mathematical functions, pixel by pixel. The images that are broken down are just small patches of whole works, not much more than a dozen pixels square.

A recent story in the New York Times reveals what all long-time observers of the humanities know already: in the era of careerism, the humanities are a “hard sell.” (The quotation marks are there to emphasize that the irony of using that phrase is quite purposeful.) Kate Zernike’s story profiles a number of universities, one of which is my very own. (The shuttering of the philosophy department is mentioned early in the piece, but there is no further commentary nor mention of UL Lafayette.)