Audio Digitization for the “Oral Histories of the American South” Project

The description of the audio digitization process given below is from 2007, but it is a model of thoroughness of description:

Cassettes are played on a Nakamichi MR-1 discrete head professional cassette deck. Tape heads are cleaned before each side of the cassette is played and the azimuth (the angle between the tape heads and tape medium) is adjusted to create maximum contact between the playback head and the tape to ensure the widest frequency response. Playback equalization is set to 120µ seconds for IEC standard Type I cassettes, and 70µS for Type II and Type IV cassettes.

XLR outputs of the Nakamichi transmit the balanced signal directly to the Apogee Rosetta 200, 24 bit, 2 channel, Analog to Digital and Digital to Analog converter. The signal is digitized at a sample rate of 96 kHz and 24 bit sample depth and travels to the computer via an XLR cable from the digital outputs to the Lynx One sound card AES/EBU audio port.

Recently, we added the Apogee Big Ben Master Digital Clock, a master word clock that virtually eliminates any possible jitter [abrupt and unwanted variation of one or more signal characteristics] that can cause high frequency distortions in the signal. This process creates audio files with excellent clarity and a very large quantity of information. A typical file, representing one side of one cassette, comprises around 1 GB of data.

Files are then stored in a designated digital deep storage on the libraries archival servers as they are too large to be stored on CD without converting the sample rate to 44.1 kHz and reducing the quality.

The signal can be monitored from each source separately from Genelec 8030A bi-amplified monitors routed through a Coleman Audio MS6A switcher with monitor controller. The switcher has balanced XLR inputs and outputs to preserve signal-to-noise ratio and features completely passive switching. Interviews are re-recorded using Wavelab, a non-linear digital audio software platform.

Each cassette side is recorded, assigned a number as a preservation master (PM), entered into a PM database including pertinent metadata, and saved as a single audio file into deep storage in a dedicated digital archive maintained by UNC. The interview audio file is then converted into a file for burning a CD listening copy for in-house library patron research. First the file is resampled to 44.1 kHz and 24 bit sample rate for audio processing. The audio file is processed in Sound Forge version 8.0 with Waves X Restoration, VST, Direct X, and Sony audio plug-ins to improve the quality.

A typical file requires two processes: normalization to an average RMS (root mean square) level of -14 dB applying dynamic compression in order to increase the volume, and noise reduction to remove as much background noise, tape hiss, and rumble as possible without affecting the source material. Some files require more specific equalization or a series of noise reduction to achieve audio of suitable quality and volume for researchers. The file is then converted to 16 bit samples, burned to a CD listening copy on a professional grade Mitsui gold audio CD at 4x speed with a Plextor DVDR PX-716A 1.09 drive using Sony CD Architect software, version 5.2. CDs are tested to determine audio is present. Finally, all individual audio files that comprise a complete interview are arranged in order and converted to one single 256 Kbps, 44.1 kHz, 16-bit, stereo MP3 audio file for the Documenting the American South, Oral Histories of the American South collection interface.

The above description accurately describes the current digitization process, a set of practices resulting from regular evaluation of current digitization standards and our abilities to meet and surpass them with the equipment and time we have available. When audio digitization of the interviews began November 1, 2005, masters were recorded at a 44.1 kHz, 16 bit sample rate. Soon, we hope to replace the LynxOne sound card with a FireWire card, removing another gain structure from the signal chain to create digital preservation masters with the least amount of information lost, or added, as possible.

A National Public, and Digital, Library Is Coming

The Chronicle of Higher Education reports this morning on a meeting that took place at Harvard on developing a national digital library. The HathiTrust was mentioned. (It also seems like something Project Bamboo should put on its radar.)

Rethinking the relationship between physical and digital

In most discourses about the relationship between the physical and the virtual, or digital, worlds, e.g. books, there is an assumption that the digital version will supersede the physical. (There are a number of interesting conversations about those things that will be better left to the physical, especially in terms of books, but that’s for another time.) [*Editions volumiques*](http://www.volumique.com/en/) is a development shop that seems to be one of the few to get that the advent of the digital affords us the opportunity to re-think the relationship between virtual and physical dimensions. The link above will take you to their website where you can preview a number of their projects:

* *Pawn* is a dynamic board game, somewhat like the old text adventure games, e.g. Zork, where you move a piece on your iPhone and different options pop up near it.
* *Pirate* takes the opposite tack: you move your iPhone around on a paper map and interact with other ships, i.e. other phones.
* *The Night of the Living Dead* attempts to turn the physical book into a linked narrative, a la the early experiments in hyperfiction. (Really, only _Hopscotch_ did that somewhat well to my mind, but I could be proved wrong rather easily.)
* *Labyrunthe* pursues this in an even more elaborate physical form, resembling the cube folding puzzles from standardized tests at time.
* *Duckette* plays with e-ink to make an interactive game.
* *Kernel Panic* … I don’t quite get.

But you should definitely go check these things out for yourself. Each project has a flash animated preview that is short and fun just to watch.

DH/Networking Explorations 1

## Following My Own Advice

For years now I have been encouraging students, both beginning and advanced, to keep a journal of their activities as one way of breaking down the barrier to getting writing done. I have especially encouraged graduate students working on their dissertations to try it. And I have done this while only being an intermittent practitioner myself. (I confess that this is in part one of the great advantages of having a spouse who practices the same profession: one is free to do much of the daily review over the dinner table. The pret-a-ecouter audience is great, but it disengages one important dimension of the process: writing.)

And so, John Anderson, if you are reading this post, here is me doing what I said, an account of trying my hand at textual analysis.

## The Onus ##

At the end of last year I was invited to participate in an NEH seminar on “Networks and Networking in the Humanities” which will be hosted by UCLA’s Institute for Pure and Applied Mathematics later this summer. Earlier this year the participants received a list of homework assignments: two books to read, a technical paper or two, and the production of an edge list.

The books have been interesting. (More on each one in separate posts.) The technical paper was at the border of my ken, but I followed chunks of it. The production of the edge list, a list of links in a network, has been the hardest task. Of course, part of it was nomenclature. “Edge list” through for a loop, new as I am to networkese, but I grokked it with the help of the assigned reading — and a variety of web reading. (Thank you, intarwebs.)

But there was another dimension to the edge list assignment that was stymying me: the data. Yes, I have the emergent data from the boat book, but I don’t feel entirely comfortable rushing to produce more data for the sake of the seminar if it means rushing certain dimensions of the research and I don’t quite have a grip on all the data I already have in a way that I am comfortable pouring it into a new paradigm of analysis and modeling. (Like some mental version of Twister.)

And so I needed a data set with which I could work that would allow me to do the kind of analysis that I hoped network theories and models would make possible. In particular I am interested in applying these paradigms to ethnographic contexts where we need to understand how individuals make their way through the world using the ready-made mentifacts that we sometimes call folklore as “equipment for living.”

What I think that means is that I want to understand how individuals within a given group (a social graph, if you will) draw from a repertoire (network) of forms (stories, legends, anecdotes, jokes, etc.) which themselves variously reflect and refract a network of ideas (ideology) dispersed (variably) throughout the group.

Networks of People, Stories, and Ideas

Or, as folklorist Henry Glassie once put it: “Culture is made up of ideas, society of people.” But ideas just don’t bounce around peoples’ heads and they don’t exist out in the world, at least very often, unencapsulated. Ideas and values are usually embedded in the things we say and do.[^1] We keep these things around, these stories and explanations, because they resonate with our values and beliefs. At the same time, the forms not only give shape to the ideas but also shape them.

This dynamic interaction has been the focus of folklore studies for the past century. For the last forty years, studies of culture and language have taken an ethnographic turn, sometimes called “performance” and sometimes called “ethnomethodology,” which has focused on the important role that individuals play in the intertextual network of forms (and thus the ideological network embedded within them).

I am one of those performance-oriented scholars. Performance studies has produced a wide range of profound micro-level studies of folklore in action. In the last decade or so, there has begun to be an attempt to build back toward the philological framework from which the performance orientation sprang and against which it initially pushed back. It’s time to fold these things together, and I think network theories offer one possibility for doing so.

## The Data ##

If not my own data, then what other corpus? I wanted to work with materials that I knew fairly well. I began to build a database of Louisiana folklore in print, focusing especially on tales and legends, but the amount of time to get a large enough corpus digitized and into the database, even using OCR software, quickly loomed too large. A great project, but one that could easily take up an entire summer, not the limited time I had to get something up and usable in order to begin to complete the seminar assignment — which I was late fulfilling anyway.

I did, however, initiate some conversations that may yet produce a foundation for such a database, contacting authors of several texts for electronic copies of their manuscripts to facilitate data entry. (The metadata is entirely a separate matter for now.)

The answer to my question didn’t come to me until I was in Providence, Rhode Island for the sixth, and final, Project Bamboo planning workshop. I don’t know if somebody said something or suggested something, but I struck upon the idea of using Zora Neale Hurston’s _Mules and Men_ as the basis for the seminar assignment and for my own initial explorations into the various software tools that are available. I was reasonably hopeful that somewhere, someone would have digitized the text, and I was right: the text is not in Project Gutenberg, nor in the Oxford Text Archive, but at the University of Virginia’s American Studies’ [hypertext collection][xroads]. There I found a [hypertext version of _Mules and Men_ put together by Laura Grand-Jean in 2001][lgj].

I am not yet at a point where I could deploy a `bash` script to `wget` or `curl` or something else the pages I needed, but since I decided to focus on only the folktales section of the book, the book’s first half, it wasn’t too much of a task to click on each page and then copy the text and paste it into a plain text document in my text editor, Textmate. For reference, I also copied and pasted the HTML in hopes that it might prove useful for getting certain kinds of texts out. That is, I had hopes of figuring out how to tell a piece of software to pull everything out between `

` tags. Unfortunately, Grand-Jean had used some non-standard `

` markup to handle the long blockquotes. I thought about doing some fancy find and replace work with regular expressions, but in the end I decided I would rather work with the plain text, which would also encourage (force) me to re-read the text. The latter proved useful as I came across some long texts embedded in dialogue that were worth including in the extracted corpus.

(The plain text version of Part One of _Mules and Men_ can be found both on [Scribd][] as well as on [GitHub][] — forked critical editions of texts is an interesting idea, no? It weighs in at 55,798 words in 2,127 lines — somewhere along the way I’ll put up some stats on word counts for block quoted text, quoted text, narrative text, etc.)

## And Now for Some Software ##

So I’ve got a digitized text. An ethnographic text.[^2] That will give me people and forms, and I’m reasonably familiar with the kinds of speech communities involved that I can take a crack at ideas. Now I hope to use software to begin to discern those patterns more clearly. (And to produce that edge list.)

The first thing I try is SEASR’s [Meandre][]. Meandre is really something like a software suite, consisting of server and client software, both of which you install and run locally. The server software syncs with the component and workflow repositories at SEASR HQ which are then made available to you through the workbench.

Meandre Workbench

As a quick glance at the UI reveals, it’s not exactly user friendly. Then again, none of this software really is. The good folks running the seminar have provided us with links to useful software: Network Workbench, Wordij, and Pajek (which is, sigh, Windows-only). I am still working my way through these various packages, but I have to say that so far my best results have been using [IBM’s Many Eyes][ibm].

[xroads]: http://xroads.virginia.edu/~HYPER/hypertex.html
[lgj]: http://xroads.virginia.edu/~MA01/Grand-Jean/Hurston/Chapters/siteintroduction.html
[Scribd]: http://www.scribd.com/doc/33800238/Zora-Neale-Hurston-s-Mules-and-Men-in-plain-text
[GitHub]: http://github.com/johnlaudun/Mules-and-Men
[Meandre]: http://seasr.org/meandre/download/
[ibm]: http://manyeyes.alphaworks.ibm.com/manyeyes/users/johnlaudun
[^1]: The poet William Carlos Williams once advised in “A Sort of Song” to: “Let the snake wait under / his weed / and the writing / be of words, slow and quick, sharp / to strike, quiet to wait, / sleepless. / — through metaphor to reconcile / the people and the stones. / Compose. (No ideas / but in things) / Invent! / Saxifrage is my flower that splits / the rocks.” His famous urging to himself and other poets to find the ideas that already surrounded them in the world echoes the anthropological project of the twentieth century: to find the intelligence and beauty in the always already peopled world of the everyday. (My apologies to Williams for eliminating his line breaks but my software, `PHP Markdown Extra`, wasn’t handling a poem within a footnote at all well.)
[^2]: To be sure, I’m fully aware of the potential problems of Hurston’s text. For a fuller discussion, see my essay in _African American Review_ ([JSTOR](http://www.jstor.org/stable/1512231)).

Alan Burdette Is Coming to Louisiana

Dr. Alan Burdette, Director of the EVIA Digital Archive and Associate Director for Digital Humanities Infrastructure will be in Louisiana the week of May 17-21. He is traveling to the the Association for Recorded Sound Collections annual meeting in New Orleans, but he has agreed to come in early and meet with faculty interested in the projects that he and others have initiated in the digital humanities. He also said he was happy to meet with an executive team and share openly what Indiana University has learned in its efforts to build a cyberinfrastructure that can both support faculty research and communications as well as the university’s efforts to position itself in the new digital learning landscape. In particular, his work on the EVIA Digital Archive, which was a cooperative effort between Indiana University and the University of Michigan funded by the Mellon Foundation, has given him a lot of insight into the current state of digital archives and the infrastructures, like an institutional repository, that play a role in such projects.

Tattoo You

The Text Analysis Developers Alliance has released an embeddable Flash widget which provides embedded [TAPOR analytics](http://taporware.mcmaster.ca/) for the page on which it resides.

Here’s an example of the embedded widget:

Oh, yeah, that *tattoo* is short for Text Analysis TOOls. (Actually, it gets even worse, but I’m too embarrassed to repeat their version.)

Analog Blog in Africa

From Motherboard TV comes this story of a man in Monrovia, Liberia who gleans the news from both traditional, print sources and digital sources and compiles them, aggregates them in the new terminology, into … wait for it … a whiteboard blog.