More TEI Work

I am still working through the phylogenetic material both on Nouvelles Mythologie Comparée as well as the materials that Julien d’Huy sent me. Both d’Huy’s work as well as Tehrani’s work require better and more texts than I currently possess: my corpus of Louisiana legends weighs in at close to 30 in terms of oral texts and another two dozen or so literary texts. I need more.

And I need better texts. So far, I have been working mostly with just the texts — that is, no metadata of any kind — a process has revealed its limitations, more and more, over the past year. If I look at the kind of analyses that I find most compelling, Jahmid Tehrani’s study of Little Red Riding Hood, for example, and if I consider the road blocks I’ve encountered in my own work, I need to be able to mark up texts with a variety of analytical details that folklorists find useful: motifs (and/or plot points not currently motifs), locations, performers, etc.

TEI is the best way forward, but, if I haven’t said it before, it is not an intuitive markup. Where I would err on the side of brevity, <source>, TEI opts for something a bit more cumbersome, <sourceDesc>. I spent a good portion of the day working with both the tutorials and examples, as well as scanning the GitHub materials and some of the other forms of documentation.

I’ve begun to build a basic framework for the next set of materials that I am folding into the project: Gerard Hurley’s 1947 survey of American treasure legendry, “Buried Treasure Tales in America”. The last part of the essay enumerates 102 tales, many of which were published in the Journal of American Folklore or Western Folklore — I am working on a complete bibliography of all this material, if anyone is interested, and I’m happy to share it, so long as everyone remembers it’s a work in progress.

What I spent much of today doing was copying and pasting texts out of the JSTOR PDFs of the JAF articles and wrapping TEI around them. Pasting OCR is never as straightforward as it sounds, so there was a fair amount of clean up done. I also normalized eye dialect so that should I want to run these texts through some scripts as is, I won’t have to deal with differences between “Ah” and “I” or “jes'” and “just”.

In addition to the texts and the bibliographic information, I also need to capture the page(s) on which the text appeared, but I don’t want that page number to be in the text itself, since it’s not at all important for analysis. My best guess is to include it under <sourceDesc> in the TEIheader. I also want to include other information in the source document, notes about collection and especially about tellers that might be useful to future users of these TEI documents, but sorting out where such things go is all of a trick. I did, however, finally determine how to embed location metadata in the TEIheader.

I know I need to go back and double-check on the adaptations of TEI by linguists and oral historians as I continue to move forward with a TEI for folklore studies. I know I know.

A Brief Note on Exchanging Texts

*This is Part 2 in the [TEI and folklore studies series][TEIfolk].*

At the center of most humanistic endeavors lies text. It can be a single text, or it can be many texts. And the texts themselves can be any length, from the few words of a particular utterance from a particular individual in a particular moment to the thousands of words that make us all of Shakespeare’s work or the millions of words that make up the novels published in England in the nineteenth century.

But what is the nature of such texts? Folklorists, for example, have long made the nature of spoken texts as represented in written form one of the central considerations of our disciplinary practice. We have, for example, worried over the proper way to represent to readers differences in pronunciation or rhythm within a text or the way a story was actually told, while at the same time also trying how best to think about the ontology of texts within a given speech community: where does one text end and another begin within the flow of spoken discourse during such tellings? Folklorists are not alone in their considerations about the nature of a given text or set of texts. Scholars of written communication face similar questions in trying to come to grips with how to think about the relationship between the physical pagination of a text versus the flow of words.

Such questions about what makes a book, for example, have become especially interesting as information technologies have made new forms of textuality possible as well as offering alternative vehicles for traditional forms of textuality (e.g., ebooks). Information technologies have also opened up entirely new means not only for the exchange of analyses, but also for texts themselves, and that returns us to the question of textuality with which we began: how do we best share texts with each other?

While there have been a number of experiments over the past three two decades, two principle formats have emerged as dominant for humanities scholars seriously interested in taking full advantage of the internet, and, perhaps, ignoring, for the moment the de facto standards of Microsoft Word file formats and Adobe PDF.[^1] most importantly, they are not in conflict. The first is plain text, and it is exactly what it says: a sequential file of characters encoded in one of a few widely-known standards, typically ASCII or one of the Unicodes, that is readable as textual material pretty much as is. That is, you do not need any particular application to open a plain text file. Plain text documents are distinguished from formatted text documents, which often contain not only information on how to format text—italics or bold, for example—but also sometimes how to structure the text in a fashion that goes beyond simple formatting. One can do both in Microsoft Word, for example: simply format text in situ, using the many styling options available, and/or tie that formatting to particular parts of a document or its structure. That is, a line in word can be bolded, because the author wishes it be bolded, or it can be bolded and noted, within the file format, that it is a certain level heading. The problem with Microsoft Word is that it is a proprietary format, subject to change without notice, and often easily broken. (Fortunate indeed is she who hasn’t had a Word document go suddenly wonky.)

*See “What Files Actually Look Like” below.*

There is, of course, a middle ground, one that most of us use on an almost daily basis, even if we know relatively little about it, and that is HTML.[^2] As it name makes clear, HTML offers authors the ability to markup their texts using a specific set of tags, that are indicated by being enclosed within angle brackets. While the text, as the illustration makes clear, may look complex, it remains encoded in plain text and thus readable by machines, and thus humans, in a wide variety of contexts. What HTML does is provide a standard way of describing how documents should be presented. Encoding bits of text, or links or images, is relatively straightforward, and because it is all based in plain text, the chances of things getting garbled, as files bounce around the internet from transmission point through a series of computers and cables until reaching the browser in our own devices, is relatively small. (And usually the garbled bits are fairly local: it is rarely the case that an entire document will be scrambled because a bit got dropped here or there. That is the fundamental strength of HTML and, thus, of the web.)

What you may not know about HTML is that it is actually a specific variant of a more generalized form of encoding known as Standard Generalized Markup Language, or SGML, and was developed by Tim Berners-Lee in order to facilitate communications among physicists.[^3] The original set of HTML tags was very small and focused mostly on presentational matters: this *word* should be italicized and this one should be in **bold** was indicated in the original HTML markup by the words being enclosed by `word` and `one`. There were a limited set of tags that indicated the structure of a document, like whether a line of text was a heading or a paragraph.

As the number of computers serving HTML documents increased and the number of people exchanging such documents increased, the web was born, and while HTML has increased in its presentational richness, its semantic dimensions have remained, for the most part, rather limited. Perhaps the best example of semantic poverty can be found in the use of italics, which are a typographical convention used to indicate the titles of books and works of art, to indicate a particular emphasis by the author, the introduction of a new term, or the use of a word or phrase from another language. That is, HTML, in replicating the printed page, also preserved its semantic impoverishment, which seems rather a waste of time given how much more computers should be able to do for us.

Fortunately, for humanists especially, at the same time Tim Berners-Lee and Robert Cailliau were developing HTML, humanities scholars and archivists were working in parallel to develop a robust system for capturing not only texts themselves in an electronic form but also for capturing the wide array of information to which users of such information regularly like to have access. Such information varies by domain and by practitioner, and TEI, as the markup system for the Text Encoding Initiative has come to be called, has the ability to adapt to those differences. TEI refers both to the consortium behind the project, the project itself, and, in general, to the flexible set of schema that the project produces.

The goal of TEI is to “develop and maintain a standard for the representation of texts in digital form.”[^4] The standard itself is spelled out in the guidelines, which attempt to “define and document a markup language for representing the structural, renditional, and conceptual features of texts.” The most recent version of the guidelines is P5, with the P standing, as I understand it, for proposal, since each version of the guidelines undergoes extensive discussion and revision before being approved for publication.

### What Files Actually Look Like

Below is a representative sample of the same document in plain text, HTML, PDF, RTF, and Word’s DOCX format. All were based on the following text and then accessed using the unix `head` command. Here’s the original text:

> He was in the excavating business,
> so he called me to come up and showed me the job.
> And we dug house basements.
> And that was when they were remodeling a lot filling stations,
> making them super service and that sort of thing,
> so I said, yeah, I’ll take it.
> So I worked there about two years and a half.
> And then we came back to Bloomington.
> At that time, my brother-in-law–
> he’s passed away–
> but at that time he owned a furniture store, United furniture. 

The text is 11 lines long, and `head` shows the user the first ten lines of a file by default. Compare the size of the documents in the table below and then compare where the “data” — the actual text above — occurs in each of the files below. (Or, where it doesn’t.) It makes you wonder about what kind of vessel into which you are pouring your hard work, no? (My friend Henry Glassie is nodding his head vigorously at this moment, and for all the right reasons.)

Here’s a simple list of the files and their sizes:

goble-text.txt 468 bytes
goble-html.html 592 bytes
goble-tei.tei 660 bytes
goble-pdf.pdf 16 KB … or 16,000 bytes for 34X the original text
goble-rtf.rtf 18 KB … or 18,000 bytes for 38X the original text
goble-docx.docx 45 KB … or 45,000 bytes for 96X the original text

In short, Word creates a document exponentially larger — almost one hundred times! — than the information that it contains. Moreover, none of that metadata, the stuff going with your data, your text, add anything of semantic value: anything added here is simply additional text added by the analyst herself.

Now to what these things look like from the computer’s point of view. As Mike Rowe would say, get ready to get dirty. It’s not pretty:

% head goble-text.txt

He was in the excavating business,
so he called me to come up and showed me the job.
And we dug house basements.
And that was when they were remodeling a lot filling stations,
making them super service and that sort of thing,
so I said, yeah, I’ll take it.
So I worked there about two years and a half.
And then we came back to Bloomington.
At that time, my brother-in-law–
he’s passed away–

% head goble-html.html

He was in the excavating business,
so he called me to come up and showed me the job.
And we dug house basements.
And that was when they were remodeling a lot filling stations,
making them super service and that sort of thing,
so I said, yeah, I’ll take it.

% head goble-tei.tei


He was in the excavating business,
so he called me to come up and showed me the job.

% head goble-pdf.pdf

4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
p?k??(??\??DYzz?i?IP??”? ??+?xWv???q?+-q
[pR~”¢Yaml?|?0#’!D?P???????ֽ?p??g?5Qq#P?E????????7???c???͂???$??S?)/))?6:?????|7`??MO?@??&??f??]?`??pP<*???v ?ݏ?,_??i?I?(zi?N??}fڝ?`??h?5)??7?6Sf????c|?" ?l???????:0Tɭ?"Э?p'䧘??tn??&? QS?X????!.???,?_?WF?L8W()??? ??}'????F?????G????? ?Y,Ķ??c??? ?sB` ????Ih??/YfS ?3?Yٜ9??wr??F??JB?/ݜ??;?"?+Z(?e?ȁaU?=?????7??

Why Folklorists Should Care about TEI

*Part 1 in a series of posts about [TEI and folklore studies][TEIfolk].*

We live, we are (probably too often) told, in a connected world. The internet, we are assured, has brought or will bring us all closer together. But such notions as connection and closeness are dependent upon actual relationships developing, and to do that we must use those two things to communicate. These are obvious things to folklorists, and yet we have been slow to take advantage of such a robust infrastructure as the internet to communicate in more than the usual ways: the exchange of PDFs or the submission of Word documents to journals. These are fine starts, but as anyone who has nurtured an essay or volume to publication knows, a lot gets left out.

Perhaps the most important thing that gets left out is all the material that we collect and record but do not have room for in the slim space of pages. This material, however, was not only useful in the development of our own thinking, but it also has far wider use potential: other folklorists could use it to teach or to develop their own research projects or the people themselves could use it for education or introspection or even simply a sense of acknowledgement that they exist and have something to add to the larger archeological record of humankind.

How to format this record has remained a puzzle for folklorists, who have engaged in robust conversations over the possible categories of human expressivity, over the uses of such expression, and how to transcode expression from one mode (e.g., spoken performance) to another mode (e.g., written). While the internet makes it possible to upload audio, video, and image files in addition to texts, it is not always the case that others can readily download such materials, and there remains the question of having downloaded the materials, are they able to view them, use them.

Matters having to do with audio, video, and image files we must leave to a longer, more comprehensive sorting out, but there exists today a format for capturing verbal materials in a written form that can encompass not only the words themselves, but the rich complexities of spoken discourse. Moreover, the format is also capable of embedding within a text a wide variety of analytical information–including, yes, type and motif numbers as well as the location, date, and nature of an event, such that folklorists can rest assured that users on the other end are receiving the fullest sense of the original that text can make possible.

[TEI][tei], as the format for the [Text Encoding Initiative][tei] has come to be called, has emerged as the foundation for any number of humanistic endeavors. It lies, for example, at the heart of the [Perseus Digital Library][], which is now the standard library for students of the Greco-Roman classics, amounting to 69 million words now. Its collections of Arabic, Germanic, Renaissance, and nineteenth-century American materials are equally stunning not only in terms of amount, but also in terms of accessibility and usability: users are, in fact, encouraged to download materials and add their own annotations. The [Oxford Text Archive][] was, like the Perseus Library Project, also a pioneer in the use of TEI, and its use of the format has meant that literary scholars and linguists are often using the same materials but for their own research agendas.

The current problem for humanistic research is that the texts available have largely been contributed by the disciplines of linguistics and literary studies, which means that the texts from which conclusions are being drawn are either sentences and utterances of a few to a few dozen words or texts of thousands upon thousands of worlds. The meaningful middle is missing. Folklorists of course specialize in this “middle” range of texts. From highly-structured short texts like proverbs, to interactionally-complex legends, to flexibly-organized narratives like myths or tales, folklorists have long recorded, transcribed, annotated, analyzed, and shared such materials, reminding the larger scientific and scholastic community of the importance of such texts and the social worlds which they help to create.

It’s time then for folklorists to join the emergent social world of interactional scholarship, whereby our materials are widely available and accessible not only for fellow folklorists to appreciate and use but also for other scholars and scientists. In doing so, in establishing ourselves as the proverbial “middle men” we will continue to maintain the importance of folklore studies to the understanding of what it means to be human.

In the posts that follow in this series, which I am tagging as [TEIfolk][] so that one click will get you all the posts at once, I hope to air out some of the work I have been doing this summer, as I try to advance thinking about *things digital* in my disciplinary home.

*Please feel free to circulate this post, and those that follow, widely. I will gladly accept any, and all, feedback. I am going to make mistakes; I am going to leave obvious things out, revealing my ignorance.*

[Perseus Digital Library]:
[Oxford Text Archive]: