Python and PDFs

Real Python has a tutorial on How to Work With a PDF in Python. I subscribe to Real Python because I find their tutorials well-written or, in the case of video tutorials, well-presented. The focus of this tutorial is the PythonPDF module, which can get metadata from a PDF, rotate pages, merge or split a PDF, and/or encrypt it. While the tutorial mentions “extract information” it does not mean PythonPDF can get text from a PDF that does not have a text layer already embedded on its pages — you could argue that the unintuitive nature of PDFs reveals their brokenness but that’s for another time. If you want to get text where there is no text layer, but you still want to use Python, it looks like you have to turn to PDFMiner — though a quick skim of its GH page doesn’t reveal if it has OCR capabilities backed in. Sigh.


In an ideal setup, my workflow would have me writing in some version of plain text — a flavor of markdown in all probability — that could be quickly and easily outputted to a variety of formats and media. In most instances, that output gets printed, or at least paginated, which means it probably has to, at least for a moment, be instantiated as a PDF. (If I remember correctly, this is essentially how the macOS display and printing system work.) What that would mean would be a collection of CSS files that transformed the generated HTML into the various kinds of documents I regularly produce: essays, reports, letters, lectures, etc.

This function is what the Marked app does and does well — it’s also functionality built into the Ulysses app if I remember. Neither of those apps, I believe, offer pagination, which is often critical to what I output. And so, I have continued to search for my own solution in hopes of building it into a workflow — for the record, when I am working on long-form plain text, my editor of choice is FoldingText because it does a brilliant job of hiding the markdown unless you are working on that sentence and, as the name implies, it makes it possible to hide all but the section of the document on which you are working. It’s brilliant. (To be clear, I am a fan of all the apps mentioned here and of their developers.)

Getting from plain text via markdown or MultiMarkdown to HTML and then pairing that HTML with a page-media aware CSS file and then outputting to PDF is not as easy as it should be. The one app of which I have been aware up until recently was PrinceXML, which its creators have made free for non-commercial use, but with the imposition of a small watermark. That’s very generous, but it’s not quite what I want and I don’t have the kind of money to afford a desktop license.

And so it was a delightful surprise to discover that there are free software options to explore:

  • wkhtmltopdf is an “open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely headless and do not require a display or display service.”
  • **WeasyPrint is a “visual rendering engine for HTML and CSS that can export to PDF. … It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.”

Next up … trying WeasyPrint and an update/report here.

Museum Anthropology Review in Transition(s)

Mar banner

Jason Jackson’s account of the rise and revision of Museum Anthropology Review may very well be as “inside baseball” as anything academic can get, but it is a detailed chronology of the events, and the reasons, that he helped establish an open access journal that continues to thrive today. I recommend it to my students for its very clear articulation of the inner workings of scholarship: there are costs; there is labor.

AFS 2019 Abstracts

Seymour Chatman’s diagram of narrative

The short abstract (97 words):

With an understanding that no text is composed, or received, in a single “mode of discourse” (description, narration, exposition, etc.), this paper explores the nature of non-narrative elements found within folk narrative, pursuing a path first begun by literary critic Meir Sternberg and linguist Carlota Smith. While Sternberg and Smith used literary texts as the basis for their study, this paper draws, like the previous one, upon folk narratives collected by a number of folklorists, including myself, in order to see if there are consistent structures of discourse present and at what level those structures lie.

The long abstract (492 words):

Save a few exceptions, folklorists have largely approached folk narrative as given, with occasional considerations of non-narrative elements. Our close readings of texts tend to focus on the topical and not the formal, on the contextually meaningful and not the structurally significant. This paper is part of a larger project to understand the nature of the components that make up a folk narrative text in order to explore what structures might emerge, and which, if any, are general and which might be cultural. The project is founded on the work of literary critic Meir Sternberg and linguist Carlota Smith, each of whom pursued parallel paths in trying to discern modes with a given text. Starting in the late seventies and working through the nineties, Sternberg attempted to extend narratological considerations to include non-narrative moments and passages in texts. Pursuing similar research but apparently unaware of Sternberg, Smith developed the notion of “modes of discourse,” based on her own work on temporal aspect, in which she explored how languages encode time and how they encode the way events happen over time. Both Sternberg and Smith, however, draw upon literary sources for the exploration and application of their ideas and methods. What would a consideration of folkloric texts bring to the table, and what role would dialogue—long established as a central feature in oral text-making—play in a possible revision of any typology of discourse modes? This paper only briefly outlines Sternberg’s work, as well as referencing the work of Labov and Waletzky which has had some role in folkloristic considerations of narrative (as outlined in a previous paper), in order to provide a backdrop for a consideration of Smith’s work to folkloristic considerations of text. In a previous paper I argued that folklore studies is as guilty as other domains in proclaiming anything narrative. In this paper, I explore other modes of discourse and then consider just how little the narrative mode has to be present for it to be received as narrative in its entirety. All examples are drawn either from my own fieldwork or from colleagues who have entrusted me with examples from their own work.

Labov, William, and Joshua Waletzky. 1967. Narrative Analysis: Oral Versions of Personal Experience. Proceedings of the 1966 Annual Spring Meeting of the American Ethnological Society, 12–44.

Smith, Carlota S. 2003. Modes of Discourse: The Local Structure of Texts (Cambridge Studies in Linguistics). Cambridge University Press.

Sternberg, Meir. 1981. Ordering the Unordered: Time, Space, and Descriptive Coherence. Yale French Studies (61, Towards a Theory of Description): 60–88.

———. 1982. Proteus in Quotation-Land: Mimesis and the Forms of Reported Discourse. Poetics Today 3 (2): 107–56.

———. 1990. Telling in Time (I): Chronology and Narrative Theory. Poetics Today 11 (4, Narratology Revisited II): 901–48.

———. 1992. Telling in Time (II): Chronology, Teleology, Narrativity. Poetics Today 13 (3): 463–541.

———. 2001. How Narrativity Makes a Difference. Narrative 9 (2, Contemporary Narratology): 115–22.

Plywood Graph Paper

I have been doing a bit of home renovation and construction of late, and some of it has involved breaking down sheets of plywood, which I now do using a piece of rigid foam lying on my garage floor and a Kreg Circular Saw Guide. As long as I am crawling around with a saw in my hand, I prefer to make as many cuts as I can. For that, I use a cut sheet, which I make using the incomparable Incompetech’s Graph Paper Generator. The settings captured in the image below produce a graph of 4 x 8 squares broken into 6-inch and then one-inch cells:

Incompetech Graph Paper Generator Settings

Re-thinking the Business Case Study

Ever since I worked in executive education and was exposed to the role of the business case study in both undergraduate, MBA, and executive education, I have been fascinated by its power to generate insight and blindness. I have followed the re-consideration of the case study, as well as of the MBA in general, over the past few years, but only from a distance. So it was great to come across Lila MacLellan’s review of work by Bridgman, Cummings and McLaughlin on “Restating the Case: How Revisiting the Development of the Case Method Can Help Us Think Differently About the Future of the Business School” (DOI: 10.5465/amle.2015.0291). They note:

years after installing the case method, Donham sincerely believed it was too indifferent to larger societal ills, too insensitive to the labor market, and thus to economic prosperity and equality among workers.

As it turns out, some of that re-consideration may have been prompted by Donham’s long-term friendship with Alfred North Whitehead. MacLellan concludes:

Part of the problem with decision-forcing exercises alone is that they ask students to work within the existing system, without examining its failures. Bridgman’s paper suggests that business professors could use cases to look at how managers think, rather than to teach students how to think like a manager.

There’s apparently also a Youtube animation.

Mystery Ranch 3-Way Briefcase

I got one of those automated prompts from a vendor for a product that I had purchased. I decided to write a review. By the time I was done writing the review, I thought it was substantive enough to post it here: because if you are searching for a bag and about to spend a bit more than you are comfortable doing, you do a lot of web searching first, in hopes that someone will describe the bag sufficiently that you get a better sense of whether it’s worth your time and hard-earned money to buy the thing

I bought an Eagle Creek Convert-a-brief bag in the late 90s, and I used the heck out of that thing, only eventually setting it aside for other bags that offered more functionality in various ways and leaning more and more towards backpacks. I bought the MR 3-way briefcase because I found myself wanting a bag looked more like a briefcase and less like a backpack recently and the old Eagle Creek just does have much left in it. There’s a lot to recommend the MR 3W: it’s compact in height and width and offers a lot of organization. The backpack straps stow neatly. And it looks good.

Would I buy it again? Yes. Do I wish I could change a few things? Yes. Starting from the front: the magnetic buckle seems cool, but then you realize that it often, if not almost always, latches when the front flap falls down, leading a casual inspection to suggest that the bag is closed when it isn’t: things fall out. Given that the things in the front pocket are sometimes really expensive things to lose, like wallets and phones, another zippered pocket would be useful here.

In the middle of the bag, I wish the padding/insert between the laptop and middle pocket were removed: the rest of the bag provides padding and structure enough to protect electronics in the back pocket and the padding/insert only adds undesired weight and thickness — and the bag would only be better if lighter.

Finally, while the padded backpack straps are easy to stow and remove, the lower half of the straps are not, making it difficult, and time-consuming, to switch from backpack mode to briefcase mode. If the bottom of the pack, when the pack is is in backpack mode, were a velcro strip, this problem could be easily solved without otherwise ruining the lines of the 3W.

Buying through Sierra was simple and fast. The bag itself mostly delights, but some of the wonkiness described above does detract from that delight.


As part of our hand editing of the TED talk data we had to retrieve missing information for, luckily, a small subset of the speakers. This meant Kinnaird splitting off two CSVs, one for the TED main event speakers and one for the other TED-sponsored event speakers, and then me trudging row by row and cell by cell, working back and forth between the CSV and a web page. Copy and pasted and two CSVs filled in. Yes.

Then it was time to fold these filled in rows back into the main CSVs from whence they came. Each smaller CSV had between 15 and 20 rows, so it didn’t seem like a task worthy of firing up a Python session and writing something in pandas to replace the rows with missing information with the filled-in rows.

I started doing the work by hand: copy a row from the missing.csv and paste it below the matching row in the speakers.csv and then deleting the matched row. Oi! Sure it was only 17 rows, but, still, there has to be a somewhat faster way!

So I decided to merge the two files using cat and then simply finding the dupes in Easy CSV Editor and deleting the row with missing data. Semi-automated?

Found note

Found with the date 26 July 2016:

Books/master notes for a class on “Folklore and Psychology” as well as, looking backwards, perhaps the same thing for Louisiana Folklore. The idea being that the book would also be interactive with questions and guided experiences as well as case studies.

Splitting Wood

As I continue to observe the maelstrom of negativity and falsehoods that is Facebook, I still want to make notes about things that happen. And I want to be able to share those notes. And then I remember that I have this blog, which is what web logs, or blogs, were supposed to be before they turned into self-publishing platforms and the key to modern success.

I am not yet decided on how much I want to reclaim this particular domain — my own name (jl.o) — or some other space where I don’t feel responsible for hosting certain pages which have become mainstays, seemingly, on the web. On the one hand, this was once my “everything that doesn’t have any other place to go goes here” space. And a chunk of that stuff was about my daughter when she was young, but then the internet got creepy and I shifted from talking about her in what I now understood was probably an all too public forum. At the same time, as blogs “came of age” and became vehicles for the blossoming of personalities, some of whom became celebrities — e.g., John Gruber or Merlin Mann — I became increasingly concerned about “managing my brand.” That this blog was a space for me to demonstrate my professional abilities and to discuss professional interests.

And then I started tracking my experiments with computational matters and suddenly this thing got popular. Other people wanting to experiment with Python and/or with thinking about texts as data were searching for things and they found a post of two of mine that was helpful and they must have told people about them because suddenly this thing had something of a readership. It freaked me out so much that I froze like the proverbial deer in headlights and stopped publishing.

And now those pages that people found useful then are still being found useful, but I haven’t tracked my voyage, and discoveries, since then, and now it feels all weird to come back to this, especially since I have Evernote for web capture and Bear for everything else, including capturing all those stray thoughts that shoot through my head like neutrinos making their way across the solar system. But both of those applications somewhat obscure where your data is — in order, I think, to make sure you don’t mess with it outside the app and possibly corrupt the sync process.

There is, I think, something remarkably re-assuring about writing all my notes in plain text — structured with some version of markdown — and storing them in plain files or in a widely-known data structure like SQL. An ideal format, to my mind, would be something like FoldingText as the UI and MySQL on the backend with a blog an easy offshoot and one simply tags, or otherwise indicates which posts are public — it would have to be a choice each and every time.

Part of all this is, I admit, in addition to a response to the way matters are developing on Facebook but also my own preference not to give over my data to someone else so that they can then monetize it. That is, by using Facebook to stay in touch with family and friends, instead of other means, I’m allowing the company to profit from my relationships. That was acceptable, to some degree, when it was a happier place, but now I find that the dark side has emerged, and it has me not only walking away from the platform, but also considering walking away from some relationships.

So, this is not only about taking a break from at least one form of social media, but also about re-focusing my own energies and making my writing my own and finding positive places in which to publish it.

I did this thinking, by the way, while splitting wood, using a maul and wedge given to me by my stepfather and an old hatchet I had lying around the house. There’s no better time to focus then when trying to follow the grain of a log, especially when you find you’ve driven a wedge into an unsplittable natural joint in the wood:

IMG 0612

The whispy shadows of hair in the lower left are my child, still finding her way into this blog, who took this photo for me as I stood nearby, somewhat hunched over and breathing hard …

… I guess I need to split more wood.

Why I Hate Moodle

Or at least my university’s implementation of it.

Let me begin with two assertions about what I see as the strengths about the nature of the web, so that people who see things another way do not need to bother themselves with either reading further or in arguing with me.

The first thing I want to note about the web is something I, and thousands of others, have observed in countless other ways and places and that is that the web is the platform without parallel for the delivery of content. Let me emphasize content, which I do over and against the delivery of an experience. The content itself may involve the user (or viewer or reader or listener) in some kind of experience, but the web itself is less about the delivery of experiences.

The second thing I want to observe about the web reveals my age: the web is at its best when it is semantic, when the way content is structured is part and parcel of its meaning. And I mean semantic in a deep sort of way, with UX/UI at the surface but reaching all the way down to <tag>s.

So, let me walk you through the way Moodle is set up at my university and you can begin to understand why I think its anathema to the promise of the web. And we can begin with the way I begin, which is to click on a link to a course that I am teaching in order to manage some aspect of it:

Screen Shot 2019 01 31 at 5 11 49 PM

There are two things that I find difficult to accept with this: first, the content, the actual content of the course, is squashed between a whole lot of navigation and other matters that amount to little more than unnecessary cognitive overhead. Sure, I could customize the interface to get rid of all the extraneous blocks, but I use the default setup because it’s what I see most, if not all, of my students using and their experience of the course is my concern. If I design things based on my tweaked-out setup, and those things do not look the same for them, then I have failed them.

The second thing seems obvious to me: I’m a teacher. I’m coming to Moodle to do things, but in order to do things I have to click a button. And I can’t tell you the number of times I have scrolled down the page to start something only to realize I have to scroll back up, click on the edit button, and, then, scroll back down to the section I want to edit and click on the Add an activity or resource button.

And speaking of too much scrolling and clicking, when you do click on the Add button, you are greeted with the following pop-up:

Screen Shot 2019 01 31 at 5 12 22 PM

Congratulations if you want to add one of a dozen of Moodle’s “activities” designed, one supposed to “enhance the educational experience” — because what undergraduate doesn’t want to use Hangman, or a Hidden Picture!, to learn about speciation or topic modeling? So more scrolling in order to get to ways to add actual content: URLs, pages, files, etc.

Perhaps the most fundamental, the most basic form of content there is is a web page. Setting aside that this web page is a column squished between a whole lot of other material, if you attempt to paste it into a text box, your formatting options look like this:

Screen Shot 2019 01 31 at 5 28 16 PM

Forget meaningful things like H1 headings or passages of code, because you aren’t getting them here. For a while, if you dug deep enough into Moodle’s bowels you could enable a Markdown filter, so that you could write and maintain pages as semantic plain text, but they have moved that switch around so much that it’s clear they don’t want you to write structured prose, just roll back to the 1980s and WordPerfect for DOS and stick to one-off formatting of text.

Moodle is ugly, takes too many clicks to do anything meaningful, and it undoes everything that was once semantic about the web. Which is kind of like Facebook, which I guess makes sense.

David Rumsey Map Collection

The David Rumsey Map Collection is a pretty impressive accomplishment. According to the site, the collections “contains more than 150,000 maps. The collection focuses on rare 16th through 21st century maps of North and South America, as well as maps of the World, Asia, Africa, Europe, and Oceania. The collection includes atlases, wall maps, globes, school geographies, pocket maps, books of exploration, maritime charts, and a variety of cartographic materials including pocket, wall, children’s, and manuscript maps. Items range in date from about 1550 to the present.”

Rooth 1980: “Pattern Recognition, Data Reduction, Catchwords and Semantic Problems”

If, like me, you are committed to finding prescient work in the realm of computational approaches to the humanities, it means you are often tracking down somewhat difficult to find volumes and quickly photocopying an article or two while you still have the volume in your hands. Anna Birgitta Rooth’s “Pattern Recognition, Data Reduction, Catchwords and Semantic Problems” is one such article, and the PDF I am making available has been OCRed.