Labels

“One popular misconception [about machine learning] is that people think they have enough data when they don’t. When people say machine learning, a very large segment of predictions are based on existing data. And in order for that to work, you generally have to have a big labeled set of data,” says Hillary Green-Lerman of Codecademy.

Emphasis on labeled.

Later:

“People often don’t realize how much of machine learning is getting data into a format so that you can feed it into an algorithm. The algorithms are actually usually available pre-baked,” Hillary said. “In a lot of ways, you need to know how to pick the best linear regression for your data, but you don’t really need to know the intricacies of how it’s programmed. You do need to work the data into a format where each row is a data point, the kind of thing you’d want to pick.

Difficulties with PIP

As I have noted before, the foundation for my work in Python is built on first installing the Xcode Command Line tools, then install MacPorts, then installing (using MacPorts) Python and PIP. Everything I then install within my Python setup, which is pretty much everything else, is done using PIP, so when I kept getting the error below after finally acquiescing to macOS’s demands to upgrade to High Sierra, I was more than a little concerned:

ImportError: No module named 'packaging'

See below for the complete traceback.1

I tried install setuptools using MacPorts, as well as uninstalling PIP. I eventually even uninstalled both Python and PIP and restarted my machine. No joy.

Joy came with this SO thread which suggested I try:

wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

Everything seems to be in working order now.

Traceback (most recent call last):
  File "/opt/local/bin/pip", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py", line 70, in <module>
    import packaging.version
ImportError: No module named 'packaging'
~ % sudo pip search jupyter
Traceback (most recent call last):
  File "/opt/local/bin/pip", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py", line 70, in <module>
    import packaging.version
ImportError: No module named 'packaging'
~ % sudo pip install setuptools
Traceback (most recent call last):
  File "/opt/local/bin/pip", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py", line 70, in <module>
    import packaging.version
ImportError: No module named 'packaging'

  1. For those interested, the complete traceback looked like this: 

Classifiers

When considering a classifier, effectiveness can be considered in terms of accuracy as well as precision and recall. (Precision and recall seem to mirror “senstivity/specificity”, unless I am misunderstanding those terms.)

Bookends, at last

In September 2016, frustrated with data that had gone missing in a transition between versions of the reference manager I had been using and liked very much, I listed the following specifications for what I wanted in such an application:

  • drag and drop input with autocompletion of data fields with as few clicks as possible;
  • storage of documents in human-recognizable containers: articles, books, etc. with names that look like in-text citation: e.g., Glassie_1982.pdf;
  • ability to scan PDF for highlights and notes and to print those notes and highlights separately;
  • ability to indicate if physical copy is present — or if physical copy is only copy — and its location — the ability to check out a physical copy would be useful;
  • ability to handle epubs gracefully — being able to read and mark them up within the app would be nice.

I am relieved to note that Bookends has much of this. For most items with a DOI, it can fairly quickly grab all the needed metadata — there really is no reason that at this moment in time anyone needs to spend time filling in those fields themselves. (I should note that occasionally Bookends either confuses the order of author’s names in BibTex files or, perhaps, that information is recorded properly in BibTex.)

While I do wish that Bookends would give me the option of replacing spaces with underscores automagically, when it offers to rename files it does so sensibly and in a human-readable form and in a location of my choosing.

Bookends’ tagging system remains opaque to me, but I’ve compensated by creating groups that do much of the work of tags. I’ll live with it.

What We Talk about When We Talk about Stories

Rejected for a special issue of the Journal of Cultural Analytics, but, still, I think, an interesting project and one I will continue to pursue. If anyone else is interested, this is part of a larger project I have in mind and I am open to there being a working group.

Current efforts to treat narrative computationally tend to focus on either the very small or the very large. Studies of small texts, some only indifferently narrative in nature, have been the focus for those interested in social media, networks, and natural language technologies, which are largely dominated by the fields of information and computer sciences. Studies of large texts, so large that they contain many kinds of modalities with narrative the dominant, have largely been the purview of the field we now tend to call the digial humanities, dominated by the fields of literary studies, classics, and history.

The current work proposes to examine the texts that fall in the middle: larger than a few dozen words, but smaller than tens, or hundreds, of thousands of words. These are the texts that have historically been the purview of two fields that themselves line either side of the divide between the humanities and the human sciences, folklore studies and anthropology (respectively).

The paper profiles the knot of issues that keep these texts out of our scholarly-scientific systems. The most significant issue is the matter of “visibility”, of accessibility, of these texts as texts and thus also as data: largely oral by nature, most folk or traditional narratives (must) have been the product of a transcription process that cannot guarantee the same kind of textuality of a “born literary” text. (The borrowing of the notion of natality is somewhat purposeful here, since we often distinguish between texts that have been, sometimes laboriously, digitized and those that were “born digital.”) As scholarly fictions, if you will, they are largely embedded within the texts that treat them, only occasionally available in collections. With limited availability, and traditionally outside the realm of the fields that currently dominate the digital humanities, folk/traditional/oral narratives are not yet a part of the larger project to model narrative nor of efforts to consider the “shape of stories.”

This accessibility gap has overlooked both human and textual populations: most of the world’s verbal narratives are in fact oral in nature and millions upon millions are produced everyday by millions and millions of people and those narratives tend to range in size from somewhere around a hundred words to, perhaps, a few thousand words in length. The result is that any current model or notion of shape simply has allowed the wrong “figures figure figures.” Put another way, there can be no shape of stories without these stories.

Populating the Popular

With the rise of Lore from an obscure podcast about odd moments in “history,” to an Amazon production, there was been a concomitant rise in interest in the possibilities for expanding the scope of the engagement between folklore studies and some form of a “popular audience.” At least two folklorists I know have been contacted by production companies looking to be a part of this emergent interest.

Like its cousin, history, folklore studies has had a strange, and often estranged relationship with popular media. Some of the popular contact has been initiated by folklorists themselves: e.g., Jan Harold Brunvand. Brunvand was a much beloved individual among the folklorists I know, which seems to be unlike how historians felt about, say, Stephen Ambrose — I know, I know, Ambrose had other issues (e.g., plagiarism). There’s also the recent discussion among historians about (yet another) Ken Burns’ film. (See Jonathan Zimmerman’s “What’s So Bad about Ken Burns?”.

Jeffrey Tolbert has written about this and even engaged in a dialogue with the creator of Lore. (For those interested, Tolbert has a personal essay in New Directions in Folklore: [here][].

[here: https://scholarworks.iu.edu/journals/index.php/ndif/article/view/20037

Ignoring Unicode Decode Errors

Working with a sample corpus this morning of fraudulent emails — Rachael Tatman’s Fraudulent Email Corpus on Kaggle, I found myself not able to get past reading the file, thanks to decoding errors:

codec can't decode byte 0xc2

Oof. That byte 0xc2 has bitten me before — I think it may be a Windows thing, but I don’t remember right now, and, more importantly, I don’t care. Data loss is not important in this moment, so simply ignoring the error is my best course forward:

import codecs

fh = codecs.open( "fraudulent_emails_small.txt", "r", encoding='utf-8', errors='ignore')

And done. Thanks, as usual, to a great StackOverflow thread.

BTW, thank you Rachael for making the dataset available!