Compare Lists in Python

If you search for how to compare two lists in Python, you will find a lot of helpful pages in a lot of places, many of which assume you are working with numbers or you want exact matches. But what if you want to compare all the items in one list with all the items in another list and you want to be able to set some arbitrary measure of similarity or difference?

The problem arose for me recently when I was trying to compare two lists of different lengths. The two lists represented keyword sets derived from a corpus using NMF, which I had run with two different component values. As part of wanting to discover a probable “best fit” I wanted to compare which strings had remained the same and which had changed to some degree.

My first impulse was to try the Jaccard coefficient, and I used some simple code to make that work:

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

I then embedded that bit of code, but it could be any code you wanted, in the following:

for jk, jv in enumerate(second_list):
    for ik, iv in enumerate(first_list): 
        do_this

The logic is pretty simple, but it is a leap, at least for me, in terms of how I think about things. When I started work on this, I kept trying to pack everything in one for loop: after all, I wanted to compare one list to another. But I wanted to compare all of one list with all of another list, which means I needed to iterate through both lists. A simpler version of this would be:

for j in second_list:
    for i in first_list:
        do_this

The addition of enumerate above was so that I could keep track of which string in each list was matching without necessarily having to see the string itself — I could use the index values that enumerate produces to call those, if I needed. enumerate is one of those functions I regularly forget, and it is very convenient: essentially it takes a list of items and transforms it into a list of tuples where the first value is the item’s index and the second value is the item itself, so [‘a’] becomes [(0,’a’)]. You can call the parts of the tuple by any variable name you like, but I tend to stick with k and v, for key and value, because … well, because. (It could easily be anything else, and I’ve even written code that called three-item tuples with rather bland, and thus also not advisable, t, u, v. Do not do this.)

So essentially both the for loops above are transforming each of the lists involved into a list of tuples and then walking through the list, comparing the items themselves but reporting only their indices.

It doesn’t really matter which list is which, so far as I can tell, so long as you keep the variables correctly aligned. My final code block looked like this:


print("Jc = Jaccard coefficient")
print("========================")
for jk, jv in enumerate(topics_45):
    for ik, iv in enumerate(topics_35):
        if jaccard_similarity(iv.split(" "), jv.split(" ")) > 0.5:
            print(f"35-{ik} and 45-{jk} have a Jc of {jaccard_similarity(iv,jv):.2f}.") 

My next step is to determine how to transform this into a network or tree so that I can see which keyword clusters continues (relatively) unchanged — where I set the threshold for relatively (and perhaps end up using something other than the Jaccard coefficient which doesn’t seem terribly discriminating — and also where clusters split or, in a few cases, disappear/die.

These Books

At least two newsletters arrived in my inbox this week using this stock photo of books. I’ve seen the image used elsewhere, but seeing it twice on the same day made me wonder “Whose books are these?” @ me on Twitter if you know.

The Power of a bash Script

Every time I run it, I am delighted by how much work the bash script for the COVID dashboard works.

~ % sh ./covid.sh
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 29 (delta 18), reused 5 (delta 2), pack-reused 0
Unpacking objects: 100% (29/29), done.
From https://github.com/nytimes/covid-19-data
   3ad1afa..f06d614  master     -> origin/master
Updating 67b320c..f06d614
Fast-forward
 README.md            |     2 +-
 live/us-counties.csv |  6395 ++++++++++++------------
 live/us-states.csv   |   110 +-
 live/us.csv          |     2 +-
 us-counties.csv      | 12803 ++++++++++++++++++++++++++++++++++++++++++++++++-
 us-states.csv        |   226 +-
 us.csv               |     6 +-
 7 files changed, 16287 insertions(+), 3257 deletions(-)
INFO    -  Cleaning site directory 
INFO    -  Building documentation to directory: /Users/johnlaudun/Developer/COVID-Acadiana/site 
INFO    -  Documentation built in 0.10 seconds 
~ %

I will admit that the dashboard is still primitive, but the idea of it was what was important at the time, and so many dashboards have popped up since then. I mostly keep running the script for a sense of the historical depth it provides.

Quick Labels with Python’s f-string

Sometimes I need a list of titles or labels for a project on which I am working. E.g., I am working with a toy dataset and I’ve created a 10 x 10 array and I want to give the rows and columns headers so I can try slicing and dicing. I prefer human-readable/thinkable names for headers, loc over iloc in pandas-speak. And this one-liner works a treat, as they say:

labels = [label{item}' for item in range(1,11)]

Done. Place it into your dataframe creation (as below) and you are good to go.

df = pd.DataFrame(data=scores, index=names, columns=labels)

A COVID Dashboard for Acadiana

At some point in May (2020), it became clear that one of the things we were facing both nationally and locally was a lack of clear information about the status of COVID — and there were far too many outlets and venues happy, as always, to pounce upon both genuine confusion as well as incipient paranoia. As a folklorist, I am of course interested in the legendry that has sprung up but as a resident of my community I am equally concerned that people don’t have easy access to information about the local scene.

When I came across Bee Guan Teo’s “Has Europe Past the First Peak of COVID-19 Outbreak?” on Towards Data Science (link), I decided to start work on what I imagined as a dashboard to let people keep abreast of the situation here in south Louisiana: COVID-19 in Acadiana was the result.

While it would seem obvious to host the page as part of this WordPress installation, my desire to have the information update daily and to do so in as automated, and thus less prone to human-induced error, a fashion as possible made it more likely that I would develop a dedicated site for the purpose. (And, let’s be clear, the role played by my own limitations with hacking either WordPress or PHP.)

The current version of COVID-19 in Acadiana is in fact built with MkDocs, a Python library that makes it easy to create a status website using Markdown. As the name suggests, it is built with documentation in mind, and so it really isn’t made to support a blog or something like that. (One day I will explore those possibilities.)

COVID-19 in Acadiana is essentially a bash script with the following components:

(1) Update the data from the NYT repo:

cd /Users/johnlaudun/Developer/covid-19-data
git pull

(2) Update the graph of cases and the table of deaths:

cd /Users/johnlaudun/Developer/COVID-Acadiana
python covid.py

(3) Build the site with the new markdown, html, and image(s):

mkdocs build

(4) Deploy the site/ directory to the web server:

cd ~
rsync -r ./Developer/COVID-Acadiana/site/ \
user@path/to/public_html/covid

It’s nothing fancy, but it works and it’s a start. My goal is to increase the information density of the page whenever I have the chance.

UPDATE (July 22): I have collected a couple of notes about creating COVID dashboards and I am pasting them here for anyone interested in setting up their own (and I may very well re-write mine).

Flattening a List in Python

There has to be a more elegant, and pythonic, way to do this, but none of my experiments with nested list comprehensions or with itertool’s chain function worked.

What I started with is a function that creates a list of sentences, each of which is a list of words from a text (string):

def sentience (the_string):
    sentences = [
            [word.lower() for word in nltk.word_tokenize(sentence)]
            for sentence in nltk.sent_tokenize(the_string)
        ]
    return sentences

But in the current moment, I didn’t need all of a text, but only two sentences to examine with the NLTK’s part-of-speech tagger. nltk.pos_tag(text), however, only accepts a flat list of words. So I needed to flatten my lists of lists into one list, and I only needed, in this case, the first two sentences:

test = []
for i in range(len(text2[0:2])): #the main list
    for j in range (len(text2[i])): #the sublists
        test.append(text2[i][j]) 

I’d still like to make this a single line of code, a nested list comprehension, but, for now, this works.

Strengths in the Humanities

Jason Jackson is one of those people I wish I could be around more: he is principled, thoughtful, and acts for the long-term. So when he casually tags something on social media, I’ll almost always have a look. Most recently, he linked to an article by Helene Meyers in Inside Higher Education on How small liberal arts colleges can best weather the pandemic, noting that humanities scholars might take a few tips from Meyers.

The entire article is worth a read, but for the purposes of re-thinking my own courses for the fall, and just generally re-thinking how I teach, I want to focus on the following things that Meyer highlights as strengths of liberal arts colleges:

  • low faculty/student ratios and small classes “allow meaningful mentoring relationships with faculty members as well as peer education. What if a British-style tutorial were part of every first-year student’s experience? Among smaller groups, meetings powered by Zoom can foster intellectual community, while online discussion forums can require students to respond to one another’s writing.”
  • intensive research seminars “where faculty-guided independent work is supplemented with a cohort of peers who can help vet one another’s projects and learn to ask (and answer) critical questions about both the research process and its products should be provided for upper-class students.”
  • study pandemic-related topics “to [help students] process the experiences of this moment” keeping mind that some students “might need to lose themselves in a passion that seems distant from the horrors of the present.”
  • integrate career coaching throughout the curriculum because “the next few graduating classes will be entering a brutal job market, and we owe our students careful instruction in the development and transferability of marketable skills.”

I see all these things as possible and even within my reach — so long as I am willing to stretch — with career coaching being the weakest point for me. Here, I will have to do more research and, I think, I will also have to consider ways to highlight portable skills/methods/ideas. (I know, I know: it’s the commodification of knowledge and education, but nothing says that making things complex or emphasizing, and perhaps teaching, that all syntheses are dynamic and ever-changing can’t be built into any particular course program or disciplinary curriculum.)

*This post is part of a series in which I design a new course, ENGL 334: Digital Folklore and Culture, in the open. I do so for myself, for my colleagues, and for my students. They are all collected under the tag open course design.

rsync without a Password

In order to set up rsync to work without a password, you first need to make sure that you can do so with a password:

rsync /local/path username@/remote/path

If successful, then generate a public/private key pair, but be sure not to give a password:

$ ssh-keygen
Enter passphrase (empty for no passphrase):
Enter same passphrase again:

Then copy the public key to the remote host — note that ssh-copy-id will copy the file to the correct location for you:

ssh-copy-id -i ~/.ssh/id_rsa.pub username@/remote

Make sure that you can ssh without a password:

ssh jlaudun@/remote

Now try rsync adding the argument -e ssh to specify the remote shell to use:

rsync -avz -e ssh /local/path username@/remote/path

Who is this course for?

This post is one of several in which I am designing a new course, ENGL 334: Digital Folklore and Culture, that I will also be teaching in a new context, remotely, and doing so completely in the open. Other posts are tagged open course design.

The Udemy How to Set Your Course Goals course begins with a consideration of who is the target student, with the understanding that courses that attempt to reach too broad of an audience end up reaching no one. Beginners feel overwhelmed and experienced individuals feel under-served. Target an audience.

After brainstorming on paper for a bit, I came up with, I think a basic list:

This course assumes that participants:

  • while fully enrobed in cultural, and folkloric, dynamics do not necessarily understand those dynamics,
  • but are interested in, and committed to, that understanding;
  • have a working familiarity with the research process — the development of an hypothesis, the collection of data, the testing of ideas against the hypothesis, and the eventual development of a syn/thesis — and need for clear communication of results;
  • willing to apply ideas and methods learned in this course (and elsewhere in the university) to materials that seem ephemeral, trivial, trolling, ass-holish (racist, sexist, classist, etc.).§

§ This course also assumes participants can handle language and/or cultural artifacts that are of intentionally or intentionally provocative/offensive in nature. Indeed, this course assumes participants want to understand why people say/do such things.

Why did I switch to Udemy? May was both busy and not, but the month slipped by and I lost access to the edX 101 course on designing courses for edX. (The edX model is that you can audit, take for free, a course for a limited time, but if you want access to it for more than a month or if you want it to count towards a curriculum, then you have to pay for it. The “if you want credit” model worked for me, but “if you want access for more than a month” appears not work for me.) The upshot is that I have switched to the Udemy course, which also means I have switched to a platform that is open to hosting courses by individuals: both edX and Coursera offer courses through affiliated institutions and organizations. I don’t know that what I do will end up on Udemy, but I can certainly take advantage of their “market aware” approach to sharpen my thinking about the course.

Sam Castleman, PhD (2020)

Sam Castleman arrived at UL-Lafayette from Western Kentucky University, which meant her foundation in folklore studies was already both deep and wide. There was not much more that we could add, and yet Sam never hesitated to re-read materials or to read intractably theoretical articles. That is who Sam is: she has an incredible drive, and the discipline to go with it, to do anything she wants. My job as her faculty member was to get out of her way — okay, I will confess to occasionally nudging her in this direction or another, but, it never took more than a nudge.

In an effort to maximize her time in graduate school — I don’t think the word minimize appears anywhere in her brain — Sam took on a number of duties, many of which, when she discussed them with me, I shook my head in response. Every time she proved my head shake wrong, somehow able to throw herself into organizational affairs and continue to excel at her studies.

When it came time to write her dissertation, Sam wrote it in 6 months. One day we were discussing possible topics and sequences, the next she was emailing me her first chapter. And the next she was presenting a part of another chapter to great interest at the annual meeting of the American Folklore Society. The chapters continued to pile into my inbox, and as soon as I had marked one and returned it, she had revised it and piled it back into my inbox. All the while Sam kept up all the other facets of her life, both personal and professional.

To be sure, there was the occasional anxious moment, a moment of doubt, and for those, to be honest, I was glad. I finally had something to do besides always be behind on returning the latest draft of a chapter to her! Nothing, however, defeats Sam for long, and I am incredibly excited to see what comes next for her. I’m not sure who learned more through all this, her or me — to be honest, it was probably me. I am humbled by that, and grateful to Sam for her continuing just to be her indomitable self.

Gina Warren (PhD, 2020)

The first time I met Gina was at The Steep House. She had accompanied one of our folklore students, Jessica Doble, and the three of us were there to talk about possible computational approaches to texts. As we were talking, Gina pulled up something on her computer, and then swiveled it to show me a screen filled with an Excel spreadsheet she had constructed that captured various moments in novels she was analyzing. She had used a spreadsheet because it was the only tool she had, and she knew she needed something more than notecards. That approach to doing things is emblematic of Gina herself: there is a kind of “isn’t it obvious that that is what needed to be done” to her, that permeates her being and makes her the scholar and writer she is.

It is with that isn’t-it-obvious approach that she raised chickens, crickets, pigs, worms, and I-don’t-want-to-know-what-else. If you are going to write a book about the backyard chicken revolution, then you should participate in it. You should be able to feel it in your bones. And if you are lucky enough to have read an advance copy of her book, then you have felt it in your bones through her words: the tenderness of wiping chick bottoms that you will, one day, kill for meat. Gina is not simply going to observe: she is going to experience. And in the doing, there will come that kind of knowing that makes her prose ring true. And, in a moment where writers worry about craft and public figures only care about telling people what they want to hear, we need someone like Gina for whom content matters, and experience matters, and science matters, and people matter, and animals matter. There is nothing that does not matter in Gina’s prose and in her world.

The only thing I could hope to do as her dissertation director was to continue to create a space within which she could continue to be the writer and scholar she herself was already committed to becoming. That was my only job, and she made it incredibly easy, and I think I speak for her entire committee when I say it was an honor to have been a part of the process in which she continues to become who she intends to be.

Designing a Course on Digital Folklore & Culture

This is the first post in a series entitled open course design in which all the design work of a new course, ENGL 334: Digital Folklore and Culture, is made public.

In Fall 2020, UL-Lafayette is going to offer for the first time a course on Digital Folklore and Culture. I will be teaching it alongside the American Folklore course, which I have for the past few years taught as “America in Legend Online and Off,” but which I have lately adapted to “collect some data and understand it.” There is, I think, a possible sequence to be had with the two courses: with the first one focusing on participants encountering a variety of vernacular forms and, perhaps, examining them as individual artifacts, and the second course then taking on more features of a course in culture analytics, with participants encouraged to curate a small collection, perhaps even imaginable as a corpora, and then making some forays into analysis “at scale.”

It would be nice to have them as a sequence, since that would mean that the introductions — to folklore and to folklore studies — could be safely housed in the lower-level course, allowing the upper-level course to move more quickly. Given our curriculum and the way our students encounter it, that isn’t going to happen any time soon, and so if I want to try this out, I will need to discover a path that allows people to enter in at the 400-level course and not feel like they are lost.

Some part of this could be satisfied by having a module introducing folklore studies, with a focus on digital folklore forms, available. I have begun the EdX 101 course as a way to help me think through how I might structure and script such a module: they are very fond of the lecture exercise model that delivers content in short bursts that are immediately reinforced. I’m also taking Microsoft’s DAT256x: Essential Math for Machine Learning on edX, and I like that the lectures only start with a talking head but then move to a series of slides. (And I note that the slides don’t have to be great to work.)

I don’t know if I need to think through the Digital Folklore and Culture course before thinking about the introductory module, but edX has the following questions as the first project activity:

  • What are the ultimate aims of this course?
  • What do we want learners to know after taking this course? What should they be able to do?
  • How does this influence (a) what is taught, (b) how it’s taught, and c) how students are assessed and graded?

What are the ultimate aims of this course? Ultimately, I want participants to have a folkloristic lens as one way to look at the world. All of us will have a variety of responses to various things others say and do, and we can examine both their actions and speech for veracity — myth busting in some places or calling bullshit in others — but I would also participants in any course I teach also to be able to ask “Why does this person think they are saying this or doing this? What is their understanding of this situation?” I don’t need, nor want, participants to excuse inexcusable behavior or beliefs, but the only way I think we have of changing behaviors and beliefs is to understand what underlies them.

What should learners be able to do after taking this course? Participants should be able to identify a vernacular artifact and to begin to sketch out its possible traditional, or perhaps simply cultural, dimensions.

How does this influence the course’s design? This is the hardest question. And it needs to be answered in parts:

One of the things I have consistently done in recent courses is to turn away from textbooks and books and towards articles drawn from scholarly databases, with the hope of establishing in the minds of participants what scholarship at least looks like if not the beginning of an ability to understand how it works and how they might interact with it. What I haven’t done is discover ways to assess how well they are mastering the scholarly/scientific paradigm, bar certain parameters of the final paper. There needs to be more, smaller, assignments: a single annotated bibliographic entry, for example.

But this does not address the central topic of Digital Folklore and Culture as outlined in the previous two answers: identify vernacular artifacts and explore their traditional dimensions. This should also be a series of discrete exercises that can be assessed early, often, and incrementally.

Fictional Text Analytics

There’s a great moment in John Scalzi’s Redshirts where statistical analysis is mentioned, and it comes down to comparing texts:


“So what you’re saying is all this is impossible,” Dahl said.

Jenkins shook his head. “Nothing’s impossible,” he said. “But some things are pretty damned unlikely. This is one of them.”

“How unlikely?” Dahl asked.

“In all my research there’s only one spaceship I’ve found that has even remotely the same sort of statistical patterns for away missions,” Jenkins said. He rummaged through the graphic elements again, and then threw one onto the screen. They all stared at it.

Duvall frowned. “I don’t recognize this ship,” she said. “And I thought I knew every type of ship we had. Is this a Dub U ship?”

“Not exactly,” Jenkins said. “It’s from the United Federation of Planets.”
Duvall blinked and focused her attention back at Jenkins. “Who are they?” she asked.

“They don’t exist,” Jenkins said, and pointed back at the ship. “And neither does this. This is the starship Enterprise. It’s fictional. It was on a science fictional drama series. And so are we.”