Useful Pandas Posts

Please note that this post is, yes, “under construction” as I compile various notes from across my file system and decide what’s worth keeping here and what’s going into the virtual trash bin.

If, like me, you are not very familiar with R and thus you do not readily grasp how pandas brings much of R’s coolness to Python data analysis workflows, then having the occasional overview and/or cheat sheet on hand is useful.

For overviews, I found the following really helpful in understanding how pandas organizes data and the methods available for working with it:

For quick tips that border on almost being cheat sheets, there is Chris Albon’s “Technical Notes on Using Data Science & Artificial Intelligence to Fight for Something That Matters”, at the bottom of which is a compendium of great tutorials and tips on using pandas. (And as you scroll, you glimpse a lot of other really useful stuff as well.)

Python and PDFs

Real Python has a tutorial on How to Work With a PDF in Python. I subscribe to Real Python because I find their tutorials well-written or, in the case of video tutorials, well-presented. The focus of this tutorial is the PythonPDF module, which can get metadata from a PDF, rotate pages, merge or split a PDF, and/or encrypt it. While the tutorial mentions “extract information” it does not mean PythonPDF can get text from a PDF that does not have a text layer already embedded on its pages — you could argue that the unintuitive nature of PDFs reveals their brokenness but that’s for another time. If you want to get text where there is no text layer, but you still want to use Python, it looks like you have to turn to PDFMiner — though a quick skim of its GH page doesn’t reveal if it has OCR capabilities backed in. Sigh.

Understanding How Beautiful Soup Works

Two years ago, when I first grabbed the transcripts of the TED talks, using wget, I relied upon the wisdom and generosity of Padraic C on StackOverflow to help me use Python’s BeautifulSoup library to get the data out of the downloaded HTML files that I wanted. Now that Katherine Kinnaird and I have decided to add talks published since then, and perhaps even go so far as to re-download the entire corpus so that everything is as much the same as possible, it was time for me to understand how BeautifulSoup (hereafter BS4) works for myself.

from bs4 import BeautifulSoup

# NB: no need to read() the file: BS4 does that
thesoup = BeautifulSoup(open("transcript.0.html"), "html5lib")

# Talk metadata is in <meta> tags in the <head>.
# This finds all <meta> tags
metas = thesoup.find_all("meta")

# Let's see what this object is...
print(type(metas))

Output: <class 'bs4.element.ResultSet'>, and we can interact with it as if it were a list. Thus, metas[0] yields: <meta charset="utf-8"/>, which is the first of a long line of <meta tags. (The complete output is at the bottom of this note below under the heading Appendix A.)

type(metas[0]) outputs: <class 'bs4.element.Tag'>. That means we will need to understand how to select items within a BS4 Tag. The items we are interested in are towards the bottom of the result set:

<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>

This gives us the slug, the description, the run time, the publication date, the number of hits, and the speaker. So, the question is, how do we navigate the “parse tree” so that we turn up the value of the content attributes when the value of the itemprop attribute is one of the above?

[meta.attrs for meta in metas] returns a list of dictionaries, with each meta its own dictionary. Here is a small sample from the larger list:

{'content': 'PT6M3S', 'itemprop': 'duration'},
{'content': '2016-05-17T14:46:20+00:00', 'itemprop': 'uploadDate'},
{'content': '1246654', 'itemprop': 'interactionCount'},
{'content': 'Laura Indolfi', 'itemprop': 'name'},

What we need to do is identify the dictionary’s position in the list by finding those dictionaries that have the values duration, etc. We then use that position to slice to that dictionary, and get the value associated with content, yes?

It turns out that the best way to do this is built into BS4, though the method was not immediately obvious. One of the answers to the StackOverflow question “Get meta tag content property with BeautifulSoup and Python” suggested the following possibility:

for tag in thesoup.find_all("meta"):
    if tag.get("name", None) == "author":
        speaker = tag.get("content", None)
    if tag.get("itemprop", None) == "duration":
        length = tag.get("content", None)
    if tag.get("itemprop", None) == "uploadDate":
        published = tag.get("content", None)
    if tag.get("itemprop", None) == "interactionCount":
        views = tag.get("content", None)
    if tag.get("itemprop", None) == "description":
        description = tag.get("content", None)

If we ask to see these values with print(speaker, length, published, views, description), we get:

Laura Indolfi PT6M3S 2016-05-17T14:46:20+00:00 1246654 Anyone
who has lost a loved one to pancreatic cancer knows the devastating
speed with which it can affect an otherwise healthy person. TED
Fellow and biomedical entrepreneur Laura Indolfi is developing a
revolutionary way to treat this complex and lethal disease: a drug
delivery device that acts as a cage at the site of a tumor,
preventing it from spreading and delivering medicine only where
it's needed. "We are hoping that one day we can make pancreatic
cancer a curable disease," she says.

Now we need to get the text of the talk out, which is made somewhat difficult by the lack of semantic markup. The start of the text looks like this:

<!-- Transcript text -->
  <div class="Grid Grid--with-gutter d:f@md p-b:4">
    <div class="Grid__cell d:f h:full m-b:.5 m-b:0@md w:12"></div>

    <div class="Grid__cell flx-s:1 p-r:4">

The only reliable thing is the comment tag: there’s also a closing one at the end of the transcript text, so if we can find some way to select all the <p> tags between the two comments, I think we’ll be in good shape.

Appendix A

The output of [print(meta) for meta in metas] is:

<meta charset="utf-8"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." name="description"/>
<meta content="Laura Indolfi" name="author"/>
<meta content='Transcript of "Good news in the fight against pancreatic cancer"' property="og:title"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image:secure_url"/>
<meta content="1050" property="og:image:width"/>
<meta content="550" property="og:image:height"/>
<meta content="article" property="og:type"/>
<meta content="TED, Talks, Themes, Speakers, Technology, Entertainment, Design" name="keywords"/>
<meta content="#E62B1E" name="theme-color"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="TED Talks" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="TED Talks" name="application-name"/>
<meta content="https://www.ted.com/browserconfig.xml" name="msapplication-config"/>
<meta content="#000000" name="msapplication-TileColor"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="Laura Indolfi: Good news in the fight against pancreatic cancer" name="title"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." property="og:description"/>
<meta content="https://www.ted.com/talks/laura_indolfi_good_news_in_the_fight_against_pancreatic_cancer/transcript" property="og:url"/>
<meta content="201021956610141" property="fb:app_id"/>
<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>
<meta content="Flash HTML5" itemprop="playerType"/>
<meta content="640" itemprop="width"/>
<meta content="360" itemprop="height"/>

Python’s `google` Module

So, like me, you become interested in the possibility of executing Google searches from within a Python script, and, like me, you installed the google module — which some have noted is no longer developed by Google itself but by a third party — and got an import error, here is what happened: yes, you did install it as google:

pip install google

but you do not call it google because that will lead to an ImportError. Instead, the name of the module is googlesearch, so what you want to do is this:

from googlesearch import search

Now it works.

Hat tip to shylajhaa sathyaram in his comment on GeeksforGeeks.

Difficulties with PIP

As I have noted before, the foundation for my work in Python is built on first installing the Xcode Command Line tools, then install MacPorts, then installing (using MacPorts) Python and PIP. Everything I then install within my Python setup, which is pretty much everything else, is done using PIP, so when I kept getting the error below after finally acquiescing to macOS’s demands to upgrade to High Sierra, I was more than a little concerned:

[code lang=text]
ImportError: No module named 'packaging'
[/code]

See below for the complete traceback.1

I tried install setuptools using MacPorts, as well as uninstalling PIP. I eventually even uninstalled both Python and PIP and restarted my machine. No joy.

Joy came with this SO thread which suggested I try:

[code lang=text]
wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
[/code]

Everything seems to be in working order now.

[code lang=text]
Traceback (most recent call last):
File "/opt/local/bin/pip", line 6, in <module>
from pkg_resources import load_entry_point
File "/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py", line 70, in <module>
import packaging.version
ImportError: No module named 'packaging'
~ % sudo pip search jupyter
Traceback (most recent call last):
File "/opt/local/bin/pip", line 6, in <module>
from pkg_resources import load_entry_point
File "/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py", line 70, in <module>
import packaging.version
ImportError: No module named 'packaging'
~ % sudo pip install setuptools
Traceback (most recent call last):
File "/opt/local/bin/pip", line 6, in <module>
from pkg_resources import load_entry_point
File "/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py", line 70, in <module>
import packaging.version
ImportError: No module named 'packaging'
[/code]


  1. For those interested, the complete traceback looked like this: 

Python’s Turtle

Here’s a thing most people are surprised by when they first find it: Python has a built-in turtle graphics module that can spawn its own Tk graphics window and draw stuff. Minimal example:

[code lang=python]
from turtle import Turtle

t = Turtle()
for step in range(36):
t.forward(400)
t.right(170)
[/code]

Python Modules You Didn’t Know You Needed

One of the things that happens as you nurture and grow a software stack is that you begin to take its functionality for granted, and, when you are faced with the prospect of re-creating it elsewhere or over, you realize you need better documentation. My work is currently founded on Python, and I have already documented the great architecture that is numpy + scipy + nltk + pandas + matplotlib + … you get the idea.

  • jupyter is central to how I work my way through code, and when I need to present that code, I am delighted that jupyter gives me the option to present a notebook as a collection of slides. RISE makes those notebooks fly using Reveal.js.
  • missingno “provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. It’s built using matplotlib, so it’s fast, and takes any pandas DataFrame input that you throw at it, so it’s flexible. Just pip install missingno to get started.”

I’ve got more … I just need to list them out.

Append a Python List Using a List Comprehension

In some code with which I am working at the moment, I need to be able to generate a list of labels based on a variable number that I provide elsewhere in a script. In this case, I am working with the Sci-Kit Learn’s topic modeling functions, and as I work iteratively through a given corpus, I am regularly adjusting the number of topics I think “fit” the corpus. Elsewhere in the script, I am using pandas to create a dataframe that contains the names of the texts as row labels and then the topic numbers will be used as column labels.

df_lda_DTM = pd.DataFrame(data= lda_W, index = docs, columns = topic_labels)

In the script, I simply use n_components to specify the number of topics which which the function, LDA or NMF, is to work.

I needed some way to generate the topic labels on the fly so that I would not be stuck with manually editing this:

topic_labels = ["Topic 0", "Topic 1", "Topic 2"]

I was able to do so with a for loop that looked like this:

topic_labels = []
for i in range(0, n_components):
    instance = "Topic {}".format(i)
    topic_labels.append(instance)

Eventually, it dawned on me that range only needs the upper bound, so I could drop the 0 inside the parenthesis:

topic_labels = []
for i in range(n_components):
    topic_labels.append("Topic {}".format(i))

That works just fine, but, while not a big block of code, this piece is part of a much longer script, and if I could get it down to a single line, using a list comprehension, I would make the overall script much easier to read, since this is just a passing bit of code that does one very small thing. One line should be enough.

Enter Python’s list comprehension, a bit of syntax sugar, as pythonistas like to call it, that I have by no means, er, fully comprehended. Still, here’s an opportunity to learn a little bit more.

So, following the guidelines for how you re-block your code within a list comprehension, I tried this:

topic_labels = [topic_labels.append("Topic {}".format(i)) for i in range(n_components)]

Better coders than I will recognize that this will not work, and will return a list of [None, None, None].

But appending a list is simply one way of building a list, of adding elements to a list, isn’t it? I could use Python’s string addition to pull this off, couldn’t I? Yes, yes I could, and did:

topic_labels = ["Topic " + str(i) for i in range(n_components)]

It couldn’t be simpler, and shorter. And it works:

print(topic_labels)
['Topic 0', 'Topic 1', 'Topic 2']