Understanding How Beautiful Soup Works

Two years ago, when I first grabbed the transcripts of the TED talks, using wget, I relied upon the wisdom and generosity of Padraic C on StackOverflow to help me use Python’s BeautifulSoup library to get the data out of the downloaded HTML files that I wanted. Now that Katherine Kinnaird and I have decided to add talks published since then, and perhaps even go so far as to re-download the entire corpus so that everything is as much the same as possible, it was time for me to understand how BeautifulSoup (hereafter BS4) works for myself.

from bs4 import BeautifulSoup

# NB: no need to read() the file: BS4 does that
thesoup = BeautifulSoup(open("transcript.0.html"), "html5lib")

# Talk metadata is in <meta> tags in the <head>.
# This finds all <meta> tags
metas = thesoup.find_all("meta")

# Let's see what this object is...
print(type(metas))

Output: <class 'bs4.element.ResultSet'>, and we can interact with it as if it were a list. Thus, metas[0] yields: <meta charset="utf-8"/>, which is the first of a long line of <meta tags. (The complete output is at the bottom of this note below under the heading Appendix A.)

type(metas[0]) outputs: <class 'bs4.element.Tag'>. That means we will need to understand how to select items within a BS4 Tag. The items we are interested in are towards the bottom of the result set:

<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>

This gives us the slug, the description, the run time, the publication date, the number of hits, and the speaker. So, the question is, how do we navigate the “parse tree” so that we turn up the value of the content attributes when the value of the itemprop attribute is one of the above?

[meta.attrs for meta in metas] returns a list of dictionaries, with each meta its own dictionary. Here is a small sample from the larger list:

{'content': 'PT6M3S', 'itemprop': 'duration'},
{'content': '2016-05-17T14:46:20+00:00', 'itemprop': 'uploadDate'},
{'content': '1246654', 'itemprop': 'interactionCount'},
{'content': 'Laura Indolfi', 'itemprop': 'name'},

What we need to do is identify the dictionary’s position in the list by finding those dictionaries that have the values duration, etc. We then use that position to slice to that dictionary, and get the value associated with content, yes?

It turns out that the best way to do this is built into BS4, though the method was not immediately obvious. One of the answers to the StackOverflow question “Get meta tag content property with BeautifulSoup and Python” suggested the following possibility:

for tag in thesoup.find_all("meta"):
    if tag.get("name", None) == "author":
        speaker = tag.get("content", None)
    if tag.get("itemprop", None) == "duration":
        length = tag.get("content", None)
    if tag.get("itemprop", None) == "uploadDate":
        published = tag.get("content", None)
    if tag.get("itemprop", None) == "interactionCount":
        views = tag.get("content", None)
    if tag.get("itemprop", None) == "description":
        description = tag.get("content", None)

If we ask to see these values with print(speaker, length, published, views, description), we get:

Laura Indolfi PT6M3S 2016-05-17T14:46:20+00:00 1246654 Anyone
who has lost a loved one to pancreatic cancer knows the devastating
speed with which it can affect an otherwise healthy person. TED
Fellow and biomedical entrepreneur Laura Indolfi is developing a
revolutionary way to treat this complex and lethal disease: a drug
delivery device that acts as a cage at the site of a tumor,
preventing it from spreading and delivering medicine only where
it's needed. "We are hoping that one day we can make pancreatic
cancer a curable disease," she says.

Now we need to get the text of the talk out, which is made somewhat difficult by the lack of semantic markup. The start of the text looks like this:

<!-- Transcript text -->
  <div class="Grid Grid--with-gutter d:f@md p-b:4">
    <div class="Grid__cell d:f h:full m-b:.5 m-b:0@md w:12"></div>

    <div class="Grid__cell flx-s:1 p-r:4">

The only reliable thing is the comment tag: there’s also a closing one at the end of the transcript text, so if we can find some way to select all the <p> tags between the two comments, I think we’ll be in good shape.

Appendix A

The output of [print(meta) for meta in metas] is:

<meta charset="utf-8"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." name="description"/>
<meta content="Laura Indolfi" name="author"/>
<meta content='Transcript of "Good news in the fight against pancreatic cancer"' property="og:title"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image:secure_url"/>
<meta content="1050" property="og:image:width"/>
<meta content="550" property="og:image:height"/>
<meta content="article" property="og:type"/>
<meta content="TED, Talks, Themes, Speakers, Technology, Entertainment, Design" name="keywords"/>
<meta content="#E62B1E" name="theme-color"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="TED Talks" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="TED Talks" name="application-name"/>
<meta content="https://www.ted.com/browserconfig.xml" name="msapplication-config"/>
<meta content="#000000" name="msapplication-TileColor"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="Laura Indolfi: Good news in the fight against pancreatic cancer" name="title"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." property="og:description"/>
<meta content="https://www.ted.com/talks/laura_indolfi_good_news_in_the_fight_against_pancreatic_cancer/transcript" property="og:url"/>
<meta content="201021956610141" property="fb:app_id"/>
<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>
<meta content="Flash HTML5" itemprop="playerType"/>
<meta content="640" itemprop="width"/>
<meta content="360" itemprop="height"/>

Python’s `google` Module

So, like me, you become interested in the possibility of executing Google searches from within a Python script, and, like me, you installed the google module — which some have noted is no longer developed by Google itself but by a third party — and got an import error, here is what happened: yes, you did install it as google:

pip install google

but you do not call it google because that will lead to an ImportError. Instead, the name of the module is googlesearch, so what you want to do is this:

from googlesearch import search

Now it works.

Hat tip to shylajhaa sathyaram in his comment on GeeksforGeeks.

Difficulties with PIP

As I have noted before, the foundation for my work in Python is built on first installing the Xcode Command Line tools, then install MacPorts, then installing (using MacPorts) Python and PIP. Everything I then install within my Python setup, which is pretty much everything else, is done using PIP, so when I kept getting the error below after finally acquiescing to macOS’s demands to upgrade to High Sierra, I was more than a little concerned:

ImportError: No module named &#039;packaging&#039;

See below for the complete traceback.1

I tried install setuptools using MacPorts, as well as uninstalling PIP. I eventually even uninstalled both Python and PIP and restarted my machine. No joy.

Joy came with this SO thread which suggested I try:

wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

Everything seems to be in working order now.

Traceback (most recent call last):
  File &quot;/opt/local/bin/pip&quot;, line 6, in &lt;module&gt;
    from pkg_resources import load_entry_point
  File &quot;/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py&quot;, line 70, in &lt;module&gt;
    import packaging.version
ImportError: No module named &#039;packaging&#039;
~ % sudo pip search jupyter
Traceback (most recent call last):
  File &quot;/opt/local/bin/pip&quot;, line 6, in &lt;module&gt;
    from pkg_resources import load_entry_point
  File &quot;/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py&quot;, line 70, in &lt;module&gt;
    import packaging.version
ImportError: No module named &#039;packaging&#039;
~ % sudo pip install setuptools
Traceback (most recent call last):
  File &quot;/opt/local/bin/pip&quot;, line 6, in &lt;module&gt;
    from pkg_resources import load_entry_point
  File &quot;/Users/john/Library/Python/3.4/lib/python/site-packages/pkg_resources/__init__.py&quot;, line 70, in &lt;module&gt;
    import packaging.version
ImportError: No module named &#039;packaging&#039;

  1. For those interested, the complete traceback looked like this: 

Python Modules You Didn’t Know You Needed

One of the things that happens as you nurture and grow a software stack is that you begin to take its functionality for granted, and, when you are faced with the prospect of re-creating it elsewhere or over, you realize you need better documentation. My work is currently founded on Python, and I have already documented the great architecture that is numpy + scipy + nltk + pandas + matplotlib + … you get the idea.

  • jupyter is central to how I work my way through code, and when I need to present that code, I am delighted that jupyter gives me the option to present a notebook as a collection of slides. RISE makes those notebooks fly using Reveal.js.
  • missingno “provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. It’s built using matplotlib, so it’s fast, and takes any pandas DataFrame input that you throw at it, so it’s flexible. Just pip install missingno to get started.”

I’ve got more … I just need to list them out.

Append a Python List Using a List Comprehension

In some code with which I am working at the moment, I need to be able to generate a list of labels based on a variable number that I provide elsewhere in a script. In this case, I am working with the Sci-Kit Learn’s topic modeling functions, and as I work iteratively through a given corpus, I am regularly adjusting the number of topics I think “fit” the corpus. Elsewhere in the script, I am using pandas to create a dataframe that contains the names of the texts as row labels and then the topic numbers will be used as column labels.

df_lda_DTM = pd.DataFrame(data= lda_W, index = docs, columns = topic_labels)

In the script, I simply use n_components to specify the number of topics which which the function, LDA or NMF, is to work.

I needed some way to generate the topic labels on the fly so that I would not be stuck with manually editing this:

topic_labels = ["Topic 0", "Topic 1", "Topic 2"]

I was able to do so with a for loop that looked like this:

topic_labels = []
for i in range(0, n_components):
    instance = "Topic {}".format(i)
    topic_labels.append(instance)

Eventually, it dawned on me that range only needs the upper bound, so I could drop the 0 inside the parenthesis:

topic_labels = []
for i in range(n_components):
    topic_labels.append("Topic {}".format(i))

That works just fine, but, while not a big block of code, this piece is part of a much longer script, and if I could get it down to a single line, using a list comprehension, I would make the overall script much easier to read, since this is just a passing bit of code that does one very small thing. One line should be enough.

Enter Python’s list comprehension, a bit of syntax sugar, as pythonistas like to call it, that I have by no means, er, fully comprehended. Still, here’s an opportunity to learn a little bit more.

So, following the guidelines for how you re-block your code within a list comprehension, I tried this:

topic_labels = [topic_labels.append("Topic {}".format(i)) for i in range(n_components)]

Better coders than I will recognize that this will not work, and will return a list of [None, None, None].

But appending a list is simply one way of building a list, of adding elements to a list, isn’t it? I could use Python’s string addition to pull this off, couldn’t I? Yes, yes I could, and did:

topic_labels = ["Topic " + str(i) for i in range(n_components)]

It couldn’t be simpler, and shorter. And it works:

print(topic_labels)
['Topic 0', 'Topic 1', 'Topic 2']

Turning Words into Numbers

As Katherine Kinnaird and I continue our work on the Tedtalks, we have found ourselves drawn to examine more closely the notion of topics, which we both feel have been underexamined in their usage in the humanities.

Most humanists use an implementation of LDA, which we will probably also use simply to stay in parallel, but at some point in our work, frustrated with my ability to get LDA to work within Python, I picked up Alan Riddell’s DARIAH tutorial and drafted an implementation of NMF topic modeling for our corpus. One advantage I noticed right away, in comparing the results to earlier work I had done with Jonathan Goodwin, was what seemed like a much more stable set of word clusters in the algorithmically-derived topics.

Okay, good, but Kinnaird noticed that stopwords kept creeping into the topics and that raised larger issues about how NMF does what it does and that meant, because she’s so thorough, backing up a bit and making sure we understand how NMF works.

What follows is an experiment to understand the shape and nature of the tf matrix, the tfidf matrix, and the output of the sklearn NMF algorithm. Some of this is driven by the following essays:

To start our adventure, we needed a small set of texts with sufficient overlap that we could later successfully derive topics from them. I set myself the task of creating ten sentences, each of approximately ten words. Careful readers who take the time to read the sentences themselves will, I hope, forgive me for the texts being rather reflexive in nature, but it did seem appropriate given the overall reflexive nature of this task.

# =-=-=-=-=-=-=-=-=-=-=
# The Toy Corpus
# =-=-=-=-=-=-=-=-=-=-= 

sentences = ["Each of these sentences consists of about ten words.",
             "Ten sentence stories and ten word stories were once popular.",
             "Limiting the vocabulary to ten words is difficult.",
             "It is quite difficult to create sentences of just ten words",
             "I need, in fact, variety in the words used.",
             "With all these texts being about texts, there will be few topics.",
             "But I do not want too much variety in the vocabulary.",
             "I want to keep the total vocabulary fairly small.",
             "With a small vocabulary comes a small matrix.",
             "The smaller the matrix the more we will be able to see how things work."]


# =-=-=-=-=-=-=-=-=-=-=
# The Stopwords for this corpus
# =-=-=-=-=-=-=-=-=-=-= 

stopwords = ["a", "about", "all", "and", "be", "being", "but", "do", "each", "few", 
             "how", "i", "in", "is", "it", "more", "much", "not", "of", "once", "the", 
             "there", "these", "to", "too", "want", "we", "were", "will", "with"]

Each text is simply a sentence in a list of strings. Below the texts is the custom stopword list for this corpus. For those curious, there are a total of 102 tokens in the corpus and 30 stopwords. Once the stopwords are applied, 49 tokens remain for a total of 31 words.

# =-=-=-=-=-=
# Clean & Tokenize
# =-=-=-=-=-=

import re
from nltk.tokenize import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
# stopwords = re.split('\s+', open('../data/tt_stop.txt', 'r').read().lower())

# Loop to tokenize, stop, and stem (if needed) texts.
tokenized = []
for sentence in sentences:   
    raw = re.sub(r"[^\w\d'\s]+",'', sentence).lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [word for word in tokens if not word in stopwords]
    tokenized.append(stopped_tokens)


# =-=-=-=-=-=-=-=-=-=-=
# Re-Assemble Texts as Strings from Lists of Words
# (because this is what sklearn expects)
# =-=-=-=-=-=-=-=-=-=-= 

texts = []
for item in tokenized:
    the_string = ' '.join(item)
    texts.append(the_string)
for text in texts:
    print(text)
sentences consists ten words
ten sentence stories ten word stories popular
limiting vocabulary ten words difficult
quite difficult create sentences just ten words
need fact variety words used
texts texts topics
variety vocabulary
keep total vocabulary fairly small
small vocabulary comes small matrix
smaller matrix able see things work
all_words = ' '.join(texts).split()
print("There are {} tokens representing {} words."
      .format(len(all_words), len(set(all_words))))
There are 49 tokens representing 31 words.

We will explore below the possibility of using the sklearn module’s built-in tokenization and stopword abilities, but while I continue to teach myself that functionality, we can move ahead with understanding the vectorization of a corpus.

There are a lot of ways to turn a series of words into a series of numbers. One of the principle ways of doing so ignores any individuated context for a particular word as we might understand it within the context of a given sentence but simply considers a word in relationship to other words in a text. That is, one way to turn words into numbers is simply to count the words in a text, reducing a text to what is known as a “bag of words.” (There’s a lot of linguistics and information science that validates this approach, but it will always chafe most humanists.)

If we run our corpus of ten sentences through the CountVectorizer, we will get a representation of it as a series of numbers, each representing the count of a particular word within a particular text:

# =-=-=-=-=-=-=-=-=-=-=
# TF
# =-=-=-=-=-=-=-=-=-=-= 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = CountVectorizer()
tf_data = vec.fit_transform(texts).toarray()
print(tf_data.shape)
print(tf_data)
(10, 31)
[[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 2 2 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1]]

The term frequency vectorizer in sklearn creates a set of words out of all the tokens, like we did above, then counts the number of times a given word occurs within a given text, returning that text as a vector. Thus, the second sentence above:

"Ten sentence stories and ten word stories were once popular." 

which we had tokenized and stopworded to become:

ten sentence stories ten word stories popular

becomes a list of numbers, or a vector, that looks like this:

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 2 2 0 0 0 0 0 0 0 1 0 0

I chose the second sentence because it has two words that occur twice, ten and stories, so that it didn’t look like a line of binary. If you stack all ten texts on top of each other, then you get a matrix that of 10 rows, each row a text, and 31 columns, each column one of the important, lexical, words.

Based on the location of the two twos, my guess is that the CountVectorizer alphabetizes its list of words, which can also be considered as features of a text. A quick check of our set of words, sorted alphabetically is our first step in confirmation. (It also reveals one of the great problems of working with words: “sentence” and “sentences” as well as “word” and “words” are treated separately and so where a human being would regard those as two lexical entries, the computer treats it as four. This is one argument for stemming, but stemming, so far as I have encountered is, is not only no panacea, it also creates other problems.)

the_words = list(set(all_words))
the_words.sort()
print(the_words)
['able', 'comes', 'consists', 'create', 'difficult', 'fact', 'fairly', 'just', 'keep', 'limiting', 'matrix', 'need', 'popular', 'quite', 'see', 'sentence', 'sentences', 'small', 'smaller', 'stories', 'ten', 'texts', 'things', 'topics', 'total', 'used', 'variety', 'vocabulary', 'word', 'words', 'work']

We can actually get that same list from the vectorizer itself with the get_feature_names method:

features = vec.get_feature_names()
print(features)
['able', 'comes', 'consists', 'create', 'difficult', 'fact', 'fairly', 'just', 'keep', 'limiting', 'matrix', 'need', 'popular', 'quite', 'see', 'sentence', 'sentences', 'small', 'smaller', 'stories', 'ten', 'texts', 'things', 'topics', 'total', 'used', 'variety', 'vocabulary', 'word', 'words', 'work']

We can actually get the count for each term with the vocabulary_ method, which reveals that sklearn stores the information as a dictionary with the term as the key and the count as the value:

occurrences = vec.vocabulary_
print(occurrences)
{'comes': 1, 'difficult': 4, 'need': 11, 'matrix': 10, 'vocabulary': 27, 'just': 7, 'see': 14, 'quite': 13, 'smaller': 18, 'consists': 2, 'texts': 21, 'variety': 26, 'sentence': 15, 'total': 24, 'popular': 12, 'create': 3, 'work': 30, 'topics': 23, 'word': 28, 'limiting': 9, 'words': 29, 'ten': 20, 'able': 0, 'keep': 8, 'sentences': 16, 'fairly': 6, 'stories': 19, 'things': 22, 'used': 25, 'fact': 5, 'small': 17}

It’s also worth pointing out that we can get a count of particular terms within our corpus by feeding the CountVectorizer a vocabulary argument. Here I’ve prepopulated a list with three of our terms — “sentence”, “stories”, and “vocabulary” — and the function returns an array which counts only the occurrence of those three terms across all ten texts:

# =-=-=-=-=-=-=-=-=-=-=
# Controlled Vocabulary Count
# =-=-=-=-=-=-=-=-=-=-= 

tags = ['sentence', 'stories', 'vocabulary']
cv = CountVectorizer(vocabulary=tags)
data = cv.fit_transform(texts).toarray()
print(data)
[[0 0 0]
 [1 2 0]
 [0 0 1]
 [0 0 0]
 [0 0 0]
 [0 0 0]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 0]]

So far we’ve been trafficking in raw counts, or occurrences, of a word — aka term, aka feature — in our corpus. Chances are, longer, or bigger, texts which simply have more words will have more of any given word, which means they may come to be overvalued (overweighted?) if we rely only on occurrences. Fortunately, we can simply normalize by length of a text to get a value that can be used to compare how often a word is used in relationship to the size of the text across all texts in a corpus. That is, we can get a term’s frequency.

As I was working on this bit of code, I learned that sklearn stores this information in a compressed sparse row matrix, wherein a series of (text, term) coordinates are followed by a value. I have captured the first two texts below. (Note the commented out toarray method in the second-to-last line. It’s there so often in sklearn code that I had come to take it for granted.)

from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(tf_data)
words_tf = tf_transformer.transform(tf_data)#.toarray()
print(words_tf[0:2])
  (0, 2)    0.5
  (0, 16)   0.5
  (0, 20)   0.5
  (0, 29)   0.5
  (1, 12)   0.301511344578
  (1, 15)   0.301511344578
  (1, 19)   0.603022689156
  (1, 20)   0.603022689156
  (1, 28)   0.301511344578

And here’s that same information represented as an array:

words_tf_array = words_tf.toarray()
print(words_tf_array[0:2])
[[ 0.          0.          0.5         0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.5         0.          0.          0.          0.5
   0.          0.          0.          0.          0.          0.          0.
   0.          0.5         0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.30151134
   0.          0.          0.30151134  0.          0.          0.
   0.60302269  0.60302269  0.          0.          0.          0.          0.
   0.          0.          0.30151134  0.          0.        ]]

Finally, we can also weight words within a document contra the number of times they occur within the overall corpus, thus lowering the value of common words.

# =-=-=-=-=-=-=-=-=-=-=
# TFIDF
# =-=-=-=-=-=-=-=-=-=-= 

tfidf = TfidfVectorizer()
tfidf_data = tfidf.fit_transform(texts)#.toarray()
print(tfidf_data.shape)
print(tfidf_data[1]) # values for second sentence
(10, 31)
  (0, 12)   0.338083066465
  (0, 28)   0.338083066465
  (0, 19)   0.67616613293
  (0, 15)   0.338083066465
  (0, 20)   0.447100526936

And now, again, in the more common form of an array:

tfidf_array = tfidf_data.toarray()
print(tfidf_array[1]) # values for second sentence
[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.33808307
  0.          0.          0.33808307  0.          0.          0.
  0.67616613  0.44710053  0.          0.          0.          0.          0.
  0.          0.          0.33808307  0.          0.        ]
#tfidf_recall = tfidf_data.get_feature_names() # Not working

Staying within the sklearn ecosystem

What if we do all tokenization and normalization in sklearn?

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This is the bog-standard version from the documentation
# test_vec = CountVectorizer(input=u'content', 
#                            encoding=u'utf-8', 
#                            decode_error=u'strict', 
#                            strip_accents=None, 
#                            lowercase=True, 
#                            preprocessor=None, 
#                            tokenizer=None, 
#                            stop_words=stopwords, 
#                            token_pattern=u'(?u)\b\w\w+\b', 
#                            ngram_range=(1, 1), 
#                            analyzer=u'word', 
#                            max_df=1.0, 
#                            min_df=1, 
#                            max_features=None, 
#                            vocabulary=None, 
#                            binary=False, 
#                            dtype=<type 'numpy.int64'>)
test_vec = CountVectorizer(lowercase = True, 
                           stop_words = stopwords, 
                           token_pattern = u'(?u)\b\w\w+\b', 
                           ngram_range = (1, 1), 
                           analyzer = u'word')

#test_data = test_vec.fit_transform(texts).toarray() # --> ValueError: empty vocabulary

Counting Control Words in a Text

As I was working on a toy corpus to understand the various facets of skearn, I came across this very clear example of how to count specific words in a collection of texts:

import sklearn
cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
data = cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
print(data)
[[1 0 0]
 [0 1 0]
 [0 0 0]
 [0 0 1]]

Please note that I’ve changed the original a bit to make it easier to deploy this is a longer script.

Test Post with JP Markdown and Syntax Highlighting Activated

Okay, here’s some regular prose, which isn’t explanatory at all, and then here comes a block of code:

from stop_words import get_stop_words
from nltk.corpus import stopwords

mod_stop = get_stop_words('en')
nltk_stop = stopwords.words("english")

print("mod_stop is {} words, and nltk_stop is {} words".format(len(mod_stop), len(nltk_stop)))

returns:

mod_stop is 174 words, and nltk_stop is 153 words

Getting Word Frequencies for 2000+ Texts

What I’ve been working on for the past few days is in preparation for attempting a topic model using the more established LDA instead of the NMF to see how well they compare — with the understanding that since there is rarely a one-to-one matchup within either method, that there will be no such match across them.

Because LDA does not filter out common words on its own, the way the NMF method does, you have to start with a stoplist. I know we can begin with Blei’s and a few other established lists, but I would also like to be able to compare that against our own results. My first thought was to build a dictionary of words and their frequency within the corpus. For convenience sake, I am using the NLTK.

Just as a record of what I’ve done, here’s the usual code for loading the talks from the CSV with everything in it:

import pandas
import re

# Get all talks in a list &amp; then into one string
colnames = [&#039;author&#039;, &#039;title&#039;, &#039;date&#039; , &#039;length&#039;, &#039;text&#039;]
df = pandas.read_csv(&#039;../data/talks-v1b.csv&#039;, names=colnames)
talks = df.text.tolist()
alltalks = &quot; &quot;.join(str(item) for item in talks) # Solves pbm of floats in talks

# Clean out all punctuation except apostrophes
all_words = re.sub(r&quot;[^\w\d&#039;\s]+&quot;,&#039;&#039;,alltalks).lower()

We still need to identify which talks have floats for values and determine what impact, if any, it has on the project.

import nltk

tt_tokens = nltk.word_tokenize(all_words)

tt_freq = {}
for word in tt_tokens:
    try:
        tt_freq[word] += 1
    except: 
        tt_freq[word] = 1

Using this method, the dictionary has 63426 entries. Most of those are going to be single-entry items or named entities, but I do think it’s worth looking at them, as well as the high-frequency words that may not be a part of established stopword lists: I think it will be important to note those words which are specifically common to TED Talks.

I converted the dictionary to a list of tuples in order to be able to sort — I see that there is a way to sort a dictionary in Python, but this is a way I know. Looking at the most common words, I see NLTK didn’t get rid of punctuation: I cleared this up by removing punctuation earlier in the process, keeping the contractions (words with apostrophes), which the NLTK does not respect.

N.B. I tried doing this simply with a regex expression that split on white spaces, but I am still seeing contractions split into different words.

tt_freq_list.sort(reverse=True)
tt_freq_list[0:20]

[(210294, &#039;the&#039;),
 (151163, &#039;and&#039;),
 (126887, &#039;to&#039;),
 (116155, &#039;of&#039;),
 (106547, &#039;a&#039;),
 (96375, &#039;that&#039;),
 (83740, &#039;i&#039;),
 (78986, &#039;in&#039;),
 (75643, &#039;it&#039;),
 (71766, &#039;you&#039;),
 (68573, &#039;we&#039;),
 (65295, &#039;is&#039;),
 (56535, &quot;&#039;s&quot;),
 (49889, &#039;this&#039;),
 (37525, &#039;so&#039;),
 (33424, &#039;they&#039;),
 (32231, &#039;was&#039;),
 (30067, &#039;for&#039;),
 (28869, &#039;are&#039;),
 (28245, &#039;have&#039;)]

Keeping the apostrophes proved to be harder than I thought — and I tried going a “pure Python” route and splitting only on white spaces, trying both of the following:

word_list = re.split(&#039;\s+&#039;, all_words)
word_list = all_words.split()

I still got: (56535, "'s"),. (The good news is that the counts match.)

Okay, good news. The NLTK white space tokenizer works:

from nltk.tokenize import WhitespaceTokenizer
white_words = WhitespaceTokenizer().tokenize(all_words)

I tried using Sci-Kit Learn’s CountVectorizer but it requires a list of strings, not one string, and it does not like that some of the texts are floats. So, we’ll save dealing with that when it comes to looking at this corpus as a corpus and not as one giant collection of words.

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(talks)

ValueError: np.nan is an invalid document, expected byte or unicode string.

The final, working, script of the day produces the output we want:

&lt;br /&gt;# Tokenize on whitespace
from nltk.tokenize import WhitespaceTokenizer
tt_tokens = WhitespaceTokenizer().tokenize(all_words)

# Build a dictionary of words and their frequency in the corpus
tt_freq = {}
for word in tt_tokens:
    try:
        tt_freq[word] += 1
    except: 
        tt_freq[word] = 1

# Build a list of tuples, sort, and see some results 
tt_freq_list = [(val, key) for key, val in tt_freq.items()]
tt_freq_list.sort(reverse=True)
tt_freq_list[0:20]

Top 10 Python libraries of 2016

Tryo Labs is continuing its tradition of retrospectives about the best Python libraries for the past year. This year, it seems, it’s all about serverless architectures and, of course, AI/ML. A lot of cool stuff happening in the latter space. Check out this year’s retrospective and also the discussion on Reddit. (And here’s a link to Tryo’s 2015 retrospective for those curious.)

Flowingdata has a list of their own: Best Data Visualization Projects of 2016/. If you haven’t seen the one about the evolution of bacteria that is a “live” visualization conducted on a giant petri dish, check it out.

Building a Corpus-Specific Stopword List

How do you go about finding the words that occur in all the texts of a collection or in some percentage of texts? A Safari Oriole lesson I took in recently did the following, using two texts as the basis for the comparison:

from pybloom import BloomFilter

bf = BloomFilter(capacity = 1000, error_rate = 0.001)

for word in text1_words:
    bf.add(word)

intersect = set([])

for word in text2_words:
    if word in bf:
        intersect.add(word)

print(intersect)

UPDATE: I’m working on getting Markdown and syntax highlighting working. I’m running into difficulties with my beloved Markdown Extra plug-in, indicating I may need to switch to the Jetpack version. (I’ve switched before but not been satisfied with the results.)