Turning Words into Numbers

As Katherine Kinnaird and I continue our work on the Tedtalks, we have found ourselves drawn to examine more closely the notion of topics, which we both feel have been underexamined in their usage in the humanities.

Most humanists use an implementation of LDA, which we will probably also use simply to stay in parallel, but at some point in our work, frustrated with my ability to get LDA to work within Python, I picked up Alan Riddell’s DARIAH tutorial and drafted an implementation of NMF topic modeling for our corpus. One advantage I noticed right away, in comparing the results to earlier work I had done with Jonathan Goodwin, was what seemed like a much more stable set of word clusters in the algorithmically-derived topics.

Okay, good, but Kinnaird noticed that stopwords kept creeping into the topics and that raised larger issues about how NMF does what it does and that meant, because she’s so thorough, backing up a bit and making sure we understand how NMF works.

What follows is an experiment to understand the shape and nature of the tf matrix, the tfidf matrix, and the output of the sklearn NMF algorithm. Some of this is driven by the following essays:

To start our adventure, we needed a small set of texts with sufficient overlap that we could later successfully derive topics from them. I set myself the task of creating ten sentences, each of approximately ten words. Careful readers who take the time to read the sentences themselves will, I hope, forgive me for the texts being rather reflexive in nature, but it did seem appropriate given the overall reflexive nature of this task.

# =-=-=-=-=-=-=-=-=-=-=
# The Toy Corpus
# =-=-=-=-=-=-=-=-=-=-= 

sentences = ["Each of these sentences consists of about ten words.",
             "Ten sentence stories and ten word stories were once popular.",
             "Limiting the vocabulary to ten words is difficult.",
             "It is quite difficult to create sentences of just ten words",
             "I need, in fact, variety in the words used.",
             "With all these texts being about texts, there will be few topics.",
             "But I do not want too much variety in the vocabulary.",
             "I want to keep the total vocabulary fairly small.",
             "With a small vocabulary comes a small matrix.",
             "The smaller the matrix the more we will be able to see how things work."]


# =-=-=-=-=-=-=-=-=-=-=
# The Stopwords for this corpus
# =-=-=-=-=-=-=-=-=-=-= 

stopwords = ["a", "about", "all", "and", "be", "being", "but", "do", "each", "few", 
             "how", "i", "in", "is", "it", "more", "much", "not", "of", "once", "the", 
             "there", "these", "to", "too", "want", "we", "were", "will", "with"]

Each text is simply a sentence in a list of strings. Below the texts is the custom stopword list for this corpus. For those curious, there are a total of 102 tokens in the corpus and 30 stopwords. Once the stopwords are applied, 49 tokens remain for a total of 31 words.

# =-=-=-=-=-=
# Clean & Tokenize
# =-=-=-=-=-=

import re
from nltk.tokenize import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
# stopwords = re.split('\s+', open('../data/tt_stop.txt', 'r').read().lower())

# Loop to tokenize, stop, and stem (if needed) texts.
tokenized = []
for sentence in sentences:   
    raw = re.sub(r"[^\w\d'\s]+",'', sentence).lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [word for word in tokens if not word in stopwords]
    tokenized.append(stopped_tokens)


# =-=-=-=-=-=-=-=-=-=-=
# Re-Assemble Texts as Strings from Lists of Words
# (because this is what sklearn expects)
# =-=-=-=-=-=-=-=-=-=-= 

texts = []
for item in tokenized:
    the_string = ' '.join(item)
    texts.append(the_string)
for text in texts:
    print(text)
sentences consists ten words
ten sentence stories ten word stories popular
limiting vocabulary ten words difficult
quite difficult create sentences just ten words
need fact variety words used
texts texts topics
variety vocabulary
keep total vocabulary fairly small
small vocabulary comes small matrix
smaller matrix able see things work
all_words = ' '.join(texts).split()
print("There are {} tokens representing {} words."
      .format(len(all_words), len(set(all_words))))
There are 49 tokens representing 31 words.

We will explore below the possibility of using the sklearn module’s built-in tokenization and stopword abilities, but while I continue to teach myself that functionality, we can move ahead with understanding the vectorization of a corpus.

There are a lot of ways to turn a series of words into a series of numbers. One of the principle ways of doing so ignores any individuated context for a particular word as we might understand it within the context of a given sentence but simply considers a word in relationship to other words in a text. That is, one way to turn words into numbers is simply to count the words in a text, reducing a text to what is known as a “bag of words.” (There’s a lot of linguistics and information science that validates this approach, but it will always chafe most humanists.)

If we run our corpus of ten sentences through the CountVectorizer, we will get a representation of it as a series of numbers, each representing the count of a particular word within a particular text:

# =-=-=-=-=-=-=-=-=-=-=
# TF
# =-=-=-=-=-=-=-=-=-=-= 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = CountVectorizer()
tf_data = vec.fit_transform(texts).toarray()
print(tf_data.shape)
print(tf_data)
(10, 31)
[[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 2 2 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1]]

The term frequency vectorizer in sklearn creates a set of words out of all the tokens, like we did above, then counts the number of times a given word occurs within a given text, returning that text as a vector. Thus, the second sentence above:

"Ten sentence stories and ten word stories were once popular." 

which we had tokenized and stopworded to become:

ten sentence stories ten word stories popular

becomes a list of numbers, or a vector, that looks like this:

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 2 2 0 0 0 0 0 0 0 1 0 0

I chose the second sentence because it has two words that occur twice, ten and stories, so that it didn’t look like a line of binary. If you stack all ten texts on top of each other, then you get a matrix that of 10 rows, each row a text, and 31 columns, each column one of the important, lexical, words.

Based on the location of the two twos, my guess is that the CountVectorizer alphabetizes its list of words, which can also be considered as features of a text. A quick check of our set of words, sorted alphabetically is our first step in confirmation. (It also reveals one of the great problems of working with words: “sentence” and “sentences” as well as “word” and “words” are treated separately and so where a human being would regard those as two lexical entries, the computer treats it as four. This is one argument for stemming, but stemming, so far as I have encountered is, is not only no panacea, it also creates other problems.)

the_words = list(set(all_words))
the_words.sort()
print(the_words)
['able', 'comes', 'consists', 'create', 'difficult', 'fact', 'fairly', 'just', 'keep', 'limiting', 'matrix', 'need', 'popular', 'quite', 'see', 'sentence', 'sentences', 'small', 'smaller', 'stories', 'ten', 'texts', 'things', 'topics', 'total', 'used', 'variety', 'vocabulary', 'word', 'words', 'work']

We can actually get that same list from the vectorizer itself with the get_feature_names method:

features = vec.get_feature_names()
print(features)
['able', 'comes', 'consists', 'create', 'difficult', 'fact', 'fairly', 'just', 'keep', 'limiting', 'matrix', 'need', 'popular', 'quite', 'see', 'sentence', 'sentences', 'small', 'smaller', 'stories', 'ten', 'texts', 'things', 'topics', 'total', 'used', 'variety', 'vocabulary', 'word', 'words', 'work']

We can actually get the count for each term with the vocabulary_ method, which reveals that sklearn stores the information as a dictionary with the term as the key and the count as the value:

occurrences = vec.vocabulary_
print(occurrences)
{'comes': 1, 'difficult': 4, 'need': 11, 'matrix': 10, 'vocabulary': 27, 'just': 7, 'see': 14, 'quite': 13, 'smaller': 18, 'consists': 2, 'texts': 21, 'variety': 26, 'sentence': 15, 'total': 24, 'popular': 12, 'create': 3, 'work': 30, 'topics': 23, 'word': 28, 'limiting': 9, 'words': 29, 'ten': 20, 'able': 0, 'keep': 8, 'sentences': 16, 'fairly': 6, 'stories': 19, 'things': 22, 'used': 25, 'fact': 5, 'small': 17}

It’s also worth pointing out that we can get a count of particular terms within our corpus by feeding the CountVectorizer a vocabulary argument. Here I’ve prepopulated a list with three of our terms — “sentence”, “stories”, and “vocabulary” — and the function returns an array which counts only the occurrence of those three terms across all ten texts:

# =-=-=-=-=-=-=-=-=-=-=
# Controlled Vocabulary Count
# =-=-=-=-=-=-=-=-=-=-= 

tags = ['sentence', 'stories', 'vocabulary']
cv = CountVectorizer(vocabulary=tags)
data = cv.fit_transform(texts).toarray()
print(data)
[[0 0 0]
 [1 2 0]
 [0 0 1]
 [0 0 0]
 [0 0 0]
 [0 0 0]
 [0 0 1]
 [0 0 1]
 [0 0 1]
 [0 0 0]]

So far we’ve been trafficking in raw counts, or occurrences, of a word — aka term, aka feature — in our corpus. Chances are, longer, or bigger, texts which simply have more words will have more of any given word, which means they may come to be overvalued (overweighted?) if we rely only on occurrences. Fortunately, we can simply normalize by length of a text to get a value that can be used to compare how often a word is used in relationship to the size of the text across all texts in a corpus. That is, we can get a term’s frequency.

As I was working on this bit of code, I learned that sklearn stores this information in a compressed sparse row matrix, wherein a series of (text, term) coordinates are followed by a value. I have captured the first two texts below. (Note the commented out toarray method in the second-to-last line. It’s there so often in sklearn code that I had come to take it for granted.)

from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(tf_data)
words_tf = tf_transformer.transform(tf_data)#.toarray()
print(words_tf[0:2])
  (0, 2)    0.5
  (0, 16)   0.5
  (0, 20)   0.5
  (0, 29)   0.5
  (1, 12)   0.301511344578
  (1, 15)   0.301511344578
  (1, 19)   0.603022689156
  (1, 20)   0.603022689156
  (1, 28)   0.301511344578

And here’s that same information represented as an array:

words_tf_array = words_tf.toarray()
print(words_tf_array[0:2])
[[ 0.          0.          0.5         0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.5         0.          0.          0.          0.5
   0.          0.          0.          0.          0.          0.          0.
   0.          0.5         0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.30151134
   0.          0.          0.30151134  0.          0.          0.
   0.60302269  0.60302269  0.          0.          0.          0.          0.
   0.          0.          0.30151134  0.          0.        ]]

Finally, we can also weight words within a document contra the number of times they occur within the overall corpus, thus lowering the value of common words.

# =-=-=-=-=-=-=-=-=-=-=
# TFIDF
# =-=-=-=-=-=-=-=-=-=-= 

tfidf = TfidfVectorizer()
tfidf_data = tfidf.fit_transform(texts)#.toarray()
print(tfidf_data.shape)
print(tfidf_data[1]) # values for second sentence
(10, 31)
  (0, 12)   0.338083066465
  (0, 28)   0.338083066465
  (0, 19)   0.67616613293
  (0, 15)   0.338083066465
  (0, 20)   0.447100526936

And now, again, in the more common form of an array:

tfidf_array = tfidf_data.toarray()
print(tfidf_array[1]) # values for second sentence
[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.33808307
  0.          0.          0.33808307  0.          0.          0.
  0.67616613  0.44710053  0.          0.          0.          0.          0.
  0.          0.          0.33808307  0.          0.        ]
#tfidf_recall = tfidf_data.get_feature_names() # Not working

Staying within the sklearn ecosystem

What if we do all tokenization and normalization in sklearn?

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This is the bog-standard version from the documentation
# test_vec = CountVectorizer(input=u'content', 
#                            encoding=u'utf-8', 
#                            decode_error=u'strict', 
#                            strip_accents=None, 
#                            lowercase=True, 
#                            preprocessor=None, 
#                            tokenizer=None, 
#                            stop_words=stopwords, 
#                            token_pattern=u'(?u)\b\w\w+\b', 
#                            ngram_range=(1, 1), 
#                            analyzer=u'word', 
#                            max_df=1.0, 
#                            min_df=1, 
#                            max_features=None, 
#                            vocabulary=None, 
#                            binary=False, 
#                            dtype=<type 'numpy.int64'>)
test_vec = CountVectorizer(lowercase = True, 
                           stop_words = stopwords, 
                           token_pattern = u'(?u)\b\w\w+\b', 
                           ngram_range = (1, 1), 
                           analyzer = u'word')

#test_data = test_vec.fit_transform(texts).toarray() # --> ValueError: empty vocabulary

MacPorts Weirdness: NLTK for Python 3 Depends on Python 2

File under NLTK dependency depends on Python 2 but there is a workaround to keep everything Python 3.

This is probably one of those things that leads my colleague [Jonathan Goodwin][] to roll his eyes when treating in Pythonic waters: while re-installing MacPorts, after upgrading to Mac OS X El Capitan, I was going through [my Python roll call][] — `numpy`, `scipy`, `nltk`, `pandas`, etc — when I noticed that `py34-nltk` was installing Python 2. Here’s what I saw scroll by:

—> Installing python2_select @0.0_1
—> Activating python2_select @0.0_1
—> Cleaning python2_select
—> Fetching archive for python27

That didn’t seem right, so I looked into the list of dependencies (which is a long list but I’ll repeat it here):

Dependencies to be installed: py34-matplotlib freetype libpng pkgconfig
py34-cairo cairo fontconfig glib2 libpixman xorg-libXext autoconf automake
libtool xorg-libX11 xorg-bigreqsproto xorg-inputproto xorg-kbproto xorg-libXau
xorg-xproto xorg-libXdmcp xorg-libxcb python27 db48 python2_select
xorg-libpthread-stubs xorg-xcb-proto libxml2 xorg-util-macros xorg-xcmiscproto
xorg-xextproto xorg-xf86bigfontproto xorg-xtrans xorg-xcb-util xrender
xorg-renderproto py34-cycler py34-six py34-dateutil py34-tz py34-parsing
py34-pyobjc-cocoa py34-pyobjc py34-py2app py34-macholib py34-modulegraph
py34-altgraph py34-tkinter tk Xft2 tcl xorg-libXScrnSaver xorg-scrnsaverproto
py34-tornado py34-backports_abc py34-certifi qhull cmake curl curl-ca-bundle
perl5 perl5.16 gdbm libarchive lzo2 py34-yaml libyaml

Buried in there are:

xorg-libxcb python27 db48 python2_select

I submitted this as a [bug at MacPorts][], and I got the following really interesting reply:

> py34-tkinter which depends on tk which depends on Xft2 which depends on xrender which depends on xorg-libX11 which depends on xorg-libxcb which depends on xorg-xcb-proto which depends on python27 which depends on python2_select. This is not a bug. If you want xorg-xcb-proto to use python34 instead, install it with its +python34 variant:

sudo port install xorg-xcb-proto +python34

> More generally, if you always want to use a +python34 in any port, if available, put “+python34” into your variants.conf file.
>
> Not all ports that use python offer a +python34 variant. If you find one that doesn’t, you can request one be added by filing a ticket.

Thanks, ryandesign.

[Jonathan Goodwin]: http://jgoodwin.net
[my Python roll call]: http://johnlaudun.org/20121230-macports-for-nltk/
[bug at MacPorts]: https://trac.macports.org/ticket/49970

Really, Too Easy

iPython Notebook and the Python Natural Language Toolkit are, I think, spoiling me. Not only does the iPython notebook make it easy to write code and to make notes about writing it — which helps a noob like me document his many (many) mistakes, but when I need to download something for the NLTK, up pops a GUI window to make it easy to select what to install:

NLTK and Stopwords

I spent some time this morning playing with various features of the Python NLTK, trying to think about how much, if any, I wanted to use it with my freshmen. (More on this in a moment.) I loaded in a short story text that we have read, and running it through various functions that the NLTK makes possible when I ran into a hiccup:

&gt;&gt;&gt; text.collocations()
Building collocations list
Traceback (most recent call last):
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
  File
&quot;/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/text.py&quot;, line 341, in collocations
    ignored_words = stopwords.words(&#039;english&#039;)
  File
&quot;/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/corpus/util.py&quot;, line 68, in __getattr__
    self.__load()
  File
&quot;/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/corpus/util.py&quot;, line 56, in __load
    except LookupError: raise e
LookupError: 
**********************************************************************
  Resource &#039;corpora/stopwords&#039; not found.  Please use the NLTK
  Downloader to obtain the resource: &gt;&gt;&gt; nltk.download().
  Searched in:
    - &#039;/usr/share/nltk&#039;
    - &#039;/Users/john/nltk_data&#039;
    - &#039;/usr/share/nltk_data&#039;
    - &#039;/usr/local/share/nltk_data&#039;
    - &#039;/usr/lib/nltk_data&#039;
    - &#039;/usr/local/lib/nltk_data&#039;
**********************************************************************

Now, the nice thing is that all you have to do is follow the directions, entering nltk.download() in the IDLE prompt, and you get:

showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

which provides the following window:

Screen Shot 2013-01-26 at 09.08.15

Clicking on the Corpora tab and scrolling down allows you to download the stopword list:

Screen Shot 2013-01-26 at 09.08.43

What I have not yet figured out is how to specify your own stopword list. Part of what I want to teach any of my students is that choosing what words are important and what words are not are a matter of subject matter expertise and thus something they should not turn over to someone else to do.

Joy = MacPorts (Python + Numpy + Scipy + Matplotlib + NLTK)

This is the TL:DR version of my previous post.

After installing [MacPorts](http://www.macports.org/install.php) via the package installer, open a terminal session and enter the following:

% sudo port selfupdate
% sudo port install python27
% sudo port install py27-numpy
% sudo port install py27-scipy
% sudo port install py27-matplotlib
% sudo port install py27-nltk
% sudo port install python_select
% sudo port select –set python python27

By the way, once I did all this. I was able to run a Python script that relied on `matplotlib` to run. *Sweet.*