Getting Word Frequencies for 2000+ Texts

What I’ve been working on for the past few days is in preparation for attempting a topic model using the more established LDA instead of the NMF to see how well they compare — with the understanding that since there is rarely a one-to-one matchup within either method, that there will be no such match across them.

Because LDA does not filter out common words on its own, the way the NMF method does, you have to start with a stoplist. I know we can begin with Blei’s and a few other established lists, but I would also like to be able to compare that against our own results. My first thought was to build a dictionary of words and their frequency within the corpus. For convenience sake, I am using the NLTK.

Just as a record of what I’ve done, here’s the usual code for loading the talks from the CSV with everything in it:

[code lang=python]
import pandas
import re

# Get all talks in a list & then into one string
colnames = ['author', 'title', 'date' , 'length', 'text']
df = pandas.read_csv('../data/talks-v1b.csv', names=colnames)
talks = df.text.tolist()
alltalks = " ".join(str(item) for item in talks) # Solves pbm of floats in talks

# Clean out all punctuation except apostrophes
all_words = re.sub(r"[^\w\d'\s]+",'',alltalks).lower()
[/code]

We still need to identify which talks have floats for values and determine what impact, if any, it has on the project.

[code lang=python]
import nltk

tt_tokens = nltk.word_tokenize(all_words)

tt_freq = {}
for word in tt_tokens:
try:
tt_freq[word] += 1
except:
tt_freq[word] = 1
[/code]

Using this method, the dictionary has 63426 entries. Most of those are going to be single-entry items or named entities, but I do think it’s worth looking at them, as well as the high-frequency words that may not be a part of established stopword lists: I think it will be important to note those words which are specifically common to TED Talks.

I converted the dictionary to a list of tuples in order to be able to sort — I see that there is a way to sort a dictionary in Python, but this is a way I know. Looking at the most common words, I see NLTK didn’t get rid of punctuation: I cleared this up by removing punctuation earlier in the process, keeping the contractions (words with apostrophes), which the NLTK does not respect.

N.B. I tried doing this simply with a regex expression that split on white spaces, but I am still seeing contractions split into different words.

[code lang=python]
tt_freq_list.sort(reverse=True)
tt_freq_list[0:20]

[(210294, 'the'),
(151163, 'and'),
(126887, 'to'),
(116155, 'of'),
(106547, 'a'),
(96375, 'that'),
(83740, 'i'),
(78986, 'in'),
(75643, 'it'),
(71766, 'you'),
(68573, 'we'),
(65295, 'is'),
(56535, "'s"),
(49889, 'this'),
(37525, 'so'),
(33424, 'they'),
(32231, 'was'),
(30067, 'for'),
(28869, 'are'),
(28245, 'have')]
[/code]

Keeping the apostrophes proved to be harder than I thought — and I tried going a “pure Python” route and splitting only on white spaces, trying both of the following:

[code lang=python]
word_list = re.split('\s+', all_words)
word_list = all_words.split()
[/code]

I still got: (56535, "'s"),. (The good news is that the counts match.)

Okay, good news. The NLTK white space tokenizer works:

[code lang=python]
from nltk.tokenize import WhitespaceTokenizer
white_words = WhitespaceTokenizer().tokenize(all_words)
[/code]

I tried using Sci-Kit Learn’s CountVectorizer but it requires a list of strings, not one string, and it does not like that some of the texts are floats. So, we’ll save dealing with that when it comes to looking at this corpus as a corpus and not as one giant collection of words.

[code lang=python]
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(talks)

ValueError: np.nan is an invalid document, expected byte or unicode string.
[/code]

The final, working, script of the day produces the output we want:

[code lang=python]
<br /># Tokenize on whitespace
from nltk.tokenize import WhitespaceTokenizer
tt_tokens = WhitespaceTokenizer().tokenize(all_words)

# Build a dictionary of words and their frequency in the corpus
tt_freq = {}
for word in tt_tokens:
try:
tt_freq[word] += 1
except:
tt_freq[word] = 1

# Build a list of tuples, sort, and see some results
tt_freq_list = [(val, key) for key, val in tt_freq.items()]
tt_freq_list.sort(reverse=True)
tt_freq_list[0:20]
[/code]

MacBook Options in Early 2017

I realized at some point recently that when I teach and when I present at conferences, I am using my personal laptop, putting it at risk when my university should be providing me the proper equipment to do those things. Fortunately, I have a bit of money left over from my professorship, and so I looked into what my portability options are:

One consideration would be the 11-inch MacBook Air, now discontinued (and never given the love it deserved):

Amazon has one for $700: Apple MacBook Air MD711LL/B 11.6-Inch Laptop (1.4GHz Intel Core i5 Dual-Core up to 2.7GHz, 4GB RAM, 128GB SSD, Wi-Fi, Bluetooth 4.0) (Certified Refurbished).

Apple has one for $849. MacBook Air 11.6/1.6GHZ/4GB/128GB Flash. March 2015.

Or one for $929: Refurbished 11.6-inch MacBook Air 1.6GHz Dual-core Intel Core i5. Originally released March 2015. 11.6-inch. 4GB of 1600MHz LPDDR3 onboard memory. 256GB PCIe-based flash storage. 720p FaceTime HD Camera. Intel HD Graphics 6000.

With that price, I thought I should look into something more readily affordable: the 9.7-inch iPad Pro Wi-Fi 32GB – Space Gray released in March 2016 lists for $579.

That’s not bad, but I a colleague of mine recently ordered one and I took one look at the size of the keyboard and thought: no. So that leaves the more expensive option, especially since my university won’t buy refurbished gear of the MacBook: 12.0/1.1GHz Dual-Core Intel Core m3/8GB/256GB Flash. April 2016. $1,249. (There was a refurbished version on the website for not a lot less, $1189, but it did have a 512GB SSD. Win some, lose some.)

Top 10 Python libraries of 2016

Tryo Labs is continuing its tradition of retrospectives about the best Python libraries for the past year. This year, it seems, it’s all about serverless architectures and, of course, AI/ML. A lot of cool stuff happening in the latter space. Check out this year’s retrospective and also the discussion on Reddit. (And here’s a link to Tryo’s 2015 retrospective for those curious.)

Flowingdata has a list of their own: Best Data Visualization Projects of 2016/. If you haven’t seen the one about the evolution of bacteria that is a “live” visualization conducted on a giant petri dish, check it out.

Expertise

Expertise matters. As Ezra Pound once noted at the beginning of the ABC of Reading, it’s a matter of having money in the bank. If I write you a check for a million dollars, that check is worthless. If Warren Buffett writes you a check for a million dollars, it’s worth it, quite literally. If I tell you something about texts, it’s worth it. Buffett? Not so much.

  1. We can all stipulate: the expert isn’t always right.
  2. But an expert is far more likely to be right than you are. On a question of factual interpretation or evaluation, it shouldn’t engender insecurity or anxiety to think that an expert’s view is likely to be better-informed than yours. (Because, likely, it is.)
  3. Experts come in many flavors. Education enables it, but practitioners in a field acquire expertise through experience; usually the combination of the two is the mark of a true expert in a field. But if you have neither education nor experience, you might want to consider exactly what it is you’re bringing to the argument.
  4. In any discussion, you have a positive obligation to learn at least enough to make the conversation possible. The University of Google doesn’t count. Remember: having a strong opinion about something isn’t the same as knowing something.

Building a Corpus-Specific Stopword List

How do you go about finding the words that occur in all the texts of a collection or in some percentage of texts? A Safari Oriole lesson I took in recently did the following, using two texts as the basis for the comparison:

[code lang=python]
from pybloom import BloomFilter

bf = BloomFilter(capacity = 1000, error_rate = 0.001)

for word in text1_words:
bf.add(word)

intersect = set([])

for word in text2_words:
if word in bf:
intersect.add(word)

print(intersect)
[/code]

UPDATE: I’m working on getting Markdown and syntax highlighting working. I’m running into difficulties with my beloved Markdown Extra plug-in, indicating I may need to switch to the Jetpack version. (I’ve switched before but not been satisfied with the results.)

Towards an Open Notebook Built on Python

As noted earlier, I am very taken with the idea of moving to an open notebook system: it goes well with my interest in keeping my research accessible not only to myself but also to others. Towards that end, I am in the midst of moving my notes and web captures out of Evernote and into DevonThink — a move made easier by a script that automates the process. I am still not a fan of DT’s UI, but its functionality cannot be denied or ignored. It quite literally does everything. This also means moving my reference library out of Papers, which I have had a love/hate relationship with for the past few years. (Much of this move is, in fact, prompted by the fact that I don’t quite trust the program after various moments of failure. I cannot deny that some of the failings might be of my own making, but, then again, this move I am making is to foolproof systems from the fail/fool point at the center of it all, me.)

Caleb McDaniel’s system is based on Gitit, which itself relies on Pandoc to do much of the heavy lifting. In his system, bibtex entries appear at the top of a note document and are, as I understand it, compiled as needed into larger, comprehensive bibtex lists. To get the bibtex entry at the top of the page into HTML for the wiki, McDaniel uses an OCAML library.

Why not, I wondered as I read McDaniel, attempt to keep as much of the workflow as possible within a single language. Since Python is my language of choice — mostly because I am too time and mind poor to attempt to master anything else — I decided to make the attempt in Python. As luck would have it, there is a bibtex2html module available for Python: [bibtex2html](https://github.com/goliveira/bibtex2html).

Now, whether the rest of the system is built on Sphinx or with MkDocs is the next matter — as is figuring out how to write a script that chains these things together so that I can approach the fluidity and assuredness of McDaniel.

I will update this post as I go. (Please note that this post will stay focused on the mechanics of such a system.)

Namespaces, Scopes, Classes

I’m still a babe in the programming woods, so Shrutarshi Basu’s explanation of namespaces, scopes, and classes in Python was pretty useful. I can’t tell if I had read around enough in preparation for final understanding or if Basu simply wrote about it in a fashion that I understood clearly, seemingly for the first time.