Top 10 Python libraries of 2016

Tryo Labs is continuing its tradition of retrospectives about the best Python libraries for the past year. This year, it seems, it’s all about serverless architectures and, of course, AI/ML. A lot of cool stuff happening in the latter space. Check out this year’s retrospective and also the discussion on Reddit. (And here’s a link to Tryo’s 2015 retrospective for those curious.)

Flowingdata has a list of their own: Best Data Visualization Projects of 2016/. If you haven’t seen the one about the evolution of bacteria that is a “live” visualization conducted on a giant petri dish, check it out.

Building a Corpus-Specific Stopword List

How do you go about finding the words that occur in all the texts of a collection or in some percentage of texts? A Safari Oriole lesson I took in recently did the following, using two texts as the basis for the comparison:

from pybloom import BloomFilter

bf = BloomFilter(capacity = 1000, error_rate = 0.001)

for word in text1_words:

intersect = set([])

for word in text2_words:
    if word in bf:


UPDATE: I’m working on getting Markdown and syntax highlighting working. I’m running into difficulties with my beloved Markdown Extra plug-in, indicating I may need to switch to the Jetpack version. (I’ve switched before but not been satisfied with the results.)

Towards an Open Notebook Built on Python

As noted earlier, I am very taken with the idea of moving to an open notebook system: it goes well with my interest in keeping my research accessible not only to myself but also to others. Towards that end, I am in the midst of moving my notes and web captures out of Evernote and into DevonThink — a move made easier by a script that automates the process. I am still not a fan of DT’s UI, but its functionality cannot be denied or ignored. It quite literally does everything. This also means moving my reference library out of Papers, which I have had a love/hate relationship with for the past few years. (Much of this move is, in fact, prompted by the fact that I don’t quite trust the program after various moments of failure. I cannot deny that some of the failings might be of my own making, but, then again, this move I am making is to foolproof systems from the fail/fool point at the center of it all, me.)

Caleb McDaniel’s system is based on Gitit, which itself relies on Pandoc to do much of the heavy lifting. In his system, bibtex entries appear at the top of a note document and are, as I understand it, compiled as needed into larger, comprehensive bibtex lists. To get the bibtex entry at the top of the page into HTML for the wiki, McDaniel uses an OCAML library.

Why not, I wondered as I read McDaniel, attempt to keep as much of the workflow as possible within a single language. Since Python is my language of choice — mostly because I am too time and mind poor to attempt to master anything else — I decided to make the attempt in Python. As luck would have it, there is a bibtex2html module available for Python: [bibtex2html](

Now, whether the rest of the system is built on Sphinx or with MkDocs is the next matter — as is figuring out how to write a script that chains these things together so that I can approach the fluidity and assuredness of McDaniel.

I will update this post as I go. (Please note that this post will stay focused on the mechanics of such a system.)

Listing Python Modules

Sometimes you need to know what Python modules you already installed, the easiest way to get a list is to:


This will give you a list of installed modules typically as a series of columns. All you have are names, not version numbers. If you need to know version numbers, then try:

import matplotlib

Python Site Generators

I have never been particularly impressed with Moodle, the learning management system used by my university and a number of other organizations. Its every impulse, it seems to me, is to increase the number of steps to get simple things done, I suppose to simplify more complex things for users with less tech savvy. Using markdown, for example, is painful and there’s no way to control the presentation of materials unless you resort to one of its myriad of under-explained, and probably under-thought, content packaging options. (I’ve never grokked the Moodle “book”, for example.)

To be honest, there are times when I feel the same way about WordPress, which has gotten GUIer and less sharp on a number of fronts — why oh why are categories and tags now unlimited in application?

I’m also less than clear on my university’s approach to intellectual property: they seem rather keen to claim everything and anything as their own, when they can’t even be bothered to give you basic production tools. (Hello? It’s been three years since I had access to a printer that didn’t involve me copying files to a flash drive and walking down stairs to load things onto a Windows machine that can only ever print PDFs.)

I decided I would give static site generation a try, particularly if I could compose in markdown, ReST, or even a Jupyter notebook (as a few of the generators appear to promise). I’m not interested in using this for blogging, and I will probably maintain it on a subdirectory of my own site, e.g. /teaching, and I hope to be able to sync between local and remote versions using Git. That seems straightforward, doesn’t it? (I’m also now thinking that I will stuff everything into the same directory and just have different pages, and subpages?, for each course. Just hang everything out there for all to see.

As for the site generators themselves, there are a number of options:

  • Pelican is a popular one, but seems very blog oriented.
  • I’ve installed both Pelican and Nikola, and I ran the latter this morning and was somewhat overwhelmed by the number of directories it generated right away.
  • Cactus seems compelling, and has a build available for the Mac.
  • There is also Hyde.
  • I’m going to ignore blogofile for now, but it’s there and its development is active.
  • If all else fails, I have used Poole before. It doesn’t have a templating system or Javascript of any of that, but maybe it’s better for it.

More on Normalizing Sentiment Distributions

Mehrdad Yazdani pointed out that some of my problems in normalization may have been the result of not having the right pieces in place, and so suggested some changes to the script. The result would seem to suggest that the two distributions are now comparable in scale — as well as on the same x-axis. (My Python-fu is not strong enough, yet, for me to determine how this error crept in.)

Mehrdaded Sentimental Outputs

Raw Sentiment normalized with np.max(np.abs(a_list))

When I run these results through my averaging function, however, I get significant vertical compression:

Averaged Mehrdaded Sentiments

Averaged Sentiment normalized with np.max(np.abs(a_list))

If I substitute np.linalg.norm(a_list) for np.max(np.abs(a_list)) in the script, I get the following results:

Raw Sentiment Normalized with numpy.linalg.norm

Raw Sentiment Normalized with numpy.linalg.norm

Averaged Sentiment Normalized with numpy.linalg.norm

Averaged Sentiment Normalized with numpy.linalg.norm

A Tale of Two Sentimental Signatures

I’m still working my way through the code that will, I hope, make it possible to compare effectively different sentimental modules in Python. While the code is available as a GitHub [], I wanted to post some of the early outcomes here, publishing my failure, as it were.

I began with the raw sentiments, which is not very interesting, since the different modules use different ranges: quite wide for Afinn, -1 to 1 for TextBlob, and between 0 and 1 for Indico.

Raw Sentiments: Afinn, Textblob, Indico

Raw Sentiments: Afinn, Textblob, Indico

To make them more comparable, I needed to normalize them, and to make the whole of it more digestible, I needed to average them. I began with normalizing the values — see the [] — and you can already see there’s a divergence in the baseline for which I cannot yet account in my code:

Normalized Sentiment: Afinn and TextBlob

Normalized Sentiment: Afinn and TextBlob

To be honest, I didn’t really notice this until I plotted the average, where the divergence becomes really apparent:

Average, Normalized Sentiments: Afinn and TextBlob

Average, Normalized Sentiments: Afinn and TextBlob


More Sentiment Comparisons

I added two kinds of moving averages to the script, and as you can see from the results below, whether you go with the numpy version or the Technical Analysis library, talib, of the running average, you get the same results: NP starts its running average at the beginning of the window; TA at the end. Here, the window was 10% of the total sentence count, which was approximately 700 overall. I entered the following in Python:

my_file = "/Users/john/Code/texts/sentiment/mdg.txt"
smooth_plots(my_file, 70)

And here is the graph:

Moving/Running Averages

Moving/Running Averages

The entire script is available as a gh.


Comparing Sentiments


Following up on some previous explorations, I was curious about the relationship between the various sentiment libraries available in Python. The code below will let you compare a text for yourself, but the current list of three — Afinn, TextBlob, and Indico — is not exhaustive, but rather the three I used to draft out this bit of code, which is better than a lot of code I’ve written thus far but still probably quite crude to some.

#! /usr/bin/env python
# Imports
import matplotlib.pyplot as plt
import seaborn # for more appealing plots
from nltk import tokenize

# Customizations
plt.rcParams['figure.figsize'] = 12, 8

import math
import re
import sys


def afinn_sentiment(filename):
    from afinn import Afinn
    afinn = Afinn()
    with open (my_file, "r") as myfile:
        text ='\n', ' ')   
        sentences = tokenize.sent_tokenize(text)
        sentiments = []
        for sentence in sentences:
            sentsent = afinn.score(sentence)
        return sentiments

# TextBlob

def textblob_sentiment(filename):
    from textblob import TextBlob
    with open (filename, "r") as myfile:'\n', ' ')   
        blob = TextBlob(text)       
        textsentiments = []
        for sentence in blob.sentences:
            sentsent = sentence.sentiment.polarity
        return textsentiments

# Indico

def indico_sentiment(filename):
    import indicoio
    indicoio.config.api_key = 'yourkeyhere'
    with open (my_file, "r") as myfile:
        text ='\n', ' ')   
        sentences = tokenize.sent_tokenize(text)
        indico_sent = indicoio.sentiment(sentences)
    return indico_sent

def plot_sentiments(filename):
    fig = plt.figure()
    plt.title("Comparison of Sentiment Libraries")
    plt.plot(afinn_sentiment(filename), label = "Afinn")
    plt.plot(textblob_sentiment(filename), label = "TextBlob")
    plt.plot(indico_sentiment(filename), label = "Indico")
    plt.ylabel("Emotional Valence")
    plt.xlabel("Sentence #")
    plt.legend(loc='lower right')
    plt.annotate("Oral Legend LAU-14 Used", xy=(30, 2))

Once you’ve loaded this script, all you need to do is give it a file with which to work:


Re-Installing Python

With any luck, the title of this post should be (will be) “Re-installing Python the Right Way.” The reason for this post is that while trying to install the, albeit experimental, iPython module that allows you to save iPython/Jupyter notebooks in a markdown format and not JSON, I was running into difficulties that seemed to be a function of the way MacPorts installs Jupyter, which was not allowing me to run jupyter from the command line. I.e., the only way I could get a Jupyter notebook was by using the deprecated ipython notebook.

I read around a bit, and it seems the preferred way to handle this is to use something like MacPorts, or Homebrew, for the base installation of Python and Pip and then to do everything from within pip.

Side note: since I plan on installing most of the packages into only my user space, and I am lazy and don’t want to type pip install --user every time, I made an alias and saved it in my `.bash_profile:

alias pinstall='pip install --user'

I used vi but use whatever editor lets you access that hidden file and do what needs to get done. Once I was done, I executed the file so its settings were current:

. ~/.bash_profile

(Note the space between the dot and the tilde.)

Having done this, I set about re-loading all the usual modules on which I depend numpy, scipy, etc. (I have a fuller list.) I was surprised by how quickly this process went, and I routinely checked to see how things were going by opening either a Python or iPython shell and importing a recently installed library.

At the end of the process, however, I still could not get jupyter at the command line. I tried a number of suggestions, but the only one that worked was to add the location of the jupyter executable to my PATH:


This strikes me as a real kludge, but it does work. After doing that, I could get jupyter notebook to work, and once I added the following to the Jupyter notebook config file I could open Markdown files as notebooks:

c.NotebookApp.contents_manager_class = 'ipymd.IPymdContentsManager'

So, the current state of [ipymd][] appears to be that line numbers do work, but you can’t convert or save an extant notebook as an md-formatted notebook. You have to create a markdown document first, then open it in Jupyter. But once you’ve done that, you have the full functionality of Jupyter.

This is going to require a bit of legwork for the current project on which I am working, but I think it’s going to make my collaborator, who is not a convert (yet!) to Jupyter, a whole lot happier.

Getting ETE3 Running on a Mac

Unfortunately, ETE, Python framework for the analysis and visualization of (phylogenetic) trees, is not currently available to install through MacPorts. The recommended way to install ETE on a Mac is through Anaconda or a Miniconda setup. I confess I was not familiar with the conda open source package management system, and I had not heard of Miniconda. I like Anaconda quite a lot, and I like it even more now that I know it’s part of a larger open source ecosystem, and it may be that one day I will switch over to it, but, right now, I am fairly happy with my Macports setup and I would rather not break what’s working.

To get a better sense of what I need to do, I clicked on the Linux native installation directions, which skip conda:

Install dependencies: python-qt4, python-lxml, python-six and python-numpy

Those don’t look too bad. Python Six, a compatibility library for Python 2 and 3, is available as py34-six. Done. As for the rest: py34-pyqt4, py34-lxml, and, of course, py34-numpy are already installed. Done, again.

It looks like pip will work here, so after making sure I have py34-pip installed and making sure I run sudo port --select pip pip34 to set it, I can run:

sudo pip install ete3

If I run an IDLE session and import ete3, I get no error prompts. Yay! Time to make some spam.

Turning a Directory of Texts into a List of Strings

For almost everything I do in text analytics, I find myself with a directory of texts which, in most instances, need to be turned into a list of strings, with each text its own item in the list. Here’s my Python boilerplate:

import glob

file_list = glob.glob('../texts' + '/*.txt')

mytexts = []
for filename in file_list:
    with open(filename, 'r', encoding='utf-8') as f:
        mytexts.append('\n', ' '))

You can double-check your work by simply calling up any given text, using mytexts[1] with the “1” being any number you want, remembering that Python starts counting at 0 and not 1, so your list of 12 texts, for example, will be 0-11.

And if you need to mush all those texts back into a single string:

alltexts = ''.join(mytexts)

I saw a post recently over on DataScience+ about the use of a Python library, bokeh, for creating interactive graphs. My first thought: You can do interactive plots in Python?