Towards an Open Notebook Built on Python

As noted earlier, I am very taken with the idea of moving to an open notebook system: it goes well with my interest in keeping my research accessible not only to myself but also to others. Towards that end, I am in the midst of moving my notes and web captures out of Evernote and into DevonThink — a move made easier by a script that automates the process. I am still not a fan of DT’s UI, but its functionality cannot be denied or ignored. It quite literally does everything. This also means moving my reference library out of Papers, which I have had a love/hate relationship with for the past few years. (Much of this move is, in fact, prompted by the fact that I don’t quite trust the program after various moments of failure. I cannot deny that some of the failings might be of my own making, but, then again, this move I am making is to foolproof systems from the fail/fool point at the center of it all, me.)

Caleb McDaniel’s system is based on Gitit, which itself relies on Pandoc to do much of the heavy lifting. In his system, bibtex entries appear at the top of a note document and are, as I understand it, compiled as needed into larger, comprehensive bibtex lists. To get the bibtex entry at the top of the page into HTML for the wiki, McDaniel uses an OCAML library.

Why not, I wondered as I read McDaniel, attempt to keep as much of the workflow as possible within a single language. Since Python is my language of choice — mostly because I am too time and mind poor to attempt to master anything else — I decided to make the attempt in Python. As luck would have it, there is a bibtex2html module available for Python: [bibtex2html](https://github.com/goliveira/bibtex2html).

Now, whether the rest of the system is built on Sphinx or with MkDocs is the next matter — as is figuring out how to write a script that chains these things together so that I can approach the fluidity and assuredness of McDaniel.

I will update this post as I go. (Please note that this post will stay focused on the mechanics of such a system.)

Listing Python Modules

Sometimes you need to know what Python modules you already installed, the easiest way to get a list is to:

help('modules')

This will give you a list of installed modules typically as a series of columns. All you have are names, not version numbers. If you need to know version numbers, then try:

import matplotlib
print(matplotlib.__version__)
    1.5.1

Python Site Generators

I have never been particularly impressed with Moodle, the learning management system used by my university and a number of other organizations. Its every impulse, it seems to me, is to increase the number of steps to get simple things done, I suppose to simplify more complex things for users with less tech savvy. Using markdown, for example, is painful and there’s no way to control the presentation of materials unless you resort to one of its myriad of under-explained, and probably under-thought, content packaging options. (I’ve never grokked the Moodle “book”, for example.)

To be honest, there are times when I feel the same way about WordPress, which has gotten GUIer and less sharp on a number of fronts — why oh why are categories and tags now unlimited in application?

I’m also less than clear on my university’s approach to intellectual property: they seem rather keen to claim everything and anything as their own, when they can’t even be bothered to give you basic production tools. (Hello? It’s been three years since I had access to a printer that didn’t involve me copying files to a flash drive and walking down stairs to load things onto a Windows machine that can only ever print PDFs.)

I decided I would give static site generation a try, particularly if I could compose in markdown, ReST, or even a Jupyter notebook (as a few of the generators appear to promise). I’m not interested in using this for blogging, and I will probably maintain it on a subdirectory of my own site, e.g. /teaching, and I hope to be able to sync between local and remote versions using Git. That seems straightforward, doesn’t it? (I’m also now thinking that I will stuff everything into the same directory and just have different pages, and subpages?, for each course. Just hang everything out there for all to see.

As for the site generators themselves, there are a number of options:

  • Pelican is a popular one, but seems very blog oriented.
  • I’ve installed both Pelican and Nikola, and I ran the latter this morning and was somewhat overwhelmed by the number of directories it generated right away.
  • Cactus seems compelling, and has a build available for the Mac.
  • There is also Hyde.
  • I’m going to ignore blogofile for now, but it’s there and its development is active.
  • If all else fails, I have used Poole before. It doesn’t have a templating system or Javascript of any of that, but maybe it’s better for it.

More on Normalizing Sentiment Distributions

Mehrdad Yazdani pointed out that some of my problems in normalization may have been the result of not having the right pieces in place, and so suggested some changes to the sentiments.py script. The result would seem to suggest that the two distributions are now comparable in scale — as well as on the same x-axis. (My Python-fu is not strong enough, yet, for me to determine how this error crept in.)

Mehrdaded Sentimental Outputs

Raw Sentiment normalized with np.max(np.abs(a_list))

When I run these results through my averaging function, however, I get significant vertical compression:

Averaged Mehrdaded Sentiments

Averaged Sentiment normalized with np.max(np.abs(a_list))

If I substitute np.linalg.norm(a_list) for np.max(np.abs(a_list)) in the script, I get the following results:

Raw Sentiment Normalized with numpy.linalg.norm

Raw Sentiment Normalized with numpy.linalg.norm

Averaged Sentiment Normalized with numpy.linalg.norm

Averaged Sentiment Normalized with numpy.linalg.norm

A Tale of Two Sentimental Signatures

I’m still working my way through the code that will, I hope, make it possible to compare effectively different sentimental modules in Python. While the code is available as a GitHub [], I wanted to post some of the early outcomes here, publishing my failure, as it were.

I began with the raw sentiments, which is not very interesting, since the different modules use different ranges: quite wide for Afinn, -1 to 1 for TextBlob, and between 0 and 1 for Indico.

Raw Sentiments: Afinn, Textblob, Indico

Raw Sentiments: Afinn, Textblob, Indico

To make them more comparable, I needed to normalize them, and to make the whole of it more digestible, I needed to average them. I began with normalizing the values — see the [] — and you can already see there’s a divergence in the baseline for which I cannot yet account in my code:

Normalized Sentiment: Afinn and TextBlob

Normalized Sentiment: Afinn and TextBlob

To be honest, I didn’t really notice this until I plotted the average, where the divergence becomes really apparent:

Average, Normalized Sentiments: Afinn and TextBlob

Average, Normalized Sentiments: Afinn and TextBlob

: https://gist.github.com/johnlaudun/5ea8234cc8d6f39b982648704c3824b0

More Sentiment Comparisons

I added two kinds of moving averages to the sentiments.py script, and as you can see from the results below, whether you go with the numpy version or the Technical Analysis library, talib, of the running average, you get the same results: NP starts its running average at the beginning of the window; TA at the end. Here, the window was 10% of the total sentence count, which was approximately 700 overall. I entered the following in Python:

my_file = "/Users/john/Code/texts/sentiment/mdg.txt"
smooth_plots(my_file, 70)

And here is the graph:

Moving/Running Averages

Moving/Running Averages

The entire script is available as a gh.

Next step: NORMALIZATION!

Comparing Sentiments

Sentiments

Following up on some previous explorations, I was curious about the relationship between the various sentiment libraries available in Python. The code below will let you compare a text for yourself, but the current list of three — Afinn, TextBlob, and Indico — is not exhaustive, but rather the three I used to draft out this bit of code, which is better than a lot of code I’ve written thus far but still probably quite crude to some.


#! /usr/bin/env python
# Imports
import matplotlib.pyplot as plt
import seaborn # for more appealing plots
from nltk import tokenize

# Customizations
seaborn.set_style("darkgrid")
plt.rcParams['figure.figsize'] = 12, 8

import math
import re
import sys
#reload(sys)
#sys.setdefaultencoding('utf-8')

# AFINN

def afinn_sentiment(filename):
    from afinn import Afinn
    afinn = Afinn()
    with open (my_file, "r") as myfile:
        text = myfile.read().replace('\n', ' ')   
        sentences = tokenize.sent_tokenize(text)
        sentiments = []
        for sentence in sentences:
            sentsent = afinn.score(sentence)
            sentiments.append(sentsent)
        return sentiments


# TextBlob

def textblob_sentiment(filename):
    from textblob import TextBlob
    with open (filename, "r") as myfile:
        text=myfile.read().replace('\n', ' ')   
        blob = TextBlob(text)       
        textsentiments = []
        for sentence in blob.sentences:
            sentsent = sentence.sentiment.polarity
            textsentiments.append(sentsent)
        return textsentiments

# Indico

def indico_sentiment(filename):
    import indicoio
    indicoio.config.api_key = 'yourkeyhere'
    with open (my_file, "r") as myfile:
        text = myfile.read().replace('\n', ' ')   
        sentences = tokenize.sent_tokenize(text)
        indico_sent = indicoio.sentiment(sentences)
    return indico_sent

def plot_sentiments(filename):
    fig = plt.figure()
    plt.title("Comparison of Sentiment Libraries")
    plt.plot(afinn_sentiment(filename), label = "Afinn")
    plt.plot(textblob_sentiment(filename), label = "TextBlob")
    plt.plot(indico_sentiment(filename), label = "Indico")
    plt.ylabel("Emotional Valence")
    plt.xlabel("Sentence #")
    plt.legend(loc='lower right')
    plt.annotate("Oral Legend LAU-14 Used", xy=(30, 2))

Once you’ve loaded this script, all you need to do is give it a file with which to work:


plot_sentiments("/Users/john/Code/texts/legends/lau-014.txt")

Re-Installing Python

With any luck, the title of this post should be (will be) “Re-installing Python the Right Way.” The reason for this post is that while trying to install the, albeit experimental, iPython module that allows you to save iPython/Jupyter notebooks in a markdown format and not JSON, I was running into difficulties that seemed to be a function of the way MacPorts installs Jupyter, which was not allowing me to run jupyter from the command line. I.e., the only way I could get a Jupyter notebook was by using the deprecated ipython notebook.

I read around a bit, and it seems the preferred way to handle this is to use something like MacPorts, or Homebrew, for the base installation of Python and Pip and then to do everything from within pip.

Side note: since I plan on installing most of the packages into only my user space, and I am lazy and don’t want to type pip install --user every time, I made an alias and saved it in my `.bash_profile:

alias pinstall='pip install --user'

I used vi but use whatever editor lets you access that hidden file and do what needs to get done. Once I was done, I executed the file so its settings were current:

. ~/.bash_profile

(Note the space between the dot and the tilde.)

Having done this, I set about re-loading all the usual modules on which I depend numpy, scipy, etc. (I have a fuller list.) I was surprised by how quickly this process went, and I routinely checked to see how things were going by opening either a Python or iPython shell and importing a recently installed library.

At the end of the process, however, I still could not get jupyter at the command line. I tried a number of suggestions, but the only one that worked was to add the location of the jupyter executable to my PATH:

PATH=$PATH:/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/3.4/bin

This strikes me as a real kludge, but it does work. After doing that, I could get jupyter notebook to work, and once I added the following to the Jupyter notebook config file I could open Markdown files as notebooks:

c.NotebookApp.contents_manager_class = 'ipymd.IPymdContentsManager'

So, the current state of [ipymd][] appears to be that line numbers do work, but you can’t convert or save an extant notebook as an md-formatted notebook. You have to create a markdown document first, then open it in Jupyter. But once you’ve done that, you have the full functionality of Jupyter.

This is going to require a bit of legwork for the current project on which I am working, but I think it’s going to make my collaborator, who is not a convert (yet!) to Jupyter, a whole lot happier.

Getting ETE3 Running on a Mac

Unfortunately, ETE, Python framework for the analysis and visualization of (phylogenetic) trees, is not currently available to install through MacPorts. The recommended way to install ETE on a Mac is through Anaconda or a Miniconda setup. I confess I was not familiar with the conda open source package management system, and I had not heard of Miniconda. I like Anaconda quite a lot, and I like it even more now that I know it’s part of a larger open source ecosystem, and it may be that one day I will switch over to it, but, right now, I am fairly happy with my Macports setup and I would rather not break what’s working.

To get a better sense of what I need to do, I clicked on the Linux native installation directions, which skip conda:

Install dependencies: python-qt4, python-lxml, python-six and python-numpy

Those don’t look too bad. Python Six, a compatibility library for Python 2 and 3, is available as py34-six. Done. As for the rest: py34-pyqt4, py34-lxml, and, of course, py34-numpy are already installed. Done, again.

It looks like pip will work here, so after making sure I have py34-pip installed and making sure I run sudo port --select pip pip34 to set it, I can run:

sudo pip install ete3

If I run an IDLE session and import ete3, I get no error prompts. Yay! Time to make some spam.

Turning a Directory of Texts into a List of Strings

For almost everything I do in text analytics, I find myself with a directory of texts which, in most instances, need to be turned into a list of strings, with each text its own item in the list. Here’s my Python boilerplate:

import glob

file_list = glob.glob('../texts' + '/*.txt')

mytexts = []
for filename in file_list:
    with open(filename, 'r', encoding='utf-8') as f:
        mytexts.append(f.read().replace('\n', ' '))

You can double-check your work by simply calling up any given text, using mytexts[1] with the “1” being any number you want, remembering that Python starts counting at 0 and not 1, so your list of 12 texts, for example, will be 0-11.

And if you need to mush all those texts back into a single string:

alltexts = ''.join(mytexts)

Learn to Python

So you want to learn to program, to write code. Perhaps you’re just curious, or perhaps you have a problem to solve. It doesn’t matter. There’s a lot more to say about it, and perhaps one day I’ll come back and write some more sentences that do that, but what you really want to know is where to start. There’s a couple of good places to start: and I think it’s important to remember that you can start anywhere and you should probably start several places and find the presentation that matches your way of learning best. (That’s the brilliance of the web, isn’t it? It’s multi-modal and you just need to find the mode, and the style within that mode, that serves you best.)

  • The gold standard for interactive Python tutorials is Learn Python.
  • Automate the Boring Stuff with Python is both a website and a book. Start with the website. Both are well-organized and useful not only as tutorials but also as references. The author also maintains the Invent with Python blog. It’s worth reading.
  • Learn Python the Hard Way is another website/book combination, but it also has videos to go with it. (You have to buy the videos, sorry.)
  • Speaking of videos, sentdex has an amazing collection of Youtube videos that address basics, data analysis, working with the NLTK, robotics, and more.
  • Visualize promises to let you write Python (or Javascript, Java, Typescript, Ruby, C, or C++) in your web browser and it will visualize for you what the computer is doing step-by-step as it executes your code. I haven’t tried it, but it sounds very interesting. (As some will know, I use iPython Jupyter notebook to do something similar.)
  • Erica Sadun has a nice post at Ars Technica on six different online tutorials.
  • You can learn Python in the Free Code Camp.

Hey, who knew Python was part of filmmaking? It is.

Complete Python for Scientific Computing Cheatsheet

Here’s everything you need to do, in the order you need to do it, using MacPorts as your basis. Please note this assumes that everything you need to do is in the most recent version of Python, which as of this date is Python 3.4.

First, install Xcode. (Workaround in the offing.)

Second, install Xcode command-line tools. First, this:

xcode-select –install

And, then, you’ll need to do this:

sudo xcodebuild -license

Third, download and install the MacPorts base package.

Fourth, once the base package is installed, run:

sudo port selfupdate

Now, we need to install Python and the various libraries. The basic setup is:

sudo port install python34
sudo port install py34-numpy
sudo port install py34-scipy
sudo port install py34-matplotlib
sudo port install py34-pandas
sudo port select –set python python34

If you would like the option of using *Jupyter* notebooks:

sudo port install py34-ipython
sudo port select –set ipython py34-ipython
sudo port install py34-jupyter

If you’re interested in doing text analytics, then you’ll probably find the following libraries useful. (Please note that the first line below is a workaround to keep NLTK from installing Python 2.)

sudo port install xorg-xcb-proto +python34
sudo port install py34-nltk

If you would like to add R to your arsenal of weapons and to have it work within Jupyter notebook:

sudo port install R
sudo port install py34-zmq

Then, in R, using sudo:

install.packages(c(‘rzmq’,’repr’,’IRkernel’,’IRdisplay’),
repos = c(‘http://irkernel.github.io/’,
getOption(‘repos’)))

IRkernel::installspec()