Building a Corpus-Specific Stopword List

How do you go about finding the words that occur in all the texts of a collection or in some percentage of texts? A Safari Oriole lesson I took in recently did the following, using two texts as the basis for the comparison:

[code lang=python]
from pybloom import BloomFilter

bf = BloomFilter(capacity = 1000, error_rate = 0.001)

for word in text1_words:
bf.add(word)

intersect = set([])

for word in text2_words:
if word in bf:
intersect.add(word)

print(intersect)
[/code]

UPDATE: I’m working on getting Markdown and syntax highlighting working. I’m running into difficulties with my beloved Markdown Extra plug-in, indicating I may need to switch to the Jetpack version. (I’ve switched before but not been satisfied with the results.)

Towards an Open Notebook Built on Python

As noted earlier, I am very taken with the idea of moving to an open notebook system: it goes well with my interest in keeping my research accessible not only to myself but also to others. Towards that end, I am in the midst of moving my notes and web captures out of Evernote and into DevonThink — a move made easier by a script that automates the process. I am still not a fan of DT’s UI, but its functionality cannot be denied or ignored. It quite literally does everything. This also means moving my reference library out of Papers, which I have had a love/hate relationship with for the past few years. (Much of this move is, in fact, prompted by the fact that I don’t quite trust the program after various moments of failure. I cannot deny that some of the failings might be of my own making, but, then again, this move I am making is to foolproof systems from the fail/fool point at the center of it all, me.)

Caleb McDaniel’s system is based on Gitit, which itself relies on Pandoc to do much of the heavy lifting. In his system, bibtex entries appear at the top of a note document and are, as I understand it, compiled as needed into larger, comprehensive bibtex lists. To get the bibtex entry at the top of the page into HTML for the wiki, McDaniel uses an OCAML library.

Why not, I wondered as I read McDaniel, attempt to keep as much of the workflow as possible within a single language. Since Python is my language of choice — mostly because I am too time and mind poor to attempt to master anything else — I decided to make the attempt in Python. As luck would have it, there is a bibtex2html module available for Python: [bibtex2html](https://github.com/goliveira/bibtex2html).

Now, whether the rest of the system is built on Sphinx or with MkDocs is the next matter — as is figuring out how to write a script that chains these things together so that I can approach the fluidity and assuredness of McDaniel.

I will update this post as I go. (Please note that this post will stay focused on the mechanics of such a system.)

Listing Python Modules

Sometimes you need to know what Python modules you already installed, the easiest way to get a list is to:

[code lang=python]
help('modules')
[/code]

This will give you a list of installed modules typically as a series of columns. All you have are names, not version numbers. If you need to know version numbers, then try:

[code lang=python]
import matplotlib
print(matplotlib.__version__)
1.5.1
[/code]

Python Site Generators

I have never been particularly impressed with Moodle, the learning management system used by my university and a number of other organizations. Its every impulse, it seems to me, is to increase the number of steps to get simple things done, I suppose to simplify more complex things for users with less tech savvy. Using markdown, for example, is painful and there’s no way to control the presentation of materials unless you resort to one of its myriad of under-explained, and probably under-thought, content packaging options. (I’ve never grokked the Moodle “book”, for example.)

To be honest, there are times when I feel the same way about WordPress, which has gotten GUIer and less sharp on a number of fronts — why oh why are categories and tags now unlimited in application?

I’m also less than clear on my university’s approach to intellectual property: they seem rather keen to claim everything and anything as their own, when they can’t even be bothered to give you basic production tools. (Hello? It’s been three years since I had access to a printer that didn’t involve me copying files to a flash drive and walking down stairs to load things onto a Windows machine that can only ever print PDFs.)

I decided I would give static site generation a try, particularly if I could compose in markdown, ReST, or even a Jupyter notebook (as a few of the generators appear to promise). I’m not interested in using this for blogging, and I will probably maintain it on a subdirectory of my own site, e.g. /teaching, and I hope to be able to sync between local and remote versions using Git. That seems straightforward, doesn’t it? (I’m also now thinking that I will stuff everything into the same directory and just have different pages, and subpages?, for each course. Just hang everything out there for all to see.

As for the site generators themselves, there are a number of options:

  • Pelican is a popular one, but seems very blog oriented.
  • I’ve installed both Pelican and Nikola, and I ran the latter this morning and was somewhat overwhelmed by the number of directories it generated right away.
  • Cactus seems compelling, and has a build available for the Mac.
  • There is also Hyde.
  • I’m going to ignore blogofile for now, but it’s there and its development is active.
  • If all else fails, I have used Poole before. It doesn’t have a templating system or Javascript of any of that, but maybe it’s better for it.

More on Normalizing Sentiment Distributions

Mehrdad Yazdani pointed out that some of my problems in normalization may have been the result of not having the right pieces in place, and so suggested some changes to the sentiments.py script. The result would seem to suggest that the two distributions are now comparable in scale — as well as on the same x-axis. (My Python-fu is not strong enough, yet, for me to determine how this error crept in.)

Mehrdaded Sentimental Outputs

Raw Sentiment normalized with np.max(np.abs(a_list))

When I run these results through my averaging function, however, I get significant vertical compression:

Averaged Mehrdaded Sentiments

Averaged Sentiment normalized with np.max(np.abs(a_list))

If I substitute np.linalg.norm(a_list) for np.max(np.abs(a_list)) in the script, I get the following results:

Raw Sentiment Normalized with numpy.linalg.norm

Raw Sentiment Normalized with numpy.linalg.norm

Averaged Sentiment Normalized with numpy.linalg.norm

Averaged Sentiment Normalized with numpy.linalg.norm