How do you go about finding the words that occur in all the texts of a collection or in some percentage of texts? A Safari Oriole lesson I took in recently did the following, using two texts as the basis for the comparison:
from pybloom import BloomFilter
bf = BloomFilter(capacity = 1000, error_rate = 0.001)
for word in text1_words:
intersect = set()
for word in text2_words:
if word in bf:
UPDATE: I’m working on getting Markdown and syntax highlighting working. I’m running into difficulties with my beloved Markdown Extra plug-in, indicating I may need to switch to the Jetpack version. (I’ve switched before but not been satisfied with the results.)
As noted earlier, I am very taken with the idea of moving to an open notebook system: it goes well with my interest in keeping my research accessible not only to myself but also to others. Towards that end, I am in the midst of moving my notes and web captures out of Evernote and into DevonThink — a move made easier by a script that automates the process. I am still not a fan of DT’s UI, but its functionality cannot be denied or ignored. It quite literally does everything. This also means moving my reference library out of Papers, which I have had a love/hate relationship with for the past few years. (Much of this move is, in fact, prompted by the fact that I don’t quite trust the program after various moments of failure. I cannot deny that some of the failings might be of my own making, but, then again, this move I am making is to foolproof systems from the fail/fool point at the center of it all, me.)
Caleb McDaniel’s system is based on Gitit, which itself relies on Pandoc to do much of the heavy lifting. In his system,
bibtex entries appear at the top of a note document and are, as I understand it, compiled as needed into larger, comprehensive bibtex lists. To get the bibtex entry at the top of the page into HTML for the wiki, McDaniel uses an OCAML library.
Why not, I wondered as I read McDaniel, attempt to keep as much of the workflow as possible within a single language. Since Python is my language of choice — mostly because I am too time and mind poor to attempt to master anything else — I decided to make the attempt in Python. As luck would have it, there is a
bibtex2html module available for Python:
Now, whether the rest of the system is built on Sphinx or with MkDocs is the next matter — as is figuring out how to write a script that chains these things together so that I can approach the fluidity and assuredness of McDaniel.
I will update this post as I go. (Please note that this post will stay focused on the mechanics of such a system.)
I’m still a babe in the programming woods, so Shrutarshi Basu’s explanation of namespaces, scopes, and classes in Python was pretty useful. I can’t tell if I had read around enough in preparation for final understanding or if Basu simply wrote about it in a fashion that I understood clearly, seemingly for the first time.
Your use of
pandas just got a little easier: the makers of the module also have a cheatsheet available. You’re welcome.
Sometimes you need to know what Python modules you already installed, the easiest way to get a list is to:
This will give you a list of installed modules typically as a series of columns. All you have are names, not version numbers. If you need to know version numbers, then try:
I have never been particularly impressed with Moodle, the learning management system used by my university and a number of other organizations. Its every impulse, it seems to me, is to increase the number of steps to get simple things done, I suppose to simplify more complex things for users with less tech savvy. Using markdown, for example, is painful and there’s no way to control the presentation of materials unless you resort to one of its myriad of under-explained, and probably under-thought, content packaging options. (I’ve never grokked the Moodle “book”, for example.)
To be honest, there are times when I feel the same way about WordPress, which has gotten GUIer and less sharp on a number of fronts — why oh why are categories and tags now unlimited in application?
I’m also less than clear on my university’s approach to intellectual property: they seem rather keen to claim everything and anything as their own, when they can’t even be bothered to give you basic production tools. (Hello? It’s been three years since I had access to a printer that didn’t involve me copying files to a flash drive and walking down stairs to load things onto a Windows machine that can only ever print PDFs.)
I decided I would give static site generation a try, particularly if I could compose in markdown, ReST, or even a Jupyter notebook (as a few of the generators appear to promise). I’m not interested in using this for blogging, and I will probably maintain it on a subdirectory of my own site, e.g.
/teaching, and I hope to be able to sync between local and remote versions using Git. That seems straightforward, doesn’t it? (I’m also now thinking that I will stuff everything into the same directory and just have different pages, and subpages?, for each course. Just hang everything out there for all to see.
As for the site generators themselves, there are a number of options:
- Pelican is a popular one, but seems very blog oriented.
- I’ve installed both Pelican and Nikola, and I ran the latter this morning and was somewhat overwhelmed by the number of directories it generated right away.
- Cactus seems compelling, and has a build available for the Mac.
- There is also Hyde.
- I’m going to ignore blogofile for now, but it’s there and its development is active.
Mehrdad Yazdani pointed out that some of my problems in normalization may have been the result of not having the right pieces in place, and so suggested some changes to the sentiments.py script. The result would seem to suggest that the two distributions are now comparable in scale — as well as on the same x-axis. (My Python-fu is not strong enough, yet, for me to determine how this error crept in.)
Raw Sentiment normalized with np.max(np.abs(a_list))
When I run these results through my averaging function, however, I get significant vertical compression:
Averaged Sentiment normalized with np.max(np.abs(a_list))
If I substitute
np.max(np.abs(a_list)) in the script, I get the following results:
Raw Sentiment Normalized with numpy.linalg.norm
Averaged Sentiment Normalized with numpy.linalg.norm