Working with a sample corpus this morning of fraudulent emails — Rachael Tatman’s Fraudulent Email Corpus on Kaggle, I found myself not able to get past
reading the file, thanks to decoding errors:
codec can't decode byte 0xc2
Oof. That byte
0xc2 has bitten me before — I think it may be a Windows thing, but I don’t remember right now, and, more importantly, I don’t care. Data loss is not important in this moment, so simply ignoring the error is my best course forward:
fh = codecs.open( "fraudulent_emails_small.txt", "r", encoding='utf-8', errors='ignore')
And done. Thanks, as usual, to a great StackOverflow thread.
BTW, thank you Rachael for making the dataset available!
After the first round of work is done with the TED talks and I’ve taken the next steps on the legend material, it will be time to figure out what to do on the literary side of things. When that happens, Jonathan Reeve’s database for Project Gutenberg looks fantastic.
Vikash Singh has a terrific write-up on “How our startup switched from Unsupervised LDA to Semi-Supervised GuidedLDA” which not only has a very clear discussion of LDA and how they modified it but also that his company’s efforts resulted in a Python library that’s as easy to install as:
pip install guidedlda
If you missed the Louisiana Book Festival but still want to hear what I had to say, here’s your chance:
The Amazing Crawfish Boat, John Laudun presents at the 2017 Louisiana Book Festival from University Press of Mississippi on Vimeo.
It probably also makes for an excellent sleep aid. Your mileage may vary.
I am delighted to announce that The Amazing Crawfish Boat will be one of the featured books at this year’s Louisiana Book Festival. The book talk is scheduled for Saturday afternoon, 3:30 p.m. to 4 p.m. in the First Floor Meeting Room of the Capitol Park Museum. If you’re at the Festival, come say hello or swing by the festival’s store after the talk to find me signing books. See you there!
While science fiction has a long history of human-AI/robot interaction, especially in terms of dialogue, the idea of robots/AIs talking to each other gained a lot more currency in the wake of two Facebook AIs seemingly developing their own language. First, a more reasoned summary of what happend at Facebook from the BBC. And now something a bit more sensational. This Quora post also has a bit more on what happened at Facebook.
All of this concern about AIs talking to each other has a history, at least in science fiction. One moment to consider occurred in 1970’s The Forbin Project in which the USA build a supercomputer to oversee its strategic defense systems (missiles, bombers, you name it), only to discover that the USSR (now Russia) had a similar computer. It’s not too long before the two computers demand to talk directly to each other, then merge to form “World Control.”
One good place to start a larger history of robots and AIs talking to each other is Emily Asher-Perrin’s survey on Tor. (Tor is a long-time publisher of science fiction and fantasy literature; their website contains a mix of original fiction, thoughtful essays, and read or watch-alongs of classic or beloved works in the genres.)
(Perhaps one thing to think about is the difference between robots as corporealized entities and artificial intelligences as noncorporeal entities: our responses to intra-entity dialogue seems to differ significantly based on whether the consciousness is individuated in a way that our own seems to be.)
One of the things that happens as you nurture and grow a software stack is that you begin to take its functionality for granted, and, when you are faced with the prospect of re-creating it elsewhere or over, you realize you need better documentation. My work is currently founded on Python, and I have already documented the great architecture that is
matplotlib + … you get the idea.
jupyter is central to how I work my way through code, and when I need to present that code, I am delighted that
jupyter gives me the option to present a notebook as a collection of slides.
RISE makes those notebooks fly using
missingno “provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. It’s built using matplotlib, so it’s fast, and takes any pandas DataFrame input that you throw at it, so it’s flexible. Just pip install missingno to get started.”
I’ve got more … I just need to list them out.