Ignoring Unicode Decode Errors

Working with a sample corpus this morning of fraudulent emails — Rachael Tatman’s Fraudulent Email Corpus on Kaggle, I found myself not able to get past reading the file, thanks to decoding errors:

codec can't decode byte 0xc2

Oof. That byte 0xc2 has bitten me before — I think it may be a Windows thing, but I don’t remember right now, and, more importantly, I don’t care. Data loss is not important in this moment, so simply ignoring the error is my best course forward:

import codecs

fh = codecs.open( "fraudulent_emails_small.txt", "r", encoding='utf-8', errors='ignore')

And done. Thanks, as usual, to a great StackOverflow thread.

BTW, thank you Rachael for making the dataset available!


Louisiana Book Festival 2017

I am delighted to announce that The Amazing Crawfish Boat will be one of the featured books at this year’s Louisiana Book Festival. The book talk is scheduled for Saturday afternoon, 3:30 p.m. to 4 p.m. in the First Floor Meeting Room of the Capitol Park Museum. If you’re at the Festival, come say hello or swing by the festival’s store after the talk to find me signing books. See you there!

AIs Talk among Themselves

While science fiction has a long history of human-AI/robot interaction, especially in terms of dialogue, the idea of robots/AIs talking to each other gained a lot more currency in the wake of two Facebook AIs seemingly developing their own language. First, a more reasoned summary of what happend at Facebook from the BBC. And now something a bit more sensational. This Quora post also has a bit more on what happened at Facebook.

All of this concern about AIs talking to each other has a history, at least in science fiction. One moment to consider occurred in 1970’s The Forbin Project in which the USA build a supercomputer to oversee its strategic defense systems (missiles, bombers, you name it), only to discover that the USSR (now Russia) had a similar computer. It’s not too long before the two computers demand to talk directly to each other, then merge to form “World Control.”

One good place to start a larger history of robots and AIs talking to each other is Emily Asher-Perrin’s survey on Tor. (Tor is a long-time publisher of science fiction and fantasy literature; their website contains a mix of original fiction, thoughtful essays, and read or watch-alongs of classic or beloved works in the genres.)

(Perhaps one thing to think about is the difference between robots as corporealized entities and artificial intelligences as noncorporeal entities: our responses to intra-entity dialogue seems to differ significantly based on whether the consciousness is individuated in a way that our own seems to be.)

Python Modules You Didn’t Know You Needed

One of the things that happens as you nurture and grow a software stack is that you begin to take its functionality for granted, and, when you are faced with the prospect of re-creating it elsewhere or over, you realize you need better documentation. My work is currently founded on Python, and I have already documented the great architecture that is numpy + scipy + nltk + pandas + matplotlib + … you get the idea.

  • jupyter is central to how I work my way through code, and when I need to present that code, I am delighted that jupyter gives me the option to present a notebook as a collection of slides. RISE makes those notebooks fly using Reveal.js.
  • missingno “provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. It’s built using matplotlib, so it’s fast, and takes any pandas DataFrame input that you throw at it, so it’s flexible. Just pip install missingno to get started.”

I’ve got more … I just need to list them out.