Working with a sample corpus this morning of fraudulent emails — Rachael Tatman’s Fraudulent Email Corpus on Kaggle, I found myself not able to get past reading
the file, thanks to decoding errors:
codec can't decode byte 0xc2
Oof. That byte 0xc2
has bitten me before — I think it may be a Windows thing, but I don’t remember right now, and, more importantly, I don’t care. Data loss is not important in this moment, so simply ignoring the error is my best course forward:
import codecs
fh = codecs.open( "fraudulent_emails_small.txt", "r", encoding='utf-8', errors='ignore')
And done. Thanks, as usual, to a great StackOverflow thread.
BTW, thank you Rachael for making the dataset available!