Ignoring Unicode Decode Errors

Working with a sample corpus this morning of fraudulent emails — Rachael Tatman’s Fraudulent Email Corpus on Kaggle, I found myself not able to get past reading the file, thanks to decoding errors:

codec can't decode byte 0xc2

Oof. That byte 0xc2 has bitten me before — I think it may be a Windows thing, but I don’t remember right now, and, more importantly, I don’t care. Data loss is not important in this moment, so simply ignoring the error is my best course forward:

import codecs

fh = codecs.open( "fraudulent_emails_small.txt", "r", encoding='utf-8', errors='ignore')

And done. Thanks, as usual, to a great StackOverflow thread.

BTW, thank you Rachael for making the dataset available!

Leave a Reply