Building a Corpus-Specific Stopword List

How do you go about finding the words that occur in all the texts of a collection or in some percentage of texts? A Safari Oriole lesson I took in recently did the following, using two texts as the basis for the comparison:

from pybloom import BloomFilter

bf = BloomFilter(capacity = 1000, error_rate = 0.001)

for word in text1_words:

intersect = set([])

for word in text2_words:
    if word in bf:


UPDATE: I’m working on getting Markdown and syntax highlighting working. I’m running into difficulties with my beloved Markdown Extra plug-in, indicating I may need to switch to the Jetpack version. (I’ve switched before but not been satisfied with the results.)

Leave a Reply