NLTK and Stopwords

I spent some time this morning playing with various features of the Python NLTK, trying to think about how much, if any, I wanted to use it with my freshmen. (More on this in a moment.) I loaded in a short story text that we have read, and running it through various functions that the NLTK makes possible when I ran into a hiccup:

>>> text.collocations()
Building collocations list
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/text.py", line 341, in collocations
    ignored_words = stopwords.words('english')
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/corpus/util.py", line 68, in __getattr__
    self.__load()
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/corpus/util.py", line 56, in __load
    except LookupError: raise e
LookupError: 
**********************************************************************
  Resource 'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource: >>> nltk.download().
  Searched in:
    - '/usr/share/nltk'
    - '/Users/john/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Now, the nice thing is that all you have to do is follow the directions, entering nltk.download() in the IDLE prompt, and you get:

showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

which provides the following window:

Screen Shot 2013-01-26 at 09.08.15

Clicking on the Corpora tab and scrolling down allows you to download the stopword list:

Screen Shot 2013-01-26 at 09.08.43

What I have not yet figured out is how to specify your own stopword list. Part of what I want to teach any of my students is that choosing what words are important and what words are not are a matter of subject matter expertise and thus something they should not turn over to someone else to do.

7 thoughts on “NLTK and Stopwords

  1. my_stops = [“omit”, “the”, “words”, “in”, “this”, “list”]
    words_less_stops = [w for w in my_whole_catted_text_file if w not in my_stops]

  2. sp = open(“c:\\googleresults”, “r”)
    spstring = sp.read()
    splist = spstring.split()
    #fdlist = nltk.FreqDist(splist)
    #s=set(stopwords.words(‘english’))
    s = [‘all’, ‘just’, ‘being’, ‘over’, ‘both’, ‘through’, ‘yourselves’, ‘its’,

    ‘before’, ‘herself’, ‘had’, ‘should’, ‘to’, ‘only’, ‘under’, ‘ours’, ‘has’, ‘do’,

    ‘them’, ‘his’, ‘very’, ‘they’, ‘not’, ‘during’, ‘now’, ‘him’, ‘nor’, ‘did’, ‘this’,

    ‘she’, ‘each’, ‘further’, ‘where’, ‘few’, ‘because’, ‘doing’, ‘some’, ‘are’, ‘our’,

    ‘ourselves’, ‘out’, ‘what’, ‘for’, ‘while’, ‘does’, ‘above’, ‘between’, ‘t’, ‘be’,

    ‘we’, ‘who’, ‘were’, ‘here’, ‘hers’, ‘by’, ‘on’, ‘about’, ‘of’, ‘against’, ‘s’, ‘or’,

    ‘own’, ‘into’, ‘yourself’, ‘down’, ‘your’, ‘from’, ‘her’, ‘their’, ‘there’, ‘been’,

    ‘whom’, ‘too’, ‘themselves’, ‘was’, ‘until’, ‘more’, ‘himself’, ‘that’, ‘but’, ‘don’,

    ‘with’, ‘than’, ‘those’, ‘he’, ‘me’, ‘myself’, ‘these’, ‘up’, ‘will’, ‘below’, ‘can’,

    ‘theirs’, ‘my’, ‘and’, ‘then’, ‘is’, ‘am’, ‘it’, ‘an’, ‘as’, ‘itself’, ‘at’, ‘have’,

    ‘in’, ‘any’, ‘if’, ‘again’, ‘no’, ‘when’, ‘same’, ‘how’, ‘other’, ‘which’, ‘you’,

    ‘after’, ‘most’, ‘such’, ‘why’, ‘a’, ‘off’, ‘i’, ‘yours’, ‘so’, ‘the’, ‘having’,

    ‘once’]
    reducedlist = [w.strip() for w in splist if w.strip() not in s]

  3. Hi, I meet the same problem that corpora/stopwords is not found. I entered nltk.download() in the terminal, but nothing happens. Could you please tell me how I can solve this problem? Thanks a lot!

    • Hi, Wen-Wen … how did you install Python on your machine? The NLTK downloader, as you can see from above, has a GUI and perhaps you don’t have all the components to make that possible. If that doesn’t work, you can as both Tanya and Steve point out above, create your own stop word list or you can use pip to install the stop_words package. I hope this helps! (Let me know if it doesn’t.)

    • Are you in the Python interpreter or at the Bash prompt? nltk.download will only work in Python after you’ve imported nltk.

    • The easiest way is simply to create your own list, either in Python directly or as a separate text file (one word per line) that you read into Python as a list, you can then append one list to the other. This is quite common: since almost everyone uses some form of standard list plus a list customized to the task at hand. Let me know if some sample code would help.

Leave a Reply to Wenwen Zheng Cancel reply