NLTK and Stopwords

I spent some time this morning playing with various features of the Python NLTK, trying to think about how much, if any, I wanted to use it with my freshmen. (More on this in a moment.) I loaded in a short story text that we have read, and running it through various functions that the NLTK makes possible when I ran into a hiccup:

[code lang=text]
>>> text.collocations()
Building collocations list
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/text.py", line 341, in collocations
ignored_words = stopwords.words('english')
File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/corpus/util.py", line 68, in __getattr__
self.__load()
File
"/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/nltk/corpus/util.py", line 56, in __load
except LookupError: raise e
LookupError:
**********************************************************************
Resource 'corpora/stopwords' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download().
Searched in:
– '/usr/share/nltk'
– '/Users/john/nltk_data'
– '/usr/share/nltk_data'
– '/usr/local/share/nltk_data'
– '/usr/lib/nltk_data'
– '/usr/local/lib/nltk_data'
**********************************************************************
[/code]

Now, the nice thing is that all you have to do is follow the directions, entering nltk.download() in the IDLE prompt, and you get:

[code lang=text]
showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
[/code]

which provides the following window:

Screen Shot 2013-01-26 at 09.08.15

Clicking on the Corpora tab and scrolling down allows you to download the stopword list:

Screen Shot 2013-01-26 at 09.08.43

What I have not yet figured out is how to specify your own stopword list. Part of what I want to teach any of my students is that choosing what words are important and what words are not are a matter of subject matter expertise and thus something they should not turn over to someone else to do.