The ACL anthology “currently hosts 63797 papers on the study of computational linguistics and natural language processing.” A lot of subjects covered. No immediate search interface but you can the full anthology as BibTeX (only 5.35MB) and search that — if you throw in abstracts, it’s only 13MB.
Text Analytics APIs 2018: A Consumer Guide is $895 for a single user license. At 299 pages, that’s about $3 per page. The blurb notes that:
Robert Dale is an internationally-recognized expert in Natural Language Processing, with three decades of experience in academia and industry. With a PhD from the University of Edinburgh, he’s worked for Microsoft and Nuance, and he’s driven the development of SaaS-based NLP software for a startup. He has taught at the University of Edinburgh in the UK and at Macquarie University in Sydney, and presented tutorials and summer school courses around the world. He has over 150 peer-reviewed publications, including a comprehensive Handbook of Natural Language Processing, and the de facto textbook Building Natural Language Generation Systems.
A recent correspondence featured on the Corpora-List began with a list of lists for doing natural language processing (NLP). I am collecting/compiling the various references and links here with the hope of sorting them at some point in time.
The first reference is to a StackOverflow thread that wondered which language, Java or Python, was better for NLP. The question is unanswerable but in the process of deliberating, a number of libraries for each language were discussed. Link.
Apache has its own NLP functionality in Stanbol.
The University of Illinois’ Cognitive Computation Group has developed a number of NLP libraries for Java available on GitHub.
DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. It’s available on GitHub.
Corpus Tools is a portal dedicated to a number of software tools for corpus processing.
LIMA is developed in C++, works under Windows and Linux (possible to build it on macOS), and supports tokenization, morphologic analysis, pos tagging, parsing, SRL, NER, etc. The free version supports English and French. The closed one adds support for
Arabic, German, Spanish. Experiments were made on Portuguese, Chinese,
Japanese and Russian. The developer promises that more languages will be added to the free version.
Nikola Milosevic noted that he was developing two tools with the aim to process tables in scientific literature: “They are a bit specific, since they take as input XMLs, currently from PMC and DailyMed, soon HTML reader will be implemented. The tools are TableAnnotator, a tool for disentangling tabular structure into a structured database with labeling functional areas (headers, stubs, super-rows, data cells), finding inter-cell relationships and annotations (can be made with various vocabularies, such as UMLS, WordNet or any vocabulary in SKOS format) and TabInOut, a tool that uses TableAnnotator and is actually a wizard for making information extraction rules. He also notes that he has a couple of other open source tools: a stemmer for Serbian and Marvin, a flexible annotation tool that can use UMLS/MetaMap, WordNet or SKOS vocabulary annotation source for text.
There is also the SAFAR framework dedicated to ANLP (Arabic Natural Language Processing). It is free, cross-platform and modular. It includes: resources needed for different ANLP tasks such as lexicons, corpora and ontologies; basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics; applications for the ANLP; and utilities such as sentence splitting, tokenization, transliteration, etc.
The Centre of Language and Speech Technology, Radboud University Nijmegen has a suite of tools, all GPLv3 , that are available from their LaMachine distribution for easy installation. Many are also in the upcoming Debian 9, as part of debian-science, the Arch User Repository, and the Python Package Index where appropriate.
- FoLiA: Format for Linguistic Annotation is an extensive and practical format for linguistically annotated resources. Programming libraries available for Python (https://github.com/proycon/pynlpl/) and C++ (https://github.com/LanguageMachines/libfolia)
- FLAT: FoLiA Linguistic Annotation Tool A comprehensive web-based linguistic annotation tool.
- Ucto is a regular-expression based tokeniser with rules for various languages. Written in C++. Python binding available as well. Supports the FoLiA format.
- Frog is a suite of NLP tools for dutch (pos tagging, lemmatisation, NER, dependency parsing, shallow parsing, morphological analysis). C++, python binding available, supports the FoLiA format.
- Timbl is a memory-based machine learning (k-NN, IB1, IGTree)
- Colibri Core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way.
- Gecco(Generic Environment for Context-Aware Correction of Orthography).
- CLAM can quickly turn command-line applications into RESTful webservices with web-application front-end.
- LuigiNLP is a still new and experimental NLP Pipeline system built on top of SciLuigi, and in turn on Spotify’s Luigi.
David McClosky wrote to the Corpora List with the following news:
I’m happy to announce two new Python packages for parsing to Stanford Dependencies. The first is PyStanfordDependencies https://github.com/dmcc/PyStanfordDependencies which is a Python interface for converting Penn Treebank trees to Stanford Dependencies. It is designed to be easy to install and run (by default, it will download and use the latest version of Stanford Dependencies for you):
sd = StanfordDependencies.get_instance(version='3.4.1')
sent = sd.convert_tree('(S1 (NP (DT some) (JJ blue) (NN moose)))')
+– some [det]
+– blue [amod]
PyStanfordDependencies also includes a basic library for reading, manipulating, and producing CoNLL-X style dependency trees.
The second package is an updated version of bllipparser https://github.com/BLLIP/bllip-parser (better known as the Charniak-Johnson reranking parser). bllipparser gives you access to the parser and reranker from Python. The most recent update integrates bllipparser with PyStanfordDependencies allowing you parse straight from text to Stanford Dependencies. It also adds tools for reading and manipulating Penn Treebank trees.
More information is available in the READMEs. Feedback, bug reports, feature requests are welcomed (please use the GitHub issue trackers for the latter two).
[OpeNER] is a language analysis toolchain helping (academic) researchers and companies make sense out of “natural language analysis”. It consist of easy to install, improve and configure components to:
* Detect the language of a text
* Tokenize texts
* Determine polarisation of texts (sentiment analysis) and detect what topics are included in the text.
* Detect entities named in the texts and link them together. (e.g. President Obama or The Hilton Hotel)
* The supported language set currently consists of: English, Spanish, Italian, German and Dutch.
Besides the individual components, guidelines exists on how to add languages and how to adjust components for specific situations and topics.
I’m still trying to figure out what all I can do with the [Saffron] browser/visualizer. It claims to analyze the research communities of natural language processing, information retrieval, and the semantic web through “text mining and linked data principles.”
The list of research domains is rather short and under-explained for the uninitiated:
I clicked on [ANLP], which is *applied natural language processing, and you get both a list of hot topics:
As well as a taxonomy network/tree that offers labels when you hover over nodes, which are themselves clickable links:
Clicking on one of the “hot topics,” in this case [natural language text], gives you a bar chart of the frequency of the topic in documents for the past thirty years:
A list of similar topics:
A list of experts:
And a list of publications:
Like a lot of browsers, this kind of static presentation of the results impoverishes the exploration that it encourages. I also haven’t explored what are its inputs: I wonder how full/complete its historical record is.
[natural language text]: http://saffron.deri.ie/acl_anlp/topic/natural_language_text/
[Reproducing NLP Research](http://wordpress.let.vupr.nl/reproducingnlpresearch/) is a new website “meant as a platform to share experiences, ideas and tips related to reproducing research results in NLP.”
[Josh Constine at TechCrunch has an article](http://techcrunch.com/2012/07/02/message-war/) about what he is calling the “message war” that Google, Apple, and Facebook are either already waging or are about to wage. While I rolled my eyes over the somewhat hyperbolic nature of the piece — it is TechCrunch and the world is always about to end or be revolutionized (sometimes at the same time) — I did find the following bit fascinating:
> People love content, but people need direct communication. Who you communicate with on a daily basis and via what medium are vital signals regarding where people sit in your social graph. Whichever company owns the most of this data will have better ways to refine the relevance of their content streams, showing you updates by the people you care about aka communicate with most, and showing ads nearby. Through natural language processing and analysis, whoever controls messages will also get to machine-read all of them and target you with ads based on what you’re talking about.
The social graph has become a cliché, at least among the technorati, but it is still powerful information that companies would like to have in order to market to us better, and perhaps on an individual basis. The nature of our relationships, as realized in actual messages, has always, so must of us have felt, been somewhat sacrosanct, off-limits, for us alone to know.
Well, that isn’t necessarily the case, since Google has always made a point of saying the ads shown through the web interface for its Gmail service are based, in some fashion, on the content of those e-mails. Like a lot of people, I have a GMail account, but it is strictly used as a channel for people I don’t know or who need pro forma contact information. (Site registrations, software licenses, and the like.) Thus, what Google gleans about me from reading my Gmail account is rather one-dimensional.
But I do text, and when I do text, it is with those closest to me, which is why I assume everybody wants access to that data. More interestingly, the way they are going to access that data is through a technology that I myself am interested in, *natural language processing*.
The world just keeps getting more and more interesting.
I started [a thread on Stackoverflow] as I try to determine how to write a Python script using the Natural Language Toolkit that will write the concordance for a term out to a file. Here’s the script as it stands:
#! /usr/bin/env python
# First we have to open and read the file:
thefile = open(‘all_no_id.txt’)
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance(“cultural”, 75, sys.maxint)
# Now to save this to a file
fileconcord = open(‘ccord-cultural.txt’, ‘w’)
Eventually I hope to have a script that will ask me for the `source text` and the `term` to be put in context and that will then generate a `text` file with the name of the term in it.
I should note that one of the respondents has already pointed me to a thread on the [NLTK discussion group], which I knew existed but had someone managed not to find.
If you’re interested in the discussion group, here’s its [home page] in the new Google Groups format. (It’s an ugly URL, to be sure.)
**Update**: [NLTK is now on GitHub]. Some of the [documentation], from what I can tell is in Tex. The NLTK book, which I own as an O’Reilly codex and epub, is also on GitHub as well as [an NLTK repository], which appears to be empty for now.
If you’re interested in the book: [visit O’Reilly’s site][site], where you can purchase it in a variety of formats, codex or electronic. The great thing about the e-versions is that you can pick and choose from PDF, epub, or mobi, which means I can have the PDF on my iPad and the epub on my phone and the mobi on my Kindle. If you really only want to deal with Amazon, then if you follow [this link][amz], I will get a small commission.
Just a quick list of natural language processing resources for Ruby:
* MIT has a list of [AI-related Ruby extensions](http://web.media.mit.edu/~dustin/papers/ai_ruby_plugins/).
* Jason Adams, who does “opinion mining for a startup in Atlanta”, has a list of [NLP Resources for Ruby](http://mendicantbug.com/2009/09/13/nlp-resources-for-ruby/).
* Nick Sieger has a post on [RubyConf: Natural language generation and processing in Ruby](http://blog.nicksieger.com/articles/2006/10/22/rubyconf-natural-language-generation-and-processing-in-ruby).
* Finally, there are a couple of papers that mention Ruby and NLP: [“Trust Region Newton Method for Large-Scale Logistic Regression”](http://ntucsu.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf) and [“Natural language question answering: the view from here”](http://www.loria.fr/~gardent/applicationsTAL/papers/jnle-qa.pdf).
A while ago I tweeted a note to Digitial Humanities Questions and Answers about putting together a Python or R script for getting a word frequency distribution for a text. The short explanation for why I want to do this is because it is one way to develop a drop, aka stop, lists in order to tweak network analysis of texts of visualization of those texts using techniques like a word cloud. I am interested in a Python or R script in particular because I want my solution to be platform independent, so that students in my digital humanities seminar can use the scripts no matter what platform they use. (I had come across some useful `bash` scripts, but that limits their use to *nix platforms like Mac OS X or Linux.)
Handily enough, a word frequency distribution function is available as part of the Python Natural Language Toolkit (NLTK) — the same functionality is also baked into R, as John Anderson demonstrated — but I am focusing any scripting acumen development for now on Python.
### Getting up and running with NLTK
To get up and running with NLTK in Python, you first need a fairly recent version of Python: 2.4 or better. (My MacBook is running 2.6.1, which is acceptable, and I’m not good enough, yet, to update.)
In addition to a recent version of Python, and in addition to the NLTK (more on that in a moment), you also need PyYAML. All the downloads for PyYAML are available here: http://pyyaml.org/download/pyyaml/. (Please note that from here on out I am describing the installation process for Mac OS X: the Windows routine uses different flavors of these resources — there is a PyYAML executable installer, for example.)
Download the tarballed and gzipped package and unpack it some place convenient. (YOu are going to delete when you are done, and so the place doesn’t matter.) I put my copy on the desktop, and so, having unpacked it, I navigated to its location in a terminal window:
`% cd /Users/me/Desktop/PyYAML-3.09`
(Please note that the presence of the `%` sign is simply to indicate that we are using the command line.) Once there, you run the setup module:
`% sudo python setup.py install`
From there, a whole lot of making and installing takes place as your terminal window scrolls quickly. It’s done within seconds. Now you need to download the appropriate NLTK file, mine was here:
This time it’s a GUI-based installer package. Follow the instructions, click on things, and you are done.
To check to make sure everything got done that needed to get done, return to your terminal window and invoke the Python interpreter:
At the Python interpreter prompt (`>>>`), type:
`>>> import nltk`
If everything went well, all you will get is a momentary pause, if any, and another interpreter prompt. Congratulations!