Open Source Tools for NLP

A recent correspondence featured on the Corpora-List began with a list of lists for doing natural language processing (NLP). I am collecting/compiling the various references and links here with the hope of sorting them at some point in time.

The first reference is to a StackOverflow thread that wondered which language, Java or Python, was better for NLP. The question is unanswerable but in the process of deliberating, a number of libraries for each language were discussed. Link.

Apache has its own NLP functionality in Stanbol.

The University of Illinois’ Cognitive Computation Group has developed a number of NLP libraries for Java available on GitHub.

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. It’s available on GitHub.

Corpus Tools is a portal dedicated to a number of software tools for corpus processing.

LIMA is developed in C++, works under Windows and Linux (possible to build it on macOS), and supports tokenization, morphologic analysis, pos tagging, parsing, SRL, NER, etc. The free version supports English and French. The closed one adds support for
Arabic, German, Spanish. Experiments were made on Portuguese, Chinese,
Japanese and Russian. The developer promises that more languages will be added to the free version.

Nikola Milosevic noted that he was developing two tools with the aim to process tables in scientific literature: “They are a bit specific, since they take as input XMLs, currently from PMC and DailyMed, soon HTML reader will be implemented. The tools are TableAnnotator, a tool for disentangling tabular structure into a structured database with labeling functional areas (headers, stubs, super-rows, data cells), finding inter-cell relationships and annotations (can be made with various vocabularies, such as UMLS, WordNet or any vocabulary in SKOS format) and TabInOut, a tool that uses TableAnnotator and is actually a wizard for making information extraction rules. He also notes that he has a couple of other open source tools: a stemmer for Serbian and Marvin, a flexible annotation tool that can use UMLS/MetaMap, WordNet or SKOS vocabulary annotation source for text.

There is also the SAFAR framework dedicated to ANLP (Arabic Natural Language Processing). It is free, cross-platform and modular. It includes: resources needed for different ANLP tasks such as lexicons, corpora and ontologies; basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics; applications for the ANLP; and utilities such as sentence splitting, tokenization, transliteration, etc.

The Centre of Language and Speech Technology, Radboud University Nijmegen has a suite of tools, all GPLv3 , that are available from their LaMachine distribution for easy installation. Many are also in the upcoming Debian 9, as part of debian-science, the Arch User Repository, and the Python Package Index where appropriate.

  • FoLiA: Format for Linguistic Annotation is an extensive and practical format for linguistically annotated resources. Programming libraries available for Python (https://github.com/proycon/pynlpl/) and C++ (https://github.com/LanguageMachines/libfolia)
  • FLAT: FoLiA Linguistic Annotation Tool A comprehensive web-based linguistic annotation tool.
  • Ucto is a regular-expression based tokeniser with rules for various languages. Written in C++. Python binding available as well. Supports the FoLiA format.
  • Frog is a suite of NLP tools for dutch (pos tagging, lemmatisation, NER, dependency parsing, shallow parsing, morphological analysis). C++, python binding available, supports the FoLiA format.
  • Timbl is a memory-based machine learning (k-NN, IB1, IGTree)
  • Colibri Core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way.
  • Gecco(Generic Environment for Context-Aware Correction of Orthography).
  • CLAM can quickly turn command-line applications into RESTful webservices with web-application front-end.
  • LuigiNLP is a still new and experimental NLP Pipeline system built on top of SciLuigi, and in turn on Spotify’s Luigi.

Leave a Reply