Syndetic

Ever since Dick Bauman used the word polysyndetic in an email to me to describe the kind of chaining together of lines achieved by a quotation like “he said”, I have held in my head a definition of syndetic as a kind of discursive conjunction. Returning to the essay in which I used Bauman’s proffered a term and now working with computational approaches to text analysis, I found myself wanting to make sure I understood the term a bit more. It turns out that syndetic is used in three different domains and the way its various meanings overlap fascinates me.

In the sense that Bauman introduced to me, in linguistics syndetic coordination is a grammatical form of syntactic coordination of syntactic elements with the help of a coordinating conjunction. E.g., “Peter, Paul, and Mary” or “Spam and eggs, please!” As a unit, the elements with the conjunction are called a syndeton. Elements coordinated with conjunctions are an asyndeton, or, I assume, asyndetic, and those with multiple conjunctions are, as my texts may have been, polysundetons or polysyndetic, e.g.:

And St. Attila raised his hand grenade up on high saying ‘O Lord bless this thy hand grenade that with it thou mayest blow thine enemies to tiny bits, in thy mercy. ‘and the Lord did grin and people did feast upon the lambs and sloths and carp and anchovies and orangutans and breakfast cereals and fruit bats and…

Among archivists, syndetic relationships are conceptual connections between terms, including genus (broader than), species (narrower than), nonpreferred equivalence (use, see), preferred equivalent term (used for), and associated term (related term). The relationships and the manner in which those relationships are organized is described as the syndetic structure.

Finally, among mathematicians, syndetic sets are subsets of natural numbers that have subset of the natural numbers, having the property of “bounded gaps”: that the sizes of the gaps in the sequence of natural numbers having the property of “bounded gaps”: that the sizes of the gaps in the sequence of natural numbers is bounded. Where to go from there … I haven’t a clue.

So, really, all I have is that three things that seem to be a part of my current world: textuality, ontology, and mathematical sets all seem to converge in the term syndetic. If only I could find the mystery that this key was meant to unlock…

More on Narrative Arcs

–from Scientific American online: “Great literature is surprisingly arithmetic”

Structuralica

I am in the midst of teaching a seminar on narrative studies, and no such course could occur without reference to the work of diverse thinkers who get grouped under the heading of structuralism. So it was quite a delight to find on the Dariah Winter School page, which is thoughtfully designed as one page with abstracts with linked slides as PDFs, links to these two inter-related projects: Structuralica is a repository of structuralist works, and Acta Structuralica is an open-access journal for structuralist research.

NASA Software

NASA recently reminded the public that it has a rather extensive library of software that is free to download and use:

NASA has released its 2017-2018 software catalog, which offers an extensive portfolio of software products for a wide variety of technical applications, all free of charge to the public, without any royalty or copyright fees.

Available in both hard copy and online, this third edition of the publication has contributions from all the agency’s centers on data processing/storage, business systems, operations, propulsion and aeronautics. It includes many of the tools NASA uses to explore space and broaden our understanding of the universe. A number of software packages are being presented for release for the first time. Each catalog entry is accompanied with a plain language description of what it does.

Open Source Tools for NLP

A recent correspondence featured on the Corpora-List began with a list of lists for doing natural language processing (NLP). I am collecting/compiling the various references and links here with the hope of sorting them at some point in time.

The first reference is to a StackOverflow thread that wondered which language, Java or Python, was better for NLP. The question is unanswerable but in the process of deliberating, a number of libraries for each language were discussed. Link.

Apache has its own NLP functionality in Stanbol.

The University of Illinois’ Cognitive Computation Group has developed a number of NLP libraries for Java available on GitHub.

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. It’s available on GitHub.

Corpus Tools is a portal dedicated to a number of software tools for corpus processing.

LIMA is developed in C++, works under Windows and Linux (possible to build it on macOS), and supports tokenization, morphologic analysis, pos tagging, parsing, SRL, NER, etc. The free version supports English and French. The closed one adds support for
Arabic, German, Spanish. Experiments were made on Portuguese, Chinese,
Japanese and Russian. The developer promises that more languages will be added to the free version.

Nikola Milosevic noted that he was developing two tools with the aim to process tables in scientific literature: “They are a bit specific, since they take as input XMLs, currently from PMC and DailyMed, soon HTML reader will be implemented. The tools are TableAnnotator, a tool for disentangling tabular structure into a structured database with labeling functional areas (headers, stubs, super-rows, data cells), finding inter-cell relationships and annotations (can be made with various vocabularies, such as UMLS, WordNet or any vocabulary in SKOS format) and TabInOut, a tool that uses TableAnnotator and is actually a wizard for making information extraction rules. He also notes that he has a couple of other open source tools: a stemmer for Serbian and Marvin, a flexible annotation tool that can use UMLS/MetaMap, WordNet or SKOS vocabulary annotation source for text.

There is also the SAFAR framework dedicated to ANLP (Arabic Natural Language Processing). It is free, cross-platform and modular. It includes: resources needed for different ANLP tasks such as lexicons, corpora and ontologies; basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics; applications for the ANLP; and utilities such as sentence splitting, tokenization, transliteration, etc.

The Centre of Language and Speech Technology, Radboud University Nijmegen has a suite of tools, all GPLv3 , that are available from their LaMachine distribution for easy installation. Many are also in the upcoming Debian 9, as part of debian-science, the Arch User Repository, and the Python Package Index where appropriate.

  • FoLiA: Format for Linguistic Annotation is an extensive and practical format for linguistically annotated resources. Programming libraries available for Python (https://github.com/proycon/pynlpl/) and C++ (https://github.com/LanguageMachines/libfolia)
  • FLAT: FoLiA Linguistic Annotation Tool A comprehensive web-based linguistic annotation tool.
  • Ucto is a regular-expression based tokeniser with rules for various languages. Written in C++. Python binding available as well. Supports the FoLiA format.
  • Frog is a suite of NLP tools for dutch (pos tagging, lemmatisation, NER, dependency parsing, shallow parsing, morphological analysis). C++, python binding available, supports the FoLiA format.
  • Timbl is a memory-based machine learning (k-NN, IB1, IGTree)
  • Colibri Core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way.
  • Gecco(Generic Environment for Context-Aware Correction of Orthography).
  • CLAM can quickly turn command-line applications into RESTful webservices with web-application front-end.
  • LuigiNLP is a still new and experimental NLP Pipeline system built on top of SciLuigi, and in turn on Spotify’s Luigi.

Batch Converting DOCX Files

My students live in a Microsoft universe, for the most part. I don’t blame them: it’s what their parents and teachers know. And I blame those same adults in their lives for not teaching them how to do anything more powerful with that software, turning Word into nothing more than a typewriter with the ability to format things in an ad hoc fashion. Style sheets! Style sheets! Style sheets! As an university professor, I duly collect their Word documents, much I would collect their printed documents, and I read them, mark on them, and hand them back. Yawn.1

Sometimes, just to play with them, I take all their papers and I mine them for patterns: words and phrases and topics that occur across a number of papers. You can’t do that with Word documents, so you need to convert them into something more useful. (And, honestly, much of what my students turn in could be done in plain text and we would all be better off.)

On a Mac, textutil does the trick nicely:

textutil -convert txt ./MyDocxFiles/*.docx

I generally then select all the text files and move them to their own directory, where, for some forms of mining I simply lump them into one big file:

cat ./texts/*.txt > alltexts.txt

(I should probably figure out how to do the “convert to text” and “place in another directory” in one command line.)

pandoc can also do this, and I need to figure that syntax out.


  1. I also sit through their prettily formatted but also fairly substance-less PowerPoints — I’m not just picking on them here: I also work with them on making such presentations more meaningful. 

Cultural Mechanics

Kudos to James O’Sullivan for a title so great I want to steal it: Cultural Mechanics is his podcast focusing on a really diverse range of digital humanities and digital arts topics. (Right now I would say it’s more digital arts in nature, but that may not be his overall focus.) Here it is on SoundCloud.