wgetting TED Talk Transcripts

As I work through Matt Jockers’ material on sentiment analysis of stories — I’m not quite prepared to call it the shape of stories — I decided it would be interesting to try syuzhet out on some non-narrative materials to see what shapes turn up. A variety of possibilities ran through my head, but the one that stood out was, believe it or not, TED talks! Think about it. TED talks are a well-established genre with a stable structure/format. Text-to-text comparison shouldn’t really invite too many possible errors on my part — this is always important for me. Moreover, in 2010 Sebastian Wernicke assessed the corpus as it stood at that time, and so perhaps a revision of that early assessment might be due.

The next step was how to download all the transcripts. The URLs all looked like this:


While I would love it if this worked:

wget -r -l 1 -w 2 --limit-rate=20k https://www.ted.com/talks/*/transcript?language=en

It doesn’t. wget is flexible, however, and if you feed it a list of files, it will work its way through that list. Fortunately, in this moment, a search of the web turned up a post on Open Culture describing a list of the 1756 TED Talks available in 2014. As luck would have it, the Google Spreadsheet is still being maintained.

I downloaded the spreadsheet as a CSV file and then simply grabbed the column of URLs using Numbers. (This could have been done with pandas but it would have taken more time, and I didn’t need to automate this part of the process.) The URLs were to the main page for each talk, and not the transcript, but all I needed to do was to add the following to the end of each line:


Which I did with some of the laziest regex ever. I could then cd into the directory I created for the files and ran this:

wget -w 2 -i ~/Desktop/talk_list.txt

What remains now is to use Beautiful Soup to rename the files using the html title tag and to get rid of everything but the actual transcript. Final report from wget:

FINISHED --2016-05-18 16:16:52--
Total wall clock time: 2h 14m 51s
Downloaded: 2114 files, 153M in 3m 33s (735 KB/s)

Voyant Tools 2

Voyant Tools is now in its second major version, which means that it runs more readily on more devices. While you can use Voyant on-line, the cooler part if running it yourself. Download it from GitHub and give it a try. A huge thanks to Geoffrey Rockwell and Stephan Sinclair and the rest of their group for all the hard work and the commitment to improving Voyant Tools year after year.

csplit < awk

I regularly need to split larger text files into smaller text files, or chunks, in order to do some kind of text analysis/mining. I know I could write a Python script that would do this, but that often involves a lot more scripting than I want, and I’m lazy, and there’s also this thing called csplit which should do the trick. I’ve just never mastered it. Until now.

Okay, so I want to split a text file I’ll call excession.txt (because I like me some Banks). Let’s start building the csplit line:

csplit -f excession excession.txt 'Culture 5' '{*}'

… Apparently I still haven’t mastered it. But this bit of awk worked right away:

awk '/Culture 5 - Excession/{filename=NR"excession"}; {print >filename}' excession.txt

For the record, I’m interested in working with the Culture novels of Iain M. Banks. I am converting MOBI files into EPUBs using Calibre, and then into plain text files. No, I cannot make these available to anyone, so please don’t ask.

The Culture series:

  1. Consider, Phlebas (1987)
  2. The Player of Games (1988)
  3. Use of Weapons (1990)
  4. The State of the Art (1991)
  5. Excession (1996)
  6. Inversions (1998)
  7. Look to Windward (2000)
  8. Matter (2008)
  9. Surface Detail (2010)
  10. Hydrogen Sonata (2012)


[Open Refine][] is a “tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases.” Link takes you to a page with lots of video tutorials. There is also Thomas Padilla’s [Getting Started with OpenRefine][].

[Open Refine]: http://openrefine.org
[Getting Started with OpenRefine]: http://thomaspadilla.org/dataprep/

Orange Textable

There’s a new visual programming interface (language?) for text analysis in town and it’s [Orange Textable][]: “Orange Textable is an open-source add-on bringing advanced text-analytical functionalities to the Orange Canvas visual programming environment (itself open-source). It essentially enables users to build data tables on the basis of text data, by means of a flexible and intuitive interface.” Looking through the documentation, it reminds me of something like the [MEANDRE/SEASR][] infrastructure/application setup from the NCSA (National Center for Supercomputing Applications) a few years ago. (The project has disappeared from both the NCSA and the I-CHASS sites.)

[Orange Textable]: http://langtech.ch/textable
[MEANDRE/SEASR]: http://www.slideshare.net/lauvil


There’s a lot more to Python’s `scikit` than I realized:

[~]% port search scikit

py27-scikit-image @0.10.1 (python, science)
Image processing algorithms for SciPy.

py27-scikit-learn @0.15.2 (python, science)
Easy-to-use and general-purpose machine learning in Python

py27-scikits-bootstrap @0.3.2 (python, science)
Bootstrap confidence interval estimation routines for SciPy.

py27-scikits-bvp_solver @1.1 (python, science)
bvp_solver is a Python package for solving two-point boundary value

py27-scikits-module @0.1 (python)
provides the files common to all scikits

py27-scikits-samplerate @0.3.3 (python, science, audio)
A Python module for high quality audio resampling

And this is just the Python 2.7 offerings. There are similar offerings for 2.6, 3.2, and 3.3.

Speaking of 3.3, it looks like most of the libraries with which I work now have 3.3 compatible versions? Time to upgrade myself?

I also installed `scrapy` this morning. I’m not quite ready to scrape the web for my own work, but the library looked like it had some useful functionality that I could at least begin to get familiar with.

**EDITED** to defeat autocorrect, which had changed `scikit` to *sickout* and `scrapy` to *scrap* without my noticing.

**Also**: many thanks again to Michel Fortin and his amazing [Markdown PHP plug-in][]. The code fence blocks are a real time-saver: no need to indent everything with four spaces after a colon. Just block off code with the same number of tildes on either end. *Done*.

[Markdown PHP plug-in]: https://michelf.ca/projects/php-markdown/

The Complete Python for Text Analysis

The following set of commands assume that you begin with a Mac OS X that does not have any of the necessities already installed. You can, thus, skip anything you have already done, e.g., if you have already installed Xcode, skip to Step 2.

Step 1: Install the Xcode development and command line tool environment. You’ll have to get Xcode from the Mac App Store. Supposedly, you can avoid this by simply installing the command line tools (see command below), but I have come across at least on instance where it seemed like I needed to go inside Xocde itself and download and install things from within preferences. (This was the old way of doing it.) Here’s the terminal command to install the Command Line Tools (a bit redundant isn’t it?):

xcode-select --install

Nota bene: I continue to see warnings when installing Python and its modules when I have not installed the complete Xcode from the App Store. They look like this:

Warning: xcodebuild exists but failed to execute
Warning: Xcode does not appear to be installed; most ports will likely fail to build.

I am installing the complete setup now on another machine, I will update this post if anything is borked.

Step 2: Install MacPorts.

If, like me, you have recently upgraded your operating system and things are borked, then you need to clean out the old installation(s). This means downloading the installer and running it like you did when you were young. It’s still fast and easy. The uninstallation is also fast and easy. Cleaning, however, takes some time. The steps below first document what you have installed before working you clean everything out:

port -qv installed > myports.txt  
sudo port -f uninstall installed  
sudo port clean all  

You can use the myports document as your list. (The migration page at MacPorts does have a way to automate the re-installation process using this document. Try it, if you like.)

At any rate, once you have MacPorts installed, pretty much everything else you need is going to be found and then installed via port search and then port install.

Step 3: Now you can start installing the stuff you want to install, like [Python 2.7]:

sudo port selfupdate  
sudo port install python27  
sudo port install python_select  
sudo port select --set python python27  

Step 4: Install everything needed for the NLTKnumpy, scipy, and matplotlib:

sudo port install py27-numpy  
sudo port install py27-scipy  
sudo port install py27-matplotlib  
sudo port install py27-nltk  

At this point, if you are only interested in NLP (natural language processing), you are done.

Optional: If you are going to pull anything from websites, then you can make your life easier by getting Beautiful Soup, which parses HTML for you:

sudo port install py27-beautifulsoup4

(Check for versions, as it may have incremented up.)

Step 5: If, however, you are also interested in network analysis as well as topic modeling and other forms of “big” data analysis, you can also install three Python modules built to do so — NetworkX, Gensim, and pandas:

sudo port install py27-networkx
sudo port install py27-gensim
sudo port install py27-pandas

Step 6: You have a pretty powerful analytical toolkit now at your disposal. If yo would like to make the user interface a bit more “friendly,” let me suggest that you also install iPython, an interactive Python interpreter, and, the best thing since someone sliced something in order to serve it the iPython notebook:

First, iPython:

sudo port install py27-ipython  
port select --set ipython ipython27  

Then, the iPython notebook components:

sudo port install py27-jinja2  
sudo port install py27-sphinx  
sudo port install py27-zmq  
sudo port install py27-pygments  
sudo port install py27-tornado  
sudo port install py27-nose  
sudo port install py27-readline  

I can’t tell you what a joy iPython notebooks are to use: you can copy complete scripts into a code cell and get results by simply hitting SHIFT + ENTER. And everything is captured for you in a space where you can also make notes on what you are doing, or, in my case, trying to do, in markdown. Everything is saved to a modified JSON file with the extension ipynb. Even better, you can transform the file, using the nbconvert utility, into HTML or LaTeX or PDF. It is very, very, nice.

Options: if you want that LaTeX option for nbconvert to work, you are going to need a functional TeX installation:

sudo port install texlive-latex

Nota bene: In my experience, any TeX installation is big, so if you are in a hurry, either open up another terminal window (or tab), do something in the GUI, or go fix yourself a cup of coffee. It’s going to take a while, and unless staring at the installation log as it scrolls by is your thing, and, hey, it could be, I suggest you let the code take its course and get some other things done.

And, if you need to convert scanned documents into text, the open source OCR application Tesseract is available:

sudo port install tesseract

You’ll need to install your preferred languages, in my case:

sudo port install tesseract-eng

See this search for tesseract for all the languages available.

Afterword: There is also, sigh!, a machine learning module for python called SciKit that does all kinds of things that at this moment in time both excites me and makes my head hurt.

: http://docs.python.org/2/

DICTION Text Analysis Software

During a clean out of my email application this morning — and we won’t discuss how badly I have been managing incoming mail of late, I came across a [Humanist DG][] post, in response to an inquiry about text analysis software, to [Diction][]:

> DICTION is a computer-aided text analysis program for determining the tone of a verbal message. DICTION searches a passage for five general features as well as thirty-five sub-features. It can process a variety of English language texts using a 10,000 word corpus and user-created custom dictionaries. DICTION currently runs on Windows® on a PC; a Mac® version is in development.
DICTION produces reports about the texts it processes and also writes the results to numeric files for later statistical analysis. Output includes raw totals, percentages, and standardized scores and, for small input files, extrapolations to a 500-word norm.

Okay, so they like to capitalize themselves. We get it.

Digging a little further into its features, you get a bit more information on the five general features:

> DICTION … uses a series of dictionaries to search a passage for five semantic features — Activity, Optimism, Certainty, Realism and Commonality — as well as thirty-five sub-features. DICTION uses predefined dictionaries and can use up to thirty custom dictionaries built with words that the user has defined, such as specific negative and positive words, for particular research needs.

And then there’s a bit more on the word lists:

> DICTION uses dictionaries (word-lists) to search a text for these qualities:
> * Certainty – Language indicating resoluteness, inflexibility, and completeness and a tendency to speak ex cathedra.
> * Activity – Language featuring movement, change, the implementation of ideas and the avoidance of inertia.
> * Optimism – Language endorsing some person, group, concept or event, or highlighting their positive entailments.
> * Realism – Language describing tangible, immediate, recognizable matters that affect people’s everyday lives.
> * Commonality – Language highlighting the agreed-upon values of a group and rejecting idiosyncratic modes of engagement.

> DICTION output includes raw totals, percentages, and standardized scores and, for small input files, extrapolations to a 500-word norm. DICTION also reports normative data for each of its forty scores based on a 50,000-item sample of discourse. The user may use these general norms for comparative purposes or select from among thirty-six sub-categories, including speeches, poetry, newspaper editorials, business reports, scientific documents, television scripts, telephone conversations, etc.

> On a computer with a 2.16 GHz Intel chip and 2 GB of RAM, DICTION can process 3,000 passages (1,500,000 words) in four minutes. The program can accept either individual or multiple-passages and, at your discretion, it provides special counts of orthographic characters and high frequency words.

Just to make sure I understand this, the “semantic” features here are really words on lists, which can come pre-populated or that you can modify or create, and then the additional 36 subcategories are really different corpora? Am I wrong in perceiving this as a more nuanced version of sentiment analysis, but still operating in much the same way by depending upon certain pre-determined word lists?

There’s a lot, to be sure, I don’t know about the history of CATA (computer-assisted textual analysis, which is my acronym for the day!) And there are certainly approaches that I do not yet fully understand the nuances. I think this must be one of them.

[Humanist DG]: http://dhhumanist.org
[Diction]: http://www.dictionsoftware.com