wgetting TED Talk Transcripts

As I work through Matt Jockers’ material on sentiment analysis of stories — I’m not quite prepared to call it the shape of stories — I decided it would be interesting to try syuzhet out on some non-narrative materials to see what shapes turn up. A variety of possibilities ran through my head, but the one that stood out was, believe it or not, TED talks! Think about it. TED talks are a well-established genre with a stable structure/format. Text-to-text comparison shouldn’t really invite too many possible errors on my part — this is always important for me. Moreover, in 2010 Sebastian Wernicke assessed the corpus as it stood at that time, and so perhaps a revision of that early assessment might be due.

The next step was how to download all the transcripts. The URLs all looked like this:


While I would love it if this worked:

wget -r -l 1 -w 2 --limit-rate=20k https://www.ted.com/talks/*/transcript?language=en

It doesn’t. wget is flexible, however, and if you feed it a list of files, it will work its way through that list. Fortunately, in this moment, a search of the web turned up a post on Open Culture describing a list of the 1756 TED Talks available in 2014. As luck would have it, the Google Spreadsheet is still being maintained.

I downloaded the spreadsheet as a CSV file and then simply grabbed the column of URLs using Numbers. (This could have been done with pandas but it would have taken more time, and I didn’t need to automate this part of the process.) The URLs were to the main page for each talk, and not the transcript, but all I needed to do was to add the following to the end of each line:


Which I did with some of the laziest regex ever. I could then cd into the directory I created for the files and ran this:

wget -w 2 -i ~/Desktop/talk_list.txt

What remains now is to use Beautiful Soup to rename the files using the html title tag and to get rid of everything but the actual transcript. Final report from wget:

FINISHED --2016-05-18 16:16:52--
Total wall clock time: 2h 14m 51s
Downloaded: 2114 files, 153M in 3m 33s (735 KB/s)

Voyant Tools 2

Voyant Tools is now in its second major version, which means that it runs more readily on more devices. While you can use Voyant on-line, the cooler part if running it yourself. Download it from GitHub and give it a try. A huge thanks to Geoffrey Rockwell and Stephan Sinclair and the rest of their group for all the hard work and the commitment to improving Voyant Tools year after year.

csplit < awk

I regularly need to split larger text files into smaller text files, or chunks, in order to do some kind of text analysis/mining. I know I could write a Python script that would do this, but that often involves a lot more scripting than I want, and I’m lazy, and there’s also this thing called csplit which should do the trick. I’ve just never mastered it. Until now.

Okay, so I want to split a text file I’ll call excession.txt (because I like me some Banks). Let’s start building the csplit line:

csplit -f excession excession.txt 'Culture 5' '{*}'

… Apparently I still haven’t mastered it. But this bit of awk worked right away:

awk '/Culture 5 - Excession/{filename=NR"excession"}; {print >filename}' excession.txt

For the record, I’m interested in working with the Culture novels of Iain M. Banks. I am converting MOBI files into EPUBs using Calibre, and then into plain text files. No, I cannot make these available to anyone, so please don’t ask.

The Culture series:

  1. Consider, Phlebas (1987)
  2. The Player of Games (1988)
  3. Use of Weapons (1990)
  4. The State of the Art (1991)
  5. Excession (1996)
  6. Inversions (1998)
  7. Look to Windward (2000)
  8. Matter (2008)
  9. Surface Detail (2010)
  10. Hydrogen Sonata (2012)


[Open Refine][] is a “tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases.” Link takes you to a page with lots of video tutorials. There is also Thomas Padilla’s [Getting Started with OpenRefine][].

[Open Refine]: http://openrefine.org
[Getting Started with OpenRefine]: http://thomaspadilla.org/dataprep/

Orange Textable

There’s a new visual programming interface (language?) for text analysis in town and it’s [Orange Textable][]: “Orange Textable is an open-source add-on bringing advanced text-analytical functionalities to the Orange Canvas visual programming environment (itself open-source). It essentially enables users to build data tables on the basis of text data, by means of a flexible and intuitive interface.” Looking through the documentation, it reminds me of something like the [MEANDRE/SEASR][] infrastructure/application setup from the NCSA (National Center for Supercomputing Applications) a few years ago. (The project has disappeared from both the NCSA and the I-CHASS sites.)

[Orange Textable]: http://langtech.ch/textable
[MEANDRE/SEASR]: http://www.slideshare.net/lauvil


There’s a lot more to Python’s `scikit` than I realized:

[~]% port search scikit

py27-scikit-image @0.10.1 (python, science)
Image processing algorithms for SciPy.

py27-scikit-learn @0.15.2 (python, science)
Easy-to-use and general-purpose machine learning in Python

py27-scikits-bootstrap @0.3.2 (python, science)
Bootstrap confidence interval estimation routines for SciPy.

py27-scikits-bvp_solver @1.1 (python, science)
bvp_solver is a Python package for solving two-point boundary value

py27-scikits-module @0.1 (python)
provides the files common to all scikits

py27-scikits-samplerate @0.3.3 (python, science, audio)
A Python module for high quality audio resampling

And this is just the Python 2.7 offerings. There are similar offerings for 2.6, 3.2, and 3.3.

Speaking of 3.3, it looks like most of the libraries with which I work now have 3.3 compatible versions? Time to upgrade myself?

I also installed `scrapy` this morning. I’m not quite ready to scrape the web for my own work, but the library looked like it had some useful functionality that I could at least begin to get familiar with.

**EDITED** to defeat autocorrect, which had changed `scikit` to *sickout* and `scrapy` to *scrap* without my noticing.

**Also**: many thanks again to Michel Fortin and his amazing [Markdown PHP plug-in][]. The code fence blocks are a real time-saver: no need to indent everything with four spaces after a colon. Just block off code with the same number of tildes on either end. *Done*.

[Markdown PHP plug-in]: https://michelf.ca/projects/php-markdown/