wgetting TED Talk Transcripts

As I work through Matt Jockers’ material on sentiment analysis of stories — I’m not quite prepared to call it the shape of stories — I decided it would be interesting to try syuzhet out on some non-narrative materials to see what shapes turn up. A variety of possibilities ran through my head, but the one that stood out was, believe it or not, TED talks! Think about it. TED talks are a well-established genre with a stable structure/format. Text-to-text comparison shouldn’t really invite too many possible errors on my part — this is always important for me. Moreover, in 2010 Sebastian Wernicke assessed the corpus as it stood at that time, and so perhaps a revision of that early assessment might be due.

The next step was how to download all the transcripts. The URLs all looked like this:

https://www.ted.com/talks/stuart_firestein_the_pursuit_of_ignorance/transcript?language=en

While I would love it if this worked:

wget -r -l 1 -w 2 --limit-rate=20k https://www.ted.com/talks/*/transcript?language=en

It doesn’t. wget is flexible, however, and if you feed it a list of files, it will work its way through that list. Fortunately, in this moment, a search of the web turned up a post on Open Culture describing a list of the 1756 TED Talks available in 2014. As luck would have it, the Google Spreadsheet is still being maintained.

I downloaded the spreadsheet as a CSV file and then simply grabbed the column of URLs using Numbers. (This could have been done with pandas but it would have taken more time, and I didn’t need to automate this part of the process.) The URLs were to the main page for each talk, and not the transcript, but all I needed to do was to add the following to the end of each line:

/transcript?language=en

Which I did with some of the laziest regex ever. I could then cd into the directory I created for the files and ran this:

wget -w 2 -i ~/Desktop/talk_list.txt

What remains now is to use Beautiful Soup to rename the files using the html title tag and to get rid of everything but the actual transcript. Final report from wget:

FINISHED --2016-05-18 16:16:52--
Total wall clock time: 2h 14m 51s
Downloaded: 2114 files, 153M in 3m 33s (735 KB/s)

Leave a Reply