Text Analytics APIs 2018

Text Analytics APIs 2018: A Consumer Guide is $895 for a single user license. At 299 pages, that’s about $3 per page. The blurb notes that:

Robert Dale is an internationally-recognized expert in Natural Language Processing, with three decades of experience in academia and industry. With a PhD from the University of Edinburgh, he’s worked for Microsoft and Nuance, and he’s driven the development of SaaS-based NLP software for a startup. He has taught at the University of Edinburgh in the UK and at Macquarie University in Sydney, and presented tutorials and summer school courses around the world. He has over 150 peer-reviewed publications, including a comprehensive Handbook of Natural Language Processing, and the de facto textbook Building Natural Language Generation Systems.

LitRPGs

Audible emailed me about “LitRPGs”, a genre about which I have heard little. I imagined something like Choose-Your-Own-Adventure or Cortazar’s “Hopscotch” but it seems to cover a pretty wide range of texts.

Ternary Color Scheme

I was interested in the data for the age of European populations, but I found myself more taken with the color scheme used in the visualization:

Ternary Representation of European Populations by Age (Web Size)

A full-sized version of the image is available on request — it’s really big. But the map is part of an article in The Lancet.

Transcription in 2018

The range of transcription options has opened up considerably since I last considered the possibility of turning over some, but not all, transcription to software. It appears to be largely done in the cloud, with offerings from the following:

  • Transcribe appears to be simply an on-line version of the mechanical transcription machines I used to use: load the audio and then type. The “automagic” version allows you to listen to the audio through a headset and then dictate it to the site, which will then transcribe. That’s interesting.
  • f4transkript is another on-line service where you load your audio and then you do the typing.

If you’re interested in these traditional forms of transcription, wherein you do the typing, then may I also suggest you check out the transcription options in Scrivener. It’s not a service, so you just buy the software and use it. And a license is very inexpensive.

For those interested in letting an AI of some kind transcribe the audio for you — ah, the future, then there appears to be Descript. It appears to be the case that you upload your files either online, or you simply load them into an app installed on your local machine: it’s not quite clear if you pursue the latter course if the transcription takes place entirely on your machine or if the AI that does the heavy lifting lives in the cloud. The demos appear to work in real time, but the site suggests that perhaps you can load an audio of whatever length as a digital file and in less time than it takes to play it, you can have a transcript back.

I’m going to see how much you can do with a free account and report back. This could be very, very useful. (And cool!)

Understanding How Beautiful Soup Works

Two years ago, when I first grabbed the transcripts of the TED talks, using wget, I relied upon the wisdom and generosity of Padraic C on StackOverflow to help me use Python’s BeautifulSoup library to get the data out of the downloaded HTML files that I wanted. Now that Katherine Kinnaird and I have decided to add talks published since then, and perhaps even go so far as to re-download the entire corpus so that everything is as much the same as possible, it was time for me to understand how BeautifulSoup (hereafter BS4) works for myself.

from bs4 import BeautifulSoup

# NB: no need to read() the file: BS4 does that
thesoup = BeautifulSoup(open("transcript.0.html"), "html5lib")

# Talk metadata is in <meta> tags in the <head>.
# This finds all <meta> tags
metas = thesoup.find_all("meta")

# Let's see what this object is...
print(type(metas))

Output: <class 'bs4.element.ResultSet'>, and we can interact with it as if it were a list. Thus, metas[0] yields: <meta charset="utf-8"/>, which is the first of a long line of <meta tags. (The complete output is at the bottom of this note below under the heading Appendix A.)

type(metas[0]) outputs: <class 'bs4.element.Tag'>. That means we will need to understand how to select items within a BS4 Tag. The items we are interested in are towards the bottom of the result set:

<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>

This gives us the slug, the description, the run time, the publication date, the number of hits, and the speaker. So, the question is, how do we navigate the “parse tree” so that we turn up the value of the content attributes when the value of the itemprop attribute is one of the above?

[meta.attrs for meta in metas] returns a list of dictionaries, with each meta its own dictionary. Here is a small sample from the larger list:

{'content': 'PT6M3S', 'itemprop': 'duration'},
{'content': '2016-05-17T14:46:20+00:00', 'itemprop': 'uploadDate'},
{'content': '1246654', 'itemprop': 'interactionCount'},
{'content': 'Laura Indolfi', 'itemprop': 'name'},

What we need to do is identify the dictionary’s position in the list by finding those dictionaries that have the values duration, etc. We then use that position to slice to that dictionary, and get the value associated with content, yes?

It turns out that the best way to do this is built into BS4, though the method was not immediately obvious. One of the answers to the StackOverflow question “Get meta tag content property with BeautifulSoup and Python” suggested the following possibility:

for tag in thesoup.find_all("meta"):
    if tag.get("name", None) == "author":
        speaker = tag.get("content", None)
    if tag.get("itemprop", None) == "duration":
        length = tag.get("content", None)
    if tag.get("itemprop", None) == "uploadDate":
        published = tag.get("content", None)
    if tag.get("itemprop", None) == "interactionCount":
        views = tag.get("content", None)
    if tag.get("itemprop", None) == "description":
        description = tag.get("content", None)

If we ask to see these values with print(speaker, length, published, views, description), we get:

Laura Indolfi PT6M3S 2016-05-17T14:46:20+00:00 1246654 Anyone
who has lost a loved one to pancreatic cancer knows the devastating
speed with which it can affect an otherwise healthy person. TED
Fellow and biomedical entrepreneur Laura Indolfi is developing a
revolutionary way to treat this complex and lethal disease: a drug
delivery device that acts as a cage at the site of a tumor,
preventing it from spreading and delivering medicine only where
it's needed. "We are hoping that one day we can make pancreatic
cancer a curable disease," she says.

Now we need to get the text of the talk out, which is made somewhat difficult by the lack of semantic markup. The start of the text looks like this:

<!-- Transcript text -->
  <div class="Grid Grid--with-gutter d:f@md p-b:4">
    <div class="Grid__cell d:f h:full m-b:.5 m-b:0@md w:12"></div>

    <div class="Grid__cell flx-s:1 p-r:4">

The only reliable thing is the comment tag: there’s also a closing one at the end of the transcript text, so if we can find some way to select all the <p> tags between the two comments, I think we’ll be in good shape.

Appendix A

The output of [print(meta) for meta in metas] is:

<meta charset="utf-8"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." name="description"/>
<meta content="Laura Indolfi" name="author"/>
<meta content='Transcript of "Good news in the fight against pancreatic cancer"' property="og:title"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image:secure_url"/>
<meta content="1050" property="og:image:width"/>
<meta content="550" property="og:image:height"/>
<meta content="article" property="og:type"/>
<meta content="TED, Talks, Themes, Speakers, Technology, Entertainment, Design" name="keywords"/>
<meta content="#E62B1E" name="theme-color"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="TED Talks" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="TED Talks" name="application-name"/>
<meta content="https://www.ted.com/browserconfig.xml" name="msapplication-config"/>
<meta content="#000000" name="msapplication-TileColor"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="Laura Indolfi: Good news in the fight against pancreatic cancer" name="title"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." property="og:description"/>
<meta content="https://www.ted.com/talks/laura_indolfi_good_news_in_the_fight_against_pancreatic_cancer/transcript" property="og:url"/>
<meta content="201021956610141" property="fb:app_id"/>
<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>
<meta content="Flash HTML5" itemprop="playerType"/>
<meta content="640" itemprop="width"/>
<meta content="360" itemprop="height"/>

Implying Comparison

This kind of data presentation strikes me as close to disingenuous. I understand that somewhere in someone’s mind, the thought is to maximize the width of the diagram, but when all the measures are the same, these all represent percent changes of opinion among the same population responding to the same survey, then the shift in scale actually misleads readers, who easily see the big red bars, that there has been large changes across the board. Instead, there’s been changes of 20%, 15%, and 7%. Those differences in size are hard to read with the light gray numbers up above, almost lost under the bold black of the diagram’s title.

You can do better, The Guardian.