Understanding How Beautiful Soup Works

Two years ago, when I first grabbed the transcripts of the TED talks, using wget, I relied upon the wisdom and generosity of Padraic C on StackOverflow to help me use Python’s BeautifulSoup library to get the data out of the downloaded HTML files that I wanted. Now that Katherine Kinnaird and I have decided to add talks published since then, and perhaps even go so far as to re-download the entire corpus so that everything is as much the same as possible, it was time for me to understand how BeautifulSoup (hereafter BS4) works for myself.

from bs4 import BeautifulSoup

# NB: no need to read() the file: BS4 does that
thesoup = BeautifulSoup(open("transcript.0.html"), "html5lib")

# Talk metadata is in <meta> tags in the <head>.
# This finds all <meta> tags
metas = thesoup.find_all("meta")

# Let's see what this object is...
print(type(metas))

Output: <class 'bs4.element.ResultSet'>, and we can interact with it as if it were a list. Thus, metas[0] yields: <meta charset="utf-8"/>, which is the first of a long line of <meta tags. (The complete output is at the bottom of this note below under the heading Appendix A.)

type(metas[0]) outputs: <class 'bs4.element.Tag'>. That means we will need to understand how to select items within a BS4 Tag. The items we are interested in are towards the bottom of the result set:

<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>

This gives us the slug, the description, the run time, the publication date, the number of hits, and the speaker. So, the question is, how do we navigate the “parse tree” so that we turn up the value of the content attributes when the value of the itemprop attribute is one of the above?

[meta.attrs for meta in metas] returns a list of dictionaries, with each meta its own dictionary. Here is a small sample from the larger list:

{'content': 'PT6M3S', 'itemprop': 'duration'},
{'content': '2016-05-17T14:46:20+00:00', 'itemprop': 'uploadDate'},
{'content': '1246654', 'itemprop': 'interactionCount'},
{'content': 'Laura Indolfi', 'itemprop': 'name'},

What we need to do is identify the dictionary’s position in the list by finding those dictionaries that have the values duration, etc. We then use that position to slice to that dictionary, and get the value associated with content, yes?

It turns out that the best way to do this is built into BS4, though the method was not immediately obvious. One of the answers to the StackOverflow question “Get meta tag content property with BeautifulSoup and Python” suggested the following possibility:

for tag in thesoup.find_all("meta"):
    if tag.get("name", None) == "author":
        speaker = tag.get("content", None)
    if tag.get("itemprop", None) == "duration":
        length = tag.get("content", None)
    if tag.get("itemprop", None) == "uploadDate":
        published = tag.get("content", None)
    if tag.get("itemprop", None) == "interactionCount":
        views = tag.get("content", None)
    if tag.get("itemprop", None) == "description":
        description = tag.get("content", None)

If we ask to see these values with print(speaker, length, published, views, description), we get:

Laura Indolfi PT6M3S 2016-05-17T14:46:20+00:00 1246654 Anyone
who has lost a loved one to pancreatic cancer knows the devastating
speed with which it can affect an otherwise healthy person. TED
Fellow and biomedical entrepreneur Laura Indolfi is developing a
revolutionary way to treat this complex and lethal disease: a drug
delivery device that acts as a cage at the site of a tumor,
preventing it from spreading and delivering medicine only where
it's needed. "We are hoping that one day we can make pancreatic
cancer a curable disease," she says.

Now we need to get the text of the talk out, which is made somewhat difficult by the lack of semantic markup. The start of the text looks like this:

<!-- Transcript text -->
  <div class="Grid Grid--with-gutter d:f@md p-b:4">
    <div class="Grid__cell d:f h:full m-b:.5 m-b:0@md w:12"></div>

    <div class="Grid__cell flx-s:1 p-r:4">

The only reliable thing is the comment tag: there’s also a closing one at the end of the transcript text, so if we can find some way to select all the <p> tags between the two comments, I think we’ll be in good shape.

Appendix A

The output of [print(meta) for meta in metas] is:

<meta charset="utf-8"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." name="description"/>
<meta content="Laura Indolfi" name="author"/>
<meta content='Transcript of "Good news in the fight against pancreatic cancer"' property="og:title"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image"/>
<meta content="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/70d551c2-1e5c-411e-b926-7d72590f66bb/LauraIndolfi_2016U-embed.jpg?c=1050%2C550&amp;w=1050" property="og:image:secure_url"/>
<meta content="1050" property="og:image:width"/>
<meta content="550" property="og:image:height"/>
<meta content="article" property="og:type"/>
<meta content="TED, Talks, Themes, Speakers, Technology, Entertainment, Design" name="keywords"/>
<meta content="#E62B1E" name="theme-color"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="TED Talks" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="TED Talks" name="application-name"/>
<meta content="https://www.ted.com/browserconfig.xml" name="msapplication-config"/>
<meta content="#000000" name="msapplication-TileColor"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="Laura Indolfi: Good news in the fight against pancreatic cancer" name="title"/>
<meta content="TED Talk Subtitles and Transcript: Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." property="og:description"/>
<meta content="https://www.ted.com/talks/laura_indolfi_good_news_in_the_fight_against_pancreatic_cancer/transcript" property="og:url"/>
<meta content="201021956610141" property="fb:app_id"/>
<meta content="Good news in the fight against pancreatic cancer" itemprop="name"/>
<meta content="Anyone who has lost a loved one to pancreatic cancer knows the devastating speed with which it can affect an otherwise healthy person. TED Fellow and biomedical entrepreneur Laura Indolfi is developing a revolutionary way to treat this complex and lethal disease: a drug delivery device that acts as a cage at the site of a tumor, preventing it from spreading and delivering medicine only where it's needed. &quot;We are hoping that one day we can make pancreatic cancer a curable disease,&quot; she says." itemprop="description"/>
<meta content="PT6M3S" itemprop="duration"/>
<meta content="2016-05-17T14:46:20+00:00" itemprop="uploadDate"/>
<meta content="1246654" itemprop="interactionCount"/>
<meta content="Laura Indolfi" itemprop="name"/>
<meta content="Flash HTML5" itemprop="playerType"/>
<meta content="640" itemprop="width"/>
<meta content="360" itemprop="height"/>

Implying Comparison

This kind of data presentation strikes me as close to disingenuous. I understand that somewhere in someone’s mind, the thought is to maximize the width of the diagram, but when all the measures are the same, these all represent percent changes of opinion among the same population responding to the same survey, then the shift in scale actually misleads readers, who easily see the big red bars, that there has been large changes across the board. Instead, there’s been changes of 20%, 15%, and 7%. Those differences in size are hard to read with the light gray numbers up above, almost lost under the bold black of the diagram’s title.

You can do better, The Guardian.

Updating the TEDtalk Corpus

Today’s task is to determine which talks have been added since the initial download of talks in May 2016 both in order to increase the size of the corpus as well as to make sure that our procedures for acquiring and cleaning data have been fully documented. To get started, I read my own documentation for 2016’s effort, which contained a link to the Google Sheet where the data is housed.

I downloaded the Sheet as a CSV file and compared it to the one from May 2016. A couple of issues emerged immediately: First, neither FileMerge, part of Xcode, nor DiffMerge handle files this size very well. Both are slowww. Second, the columns have moved and changed names:

May 2016: 18,id,/,Speaker,Name,Short Summary,Event,Duration,Publish date,
May 2018: Talk ID,public_url,speaker_name,headline,description,event,duration,language,published,tags

For the purpose of comparison, I only need ID and url, so I think I’ll make a copy of both files and trim them to those columns.… On closer inspection, the URLs have changed. In 2016, the URLs were by the ID number — http://www.ted.com/talks/view/id/53 — but now they are a blend of author and title: https://www.ted.com/talks/majora_carter_s_tale_of_urban_renewal. So I’ll go with ID and speaker for comparison but keep the URL so I can use that for the download list.

Kaleidoscope can handle files this size and its viewing options are pretty nice — and it’s affordably priced, if I need to keep it. (It tracks 76 changes, but I can see that some of those changes are simply some internal shifting that the applications doesn’t follow. I’ll take the automagic where I can get it, and work where I don’t.)

Slight change of plan: I duplicated the files and I am going to sort by ID number to see if that helps. Now Kaleidoscop is showing 412 changes. That doesn’t work. Some talks from the earlier CSV do not have IDs.

I considered a number of solutions to the problem, but I returned to the original diff, where Kaleidoscope tracked 76 changes and inspected the diff by hand. Most of the changes were lines shifted up or down, so I resolved those by hand. That left three large blocks of lines at the end that I copied and pasted into a new CSV as well as the following outliers listed by ID and speaker:

- 25, Tan
- 1676, Davis
- 1923, Cameron
+ 2386, n/a, year in ideas
+ 2451, Gerald
+ 2464, Torvalds
- n/a, Evans
+ 925, Shaw

Tomorrow I will see if the wget code still works.

Excession

A Culture GSV and Its Escorts by Ex-Pacifist on DeviantArt.

An imagining of what a Culture GSV looks like by Ex-Pacifist on DeviantArt.

About once or twice a year, something happens that reminds me of something from an Iain M. Banks novel, and I go that spot on my bookshelves, actual or virtual, and pull one down to re-read. This past month or so, Excession has been my concern, and, as its name suggests, I found myself wondering how I could have missed so much of the novel the first time?

Banks’ Culture novels present a post-scarcity civilization of mixed humans and various ranges of artificial intelligences. At the top of the proverbial heap, and largely running the show, are the Minds, vast intelligences usually tasked with operating a variety of sizes of ships, orbitals, hubs, and other facilities that range in size, in terms of complexity, from small cities (in space) to entire planets. At the top of the hierarchy are the Minds that are also the largest ships in the Culture, the General Service Vehicles (GSVs). The capabilities of the GSVs, as is revealed throughout the series but is also the focus of Excession, are virtually limitless.

Despite their being the true power behind the throne, much of the novels in the Cultureverse are told about and through humans. Speaking of that perspective, it should also be noted that, for the most part, the humans involved are rather close to humanity as we understand and experience it. So, while Culture humans have the ability to change form, change sexes, and lead lives several hundred years, and possibly up to a thousand years, in length. Most of our guides through the Culture are happy to stick to the basics, with the occasional need to gland something to make them more calm or more quick to respond, or whatever is called for by the current situation.

Some human stories are a part of the braided narrative of Excession, but the novel is also fascinatingly most concerned with the actions of various Minds, whose awareness of the shifting nature of it all, especially politics, seems to be what plagues them. The eponymous entity ostensibly at the center of the story, the Excession, seems more a MacGuffin than anything else, save a postscript at novel’s end, but its appearance somewhere within the Culture’s zone of influence is what puts a large number of diverse plot points in motion. (Spoilers ahead.)

In Banks’ Cultureverse, ships move about by tapping into the energy grids that lie “above” and “below” reality as we know it: thus above real space lies “ultra space” and below it lies “infra space.” (The obvious analogues here are the two kinds of radiation that lie just outside the human-visible spectrum we call light, ultraviolet, at a higher frequency, and infrared, at a lower frequency.) Either grid can be tapped, but not both at the same time. The object that gets dubbed the Excession appears to be able to tap both at the same time. This appears to be the limits of its expressibility. Initial attempts to communicate with it result in the loss of the ship. Thereafter, the various ships that come to witness the Excession for themselves keep their distance.

Despite the desire of the Culture minds that the excision might bring some of the known “Elder” civilizations to the scene, the only other civilization that seems interested are the Affront, a brutish, egocentric species that lives to establish their superiority through a variety of means, the more painful the better. The Affront represent, for some in the Culture, a mistake in need of mending, and so the appearance of the Excession, with its possible promise of unknown technological riches, is allowed to lure the Affront into war with the Culture, a war the reader knows, and the Minds involved know, from the first novel in the series, Consider, Phlebas, that the Culture will inevitably win.

One of the concerns of Excession is the Culture’s impulse to intervene in the affairs of others. The goal is always, of course, to make them less brutish, militaristic, and prone to see others as either potential slaves or cannon fodder: in short, to make them more cultured, and the irony of the verb form is not lost on Banks’ and his Culture denizens. There are glimpses of reconsiderations of, and possible recriminations for, the active peace-making in which the Culture engages, in other Culture novels. Echoes of the Idiran war echo in Look to Windward, with one of the Minds being particularly wracked with guilt by events that took place centuries ago. There is also a hint of the Culture having gotten into trouble for its busybody nature in Surface Detail.

But only in Excession is the matter taken up so centrally and by the players who seemingly matter the most but who are largely a backdrop in other novels, the Minds themselves. Like any good speculative fiction opera, there are plots and counterplots and plots within plots and subterfuges. What fascinates is how Banks manages to make matters both mundane enough for the reader to follow as well as surreal, quite literally, enough that we recognize in some fashion that the experiences of the Minds is necessarily not at all like our own. (He drops his guard here regularly when he resorts to analogies like “The giant ship watched the Excession, still billowing out towards it. For all its prodigious power, the Sleeper now felt as helpless as the driver of an ancient covered wagon, caught on a road beneath a volcano, watching the incandescent cloud of a nueé ardente tearing down the mountainside towards it.” Really? A ship with a Mind that does “metamath” for fun and which has, we are told, just finished constructing in record time and with complete stealth some thirty thousand war ships and its first impulse is to imagine itself in a covered wagon?) The problem is, of course, how not like us the Minds are, and yet with their politicking and guilt, they are like us.

And guilt plays a large role in Excession, as it does elsewhere in the Culture books — it is perhaps one of the central themes for Banks, the one emotion a post-scarcity world is sure to feel, one supposes. At least two Minds destroy themselves out of guilt, which is not unlike the Mind that does so in Look to Windward. In the case of all three Minds, they have been guilty of participating in war. In all three circumstances, there were possible mitigating circumstances: the Affront do seem rather horrid and there more horrible impulses would be better curbed than continued to be let loose upon any sector of the galaxy. But, it seems to be the case, in the overall trend of the Culture novels, that it’s better to achieve such means through skullduggery or a bit of carrot-and-sticking. Given, how many of the novels are about the exploits of, willing or unwilling, agents of Contact and/or Special Circumstances, Excession is one tale in which the Culture’s inner workings on front and center and it would appear that they are as willing to dig into their own skulls as those of the civilizations which they seek to improve through their “involvement.”

Anticipating the Turn


Breton’s apartment in Paris, filled with objects he and Levi-Strauss bought while exiled in New York during the war.

As with any intellectual history, there is more to “the turn toward performance” than meets the eye: there is considerable buildup across a broad intellectual front, including the introduction of existentialism into the American academy and public culture. (E.g., William Barrett’s The Irrational Man [1958] — see note below). A consideration of these broader trends would reveal that the acceptance of work by Martin Heidegger, Jean-Paul Sartre, and Albert Camus following the second World War was anticipated by work in American philosophy, such as John Dewey’s Art as Experience (1934) and Kenneth Burke’s Philosophy of Literary Form (1941). Some of the effort to discern a particular American culture was in response to the rise of rich international connections (which manifested in politics as a concern over communism), many of which were brought about by displaced intellectuals who came to the U.S. in the thirties and forties. Some of them stayed and the result was that American intellectuals interested in work by Roman Jakobson found him referring to work by Vladimir Propp and Mikhail Bakhtin, and so American scholars found themselves confronted by an entire school of literary theory, now known as Russian formalism, which interacted somewhat with their emerging interest in structuralism as it had been developed in France by Lévi-Strauss, Piaget, Lacan, and others. (And all of this ignores the many contributions of the Frankfurt School during this time.)

Burke, Kenneth. 1973/1941. Literature as Equipment for Living. The Philosophy of Literary Form. Berkeley and Los Angeles: University of California Press. Pp. 293-304.

Jakobson Roman. 1960. Closing Statement: Linguistics and Poetics. In Style in Language, 350-377. Ed. Thomas Sebeok. MIT Press.

Lord, Albert. 1960. The Singer of Tales. Harvard University Press. (The link below will take you to an online version of book hosted by Harvard University.)

Note: If you have never had the chance to read Barrett’s The Irrational Man, I highly recommend it. A survey of its chapter titles should prove reason enough: from “The Encounter with Nothingness” and “The Testimony of Modern Art” to “The Place of the Furies,” the book was the gateway to existentialism, and thus also phenomenology, for many.

Hannah Arendt holding court at the New School for Social Research.

Python’s `google` Module

So, like me, you become interested in the possibility of executing Google searches from within a Python script, and, like me, you installed the google module — which some have noted is no longer developed by Google itself but by a third party — and got an import error, here is what happened: yes, you did install it as google:

pip install google

but you do not call it google because that will lead to an ImportError. Instead, the name of the module is googlesearch, so what you want to do is this:

from googlesearch import search

Now it works.

Hat tip to shylajhaa sathyaram in his comment on GeeksforGeeks.