All posts tagged data

Interactive Map of U.S. Migration

Forbes has a fantastic dynamic map of migration statistics drawn from IRS data. The migration is internal to the U.S., but clicking on cities or areas reveals patterns that make you ask questions. Here’s a screen shot for the map with Lafayette as the focus:

Migration into and out of Lafayette, Louisiana

What’s the inbound migration from south California and Nevada? Is that migrant workers?

Over at WiredBeautiful Data: The Art of Science Field Notes.

Schema

Google, Microsoft, and Yahoo have gotten together to adapt a collection of microformats that will make it possible for folks who produce and publish content to the web to make searching that content more meaningful:

Most webmasters are familiar with HTML tags on their pages. Usually, HTML tags tell the browser how to display the information included in the tag. For example, <h1>Avatar</h1> tells the browser to display the text string “Avatar” in a heading 1 format. However, the HTML tag doesn’t give any information about what that text string means — “Avatar” could refer to the a hugely successful 3D movie, or it could refer to a type of profile picture—and this can make it more difficult for search engines to intelligently display relevant content to a user.

Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, and Yahoo!

You use the schema.org vocabulary, along with the microdata format, to add information to your HTML content. While the long term goal is to support a wider range of formats, the initial focus is on Microdata. This guide will help get you up to speed with microdata and schema.org, so that you can start adding markup to your web pages.

Using Lightroom

Photography is part of my research, and I also enjoy photographing my family and just generally documenting my world — more on that as my next potential project later. Between those various interests and commitments, I have about 15,000 images, all of which are safely cataloged by Adobe’s Lightroom. (I tried Aperture when it premiered at an unbelievable price point on the Mac App store, but either I have worked with Lightroom too long and couldn’t figure out how to access Aperture’s features or it doesn’t have the functionality on which I now depend that exists in Lightroom.)

I get a lot of questions about using Lightroom from students and colleagues. From now on, I am telling everyone to start here. That link takes you George Jardine’s website and the half-hour tutorial he recorded on the basics of image management with Lightroom.

If the tutorial convinces you to try Lightroom, then you should also read Rob Sylvan’s “10 Things I Wish I Could Tell Every Lightroom User.”

iPhone Tracker on GitHub

Apple’s latest update to iOS fixes the problem of making the location services cache easily available on your computer, but before you update, you might still enjoy seeing how much information about you is available. How widely available it is is a matter for a separate discussion.

I tried out the app on myself, just before I updated, to see what the results look like:

Screen shot 2011-05-09 at 6.15.57 AM

It’s pretty much what you expect: it shows that I live most of my life within Lafayette, where I live and work, and the city’s environs, where I do research. What I found interesting, since the app offers this data as an animated timeline, are the brief flowerings that occurred thanks to travel I have done over the past year.

Viewed within a historical perspective, and internally, this information raises no great concerns for me. Viewed from a chance to market to me I have some concerns. Viewed from a particularized and dynamic tracking of my movements … I don’t like it at all.

Electronic Literature Organization

I did not know about the Electronic Literature Organization until I saw their announcement of the publication of their second collection in The Humanist. It looks interesting. It might be worth a rummage in the near future.

April 1 Is Backup Day

April 1 is international backup day, which seems like an odd day to choose. I think it would be better, if also equally unfortunate for those of who live in societies that celebrate April Fools, to mark it as open information, or open access, day. Today is the 200th birthday of Robert Bunsen, famous for his eponymous burner, which he chose not to patent and, in fact, pursued those who tried to patent it for themselves.

In celebration of open information day, I offer up this passage from Benjamin Franklin’s Autobiography which details his refusal to patent the Franklin stove:

In order of time, I should have mentioned before, that having, in 1742, invented an open stove for the better warming of rooms, and at the same time saving fuel, as the fresh air admitted was warmed in entering, I made a present of the model to Mr. Robert Grace, one of my early friends, who, having an iron-furnace, found the casting of the plates for these stoves a profitable thing, as they were growing in demand.

To promote that demand, I wrote and published a pamphlet, entitled “An Account of the new-invented Pennsylvania Fireplaces; wherein their Construction and Manner of Operation is particularly explained; their Advantages above every other Method of warming Rooms demonstrated; and all Objections that have been raised against the Use of them answered and obviated,” etc.

This pamphlet had a good effect. Gov’r. Thomas was so pleas’d with the construction of this stove, as described in it, that he offered to give me a patent for the sole vending of them for a term of years; but I declin’d it from a principle which has ever weighed with me on such occasions, viz., That, as we enjoy great advantages from the inventions of others, we should be glad of an opportunity to serve others by any invention of ours; and this we should do freely and generously.

An ironmonger in London however, assuming a good deal of my pamphlet, and working it up into his own, and making some small changes in the machine, which rather hurt its operation, got a patent for it there, and made, as I was told, a little fortune by it. And this is not the only instance of patents taken out for my inventions by others, tho’ not always with the same success, which I never contested, as having no desire of profiting by patents myself, and hating disputes. The use of these fireplaces in very many houses, both of this and the neighbouring colonies, has been, and is, a great saving of wood to the inhabitants. (From Franklin’s Autobiography.)

And I also note that my colleague Jason Jackson and the team at Open Folklore have exciting news of their own.

Another Graffiti Blog/Database

Latrinalia — the writing on the walls of bathrooms — and graffiti have been studied by folklorists for quite some time. It’s refreshing to see folks not only collecting material but also attempting to publish it in some fashion as they collect it. My friend and colleague Quinn Dombrowksi was the first person I know to do so, and now I just ran across Graffiti on Grounds, an “archive of writing scratched and scrawled around the campus of the University of Virginia.” The great thing about GoG is that clicking on an individual item gets you a single page which has Dublin Core metadata and “Graffiti Item Type” metadata. If there was a “This Week in the Humanities” program, I would like to do a show on this.

More Dropbox Goodness

I wish all services, and even a lot of applications, were as good as Dropbox. I turned the participants in my digital humanities seminar onto it, and, if I had done nothing else, I think that alone would have made the class for some of them. None of them hauls around a USB drive anymore. They have made sharing Dropbox files and folders part of how they work: it’s been amazing to watch.

If you haven’t tried it out, do. 2GB of storage is free. I have a slightly larger account, 10GB for $10 a month. I keep my home and office files synced via DropBox, and I also access PDFs and other files in GoodReader (iPad) via DB.

If you try it and like it, feel free to use my referral code. We both get an extra 250MB for free.

Once you are up and running, head over to AppStorm and read their “Ultimate Dropbox Toolkit and Guide” (link to post).

Reading on an iPad during/as Prime Time"> Reading on an iPad during/as Prime Time

Apps like Read It Later do collect interesting kinds of data from their users. Interesting in the aggregate: it would appear that one of the things that iPad users are doing is spending their evening hours on the couch not watching television but reading. (Or perhaps both.) There are a variety of cool graphs and charts at the link.

Outsourcing Film to Digital Transfer

Because MLA came in January this year, our household is a week or so behind its usual schedule for getting Christmas put away. Typically we do this earlier in January, trying to get our Christmas tree on the curb in time for it to be recycled for coastal restoration. Unfortunately, that recycling program is not happening this year: none of the parishes — Louisiana has parishes instead of counties — involved has any money for it. (If you are keeping track of the casualty count for the economic downturn in Louisiana, it’s: public health, higher education, the coast.) And so our clean up got put off until the MLK weekend.

And so out came the plastic bins to put away the Christmas decorations. But, what’s that? Aren’t you a little tired of that closet threatening your life every time you open it? Well, then, let’s take out all the bins, sort through them, throw some things away, give some things away, repack some things and begin to get a little order in here.

Hey, here’s a whole box of APS film canisters.

I have a lot of negatives lying around. Much of it is probably not worth spending too much money to preserve, but if it can be digitized in bulk for a reasonable price, then I am open to the idea. I don’t have that many APS canisters. Most of my film photography was done with a 35mm camera, but a lot of that is on slides, which are all neatly tucked into binders … and I don’t know when I will work up the energy to get that digitized. (My colleague Barry Jean Ancelet was fortunate enough to have a few semesters of graduate students to do the digitizing for him. Perhaps, one day, when I have a similar status, I can enjoy something similar. Gotta get that book done. — yes, Craig Gill, I am working on it. I promise.)

But let’s focus on the APS to digital for the time being, and see what we can learn:

  • ScanMyPhotos.com will do 2000 dpi scanning for $10 a roll or 4000 dpi scanning for $20 a roll. All scans are output as JPGs. (This makes no sense to me.) They will also scan slides and prints.
  • FotoBridge also does scanning, but it doesn’t have anything on APS scanning. Their price for scanning up to 250 slides at 2000 dpi is $90. 3000 dpi costs $102, and 4000 dpi $115. The prices drop as you increase the amount you have scanned.

The Joy of Stats

Just in time for some weekend vid watching:

Thanks to Kottke.

North American English Dialects, Based on Pronunciation Patterns

In what he describes as a hobby, Rick Aschmann has assembled a lovely map of dialects of English in North America.

Medieval Warfare … Yup, It Was Bad

The Economist has a nice write-up of the work of archeologists to piece together the Battle of Towton. Trolling through Towton’s mass graves — contemporary accounts estimated the death toll at 28,000, they have pieced together the ages of the men in the battle — from 17 to 50 — and how they died — nastily.

The soldier now known as Towton 25 had survived battle before. A healed skull fracture points to previous engagements. He was old enough—somewhere between 36 and 45 when he died—to have gained plenty of experience of fighting. But on March 29th 1461, his luck ran out.

Machine Learning for Human Memorization

A machine learning researcher, Danny Tarlow, has come up with a way to describe his problem in competitive scrabble in programming terms. Here’s a link to the post, and here’s his rough description of the problem:

As some of you know, I used to play Scrabble somewhat seriously. Most Tuesdays in middle school, I would go to the local scrabble club meetings and play 4 games against the best Scrabble players in the area (actually, it was usually 3 games, because the 4th game started past my bedtime). It’s not your family game of Scrabble: to begin to be competitive, you need to know all of the two letter words, most of the threes, and you need to have some familiarity with a few of the other high-priority lists (e.g., vowel dumps; short q, z, j, and x words; at least a few of the bingo stems). See here for a good starting point.

Anyway, I recently went to the Toronto Scrabble Club meeting and had a great time. I think I’ll start going with more regularity. As a busy machine learning researcher, though, I don’t have the time or the mental capacity to memorize long lists of words anymore: for example, there are 972 legal three letter words and 3902 legal four letter words.

So I’m looking for an alternative to memorization. Typically during play, there will be a board position that could yield a high-scoring word, but it requires that XXX or XXXX be a word. It would be very helpful if I could spend a minute or so of pen and paper computation time, then arrive at an answer like, “this is a word with 90% probability”. So what I really need is just a binary classifier that maps a word to probability of label “legal”.

Problem description: In machine learning terms, it’s a somewhat unique problem (from what I can tell). We’re not trying to build a classifier that generalizes well, because the set of 3 (or 4) letter words is fixed: we have all inputs, and they’re all labeled. At first glance, you might think this is an easy problem, because we can just choose a model with high model capacity, overfit the training data, and be done. There’s no need for regularization if we don’t care about overfitting, right? Well, not exactly. By this logic, we should just use a nearest neighbors classifier; but in order for me to run a nearest neighbors algorithm in my head, I’d need to memorize the entire training set!