Posts Tagged: data


26
Dec 10

Machine Learning for Human Memorization

A machine learning researcher, Danny Tarlow, has come up with a way to describe his problem in competitive scrabble in programming terms. Here’s a link to the post, and here’s his rough description of the problem:

As some of you know, I used to play Scrabble somewhat seriously. Most Tuesdays in middle school, I would go to the local scrabble club meetings and play 4 games against the best Scrabble players in the area (actually, it was usually 3 games, because the 4th game started past my bedtime). It’s not your family game of Scrabble: to begin to be competitive, you need to know all of the two letter words, most of the threes, and you need to have some familiarity with a few of the other high-priority lists (e.g., vowel dumps; short q, z, j, and x words; at least a few of the bingo stems). See here for a good starting point.

Anyway, I recently went to the Toronto Scrabble Club meeting and had a great time. I think I’ll start going with more regularity. As a busy machine learning researcher, though, I don’t have the time or the mental capacity to memorize long lists of words anymore: for example, there are 972 legal three letter words and 3902 legal four letter words.

So I’m looking for an alternative to memorization. Typically during play, there will be a board position that could yield a high-scoring word, but it requires that XXX or XXXX be a word. It would be very helpful if I could spend a minute or so of pen and paper computation time, then arrive at an answer like, “this is a word with 90% probability”. So what I really need is just a binary classifier that maps a word to probability of label “legal”.

Problem description: In machine learning terms, it’s a somewhat unique problem (from what I can tell). We’re not trying to build a classifier that generalizes well, because the set of 3 (or 4) letter words is fixed: we have all inputs, and they’re all labeled. At first glance, you might think this is an easy problem, because we can just choose a model with high model capacity, overfit the training data, and be done. There’s no need for regularization if we don’t care about overfitting, right? Well, not exactly. By this logic, we should just use a nearest neighbors classifier; but in order for me to run a nearest neighbors algorithm in my head, I’d need to memorize the entire training set!


8
Dec 10

Portable Copy Stand

This morning’s ProfHacker, now part of the Chronicle of Higher Education, has a write-up by Konrad Lawson on his portable copy stand that lets him quickly set up his camera to photograph book pages in archives and libraries.


2
Nov 10

Audio Digitization for the “Oral Histories of the American South” Project

The description of the audio digitization process given below is from 2007, but it is a model of thoroughness of description:

Cassettes are played on a Nakamichi MR-1 discrete head professional cassette deck. Tape heads are cleaned before each side of the cassette is played and the azimuth (the angle between the tape heads and tape medium) is adjusted to create maximum contact between the playback head and the tape to ensure the widest frequency response. Playback equalization is set to 120µ seconds for IEC standard Type I cassettes, and 70µS for Type II and Type IV cassettes.

XLR outputs of the Nakamichi transmit the balanced signal directly to the Apogee Rosetta 200, 24 bit, 2 channel, Analog to Digital and Digital to Analog converter. The signal is digitized at a sample rate of 96 kHz and 24 bit sample depth and travels to the computer via an XLR cable from the digital outputs to the Lynx One sound card AES/EBU audio port.

Recently, we added the Apogee Big Ben Master Digital Clock, a master word clock that virtually eliminates any possible jitter [abrupt and unwanted variation of one or more signal characteristics] that can cause high frequency distortions in the signal. This process creates audio files with excellent clarity and a very large quantity of information. A typical file, representing one side of one cassette, comprises around 1 GB of data.

Files are then stored in a designated digital deep storage on the libraries archival servers as they are too large to be stored on CD without converting the sample rate to 44.1 kHz and reducing the quality.

The signal can be monitored from each source separately from Genelec 8030A bi-amplified monitors routed through a Coleman Audio MS6A switcher with monitor controller. The switcher has balanced XLR inputs and outputs to preserve signal-to-noise ratio and features completely passive switching. Interviews are re-recorded using Wavelab, a non-linear digital audio software platform.

Each cassette side is recorded, assigned a number as a preservation master (PM), entered into a PM database including pertinent metadata, and saved as a single audio file into deep storage in a dedicated digital archive maintained by UNC. The interview audio file is then converted into a file for burning a CD listening copy for in-house library patron research. First the file is resampled to 44.1 kHz and 24 bit sample rate for audio processing. The audio file is processed in Sound Forge version 8.0 with Waves X Restoration, VST, Direct X, and Sony audio plug-ins to improve the quality.

A typical file requires two processes: normalization to an average RMS (root mean square) level of -14 dB applying dynamic compression in order to increase the volume, and noise reduction to remove as much background noise, tape hiss, and rumble as possible without affecting the source material. Some files require more specific equalization or a series of noise reduction to achieve audio of suitable quality and volume for researchers. The file is then converted to 16 bit samples, burned to a CD listening copy on a professional grade Mitsui gold audio CD at 4x speed with a Plextor DVDR PX-716A 1.09 drive using Sony CD Architect software, version 5.2. CDs are tested to determine audio is present. Finally, all individual audio files that comprise a complete interview are arranged in order and converted to one single 256 Kbps, 44.1 kHz, 16-bit, stereo MP3 audio file for the Documenting the American South, Oral Histories of the American South collection interface.

The above description accurately describes the current digitization process, a set of practices resulting from regular evaluation of current digitization standards and our abilities to meet and surpass them with the equipment and time we have available. When audio digitization of the interviews began November 1, 2005, masters were recorded at a 44.1 kHz, 16 bit sample rate. Soon, we hope to replace the LynxOne sound card with a FireWire card, removing another gain structure from the signal chain to create digital preservation masters with the least amount of information lost, or added, as possible.


7
Oct 10

A National Public, and Digital, Library Is Coming

The Chronicle of Higher Education reports this morning on a meeting that took place at Harvard on developing a national digital library. The HathiTrust was mentioned. (It also seems like something Project Bamboo should put on its radar.)


10
Sep 10

The Many Dimensions of Data Visualization

The NEH Institute on Networks and Networking in the Humanities opened with a multi-day salvo aimed at getting participants to think about the importance of visualization. Journalist David McCandless in the embedded TED Talk below, makes a case for how visualization is one form of analysis:


9
Sep 10

Old Maps Explain Current Divisions

I don’t entirely know what to make of these patterns, but the patterns are fascinating in and of themselves: wherein older divides, between Catholics and Protestants in late nineteenth century Germany in the first case and between Imperial German and Imperial Russian parts of Poland in the second case, actually map onto current political divisions. The maps are below and the links to the articles are below their respective maps.



5
Aug 10

The Problem(s) with PDF

In Tolkien’s grand narrative, the “one true ring” turned out to be a really bad idea, and it took a three-book sequence to destroy the thing. In the humanities in particular and the academy in general, we continue to be vexed by a file format that allows for productive interchange that is also open — both in the beer and speech senses. Microsoft’s Word files, DOC and DOCX, are clearly not it, though they are now so ingrained in everyone’s workflows, if only thanks to the application being omnipresent on most Windows computers, that many of us assume they are the basis for any interchange.

But anyone who has had to trade a complex document back and forth a few times with more than a few basic style options has learned, things get lost in transit.

Until recently, however, few applications did a decent job of reading and writing Word’s DOC file format. It was getting better — which may be one of the reasons why Microsoft changed to the DOCX format, who knows? — but it was still not reliable.

What are the alternatives?

  • OpenOffice’s ODF has never quite caught on.
  • RTF is fairly reliable, but it isn’t capable of much.
  • HTML seems so “webby” and hasn’t, at least until CSS3, been at all friendly to printed matter.

Which leaves PDF.

Adobe wisely side-stepped competing directly with Microsoft in producing its “portable document format,” but unfortunately for Adobe, but perhaps fortunately for those of us for whom openness matters, PDF seems to have really hit its stride exactly in that moment where the rise of mobile computing devices call it most into question. After all, who here hasn’t muttered in frustration when accessing some simple text content on your phone or tablet and discovered it is in a PDF formatted for an 8 x 11 piece of paper. Oof!

And yet just as we in the humanities have leaned too much upon Word — I now traffic in tracking changes in Word documents in articles for journals and books (Ugh!) — we are starting to lean too much on PDF. A recent exchange in the Digital Humanities On-line mailing turned up the follow comment from Stephen Woodruff:

There are many ways of creating and encoding a PDF file, and not all result in text which can be copied and pasted if the text includes more than standard Ascii characters. Normal word processors hold a internationally accepted numerical representation of each letter plus a note of its font, size, colour and so on. So you can search for an “a” without caring whether its in Arial or Times, red or italic, and you can copy that numerical representation to another application, even if it doesn’t understand colour or have the same fonts.

PDF doesn’t always work like that. Some encodings are analogous to what a typical word processor would use, some are not: they store glyphs, effectively pictures of the individual letters, and have a table to convert back between those and the character codes needed by a copy-paste operation. Its that conversion back that can go wrong: you can read the PDF files and print them because all your eyes and the printer need are the shapes, but if they have been created badly you can not reliably extract the text. (I’m trying hard not to start complaining about the use of PDF, which is a PAGE description language not a TEXT description language, in the academic world.)

Stephen also points to a terrific post by Adobe’s James King which clarifies PDF’s purpose. King’s post ends with the following:

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

And so there you have it: PDF is really about presentation. Whether you can get text (data) back out of it is not this particular vessel’s problem nor its concern. That seems problematic to me for those of us who wish our content to be as portable and re-usable as possible. I think PDF is terrific as one possible output, one possible product, but it’s not the interchange format of which everyone dreams. Quite the opposite.


26
Jul 10

Federated Is the Future for Open Source

In his remarks to this year’s OSCON, Tim O’Reilly makes the interesting assertion that “federated is the future for open source”. His assertion comes out of his interest in the internet as the next operating system. His example makes the point very clearly (paraphrased):

Imagine yourself out with friends and you decide to get a pizza. What do you do? If you have one of the new smart phones [by which he means iPhone or Android], you can quite literally put the thing to you mouth and speak the word pizza into an app and it will search for places to eat pizza that also happen to be nearby.

The technologies involved are quite astonishing: touch sensors (to activate the app) motion sensors (the device has to know you are moving it up to your head to know to turn on the microphone), a GPS radio (to know where you are), and a microwave radio (to transmit your request).

But the technology doesn’t end there: the speech recognition is not being done on your phone in many instances but “in the cloud” as is the cross-indexing of eateries and your location. All of this is assembled into some form of text — HTML or otherwise — and then sent back to your handset, which now offers you a range of options.

Amazing stuff. But even more amazing is that really how Google, for example, know how to understand your spoken request is because they have a pretty good sense of what goes with what. They are, after all, in the search business as well. It’s all this data that makes it possible to give you not just an answer but a semantically-rich and appropriate one.

Obviously, the more you can cross-pollinate these various data sets, the more interesting your results will be and the more kind of innovation become possible. But Google owns its (your) searches and Facebook owns its (your) social graphs. Given that the current trend is in this direction, O’Reilly asks the pressing question of where does the open source community go when a lot of these companies are built on open source — Google runs on Linux after all and gives away a lot of the software it developes — but the data itself remains beyond our reach?


30
Jun 10

Pro Photographer’s Workflow

Chase Jarvis is the author of the popular Best Camera blog and book. (His argument is/was the best camera is the one you have with you, and so the book is a collection of photographs taken with his iPhone camera. The subtext is that one should focus on such abilities as composition, lighting, and framing rather than worry about the gear/gadgets in your hand.)

Also on his website is a nice video that details his workflow. Jarvis is a professional photographer with not only a serious staff who accompany him everywhere but also a pretty serious collection of gear. Essentially, he runs all his images and video through Aperture and onto hard drives — Adobe, are you paying attention? Video! — the hard drives escalate from portable drives in the field, to small RAID drives in hotel rooms, to a serious XServe set up back at his office/studio.

The takeaway here? Backup, backup, backup. And an important corollary is many, many copies in diverse locations. (Offsite, offsite, offsite.)

A tidbit within all this is the file naming convention they use:

year/project/day/camera/shot

Example:

20100630_ProjectHere_1_S900123.Cr2

20
May 10

You Are Not a Curator

Thank you, New Curator, for trying to take a bit of wind out of the sails of the ship that seeks to take a perfectly useful term, curation, and a perfectly useful set of skills often embodied in trained professionals known as curators, or also as librarians, and make it so overused as to be as useless as “data mining” or, now, “social media.” Here’s the link.


12
May 10

CMS Made Simple

I have to remind myself now and then that CMS Made Simple is still out there and it’s still inviting. I still prefer the Ruby way, but if I ever do decide to build a much more CMS-oriented site, there’s always CMSMS.


31
Jan 10

Playing with Wolfram Alpha

I decided to play a bit with Wolfram Alpha. If I day traded, it would be a terrific resource. So far, that’s the only thing I have tried that has given me results that I knew what to do with. Now, it could very well be that WA is giving me results that are smarter than I am…

Here’s a trial search

Clicking on the link is just like visiting WA and typing in:

caterpillar cummins john deere

(Searching for makers of heavy equipment was the first thing that came to my mind.)


20
Jan 10

Markdown in Brief

# Header 1 #
## Header 2 ##
### Header 3 ###             (Hashes on right are optional)
#### Header 4 ####
##### Header 5 #####

This is a paragraph, which is text surrounded by whitespace.
Paragraphs can be on one line (or many), and can drone on
for hours.  

Here is a Markdown link to [Warped](http://warpedvisions.org), 
and a literal .  Now some SimpleLinks, like 
one to google (autolinks to are-you-feeling-lucky), a test 
link to a Wikipedia page, and a CPU at foldoc. 

Now some inline markup like _italics_,  **bold**, and `code()`.

![picture alt](/images/photo.jpeg "Title is optional")     

> Blockquotes are like quoted text in email replies
>> And, they can be nested

* Bullet lists are easy too
- Another one
+ Another one

1. A numbered list
2. Which is numbered
3. With periods and a space

And now some code:

    // Code is just text indented a bit
    which(is_easy) to_remember();

Text with  
two trailing spaces  
(on the right)  
can be used  
for things like poems  

Some horizontal rules ...

* * * *
****
--------------------------

17
Jan 10

Linguists Agree to Publish Data

My friend Jason Jackson passes on the news that at the annual meeting of the Linguistics Society of America, the following resolution was passed:

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; …

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns; …
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.

6
Jan 10

NAS Report on Research Data in the Digital Age

The National Academies Press has just released a 180-page book on Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. The link will take you to the book’s page on the press’s website. It’s available as a paperback for $31.46, as a PDF for $27, or as a combo for $41. You can also follow a link on the page to read it on-line for free.