In Tolkien’s grand narrative, the “one true ring” turned out to be a really bad idea, and it took a three-book sequence to destroy the thing. In the humanities in particular and the academy in general, we continue to be vexed by a file format that allows for productive interchange that is also open — both in the beer and speech senses. Microsoft’s Word files, DOC and DOCX, are clearly not it, though they are now so ingrained in everyone’s workflows, if only thanks to the application being omnipresent on most Windows computers, that many of us assume they are the basis for any interchange.

But anyone who has had to trade a complex document back and forth a few times with more than a few basic style options has learned, things get lost in transit.

Until recently, however, few applications did a decent job of reading and writing Word’s DOC file format. It was getting better — which may be one of the reasons why Microsoft changed to the DOCX format, who knows? — but it was still not reliable.

What are the alternatives?

  • OpenOffice’s ODF has never quite caught on.
  • RTF is fairly reliable, but it isn’t capable of much.
  • HTML seems so “webby” and hasn’t, at least until CSS3, been at all friendly to printed matter.

Which leaves PDF.

Adobe wisely side-stepped competing directly with Microsoft in producing its “portable document format,” but unfortunately for Adobe, but perhaps fortunately for those of us for whom openness matters, PDF seems to have really hit its stride exactly in that moment where the rise of mobile computing devices call it most into question. After all, who here hasn’t muttered in frustration when accessing some simple text content on your phone or tablet and discovered it is in a PDF formatted for an 8 x 11 piece of paper. Oof!

And yet just as we in the humanities have leaned too much upon Word — I now traffic in tracking changes in Word documents in articles for journals and books (Ugh!) — we are starting to lean too much on PDF. A recent exchange in the Digital Humanities On-line mailing turned up the follow comment from Stephen Woodruff:

There are many ways of creating and encoding a PDF file, and not all result in text which can be copied and pasted if the text includes more than standard Ascii characters. Normal word processors hold a internationally accepted numerical representation of each letter plus a note of its font, size, colour and so on. So you can search for an “a” without caring whether its in Arial or Times, red or italic, and you can copy that numerical representation to another application, even if it doesn’t understand colour or have the same fonts.

PDF doesn’t always work like that. Some encodings are analogous to what a typical word processor would use, some are not: they store glyphs, effectively pictures of the individual letters, and have a table to convert back between those and the character codes needed by a copy-paste operation. Its that conversion back that can go wrong: you can read the PDF files and print them because all your eyes and the printer need are the shapes, but if they have been created badly you can not reliably extract the text. (I’m trying hard not to start complaining about the use of PDF, which is a PAGE description language not a TEXT description language, in the academic world.)

Stephen also points to a terrific post by Adobe’s James King which clarifies PDF’s purpose. King’s post ends with the following:

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

And so there you have it: PDF is really about presentation. Whether you can get text (data) back out of it is not this particular vessel’s problem nor its concern. That seems problematic to me for those of us who wish our content to be as portable and re-usable as possible. I think PDF is terrific as one possible output, one possible product, but it’s not the interchange format of which everyone dreams. Quite the opposite.

In his remarks to this year’s OSCON, Tim O’Reilly makes the interesting assertion that “federated is the future for open source”. His assertion comes out of his interest in the internet as the next operating system. His example makes the point very clearly (paraphrased):

Imagine yourself out with friends and you decide to get a pizza. What do you do? If you have one of the new smart phones [by which he means iPhone or Android], you can quite literally put the thing to you mouth and speak the word pizza into an app and it will search for places to eat pizza that also happen to be nearby.

The technologies involved are quite astonishing: touch sensors (to activate the app) motion sensors (the device has to know you are moving it up to your head to know to turn on the microphone), a GPS radio (to know where you are), and a microwave radio (to transmit your request).

But the technology doesn’t end there: the speech recognition is not being done on your phone in many instances but “in the cloud” as is the cross-indexing of eateries and your location. All of this is assembled into some form of text — HTML or otherwise — and then sent back to your handset, which now offers you a range of options.

Amazing stuff. But even more amazing is that really how Google, for example, know how to understand your spoken request is because they have a pretty good sense of what goes with what. They are, after all, in the search business as well. It’s all this data that makes it possible to give you not just an answer but a semantically-rich and appropriate one.

Obviously, the more you can cross-pollinate these various data sets, the more interesting your results will be and the more kind of innovation become possible. But Google owns its (your) searches and Facebook owns its (your) social graphs. Given that the current trend is in this direction, O’Reilly asks the pressing question of where does the open source community go when a lot of these companies are built on open source — Google runs on Linux after all and gives away a lot of the software it developes — but the data itself remains beyond our reach?

Chase Jarvis is the author of the popular Best Camera blog and book. (His argument is/was the best camera is the one you have with you, and so the book is a collection of photographs taken with his iPhone camera. The subtext is that one should focus on such abilities as composition, lighting, and framing rather than worry about the gear/gadgets in your hand.)

Also on his website is a nice video that details his workflow. Jarvis is a professional photographer with not only a serious staff who accompany him everywhere but also a pretty serious collection of gear. Essentially, he runs all his images and video through Aperture and onto hard drives — Adobe, are you paying attention? Video! — the hard drives escalate from portable drives in the field, to small RAID drives in hotel rooms, to a serious XServe set up back at his office/studio.

The takeaway here? Backup, backup, backup. And an important corollary is many, many copies in diverse locations. (Offsite, offsite, offsite.)

A tidbit within all this is the file naming convention they use:

year/project/day/camera/shot

Example:

20100630_ProjectHere_1_S900123.Cr2

Thank you, New Curator, for trying to take a bit of wind out of the sails of the ship that seeks to take a perfectly useful term, curation, and a perfectly useful set of skills often embodied in trained professionals known as curators, or also as librarians, and make it so overused as to be as useless as “data mining” or, now, “social media.” Here’s the link.

I have to remind myself now and then that CMS Made Simple is still out there and it’s still inviting. I still prefer the Ruby way, but if I ever do decide to build a much more CMS-oriented site, there’s always CMSMS.

I decided to play a bit with Wolfram Alpha. If I day traded, it would be a terrific resource. So far, that’s the only thing I have tried that has given me results that I knew what to do with. Now, it could very well be that WA is giving me results that are smarter than I am…

Here’s a trial search

Clicking on the link is just like visiting WA and typing in:

caterpillar cummins john deere

(Searching for makers of heavy equipment was the first thing that came to my mind.)

My friend Jason Jackson passes on the news that at the annual meeting of the Linguistics Society of America, the following resolution was passed:

Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and

Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and

Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; …

Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to:

  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns; …
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.

The National Academies Press has just released a 180-page book on Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. The link will take you to the book’s page on the press’s website. It’s available as a paperback for $31.46, as a PDF for $27, or as a combo for $41. You can also follow a link on the page to read it on-line for free.

shipslogarchivecolour
Whytootackay Island by Lieutenant G Tobin aboard HMS Providence in 1792

Slashdot brought this BBC story to my attention:

The BBC reports that researchers are digitizing the captains’ logs from the voyages of Charles Darwin on HMS Beagle, Captain Cook from HMS Discovery, Captain Bligh from The Bounty, and 300 other 18th and 19th century ships’ logbooks to provide historical climate records for modern-day climate researchers who will use the meteorological data to build up a picture of weather patterns in the world at the beginning of the industrial era. The researchers are cross-referencing the data with historical records for crop failures, droughts and storms and will compare it with data for the modern era in order to predict similar events in the future.

Andy Kessler in op-ed on the 19 August 2009 Wall Street Journal assumes that AT&T killed the Google Voice app for the iPhone. Apple disagrees, but his essential point that Google Voice is feature-rich while current telephony is feature poor remains. His argument: AT&T is dying and it’s slowing us down as it goes. I’m not one for such grand rhetoric, but what I think is crucial is his argument that we need to do away with regulation of telephony and television, with the national communications policy altogether and focus on a National Data Policy with the following assumptions:

  • End phone exclusivity. Any device should work on any network. Data flows freely.
  • Transition away from “owning” airwaves. As we’ve seen with license-free bandwidth via Wi-Fi networking, we can share the airwaves without interfering with each other. Let new carriers emerge based on quality of service rather than spectrum owned. Cellphone coverage from huge cell towers will naturally migrate seamlessly into offices and even homes via Wi-Fi networking. No more dropped calls in the bathroom.
  • End municipal exclusivity deals for cable companies. TV channels are like voice pipes, part of an era that is about to pass. A little competition for cable will help the transition to paying for shows instead of overpaying for little-watched networks. Competition brings de facto network neutrality and open access (if you don’t like one service blocking apps, use another), thus one less set of artificial rules to be gamed.
  • Encourage faster and faster data connections to our homes and phones. It should more than double every two years. To homes, five megabits today should be 10 megabits in 2011, 25 megabits in 2013 and 100 megabits in 2017. These data-connection speeds are technically doable today, with obsolete voice and video policy holding it back.