HTML to PDF

In an ideal setup, my workflow would have me writing in some version of plain text — a flavor of markdown in all probability — that could be quickly and easily outputted to a variety of formats and media. In most instances, that output gets printed, or at least paginated, which means it probably has to, at least for a moment, be instantiated as a PDF. (If I remember correctly, this is essentially how the macOS display and printing system work.) What that would mean would be a collection of CSS files that transformed the generated HTML into the various kinds of documents I regularly produce: essays, reports, letters, lectures, etc.

This function is what the Marked app does and does well — it’s also functionality built into the Ulysses app if I remember. Neither of those apps, I believe, offer pagination, which is often critical to what I output. And so, I have continued to search for my own solution in hopes of building it into a workflow — for the record, when I am working on long-form plain text, my editor of choice is FoldingText because it does a brilliant job of hiding the markdown unless you are working on that sentence and, as the name implies, it makes it possible to hide all but the section of the document on which you are working. It’s brilliant. (To be clear, I am a fan of all the apps mentioned here and of their developers.)

Getting from plain text via markdown or MultiMarkdown to HTML and then pairing that HTML with a page-media aware CSS file and then outputting to PDF is not as easy as it should be. The one app of which I have been aware up until recently was PrinceXML, which its creators have made free for non-commercial use, but with the imposition of a small watermark. That’s very generous, but it’s not quite what I want and I don’t have the kind of money to afford a desktop license.

And so it was a delightful surprise to discover that there are free software options to explore:

  • wkhtmltopdf is an “open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely headless and do not require a display or display service.”
  • **WeasyPrint is a “visual rendering engine for HTML and CSS that can export to PDF. … It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.”

Next up … trying WeasyPrint and an update/report here.

Plain Text Commenting

One of the reasons to continue to use Word is that it makes it easy to add comments and/or mark up a document, in the editorial sense and not in the HTML/XML/TEI sense. [Critic Markup][] is not likely to solve the issue such that those of use who prefer plain text over Word are going to have a breakthrough moment with our colleagues and collaborators, but if you are lucky enough to work with like-minded folks, then [Critic Markup][] may just do what you need.

[Critic Markup]: http://criticmarkup.com

Vim Resources

More of the plain text CMS stuff, I’m afraid. Steve Francia has a [Vim crash course][vcc], and he also has something he calls his [Ultimate Vim config][uvc] (on GitHub). Swaroop C H has published a terrific introduction to Vim, entitled [A Byte of Vim][bov]. (He has also published an introduction to Python entitled _A Byte of Python_, but as the lack of a link indicates I have not read it.)

[vcc]: http://spf13.com/post/vim-crash-course
[uvc]: http://github.com/spf13/spf13-vim
[bov]: http://swaroopch.com/notes/Vim

vi Quick Guide

## Modes

Vi has two modes insertion mode and command mode. The editor begins in command mode, where the cursor movement and text deletion and pasting occur. Insertion mode begins upon entering an insertion or change command. [ESC] returns the editor to command mode (where you can quit, for example by typing :q!). Most commands execute as soon as you type them except for “colon” commands which execute when you press the ruturn key.

## Quitting

😡 Exit, saving changes
:q Exit as long as there have been no changes
ZZ Exit and save changes if any have been made
:q! Exit and ignore any changes

## Inserting Text

i Insert before cursor
I Insert before line
a Append after cursor
A Append after line
o Open a new line after current line
O Open a new line before current line
r Replace one character
R Replace many characters

## Motion

h Move left
j Move down
k Move up
l Move right
w Move to next word
W Move to next blank delimited word
b Move to the beginning of the word
B Move to the beginning of blank delimted word
e Move to the end of the word
E Move to the end of Blank delimited word
( Move a sentence back
) Move a sentence forward
{ Move a paragraph back
} Move a paragraph forward
0 Move to the begining of the line
$ Move to the end of the line
1G Move to the first line of the file
G Move to the last line of the file
nG Move to nth line of the file
:n Move to nth line of the file
fc Move forward to c
Fc Move back to c
H Move to top of screen
M Move to middle of screen
L Move to botton of screen
% Move to associated ( ), { }, [ ]

## Deleting Text

Almost all deletion commands are performed by typing d followed by a motion. For example, dw deletes a word. A few other deletes are:

x Delete character to the right of cursor
X Delete character to the left of cursor
D Delete to the end of the line
dd Delete current line
:d Delete current line

## Yanking Text

Like deletion, almost all yank commands are performed by typing y followed by a motion. For example, y$ yanks to the end of the line. Two other yank commands are:

yy Yank the current line
:y Yank the current line

## Changing Text

The change command is a deletion command that leaves the editor in insert mode. It is performed by typing c followed by a motion. For wxample cw changes a word. A few other change commands are:

C Change to the end of the line
cc Change the whole line

## Putting Text

p Put after the position or after the line
P Put before the poition or before the line

## Buffers

Named buffers may be specified before any deletion, change, yank or put command. The general prefix has the form “c where c is any lowercase character. for example, “adw deletes a word into buffer a. It may thereafter be put back into text with an appropriate “ap.

## Markers

Named markers may be set on any line in a file. Any lower case letter may be a marker name. Markers may also be used as limits for ranges.

mc Set marker c on this line
`c Go to beginning of marker c line.
c Go to first non-blank character of marker c line.

## Search for strings

/string Search forward for string
?string Search back for string
n Search for next instance of string
N Search for previous instance of string

## Replace

The search and replace function is accomplished with the :s command. It is commonly used in combination with ranges or the :g command (below).

:s/pattern/string/flags Replace pattern with string according to flags.
g Flag – Replace all occurences of pattern
c Flag – Confirm replaces.
& Repeat last :s command

## Regular Expressions

. (dot) Any single character except newline
* zero or more occurances of any character
[…] Any single character specified in the set
[^…] Any single character not specified in the set
^ Anchor – beginning of the line
$ Anchor – end of line
\< Anchor – begining of word
\> Anchor – end of word
\(…\) Grouping – usually used to group conditions
\n Contents of nth grouping
[…] – Set Examples
[A-Z] The SET from Capital A to Capital Z
[a-z] The SET from lowercase a to lowercase z
[0-9] The SET from 0 to 9 (All numerals)
[./=+] The SET containing . (dot), / (slash), =, and +
[-A-F] The SET from Capital A to Capital F and the dash (dashes must be specified first)
[0-9 A-Z] The SET containing all capital letters and digits and a space
[A-Z][a-zA-Z] In the first position, the SET from Capital A to Capital ZIn the second character position, the SET containing all letters
Regular Expression Examples
/Hello/ Matches if the line contains the value Hello
/^TEST$/ Matches if the line contains TEST by itself
/^[a-zA-Z]/ Matches if the line starts with any letter
/^[a-z].*/ Matches if the first character of the line is a-z and there is at least one more of any character following it
/2134$/ Matches if line ends with 2134
/\(21|35\)/ Matches is the line contains 21 or 35Note the use of ( ) with the pipe symbol to specify the ‘or’ condition
/[0-9]*/ Matches if there are zero or more numbers in the line
/^[^#]/ Matches if the first character is not a # in the line
Notes:1. Regular expressions are case sensitive2. Regular expressions are to be used where pattern is specified

## Counting

Nearly every command may be preceded by a number that specifies how many times it is to be performed. For example, 5dw will delete 5 words and 3fe will move the cursor forward to the 3rd occurence of the letter e. Even insertions may be repeated conveniently with thismethod, say to insert the same line 100 times.

## Ranges

Ranges may precede most “colon” commands and cause them to be executed on a line or lines. For example :3,7d would delete lines 3-7. Ranges are commonly combined with the :s command to perform a replacement on several lines, as with :.,$s/pattern/string/g to make a replacement from the current line to the end of the file.

:n,m Range – Lines n-m
:. Range – Current line
:$ Range – Last line
:’c Range – Marker c
:% Range – All lines in file
:g/pattern/ Range – All lines that contain pattern

## Files

:w file Write to file
:r file Read file in after line
:n Go to next file
:p Go to previos file
:e file Edit file
!!program Replace line with output from program

## Other

~ Toggle upp and lower case
J Join lines
. Repeat last text-changing command
u Undo last change
U Undo all changes to line

Copied from [Lagmonster](http://www.lagmonster.org/docs/vi.html) and adapted here.

Some Further Notes on a Plain Text (CM) System

If you are working in plain text, you are probably still going to want some way of structuring your text, that is marking it up just a little so that you can do a variety of things with it. As I have already noted, the way that I know best is a variant of Markdown known as MultiMarkdown. But there are other systems out there: I have always been intrigued by the amazing scope of [reStructuredText][] and I am somewhat impressed by [AsciiDoc][]. (By way of contrast, I have always hated MediaWiki markup: it is almost incomprehensible to me.) The beauty of reStructuredText is that you can convert it to HTML or a lot of other formats with `docutils`. Even better is [Pandoc][], which converts back and forth between Markdown, HTML, MediaWiki, man, and reStructuredText. *Oh my!*

You can get Pandoc through a standalone installer or you can get it through MacPorts. To get MacPorts, however, you need the latest version of Xcode, which brings me to the topic of the moment: a plain text system is really founded on the Unix way of doing things, which means that your data is in the clear but you as an operator must be more sophisticated. Standalone applications like MacJournal and DevonThink, which I keep mentioning not at all because they are inadequate but because they are so good and because I use them when I am more in an “Apple” mode of doing things, are wonderful because you download them and all this functionality is built in. At the command line, not only do you assemble the functionality you want out of a variety of small applications, but in order to install or maintain those applications you need to have a better grasp of *what requires what*, also known as *dependencies*.

The useful Python script [Blogpost][], a command line tool for uploading posts directly to a WordPress site, is available through a Google Code project, which requires that you get a local copy through Mercurial, a distributed version control system, which is easily available … through MacPorts. There are other ways to get it, but allowing MacPorts to keep track of it means that you have an easier time getting it updated. This works much like Mac’s Software Update functionality, or the new badges that come with the Mac App store that tell you that updates are available. No badges at the command line, but if you allow MacPorts, also known as a package manager, to, well, manage your packages, then all you need to remember to do is to run `update` once a week or so and all of that stuff is taken care of for you.

And so to summarize the dependencies:

`Blogpost -> Mercurial -> MacPorts -> XCode`

Package managers, like MacPorts, only keep track of things locally, that is on the one machine on which they are installed, and not across several machines. It’s a bit of a pain to replicate all these steps across various machines, and so I now understand the appeal of `debconf` for Ubuntu users. I don’t quite know how to make that happen for myself, but I am open to suggestions.

[reStructuredText]: http://docutils.sourceforge.net/docs/ref/rst/introduction.html
[AsciiDoc]: http://www.methods.co.nz/asciidoc/
[Pandoc]: http://johnmacfarlane.net/pandoc/
[Blogpost]: http://srackham.wordpress.com/blogpost-readme/

Some Notes on a Plain Text (CM) System

The idea of a “trusted system” probably can be attributed to David Allen as much as to anyone else. Certainly the idea is his within the current zeitgeist. Even if you have not heard of him you probably have heard the ubiquitous three letters associated with him, GTD. Allen’s focus is on projects and tasks, but the idea of a trusted system applies just as well to any undertaking. For folks who type for a living, be it words in sentence or functions in a line of code, ideas are just as important as tasks when it comes to accomplishing projects. Allen’s GTD system has a response to ideas, but it largely comes down to putting things in folders.

But as anyone who works with ideas knows, sometimes you don’t know where to put them. And, just as importantly, why should you have to put them in any particular place? In the era of computation — that is, in the era of `grep` and `#tag` — having to file things, at least right away, would seem an anachronism that forces us to return to a paper era that often forced us to ignore the way the human mind words. That is, when operating in rich mode the mind is capable of grasping diffuse patterns across a range of items in a given corpus, but finding those items when they are filed across a number of separate folders, or their digital equivalent of directories is tedious work. `grep` solves some of that problem, of course.

I have largely committed, in the last few weeks, to using DevonThink as the basis for my workflow, because I like its UI and its various features and because it makes casual use so easy — and when I am sitting in my campus office, I need things to be casually easy.

But the more I learn about DevonThink’s artificial intelligence, the more I want to be able to tweak it, add my own dimensions to it. For example, DevonThink readily gives you a word frequency list, but what I want to exclude common words from that list? I know a variety of command line programs that allow me to feed them a “stop list”, a list of words to drop from consideration (and indeed these lists are sometimes known as “drop lists”) when presenting me a table of words and the number of times they appear in a given corpus. I am also guessing that when DT offers to “auto group” or “auto classify” a collection of texts, it is using some form of semantic, or keyword, mapping to do so. What if I would like to tweak those results? Not possible. This is, of course, the problem with closed applications.

The other problem with applications like DevonThink and MacJournal, as much as I like both of them, is that you can do a lot within them, but not so much without. While neither application holds your data captive — both offer a variety of export options — a lot of their functionality exists within the application itself. Titles, tags, etc.

Having seen what these applications can do and how I use them, would it be possible to replicate much of the functionality I prefer in a plain text system that would also have the advantage of, well, being plain text? As the Linux Information Project notes:

> Plain text is supported by nearly every application program on every operating system and on every type of CPU and allows information to be manipulated (including, searching, sorting and updating) both manually and programmatically using virtually every text processing tool in existence. … This flexibility and portability make plain text the best format for storing data persistently (i.e., for years, decades, or even millennia). That is, plain text provides insurance against the obsolescence of any application programs that are needed to create, read, modify and extend data. Human-readable forms of data (including data in self-describing formats such as HTML and XML) will most likely survive longer than all other forms of data and the application programs that created them. In other words, as long as the data itself survives, it will be possible to use it even if the original application programs have long since vanished.

Who doesn’t want their data to be around several millennia from now? On a smaller horizon, I once lost some data to a Windows NT crash that could not be recovered even with three IT specialists hovering over the machine. (To be fair to Windows NT, I think I remember the power supply was just about to go bad and that it was going to take the hard drive with it.) Ever since that moment, I have had a tendency to want to keep several copies of my data in several places at the same time. Both DropBox and our NAS satisfy that lingering anxiety, but both of them are largely opaque in their process and they largely sync my data as it exists in various closed formats.

And as the existence of this logbook itself proves, I have problems with focus, and there is something deeply appealing in working inside an environment as singularly focused as a terminal shell. That is, I really do daydream about having a laptop which has no GUI installed. All command line, all the time. Data would be synced via `rsync` or something like it, and I would da various kinds of data manipulation via a set number of scripts, that I also maintained via Git or something like it.

Now, the chief problem plain text systems have, compared to other forms of content management, is a lack of an ability to hold metadata, and so the system I have sketched out defaults to two conventions about which I am ambivalent but which I feel offer reasonable working solutions.

The first of these conventions is the filename. Whether I am writing in MacJournal or making a note in my notebook, I tend to label most private entries with their date and time. In MacJournal this looks like this: `2012-01-04-1357`. In my Moleskine notebook, every page has a day header and each entry has its own title. Diary entries are titled with the time they were begun. So a ‘date-time` file naming convention will work for those notes.

When I am reading, I write down two kinds of things: quotes and notes. Quotes are obvious, but notes can range from short questions to extended responses and brainstorming. Quotes are easily named using the Turabian author-date system which would produce a file name that looks like this: `Author-date-pagenumber(s)`. Such a scheme requires that a key be kept somewhere that decodes `author-date`s into bibliographic entries. What about notes? I think the easiest way to handle this is using `author-date-page-note`. In my own hand-written notes, I tend to handle page numbers to citations within parentheses and pages to notes with square brackets, but I don’t know that regex on filenames is how I want to handle this.

Filenames handle the basics of metadata, in some fashion, but obviously not a lot, and I am being a bit purposeful here in trying to avoid overly long filenames. For additional metadata, I think the best way to go is with Twitter-style “hashtags”. E.g., `#keyword`.

Where to put the tags, at the beginning like MultiMarkdown or AsciiDoc, or at the end where they don’t interfere with reading? I haven’t decided yet? I use MultiMarkdown, and PHPMarkdown, almost by default when writing in plain text. The current exception to this is that I am not separating paragraphs by an additional line feed, which is the basis for most Markdown variants. This is just something I am trying, because when I am writing prose with dialogue or prose with short paragraphs, the additional white space looks a bit nonsensical. The fact is, after years of being habituated to books, I am used to seeing paragraphs begin with an indent and no extra line spacing. It’s very tidy looking, and so I am playing with a script through which I pass my indented prose notes and which replaces the tab characters, `\t`, with a newline character, `\n`, before passing the text onto Markdown.

Now, this system is extremely limited: it doesn’t handle media. It doesn’t handle PDFs. It doesn’t handle a whole host of things, but that is also its essence. It’s a work in progress. I will let you know how it goes. Look for the collection of scripts to appear on GitHub on some point in the near future.