HTML to PDF

In an ideal setup, my workflow would have me writing in some version of plain text — a flavor of markdown in all probability — that could be quickly and easily outputted to a variety of formats and media. In most instances, that output gets printed, or at least paginated, which means it probably has to, at least for a moment, be instantiated as a PDF. (If I remember correctly, this is essentially how the macOS display and printing system work.) What that would mean would be a collection of CSS files that transformed the generated HTML into the various kinds of documents I regularly produce: essays, reports, letters, lectures, etc.

This function is what the Marked app does and does well — it’s also functionality built into the Ulysses app if I remember. Neither of those apps, I believe, offer pagination, which is often critical to what I output. And so, I have continued to search for my own solution in hopes of building it into a workflow — for the record, when I am working on long-form plain text, my editor of choice is FoldingText because it does a brilliant job of hiding the markdown unless you are working on that sentence and, as the name implies, it makes it possible to hide all but the section of the document on which you are working. It’s brilliant. (To be clear, I am a fan of all the apps mentioned here and of their developers.)

Getting from plain text via markdown or MultiMarkdown to HTML and then pairing that HTML with a page-media aware CSS file and then outputting to PDF is not as easy as it should be. The one app of which I have been aware up until recently was PrinceXML, which its creators have made free for non-commercial use, but with the imposition of a small watermark. That’s very generous, but it’s not quite what I want and I don’t have the kind of money to afford a desktop license.

And so it was a delightful surprise to discover that there are free software options to explore:

  • wkhtmltopdf is an “open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely headless and do not require a display or display service.”
  • **WeasyPrint is a “visual rendering engine for HTML and CSS that can export to PDF. … It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.”

Next up … trying WeasyPrint and an update/report here.

Schema

Google, Microsoft, and Yahoo have gotten together to adapt a collection of microformats that will make it possible for folks who produce and publish content to the web to make searching that content more meaningful:

> Most webmasters are familiar with HTML tags on their pages. Usually, HTML tags tell the browser how to display the information included in the tag. For example, `

Avatar

` tells the browser to display the text string “Avatar” in a heading 1 format. However, the HTML tag doesn’t give any information about what that text string means — “Avatar” could refer to the a hugely successful 3D movie, or it could refer to a type of profile picture—and this can make it more difficult for search engines to intelligently display relevant content to a user.

> Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, and Yahoo!

> You use the schema.org vocabulary, along with the microdata format, to add information to your HTML content. While the long term goal is to support a wider range of formats, the initial focus is on Microdata. This guide will help get you up to speed with microdata and schema.org, so that you can start adding markup to your web pages.

Syntax Highlighting in Word

I am working on my paper for the computational folkloristics panel at AFS this year. My goal is to apply some of the network theory and visualization methods I learned at the NEH Institute on Networks and Networking in the Humanities do the intellectual history of folklore studies. I thought an interesting phenemonenon to tackle would be the emergence of performance studies as a paradigm. That is, what does a paradigm shift look like from the point of view of a network? What did it look like in folklore studies?

To do this work I am interacting with JSTOR’s *Data for Research* program, and I am trying to keep notes as I go. Because this will eventually be something I want to share with others, I am keeping my notes in Word — if only because I can control the presentation much more readily. For the XML with which I am working to be more readable, it could use some syntax highlighting, a feature I count on in my text editor, Textmate, but which is not available in Word … unless, of course, you happen upon on-line sites which will do the work for you.

One such site is [ToHTML](http://tohtml.com/). [PlanetB](http://www.planetb.ca/2008/11/syntax-highlight-code-in-word-documents/) will also do some syntax highlighting.

Zen Coding for HTML

[Zen Coding for HTML](http://www.downloadsquad.com/2010/04/30/if-you-code-html-zen-coding-will-change-your-life/) allows you to type this:

div#page>div.logo+ul#navigation>li*5>a

and have your text editor convert it to this:

Convert HTML to text

I forgot from where I copied this script:

#!/bin/bash
# Usage: convert-html-to-md […]
# Convert the specified HTML files into Markdown text-format equivalents
# in the current working directory. The file extension will be .md.txt.
# Requires the html2text.py Python script by Aaron Swartz to convert
# from HTML to Markdown text [www.aaronsw.com/2002/html2text/].
# html2text=”${1}”shift

[while [ -n “${1}” ] ; do
# Use the contents of the title element for the filename. In case
# the title element spans multiple lines, the entire file is first
# converted to a single line before the sed pattern is applied. Any
# “unsafe” characters are then replaced with hyphens to produce a
# valid filename.
title=$(cat “${1}” | \
tr -d ‘\n\r’ | \
sed -nre ‘s/^.*(.*?)<\/title>.*$/\1\n/ip’ | \<br /> tr “\`~\!@#$%^&*()+={}|[]\\:;\”\’<>?,/ \t” ‘[-*]’)</p> <p> # If there’s no title, then just use the original filename.<br /> if [ -z “${title}” ] ; then<br /> title=$(basename “${1}” .html)<br /> fi</p> <p> # Convert the HTML to Markdown.<br /> cat “${1}” | python “${html2text}” > “${title}.md.txt”<br /> shift<br /> done]</p> </div><!-- .entry-content --> <footer class="entry-meta"> Posted on <a href="http://johnlaudun.org/20080508-convert-html-to-text/" title="12:16" rel="bookmark"><time class="entry-date" datetime="2008-05-08T12:16:18-06:00" pubdate>2008 May 8</time></a><span class="byline"> by <span class="author vcard"><a class="url fn n" href="http://johnlaudun.org/author/johnlaudun/" title="View all posts by johnlaudun" rel="author">johnlaudun</a></span></span>. <span class="sep"> | </span> <span class="tags-links"> Tagged: <a href="http://johnlaudun.org/tag/code/" rel="tag">code</a>, <a href="http://johnlaudun.org/tag/html/" rel="tag">html</a>, <a href="http://johnlaudun.org/tag/python/" rel="tag">python</a>.</span> </footer><!-- .entry-meta --> </article><!-- #post-1997 --> </div><!-- #content .site-content --> </section><!-- #primary .content-area --> <div id="secondary" class="widget-area" role="complementary"> <aside id="text-3" class="widget widget_text"> <div class="textwidget"><a href="http://johnlaudun.org/boat/" rel="attachment wp-att-7877"><img src="https://i0.wp.com/media.johnlaudun.org.s3.amazonaws.com/wordpress/media/2016/01/ACB-cover-small-103x150.jpeg?resize=103%2C150" alt="The Amazing Crawfish Boat" width="103" height="150" data-recalc-dims="1" /></a> <p style="line-height:1.1 "><small><em>The Amazing Crawfish Boat</em> is available at your favorite bookseller (both <a href="http://amzn.to/1rf9wAT">Amazon</a> and <a href="http://www.barnesandnoble.com/w/the-amazing-crawfish-boat-john-laudun/1121843205?ean=9781496804204">B&N</a>). I have also released some additional <em>free</em> materials: audio versions of some of the chapters and photos — all available for download. Details are available on the <a href="http://johnlaudun.org/boat/">book’s page</a>.</small></p></div> </aside><aside id="search-5" class="widget widget_search"> <form method="get" id="searchform" action="http://johnlaudun.org/" role="search"> <label for="s" class="assistive-text">Search</label> <input type="text" class="field" name="s" value="" id="s" placeholder="Search …" /> <input type="submit" class="submit" name="submit" id="searchsubmit" value="Search" /> </form> </aside><aside id="top-posts-2" class="widget widget_top-posts"><h1 class="widget-title">Top Posts & Pages</h1><ul> <li> <a href="http://johnlaudun.org/20170928-append-python-list-using-list-comprehension/" class="bump-view" data-bump-view="tp"> Append a Python List Using a List Comprehension </a> </li> <li> <a href="http://johnlaudun.org/20150512-installing-and-setting-pip-with-macports/" class="bump-view" data-bump-view="tp"> Installing, and Setting, PIP with MacPorts </a> </li> <li> <a href="http://johnlaudun.org/20131228-ipython-notebook-keyboard-shortcuts/" class="bump-view" data-bump-view="tp"> iPython Notebook Keyboard Shortcuts </a> </li> <li> <a href="http://johnlaudun.org/20140820-imperial/" class="bump-view" data-bump-view="tp"> Imperial Measurements </a> </li> <li> <a href="http://johnlaudun.org/20160521-irkernel-difficulties/" class="bump-view" data-bump-view="tp"> IRkernel Difficulties </a> </li> <li> <a href="http://johnlaudun.org/20140721-install-r-with-macports/" class="bump-view" data-bump-view="tp"> Install R with MacPorts </a> </li> <li> <a href="http://johnlaudun.org/20151127-clouds/" class="bump-view" data-bump-view="tp"> Clouds </a> </li> <li> <a href="http://johnlaudun.org/20150705-from-csv-to-projection/" class="bump-view" data-bump-view="tp"> From CSV to Bipartite Network to One-Mode Projection </a> </li> <li> <a href="http://johnlaudun.org/20080321-word-wrap-filling-in-emacs/" class="bump-view" data-bump-view="tp"> Word-wrap (filling) in Emacs </a> </li> <li> <a href="http://johnlaudun.org/20130127-linkedin-network-visualization/" class="bump-view" data-bump-view="tp"> LinkedIn Network Visualization </a> </li> </ul></aside> </div><!-- #secondary .widget-area --> </div><!-- #main .site-main --> <footer id="colophon" class="site-footer" role="contentinfo"> <div class="site-info"> <a href="http://wordpress.org/" rel="generator">Proudly powered by WordPress</a> Theme: Publish by <a href="http://kovshenin.com/" rel="designer">Konstantin Kovshenin</a>. </div><!-- .site-info --> </footer><!-- #colophon .site-footer --> </div><!-- #page .hfeed .site --> <div style="display:none"> </div> <script> jQuery(document).ready(function () { jQuery.post('http://johnlaudun.org?ga_action=googleanalytics_get_script', {action: 'googleanalytics_get_script'}, function(response) { var s = document.createElement("script"); s.type = "text/javascript"; s.innerHTML = response; jQuery("head").append(s); }); }); </script><script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-includes/js/dist/vendor/wp-polyfill.min.js?ver=7.0.0'></script> <script type='text/javascript'> ( 'fetch' in window ) || document.write( '<script src="http://johnlaudun.org/wordpress/wp-includes/js/dist/vendor/wp-polyfill-fetch.min.js?ver=3.0.0"></scr' + 'ipt>' );( document.contains ) || document.write( '<script src="http://johnlaudun.org/wordpress/wp-includes/js/dist/vendor/wp-polyfill-node-contains.min.js?ver=3.26.0-0"></scr' + 'ipt>' );( window.FormData && window.FormData.prototype.keys ) || document.write( '<script src="http://johnlaudun.org/wordpress/wp-includes/js/dist/vendor/wp-polyfill-formdata.min.js?ver=3.0.12"></scr' + 'ipt>' );( Element.prototype.matches && Element.prototype.closest ) || document.write( '<script src="http://johnlaudun.org/wordpress/wp-includes/js/dist/vendor/wp-polyfill-element-closest.min.js?ver=2.0.2"></scr' + 'ipt>' ); </script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-includes/js/dist/dom-ready.min.js?ver=2.2.0'></script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-content/plugins/jetpack/_inc/build/photon/photon.min.js?ver=20190901'></script> <script type='text/javascript' src='https://s0.wp.com/wp-content/js/devicepx-jetpack.js?ver=201947'></script> <script type='text/javascript'> /* <![CDATA[ */ var jetpackCarouselStrings = {"widths":[370,700,1000,1200,1400,2000],"is_logged_in":"","lang":"en","ajaxurl":"http:\/\/johnlaudun.org\/wordpress\/wp-admin\/admin-ajax.php","nonce":"20f7d783ee","display_exif":"1","display_geo":"1","single_image_gallery":"1","single_image_gallery_media_file":"","background_color":"black","comment":"Comment","post_comment":"Post Comment","write_comment":"Write a Comment...","loading_comments":"Loading Comments...","download_original":"View full size <span class=\"photo-size\">{0}<span class=\"photo-size-times\">\u00d7<\/span>{1}<\/span>","no_comment_text":"Please be sure to submit some text with your comment.","no_comment_email":"Please provide an email address to comment.","no_comment_author":"Please provide your name to comment.","comment_post_error":"Sorry, but there was an error posting your comment. Please try again later.","comment_approved":"Your comment was approved.","comment_unapproved":"Your comment is in moderation.","camera":"Camera","aperture":"Aperture","shutter_speed":"Shutter Speed","focal_length":"Focal Length","copyright":"Copyright","comment_registration":"1","require_name_email":"1","login_url":"http:\/\/johnlaudun.org\/wordpress\/wp-login.php?redirect_to=http%3A%2F%2Fjohnlaudun.org%2F20190409-html-to-pdf%2F","blog_id":"1","meta_data":["camera","aperture","shutter_speed","focal_length","copyright"],"local_comments_commenting_as":"<p id=\"jp-carousel-commenting-as\">You must be <a href=\"#\" class=\"jp-carousel-comment-login\">logged in<\/a> to post a comment.<\/p>"}; /* ]]> */ </script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-content/plugins/jetpack/_inc/build/carousel/jetpack-carousel.min.js?ver=20190102'></script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-includes/js/mediaelement/wp-mediaelement.min.js?ver=5.2.4'></script> <script type='text/javascript' src='https://secure.gravatar.com/js/gprofiles.js?ver=2019Novaa'></script> <script type='text/javascript'> /* <![CDATA[ */ var WPGroHo = {"my_hash":""}; /* ]]> */ </script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-content/plugins/jetpack/modules/wpgroho.js?ver=5.2.4'></script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-content/themes/publish/js/small-menu.js?ver=20120206'></script> <script type='text/javascript' src='http://johnlaudun.org/wordpress/wp-includes/js/wp-embed.min.js?ver=5.2.4'></script> <script type='text/javascript' src='https://stats.wp.com/e-201947.js' async='async' defer='defer'></script> <script type='text/javascript'> _stq = window._stq || []; _stq.push([ 'view', {v:'ext',j:'1:7.8',blog:'33779968',post:'0',tz:'-6',srv:'johnlaudun.org'} ]); _stq.push([ 'clickTrackerInit', '33779968', '0' ]); </script> </body> </html>