Python and PDFs

Real Python has a tutorial on How to Work With a PDF in Python. I subscribe to Real Python because I find their tutorials well-written or, in the case of video tutorials, well-presented. The focus of this tutorial is the PythonPDF module, which can get metadata from a PDF, rotate pages, merge or split a PDF, and/or encrypt it. While the tutorial mentions “extract information” it does not mean PythonPDF can get text from a PDF that does not have a text layer already embedded on its pages — you could argue that the unintuitive nature of PDFs reveals their brokenness but that’s for another time. If you want to get text where there is no text layer, but you still want to use Python, it looks like you have to turn to PDFMiner — though a quick skim of its GH page doesn’t reveal if it has OCR capabilities backed in. Sigh.

HTML to PDF

In an ideal setup, my workflow would have me writing in some version of plain text — a flavor of markdown in all probability — that could be quickly and easily outputted to a variety of formats and media. In most instances, that output gets printed, or at least paginated, which means it probably has to, at least for a moment, be instantiated as a PDF. (If I remember correctly, this is essentially how the macOS display and printing system work.) What that would mean would be a collection of CSS files that transformed the generated HTML into the various kinds of documents I regularly produce: essays, reports, letters, lectures, etc.

This function is what the Marked app does and does well — it’s also functionality built into the Ulysses app if I remember. Neither of those apps, I believe, offer pagination, which is often critical to what I output. And so, I have continued to search for my own solution in hopes of building it into a workflow — for the record, when I am working on long-form plain text, my editor of choice is FoldingText because it does a brilliant job of hiding the markdown unless you are working on that sentence and, as the name implies, it makes it possible to hide all but the section of the document on which you are working. It’s brilliant. (To be clear, I am a fan of all the apps mentioned here and of their developers.)

Getting from plain text via markdown or MultiMarkdown to HTML and then pairing that HTML with a page-media aware CSS file and then outputting to PDF is not as easy as it should be. The one app of which I have been aware up until recently was PrinceXML, which its creators have made free for non-commercial use, but with the imposition of a small watermark. That’s very generous, but it’s not quite what I want and I don’t have the kind of money to afford a desktop license.

And so it was a delightful surprise to discover that there are free software options to explore:

  • wkhtmltopdf is an “open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely headless and do not require a display or display service.”
  • **WeasyPrint is a “visual rendering engine for HTML and CSS that can export to PDF. … It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.”

Next up … trying WeasyPrint and an update/report here.

PDF Index Generator

PDF Index Generator is a powerful indexing utility for generating the back of your book index and writing it to your book in (4) easy steps. PDF Index Generator parses your PDF, collects the index words and their location in the PDF, then writes the generated index to a PDF or a text file you specify. The main target for PDF Index Generator is to automate the process of generating the book index instead of doing the hard work manually.

I didn’t get to try this out, but I had it bookmarked. (UPM was very kind to do the indexing for me since the moment it needed to be done was also the moment that my father died.)

Shrink Preview Files

[Macworld has a great tip](http://www.macworld.com/article/1168311/shrink_preview_files_without_ruining_image_quality.html) on how to shrink Preview files without ruining image quality. Essentially, it entails navigating to:

/System/Library/Filters

and then copying the file `Reduce File Size.qfilter` to some place where you can edit it. No fear: it’s an XML file, which means you can produce multiple versions for different effects, making sure you give the versions different names so you can move them back into the `Filters` directory when you are done. (You will need to be able to authenticate to do so, just as you had to make a copy of the file elsewhere in order to work with it: Mac OS X does not want you editing its innards live.)

The parameters you are going to adjust are: Compression Quality and ImageSizeMax. The article has some good suggested values, which are a good place to start as you tweak things for your own benefit.