Cloud, Kluit, Clod?

Just a quick demonstration of the power of I dubbed my “personal cloud” experiment kluit: a Dutch word meaning both clod and the ball of earth around the roots of a tree. In other words, kluit is firmly grounded because you’ve got your own ground with you wherever you go. Be like Dracula. With a name in mind, I also wanted a matching logo. Following a quick search for leaves, root (or was it tree) and after a little initial play something like attraction, this is the quick and satisfying result.

A couple of floating leaves still connected with their roots. This arrangement symbolizes how creating your personal cloud keeps it grounded.

And of course the remix is free for all. Enjoy.

Trois champignons

J’ai compris plus longtemps que, pour apprendre une langue (comme le français), il ne suffise pas d’exercices purement textuels. Il y a trois choses importantes, en ordre :

  1. Lire, lire, lire. Quantité, pas de qualité. Il est mieux de lire dix bandes dessinées pour des enfants que de ne lire aucun texte plus complexe.
  2. Écouter. Lire est bonne pour le vocabulaire, mais pour comprendre la langue on a besoin de langue orale.
  3. Créer. Écrire, parler… c’est plus difficile.

Inspiré par les bandes dessinées, je vous présente trois champignons. Le premier champignon est en belle forme. Le deuxième champignon a bu. Le troisième champignon a utilisé du LSD ou de quelque chose.

Just a Star

Messing about a little in Inkscape with my wife’s Wacom CTH-680S tablet on Linux 4.1, after first trying it in Xournal. It seems to be functioning a fair bit better than a few kernel versions ago.

The tablet is really good. I’d recommend it.

Fixing Up Scanned PDFs with Scan Tailor

Scanned PDFs come my way quite often and I don’t infrequently wish they were nicer to use on digital devices. One semi-solution might include running off to the library and rescanning them personally, but there is a middle road between doing nothing and doing too much: digital manipulation. The knight in shining armor is called Scan Tailor. Note that this post is not about merely cropping away some black edges. When you’re just looking for a tool to cut off some unwanted edges, I’d recommend PDF Scissors instead. If you just want to fix some incorrect rotation once and for all, try the pdftools found in texlive-extra-utils, which gives you simple shorthands like pdf90, pdf180 and pdf270. This post is about splitting up double scanned pages, increasing clarity, and adding an OCR layer on top. With that out of the way, if you’re impatient, you can skip to the script I wrote to automate the process.

Coaxing Scan Tailor

Unfortunately Scan Tailor doesn’t directly load scanned PDFs, which is what seems to be produced by copiers by default and what you’re most likely to receive from other people. Luckily this is easy to work around. If you want to use the program on documents you scan yourself, selecting e.g. TIFF in the output options could be a better choice.

To extract the images from PDF files, I use pdfimages. I believe it tends to come preinstalled, but if not grab poppler-utils with sudo apt install poppler-utils.

pdfimages -png filename.pdf outputname

You might want to take a look at man pdfimages. The -j flag makes sure JPEG files are output as is rather than being converted to something else, for instance, while the -tiff option would convert the output to TIFF. Like PNG, that is lossless. What might also be of interest are -ccitt and -all, but in this case I’d want the images as JPEG, PNG or TIFF because that’s what Scan Tailor takes as input.

At this point you could consider cropping your images to aid processing with Scan Tailor, but I’m not entirely sure how to automate it out of the way. Perhaps unpaper with a few flags could work to remove (some) black edges, but functionally speaking Scan Tailor is pretty much unpaper with a better (G)UI. In any case, this could be investigated.

You’ll want to use pdfimages once more to obtain the DPI of your images for use with Scan Tailor, unless you like to calculate the DPI yourself using the document dimensions and the number of pixels. Both PNG and TIFF support this information, but unfortunately pdfimages doesn’t write it.

$ pdfimages -list filename.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
   1     0 image    1664  2339  gray    1   1  ccitt  no         4  0   200   200  110K  23%
   2     1 image    1664  2339  gray    1   1  ccitt  no         9  0   200   200  131K  28%

Clearly our PDF was scanned at a somewhat disappointing 200 DPI. Now you can start Scan Tailor, create a new project based on the images you just extracted, enter the correct DPI, and just follow the very intuitive steps. For more guidance, read the manual. With any setting you apply, take care to apply it to all pages if you wish, because by default the program quite sensibly applies it only to one page. Alternatively you could run scantailor-cli to automate the process, which could reduce your precious time spent to practically zero. I prefer to take a minute or two to make sure everything’s alright. I’m sure I’ll make up the difference by not having to scroll left and right and whatnot afterwards.

By default Scan Tailor wants to output to 600 DPI, but with my 200 DPI input file that just seemed odd. Apparently it has something to do with the conversion to pure black and white, which necessitates a higher DPI to preserve some information. That being said, 600 DPI seems almost laughably high for 200 DPI input. Perhaps “merely” twice the input DPI would be sufficient. Either way, be sure to use mixed mode on pages with images.

Scan Tailor’s output is not a PDF yet. It’ll require a little bit of post-processing.

Simple Post-Processing

The way I usually go about trying to find new commands already installed on my computer is simply by typing the relevant phrase, in this case tiff. Press Tab for autocomplete. If that fails, you could try apt search tiff, although I prefer a GUI like Synaptic for that. The next stop is a search engine, where you can usually find results faster by focusing on the Arch, Debian or Ubuntu Wikis. On the other hand, blog and forum posts often contain useful information.

$ tiff
tiff2bw     tiff2ps     tiffcmp     tiffcrop    tiffdump    tiffmedian  tiffsplit   
tiff2pdf    tiff2rgba   tiffcp      tiffdither  tiffinfo    tiffset     tifftopnm

tiff2pdf sounds just like what we need. Unfortunately it only processes one file at a time. Easy to fix with a simple shell script, but rtfm (man tiff2pdf) for useful info. “If you have multiple TIFF files to convert into one PDF file then use tiffcp or other program to concatenate the files into a multiple page TIFF file.

tiffcp *.tif out.tif

You could easily stop there, but for sharing or use on devices a PDF (or DjVu) file is superior. My phone doesn’t even come with a TIFF viewer by default and the one in Dropbox — why does that app open almost all documents by itself anyway? — just treats it as a collection of images, which is significantly less convenient than your average document viewer. Meanwhile, apps like the appropriately named Document Viewer deal well with both PDF and DjVu.

tiff2pdf -o bla.pdf out.tif

Wikipedia suggests the CCITT compression used for black and white text is lossless, which is nice. Interestingly, an 1.8 MB low-quality 200 DPI PDF more than doubled in size with this treatment, but a 20MB 400 DPI document was reduced in size to 13MB. Anyway, for most purposes you could consider compressing it with JBIG2, for instance using jbig2enc. Another option might be to ignore such PDF difficulties and use pdf2djvu or to compile a DjVu document directly from the TIFF files. At this point we’re tentatively done.

Harder but Neater Post-Processing

After I’d already written most of this section, I came across this Spanish page that pretty much explains it all. So it goes. Because of that page I decided to add a little note about checkinstall, a program I’ve been using for years but apparently always failed to mention.

You’re going to need jbig2enc. You can grab the latest source or an official release. But first let’s get some generic stuff required for compilation:

sudo apt install build-essential automake autotools-dev libtool

And the jbig2enc-specific dependencies:

sudo apt install libleptonica-dev libjpeg8-dev libpng12-devlibpng-dev libtiff5-dev zlib1g-dev

In the jbig2enc-master directory, compile however you like. I tend to do something along these lines:

mkdir build
cd build

Now you can sudo make install to install, but you’ll have to keep the source directory around if you want to run sudo make uninstall later. Instead you can use checkinstall (sudo apt install checkinstall, you know the drill). Be careful with this stuff though.

sudo checkinstall make install

You have to enter a name such as jbig2enc, a proper version number (e.g. 0.28-0 instead of 0.28) and that’s about it. That wasn’t too hard.

At this point you could produce significantly smaller PDFs using jbig2enc itself (some more background information):

jbig2 -b outputbasename -p -s whatever-*.tif outputbasename > output.pdf

However, it doesn’t deal with mixed images as well as tiff2pdf does. And while we’re at it, we might just as well set up our environment for some OCR goodness. Mind you, the idea here is just to add a little extra value with no extra time spent after the initial setup. I have absolutely no intention of doing any kind of proofreading or some such on this stuff. The simple fact is that the Scan Tailor treatment drastically improved the chances of OCR success, so it’d be crazy not to do it. There’s a tool called pdfbeads that can automatically pull it all together, but it needs a little setup first.

You need to install ruby-rmagick, ruby-hpricot if you want to do stuff with OCRed text (which is kind of the point), and ruby-dev.

sudo apt install ruby-rmagick ruby-hpricot ruby-dev

Then you can install pdfbeads:

sudo gem install pdfbeads

Apparently there are some issues with iconv or something? Whatever it is, I have no interest in Ruby at the moment and the problem can be fixed with a simple sudo gem install iconv. If iconv is added to the pdfbeads dependencies or if it switches to whatever method Ruby would prefer, this shouldn’t be an issue in the future.

At this point we’re ready for the OCR. sudo apt install tesseract-ocr and whatever languages you might want, such as tesseract-ocr-nld. The -l switch is irrelevant if you just want English, which is the default.

parallel tesseract -l eng+nld {} {.} hocr ::: *.tif

GNU Parallel speeds this up by automatically running as many different tesseracts as you’ve got CPU cores. Install with sudo apt install parallel if you don’t have it, obviously. I’m pretty patient about however much time this stuff might take as long as it proceeds by itself without requiring any attention, but on my main computer this will make everything proceed almost four times as quickly. Why wait any longer than you have to? The OCR results are actually of extremely high quality: it has some issues with italics and that’s pretty much it. It’s not an issue with the characters, but it doesn’t seem to detect the spaces in between words. But what do I care, other than that minor detail it’s close to perfect and this wasn’t even part of the original plan. It’s a very nice bonus.

Once that’s done, we can produce our final result:

pdfbeads *.tif > out.pdf

My 20 MB input file now is a more usable and legible 3.7 MB PDF with decent OCR to boot. Neat. A completely JPEG-based file I tried went from 46.8 MB to 2.6 MB. Now it’s time to automate the workflow with some shell scripting.

ReadablePDF, the script

Using the following script you can automate the entire workflow described above, although I’d always recommend double-checking Scan Tailor’s automated results. The better the input, the better the machine output, but even so there might just be one misdetected page hiding out. The script could still use a few refinements here and there, so I put it up on Github. Feel free to fork and whatnot. I licensed it under the GNU General Public License version 3.

# readablepdf
# ReadablePDF streamlines the effort of turning a not so great PDF into
# a more easily readable PDF (or of course a pretty decent PDF into an
# even better one). This script depends on poppler-utils, imagemagick,
# scantailor, tesseract-ocr, jbic2enc, and pdfbeads.
# Unfortunately only the first four are available in the Debian repositories.
# sudo apt install poppler-utils imagemagick scantailor tesseract-ocr
# For more background information and how to install jbig2enc and pdfbeads,
# see //
# GNU Aspell and GNU Parallel are recommended but not required.
# sudo apt install aspell parallel
# Aspell dictionaries tend to be called things like aspell-en, aspell-nl, etc.

BASENAME=${@%.pdf} # or `basename "@%" .pdf`

# It would seem that at some point in its internal processing, pdfbeads has issues with spaces.
# Let's strip them and perhaps some other special characters so as still to provide
# meaningful working directory and file names.
BASENAME_SAFE=$(echo "${BASENAME}" | tr ' ' '_') # Replace all spaces with underscores.
#BASENAME_SAFE=$(echo "${BASENAME_SAFE}" | tr -cd 'A-Za-z0-9_-') # Strip other potentially harmful chars just in case?

SCRIPTNAME=$(basename "$0" .sh)


# If project file exists, change directory and assume everything's in order.
# Else do the preprocessing and initiation of a new project.
if [ -f "${TMP_DIR}/${BASENAME_SAFE}.ScanTailor" ]; then
	echo "File ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor exists."
	cd "${TMP_DIR}"
	echo "File ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor does not exist."
	# Let's get started.
	mkdir "${TMP_DIR}"
	cd "${TMP_DIR}"
	# Only output PNG to prevent any potential further quality loss.
	pdfimages -png "../${BASENAME}.pdf" "${BASENAME_SAFE}"
	# This is basically what happens in as well
	# get the x-dpi; no logic for different X and Y DPI or different DPI within PDF file
	# y-dpi would be pdfimages -list out.pdf | sed -n 3p | awk '{print $14}'
	DPI=$(pdfimages -list "../${BASENAME}.pdf" | sed -n 3p | awk '{print $13}')
	# TODO Skip all this based on a rotation command-line flag!
	# Adapted from
	# Scan Tailor says it can't automatically figure out the rotation.
	# I'm not a programmer, but I think I can do well enough by (ab)using OCR. :)
	mkdir ${TMP}

	# Make copies in all four orientations (the src file is 0; copy it to make 
	# things less confusing)

	cp "$file" "$north_file"
	convert -rotate 90 "$file" "$east_file"
	convert -rotate 180 "$file" "$south_file"
	convert -rotate 270 "$file" "$west_file"

	# OCR each (just append ".txt" to the path/name of the image)

	# tesseract appends .txt automatically
	tesseract "$north_file" "$north_file"
	tesseract "$east_file" "$east_file"
	tesseract "$south_file" "$south_file"
	tesseract "$west_file" "$west_file"

	# Get the word count for each txt file (least 'words' == least whitespace junk
	# resulting from vertical lines of text that should be horizontal.)
	echo "$(wc -w ${north_text}) ${north_file}" > $wc_table
	echo "$(wc -w ${east_text}) ${east_file}" >> $wc_table
	echo "$(wc -w ${south_text}) ${south_file}" >> $wc_table
	echo "$(wc -w ${west_text}) ${west_file}" >> $wc_table

	# Spellcheck. The lowest number of misspelled words is most likely the 
	# correct orientation.
	while read record; do
		txt=$(echo "$record" | awk '{ print $2 }')
		# This is harder to automate away, pretend we only deal with English and Dutch for now.
		misspelled_word_count=$(< "${txt}" aspell -l en list | aspell -l nl list | wc -w)
		echo "$misspelled_word_count $record" >> $misspelled_words_table
	done < $wc_table

	# Do the sort, overwrite the input file, save out the text
	winner=$(sort -n $misspelled_words_table | head -1)
	rotated_file=$(echo "${winner}" | awk '{ print $4 }')
	rotation=$(basename "${rotated_file}")
	echo "Rotating ${rotation} degrees"

	# Clean up.
	if [ -d ${TMP} ]; then
		rm -r ${TMP}
	# TODO end skip
	if [[ ${rotation} -ne 0 ]]; then
		mogrify -rotate "${rotation}" "${BASENAME_SAFE}-*.png"
	# consider --color-mode=mixed --despeckle=cautious
	scantailor-cli --dpi="${DPI}" --margins=5 --output-project="${BASENAME_SAFE}.ScanTailor" ./*.png ./

while true; do
	read -p "Please ensure automated detection proceeded correctly by opening the project file ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor in Scan Tailor. Enter [Y] to continue now and [N] to abort. If you restart the script, it'll continue from this point unless you delete the directory ${TMP_DIR}. " yn
	case $yn in
		[Yy]* ) break;;
		[Nn]* ) exit;;
		* ) echo "Please answer yes or no.";;

# Use GNU Parallel to speed things up if it exists.
if command -v parallel >/dev/null; then
	parallel tesseract {} {.} ${TESSERACT_PARAMS} hocr ::: *.tif
	for i in ./*.tif; do tesseract $i $(basename $i) ${TESSERACT_PARAMS} hocr; done;

# pdfbeads doesn't play nice with filenames with spaces. There's nothing we can do
# about that here, but that's ${BASENAME_SAFE} is generated up at the beginning.
# Also pdfbeads ./*.tif > "${BASENAME_SAFE}.pdf" doesn't work,
# so you're in trouble if your PDF's name starts with "-".
# See
pdfbeads *.tif > "${BASENAME_SAFE}.pdf"

mv "${BASENAME_SAFE}.pdf" ../"${BASENAME}-readable.pdf"


If you’re not interested in the space savings of JBIG2 because the goal of ease of use and better legibility has been achieved (and you’d be quite right; digging further is just something I like to do), after tiff2pdf you could still consider tossing in pdfsandwich. You might as well, for the extra effort only consists of installing an extra package. Instead, OCRmyPDF might also work, or perhaps even plain Tesseract 3.03 and up. pdfsandwich just takes writing the wrapper out of your hands. But again, this part is just a nice bonus.

pdfsandwich -lang nld+eng filename.pdf -o filename-ocr.pdf

The resulting file doesn’t strike my fancy after playing with the tools mentioned above, but hey, it takes less time to setup and it works.


DjVu is probably a superior alternative to PDF, so it ought to be worth investigating. This link might help.

A very useful application, found in the Debian repositories to boot, is djvubind. It works very similar to ReadablePDF, but produces DjVu files instead. For sharing these may be less ideal, but for personal use they seem to be even smaller (something that could probably be affected by the choices for dictionary size) while displaying even faster.

Other Matters of Potential Interest

Note that I’m explicitly not interested in archiving a book digitally or some such. That is, I want to obtain a digital copy of a document or book that avoids combining the worst of both digital and paper into one document, but I’m not interested in anything beyond that unless it can be done automatically. Moreover, attempting to replicate original margins would actually make the digital files less usable. For digital archiving you’ll obviously have to redo that not-so-great 200 DPI scan and do a fair bit more to boot. It looks like Spreads is a great way to automate the kind of workflow desired in that case. This link dump might offer some further inspiration.


My goal has been achieved. Creating significantly improved PDFs shouldn’t take more than a minute or two of my time from now on, depending a bit on the quality of the input document. Enjoy.

Only Literary Discourse?

[I]n matters of race, silence and evasion have historically ruled literary discourse. […] The situation is aggravated by the tremor that breaks into discourse on race. It is further complicated by the fact that ignoring race is understood to be a graceful, even generous, liberal gesture.

Toni Morrison, Playing in the dark: whiteness and the literary imagination. 1992. Harvard University: Cambridge. p. 14.


Preparing a PDF in Sections for Binding

For PostScript, Debian has a nice collection of tools in the psutils package, including psbooks and psnup. But since I do most stuff in PDF, I figured I’d skip a step and look for something similar for PDFs: PDFjam is just the thing.

In Debian Squeeze you have to install the pdfjam package separately, but in newer versions of Debian and Ubuntu it comes as part of the texlive-extra-utils package.

By default it turns the whole file into one big booklet. If you want multiple sections for binding, you’ll have to disable that behavior. The --signature option allows you to specify a multiple of four for the size of the sections.

pdfbook --booklet false --signature 16 your-file.pdf


Shapes for sounds (cowhouse): not perfect, but very good looking

I picked this book up on a whim at the Boekenfestijn for relatively little. It turned out to be a decent find.

First of all, this book looks rather nice, occasionally even stunning. It presents a lot of information in an easily accessible, visual manner. I like how the right-side lines of the text are jagged rather than the omnipresent justified, and I quickly grew fond of the phoneme head that shows how we articulate sounds. It’s a pity that this feature wasn’t extended to include a few more phonemes of the English language in one of the many appendixes.

Page 17 has some strange things going on regarding phonetics: w and y are initially incorrectly listed as fricatives, but a few lines down also correctly as approximants (also known as glides)—assuming we’re actually talking about /w/ and /j/. This section on phonetics is at the very least lacking in clarity, even if my copy of An Introduction to Language could’ve benefited from some of its typographical prowess.

In the next paragraph, h is listed as a letter that takes its name from placing a short vowel sound, usually e, before it. However, /eɪtʃ/ does not fit that bill. Aitch doesn’t even contain /h/. It was actually mentioned as “aitch” earlier in the text and listed not much later alongside “h, j, k, q, w, y” as late inclusions to the language. Since the author is a typographer by trade and the true focus of the book was the visual charts, I hope similar small mistakes didn’t sneak into those parts of the book, because I don’t have enough prior knowledge to tell. There are also numerous comma splices throughout the text. Once again, this distracts from the overall very polished feel of the book.

Appendix №6 shows the evolution of writing very neatly, but unfortunately the interrobang (‽) seems to have accidentally been turned into a regular question mark (?). I know, I’m picking nits, but it was specifically mentioning and showing the interrobang after all.

Finally, the book has a bibliography that can aid you if you want to know more. Always a good thing.

Don’t let my nitpicking give you the wrong impression: I quite thoroughly enjoyed this gorgeous, fun, informative book.

PS For some color illustrations of the charts and appendixes, see the brain pickings review.

This review was cross-posted on LibraryThing.


On My Header Image

In what is probably the biggest visual change since I first created this theme back in ’05 — yes, it’s that old! — on June 2, 2011 I replaced the header image with a picture I took a month prior in the Keukenhof.

The opportunity presented itself to experiment slightly with decent JPEG compression, rather than simply depending on GIMP’s output, which unfortunately is virtually guaranteed to be suboptimal. Since all I did was crop and resize, I used PNG as my working format. I might’ve been able to use jpegcrop and jpegtran, but since I was going to re-encode in a lossy manner afterward that would have been nothing but needless extra effort.

First I tried cjpeg, which doesn’t support a lot of input filetypes, so I had to save a copy as BMP.

cjpeg -quality 80 -optimize -progressive -dct float -outfile test80.jpg head.bmp

Then I discovered that imagemagick can do the exact same thing, optimized by default and everything. It also uses libjpeg under the hood, so the resulting image is exactly the same.

convert -quality 80 -interlace plane head.png test80.jpg

That results in JPEGs that are about as small as they can get without enabling options that might not be readily supported by all viewers. I wrote a (very) simple shell script to aid with a quick overview of size versus quality.

convert -quality 30 -interlace plane $1 ${filename}30.jpg
convert -quality 40 -interlace plane $1 ${filename}40.jpg
convert -quality 50 -interlace plane $1 ${filename}50.jpg
convert -quality 60 -interlace plane $1 ${filename}60.jpg
convert -quality 70 -interlace plane $1 ${filename}70.jpg
convert -quality 80 -interlace plane $1 ${filename}80.jpg

My rationale is that any quality under 30 is most likely too ugly and anything over 80 will result in a file size that’s too large for my intended purpose of using lower quality — but not low quality — images on the Internet.

I also decided it was time to get rid of my half-hearted concessions to Internet Exporer. This in no way inhibits readability of the content.

Lossless Rotation with jhead and jpegtrans

I like my pictures rotated in such a way that I don’t have to depend on application support for them to be displayed correctly. jpegtran (pre-installed on most distros) is a wonderful application with many features, including lossless rotation, but it’s too laborious for my purposes. That’s where jhead comes in.

You can simply go into a directory, run a command like the following, and everything will be done automatically for you.

jhead -autorot *.JPG

Of course I wouldn’t run it if you don’t have a backup available. I always keep the pictures around on my camera until I’ve confirmed that all processing was successful and then I still don’t delete them until the adjusted files were also copied to my external HDD in my semi-regular backup regime.

Another utility that can perform the same task is exiftran, but despite being more or less dedicated to this very purpose it’s not even easier to use: I’d expect exiftran *.JPG to default to the equivalent of the jhead -autorot *.JPG command I posted above, but instead you have to use exiftran -ai *.JPG. All other things being equal for my purposes, I decided to go with jhead because it has many more features — although last year I decided that exiv2 is superior to jhead in ease of use for most of those features.

If you’re just looking for the occasional lossless rotation, you could also try the Geeqie image viewer and manager. It integrates calls to exiftran, but beware that you explicitly have to choose the lossless option, as there are also lossy rotate options.


Pancake Visions

Some of you may be aware that I often imagine things in random shapes that other people have trouble envisioning, sometimes even after I draw them out. On June 3rd my wife and I baked tiny pancakes, and here’s what I saw in two of them.

An evil cat in a pancake.
The first pancake that managed to attract my attention was an evil cat.
A face in a pancake.
This pancake also happened to be on the plate while taking a picture of the evil cat one, so I figured I’d demonstrate that I do indeed see something in just about anything.

Note, these are animated SVG images. At the time of writing they only render correctly in Opera and Webkit browsers, whereas Gecko displays a static image. Internet Explorer is served with fallback PNGs.

The SVGs now also render correctly in Firefox 4.

Replaced OBJECT elements with PICTURE elements.


