## Image Optimization Guide

On the forum I administer, I am forced to run a tight attachment policy. Disk space doesn’t grow on trees. Occasionally this leads to questions about the small attachment size limit of 50 KiB. This guide is intended to clarify that this is not nearly as tiny as you might think. Note that although I’ll mention commands without much explanation for the sake of brevity, you’re always recommended to further explore the possibilities offered by those commands with the --help flag as well as by running man the-command-here.

First you need to ask yourself what kind of file type is appropriate, if you have the choice. On screenshots, the main purpose of attachments on my forum, you’ll often encounter large areas of uniform background colors. PNG is therefore almost invariably the right choice. Crop out everything but what’s relevant. JPEG is appropriate for more dynamic pictures such as photographs. If you want to do a lot with photographs, you might want to consider an external hosting service. My wife likes SmugMug. Still, for thumbnails you might be able to do a fair bit more within a few hundred KiB than you might think. Finally, the vector graphics in SVG result in pictures that always look sharp. You’ll typically have drawn these in a program like Inkscape or Adobe Illustrator.

### 1. Optimizing JPEG

Often you’ll want to crop your file. Do not edit your JPEG followed by resaving it because this will result in reduced quality! You can crop losslessly with cropgui. On Windows you can use IrfanView.

If you don’t want to crop, and also potentially for some post-cropgui optimization, use jpegtran -copy none -progressive -optimize file.jpg > file-opt.jpg. Note that this will get rid of all metadata, which may be undesirable. If so, use jpegtran -copy all -progressive -optimize file.jpg > file-opt.jpg.

Of course if you want to scale down your JPEG there’s no point in mucking about with lossless cropping first. After scaling down, check how long your quality can go (also see a little helper script I wrote). In any case, you should avoid introducing any unnecessary compression steps with associated quality loss. Here are some results:

• The original 11.jpg at 2.19 MB.
• Losslessly cropped 11-crop.jpg at 1.11 MB.
• Optimized with -copy all -progressive -optimize 11-crop-opt.jpg at 1.04 MB. -copy none would’ve saved an extra whopping 40-some KiB, which on this kind of filesize has little benefit, and besides, I quite like the metadata. For thumbnail-sized files the balance is likely to be different. For example, the 52.2 KiB SmugMug auto-generated thumbnail below can be insignificantly reduced to 51.1 KiB with --copy all, but to 48.2 KiB with --copy none. I think an 8% reduction is not too shabby, plus it brings the file size down to under the arbitrary 50 KiB limit on my forum.

### 2. Optimizing PNG

As I wrote in the introduction, for screenshots PNG is typically the right choice. If you want to use lossless PNG, use optipng -o7. In my experience it’s ever so slightly smaller than other solutions like pngcrush. But as long as you use a PNG optimizer it shouldn’t much matter which one you fancy. Also see this comparison.

If you don’t care about potentially losing some color accuracy, use pngquant instead. To top it off, if you really want to squeeze out your PNG, you can pass quality settings with --quality min-max, meaning you can pass --quality 30-50 or just --quality 10. Here are some quick results for the screenshot in the SVG section below, but be sure to check out the pngquant website for some impressive examples.


$du -h --apparent-size inkscape-plain-svg.png 27K inkscape-plain-svg.png$ du -h --apparent-size inkscape-plain-svg-fs8\ default.png
7.6K	inkscape-plain-svg-fs8 default.png

$du -h --apparent-size inkscape-plain-svg-fs8\ quality\ 10.png 4.3K inkscape-plain-svg-fs8 quality 10.png  In this case there is no visual distinction between the original PNG and the default pngquant settings. The quality 10 result is almost imperceptibly worse unless you look closely, so I didn’t bother to include a sample. ### 3. Optimizing SVG For using SVG on the web, I imagine I don’t have to tell you that in Inkscape, you should save your file as Plain SVG. Save as Plain SVG in Inkscape. What you may not know is that just like there are lossy PNGs, you can also create what amounts to lossy SVGs. There are some command-line tools to optimize SVGs, including (partially thanks to this SO answer): • Scour is probably the best command line tool for some quick optimization. You can just use the defaults like scour < in.svg > out.svg or scour -i in.svg -o out.svg. But I recommend you go further. • SVGO (SVG Optimizer) • SVG-optimiser (by Peter Collingridge) • SVG-editor (by Peter Collingridge My personal preference for squeezing out every last byte goes toward the web-based version of the SVG-editor by Peter Collingridge. By running it in a browser with inferior SVG support such as Firefox, you’ll be sure that your optimized SVG still works properly afterward. The command line tools can only safely be used for basic optimizations, whereas the effects of going lossy (such as lowering precision) can only be fully appreciated graphically. ### Addendum A: Scanned Documents Scanned documents are a different item altogether. The best format for private use is DjVu, but for public sharing PDF is probably preferable. To achieve the best results, you should scan your documents in TIFF or PNG, followed by processing with unpaper or ScanTailor. If you’ve already got a PDF you’d like to improve, you can use pdfsandwich or my own readablepdf. ### Addendum B: Video I’m not aware of any lossless optimization for video compression such as offered by jpegtran, but you can often losslessly cut video. In the general purpose Avidemux, simply make sure both video and audio are set to copy. There is also a dedicated cross-platform app for lossless trimming of videos called, unsurprisingly, LosslessCut. If you do want to introduce loss for a smaller file size you can use the very same Avidemux with a different setting, ffmpeg, mpv, VLC, and so forth. You can get reasonable quality that’ll play many places with something like: ffmpeg -i input-file.ext -c:v libx264 -crf 19 -preset slow -c:a libfaac -b:a 192k -ac 2 output-file.mp4 For the open WebM format, you can use along these lines: ffmpeg -i input.mp4 -c:v libvpx -b:v 1M -c:a libvorbis output.webm More examples on the ffmpeg wiki. Note that in many cases you should just copy the audio using -acodec copy, but of course that’s not always an option. Extra compression artifacts in audio detract significantly more from the experience than low-quality video. CommentsTags: ## UNetbootin Custom Drive Selection UNetbootin has been broken for many, many years, but just today (a few years after the fact) I discovered that the previous GUI option to show all drives was readded as a command line option. So if the program doesn’t want to detect your drive, just use the targetdrive argument: unetbootin targetdrive=/dev/sdf1 And voila, it’s working. I have no idea why it should have to be so difficult. The program categorically refuses to detect any of my USB flashdrives or harddrives, so since the removal of show all drives it’s been utterly useless. PS This is basically only for Windows ISOs. For everything else you can just use, e.g., dd. Much easier. ## Switching to FreshRSS QuiteRSS is a terrific piece of software. It only has one flaw, which is that it only runs on my desktop. Unfortunately this has led to me increasingly getting behind on the things I like to read. Sometimes this is fine, like when I can read a book instead, but other times it’s mildly frustrating. It would seem that none of the online feeds readers, whether self-hosted or SaaS, support the paradigm I’m used to. They’re all following the “golden standard” of nightmarish, thankfully-it’s-gone Google Reader. Basically I use feeds like emails. Most I delete after reading. Those I want to keep for reference I keep around, marked read. But not so with these feedreaders. Feeds you want to keep for later reading should preferably be favorited, bookmarked, or maybe saved to a system like Wallabag. This has advantages too, of course. By centralizing your to-read list in one location, like Wallabag or Pocket, you don’t have the problem of remembering what’s where, or that you have loads of unread open tabs in various browsers. Long story short, after sampling a whole bunch of feedreaders I opted for FreshRSS. It suffers from the omnipresent “no pages” disease. Got a feed with a thousand items? (Yes, they exist.) You can go to the start or the beginning by sorting in ascending or descending order, but reading things somewhere down the middle? Forget it. These minor inconveniences are worth it, however. This way I can easily read my feeds from any computer anywhere in the world. The feeds are always updated, provided you set up a cron job. I don’t have to start up my computer or risk missing anything if I’m on vacation for a few days. I can quickly check them on my cellphone during an otherwise wasted moment. Overall I’m happy. Goodbye, QuiteRSS. You were a good friend after Opera died, but it’s time to move on. PS Here are some feed-related links that should go along nicely with any feed reader. • Feed Creator allows you to create feeds for webpages that are missing them. • So does RSS-Bridge, but since it’s self-hosted it fits perfectly next to FreshRSS in the kluit spirit. • Tubes is a tool I wrote a few years back that can filter and fix up feeds. Useful if a website happens to have a feed, but not on a per-category basis or some such. Or of course because you might want to subscribe to an hourly news podcast, but only get the news once a day. CommentsTags: , , , ## Cloud, Kluit, Clod? Just a quick demonstration of the power of openclipart.org. I dubbed my “personal cloud” experiment kluit: a Dutch word meaning both clod and the ball of earth around the roots of a tree. In other words, kluit is firmly grounded because you’ve got your own ground with you wherever you go. Be like Dracula. With a name in mind, I also wanted a matching logo. Following a quick search for leaves, root (or was it tree) and after a little initial play something like attraction, this is the quick and satisfying result. And of course the remix is free for all. Enjoy. Comments (1)Tags: , , , ## SSH publickey denied? I was suddenly having trouble connecting to GitHub, after pulling in an OpenSSH update to version 7. Chances are that means the problem is security-related, meaning it’s worthwhile to take the time to investigate the cause. $ git pullPermission denied (publickey).fatal: Could not read from remote repository.Please make sure you have the correct access rightsand the repository exists.

A little debugging showed the following:

$ssh -vT git@github.comOpenSSH_7.1p2 Debian-2, OpenSSL 1.0.2f 28 Jan 2016debug1: Reading configuration data /etc/ssh/ssh_configdebug1: /etc/ssh/ssh_config line 19: Applying options for *debug1: Connecting to github.com [192.30.252.130] port 22.debug1: Connection established.[…]debug1: Skipping ssh-dss key /home/frans/.ssh/id_dsa for not in PubkeyAcceptedKeyTypes[…]debug1: No more authentication methods to try.Permission denied (publickey). Of course I could quickly fix the problem by adding PubkeyAcceptedKeyTypes ssh-dss to ~/.ssh/config, but checking OpenSSH.com tells me that “OpenSSH 7.0 and greater similarly disables the ssh-dss (DSA) public key algorithm. It too is weak and we recommend against its use.” So, although I could obviously re-enable it easily, I guess I’ll have to generate a new key. I hope GitHub’s guide is accurate for generating something sufficiently secure, because I’m kind of ticked off that something I generated in 2013 is already considered “legacy.” I hope I’m to blame and not an earlier version of GitHub’s guide. Incidentally, to change the passphrase one would use the -p option, e.g.: ssh-keygen -f id_rsa -p ## LuaLatex Font Hassles The TeX Gyre Pagella font I was using turned out not to contain Cyrillic characters. Unfortunately, fontspec doesn’t seem to have an easy means of setting a fallback font — I checked the manual, I swear! So I found a lookalike font named Palladio Uralic and used it instead. Before you can use a newly installed font, you have to run luaotfload-tool --update. %So is Palladio. Used as fallback. Thanks to http://tex.stackexchange.com/a/37251/32003 \newfontfamily\palladio{Palladio Uralic} \DeclareTextFontCommand{\textpalladio}{\palladio} ## LaTeX: combining added margins with hanging indents Since I’m using KOMA, the obvious method would seem to be: \begin{addmargin}[1cm]{0cm} Yada. \end{addmargin}  Unfortunately, that doesn’t seem to combine with the hanging environment. So I did it a little more manually, which will probably have someone shaking their head while I’m stuck feeling pretty clever about it: \parindent=1cm\hangindent=2cm Yada. ## Fixing Up Scanned PDFs with Scan Tailor Scanned PDFs come my way quite often and I don’t infrequently wish they were nicer to use on digital devices. One semi-solution might include running off to the library and rescanning them personally, but there is a middle road between doing nothing and doing too much: digital manipulation. The knight in shining armor is called Scan Tailor. Note that this post is not about merely cropping away some black edges. When you’re just looking for a tool to cut off some unwanted edges, I’d recommend PDF Scissors instead. If you just want to fix some incorrect rotation once and for all, try the pdftools found in texlive-extra-utils, which gives you simple shorthands like pdf90, pdf180 and pdf270. This post is about splitting up double scanned pages, increasing clarity, and adding an OCR layer on top. With that out of the way, if you’re impatient, you can skip to the script I wrote to automate the process. ### Coaxing Scan Tailor Unfortunately Scan Tailor doesn’t directly load scanned PDFs, which is what seems to be produced by copiers by default and what you’re most likely to receive from other people. Luckily this is easy to work around. If you want to use the program on documents you scan yourself, selecting e.g. TIFF in the output options could be a better choice. To extract the images from PDF files, I use pdfimages. I believe it tends to come preinstalled, but if not grab poppler-utils with sudo apt install poppler-utils. pdfimages -png filename.pdf outputname You might want to take a look at man pdfimages. The -j flag makes sure JPEG files are output as is rather than being converted to something else, for instance, while the -tiff option would convert the output to TIFF. Like PNG, that is lossless. What might also be of interest are -ccitt and -all, but in this case I’d want the images as JPEG, PNG or TIFF because that’s what Scan Tailor takes as input. At this point you could consider cropping your images to aid processing with Scan Tailor, but I’m not entirely sure how to automate it out of the way. Perhaps unpaper with a few flags could work to remove (some) black edges, but functionally speaking Scan Tailor is pretty much unpaper with a better (G)UI. In any case, this could be investigated. You’ll want to use pdfimages once more to obtain the DPI of your images for use with Scan Tailor, unless you like to calculate the DPI yourself using the document dimensions and the number of pixels. Both PNG and TIFF support this information, but unfortunately pdfimages doesn’t write it. $ pdfimages -list filename.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1     0 image    1664  2339  gray    1   1  ccitt  no         4  0   200   200  110K  23%
2     1 image    1664  2339  gray    1   1  ccitt  no         9  0   200   200  131K  28%

Clearly our PDF was scanned at a somewhat disappointing 200 DPI. Now you can start Scan Tailor, create a new project based on the images you just extracted, enter the correct DPI, and just follow the very intuitive steps. For more guidance, read the manual. With any setting you apply, take care to apply it to all pages if you wish, because by default the program quite sensibly applies it only to one page. Alternatively you could run scantailor-cli to automate the process, which could reduce your precious time spent to practically zero. I prefer to take a minute or two to make sure everything’s alright. I’m sure I’ll make up the difference by not having to scroll left and right and whatnot afterwards.

By default Scan Tailor wants to output to 600 DPI, but with my 200 DPI input file that just seemed odd. Apparently it has something to do with the conversion to pure black and white, which necessitates a higher DPI to preserve some information. That being said, 600 DPI seems almost laughably high for 200 DPI input. Perhaps “merely” twice the input DPI would be sufficient. Either way, be sure to use mixed mode on pages with images.

Scan Tailor’s output is not a PDF yet. It’ll require a little bit of post-processing.

### Simple Post-Processing

The way I usually go about trying to find new commands already installed on my computer is simply by typing the relevant phrase, in this case tiff. Press Tab for autocomplete. If that fails, you could try apt search tiff, although I prefer a GUI like Synaptic for that. The next stop is a search engine, where you can usually find results faster by focusing on the Arch, Debian or Ubuntu Wikis. On the other hand, blog and forum posts often contain useful information.

$tiff tiff2bw tiff2ps tiffcmp tiffcrop tiffdump tiffmedian tiffsplit tiff2pdf tiff2rgba tiffcp tiffdither tiffinfo tiffset tifftopnm tiff2pdf sounds just like what we need. Unfortunately it only processes one file at a time. Easy to fix with a simple shell script, but rtfm (man tiff2pdf) for useful info. “If you have multiple TIFF files to convert into one PDF file then use tiffcp or other program to concatenate the files into a multiple page TIFF file. tiffcp *.tif out.tif You could easily stop there, but for sharing or use on devices a PDF (or DjVu) file is superior. My phone doesn’t even come with a TIFF viewer by default and the one in Dropbox — why does that app open almost all documents by itself anyway? — just treats it as a collection of images, which is significantly less convenient than your average document viewer. Meanwhile, apps like the appropriately named Document Viewer deal well with both PDF and DjVu. tiff2pdf -o bla.pdf out.tif Wikipedia suggests the CCITT compression used for black and white text is lossless, which is nice. Interestingly, an 1.8 MB low-quality 200 DPI PDF more than doubled in size with this treatment, but a 20MB 400 DPI document was reduced in size to 13MB. Anyway, for most purposes you could consider compressing it with JBIG2, for instance using jbig2enc. Another option might be to ignore such PDF difficulties and use pdf2djvu or to compile a DjVu document directly from the TIFF files. At this point we’re tentatively done. ### Harder but Neater Post-Processing After I’d already written most of this section, I came across this Spanish page that pretty much explains it all. So it goes. Because of that page I decided to add a little note about checkinstall, a program I’ve been using for years but apparently always failed to mention. You’re going to need jbig2enc. You can grab the latest source or an official release. But first let’s get some generic stuff required for compilation: sudo apt install build-essential automake autotools-dev libtool And the jbig2enc-specific dependencies: sudo apt install libleptonica-dev libjpeg8-dev libpng12-devlibpng-dev libtiff5-dev zlib1g-dev In the jbig2enc-master directory, compile however you like. I tend to do something along these lines: ./autogen.sh mkdir build cd build ../configure make Now you can sudo make install to install, but you’ll have to keep the source directory around if you want to run sudo make uninstall later. Instead you can use checkinstall (sudo apt install checkinstall, you know the drill). Be careful with this stuff though. sudo checkinstall make install You have to enter a name such as jbig2enc, a proper version number (e.g. 0.28-0 instead of 0.28) and that’s about it. That wasn’t too hard. At this point you could produce significantly smaller PDFs using jbig2enc itself (some more background information): jbig2 -b outputbasename -p -s whatever-*.tif pdf.py outputbasename > output.pdf However, it doesn’t deal with mixed images as well as tiff2pdf does. And while we’re at it, we might just as well set up our environment for some OCR goodness. Mind you, the idea here is just to add a little extra value with no extra time spent after the initial setup. I have absolutely no intention of doing any kind of proofreading or some such on this stuff. The simple fact is that the Scan Tailor treatment drastically improved the chances of OCR success, so it’d be crazy not to do it. There’s a tool called pdfbeads that can automatically pull it all together, but it needs a little setup first. You need to install ruby-rmagick, ruby-hpricot if you want to do stuff with OCRed text (which is kind of the point), and ruby-dev. sudo apt install ruby-rmagick ruby-hpricot ruby-dev Then you can install pdfbeads: sudo gem install pdfbeads Apparently there are some issues with iconv or something? Whatever it is, I have no interest in Ruby at the moment and the problem can be fixed with a simple sudo gem install iconv. If iconv is added to the pdfbeads dependencies or if it switches to whatever method Ruby would prefer, this shouldn’t be an issue in the future. At this point we’re ready for the OCR. sudo apt install tesseract-ocr and whatever languages you might want, such as tesseract-ocr-nld. The -l switch is irrelevant if you just want English, which is the default. parallel tesseract -l eng+nld {} {.} hocr ::: *.tif GNU Parallel speeds this up by automatically running as many different tesseracts as you’ve got CPU cores. Install with sudo apt install parallel if you don’t have it, obviously. I’m pretty patient about however much time this stuff might take as long as it proceeds by itself without requiring any attention, but on my main computer this will make everything proceed almost four times as quickly. Why wait any longer than you have to? The OCR results are actually of extremely high quality: it has some issues with italics and that’s pretty much it. It’s not an issue with the characters, but it doesn’t seem to detect the spaces in between words. But what do I care, other than that minor detail it’s close to perfect and this wasn’t even part of the original plan. It’s a very nice bonus. Once that’s done, we can produce our final result: pdfbeads *.tif > out.pdf My 20 MB input file now is a more usable and legible 3.7 MB PDF with decent OCR to boot. Neat. A completely JPEG-based file I tried went from 46.8 MB to 2.6 MB. Now it’s time to automate the workflow with some shell scripting. ### ReadablePDF, the script Using the following script you can automate the entire workflow described above, although I’d always recommend double-checking Scan Tailor’s automated results. The better the input, the better the machine output, but even so there might just be one misdetected page hiding out. The script could still use a few refinements here and there, so I put it up on Github. Feel free to fork and whatnot. I licensed it under the GNU General Public License version 3. #!/bin/bash # readablepdf # ReadablePDF streamlines the effort of turning a not so great PDF into # a more easily readable PDF (or of course a pretty decent PDF into an # even better one). This script depends on poppler-utils, imagemagick, # scantailor, tesseract-ocr, jbic2enc, and pdfbeads. # # Unfortunately only the first four are available in the Debian repositories. # sudo apt install poppler-utils imagemagick scantailor tesseract-ocr # # For more background information and how to install jbig2enc and pdfbeads, # see http://fransdejonge.com/2014/10/fixing-up-scanned-pdfs-with-scan-tailor#harder-post-processing # # GNU Aspell and GNU Parallel are recommended but not required. # sudo apt install aspell parallel # # Aspell dictionaries tend to be called things like aspell-en, aspell-nl, etc. BASENAME=${@%.pdf} # or basename "@%" .pdf

# It would seem that at some point in its internal processing, pdfbeads has issues with spaces.
# Let's strip them and perhaps some other special characters so as still to provide
# meaningful working directory and file names.
BASENAME_SAFE=$(echo "${BASENAME}" | tr ' ' '_') # Replace all spaces with underscores.
#BASENAME_SAFE=$(echo "${BASENAME_SAFE}" | tr -cd 'A-Za-z0-9_-') # Strip other potentially harmful chars just in case?

SCRIPTNAME=$(basename "$0" .sh)
TMP_DIR=${SCRIPTNAME}-${BASENAME_SAFE}

TESSERACT_PARAMS="-l eng+nld"

# If project file exists, change directory and assume everything's in order.
# Else do the preprocessing and initiation of a new project.
if [ -f "${TMP_DIR}/${BASENAME_SAFE}.ScanTailor" ]; then
echo "File ${TMP_DIR}/${BASENAME_SAFE}.ScanTailor exists."
cd "${TMP_DIR}" else echo "File${TMP_DIR}/${BASENAME_SAFE}.ScanTailor does not exist." # Let's get started. mkdir "${TMP_DIR}"
cd "${TMP_DIR}" # Only output PNG to prevent any potential further quality loss. pdfimages -png "../${BASENAME}.pdf" "${BASENAME_SAFE}" # This is basically what happens in https://github.com/virantha/pypdfocr as well # get the x-dpi; no logic for different X and Y DPI or different DPI within PDF file # y-dpi would be pdfimages -list out.pdf | sed -n 3p | awk '{print$14}'
DPI=$(pdfimages -list "../${BASENAME}.pdf" | sed -n 3p | awk '{print $13}') #<<'end_long_comment' # TODO Skip all this based on a rotation command-line flag! # Adapted from http://stackoverflow.com/a/9778277 # Scan Tailor says it can't automatically figure out the rotation. # I'm not a programmer, but I think I can do well enough by (ab)using OCR. :) file="${BASENAME_SAFE}-000.png"

TMP="/tmp/rotation-calc"
mkdir ${TMP} # Make copies in all four orientations (the src file is 0; copy it to make # things less confusing) north_file="${TMP}/0"
east_file="${TMP}/90" south_file="${TMP}/180"
west_file="${TMP}/270" cp "$file" "$north_file" convert -rotate 90 "$file" "$east_file" convert -rotate 180 "$file" "$south_file" convert -rotate 270 "$file" "$west_file" # OCR each (just append ".txt" to the path/name of the image) north_text="$north_file.txt"
east_text="$east_file.txt" south_text="$south_file.txt"
west_text="$west_file.txt" # tesseract appends .txt automatically tesseract "$north_file" "$north_file" tesseract "$east_file" "$east_file" tesseract "$south_file" "$south_file" tesseract "$west_file" "$west_file" # Get the word count for each txt file (least 'words' == least whitespace junk # resulting from vertical lines of text that should be horizontal.) wc_table="$TMP/wc_table"
echo "$(wc -w${north_text}) ${north_file}" >$wc_table
echo "$(wc -w${east_text}) ${east_file}" >>$wc_table
echo "$(wc -w${south_text}) ${south_file}" >>$wc_table
echo "$(wc -w${west_text}) ${west_file}" >>$wc_table

# Spellcheck. The lowest number of misspelled words is most likely the
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table" while read record; do txt=$(echo "$record" | awk '{ print$2 }')
# This is harder to automate away, pretend we only deal with English and Dutch for now.
misspelled_word_count=$(< "${txt}" aspell -l en list | aspell -l nl list | wc -w)
echo "$misspelled_word_count$record" >> $misspelled_words_table done <$wc_table

# Do the sort, overwrite the input file, save out the text
winner=$(sort -n$misspelled_words_table | head -1)
rotated_file=$(echo "${winner}" | awk '{ print $4 }') rotation=$(basename "${rotated_file}") echo "Rotating${rotation} degrees"

# Clean up.
if [ -d ${TMP} ]; then rm -r${TMP}
fi
# TODO end skip

if [[ ${rotation} -ne 0 ]]; then mogrify -rotate "${rotation}" "${BASENAME_SAFE}-*.png" fi #end_long_comment # consider --color-mode=mixed --despeckle=cautious scantailor-cli --dpi="${DPI}" --margins=5 --output-project="${BASENAME_SAFE}.ScanTailor" ./*.png ./ fi while true; do read -p "Please ensure automated detection proceeded correctly by opening the project file${TMP_DIR}/${BASENAME_SAFE}.ScanTailor in Scan Tailor. Enter [Y] to continue now and [N] to abort. If you restart the script, it'll continue from this point unless you delete the directory${TMP_DIR}. " yn
case $yn in [Yy]* ) break;; [Nn]* ) exit;; * ) echo "Please answer yes or no.";; esac done # Use GNU Parallel to speed things up if it exists. if command -v parallel >/dev/null; then parallel tesseract {} {.}${TESSERACT_PARAMS} hocr ::: *.tif
else
for i in ./*.tif; do tesseract $i$(basename $i)${TESSERACT_PARAMS} hocr; done;
fi

# pdfbeads doesn't play nice with filenames with spaces. There's nothing we can do
# about that here, but that's ${BASENAME_SAFE} is generated up at the beginning. # # Also pdfbeads ./*.tif > "${BASENAME_SAFE}.pdf" doesn't work,
# so you're in trouble if your PDF's name starts with "-".
# See http://www.dwheeler.com/essays/filenames-in-shell.html#prefixglobs
pdfbeads *.tif > "${BASENAME_SAFE}.pdf" #OUTPUT_BASENAME=${BASENAME}-output@DPI${DPI} mv "${BASENAME_SAFE}.pdf" ../"${BASENAME}-readable.pdf"  ### Alternatives If you’re not interested in the space savings of JBIG2 because the goal of ease of use and better legibility has been achieved (and you’d be quite right; digging further is just something I like to do), after tiff2pdf you could still consider tossing in pdfsandwich. You might as well, for the extra effort only consists of installing an extra package. Instead, OCRmyPDF might also work, or perhaps even plain Tesseract 3.03 and up. pdfsandwich just takes writing the wrapper out of your hands. But again, this part is just a nice bonus. pdfsandwich -lang nld+eng filename.pdf -o filename-ocr.pdf The resulting file doesn’t strike my fancy after playing with the tools mentioned above, but hey, it takes less time to setup and it works. #### DjVu DjVu is probably a superior alternative to PDF, so it ought to be worth investigating. This link might help. A very useful application, found in the Debian repositories to boot, is djvubind. It works very similar to ReadablePDF, but produces DjVu files instead. For sharing these may be less ideal, but for personal use they seem to be even smaller (something that could probably be affected by the choices for dictionary size) while displaying even faster. ### Other Matters of Potential Interest Note that I’m explicitly not interested in archiving a book digitally or some such. That is, I want to obtain a digital copy of a document or book that avoids combining the worst of both digital and paper into one document, but I’m not interested in anything beyond that unless it can be done automatically. Moreover, attempting to replicate original margins would actually make the digital files less usable. For digital archiving you’ll obviously have to redo that not-so-great 200 DPI scan and do a fair bit more to boot. It looks like Spreads is a great way to automate the kind of workflow desired in that case. This link dump might offer some further inspiration. ### Conclusion My goal has been achieved. Creating significantly improved PDFs shouldn’t take more than a minute or two of my time from now on, depending a bit on the quality of the input document. Enjoy. Comments (1)Tags: , , ## Pandoc Markdown Over Straight LaTeX I familiarized myself with LaTeX because I like HTML better than word processors. In fact, I disprefer word processors. LibreOffice Writer can do a fairly decent job of WYSIWYM (What You See Is What You Mean), but in many ways I like it less than HTML. So why don’t I just use HTML, you ask? Quite simply, HTML isn’t necessarily the best option for print. Prince does a great job generating printable PDFs, but even though writing straight HTML is easy enough and adds many benefits, I mostly only prefer it over your run of the mill text editing software. Besides, I wanted to profit from BibTeX reference management, which tends to come along with LaTeX. Clearly then, LaTeX has some nice features. Unfortunately, it shares many of HTML’s flaws and adds some others: \emph{} is at best marginally easier to type than <em></em>, but I find it somewhat harder to read. Besides which, converting LaTeX to other formats like HTML can be a pain. On the good side, LaTeX and HTML also share many features. Both depend on plain-text files, which is great because you can open them on any system, and because you can use versioning software. Binary blobs and compressed zip files are also more prone to data loss in case of damage. The great thing about versioning software isn’t necessarily that you can go back to a former version, but the knowledge that you can go back. Normally I’m always busy commenting out text or putting it at the bottom, but when it’s versioned I feel much more free about just deleting it. Maybe I’ll put some of it back in later, but it lets the machine take the work off of my hands. I know, Writer, Word, et cetera can do this too, but did I mention I prefer plain text anyway? Where LaTeX really shines is its reference management, math support without having to use incomprehensible gibberish like MathML or some odd equation editor, and its typographical prowess. On top of the shared features with HTML, those features are why I looked into LaTeX in the first place. So how can I get those features without being bothered by the downsides of HTML and LaTeX? As it turns out, the answer is Pandoc’s variant of Markdown. In practice, I rarely need more than what Pandoc’s Markdown can give me. It’s HTML-focused, which I like because I know HTML, but you can insert math (La)TeX-style between $ characters. It also comes with its own citation reference system, which it changes to BibLaTeX citations upon conversion to LaTeX. As these things go, I wasn’t the first with this idea.

Of course it won’t do to repeat myself on the command line constantly, so I wrote a little conversion helper script:

#!/bin/bash
#generate-pdf.sh

BASENAME=your-text-file-without-extension
# I compiled an updated version of Pandoc locally.
PANDOC_LOCAL=~/.cabal/bin/pandoc

if [ -x $PANDOC_LOCAL ]; then PANDOC=$PANDOC_LOCAL
else
PANDOC=pandoc
fi

# Output to HTML5.
$PANDOC \$BASENAME.md \
--to=html5 \
--mathml \
--self-contained \
--smart \
--csl modern-language-association-with-url.csl \
--bibliography $BASENAME-bibliography.bib \ -o$BASENAME.html

# Output to $BASENAME-body.tex #$BASENAME.tex has this file as input
$PANDOC \$BASENAME.md \
--smart \
--biblatex \
--bibliography $BASENAME-bibliography.bib \ -o$BASENAME-body.tex

# Pandoc likes to output p.~ or pp.~ in its \autocite, but I just want the numbers.
sed -i 's/\\autocite\[p.~/\\autocite\[/g' $BASENAME-body.tex sed -i 's/\\autocite\[pp.~/\\autocite\[/g'$BASENAME-body.tex
# It would probably suffice to just do this but I don't want any nasty surprises:
#sed -i 's/p.~//g' $BASENAME-body.tex #sed -i 's/pp.~//g'$BASENAME-body.tex

# If ever bored, consider adding something to change \autocite[1-2] into \autocite[1--2]

# Generate the PDF.
lualatex $BASENAME biber$BASENAME
lualatex $BASENAME lualatex$BASENAME

# Remove these files after the work is done.
rm  \
$BASENAME.aux \$BASENAME.bbl \
$BASENAME.blg \$BASENAME.bcf \
$BASENAME.run.xml \$BASENAME.toc \
#$BASENAME-body.tex Something that may not be immediately obvious from the script is that I’ve also got a$BASENAME.tex file. This contains all of my relevant settings, but instead of the main content it contains \input{basename-body.tex}. There are some prerequisites for working with Pandoc-generated LaTeX, for instance:

%for pandoc table output (needs ctable for 1.9; longtable for 1.10)
\usepackage{longtable}

I haven’t yet made up my mind on what to do about splitting up chapters in different files, but it hasn’t bothered me yet.

There you have it. That’s my way of keeping thing simple while still profiting from LaTeX typesetting.

## Mounting Remote Filesystems With sshfs

This is a condensed and edited version of the Ubuntu Blog guide regarding how to mount a remote ssh filesystem using sshfs, based on my personal experience.

Before you can use sshfs, you’ll need an SSH server. This is useful for all kinds of things, but that’s not important here. To set up an SSH server in Ubuntu, all you need to do is sudo apt-get install openssh-server. Setting it up in Cygwin (like I did to access my Windows box, and to tunnel VNC through it) is a bit trickier, but there are decent tutorials out there. Once that’s taken care of, you can set up sshsf.

sudo apt-get install sshfs
sudo mkdir /media/dir-name
sudo chown whoami /media/dir-name
sudo adduser whoami fuse


Log out and log back in again so that you’re a proper part of the group.

Mount using sshfs [user@]host.ext:/remote-dir /media/dir-name; unmount using fusermount -u /media/dir-name.

It all worked perfectly for me, but if not, there’s apparently a solution.

If you get the following error:

You will have to load the fuse module by doing:
\$sudo modprobe fuse

You can add fuse to the modules that are loaded on startup by editing the file /etc/modules and adding a line with only the word “fuse” in it, at the end.

and then issue the sshfs command above again.

If you’re on Windows, don’t panick. Dokan SSHFS will perform the same task.

It should be noted that this is even easier within KDE applications, where you can simply use fish://your-server.com, but sshfs cooperates better with the rest of my system. Trying the same with Dolphin in KDE on Windows results in a KIOslave going crazy using all the CPU it can, however.

Aside from easy editing of files directly on my Windows box, this finally enabled me to stream videos from my Windows box, although right now only lower quality ones since it’s also connected through WLAN. With Samba things just weren’t working out, and the same applied to FTP (though it was better for file transfers than Samba, I have to say). Admittedly, this still actually uses FTP under the hood, but it just works better. Besides it will also be more secure to use remotely thanks to SSH.

Older Entries »