The One with the Thoughts of Frans

OCR Text in PDF with Tesseract

Since I had some scanned PDFs which I wanted to change into plain text, I looked into OCR solutions for Linux: as it turns out there are some pretty good options. I decided to go with Tesseract; you’ll need to install one or more language packs along with it. Unfortunately it only handles TIF files as input, so I needed a simple shell script to automatically convert PDFs to TIFs. This is what you’ll need to install:

aptitude install tesseract-ocr tesseract-ocr-eng tesseract-ocr-nld imagemagick

You might notice ImageMagick in there, which is just useful to have. Heck, even if you’re not interested in OCR you should install it right now and read the manual. In any case, it’s used in the shell script I wrote to assist my OCR-ing. I picked up a script from the Ubuntu Forums, but for some reason it was wasting CPU cycles and disk space with useless conversions to an intermediary format: ImageMagick can convert PDF straight to TIF.

# Simplified implementation of

# Might consider doing something with getopts here, see

SCRIPT_NAME=`basename "$0" .sh`

mkdir ${TMP_DIR}
cp ${@} ${TMP_DIR}
cd ${TMP_DIR}

convert -density ${DPI} -depth 8 ${@} "${FILENAME}.tif"
tesseract "${FILENAME}.tif" "${OUTPUT_FILENAME}" -l ${TESS_LANG}

mv ${OUTPUT_FILENAME}.txt ..
rm *
cd ..
rmdir ${TMP_DIR}

This may not suit your needs, but I think as a starting point it’s a step up from what the Ubuntu forums gave me.


  1. […] Blog von Frans de Jonge habe ich eine kleine Anlei­tung dazu gefun­den, wie ich aus den PDFs zunächst TIFs erstelle und die dann an Tes­ser­act […]

    July 25, 2013 @ 13:43Permalink
    Das papierlose Büro: OCR mit Tesseract und Imagemagick

  2. October 20, 2013 @ 20:53Permalink
    OCR using Tesseract on multipage PDFs

RSS feed for comments on this post· TrackBack URI

Leave a Comment

You must be logged in to post a comment.