OCR Text in PDF with Tesseract

April 2, 2012 at 0:13 · Filed under Linux

Since I had some scanned PDFs which I wanted to change into plain text, I looked into OCR solutions for Linux: as it turns out there are some pretty good options. I decided to go with Tesseract; you’ll need to install one or more language packs along with it. Unfortunately it only handles TIF files as input, so I needed a simple shell script to automatically convert PDFs to TIFs. This is what you’ll need to install:

aptitude install tesseract-ocr tesseract-ocr-eng tesseract-ocr-nld imagemagick

You might notice ImageMagick in there, which is just useful to have. Heck, even if you’re not interested in OCR you should install it right now and read the manual. In any case, it’s used in the shell script I wrote to assist my OCR-ing. I picked up a script from the Ubuntu Forums, but for some reason it was wasting CPU cycles and disk space with useless conversions to an intermediary format: ImageMagick can convert PDF straight to TIF.

#!/bin/bash
#ocrpdftotext
# Simplified implementation of http://ubuntuforums.org/showthread.php?t=880471

# Might consider doing something with getopts here, see http://wiki.bash-hackers.org/howto/getopts_tutorial
DPI=300
TESS_LANG=nld

FILENAME=${@%.pdf}
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
OUTPUT_FILENAME=${FILENAME}-output@DPI${DPI}

mkdir ${TMP_DIR}
cp ${@} ${TMP_DIR}
cd ${TMP_DIR}

convert -density ${DPI} -depth 8 ${@} "${FILENAME}.tif"
tesseract "${FILENAME}.tif" "${OUTPUT_FILENAME}" -l ${TESS_LANG}

mv ${OUTPUT_FILENAME}.txt ..
rm *
cd ..
rmdir ${TMP_DIR}

This may not suit your needs, but I think as a starting point it’s a step up from what the Ubuntu forums gave me.

Permalink

2 Comments↓

[…] Blog von Frans de Jonge habe ich eine kleine Anleitung dazu gefunden, wie ich aus den PDFs zunächst TIFs erstelle und die dann an Tesseract […]

July 25, 2013 @ 13:43Permalink
Das papierlose Büro: OCR mit Tesseract und Imagemagick
[…] http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tesseract/ […]

October 20, 2013 @ 20:53Permalink
OCR using Tesseract on multipage PDFs

RSS feed for comments on this post· TrackBack URI

You must be logged in to post a comment.

The One with the Thoughts of Frans

OCR Text in PDF with Tesseract

2 Comments↓

Leave a Comment