OCR Text in PDF with Tesseract
Since I had some scanned PDFs which I wanted to change into plain text, I looked into OCR solutions for Linux: as it turns out there are some pretty good options. I decided to go with Tesseract; you’ll need to install one or more language packs along with it. Unfortunately it only handles TIF files as input, so I needed a simple shell script to automatically convert PDFs to TIFs. This is what you’ll need to install:
aptitude install tesseract-ocr tesseract-ocr-eng tesseract-ocr-nld imagemagick
You might notice ImageMagick in there, which is just useful to have. Heck, even if you’re not interested in OCR you should install it right now and read the manual. In any case, it’s used in the shell script I wrote to assist my OCR-ing. I picked up a script from the Ubuntu Forums, but for some reason it was wasting CPU cycles and disk space with useless conversions to an intermediary format: ImageMagick can convert PDF straight to TIF.
#!/bin/bash
#ocrpdftotext
# Simplified implementation of http://ubuntuforums.org/showthread.php?t=880471
# Might consider doing something with getopts here, see http://wiki.bash-hackers.org/howto/getopts_tutorial
DPI=300
TESS_LANG=nld
FILENAME=${@%.pdf}
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
OUTPUT_FILENAME=${FILENAME}-output@DPI${DPI}
mkdir ${TMP_DIR}
cp ${@} ${TMP_DIR}
cd ${TMP_DIR}
convert -density ${DPI} -depth 8 ${@} "${FILENAME}.tif"
tesseract "${FILENAME}.tif" "${OUTPUT_FILENAME}" -l ${TESS_LANG}
mv ${OUTPUT_FILENAME}.txt ..
rm *
cd ..
rmdir ${TMP_DIR}
This may not suit your needs, but I think as a starting point it’s a step up from what the Ubuntu forums gave me.
[…] Blog von Frans de Jonge habe ich eine kleine Anleitung dazu gefunden, wie ich aus den PDFs zunächst TIFs erstelle und die dann an Tesseract […]
July 25, 2013 @ 13:43Permalink
Das papierlose Büro: OCR mit Tesseract und Imagemagick
[…] http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tesseract/ […]
October 20, 2013 @ 20:53Permalink
OCR using Tesseract on multipage PDFs