The One with the Thoughts of Frans

Strip The Junk Out of Word or Writer Files with Writer2LaTeX and Pandoc

While playing around with LaTeX stuff I noticed that the clean article and ultra-clean article Writer2LaTeX options were quite useful. It occurred to me that aside from converting anything Writer can open to LaTeX, this might also be used to strip the cruft from those very same files. Just convert them back to ODT afterward.

pandoc -o output.odt input.tex

Since Writer2LaTeX was a command-line utility before it became a LibreOffice extension, we can automate it a bit more like this after installing the writer2latex package:

w2l -latex -ultraclean input.odt

Unfortunately that won’t handle everything LibreOffice does. If you want to stick to the command line you can take care of that like this:

loffice --headless --convert-to odt input.docx

I stuck it all in a shell script:

#!/bin/bash
#clean-writer.sh
#http://stackoverflow.com/questions/965053/extract-filename-and-extension-in-bash/965072#965072
FILENAME=$1
EXTENSION=${FILENAME##*.}
FILENAME=${FILENAME%.*}

SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
OUTPUT_FILENAME=${FILENAME}-clean.odt

mkdir ${TMP_DIR}
cp $1 ${TMP_DIR}
cd ${TMP_DIR}

#convert to ODT if the file is DOC or something
if [ "$EXTENSION" != "odt" ]
then
	loffice --headless --convert-to odt $1
fi

w2l -latex -ultraclean ${FILENAME}.odt
pandoc -o ${OUTPUT_FILENAME} ${FILENAME}.tex

mv ${OUTPUT_FILENAME} ..
rm *
cd ..
rmdir ${TMP_DIR}

And thus you end up with a much more manageable Writer file. Just be careful: e.g. tables aren’t supported by Pandoc quite properly.

Leave a Comment

You must be logged in to post a comment.