Strip The Junk Out of Word or Writer Files with Writer2LaTeX and Pandoc
While playing around with LaTeX stuff I noticed that the clean article and ultra-clean article Writer2LaTeX options were quite useful. It occurred to me that aside from converting anything Writer can open to LaTeX, this might also be used to strip the cruft from those very same files. Just convert them back to ODT afterward.
pandoc -o output.odt input.tex
Since Writer2LaTeX was a command-line utility before it became a LibreOffice extension, we can automate it a bit more like this after installing the writer2latex package:
w2l -latex -ultraclean input.odt
Unfortunately that won’t handle everything LibreOffice does. If you want to stick to the command line you can take care of that like this:
loffice --headless --convert-to odt input.docx
I stuck it all in a shell script:
#!/bin/bash
#clean-writer.sh
#http://stackoverflow.com/questions/965053/extract-filename-and-extension-in-bash/965072#965072
FILENAME=$1
EXTENSION=${FILENAME##*.}
FILENAME=${FILENAME%.*}
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
OUTPUT_FILENAME=${FILENAME}-clean.odt
mkdir ${TMP_DIR}
cp $1 ${TMP_DIR}
cd ${TMP_DIR}
#convert to ODT if the file is DOC or something
if [ "$EXTENSION" != "odt" ]
then
loffice --headless --convert-to odt $1
fi
w2l -latex -ultraclean ${FILENAME}.odt
pandoc -o ${OUTPUT_FILENAME} ${FILENAME}.tex
mv ${OUTPUT_FILENAME} ..
rm *
cd ..
rmdir ${TMP_DIR}
And thus you end up with a much more manageable Writer file. Just be careful: e.g. tables aren’t supported by Pandoc quite properly.