Wednesday, September 22, 2010

Document conversion tool

So I was looking for a tool to convert between various document formats (ie .tex to .doc, .html, .pdf, .odt or any variation thereof) and I found that it was surprisingly difficult finding a tool to convert to .odt. Converting from pretty much anything to either html or pdf was a breeze, however to other formats was more difficult. To convert from .tex to .html, I chose to use latex2html rather than tex4ht because I found the output to be cleaner. Conversion from .tex to .pdf was quite a bit easier. There were a few options. I could either run latex, dvips, and ps2pdf; or latex and dvipdfm; or just pdflatex. These are the standard options. For some reason, when I used these methods the hyperlink table of contents that I created (via hyperref and \tableofcontents) wouldn't show up. I instead chose to use rubber instead. Rubber is a wrapper for LaTeX and some companion programs. With the '-d' option, I was able to make PDFs out of the .tex files that included they hyperlinked table of contents. I don't really understand why this worked and the other methods didn't but it did. Now was the hard part. Converting the .tex files into .odt or into .doc appeared to be near impossible to do cleanly. The best option I had heard was to convert first into HTML and then load it into Office and then save as the desired formats. I found this to work out extremely well. However, I had intended to automate the whole process of document conversion with a script, so this method was not very good for me especially since neither Microsoft Office nor Openoffice.org had very good command line interfaces. This was when I discovered a program called JODConverter. This is a Java program that utilizes Openoffice to convert from one format to another. While this does mean that I probably could have found a way to use Openoffice directly via command line, who was I to complain when there was a program out there to do it for me =D. In the end I wrote a small BASH script to help me with the conversions.

NOTES:
-written in BASH because I'm using Arch Linux
-uses zenity to provide a GUI but is not really necessary
-while this is tailored to my LaTeX usage it can probably be adapted for anything, although just using JODConverter is probably better if LaTeX is not an issue


#!/bin/bash
cd /home/ray/Documents/Novel/Tex/
response=$(zenity --list --title="Choose File" --column=File \
$(ls --hide=*.pdf --hide=*.odt --hide=*.html --hide=Revisions --hide=Output /home/ray/Documents/Novel/Tex/) )
input=$(zenity --list --title="Choose File Type" --checklist --column=Files --column=Description \
TRUE PDF \
TRUE HTML \
TRUE ODT \
TRUE DOC)
cd /home/ray/Documents/Novel/Tex/
IFS='|' ; for word in $input ; do
case $word in
PDF) rubber -f -s -d /home/ray/Documents/Novel/Tex/$response;;
HTML) latex2html -split 0 -no_navigation -dir /home/ray/Documents/Novel/Tex/ $response
sed s_${response%.*x}.html#_#_ <${response%.*x}.html>${response%.*x}.html.new
rm index.html
rm ${response%.*x}.html
mv ${response%.*x}.html.new ${response%.*x}.html;;
ODT)latex2html -split 0 -no_navigation -dir /home/ray/Documents/Novel/Tex/ $response
sed s_${response%.*x}.html#_#_ <${response%.*x}.html>${response%.*x}.html.new
rm index.html
rm ${response%.*x}.html
mv ${response%.*x}.html.new ${response%.*x}.html
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard &
jodconverter /home/ray/Documents/Novel/Tex/${response%.*x}.html /home/ray/Documents/Novel/Tex/${response%.*x}.odt
pkill soffice;;
DOC)latex2html -split 0 -no_navigation -dir /home/ray/Documents/Novel/Tex/ $response
sed s_${response%.*x}.html#_#_ <${response%.*x}.html>${response%.*x}.html.new
rm index.html
rm ${response%.*x}.html
mv ${response%.*x}.html.new ${response%.*x}.html
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard &
jodconverter /home/ray/Documents/Novel/Tex/${response%.*x}.html /home/ray/Documents/Novel/Tex/${response%.*x}.doc
pkill soffice;;
esac
unset IFS
cd /home/ray/Documents/Novel/Tex/
rm $(ls --hide=*.tex --hide=*.sh --hide=*.html --hide=*.odt --hide=*.pdf --hide=*.doc --hide=Output --hide=Revisions)
mkdir Output/$(date +%F-%R)
cp $(ls --hide=*.tex --hide=Revisions --hide=Output) /home/ray/Documents/Novel/Tex/Output/$(date +%F-%R)/
#
#echo "-------------------- EXTRA STEPS --------------------"
#echo "1. Open HTML with OpenOffice.org Writer"
#echo "2. Add first-line indent"
#echo "3. Save file as Master.odt"
#echo "4. Export Master.odt to GoogleDocs"
#echo "-----------------------------------------------------"
#cd ~
done