Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

As required I am trying to convert doc or docx (Microsoft word) files to html format with Apache tika

I end up with following code which works fine, But its not adding any style sheet to result html.

 import javax.xml.transform.OutputKeys;
 import java.io.*;
 import javax.xml.transform.stream.StreamResult;
 import javax.xml.transform.sax.SAXTransformerFactory;
 import javax.xml.transform.sax.TransformerHandler;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.detect.DefaultDetector;


public class DocxConvert

  {

  public static void main(String []args) 
   {
      InputStream input=null;

     try
        {
    StringWriter sw = new StringWriter();
            SAXTransformerFactory factory = (SAXTransformerFactory)
            SAXTransformerFactory.newInstance();
            TransformerHandler handler = factory.newTransformerHandler();
            handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
            handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"yes");
            handler.setResult(new StreamResult(sw));
            input = new FileInputStream("f:\\file.doc");
            DefaultDetector detector = new DefaultDetector();
            Metadata metadata = new Metadata();
            org.apache.tika.parser.Parser parser = new AutoDetectParser(detector); 
            parser.parse(input, handler, metadata, new ParseContext());

            System.out.print(sw.toString());

        }
      catch (Exception ex)
   { 
        ex.printStackTrace();
   }
      finally {
              try {
            input.close();
          }
                  catch (IOException e)
                 {
            // TODO Auto-generated catch block
            e.printStackTrace();
          }
       } 

 }

}

Is there any way to add/generate style sheet to output? kindly help !

share|improve this question

2 Answers 2

You can use unoconv and it requires Openoffice or Libreoffice. Download from here and it provides doc,docx,xls etc. to pdf conversion from command line in your server. if you want to show embedding pdf file with apache or apache tomcat, i think pdf.js is good solution.

share|improve this answer

I used version 1.6 of Tika and that worked fine for me. Here is the pom dependency I used.

http://tika.apache.org/download.html

   <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.6</version>
        </dependency>
    </dependencies>
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.