XML Creation using java translates CR in HTML Entity

Question

I've got a pretty serious problem with XML Creation using standard java objects, my code is as follows:

//Generate DOM
DOMSource source = this.generateDomDocument(params...);

//WRITE XML FILE
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();

//Properties
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, STRING_FIELD_DTD);
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");

//Convert and write to disk
transformer.transform(source, new StreamResult(
                      new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8")));

Problem is, the transformer is transforming Carriage Returns in  entities which I should not have in the resulting XML. This is an example, I have a result file with translations written in several different languages (that's why I use UTF-8) and they are all the same when they contain CarriageReturns in the text:

<content langID="EN">
                    <desc> Test string&#13;
do not copy.</desc>

To clear things up, this is what I expect in the XML:

<content langID="EN">
                    <desc> Test string
do not copy.</desc>

I looked up the issue on google and here too but there seems to be no solution or workaround.

I'm not sure, that I understand the problem correctly: The input data contains CRs and the XML output should a) not include them? or b) have simply a line break there? or c)... ? — Rainer Schwarze, Commented Jul 3, 2012 at 20:02
The input data contains CRs and the XML output should simply have a line break and not something like then CRLF. I inspected the XML code with Notepad++ and this is what I found. — OverLex, Commented Jul 4, 2012 at 9:53

OverLex · Accepted Answer · 2012-07-10 13:49:57Z

After a lot of work I found two solutions to my own problem, they are more workarounds than real solutions:

Solution 1

Create a class that extends FilterOutputStream and implement the necessary methods to write every character on the final stream (a File in the case above) except for those unneeded, i.e.  To write to stream just add the filter:

 StreamResult result = new StreamResult(
                           new OutputStreamWriter(
                               new XMLFilterOutputStream(
                                   new FileOutputStream(filename)),"UTF-8"));
 transformer.transform(source, result);

Solution 2

When creating the DOM tree just escape the \r character (thus removing every Carriage Return from the original text):

String util = //original string data

Element desc = doc.createElement("desc");                   
Node text = doc.createTextNode((util!=null ? stringEscape(util).trim() : ""));
desc.appendChild(text);
externalElement.appendChild(desc);

And having the escaping method done this way:

private String stringEscape(String str){
    StringBuffer st = new StringBuffer(str);
    for(int i=0; i < st.length();i++){
        String s = st.substring(i,i+1);
        if("\r".equals(s)){
            st.replace(i,i+1,"");   
        }
    }
    return st.toString();
}

I know, it's horrible, but works.

The correct solution, AFAIK, should be accessing the HTMLEntities file in XALAN and modify that one, therefore forcing the transformer to omit some entities.

For stringEscape it's probably much faster something like str.replaceAll("\r", ""). — lapo, Commented Jul 11, 2012 at 7:44

Collectives™ on Stack Overflow

XML Creation using java translates CR in HTML Entity

1 Answer 1

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related