Saving the original newline type (\ r vs \ r \ n) in XML

I have an application in which I would like to use an XML file to store: (1) the source text of a document and (2) several objects that “point” to the source text using character offsets. For instance:.

<Document>
  <OriginalText>This is a test</OriginalText>
  <Word start_offset="0" end_offset="4" id="w1"/>
  <Word start_offset="6" end_offset="7" id="w2"/>
  <Word start_offset="8" end_offset="9" id="w3"/>
  <Word start_offset="10" end_offset="14" id="w4"/>
</Document>

However, I am worried about a potential problem - I cannot control the contents of the input documents, so it may contain the lines "\ n" or "\ r \ n". However, the XML specification [1] says:

The XML processor MUST behave as if it normalized all line breaks in the external parsed entities (including the document object) at the input, before parsing, translating the two-character sequence #xD #xA and any #xD followed by #xA until single character #xA.

Ie, newlines are normalized before the application sees the XML file. Unfortunately, it seems to me that this can throw off character shifts. For example, the symbol that was at offset 173 before the offsets were normalized may be at offset 168 after the offsets were normalized. My questions:

  • Am I interpreting the XML specification correctly?

  • I assume that simply encoding newlines (i.e. replacing \ r with & #xD;) will not fix the problem, because encoded characters will be replaced before the XML processor normalizes line breaks. It is right?

  • - ? , , \r, - ( , "" ); , . (, base64 uuencode), , XML.

( , , , , .)

[1] http://www.w3.org/TR/REC-xml/#sec-line-ends

+3
2

, , , , () CR , . , CR, &#xD;, LF, ( , ), XML. , CR CDATA, , CDATA , .

, , . : , XML . CR, XML , .

, . , , . , , . , . , , .

+4

, , <br />.

0

All Articles