Java regex alternative for bytes in a stream

I have XML files (encoded in UTF-8) that have two problems:

  • Some of them (not all) contain Byte Order Mark EF BB BF

  • Some of them (not all) contain Null 00 characters distributed throughout the file.

Both problems prevent me from parsing XML with SAX Parser. My current approach was to read the file in String and use a regular expression to extract these characters and write the string back to the file, which worked fine. However, my files are quite large (hundreds of megabytes) and reading the file in String, creating the result. A string of the same size every time I call replaceAll () quickly leads to an error in the java heap area.

Increasing heap size is definitely not a long-term solution. I will need to transfer the file and extract all these characters on the fly.

Any suggestions on what an effective solution should look like?

+3
source share
3 answers

I would subclass FilterInputStreamto filter out unwanted bytes at runtime.

The task should be quite simple, since byte order marks are probably only at the beginning of the file (so you only need to check there), and nul-bytes can easily be flater with a simple comparison ==(no regular expressions needed).

This is likely to also increase productivity, since you do not need to write the full adjusted file to disk before re-reading it.

+7

SAX. . read() FilterInputStream, .

, , @Joachim.;)

+1

, . , - . , downvotes.:)


InputStream, mark() reset(), reset, :

InputStream in = new BufferedInputStream(
        new FileInputStream(new File("xmlfile.xml")));
in.mark(3);
byte[] maybeBom = new byte[] {
        (byte) in.read(), (byte) in.read(), (byte) in.read() };

if(!Arrays.equals(maybeBom, new byte[] { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF })) {
    in.reset();
}

BufferedInputStream, FileInputStream mark().

+1

All Articles