Removing non-UTF-8 characters from a large txt file

I am working on a 1 Gigabyte JSON text file which I am trying to parse using Java. However, the parser throws an exception because it runs in the '-' character that throws this exception:

Exception Invalid Start Byte UTF-8 0x96

I tried removing the character with sed and perl, but it seems that they cannot read the character, and thus the file remains unchanged. I would like to remove a character from the whole file or replace it with any other character or string to work with the parse.

+5
source share
2 answers

Your file is not encoded in UTF-8.

InputStreamReader. , , UTF-8 ( OutputStreamWriter).

, : . Charsets.

+5

, , UTF-8 . , , : Java:

, , , InputStreamReader#getEncoding()

+2

All Articles