Org.apache.commons.io.IOUtils.toString misinterprets UTF-8

Question

Org.apache.commons.io.IOUtils.toString misinterprets UTF-8

I am trying to get the source code from a URI. It is reported that UTF-8. I also tried ISO-8859-1, ISO-8859-1 Windows-1250 and ISO-8859-2.

Here is my last attempt code (example ISO-8859-2):

public static String getPage(String page,String charset) throws IOException{
        URL url=new URL(page);

        return org.apache.commons.io.IOUtils.toString(url.openConnection().getInputStream(),charset);
    }

    public static void main(String args[])throws Exception{
        String page=getPage("http://buscon.rae.es/drae/srv/search?val=aba","ISO-8859-2");
        System.out.println(page);
    }

But the result:

apÄ? ge 'quita, aparta', y este del gr. á¼? I AM? I ± Î³Îμ)

instead:

(Del lat. Apăge 'quita, aparta', y este del gr. Ἄπαγε).

Similarly, UTF-8 (which works with other code, as well as in browsers) and other encoding names also does not work in a similar way.

+2

java url character-encoding

Lengoman Aug 7 '12 at 15:51

source share

1 answer

McDowell · Accepted Answer · 2012-08-08T11:26:39+0000

U + 0103 (ă) is encoded as a sequence of bytes C4 83; this is UTF-8 data.

, PrintStream, System.out. , , .

Windows .

Org.apache.commons.io.IOUtils.toString misinterprets UTF-8

More articles: