<head> , charset :
<meta http-equiv="Content-Type" content="text/html; charset=xxxxx">
( : StackOverflow ... , 中文字, , , UTF-8 , PeeCees, , , GBK, pasokon, Shift-JIS).
So, if you have an encoding, you know what to expect, and deal with it accordingly. If not, you will need to make some reasonable assumptions - are there non-ASCII characters (> 127) in the text version of the page? Are there any HTML objects like 一(一) or é(é)?
Once you have guessed / determined the encoding of the page, you can convert it to UTF-8 and be on your way.
source
share