How to handle encodings using the Python query library

I struggled with encodings for too long, and today I want to completely break the mental block.

Right now I'm using Requests to clear a bunch of websites, and from what I can tell, HTTP headers are used to figure out the encodings used by these pages, to return to chardet when there are no website headers. From there, it decodes the bytecode that it loads, and then helps me pass the unicode object to r.text.

Things are good.

But where am I confused, is that from there I will work on the text a bit and then print it to stdout, providing the encoding for printing:

 print foo.encode('utf-8')

The problem is that when I do this, the printable thing is confused. In the following, I expect to get an emdash between the words “judgments” and “Standard”:

 Declaratory judgmentsStandard of review.

Instead, I get a square piece with four tiny digits. Of course, this does not seem to be here, but I think the numbers are 0097, which corresponds to what I get if I do this:

repr(foo)
u'Declaratory judgments\x97Standard of review.'

So that makes sense, but where are my empacks?

The process is as follows:

  • Requests load the page and intelligently decode the text into a unicode object
  • I work with him
  • I encode it in utf-8 and print it out.

Where is the problem? It sounds like a mythical unicode sandwich , but obviously I'm missing something.

+5
source share
1 answer

- . \x97 emdash cp1252 . Unicode U + 0097 . , cp1252 Unicode. , , .

PS: Unicode , !:)

+4

All Articles