Problem with scraper site with foreign characters

I need help with the scraper that I am writing. I am trying to clear the table of university ratings, and some of these schools are European universities with foreign names (for example, ä, ü). I am already scraping another table on another site with foreign universities in the same way, and everything is working fine. But for some reason, the current scraper will not work with foreign characters (and, if you disassemble other people's characters, the two scrapers are exactly the same).

Here is what I am doing to try to make everything work:

  • Declare an encoding in the very first line of the file:

    # -*- coding: utf-8 -*-
    
  • Import and use smart unicode from django framework from django.utils.encoding import smart_unicode

    school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8',        
    strings_only=False, errors='strict').encode('utf-8')
    
  • Use the encoding function as shown above when it is associated with the smart_unicode function. I can’t think of what else I can do wrong. Before dealing with these scrappers, I really didn’t know much about different encodings, so it was a bit of an impressive experience. I tried reading the following, but cannot solve this problem.

, , , .. , (, ASCII , UTF-8 , . , , , , . , , . , !

+5
3

- , , ( HTTP, HTML meta , , , -, ). , , .

, - utf-8. Iso-8859-1 , iso-8859-1, utf-8, ( Ascii).

+2

, HTTP. HTML- :

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...
0

<head> , charset :

<meta http-equiv="Content-Type" content="text/html; charset=xxxxx">

( : StackOverflow ... , 中文字, , , UTF-8 , PeeCees, , , GBK, pasokon, Shift-JIS).

So, if you have an encoding, you know what to expect, and deal with it accordingly. If not, you will need to make some reasonable assumptions - are there non-ASCII characters (> 127) in the text version of the page? Are there any HTML objects like &#19968;(一) or &#233;(é)?

Once you have guessed / determined the encoding of the page, you can convert it to UTF-8 and be on your way.

-1
source

All Articles