(Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.As these are
.html
some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.If files are already made then as bye Gribouillis there is chardet.
So eg if i have one .html file which(i make to be latin-1) and one in utf-8.
λ chardetect page_latin.html page_latin.html: ISO-8859-1 with confidence 0.73 G:\div_code\html_utf λ chardetect page_utf8.html page_utf8.html: utf-8 with confidence 0.7525
from bs4 import BeautifulSoup with open('page_latin.html', encoding='latin-1') as fp: soup = BeautifulSoup(fp, 'lxml') h1_tag = soup.find('h1') print(h1_tag) # Utf-8 the default with open('html_new.html') as fp: soup = BeautifulSoup(fp, 'lxml') h1_tag = soup.find('h1') print(h1_tag)
Output:<h1>Jalapeñod je pèle</h1>
<h1>Jalapeñod je pèle</h1>
So all works as it should,if take away encoding='latin-1'
it break and get UnicodeDecodeError
.Can also convert to utf-8 as this happens when open a file in Beautiful Soup:
Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library calledUnicode, Dammit
to detect a document’s encoding andconvert it to Unicode
.
So from latin-1 to utf-8.
from bs4 import BeautifulSoup with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out: file_out = fp.read() # When open a file in BS it will be Unicode soup = BeautifulSoup(file_out, 'lxml') fp_out.write(soup.prettify())
λ chardetect html_new.html html_new.html: utf-8 with confidence 0.7525File used in test,same just with different encoding.
<html lang="en"> <head> <title>Here is site title</title> </head> <body> <h1>Jalapeñod je pèle</h1> </body> </html>