Right way to open files with different encodings?

***snippsat*** · (This post was last modified: Apr-23-2024, 05:50 PM by snippsat.)

(Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.

As these are .html some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.
If files are already made then as bye Gribouillis there is chardet.
So eg if i have one .html file which(i make to be latin-1) and one in utf-8.

λ chardetect page_latin.html
page_latin.html: ISO-8859-1 with confidence 0.73

G:\div_code\html_utf
λ chardetect page_utf8.html
page_utf8.html: utf-8 with confidence 0.7525

from bs4 import BeautifulSoup

with open('page_latin.html', encoding='latin-1') as fp:
    soup = BeautifulSoup(fp, 'lxml')
    h1_tag = soup.find('h1')
    print(h1_tag)

# Utf-8 the default
with open('html_new.html') as fp:
    soup = BeautifulSoup(fp, 'lxml')
    h1_tag = soup.find('h1')
    print(h1_tag)

Output:<h1>Jalapeñod je pèle</h1>
<h1>Jalapeñod je pèle</h1>

So all works as it should,if take away encoding='latin-1' it break and get UnicodeDecodeError.

Can also convert to utf-8 as this happens when open a file in Beautiful Soup:

Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode.

So from latin-1 to utf-8.

from bs4 import BeautifulSoup

with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out:
    file_out = fp.read()
    # When open a file in BS it will be Unicode
    soup = BeautifulSoup(file_out, 'lxml')
    fp_out.write(soup.prettify())

λ chardetect html_new.html
html_new.html: utf-8 with confidence 0.7525

File used in test,same just with different encoding.

<html lang="en">
  <head>
    <title>Here is site title</title>
  </head>
  <body>
    <h1>Jalapeñod je pèle</h1>
  </body>
</html>

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Open files in an existing window instead of new	Kostov	2	408	Apr-13-2024, 07:22 AM Last Post: Kostov
	open python files in other drive	akbarza	1	755	Aug-24-2023, 01:23 PM Last Post: deanhystad
	How to open/load image .tiff files > 2 GB ?	hobbyist	1	2,510	Aug-19-2021, 12:50 AM Last Post: Larz60+
	Open and read multiple text files and match words	kozaizsvemira	3	6,829	Jul-07-2021, 11:27 AM Last Post: Larz60+
	(solved) open multiple libre office files in libre office	lucky67	5	3,444	May-29-2021, 04:54 PM Last Post: lucky67
	Can't open files	Lass86	5	2,526	Nov-10-2020, 07:18 PM Last Post: jefsummers
	Using Python to loop csv files to open them	Secret	4	2,807	Sep-13-2020, 11:30 AM Last Post: Askic
	Find specific subdir, open files and find specific lines that are missing from a file	tester_V	8	3,725	Aug-25-2020, 01:52 AM Last Post: tester_V
	ModuleNotFoundError: no module named 'encodings'	grunge10111	1	3,879	May-29-2020, 02:22 AM Last Post: Larz60+
	subprocess.Popen() and encodings	voltron	0	5,809	Feb-20-2020, 04:57 PM Last Post: voltron

Right way to open files with different encodings?

User Panel Messages

Announcements