Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
utf-8 in python3
#1
does anyone understand what is going on here:

https://stackoverflow.com/questions/2736...t-encode-s

i am getting error messages about "surrogates not allowed" on a few files that have valid Unicode characters in their names which are correctly encoded in UTF-8. no surrogate codes (U+D800..U+DFFF) are in these file names. this is happening in the huge "aws" command that implements "aws s3 sync" among other things. what i am curious about is what kinds of common coding errors around Unicode/UTF-8 facilities in python3 could end up with these surrogate codes. how did the SOf poster end up with those codes in a non-AWS script? what should i do to fix that "aws" command/script?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
You can ship around, when you use a path in bytes for os.walk().
One guy described it in your link.

When you working with this paths, you don't change them. They still should be represented as bytes.
Maybe you use pathlib to have a nice api.
When you want to print them, use bytes.decode('utf-8', 'ignore').

Maybe you can post some examples in raw bytes, which cause this problems.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
examples running "aws s3 sync" to upload files that have Unicode in UTF8 format in their names. the "ls" command intentionally mangles them for historical reasons. the "echo" command does not do that. the output of "echo" with each name is piped to my "xd16" command to show the bytes in hexadecimal. the script cats itself at the end. see the file with "Cielo_estrellado_by_Eduardo_Diez" in its name. notice how the "aws s3 sync" command ends up with the character \udcc3. it should have gotten \uF1 OR \u00F1 which is the correct encoding for UTF8 bytes \xc3 \xb1.

the output is logged in file http://ipal.net/free/20180518-023525-016...-files.log
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Gnuradio python3 is not compatible python3 xmlrpc library How Can I Fix İt ? muratoznnnn 3 4,992 Nov-07-2019, 05:47 PM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020