Aug-28-2021, 01:34 AM
i think i need to use my own UTF-8 decoder code:
what really annoys me is that an exception has to be raised for this in a way that i can't recover from (ignore the character). UTF-8 does general make things difficult. but at least i have already made my own UTF-8 decoder. now, i just need to add some detection of this kind of error and look for possible things that messed it up that is not just a bad encoder (rare that they get put into regular use). for example, if decode comes up with bad Unicode, check for things like a newline byte that would show a bad line cut and just remove the unfinished UTF-8.
people here like to see code, so ...
tokenize_stdin.py:
Output:lt2a/phil /home/phil 126> py tokenize_stdin.py <sfc.py|cut -c1-132|lineup >/dev/null
Traceback (most recent call last):
File "/usr/host/bin/lineup", line 8, in <module>
for arg in argv if argv else stdin:
File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1895: invalid continuation byte
Traceback (most recent call last):
File "tokenize_stdin.py", line 6, in <module>
print(repr(x))
BrokenPipeError: [Errno 32] Broken pipe
lt2a/phil /home/phil 127> py tokenize_stdin.py <sfc.py|lineup >/dev/null
lt2a/phil /home/phil 128>
i suspect that cut (in first command pipeline) sliced in the middle of some multi-byte UTF-8 character and it was now being decoded with the newline byte that ends up at the end of a shorter line. the source file (sfc.py) is only ASCII so i am wondering what tokenize.tokenize() put in there that is non-ASCII enough to get up to Unicodes that are multi-byte UTF-8.what really annoys me is that an exception has to be raised for this in a way that i can't recover from (ignore the character). UTF-8 does general make things difficult. but at least i have already made my own UTF-8 decoder. now, i just need to add some detection of this kind of error and look for possible things that messed it up that is not just a bad encoder (rare that they get put into regular use). for example, if decode comes up with bad Unicode, check for things like a newline byte that would show a bad line cut and just remove the unfinished UTF-8.
people here like to see code, so ...
tokenize_stdin.py:
import tokenize n = 0 with open(0,'rb') as f: for x in tokenize.tokenize(f.readline): print(repr(x)) n += 1 print(f'ALL DONE with {n} tokens')lineup.py:
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """Line up columns of input text.""" from sys import argv,stdin argv.pop(0) size = [] # indexed by col number rows = [] # indexed by row number for arg in argv if argv else stdin: # 2 loops in case 1 argument has 2 or more lines for line in arg.splitlines(): cols = line.split() rows.append(cols) x = len(cols) y = len(size) if y<x: size[:] = size+(x-y)*[1] for n in range(x): size[n] = max(size[n],len(cols[n])) for row in rows: new = [] n = 0 for col in row: if col.isdecimal(): new.append(col.rjust(size[n])) else: new.append(col.ljust(size[n])) n += 1 print(' '.join(new).rstrip())
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.