Python Forum
Trouble with encoded data (I think)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Trouble with encoded data (I think)
#1
Hi guys!

My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.

The main code. I don't fully understand half of what I've done here, but it works.
import io
import pytesseract

from PIL import Image
from wand.image import Image as wi
import csv

pdf = wi(filename="asker2.pdf", resolution=200)
pdfImage = pdf.convert('png')

imageBlobs = []

for img in pdfImage.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('png'))

recognized_text = []

for imgBlob in imageBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang='nor', config='--psm 1')
    recognized_text.append(text)
My problems:

The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.

Pandas error I get when not using encoding="unicode_escape"
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2
Here's some of the stuff I've tried to work around it:

#Attempting to reduce the ammount of data
removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ΓΈ()'})


#Writing the output data to a text file, with unicode_escape because pandas can't read my data if not.
with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f:
     f.write(removed_char)
The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.

I tried to remove the \n doing this:
text.replace('\n', '')
 
with open("output1.txt", 'r', newline=None) as fd:
    for line in fd:
        line = line.replace("\n", "")
The \n is still everywhere, or huge spaces of lines with no data.

I tried some variations of this using 'rb' and 'wb' but then I get an error saying
TypeError: a bytes-like object is required, not 'str'
Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position
Am I missing some obvious solution to this? Been stuck for about 25 hours of active research.
Reply


Messages In This Thread
Trouble with encoded data (I think) - by fishglue - Oct-10-2019, 08:15 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
Question Can you put encoded strings into .txt files? Alivegamer 0 1,305 May-04-2022, 12:50 AM
Last Post: Alivegamer
  filecmp is not working for UTF-8 BOM encoded files sureshnagarajan 3 2,699 Feb-10-2021, 11:17 AM
Last Post: sureshnagarajan
  get original code after being encoded to UTF-8 ashok 18 6,165 Sep-08-2020, 04:17 AM
Last Post: ndc85430
  Having trouble with minute stock data MAZambelli4353 2 2,406 Sep-03-2019, 09:41 AM
Last Post: perfringo
  convert hex encoded string to ASCII Skaperen 4 114,951 Oct-02-2016, 09:22 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020