Trouble with encoded data (I think)

fishglue · Oct-10-2019, 08:15 PM

Hi guys!

My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.

The main code. I don't fully understand half of what I've done here, but it works.

import io
import pytesseract

from PIL import Image
from wand.image import Image as wi
import csv

pdf = wi(filename="asker2.pdf", resolution=200)
pdfImage = pdf.convert('png')

imageBlobs = []

for img in pdfImage.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('png'))

recognized_text = []

for imgBlob in imageBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang='nor', config='--psm 1')
    recognized_text.append(text)

My problems:

The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.

Pandas error I get when not using encoding="unicode_escape"

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2

Here's some of the stuff I've tried to work around it:

#Attempting to reduce the ammount of data
removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ø()'})


#Writing the output data to a text file, with unicode_escape because pandas can't read my data if not.
with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f:
     f.write(removed_char)

The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.

I tried to remove the \n doing this:

text.replace('\n', '')

 
with open("output1.txt", 'r', newline=None) as fd:
    for line in fd:
        line = line.replace("\n", "")

The \n is still everywhere, or huge spaces of lines with no data.

I tried some variations of this using 'rb' and 'wb' but then I get an error saying

TypeError: a bytes-like object is required, not 'str'

Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position

Am I missing some obvious solution to this? Been stuck for about 25 hours of active research.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Can you put encoded strings into .txt files?	Alivegamer	0	1,305	May-04-2022, 12:50 AM Last Post: Alivegamer
	filecmp is not working for UTF-8 BOM encoded files	sureshnagarajan	3	2,699	Feb-10-2021, 11:17 AM Last Post: sureshnagarajan
	get original code after being encoded to UTF-8	ashok	18	6,165	Sep-08-2020, 04:17 AM Last Post: ndc85430
	Having trouble with minute stock data	MAZambelli4353	2	2,406	Sep-03-2019, 09:41 AM Last Post: perfringo
	convert hex encoded string to ASCII	Skaperen	4	114,951	Oct-02-2016, 09:22 AM Last Post: snippsat

Trouble with encoded data (I think)

User Panel Messages

Announcements