Extract Line from PDF

**Gribouillis** · Nov-27-2018, 01:08 PM

equaliser Wrote:I have this now

What does the program print? Do you see lines with the _akt: pattern ?

equaliser · Nov-27-2018, 01:32 PM

lol, I cant see, where I can give y´all likes for your posts.

Awesome! Split by '\n' makes the rest. It works.

Thank you @Gribouillis. @snippsat! Awesome support from you. Thank you very much!

***snippsat*** · Nov-27-2018, 03:42 PM

(Nov-27-2018, 01:32 PM)equaliser Wrote: lol, I cant see, where I can give y´all likes for your posts.

equaliser · (This post was last modified: Dec-04-2018, 11:35 AM by equaliser.)

Thank you very much so far. This realy helped me out. I have tried this the last days. I have found a very mysterious issue.

I have several pdfs made by me. The content as text is always the same. The only difference is a number. For example this:

Text 1

Some text here with a specific line
_akt: x 123419
Some more text a s dummy.
Thats it.

Text 2

Some text here with a specific line
_akt: x 234519
Some more text a s dummy.
Thats it.

Now I have a script which uses PyPDF4, to extract the text and print the wanted number after x . I need that number to move the pdf-files into a specific direction. So far so good. It seems to be very simple. But my problem is:

Sometimes I get the number 123419 printed in one line, which is good. But sometime I get the Number splitet into to lines like that

2345
18

Because of this randomness, the module cant run exactly. I have to say that the split seems to be BEFORE I run my split function. So the PDF-files seems to be the problem. Has anyone an Idea what could be done.

Here is my code:

pdfFileObj = open(r'H:\Scans\TestScan\AktzTest.pdf', 'rb')
#pdfFileObj = open(r'H:\Scans\TestScan\AktzTest2.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
print(pages_text)

for line in pages_text.split('\n'):
    if re.match(r"_aktz: AZ x ", line):
#        print(line)
        prefix, number = line.split('x ', 1)
        print(number)
        shutil.move('H:\\Scans\\TestScan\\AktzTest.pdf', f'H:\\Akten\\TestAkten\\{number}')

It feels like the PDF-file has invisible breaks inmid the word. Very confusing. I guess its because, how the pdf is made. But for this one document it works like a charm.

Well yeah. Its because of the PDF-format. So it is not python related anymore. We can close the thread, I guess :D

Extract Line from PDF

User Panel Messages

Announcements