Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract Line from PDF
#11
equaliser Wrote:I have this now
What does the program print? Do you see lines with the _akt: pattern ?
Reply
#12
lol, I cant see, where I can give y´all likes for your posts.

Awesome! Split by '\n' makes the rest. It works.

Thank you @Gribouillis. @snippsat! Awesome support from you. Thank you very much!
Reply
#13
(Nov-27-2018, 01:32 PM)equaliser Wrote: lol, I cant see, where I can give y´all likes for your posts.
[Image: arblSH.jpg]
Reply
#14
Thank you very much so far. This realy helped me out. I have tried this the last days. I have found a very mysterious issue.

I have several pdfs made by me. The content as text is always the same. The only difference is a number. For example this:

Text 1
Some text here with a specific line
_akt: x 123419
Some more text a s dummy.
Thats it. 
Text 2
Some text here with a specific line
_akt: x 234519
Some more text a s dummy.
Thats it. 
Now I have a script which uses PyPDF4, to extract the text and print the wanted number after x . I need that number to move the pdf-files into a specific direction. So far so good. It seems to be very simple. But my problem is:

Sometimes I get the number 123419 printed in one line, which is good. But sometime I get the Number splitet into to lines like that

2345
18
Because of this randomness, the module cant run exactly. I have to say that the split seems to be BEFORE I run my split function. So the PDF-files seems to be the problem. Has anyone an Idea what could be done.

Here is my code:

pdfFileObj = open(r'H:\Scans\TestScan\AktzTest.pdf', 'rb')
#pdfFileObj = open(r'H:\Scans\TestScan\AktzTest2.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
print(pages_text)

for line in pages_text.split('\n'):
    if re.match(r"_aktz: AZ x ", line):
#        print(line)
        prefix, number = line.split('x ', 1)
        print(number)
        shutil.move('H:\\Scans\\TestScan\\AktzTest.pdf', f'H:\\Akten\\TestAkten\\{number}')
It feels like the PDF-file has invisible breaks inmid the word. Very confusing. I guess its because, how the pdf is made. But for this one document it works like a charm.

Well yeah. Its because of the PDF-format. So it is not python related anymore. We can close the thread, I guess :D
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020