Nov-27-2018, 01:08 PM
equaliser Wrote:I have this nowWhat does the program print? Do you see lines with the
_akt:
pattern ?
Extract Line from PDF
|
Nov-27-2018, 01:08 PM
equaliser Wrote:I have this nowWhat does the program print? Do you see lines with the _akt: pattern ?
Nov-27-2018, 01:32 PM
lol, I cant see, where I can give y´all likes for your posts.
Awesome! Split by '\n' makes the rest. It works. Thank you @Gribouillis. @snippsat! Awesome support from you. Thank you very much!
Nov-27-2018, 03:42 PM
Thank you very much so far. This realy helped me out. I have tried this the last days. I have found a very mysterious issue.
I have several pdfs made by me. The content as text is always the same. The only difference is a number. For example this: Text 1 Some text here with a specific line _akt: x 123419 Some more text a s dummy. Thats it.Text 2 Some text here with a specific line _akt: x 234519 Some more text a s dummy. Thats it.Now I have a script which uses PyPDF4, to extract the text and print the wanted number after x . I need that number to move the pdf-files into a specific direction. So far so good. It seems to be very simple. But my problem is:Sometimes I get the number 123419 printed in one line, which is good. But sometime I get the Number splitet into to lines like that 2345 18Because of this randomness, the module cant run exactly. I have to say that the split seems to be BEFORE I run my split function. So the PDF-files seems to be the problem. Has anyone an Idea what could be done. Here is my code: pdfFileObj = open(r'H:\Scans\TestScan\AktzTest.pdf', 'rb') #pdfFileObj = open(r'H:\Scans\TestScan\AktzTest2.pdf', 'rb') pdfReader = PyPDF4.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) pages_text = pageObj.extractText() print(pages_text) for line in pages_text.split('\n'): if re.match(r"_aktz: AZ x ", line): # print(line) prefix, number = line.split('x ', 1) print(number) shutil.move('H:\\Scans\\TestScan\\AktzTest.pdf', f'H:\\Akten\\TestAkten\\{number}')It feels like the PDF-file has invisible breaks inmid the word. Very confusing. I guess its because, how the pdf is made. But for this one document it works like a charm. Well yeah. Its because of the PDF-format. So it is not python related anymore. We can close the thread, I guess :D |
|