Apr-08-2024, 04:45 PM
The PDF is just a scan, an image PDF? Not actually a PDF with text??
If you are only dealing with images, convert to jpg, crop using PIL Image, then OCR the cropped image.
I presuppose that the Provider part is always roughly where it is in your example PDF. There is sufficient white background around to text to give leeway for various addresses.
Once you have a reliable set of coordinates, just crop every PDF page with those coords. Have to say, the scan could be better!
If you actually have text PDFs, try with fitz.
If you are only dealing with images, convert to jpg, crop using PIL Image, then OCR the cropped image.
I presuppose that the Provider part is always roughly where it is in your example PDF. There is sufficient white background around to text to give leeway for various addresses.
Once you have a reliable set of coordinates, just crop every PDF page with those coords. Have to say, the scan could be better!
If you actually have text PDFs, try with fitz.