Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Address Extraction
#5
The PDF is just a scan, an image PDF? Not actually a PDF with text??

If you are only dealing with images, convert to jpg, crop using PIL Image, then OCR the cropped image.

I presuppose that the Provider part is always roughly where it is in your example PDF. There is sufficient white background around to text to give leeway for various addresses.

Once you have a reliable set of coordinates, just crop every PDF page with those coords. Have to say, the scan could be better!

If you actually have text PDFs, try with fitz.
Reply


Messages In This Thread
Address Extraction - by standenman - Apr-06-2024, 03:47 PM
RE: Address Extraction - by DPaul - Apr-07-2024, 09:36 AM
RE: Address Extraction - by standenman - Apr-07-2024, 12:43 PM
RE: Address Extraction - by DPaul - Apr-07-2024, 05:20 PM
RE: Address Extraction - by Pedroski55 - Apr-08-2024, 04:45 PM
RE: Address Extraction - by DPaul - Apr-08-2024, 05:32 PM
RE: Address Extraction - by standenman - Apr-10-2024, 04:00 PM
RE: Address Extraction - by DPaul - Apr-10-2024, 05:22 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Strategy for data extraction standenman 1 656 Mar-11-2024, 01:44 PM
Last Post: carecavoador
  Python Machine Learning: For Data Extraction JaneTan 0 1,913 Nov-24-2020, 06:45 AM
Last Post: JaneTan
  Feature extraction algorithm lukaznt 1 2,659 Mar-02-2018, 05:16 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020