Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Address Extraction
#1
I am trying to map out a strategy for extracting text data from a pdf files. The files are semi-structured and would be created by the social security administration using iText.

I have attached an example file. I want to extract the name and address of the “provider”. Seems not too hard at first, but we are talking pdf files, right?
The problem has two major parts: identification and extraction of the desired text, then the cleaning of the text.

Here are my efforts so far. I was able to identify the bbox coordinates of the “provider and address” and this worked ok for several test pdfs until I encountered one that had the PDF417 bar code in a different location, throwing off the coordinates for the table with “provider and address”.

Next I tried to just find the table with “PROVIDER:” and found that in one test doc “PROVIDER” had its own table! That makes me wonder whether I might in the future find a pdf in which the name and address are in fact split up in different tables as well.

Then I tried converting to a text doc. But the text string I get mixes the address with other text like: “4716 ALLIANCE BLVD STE 500 AKA: CASE ID: 43 PLANO, TX 75093-5386 CLINIC #: DS/UNIT: 0F08/U70 DS PHONE NUMBER: (800) 252-7009, EXT 666 JOHN DOE, MD ". Now we have non-address elements added in appropriately in the text string.

I guess that could be a data cleaning issue but getting the name and address at the table level gives me a line separation between the name, each address line, and the City State Zip, which I would think would be an aid in the data cleaning.

I am concerned about data cleaning because I have tried several python modules and found them wanting. For example, in one test the text from the table came out like this:

“PROVIDER:
GRAND PRAIRIE HEALTH CENT
7920 ELMBROOK DR STE 120
TX 75247 DALLAS,”

Which does not make sense. Then I put this through the python library usaddress and it labels Dallas as a country. It clearly operates on position, which is not much help when acrobat has scrambled the order!

Thankful for any tips or advice offered!

Attached Files

.pdf   RedactedMRR.pdf (Size: 93.46 KB / Downloads: 6)
Reply
#2
If "PROVIDER" is not always in the exact same place, bbox won't help.
What module are you using to OCR the pdf with ?
What does the code look like?
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#3
I am just using Adobe Acrobat Pro to OCR the file.

(Apr-07-2024, 09:36 AM)DPaul Wrote: If "PROVIDER" is not always in the exact same place, bbox won't help.
What module are you using to OCR the pdf with ?
What does the code look like?
Paul
Reply
#4
I think you need to consider other ocr options.
It is possible to get the x-y coordinates of the word PROVIDER.
But then i would transform the pdf into an image and use tesseract for the OCR.
I remember seeing a post here a few weeks ago , where coordinates of any "word"
can also be found with another module.
I'll see if I can find it.
Paul

Edit: I think it is pdfplumber.
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#5
The PDF is just a scan, an image PDF? Not actually a PDF with text??

If you are only dealing with images, convert to jpg, crop using PIL Image, then OCR the cropped image.

I presuppose that the Provider part is always roughly where it is in your example PDF. There is sufficient white background around to text to give leeway for various addresses.

Once you have a reliable set of coordinates, just crop every PDF page with those coords. Have to say, the scan could be better!

If you actually have text PDFs, try with fitz.
Reply
#6
(Apr-08-2024, 04:45 PM)Pedroski55 Wrote: I presuppose that the Provider part is always roughly where it is in your example PDF
Hi,
If you use an OCR module, you can OCR the text, look for PROVIDER in capitals, find its
coordinates, and get the rest of the text.
Should always work.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#7
Sorry I am a bit confused. Yes I can always rely on the text "PROVIDER:" and can always get its coordinates, but the the text I need, the provider, will be at different coordinates and even in a different table. How can I rely of that with different documents?


(Apr-08-2024, 05:32 PM)DPaul Wrote:
(Apr-08-2024, 04:45 PM)Pedroski55 Wrote: I presuppose that the Provider part is always roughly where it is in your example PDF
Hi,
If you use an OCR module, you can OCR the text, look for PROVIDER in capitals, find its
coordinates, and get the rest of the text.
Should always work.
Paul
Reply
#8
From what you have shown, the word PROVIDER
is followed by 3-4 (5 ?) lines. The lines you need.
If you can rely on that layout, wherever it is,
and if you have the coordinates of "PROVIDER",
you can figure out the other lines.
The difference is that you have been using Adobe acrobat ( = a piece of software),
while Tesseract or pdfplumber are modules that need to be imported into a python program.
If you use tesseract, the coordinates can be found either with the to_data or the to_boxes option.
I would go for to_data first. Gives you the x/y coord of whole words.
Paul
standenman likes this post
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Strategy for data extraction standenman 1 547 Mar-11-2024, 01:44 PM
Last Post: carecavoador
  Python Machine Learning: For Data Extraction JaneTan 0 1,871 Nov-24-2020, 06:45 AM
Last Post: JaneTan
  Feature extraction algorithm lukaznt 1 2,614 Mar-02-2018, 05:16 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020