Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Address Extraction
#1
I am trying to map out a strategy for extracting text data from a pdf files. The files are semi-structured and would be created by the social security administration using iText.

I have attached an example file. I want to extract the name and address of the “provider”. Seems not too hard at first, but we are talking pdf files, right?
The problem has two major parts: identification and extraction of the desired text, then the cleaning of the text.

Here are my efforts so far. I was able to identify the bbox coordinates of the “provider and address” and this worked ok for several test pdfs until I encountered one that had the PDF417 bar code in a different location, throwing off the coordinates for the table with “provider and address”.

Next I tried to just find the table with “PROVIDER:” and found that in one test doc “PROVIDER” had its own table! That makes me wonder whether I might in the future find a pdf in which the name and address are in fact split up in different tables as well.

Then I tried converting to a text doc. But the text string I get mixes the address with other text like: “4716 ALLIANCE BLVD STE 500 AKA: CASE ID: 43 PLANO, TX 75093-5386 CLINIC #: DS/UNIT: 0F08/U70 DS PHONE NUMBER: (800) 252-7009, EXT 666 JOHN DOE, MD ". Now we have non-address elements added in appropriately in the text string.

I guess that could be a data cleaning issue but getting the name and address at the table level gives me a line separation between the name, each address line, and the City State Zip, which I would think would be an aid in the data cleaning.

I am concerned about data cleaning because I have tried several python modules and found them wanting. For example, in one test the text from the table came out like this:

“PROVIDER:
GRAND PRAIRIE HEALTH CENT
7920 ELMBROOK DR STE 120
TX 75247 DALLAS,”

Which does not make sense. Then I put this through the python library usaddress and it labels Dallas as a country. It clearly operates on position, which is not much help when acrobat has scrambled the order!

Thankful for any tips or advice offered!

Attached Files

.pdf   RedactedMRR.pdf (Size: 93.46 KB / Downloads: 6)
Reply


Messages In This Thread
Address Extraction - by standenman - Apr-06-2024, 03:47 PM
RE: Address Extraction - by DPaul - Apr-07-2024, 09:36 AM
RE: Address Extraction - by standenman - Apr-07-2024, 12:43 PM
RE: Address Extraction - by DPaul - Apr-07-2024, 05:20 PM
RE: Address Extraction - by Pedroski55 - Apr-08-2024, 04:45 PM
RE: Address Extraction - by DPaul - Apr-08-2024, 05:32 PM
RE: Address Extraction - by standenman - Apr-10-2024, 04:00 PM
RE: Address Extraction - by DPaul - Apr-10-2024, 05:22 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Strategy for data extraction standenman 1 598 Mar-11-2024, 01:44 PM
Last Post: carecavoador
  Python Machine Learning: For Data Extraction JaneTan 0 1,894 Nov-24-2020, 06:45 AM
Last Post: JaneTan
  Feature extraction algorithm lukaznt 1 2,633 Mar-02-2018, 05:16 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020