Address Extraction

standenman · Apr-06-2024, 03:47 PM

I am trying to map out a strategy for extracting text data from a pdf files. The files are semi-structured and would be created by the social security administration using iText.

I have attached an example file. I want to extract the name and address of the “provider”. Seems not too hard at first, but we are talking pdf files, right?
The problem has two major parts: identification and extraction of the desired text, then the cleaning of the text.

Here are my efforts so far. I was able to identify the bbox coordinates of the “provider and address” and this worked ok for several test pdfs until I encountered one that had the PDF417 bar code in a different location, throwing off the coordinates for the table with “provider and address”.

Next I tried to just find the table with “PROVIDER:” and found that in one test doc “PROVIDER” had its own table! That makes me wonder whether I might in the future find a pdf in which the name and address are in fact split up in different tables as well.

Then I tried converting to a text doc. But the text string I get mixes the address with other text like: “4716 ALLIANCE BLVD STE 500 AKA: CASE ID: 43 PLANO, TX 75093-5386 CLINIC #: DS/UNIT: 0F08/U70 DS PHONE NUMBER: (800) 252-7009, EXT 666 JOHN DOE, MD ". Now we have non-address elements added in appropriately in the text string.

I guess that could be a data cleaning issue but getting the name and address at the table level gives me a line separation between the name, each address line, and the City State Zip, which I would think would be an aid in the data cleaning.

I am concerned about data cleaning because I have tried several python modules and found them wanting. For example, in one test the text from the table came out like this:

“PROVIDER:
GRAND PRAIRIE HEALTH CENT
7920 ELMBROOK DR STE 120
TX 75247 DALLAS,”

Which does not make sense. Then I put this through the python library usaddress and it labels Dallas as a country. It clearly operates on position, which is not much help when acrobat has scrambled the order!

Thankful for any tips or advice offered!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Strategy for data extraction	standenman	1	598	Mar-11-2024, 01:44 PM Last Post: carecavoador
	Python Machine Learning: For Data Extraction	JaneTan	0	1,894	Nov-24-2020, 06:45 AM Last Post: JaneTan
	Feature extraction algorithm	lukaznt	1	2,633	Mar-02-2018, 05:16 AM Last Post: Larz60+

Address Extraction

User Panel Messages

Announcements