Apr-06-2024, 03:47 PM
I am trying to map out a strategy for extracting text data from a pdf files. The files are semi-structured and would be created by the social security administration using iText.
I have attached an example file. I want to extract the name and address of the “provider”. Seems not too hard at first, but we are talking pdf files, right?
The problem has two major parts: identification and extraction of the desired text, then the cleaning of the text.
Here are my efforts so far. I was able to identify the bbox coordinates of the “provider and address” and this worked ok for several test pdfs until I encountered one that had the PDF417 bar code in a different location, throwing off the coordinates for the table with “provider and address”.
Next I tried to just find the table with “PROVIDER:” and found that in one test doc “PROVIDER” had its own table! That makes me wonder whether I might in the future find a pdf in which the name and address are in fact split up in different tables as well.
Then I tried converting to a text doc. But the text string I get mixes the address with other text like: “4716 ALLIANCE BLVD STE 500 AKA: CASE ID: 43 PLANO, TX 75093-5386 CLINIC #: DS/UNIT: 0F08/U70 DS PHONE NUMBER: (800) 252-7009, EXT 666 JOHN DOE, MD ". Now we have non-address elements added in appropriately in the text string.
I guess that could be a data cleaning issue but getting the name and address at the table level gives me a line separation between the name, each address line, and the City State Zip, which I would think would be an aid in the data cleaning.
I am concerned about data cleaning because I have tried several python modules and found them wanting. For example, in one test the text from the table came out like this:
“PROVIDER:
GRAND PRAIRIE HEALTH CENT
7920 ELMBROOK DR STE 120
TX 75247 DALLAS,”
Which does not make sense. Then I put this through the python library usaddress and it labels Dallas as a country. It clearly operates on position, which is not much help when acrobat has scrambled the order!
Thankful for any tips or advice offered!
I have attached an example file. I want to extract the name and address of the “provider”. Seems not too hard at first, but we are talking pdf files, right?
The problem has two major parts: identification and extraction of the desired text, then the cleaning of the text.
Here are my efforts so far. I was able to identify the bbox coordinates of the “provider and address” and this worked ok for several test pdfs until I encountered one that had the PDF417 bar code in a different location, throwing off the coordinates for the table with “provider and address”.
Next I tried to just find the table with “PROVIDER:” and found that in one test doc “PROVIDER” had its own table! That makes me wonder whether I might in the future find a pdf in which the name and address are in fact split up in different tables as well.
Then I tried converting to a text doc. But the text string I get mixes the address with other text like: “4716 ALLIANCE BLVD STE 500 AKA: CASE ID: 43 PLANO, TX 75093-5386 CLINIC #: DS/UNIT: 0F08/U70 DS PHONE NUMBER: (800) 252-7009, EXT 666 JOHN DOE, MD ". Now we have non-address elements added in appropriately in the text string.
I guess that could be a data cleaning issue but getting the name and address at the table level gives me a line separation between the name, each address line, and the City State Zip, which I would think would be an aid in the data cleaning.
I am concerned about data cleaning because I have tried several python modules and found them wanting. For example, in one test the text from the table came out like this:
“PROVIDER:
GRAND PRAIRIE HEALTH CENT
7920 ELMBROOK DR STE 120
TX 75247 DALLAS,”
Which does not make sense. Then I put this through the python library usaddress and it labels Dallas as a country. It clearly operates on position, which is not much help when acrobat has scrambled the order!
Thankful for any tips or advice offered!
Attached Files
RedactedMRR.pdf (Size: 93.46 KB / Downloads: 6)