Strategy for data extraction - Printable Version

Strategy for data extraction - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Strategy for data extraction (/thread-41653.html)

Strategy for data extraction - standenman - Feb-22-2024

I am trying to come up with a strategy for extracting key data from generic letters for different clients. This is the format of the letter I want to parse. It should look the same for every client, although there may be minor layout differences. First I want to extract the addressee of the letter which is redacted. Second I want to extract the name of the client. It is over on the right hand margin after "Re:". Then there are 2 items of data I want from the main body of the letter: the time period of the records requested (first sentence after heading "What We Need From You".) Then I want the date in the first sentence of the third paragraph in that heading "Please respond by May 26, 2023".

I have wondered about a regex approach, but then wondered is using some nlp tool like spacy better? Thanks for any advice - I really appreciate it!

RE: Strategy for data extraction - carecavoador - Mar-11-2024

(Feb-22-2024, 10:52 PM)standenman Wrote: I am trying to come up with a strategy for extracting key data from generic letters for different clients. This is the format of the letter I want to parse. It should look the same for every client, although there may be minor layout differences. First I want to extract the addressee of the letter which is redacted. Second I want to extract the name of the client. It is over on the right hand margin after "Re:". Then there are 2 items of data I want from the main body of the letter: the time period of the records requested (first sentence after heading "What We Need From You".) Then I want the date in the first sentence of the third paragraph in that heading "Please respond by May 26, 2023".

I have wondered about a regex approach, but then wondered is using some nlp tool like spacy better? Thanks for any advice - I really appreciate it!

I'd use pypdf to read the PDF files and extract the text. If your PDF files are images like the one you attached on your previous post, you may want to OCR it to extract the text using something like pytesseract.

Once you get the text, obtaining the information you need might be trivial. Have you tried something? Do you have any code to show?

import pytesseract
from pdf2image import convert_from_path


PDF_FILE = r"C:\Users\user\Desktop\MedRequestTemplate_Redacted-min.pdf"

# This is the location of the folter containing poppler executable
# needed for pdf2image to work.
POPPLER_LOCATION = r"C:\poppler\Library\bin"

# This is the location of the Tesseract-OCR executable.
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


def generate_texts_from_image_pdf(pdf_path: str, lang: str = "eng") -> str:
    """Performs an OCR in a PDF file and returns it's text content."""
    image = convert_from_path(pdf_path, poppler_path=POPPLER_LOCATION)
    text: str = pytesseract.image_to_string(image[0], lang=lang)
    return text

text = generate_texts_from_image_pdf(PDF_FILE)
print(text)

Output:P O BOX 149198
AUSTIN TX 78714-9198

Date: May 12, 2023
Case [D: gaily

RE:
DOB:

Vendor Number: |

We are the office that makes disability decisions for the Social Security Administration. ie — 7,1 is applying for or is
receiving disability benefits due to the following conditions: Lumbar Disfunction. This is not an authorization to perform an     
examination.

What We Need From You
To help us evaluate this claim, please send records covering the period of: 08/03/2021 to Present.

Include the following information: medical history, psychiatric history, clinical findings, laboratory findings, imaging reports, 
treatment prescribed and the response, diagnosis, and prognosis.

Please respond by May 26, 2023. We are enclosing a signed HIPAA compliant authorization for the release of medical
records and information.

Please provide a statement based on your findings. Your statement should express your opinion about your patient’s ability to     
do work-related physical and/or mental activities despite the limitations or restrictions imposed by his medical condition(s).    
For physical impairments, these activities include sitting, standing, walking, lifting, carrying, pushing, pulling, or other      
physical activities (including manipulative or postural activities, such as reaching, handling, stooping, or crouching); other    
activities, such as seeing, hearing, or using other senses; and ability to adapt to environmenta! conditions, such as temperature 
extremes or fumes, For mental impairments, these activities include understanding; remembering; maintaining concentration,        
persistence, or pace; carrying out instructions; and responding appropriately to supervision, coworkers, and work pressures.      

If it is determined that we need additional information regarding your patient's impairments, would you be willing to perform     
an examination to provide additional findings? Please contact us if you would be willing to perform this examination. We will     
assume that you do not wish to perform the examination if you do not respond.

Tf You Have Any Questions

If you have any questions or wish to provide more information, please call us at the number(s) shown below Monday - Friday        
between 7:00 am and 5:00 pm. When you call or leave a message, please provide the Case [Dy our —_ |

a. and a call back number.

Thank you for your help.

Texas Disability Determination Services/Texas Disability Determination Services
(800) 252-7009
(866) 892-9281 (FAX)

67884 18/ Assigned 0643 U15/ DCPS / DCM61025842 / OMB No, 0960-0555 / 98022133