Strategy for data extraction - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Strategy for data extraction (/thread-41653.html) |
Strategy for data extraction - standenman - Feb-22-2024 I am trying to come up with a strategy for extracting key data from generic letters for different clients. This is the format of the letter I want to parse. It should look the same for every client, although there may be minor layout differences. First I want to extract the addressee of the letter which is redacted. Second I want to extract the name of the client. It is over on the right hand margin after "Re:". Then there are 2 items of data I want from the main body of the letter: the time period of the records requested (first sentence after heading "What We Need From You".) Then I want the date in the first sentence of the third paragraph in that heading "Please respond by May 26, 2023". I have wondered about a regex approach, but then wondered is using some nlp tool like spacy better? Thanks for any advice - I really appreciate it! RE: Strategy for data extraction - carecavoador - Mar-11-2024 (Feb-22-2024, 10:52 PM)standenman Wrote: I am trying to come up with a strategy for extracting key data from generic letters for different clients. This is the format of the letter I want to parse. It should look the same for every client, although there may be minor layout differences. First I want to extract the addressee of the letter which is redacted. Second I want to extract the name of the client. It is over on the right hand margin after "Re:". Then there are 2 items of data I want from the main body of the letter: the time period of the records requested (first sentence after heading "What We Need From You".) Then I want the date in the first sentence of the third paragraph in that heading "Please respond by May 26, 2023".I'd use pypdf to read the PDF files and extract the text. If your PDF files are images like the one you attached on your previous post, you may want to OCR it to extract the text using something like pytesseract .Once you get the text, obtaining the information you need might be trivial. Have you tried something? Do you have any code to show? import pytesseract from pdf2image import convert_from_path PDF_FILE = r"C:\Users\user\Desktop\MedRequestTemplate_Redacted-min.pdf" # This is the location of the folter containing poppler executable # needed for pdf2image to work. POPPLER_LOCATION = r"C:\poppler\Library\bin" # This is the location of the Tesseract-OCR executable. pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" def generate_texts_from_image_pdf(pdf_path: str, lang: str = "eng") -> str: """Performs an OCR in a PDF file and returns it's text content.""" image = convert_from_path(pdf_path, poppler_path=POPPLER_LOCATION) text: str = pytesseract.image_to_string(image[0], lang=lang) return text text = generate_texts_from_image_pdf(PDF_FILE) print(text)
|