Python Forum
Splitt PDF at regex value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Splitt PDF at regex value
#1
I am trying to create code that will split a pdf into multiple files based upon a regex value in the pdf text. Specifically, I want to split this pdf based into discrete PDFs that represent a patient visit. So my test pdf I see that the office visit date is styled "Visit: ##/##/####". So as the code interates through the pages, I only want a split where that office visit date value changes. And that I want that newly created pdf file(s) to be named with the date of the visit. Here is my code and my errors:

import re
from PyPDF2 import PdfReader, PdfWriter

def split_pdf_by_date(pdf_path, regex_pattern):
    # Open the PDF file
    pdf = PdfReader(pdf_path)

    # Initialize variables
    current_date = None
    output = None

    # Iterate through each page in the PDF
    for page_num in range(len(pdf.pages)):
        # Extract the text from the current page
        page = pdf.pages[page_num]
        text = page.extract_text()

        # Find the date in the text using regex
        date_match = re.search(regex_pattern, text)

        if date_match:
            # Get the date value
            date = date_match.group()

            if current_date is None or date != current_date:
                # Start a new output PDF if the date has changed
                if output:
                    output_path = f"output_{current_date}.pdf"
                    with open(output_path, "wb") as output_file:
                        output.write(output_file)

                # Update the current date and create a new PDF writer
                current_date = date
                output = PdfWriter()

        if output:
            # Add the current page to the output PDF
            output.add_page(page)

    # Save the last output PDF
    if output:
        output_path = f"output_{current_date}.pdf"
        with open(output_path, "wb") as output_file:
            output.write(output_file)

        print("PDF split completed successfully.")
        print(output_path)  # Print the output path

# Example usage
pdf_path = "Test.pdf"
date_regex = r"Visit: \d{2}/\d{2}/\d{4}" \

split_pdf_by_date(pdf_path, date_regex)
Error:
unknown widths : [0, IndirectObject(3121, 0, 2157813271952)] unknown widths : [0, IndirectObject(3115, 0, 2157813271952)] unknown widths : [0, IndirectObject(3110, 0, 2157813271952)] unknown widths : [0, IndirectObject(3104, 0, 2157813271952)] unknown widths : [0, IndirectObject(3099, 0, 2157813271952)] unknown widths : [0, IndirectObject(3051, 0, 2157813271952)] unknown widths : [0, IndirectObject(3034, 0, 2157813271952)] Traceback (most recent call last): File "c:\Users\stand\venv\import PyPDF2.py", line 54, in <module> split_pdf_by_date(pdf_path, date_regex) File "c:\Users\stand\venv\import PyPDF2.py", line 30, in split_pdf_by_date with open(output_path, "wb") as output_file: ^^^^^^^^^^^^^^^^^^^^^^^ OSError: [Errno 22] Invalid argument: 'output_Visit: 03/23/2023.pdf'
I can see that the first office visit in the target pdf, 3/23/2023 gets found, but that it is about it!
Reply


Messages In This Thread
Splitt PDF at regex value - by standenman - Jun-13-2023, 12:39 PM
RE: Splitt PDF at regex value - by deanhystad - Jun-13-2023, 01:42 PM
RE: Splitt PDF at regex value - by standenman - Jun-13-2023, 02:41 PM
RE: Splitt PDF at regex value - by deanhystad - Jun-13-2023, 04:58 PM
RE: Splitt PDF at regex value - by standenman - Jun-13-2023, 06:00 PM
RE: Splitt PDF at regex value - by deanhystad - Jun-13-2023, 06:14 PM
RE: Splitt PDF at regex value - by standenman - Jun-13-2023, 07:03 PM
RE: Splitt PDF at regex value - by deanhystad - Jun-13-2023, 07:14 PM
RE: Splitt PDF at regex value - by standenman - Jun-13-2023, 07:16 PM
RE: Splitt PDF at regex value - by standenman - Jun-13-2023, 09:37 PM
RE: Splitt PDF at regex value - by deanhystad - Jun-14-2023, 12:18 PM
RE: Splitt PDF at regex value - by Pedroski55 - Jul-11-2023, 01:25 AM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020