![]() |
PDF Extract using CSV values - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: PDF Extract using CSV values (/thread-36033.html) |
PDF Extract using CSV values - atomxkai - Jan-11-2022 Hello, need help on how to read from CSV file with multiple values instead of manual input? Thank you. from PyPDF2 import PdfFileReader, PdfFileWriter pdf_file_path = 'document.pdf' file_base_name = pdf_file_path.replace('.pdf', '') pdf = PdfFileReader(pdf_file_path) pdfWriter = PdfFileWriter() # this values are manual input # how to read csv file with multiple values instead of manual input? setpage = 21 startpage = 524 endpage = 570 for page_num in range(startpage,endpage): pdfWriter.addPage(pdf.getPage(page_num)) with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f: pdfWriter.write(f) f.close() RE: PDF Extract using CSV values - BashBedlam - Jan-11-2022 Assuming that your csv file looks something like this: Set Page, Start Page, End Page 2, 4, 8 2, 10, 14 2, 16, 20Then this will do what you're asking: from PyPDF2 import PdfFileReader, PdfFileWriter pdf_file_path = 'document.pdf' file_base_name = pdf_file_path.replace('.pdf', '') pdf = PdfFileReader(pdf_file_path) pdfWriter = PdfFileWriter() with open ('page values.csv', 'r') as page_values_file : page_values_file.readline () # dump the header for line in page_values_file : page_values = line.strip ().split (',') setpage = int (page_values [0]) startpage = int (page_values [1]) endpage = int (page_values [2]) for page_num in range(startpage,endpage): pdfWriter.addPage(pdf.getPage(page_num)) with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f: pdfWriter.write(f) RE: PDF Extract using CSV values - atomxkai - Jan-12-2022 Thank you so much BashBedlam! It works! It help a lot and saved a day! ![]() I have some comments with the results I have. Results: document_subset_1 -> it does not export document_subset_2 -> it export as per values from csv document_subset_3 -> it export as per values from csv but it includes values from subset_2 in first pages document_subset_4 -> same as subset_3 including subset_2 ... so the good thing is I just used the last subset PDF file which is complete extraction and manually extract subset_1. Almost perfect! ![]() Example of my csv values are from 300+ PDF pages: setpage startpage endpage 1 0 5 2 17 22 3 54 59 4 67 72 5 82 87 5 87 92 5 92 97 6 109 114 7 122 127 8 183 188 9 208 213 9 213 218 10 222 227setpage - I grouped them as set because some are continuous. RE: PDF Extract using CSV values - BashBedlam - Jan-12-2022 First off, there's no page zero so your first entry should start with a one. Secondly, I may have misunderstood your intended outcome. Try this and see if it's more what you had in mind. from PyPDF2 import PdfFileReader, PdfFileWriter pdf_file_path = 'document.pdf' file_base_name = pdf_file_path.replace('.pdf', '') pdf = PdfFileReader(pdf_file_path) with open ('page values.csv', 'r') as page_values_file : page_values_file.readline () # dump the header for line in page_values_file : page_values = line.strip ().split (',') setpage = int (page_values [0]) startpage = int (page_values [1]) endpage = int (page_values [2]) pdfWriter = PdfFileWriter() for page_num in range(startpage,endpage): pdfWriter.addPage(pdf.getPage(page_num)) with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f: pdfWriter.write(f) RE: PDF Extract using CSV values - atomxkai - Jan-12-2022 (Jan-12-2022, 06:41 PM)BashBedlam Wrote: First off, there's no page zero so your first entry should start with a one. Secondly, I may have misunderstood your intended outcome. Try this and see if it's more what you had in mind. This 2nd update works well as per the given CSV values per record row. Although, it might not read the 3 records with same setpage as previous script, but when I instead change it to sequential it does the job to extract PDF pages as per values in each record rows. I'm still learning how to call these functions and group them under the for loop and using with. ![]() I think I have 2 choices now if I use the 1st code, which the PDF files already merge in last PDF file output then just insert the 1st set while the 2nd code if I wanted to exactly export each PDF files as per CSV record values. My challenge is just how to extract the 1st CSV record values, I have tried to set to startpage = 1 but looks like it didn't export. Thanks again BashBedlam! really appreciate it. RE: PDF Extract using CSV values - Pedroski55 - Jan-13-2022 I often need to cut bits out of textbook pdfs. Normally I just note down the start page and finish page and enter them by hand. I never thought about making a csv, because I only want Unit 3 or Lesson 5. If you just want one part of your csv data, make a loop of data and give yourself the choice of which section you want. If you try this in your shell, just change the paths for your paths. def myApp(): #! /usr/bin/python3 # this program will take a pdf and extract a range of connected pages from PyPDF2 import PdfFileWriter, PdfFileReader import os, csv print('enter the path to the pdf you want to get pages from ... ') path2PDF = input('something like /home/pedro/Latin/ (don\'t forget the last /) ... ') path2Extracts = '/home/pedro/pdfExtractedPages/' path2CSV = '/home/pedro/pdfs/' files = os.listdir(path2PDF) pdfs = [] for f in files: if f.endswith('.pdf'): pdfs.append(f) for f in pdfs: print('Which PDF do you want to extract pages from?') print(f) myPDF = input('Copy and paste 1 of the PDF names here ... ') # read the pdf pdf = PdfFileReader(path2PDF + myPDF) pages = pdf.getNumPages() print('This pdf has ' + str(pages) + ' pages') # get the csv with the page details for extraction print('What pages do you want to get? They are in a CSV file.') csv_files = os.listdir(path2CSV) csvs = [] for f in csv_files: if f.endswith('.csv'): csvs.append(f) for f in csvs: print('Which CSV file do you need?') print(f) myCSV = input('Copy and paste 1 of the CSV names here ... ') # get the data from csv with open(path2CSV + myCSV) as infile: # read the csv file in answers = csv.reader(infile) # csv.reader is annoying, it's gone if you have to repeat, so read to a data list first data = [] for row in answers: data.append(row) # get the base name for saving the PDFs name = myPDF.split('.') bookTitle = name[0] # a function to make the excerpts def makePDF(alist): start = int(alist[1]) end = int(alist[2]) label = alist[0] pdf_writer = PdfFileWriter() for page in range(start, end): pdf_writer.addPage(pdf.getPage(page)) output_filename = f'{bookTitle}_{label}.pdf' with open(path2Extracts + output_filename, 'wb') as out: pdf_writer.write(out) print(f'Created: {output_filename} and saved in', path2Extracts) for i in range(1, len(data)): makePDF(data[i]) print('Pages extracted, pdfs made and saved in ', pathToExtracts) print('All done!') |