I have a directory with many files that are named like:
1234_part1.pdf
1234.pdf
5432_part1.pdf
5432.pdf
2323_part1.pdf
2323.pdf
etc.
I am trying to merge the pdf where the the first number part of the file are the same. I have code that can do this one at a time, but when I have over 500 files in the directory I am not sure how to loop through, here is what I have so far:
from PyPDF2 import PdfFileMerger, PdfFileReader
merger = PdfFileMerger()
merger.append(PdfFileReader(file('c:/example/1234_part1.pdf', 'rb')))
merger.append(PdfFileReader(file('c:/example/1234.pdf', 'rb')))
merger.write("c:/example/ouput/1234_combined.pdf")
Ideally the output file would be 'xxxx_combined_<today's date>.pdf'.
i.e. 1234_combined_051719.pdf
And also if there is a number file that only has part 1 or the other file it would not combined —
i.e. if there was a 9999_part1.pdf, but no 9999.pdf, then there would be no output for the '9999_combined_<today's date>.pdf'.
Try using os.listdir() to get all of the files in your directory Then use .split() at the end of your string (filename) to isolate the pdf file number. Then look for that number pattern in the list of files you made.
import os
from PyPDF2 import PdfFileMerger, PdfFileReader
dir = 'my/dir/of/pdfs/'
file_list = os.listdir(dir)
num_list = []
for fname in file_list:
if '_' in fname: # if the filename has an underscore in it
file_num = fname.split('_')[0] # get's first element in list of splits
else:
file_num = fname.split('.')[0]
if file_num not in num_list:
num_list.append(file_num)
# now you have a list of all of your file numbers you can grab all files
# in the file_list containing that number
for num in num_list:
pdf_parts = [x for x in file_list if num in x] # grabs all files with that number
if len(pdf_parts < 2): # if there is only one pdf with that num ...
continue # skip it!
# your pdf append operation here for each item in the pdf_parts list.
# something like this maybe ...
merger = PdfFileMerger()
# sorts list by filename length in decending order so that
# '_part' files come first
sorted_pdf_parts = pdf_parts.sort(key=len, reverse=True)
for part in sorted_pdf_parts:
merger.append(PdfFileReader(file(dir + part, 'rb')))
merger.write('out/dir/' + num + '_combined.pdf')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With