Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Script to loop through and match files based on file name and append

Tags:

python

pypdf

I have a directory with many files that are named like:

1234_part1.pdf
1234.pdf
5432_part1.pdf
5432.pdf
2323_part1.pdf
2323.pdf
etc.

I am trying to merge the pdf where the the first number part of the file are the same. I have code that can do this one at a time, but when I have over 500 files in the directory I am not sure how to loop through, here is what I have so far:

from PyPDF2 import PdfFileMerger, PdfFileReader
merger = PdfFileMerger()
merger.append(PdfFileReader(file('c:/example/1234_part1.pdf', 'rb')))
merger.append(PdfFileReader(file('c:/example/1234.pdf', 'rb')))
merger.write("c:/example/ouput/1234_combined.pdf")

Ideally the output file would be 'xxxx_combined_<today's date>.pdf'. i.e. 1234_combined_051719.pdf

And also if there is a number file that only has part 1 or the other file it would not combined — i.e. if there was a 9999_part1.pdf, but no 9999.pdf, then there would be no output for the '9999_combined_<today's date>.pdf'.

like image 931
Lulumocha Avatar asked Dec 08 '25 10:12

Lulumocha


1 Answers

Try using os.listdir() to get all of the files in your directory Then use .split() at the end of your string (filename) to isolate the pdf file number. Then look for that number pattern in the list of files you made.

import os
from PyPDF2 import PdfFileMerger, PdfFileReader

dir = 'my/dir/of/pdfs/'
file_list = os.listdir(dir)
num_list = []

for fname in file_list:
    if '_' in fname:  # if the filename has an underscore in it
        file_num = fname.split('_')[0]  # get's first element in list of splits
    else:
        file_num = fname.split('.')[0]
    if file_num not in num_list:
        num_list.append(file_num)

# now you have a list of all of your file numbers you can grab all files
# in the file_list containing that number
for num in num_list:
    pdf_parts = [x for x in file_list if num in x] # grabs all files with that number
    if len(pdf_parts < 2):  # if there is only one pdf with that num ...
        continue  # skip it!
    # your pdf append operation here for each item in the pdf_parts list.
    # something like this maybe ...

    merger = PdfFileMerger()
    # sorts list by filename length in decending order so that 
    # '_part' files come first
    sorted_pdf_parts = pdf_parts.sort(key=len, reverse=True) 
    for part in sorted_pdf_parts:
        merger.append(PdfFileReader(file(dir + part, 'rb')))
    merger.write('out/dir/' + num + '_combined.pdf')

like image 200
Jello Avatar answered Dec 09 '25 23:12

Jello