Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to analyze PDF using ChatGPT / Vision python API? [closed]

I have a list of pdf files and I want to analyze the first page of each document to extract information. I've tried a lot of free and paid OCR, but in my case, the results aren't good enough.

So I want to try using the ChatGPT API in python. How do I go about it?

Also, I saw in openAI Vision documentation that there is a detail parameter but there is no example provided, how do I use this parameter?

like image 816
Aurelien Avatar asked Sep 03 '25 16:09

Aurelien


1 Answers

Firstly, you need to extact the first page of each document as image (here PNG).

import fitz

def save_first_page_as_png(pdf_path: str, image_path: str):
    pdf_document = fitz.open(pdf_path)
    first_page = pdf_document.load_page(0)
    pixmap = first_page.get_pixmap()
    pixmap.save(image_path)

Then, to call ChatGPT API you need to convert this image in base64.

import base64

def encode_image(image_path: str):
    with open(image_path, 'rb') as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

Finally, you can call ChatGPT API (with the detail parameter).

import requests

api_key = 'your_api_key'

def call_gpt4_with_image(base64_image):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this image."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                            "detail": "low"
                        }
                    }
                ]
            }
        ],
        "max_tokens": 300
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    print(response.json())

Note that the parameter detail can be low or high (see the documentation).

Here is an example of a full workflow.

pdf_paths = ['pdf1.pdf', 'pdf2.pdf', 'pdf3.pdf']
for pdf_path in pdf_paths:
    first_page_path = pdf_path.replace('.pdf', '_1st_page.png')
    save_first_page_as_png(pdf_path, first_page_path)
    base64_image = encode_image(first_page_path)
    call_gpt4_with_image(base64_image)

EDIT: In terms of price, I tried it on the first page of a pdf, the png was 596x842 pixels, the request (question + image) cost me 98 Input tokens and 87 Output tokens. With the current pricing of $0.01 / 1K Input tokens and $0.03 / 1K Output tokens, that's a total of $0.00359 ($3.59/1K images).

like image 138
Aurelien Avatar answered Sep 05 '25 05:09

Aurelien