I am currently trying to get started working with Langchain. I am working in Anaconda/Spyder IDE:
# Imports
import os
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message
# Set API keys and the models to use
API_KEY = "MY API KEY HERE"
model_id = "gpt-3.5-turbo"
os.environ["OPENAI_API_KEY"] = API_KEY
pdf_path = '.\Paris.pdf'
loaders = PyPDFLoader(".\Paris.pdf")
I then run it with:
streamlit run c:\users\myname\.spyder-py3\untitled0.py [ARGUMENTS]
I get:
The streamlit module does run and opens in the browser but I get an error.
ValueError: File path .\Paris.pdf is not a valid file or url
I have checked carefully and the PDF is in fact located in the correct directory (i.e. the directory where the python script is located).
As a test I also tried:
# Imports
from PyPDF2 import PdfReader
pdf_path = './Paris.pdf'
with open(pdf_path, 'rb') as file:
pdf = PdfReader(file)
num_pages = len(pdf.pages)
for page_number in range(num_pages):
page = pdf.pages[page_number]
page_text = page.extract_text()
print(f"Page {page_number + 1}:\n{page_text}")
This worked perfectly. Note that I used the same path as with the langchain/streamlit version. I have installed langchain (multiple times), pyPDF and streamlit.
I then tried:
import os
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(".\Paris.pdf")
pages = loader.load_and_split()
print(pages)
That works. What is wrong in the first code snippet that causes the file path to throw an exception.
I investigated further and it turns out that the addition of the streamlit components of the code cause the file path issue to occur.
Since it is an error of streamlit components, I would suggest you to use streamlit's file_uploader
method as follows:
import streamlit as st
uploaded_file = st.file_uploader("Upload your PDF")
But in this case, you will have to read the pdf file in another approach which is by using PyPDF2.PdfReader
as follows:
import streamlit as st
from PyPDF2 import PdfReader
uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
reader = PdfReader(uploaded_file)
If you need the uploaded pdf to be in the format of Document
(which is when the file is uploaded through langchain.document_loaders.PyPDFLoader
) then you can do the following:
import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document
uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
docs = []
reader = PdfReader(uploaded_file)
i = 1
for page in reader.pages:
docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
i += 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With