When using Pinecone in Langchain, whenever I do a similarity search -- which is supposed to return the most relevant documents to my query -- I find that it returns the same document again and again. (When I use other vectorstores such as Chroma and FAISS, then I indeed get different documents as expected.)
>>> pinecone_vectordb
<langchain.vectorstores.pinecone.Pinecone object at 0x0000017F1FEE29D0>
>>> query = "what are some bonus features offered by credit cards?"
>>> found_docs = vectordb_pinecone.similarity_search(query,k=3)
>>> found_docs
[Document(page_conten...tadata={}), Document(page_conten...tadata={}), Document(page_conten...tadata={})]
special variables
function variables
0:
Document(page_content='What Are Some of the Bonus Categories for Business Credit Cards?', metadata={})
1:
Document(page_content='What Are Some of the Bonus Categories for Business Credit Cards?', metadata={})
2:
Document(page_content='What Are Some of the Bonus Categories for Business Credit Cards?', metadata={})
I attempted to do a similarity search which I expected should return a series of documents in a hierarchy according to how similar they were to my query.
There is probably an issue with your document creation. I faced the same issue, and Ive explained the solution here: https://github.com/hwchase17/langchain/pull/7332
Basically, you should avoid using Document(page_content=chunk, metadata=source.metadata) or something similar, if you have used. This is because, the source.metadata here is a dict, which does not create a copy of itself on assignment. Instead, you can use something like Document(page_content=chunk, metadata=source.metadata.copy())
I had the exact same situation. The problem for me was the way I was creating the embeddings.
I was using the .from_documents method on the list of documents that I made, which should work according to the API documentation, but for some reason it was repeatedly copying the last document over and over again into the index.
So I looked through the API docs for pinecone integrations and I found the .add_documents method, and used it to individually add each document by traversing through my documents list. This populated the vectors correctly in my index and my RAG pipeline worked as expected rather than returning duplicates of the same document.
Old code:
documents = [TextSplitterDocument(chunk) for chunk in chunks]
self.embeddings = PineconeEmbeddings(model=self.model_name, pinecone_api_key=self.pinecone_api_key)
docsearch = PineconeVectorStore.from_documents(
documents=documents,
index_name=index_name,
embedding=self.embeddings,
namespace="documentation"
)
New code:
documents = [TextSplitterDocument(chunk, metadata) for chunk in chunks]
self.embeddings = PineconeEmbeddings(model=self.model_name,
pinecone_api_key=self.pinecone_api_key)
print(f"individually adding document_id: {documents[0].id} with the content: {documents[0].page_content}")
docsearch = PineconeVectorStore.from_documents(
documents=[documents[0]],
index_name=index_name,
embedding=self.embeddings,
namespace="documentation"
)
documents.remove(documents[0])
for document in documents:
print(f"individually adding document_id: {document.id} with the content: {document.page_content}")
docsearch.add_documents([document])
On another note, I used this tool called Lune AI to create a conversational AI on the pinecone integrations API documentation to find a solution to this problem really quickly. It's a useful tool for developers because you can easily train an LLM on up to date sources of information like API docs and ask it questions without the usual hallucination nonsense replies from something like ChatGPT. Pinecone Integrations
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With