Why does Pinecone repeatedly return the same result from my series of documents?

Question

When using Pinecone in Langchain, whenever I do a similarity search -- which is supposed to return the most relevant documents to my query -- I find that it returns the same document again and again. (When I use other vectorstores such as Chroma and FAISS, then I indeed get different documents as expected.)

>>> pinecone_vectordb
<langchain.vectorstores.pinecone.Pinecone object at 0x0000017F1FEE29D0>

>>> query = "what are some bonus features offered by credit cards?"

>>> found_docs = vectordb_pinecone.similarity_search(query,k=3)

>>> found_docs
[Document(page_conten...tadata={}), Document(page_conten...tadata={}), Document(page_conten...tadata={})]
special variables
function variables
0:
Document(page_content='What Are Some of the Bonus Categories for Business Credit Cards?', metadata={})
1:
Document(page_content='What Are Some of the Bonus Categories for Business Credit Cards?', metadata={})
2:
Document(page_content='What Are Some of the Bonus Categories for Business Credit Cards?', metadata={})

I attempted to do a similarity search which I expected should return a series of documents in a hierarchy according to how similar they were to my query.

Nishant Kumar · Accepted Answer

There is probably an issue with your document creation. I faced the same issue, and Ive explained the solution here: https://github.com/hwchase17/langchain/pull/7332

Basically, you should avoid using Document(page_content=chunk, metadata=source.metadata) or something similar, if you have used. This is because, the source.metadata here is a dict, which does not create a copy of itself on assignment. Instead, you can use something like Document(page_content=chunk, metadata=source.metadata.copy())

Josh · Answer

I had the exact same situation. The problem for me was the way I was creating the embeddings.

I was using the .from_documents method on the list of documents that I made, which should work according to the API documentation, but for some reason it was repeatedly copying the last document over and over again into the index.

So I looked through the API docs for pinecone integrations and I found the .add_documents method, and used it to individually add each document by traversing through my documents list. This populated the vectors correctly in my index and my RAG pipeline worked as expected rather than returning duplicates of the same document.

Old code:

documents = [TextSplitterDocument(chunk) for chunk in chunks]

self.embeddings = PineconeEmbeddings(model=self.model_name, pinecone_api_key=self.pinecone_api_key)
docsearch = PineconeVectorStore.from_documents(
    documents=documents,
    index_name=index_name,
    embedding=self.embeddings,
    namespace="documentation"
)

New code:

documents = [TextSplitterDocument(chunk, metadata) for chunk in chunks]

self.embeddings = PineconeEmbeddings(model=self.model_name, 
pinecone_api_key=self.pinecone_api_key)
print(f"individually adding document_id: {documents[0].id} with the content: {documents[0].page_content}")
docsearch = PineconeVectorStore.from_documents(
    documents=[documents[0]],
    index_name=index_name,
    embedding=self.embeddings,
    namespace="documentation"
)

documents.remove(documents[0])

for document in documents:
    print(f"individually adding document_id: {document.id} with the content: {document.page_content}")
    docsearch.add_documents([document])

On another note, I used this tool called Lune AI to create a conversational AI on the pinecone integrations API documentation to find a solution to this problem really quickly. It's a useful tool for developers because you can easily train an LLM on up to date sources of information like API docs and ask it questions without the usual hallucination nonsense replies from something like ChatGPT. Pinecone Integrations

Why does Pinecone repeatedly return the same result from my series of documents?

Tags:

nlp-question-answering

langchain

Yishai Rasowsky

2 Answers

Nishant Kumar

Josh

Recent Activity

Donate For Us

Why does Pinecone repeatedly return the same result from my series of documents?

Tags:

nlp-question-answering

langchain

Yishai Rasowsky

2 Answers

Nishant Kumar

Josh

Related questions

Recent Activity

Donate For Us