I have a function which goes to url and crawls its content (+ from subpages). Then I want to load text content to langchain VectorstoreIndexCreator()
. How can I do it via loader? I could not find any suitable loader in langchain.document_loaders
. Should I use BaseLoader for it? How?
My code
import requests
from bs4 import BeautifulSoup
import openai
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
def get_company_info_from_web(company_url: str, max_crawl_pages: int = 10, questions=None):
# goes to url and get urls
links = get_links_from_page(company_url)
# get_text_content_from_page goes to url and yields text, url tuple
for text, url in get_text_content_from_page(links[:max_crawl_pages]):
# add text content (string) to index
# loader????
index= VectorstoreIndexCreator().from_documents([Document(page_content=content, metadata={"source": url})])
# Finally, query the vector database:
DEFAULT_QUERY = f"What does the company do? Who are key people in this company? Can you tell me contact information?"
query = questions or DEFAULT_QUERY
logger.info(f"Query: {query}")
result = index.query_with_sources(query)
logger.info(f"Result:\n {result['answer']}")
logger.info(f"Sources:\n {result['sources']}")
return result['answer'], result['sources']
Yes, you can use the WebBaseLoader which usages BeautifulSoup
behind the scene to parse the data.
See the below sample:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader(your_url)
scrape_data = loader.load()
you can do multiple web pages by passing an array of URLs like below:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.load()
And to load multiple web pages concurrently, you can use the aload()
method.
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload() # <-------- here
You may encounter some issues with loading concurrently if you already have a running asynio event loop which will throw an error something like "nested event loop error"
or "RuntimeError: This event loop is already running"
something like that. You can resolve this issue by using nest_asyncio library which is a patch to allow nested event loops. See the sample below:
import nest_asyncio
nest_asyncio.apply()
loader = WebBaseLoader([your_url_1, your_url_2])
scrape_data = loader.aload()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With