Greetings everyone,
I'm writing a python program that requires making 1000+ cURL requests. It generates a cURL request and fetches some JSON and then processes it, and does this for over 1000+ URLs. If I try to do it conventionally, it takes some 20 minutes, but it needs to be done within 3 minutes.
So after a few hours of research, the most efficient solution I found was multithreading combined with Keep-Alive TCP connection.
So basically I'm trying to retrieve some information about a few products from a website, through web-scrapping.
The below program illustrates the SAME
import requests
import json
import time
s = requests.Session()
def getInfo(productName):
# this try block tries to get information and then parse it and then display a few
# parameters about the particular product...
try:
# this is just an example URL...
URL = "www.example.com/products/"+productName
r = s.get(URL,headers)
result = json.loads(r.text)
print(result['information'])
except json.decoder.jsonDecodeError:
print("Unable to process data for " + productName)
products = [product1, product2, product3... productN]
counter = 1
mainThread = threading.current_thread()
for product in products:
# this if block checks if this is the fifth iteration of the for loop...
# if yes then change tcp connection...
if counter%5 == 0:
# wait until all other threads except the main thread are completed, cause we don't
# want to drop the connection in the middle of a request...
threads = threading.enumerate()
for thread in threads:
if thread is mainThread:
continue
thread.join()
print("Connection Switched")
# establish a new connection...
s = requests.Session()
# hold on for a sec
time.delay(1)
# start a new thread for getting info about the current 'prouct'
thread = threading.Thread(target=getInfo,args=(product,))
thread.start()
print("Done")
Note that the code was an simplified version of the real code...
Anyways...
I don't know for what reason, my program creates a new TCP connection for each product. There isn't much in logs, just the basic stuff...
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): www.example.com
...
Even after hours of searching, I can't seem to find an appropriate enough solution.
Here are some of the things that I have TRIED
The above two links are just related questions, so basically I haven't tried any sensible solution yet.
It would really be appreciated if you could help me with this weird snag...
Thanks in advance
I'm new to this library and just encountered this issue.
I solved it like this:
with sessions.Session() as session:
response = session.request("POST", my_url, data=my_data)
This makes everything much faster as it reuses the connection you make at the beginning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With