Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests library creates new connection each time

Greetings everyone, I'm writing a python program that requires making 1000+ cURL requests. It generates a cURL request and fetches some JSON and then processes it, and does this for over 1000+ URLs. If I try to do it conventionally, it takes some 20 minutes, but it needs to be done within 3 minutes.

So after a few hours of research, the most efficient solution I found was multithreading combined with Keep-Alive TCP connection.

So basically I'm trying to retrieve some information about a few products from a website, through web-scrapping.

The below program illustrates the SAME

import requests
import json
import time

s = requests.Session()

def getInfo(productName):
        # this try block tries to get information and then parse it and then display a few  
        # parameters about the particular product...
        try:

            # this is just an example URL...
            URL = "www.example.com/products/"+productName
            r = s.get(URL,headers)
            result = json.loads(r.text)
            print(result['information'])

        except json.decoder.jsonDecodeError:


            print("Unable to process data for " + productName)

products = [product1, product2, product3... productN]
counter = 1
mainThread = threading.current_thread()

for product in products:

        # this if block checks if this is the fifth iteration of the for loop...
        # if yes then change tcp connection...

        if counter%5 == 0:

            # wait until all other threads except the main thread are completed, cause we don't 
            # want to drop the connection in the middle of a request...

            threads = threading.enumerate()
            for thread in threads:

                if thread is mainThread:
                    continue

                thread.join()

            print("Connection Switched")
            # establish a new connection...
            s = requests.Session()
            # hold on for a sec
            time.delay(1)
        # start a new thread for getting info about the current 'prouct'
        thread = threading.Thread(target=getInfo,args=(product,))
        thread.start()

print("Done")

Note that the code was an simplified version of the real code...

Anyways...

I don't know for what reason, my program creates a new TCP connection for each product. There isn't much in logs, just the basic stuff...

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): www.example.com
...

Even after hours of searching, I can't seem to find an appropriate enough solution.

Here are some of the things that I have TRIED

  • Stackoverflow Question
  • Another Stackoverflow question

The above two links are just related questions, so basically I haven't tried any sensible solution yet.

It would really be appreciated if you could help me with this weird snag...

Thanks in advance

like image 791
HufF867 Avatar asked Dec 01 '25 23:12

HufF867


1 Answers

I'm new to this library and just encountered this issue.

I solved it like this:

with sessions.Session() as session:
    response = session.request("POST", my_url, data=my_data)

This makes everything much faster as it reuses the connection you make at the beginning.

like image 184
GarethD Avatar answered Dec 04 '25 13:12

GarethD