How does Cloudflare detect that I am a bot even though I have provided the cf_clearance cookie?

Question

How does Cloudflare even know that this request came from a script even if I provided all the data, cookies and parameters when making a normal request? What does it check for? Am I doing something wrong? For example (I have redacted some of the values):

import requests

cookies = {
    '__Host-next-auth.csrf-token': '...',
    'cf_clearance': '...',
    'oai-asdf-ugss': '...',
    'oai-asdf-gsspc': '...',
    'intercom-id-dgkjq2bp': '...',
    'intercom-session-dgkjq2bp': '',
    'intercom-device-id-dgkjq2bp': '...',
    '_cfuvid': '...',
    '__Secure-next-auth.callback-url': '...',
    'cf_clearance': '...',
    '__cf_bm': '...',
    '__Secure-next-auth.session-token': '...',
}

headers = {
    'authority': 'chat.openai.com',
    'accept': 'text/event-stream',
    'accept-language': 'en-IN,en-US;q=0.9,en;q=0.8',
    'authorization': 'Bearer ...',
    'content-type': 'application/json',
    'cookie': '__Host-next-auth.csrf-token=...',
    'origin': 'https://chat.openai.com',
    'referer': 'https://chat.openai.com/chat',
    'sec-ch-ua': '"Brave";v="111", "Not(A:Brand";v="8", "Chromium";v="111"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Linux"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'sec-gpc': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
}

json_data = {
 ...
}

response = requests.post('https://chat.openai.com/backend-api/conversation', cookies=cookies, headers=headers, json=json_data)

I have tried different useragents to no avail, but I can't seem to figure out whats causing the problem in the first place.

The response comes back with error code 403 and HTML something like:

<html>
...
...
<h1>Access denied</h1>
  <p>You do not have access to chat.openai.com.</p><p>The site owner may have set restrictions that prevent you from accessing the site.</p>
  <ul class="cferror_details">
    <li>Ray ID: ...</li>
    <li>Timestamp: ...</li>
    <li>Your IP address: ...</li>
    <li class="XXX_no_wrap_overflow_hidden">Requested URL: chat.openai.com/backend-api/conversation </li>
    <li>Error reference number: ...</li>
    <li>Server ID: ...</li>
    <li>User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36</li>
  </ul>
...
...
</html>

StingyJack · Accepted Answer

I used to run a web data scraping/mining team that had to scrape about 20K sites every day for publicly available data. The only way we could reliably get past some of the harder bot checks (reCAPTCHA, Cloudflare, and some of the dozen or more AI/ML powered others) was to either use a proxy that made our traffic look like human user traffic, or to programmatically remote control a browser, and sometimes both.

Proxy providers seem to come and go every few years, and the one I used last is no longer around, but it looks like there are a few that guarantee a similar experience of "priming" your requests to make them look like legit traffic. This was necessary for some of the bot detection (reCAPTCHA specifically, but probably also Cloudflare) that use current traffic analysis and historical data to determine if you are a bot or not. None of these proxies would be free, but as long as you don’t need to make 300K requests per day they should be relatively cheap.

The remote control option was a container image that had the browser running on it and a Python-based remote control package* that would interact with the browser like a keyboard and mouse. This was important to defeat/avoid bot detection that would fingerprint the browser and/or observe behavior. There are a rather startling number of properties your browser gives up about itself via JavaScript and you are bound to forget one of them if you aren’t just using a regular browser. Those properties get inspected for any "Bot" flags in conjunction with how fast and what you are clicking on when visiting the page to determine if you are human or not.

* It was PyAutoGui, PyScreeze to take screenshots and pyTesseract to OCR the screenshot. Selenium/WebDriver is detectable by the more advanced bot detection software, hence the screenshot + OCR is used for collecting data and for locating clickables.

Life is complex · Answer

UPDATED 04-10-2023

Back in December 2022, OpenAI deployed Cloudflare to protect ChatGPT from being abused from non-official means. OpenAI also started redesigning their service, which included changing endpoints that were being queried externally from Python scripts.

For example this endpoint, which you are trying to reach use to accept POST requests prior to these changes.

https://chat.openai.com/backend-api/conversations

This endpoint currently only accepts GET requests using the cf_clearance cookie and other header information extracted from an authorized Browser session.

When I tried to use these cookies with a POST request, I get the error message Access was denied Error code 1020 in the response text. This error message is a clear indication that OpenAI has enabled a Cloudflare firewall rule for POST requests to the endpoint in question, which is https://chat.openai.com/backend-api/conversations

Here is a Cloudflare reference for this error message.

The new endpoint is https://api.openai.com/v1/completions, which will except POST requests.

Currently you have at least 3 options to use ChatGPT with Python.

The first is to use selenium, which would allow you to interact with ChatGPT much like you using a browser session yourself.

The second option would be to use the new endpoint as show in the code below:

import requests

url = 'https://api.openai.com/v1/completions'
headers = {'Content-Type': 'application/json',
           'Authorization': 'Bearer YOUR_API_KEY'}
data = {'prompt': 'tell me about wine',
        'model': 'text-davinci-003',
        'temperature': 0.5,
        'max_tokens': 4000}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    # ChatGPT will provide a different response with each request.
    print(response.json())
else:
    print(f'Request failed with status code {response.status_code}')

The third option is to use the official ChatGPT API

Here is where you obtain an API key.

Here is the API Documentation

Here is the basic code needed to use the API:

import os
import requests

api_endpoint = "https://api.openai.com/v1/completions"
api_key = os.getenv("OPENAI_API_KEY")

request_headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}

request_data = {
    "model": "text-davinci-003",
    "prompt": "What is the most popular programming language?",
    "max_tokens": 100,
    "temperature": 0.5
}

response = requests.post(api_endpoint, headers=request_headers, json=request_data)

# ChatGPT will provide a different response with each request.
if response.status_code == 200:
    response_text = response.json()["choices"][0]["text"]
    print(response_text)
    # The most popular programming language is currently JavaScript, followed by Python, Java, C/C++, and C#
else:
    print(f"Request failed with status code: {str(response.status_code)}")

Hopefully this information is useful to you. Happy coding.

How does Cloudflare detect that I am a bot even though I have provided the cf_clearance cookie?

Tags:

python

cookies

python-requests

session-cookies

cloudflare

Anm

2 Answers

StingyJack

UPDATED 04-10-2023

Life is complex

Recent Activity

Donate For Us

How does Cloudflare detect that I am a bot even though I have provided the cf_clearance cookie?

Tags:

python

cookies

python-requests

session-cookies

cloudflare

Anm

2 Answers

StingyJack

UPDATED 04-10-2023

Life is complex

Related questions

Recent Activity

Donate For Us