I'm writing a personal Chrome extension (Note: these can make cross-origin requests. See https://developer.chrome.com/extensions/xhr).
I'm trying to use XMLHttpRequest to access a certain website, then extract data from it using javascript. My problem is that this website often returns its "robots" page to me instead of the HTML. Of course, when I visit this website in my browser, it works fine. Also, if I visit the website with my browser THEN make the XHR request, it also works fine.
I thought the problem might be that my request headers were not correct. I then modified my request headers so that they were identical to my browser ones (using chrome.webRequest). Unfortunately, this did not work either. One thing I noticed is that my browser has some cookies in its request headers, which I do not know how to replicate (see below).
Therefore, my question is: how can I solve or debug this problem? Is there a way to find out WHY the site is delivering its "robots" page to me? If I view its robots.txt file, I am not breaking any obvious rules. I am pretty new to javascript and web programming, so sorry if this is a basic question.
Here is an example of my browser request headers:
GET /XXX/XXX HTTP/1.1
Host: www.example.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
Referer: https://www.example.com/XXX
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: D_IID=XXX-XXX-XXX-XXX-XXX; D_UID=XXX-XXX-XXX-XXX-XXX; D_ZID=XXX-XXX-XXX-XXX-XXX; D_ZUID=XXX-XXX-XXX-XXX-XXX; D_HID=XXX-XXX-XXX-XXX-XXX; D_SID=XXX/XXX/XXX
I am also including my "General" headers defined in Chrome:
Request URL: https://www.example.com/XXX
Request Method: GET
Status Code: 200 OK
Remote Address: XXX
Referrer Policy: no-referrer-when-downgrade
And my response headers:
Cache-Control: private, no-cache, no-store, must-revalidate
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html
Date: Wed, 06 Feb 2019 XXX GMT
Edge-Control: no-store, bypass-cache
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Server: XXX
Surrogate-Control: no-store, bypass-cache
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-DB: 0
X-DW: 0
X-DZ: XXX
After looking at the response HTML, I am not sure what it is. I originally thought it was some kind of ROBOTS response because it says META NAME="ROBOTS", but now I am less sure. Here is the general structure of the HTML.
<!DOCTYPE html>
<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 XXX GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=example.com" />
<script type="text/javascript">
// SOME JAVASCRIPT
</script>
<script type="text/javascript" src="/example.js" defer></script></head>
<body>
<div id="XXX"> </div>
</body>
</html>
When I take a look at your user agent, you working on a Mac, Apple stuff.
Some background info, Chrome uses the render engine of Safari, because Apple don't allow other engines, Apple policy. In fact Chrome is a just a GUI shield, looks like Chrome however the skeleton is still Safari. That's why when there are problems with Safari, the engine, you are unable to fix the problem. Installing another browser doesn't help when there is a problem with the core. The problem you have is one of them, sad but true. Let me explain.
Had similar problems in the past with a (Luondo) webshop embedded (object/iframe) into a website, also cross-policy etc enabled however won't work (only!) on Apple devices. Only Apple users need to visit the domain of the webshop first before they can place an order (exactly like you described, cookie problem). It is some kind of stupid security related policy, only exists with use of Safari (or with their required embedded engine).
What I did (however in your case doesn't help I suppose) is to add a message on the page when an Apple device was detected. The message includes a link to the webshop domain that will open in another tabpage. After this, Apple users can place an order. See also this message in Dutch (translation below):

Apple users, please note:
There is a security issue with Safari, first you have to visit our webshop provider once to be able to place an order.
Click on the following link to open the website of our webshop provider and after this you can close it: [link]Activate order facility onto Apple device[/link]
Apologise for the inconvenience.
Not the best translation, however, you get the point I guess. This is (from now, 2019) two years ago and the problem still exists, like you showed us.
Solution:
Is there a solution anyway, probably not for Apple users (because Apple needs to fix this) however, have you tried it on a Linux/Windows machine with Chrome installed? I bet it will work unless there the are some security restrictions server site that avoids ajax calls, however, I think there are no issues.
Another approuch:
1. I do not know your skills however you can consider to setup a proxy server to avoid these problems, embed (or better include) the content of the site into your output (cookies included). One warning tho, this can be illegal because you merged content of another site to be something of your own;
2. Ask the owner of the site they provide an API to their services.
Personal thoughts about your ajax method:
If you want to 'merge' html or extracting content (as you call it) of another site (by using javascript) which is not your own, I am in doubt what you tried to do is legal. I think that is also the reason why you don't want to mention the name/domain of the 'service' (example is not the service I suppose). Try to figure out what you want to do is actually legal, if not, this all is a waste of time unless there is an API for it like I explained above.
Maybe this all doesn't sound to you like an answer however it will gave you (hopefully) some insight to the actual problem.
Have a nice Saturday, hopes it helps.
my question is: how can I solve or debug this problem?
To debug try Fiddler, mitmproxy, Wireshark or any http(s) debugging proxy and see how your extension send the XMLHttpRequest headers.
Also try to emulate browser request using Postman or run the XMLHttpRequest in Chrome devtool.
And my guess, it is because X-Requested-With: XMLHttpRequest header.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With