I am scraping Google using Python Scrapy.
There is an AJAX to a URL and it response is in content-type:application/vnd.google.octet-stream-compressible. Can I convert it into readable form?
This is cURL, you can see its response.
curl 'https://www.google.com/maps/vt/stream/pb=!1m7!8m6!1m3!1i17!2i38176!3i49635!2i7!3x16383!2m3!1e0!2sm!3i371050979!3m7!2sen!5e1105!12m4!1e68!2m2!1sset!2sRoadmap!4e1!6m6!1e12!2i2!28e3!39b1!44e2!50e0' -H 'Referer: https://www.google.com/maps/_/js/k=maps.m.en.MQjize_OSyY.O/m=npm,wte,vw/rt=j/d=1/ed=1/exm=/rs=ACT90oGsAb_R5Wfu0Yk-GEzceAGaUAdIbg' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' --compressed
It's always useful to look at the raw data, even if binary:
$ hexdump -C $file
00000000 58 48 52 31 00 00 69 8e 06 00 91 91 93 8a 8b 3d |XHR1..i........=|
00000010 31 99 83 78 18 98 89 9f 93 98 8b 9a b9 6e b2 91 |1..x.........n..|
00000020 1d be 12 cb d5 dc 96 91 81 91 9b 9b 9b 96 d2 d3 |................|
00000030 df c9 9b 9b 98 65 9b 9b 9b b8 93 98 9b 9b 9b 81 |.....e..........|
[...]
The XHR1 is a good start, and might be a file signature. Searching for XHR1 + maps, lead me to this blog post from 2017 on reverse engineering Google Map's formats. This section seems relevant:
The new one, integrated to the web application just uses binary Protobuf — again, that’s a second form of Protobuf in the web app. A single-byte xor is used instead of RC4, what a ramp up in efficiency. That’s the binary data sent through AJAX that we mentioned earlier, encapsulated in a mere length-value container format (signature “XHR1”). Transmitted data has mostly the same meaning as in the old (I made a small script to mostly render it to SVG just to spend time, it kind of works but would hardly be useful to a lot of people and could attract C&D’s, so this is left as an exercise for the reader).
The next four bytes after the signature are 00 00 69 8e, which does look a like a 32-bit integer representing length. Try converting that to an integer N, and reading the next N bytes of the message. If it still doesn't look like anything, maybe the XOR "encryption" is still in use. Try XOR'ing that byte string with all number from 0-255 and see if anything useful pops up, either by signature or by text contents.
Do note, though, that the blog post above contains examples of Cease and Desist letters from Google to people who tried scrapping Maps data this way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With