Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting correct encoding from php cURL

(see update at bottom of post)

Using the Chrome network logger, I notice a given XHR request:

Request Headers

GET ... HTTP/1.1
Host: ...
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Origin: ...
Authorization: Jra45648WwbbQ
Accept: */*
Referer: ...
Accept-Encoding: gzip, deflate, sdch, br
Accept-Language: en-US,en;q=0.8

Response Headers

HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Authorization, Origin, Content-Type, Accept, Referer, User-Agent, deportes
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin: ...
Access-Control-Expose-Headers: Authorization, x-request-id, x-mlbam-reply-after
Content-Type: application/octet-stream
Date: Sun, 16 Apr 2017 ... GMT
Server: nginx/1.11.3
Vary: Accept
X-Request-ID: ...
Content-Length: 16
Connection: keep-alive

The response content is @ EqV¡^MSÁ9

Perfect. This is the correct output.

Now, I need to recreate this exact exchange within PHP using cURL. So I duplicate the request using the same headers.

    $ch = curl_init();
    curl_setopt_array($ch, array(
        CURLOPT_URL => $url,
        CURLOPT_HTTPHEADER => $headers,
        CURLOPT_ENCODING => 'gzip',
        CURLOPT_RETURNTRANSFER => true
    ));

However, the output here is @ EqV–¡^MSƒÁ’9, which is clearly different.

I need to get it in the original format (@ EqV¡^MSÁ9), because eventually the output from the PHP will be served to a javascript script, and the value of charCodeAt has different results between these two output. I'm not sure how to approach this problem.

Example of the two different outputs in Notepad++

As you can see, after the XHR request, the response preview in Chrome is correct:

Chrome Network Logger Preview

If I change the encoding type of my PHP page's output to Western (ISO-8859-15), I get @ EqV¡^MSÁ9.

And if I paste that output into Notepad++, I get something very, very similar to what I want, but still slightly different (in this case, different by one single character). So maybe this is very close to the encoding I need?

Encoding

How can I find the encoding I need? What is the default encoding of chrome, since it seems to handle the response just fine?

UPDATE: I tested with a new value, òÝD¶0v¢ÔL·ßÎO Ó, and using mb_convert_encoding($r, 'utf-8', 'ISO-8859-15') gave me the correct result. So why is it encoding that particular response (@ EqV¡^MSÁ9) gives me a value that is short a character?

like image 831
X33 Avatar asked Sep 01 '25 17:09

X33


2 Answers

Chrome default encoding is UTF-8, and if you set it to to UTF-8
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8'); your text will be as expected you can try that here.
Also detecting the encoding is painful since it can encounter many issues using mb_detect_encoding but in this case it can be helpful if you specify the expected order of detection like so:

mb_detect_encoding($val, 'UTF-8,ISO-8859-15');

In my personal experience it is worthless without specifying the targets and in the right order, for example you need to list UTF-8 before ISO-8859-1 in your encoding_list or it will return ISO-8859-1 in most cases

UPDATE:
The doc says CURLOPT_ENCODING => '' handle all encodings you can try that but as I said since you are dealing with a known encoding wich is UTF-8 please try

$ch = curl_init();
    curl_setopt_array($ch, array(
        CURLOPT_URL => $url,
        CURLOPT_HTTPHEADER => $headers,
        CURLOPT_ENCODING => 'UTF-8',
        CURLOPT_RETURNTRANSFER => true
    ));
like image 152
Ryad Boubaker Avatar answered Sep 04 '25 06:09

Ryad Boubaker


You can attempt to detect the encoding of the octet stream and then convert it to a known charset.

$result = curl_exec($ch);
curl_close($ch);
echo mb_detect_encoding($result);
$resultUTF8 = mb_convert_encoding($result, 'ISO-8859-15', 'utf-8');
like image 39
JasonBoss Avatar answered Sep 04 '25 08:09

JasonBoss