Sometimes when trying to scrape Instagram media, by adding at the end of the URL (?__a=1)
EX: https://www.instagram.com/p/CP-Kws6FoRS/?__a=1
The response returned
{
"__ar": 1,
"error": 1357004,
"errorSummary": "Sorry, something went wrong",
"errorDescription": "Please try closing and re-opening your browser window.",
"payload": null,
"hsrp": {
"hblp": {
"consistency": {
"rev": 1005622141
}
}
},
"lid": "7104767527440109183"
}
Why is this response returned and what should I do to fix this? Also, did we have another way to get the video and photo URL?
I solved this problem by adding &__d=dis
to the query string at the end of the URL, like so: https://www.instagram.com/p/CFr6G-whXxp/?__a=1&__d=dis
I believe I may found a workaround using:
https://i.instagram.com/api/v1/users/web_profile_info/?username={username}
to get the user's info and recent posts. data.user
from the response is the same as graphql.user
from https://i.instagram.com/{username}/?__a=1
.<meta property="al:ios:url" content="instagram://media?id={media_id}">
in the HTML response of https://instagram.com/p/{post_shortcode}
.https://i.instagram.com/api/v1/media/{media_id}/info
using the extracted media id to get the same response as https://instagram.com/p/{post_shortcode}/?__a=1
.A couple important of points:
user-agent
used in the script is important. I found the one Firefox generated when re-sending requests in the dev tools returned the "Sorry, something went wrong"
error.cookiejar = browser_cookie3.chrome(domain_name='instagram.com')
Here's the full script. Let me know if this is helpful!
import os
import pathlib
import string
from datetime import datetime, timedelta
from urllib.parse import urlparse
import bs4 as bs
import browser_cookie3
from google.auth.transport import requests
import requests
# setup.
username = "<username>"
output_path = "C:\\some\\path"
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 9; GM1903 Build/PKQ1.190110.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36 Instagram 103.1.0.15.119 Android (28/9; 420dpi; 1080x2260; OnePlus; GM1903; OnePlus7; qcom; sv_SE; 164094539)"
}
def download_post_media(post: dict, media_list: list, number: int):
output_filename = f"{output_path}/{username}"
if not os.path.isdir(output_filename):
os.mkdir(output_filename)
post_time = datetime.fromtimestamp(int(post["taken_at_timestamp"])) + timedelta(hours=5)
output_filename += f"/{username}_{post_time.strftime('%Y%m%d%H%M%S')}_{post['shortcode']}_{number}"
current_media_json = media_list[number - 1]
if current_media_json['media_type'] == 1:
media_type = "image"
media_ext = ".jpg"
media_url = current_media_json["image_versions2"]['candidates'][0]['url']
elif current_media_json['media_type'] == 2:
media_type = "video"
media_ext = ".mp4"
media_url = current_media_json["video_versions"][0]['url']
output_filename += media_ext
response = send_request_get_response(media_url)
with open(output_filename, 'wb') as f:
f.write(response.content)
def send_request_get_response(url):
cookiejar = browser_cookie3.firefox(domain_name='instagram.com')
return requests.get(url, cookies=cookiejar, headers=headers)
# use the /api/v1/users/web_profile_info/ api to get the user's information and its most recent posts.
profile_api_url = f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}"
profile_api_response = send_request_get_response(profile_api_url)
# data.user is the same as graphql.user from ?__a=1.
timeline_json = profile_api_response.json()["data"]["user"]["edge_owner_to_timeline_media"]
for post in timeline_json["edges"]:
# get the HTML page of the post.
post_response = send_request_get_response(f"https://instagram.com/p/{post['node']['shortcode']}")
html = bs.BeautifulSoup(post_response.text, 'html.parser')
# find the meta tag containing the link to the post's media.
meta = html.find(attrs={"property": "al:ios:url"})
media_id = meta.attrs['content'].replace("instagram://media?id=", "")
# use the media id to get the same response as ?__a=1 for the post.
media_api_url = f"https://i.instagram.com/api/v1/media/{media_id}/info"
media_api_response = send_request_get_response(media_api_url)
media_json = media_api_response.json()["items"][0]
media = list()
if 'carousel_media_count' in media_json:
# multiple media post.
for m in media_json['carousel_media']:
media.append(m)
else:
# single media post.
media.append(media_json)
media_number = 0
for m in media:
media_number += 1
download_post_media(post['node'], media, media_number)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With