Downloading a webpage and associated resources to a WARC in python

Question

I'm interested in downloading for later analysis a bunch of webpages. There are two things that I would like to do:

Download the page and associated resources (images, multiple pages associated with an article, etc) to a WARC file.
change all links to point to the now local files.

I would like to do this in Python.

Are there any good libraries for doing this? Scrapy seems designed to scrape websites, rather than single pages, and I'm not sure how to generate WARC files. Calling out to wget is a doable solution if there isn't something more python native. Heritrix is complete overkill, and not so much of a python solution. wpull would be ideal if it had a well documented python library, but it seems instead to be mostly an application.

Any other ideas?

raffaele messuti · Accepted Answer

just use wget, is the simplest and most stable tool you can have to crawl web and save into a warc.

man wget, or just to start:

--warc-file=FILENAME        save request/response data to a .warc.gz file
-p,  --page-requisites           get all images, etc. needed to display HTML page

please note that you don't have to change any links, the warc preserve the original web pages. is the job of replay software (openwayback, pywb) to make the warc content browsable again.

if you need to go with python: internetarchive/warc is the default library

take a look at this if you want manually crafting a warc file ampoffcom/htmlwarc

Downloading a webpage and associated resources to a WARC in python

Tags:

python

html

web-scraping

warc

Andrew Spott

1 Answers

raffaele messuti

Recent Activity

Donate For Us

Downloading a webpage and associated resources to a WARC in python

Tags:

python

html

web-scraping

warc

Andrew Spott

1 Answers

raffaele messuti

Related questions

Recent Activity

Donate For Us