I have a website that I want to monitor for changes, specifically in one DIV in the HTML. I was using http://www.followthatpage.com/ to monitor a webpage for changes, but I ran into two issues:
Ideally, I would like to write a bash or python script that does a diff of two files every 15 minutes, and emails any changes. I was thinking I might be able to use the diff command after downloading two files, and set it up for a cron to email if there are changes, but I still don't know how to filter only to a specific DIV.
Is there an easier way then figuring out how to do this myself (an existing script)? If not, what would be the best method to do this?
Another way to do it if you have access to a linux terminal is to add a cronjob
$ crontab -e
and place the following line (everyday at 16:00)
0 16 * * * diff_web_page.sh
where contents of diff_web_page.sh are
#!/bin/bash
URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
# First time that we are running, create the file and exit.
lynx -dump "$URL" &> $TMP_FILE;
# lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
# the file exist, grub the new version and compare it
lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
diff -Npaur $TMP_FILE $TMP_FILE.new;
mv $TMP_FILE.new $TMP_FILE;
fi
this will email the diff of the webpage every time its executed in the user@host (at the linux box you are running this cron job).
If you want a specific div, you can awk the output with pcregrep -M when dumping the web page with lynx
Since the div you want is specific to the site, you will probably have to setup a simple check.
This consists of
urllib.urlopen(URL) or requests.get(URL).Figuring out what and how to extract the data is going to take you the longest time. I recommend using Developer Tools in Chrome/Firefox.
Let's say we want to know when the counter updates on digitalocean.com. The div for the counter looks like this:
<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>
Sadly, there's no id, which would be really easy to pull out using BeautifulSoup4. (e.g. soup.find(id="counter").
Instead, I would elect to pull out all the inner elements that have class "count".
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))
BeautifulSoup has excellent documentation for parsing out HTML documents without having to bang your head (depending on how well laid out the site you're scraping is).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With