I am going to build a site using Google App Engine. My public site contains thousands of pictures. I want to store these pictures in the Cloud: Google Storage or Amazon S3 or Google App Engine BlobStore. The problem is image hotlinking.
As for Google Storage, I googled and I cant find a way to prevent image hotlinking. (I like its command line tool gsutil very much though)
Amazon S3 has "Query String Authentication" which generates expiring image urls. But this is very bad for SEO, isnt it? Constantly changing the URL would have quite negative affects as it takes upwards of a year to get an image, and its related URL, into Google Images. I am rather sure changing this URL would have an immediate negative affect when GoogleBot comes around to say hi. (UPDATE: A better way to preven image hotlinking in Amazon S3 by referrer is using Bucket Policy. Details here: http://www.naveen.info/2011/03/25/amazon-s3-hotlink-prevention-with-bucket-policies/)
Google App Engine BlobStore? I have to upload the images via Web Interfaces manually and it generates changing urls too. (update: Due to my ignorance about Blobstore, I just made a mistake. By using Google App Engine BlobStore, you can use whatever url to serve the image you want.)
What I need is simple referrer protection: Only show the image when the referrer is my site.
Are there some better ways to prevent image hotlinking. I dont want to file bankruptcy due to the extremely high cost of cloud bandwidth.
UPDATE:
Still difficult to choose from the three, each of them have pros and cons. BlobStore seems to be the ultimate choice.
The easiest option would be to use the blobstore. You can provide whatever upload interface you want - it's up to you to write it - and the blobstore doesn't constrain your download URLs, only your upload ones. You can serve blobstore images under any URL simply by setting the appropriate headers, or you can use get_serving_url to take advantage of the built-in fast image serving support, which generates cryptic but consistent URLs (but doesn't let you do referer checks).
I would suggest giving some consideration to whether this is a real, practical problem you're facing, though. The bandwidth consumed by a few hotlinked images is pretty minimal by today's standards, and it's not a particularly common practice in the first place. As @sharth points out in the comments, it's likely to impact SEO too, since image search tends to show images in their own windows in addition to linking to the page that hosted them.
Whenever I get back to coding for statistical web services, I had to generate images and charts dynamically. The images generated would depend on the request parameter, state of data repository, and some header info.
Therefore if I were you, I would write a REST web service to serve the images. Not too difficult. It's pretty ticklish too because if you don't like a particular ip address, you could show cartoon of tongue-out-of-cheek (or animated gif of OBL samba dancing while getting bombed) rather than the image for the data request.
For your case you would check the referer (or referrer) at the http header, right? I am doubtful because people can and will hide, blank out or even fake the referer field in the http header.
So, not only check the referer field, but create a data field where the value changes. The value could be a simple value matching.
During the world war, Roosevelt and Churchill communicated thro encryption. They each had an identical stack of disks, which somehow contained the encryption mechanism. After each conversation, both would discard the disk (and never reused) so that the next time they spoke again, they reach for the next disk on the stack.
Instead of a stack of disks, your image consumers and your image provider would carry the same stack of 32 bit tokens. 32 bits would give you ~4 billion ten minute periods. The stack is randomly sequenced. Since it is well known that "random generators" are not truly random and actually algorithmic in a way which can be predicted if supplied a sufficiently long sequence, you should either use a "true random generator" or resequence the stack every week.
Due to latency issues, your provider would accept tokens from the current period, the last period and the next period. Where period = sector.
Your ajax client (presumably gwt) on your browser would get an updated token from the server every ten minutes. The ajax client would use that token to request for images. Your image provider service would reject a stale token and your ajax client would have to request a fresh token from the server.
It is not a fireproof method but it is shatterproof, so that it could reduce/discourage the number of spam requests (nearly to zero, I presume).
The way I generate "truly random" sequences is again quick and dirty. I further work on an algorithmically generated "random" sequence by spending a few minutes manually throwing in a few monkey wrenches by manually resequencing or deleting values of the sequences. That would mess up any algorithmic predictability. Perhaps, you could write a monkey wrench thrower. But an algorithmic monkey wrench thrower would simply be adding a predictable algorithm above another predictable algorithm which does not reduce the overall predictability at all.
You could further obsessively constrict the situation by employing cyclic redundancy matching as a quick and dirty "encrypted" token matching mechanism.
Let us say you have a circle divided into 8 equidistant sectors. You would have a 3 digit binary number to be able to address anyone of all the 8 sectors. Imagine each sector is further subdivided into 8 subsectors, so that now you will be able to address each subsector with an additional 3 bytes, making a total of six bytes.
You plan to change the matching value every 10 minutes. Your image provider and all your approved consumers will have the same stack of sector addresses. Every ten minutes they throw away the sector address and use the next one. When a consumer sends your provider a matching value, it does not send the sector address but the subsector address. So that as long as your provider receives a subsector address belonging to the currently accepted sector, the provider service would respond with the correct image.
But the subsector address is remapped through an obfuscation sequencing algorithm. so that each subsector address within the same sector do not look similar at all. In that way, not all browsers would receive the same token value or highly similar token value.
Let us say that you have 16bit sector addresses and each sector has 16 bit subsector addresses, making up a 32 bit token. Which means you can afford to have 65536 concurrent browser clients carrying the same token sector but where no two token has the same low predictability value. So that you could assign a token subsector value for every session id. Unless you have more than 65536 concurrent sessions to your image provider service, no two session ids would need to share the same subsector token address. In that way, unless a spammer had access to session id faking equipment/facilities, there would be no way your image provider could be spammed except thro denial of service attack.
Low predictability means that there is low probability for a snooper or peeper to concoct an acceptable token to spam your image provider service.
Certainly, normal bots would not be able to get thro - unless you had really offended the ANNONYMOUS group and they decided to spam your server out of sheer fun. And even then if you had thrown monkey wrenches into the sector address stack and subsector maps, it would be really difficult to predict a next token.
BTW, Cyclic Redundancy matching is actually an error correction technique and not so much an encryption technique.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With