What I am trying to do is create a 12 character id for articles on my website similar to how youtube handles their video id (http://www.youtube.com/watch?v=53iddd5IcSU). Right now I am generating an MD5 hash and then grabbing 12 characters of it like this:
$ArticleId = substr(MD5("Article".$currentID),10,12)
where $currentID is the numeric ID from the database (eg 144)
I am slightly paranoid that I will run into a duplicate $ArticleId, but realistically what are the chances that this will happen? And also, being that the column in my database is unique, how can I handle this rare scenario without having an ugly error thrown?
P.S. I made a small script to check for duplicates within the first 5000 $ArticleId's and there were none.
EDIT: I don't like the way the base64_encode hashes look so I did this:
function retryAID($currentID)
{
    $AID = substr(MD5("Article".$currentID*2),10,12);
    $setAID = "UPDATE `table` SET  `artID` =  '$AID' WHERE `id` = $currentID ";
    mysql_query($setLID) or retryAID($currentID);
}
$AID = substr(MD5("Article".$currentID),10,12);
$setAID = "UPDATE `table` SET  `artID` =  '$AID' WHERE `id` = $currentID ";
mysql_query($setAID) or retryAID($currentID);
Since the AID column is unique the mysql_query will throw an error and the retryAID function will find a unique id...
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.
For example, the MD5 hash is always 128 bits long (commonly represented as 16 hexadecimal bytes). Thus, there are 2^128 possible MD5 hashes.
Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).
This is because each hash value is unique, even when users reuse their passwords. Salting adds another layer of security to thwart rainbow table attacks. Hashing can also be used when analyzing or preventing file tampering. This is because each original file generates a hash and stores it within the file data.
What's wrong with using a sequential id? The database will handle this for you.
That aside, 12 characters is still 96 bits. 296 = 79228162514264337593543950336 possible hashes. Even though MD5 is known to have collision vulnerabilities, there's a world of difference between the possibility of a collision and the probability of actually seeing one.
Based on the return value of the PHP md5 function you're using, my numbers above aren't quite right.
Returns the hash as a 32-character hexadecimal number.
Since you're taking 12 characters from a 32-character hexadecimal number (and not 12 bytes of the 128-bit hash), the actual number of possible hashes you could end up with is 1612 = 281474976710656. Still quite a few.
<?php
  function get_id()
  {
    $max = 1679615; // pow(36, 4) - 1;
    $id = '';
    for ($i = 0; $i < 3; ++$i)
    {
      $r = mt_rand(0, $max);
      $id .= str_pad(base_convert($r, 10, 36), 4, "0", STR_PAD_LEFT);
    }
    return $id;
  }
?>
Returns a 12 character number in base-36, which gives 4,738,381,338,321,616,896 possibilities. (The probability of collision depends on the distribution of the random number generator.)
To ensure no collisions, you'll need to loop:
<?php
do {
  $id = get_id();
} while ( !update_id($id) );
?>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With