I'm trying to figure out a good hash function for a std::pair of two primitive types. This is the way I have it implemented right now:
template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const
{
    return stdext::hash_value<T>(rhs.first) ^ stdext::hash_value<U>(rhs.second);
}
It appears to work even if I have two pairs such as (1, 2) and (2, 1) (numbers flipped). They generate the same hash value but the values are still inserted successfully into the hash map. Any thoughts?
Generally speaking, hashing containers always have to handle this case (hash collisions). There are a couple methods they can use like chaining and probing, any of which will potentially hurt performance.
Instead, I would suggest using boost::hash_combine to combine the hashes in such a way they swapping first and second doesn't generate the same hash.
Here's an implementation of hash_combine based on the docs from the current version of boost:
(http://www.boost.org/doc/libs/1_53_0/doc/html/hash/reference.html#boost.hash_combine)
template<typename T> void hash_combine(size_t & seed, T const& v) {
  seed ^= stdext::hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
You'd use it like this:
template<typename T, typename U>
std::size_t operator()(const std::pair<T,U> &rhs) const   {
  size_t retval = stdext::hash_value<T>(rhs.first);
  hash_combine(retval, rhs.second);
  return retval;
}
I can't actually vouch for the logic behind this function, but it looks more solid than most of the options mentioned here.
Assuming that stdext::hash_value provides a good distribution of hash values for each of first and second, what you've done is fine if you don't expect a disproportionately high incidence of mirror image pairs (e.g. (1,2) and (2,1)). If your data set is such that you do expect a lot of those, you might consider tweaking the hash function to force them to be different. For example, inverting the first hash value:
return ~stdext::hash_value<T>(rhs.first) ^ stdext::hash_value<T>(rhs.second);
I mention this only because you expressed a concern about mirror image pairs. If your input is close to random in that regard, then ^ should be fine. Remember, the goal is to minimize collisions, not avoid them entirely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With