Friday, April 15, 2011

Is there a CPAN module that digests a short string into a short number?

I need to create unique numerical ids for some short strings.

some.domain.com    -> 32423421
another.domain.com -> 23332423
yet.another.com    -> 12131232

Is there a Perl CPAN module that will do something like this?

I've tried using Digest::MD5 but the resulting numbers are too long:

some.domain.com    -> 296800572457176150356613937260800159845
From stackoverflow
  • Just take the first 8 digits of the MD5 hash. This works because MD5 is uniformly distributed over its hash address space. This means that any consecutive sequence of MD5 hash digits will itself be a uniformly distributed hash.

    Alternatively, just use some other uniformly-distributed hashing mechanism that returns 8 numbers. Whatever's easiest for you.

    git-noob : but then the probability of a collision goes up?
    John Feminella : That's right, but your probability of a collision always goes up when you reduce the address space. You'd have precisely the same problem using a shorter hash no matter how it's created.
  • Either Digest::CRC or String::CRC32. The first gives you option to calculate 8-, 16- and 32-bit chcecksums, while second only supports 32-bit.

  • Given the fact that the strings look like a host names, perhaps you will just resolve them to ip, and present the ip as integer?

    Kind of like:

    perl -le 'my $ip = gethostbyname("depesz.com"); my $num = unpack("N", $ip); print $num'
    1311657670
    
    innaM : What if they all point to the same IP? There are IPs out there that serve some 10 million host names.

0 comments:

Post a Comment