10

The domain of interest is string matching. Assume I have a structure like this.

typedef struct
{
    char *name,
    int (*function)();

} StringArray

StringArray s[] = 
{
    {"George", func1},
    {"Paul",   func2},
    {"Ringo",  func3},
    {"John",   func4},
    {"",       NULL}   /* End of list */ 
}

There are a fixed number of strings in the array. They are hard coded as in the example. If the table changes, there would be a need to re-evaluate the quality of the hash function.

I want to apply a hash function to a string, and if the string matches one in the array, then call the function. A perfect hash function is needed for this. No collisions are allowed.The purpose of requiring hashing is to get O(1) performance on the lookup.

What ideas do you have on designing a function to do this?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
EvilTeach
  • 28,120
  • 21
  • 85
  • 141
  • 19
    The following page has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: http://partow.net/programming/hashfunctions/index.html –  Oct 31 '10 at 23:10
  • See the [gperf](http://www.gnu.org/software/gperf/) home page. –  Apr 09 '09 at 15:35

6 Answers6

2

The summary lists both C and C++. Which of them are you looking for? C and C++ are two distinct languages, and differ greatly in their string handling and data structures (and the fact that the C ones work in C++ doesn't change that).

Why, specifically, do you want a perfect hash function? Is it that you want to associate a string with a function, and thought that would be a good way to do it? Is this some sort of homework assignment? Do you have a reason not to use map<> in C++? (Or unordered_map<> if available?)

If you do need a perfect hash, what are the constraints on the strings? Will there be a certain fixed set you want to dispatch on? What about strings that don't match one of the set? Are you willing to accept hits from random strings, or is the number of incoming strings limited?

If you could edit your question to include information like that, we could be a lot more helpful.

EDIT (in response to the first two comments):

OK, we should look at C solutions, since you presumably want this for both your C and C++ work. You presumably want the performance, but have you tested? If we're dealing with strings coming in on the I/O system, the time there is likely to dwarf the dispatch time.

You are expecting arbitrary strings. It's a bit much to expect a perfect hash function that will avoid all collisions from random data, so you do need to consider that.

Have you considered a trie? It may be more efficient than a perfect hash function (or may not be), it should be fairly easy to implement in C, and it will avoid problems with redoing your list of dispatching strings or possible collisions.

David Thornley
  • 56,304
  • 9
  • 91
  • 158
  • I code in both c and c++, and god help me Pro*C. O(1) hashing for performance. Lol, no homework assignment. I am looking to build a tool to speed up some performance critical code. The example is made simple for discussion purposes. The real world use is not. – EvilTeach Apr 09 '09 at 16:21
  • The strings will very in length. None of them will be of length zero. As a practical limit, no string in the array will be longer than 32 characters. What the caller passes in can be of any length, but if it is longer than the strings in the table it is the case of a no-match – EvilTeach Apr 09 '09 at 16:23
0

The final result of this exercise was to

  • Steal a number of string oriented hash functions off the net.
  • Build a sort of factory class which tested each of the functions against the data set with a range of mod operator values, looking for the smallest perfect hash that works with that function.
  • That factory class default constructor returns a string, which represents a set of arguments that when use picks the correct hash function, and mod size to give the perfect hash that requires the least amount of memory.
  • under normal usage, you simply instantiate the class with the returned arguments, and the class puts itself into a working state with the desired functions.
  • That constructor validates that there are no collisions and aborts if there are.
  • In the case where no perfect hash is found, it degrades into a binary search across a sorted version of the input table.

For the set of arrays I have in my domain, this seems to work very very well. A possible future optimization would be to do the same sort of testing, on substrings of the input. In the sample case, the first letter of each musicians name is sufficient to tell them apart. Then one would need to balance the cost of the actual hash function against the memory used.

My thanks to everyone who contributed ideas.

Evil

EvilTeach
  • 28,120
  • 21
  • 85
  • 141
0

You can use map

std::string foo() { return "Foo"; }
std::string bar() { return "Bar"; }

int main()
{
   std::map<std::string, std::string (*)()> m;
   m["foo"] = &foo;
   m["bar"] = &bar; 
}
Vinay
  • 4,743
  • 7
  • 33
  • 43
  • std::map doesn't use a hash - it is tree based –  Apr 09 '09 at 15:37
  • why to invent the wheel, you can use existing libraries like map. – Vinay Apr 09 '09 at 15:41
  • 1
    perhaps the questioner wanted the performance characteristics of hashing rather than those of tree searching? –  Apr 09 '09 at 15:43
  • @Mitch: what's the problem? We should often ask ourselves not how to do this or that but why I want to use this or that. – Mykola Golubyev Apr 09 '09 at 15:43
  • Precisely. Why does EvilTeach want a hash function? There may be good reasons for exactly that, but I haven't seen them. For the general case of taking a string and executing a function, this works fine. – David Thornley Apr 09 '09 at 16:02
0

If collisions are absolutely not allowed, your only option is to keep track of every string in the database, which is probably not a best way to go.

What I would do is apply one of existing common strong hashing algorithms, such as: MD5 or SHA. There's miriads of samples all around, here's one for example: http://www.codeproject.com/KB/security/cryptest.aspx

galets
  • 17,802
  • 19
  • 72
  • 101
0

Use a balanced binary tree. Then you KNOW behaviour is ALWAYS O(logn).

I strongly dislike hashes. People don't realise how much risk they take with their algorithm. They run some test data and then deploy in the field. I have NEVER seen a deployed hash algorithm get checked for behaviour in the field.

O(log n) is almost always acceptable in lieu of O(1).

  • “O(log n) is almost always acceptable in lieu of O(1).” In many applications, this statement is flat out wrong. Just increase the number of data points to above a few millions to see this. – Konrad Rudolph Apr 09 '09 at 16:08
  • Once you've done that, test. Hashes do not give guaranteed results, unless you know in advance what all possible inputs can be. A hash function that tends to clump the input will likely not give you O(1). – David Thornley Apr 09 '09 at 16:13
  • In this case, all the inputs are known. They are sitting in the array. and input string is either an exact match, or a no-match. – EvilTeach Apr 09 '09 at 16:18
  • No, all the dispatchable inputs are known. Once you've hashed the input string, and gotten a hit, you do need to compare. – David Thornley Apr 09 '09 at 17:11
  • Konrad - yes. But the number of applications with so many datum are few and far between. They form a specific, rather rare class. This is why I said *almost always acceptable*. –  Apr 09 '09 at 17:26
-1

Well, there is no perfect hash function.

You have several that minimize collisions, but none eliminates them.

Can't advise one though :P

EDIT: The solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code.

Megacan
  • 2,510
  • 3
  • 20
  • 31
  • @Adam: There's a pretty big caveat there given that it only applies when there is a distinct data set. Since the OP didn't make any mention of limiting the strings being used I'd agree with Megacan that there is no perfect hash in this case. +1. – sipsorcery Apr 09 '09 at 15:47
  • The questioner does make that mention, at least implicitly - there were only four Beatles )or siz if you include the drummer they sacked and Stu whatsisname) - still, a fixed data set. –  Apr 09 '09 at 15:52
  • I was only saying, the solution can't be finding a perfect hash function. The solution is to be aware of collisions. Generically a hash function has collisions. That obviously depends on the dataset and the size of the resulting hash code. – Megacan Apr 09 '09 at 16:00
  • Yes it will be a fixed data set, there are only so many strings he could type in a lifetime :). The point here though is that "there is no perfect hash function" is a truism unless there are explicit constraints specified. Megacan was logoically correct in my book. – sipsorcery Apr 09 '09 at 16:02
  • @megacan - follow the link in the answer I posted - perfect hash functions are possible and quite widely used –  Apr 09 '09 at 16:06
  • @spi You are correct. I should make it clear that that the table is static. – EvilTeach Apr 09 '09 at 16:14
  • @Neil - I've read the link. It says that a perfect hash function maps a key to a single hashcode, and no 2 keys have the same hash? am I right? I'm not arguing with that. That can only be possible if the number of combinations of keys are less or equal than the number of hashcodes. – Megacan Apr 09 '09 at 16:35