vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Dear PostgreSQL Developers, This patch is a "diff -c" against the hashfunc.c from postgresql-8.3beta1. It implements the 2006 version of the hash function by Bob Jenkins. Its features include a better and faster hash function. I have included the versions supporting big-endian and little-endian machines that will be selected based on the machine configuration. Currently, I have hash_any() just a stub calling hashlittle and hashbig. In order to allow the hash index to support large indexes (>10^9 entries), the hash function needs to be able to provide 64-bit hashes. The functions hashbig2/hashlittle2 produce 2 32-bit hashes that can be used as a 64-bit hash value. I would like some feedback as to how best to include 64-bit hashes within our current 32-bit hash infrastructure. The hash-merge can simple use one of the 2 32-bit pieces to provide the current 32-bit hash values needed. Then they could be pulled directly from the hash index and not need to be recalculated at run time. What would be the best way to implement this in a way that will work on machines without support for 64-bit integers? The current patch passes all the regression tests, but has a few warnings for the different variations of the new hash function. Until the design has crystalized, I am not going to worry about them and I want testers to have access to the different functions. I am doing the initial patches to the hash index code based on a 32-bit hash, but I would like to add the 64-bit hash support pretty early in the development cycle in order to allow for better testing. Any thoughts would be welcome. Regards, Ken ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote: > Its features include a better and faster hash function. Looks very promising. Do you have any performance test results to show it really is faster, when compiled into Postgres? Better probably needs some definition also; in what way are the hash functions better? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Sun, Oct 28, 2007 at 05:27:38PM +0000, Simon Riggs wrote: > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote: > > Its features include a better and faster hash function. > > Looks very promising. Do you have any performance test results to show > it really is faster, when compiled into Postgres? Better probably needs > some definition also; in what way are the hash functions better? > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > The new hash function is roughly twice as fast as the old function in terms of straight CPU time. It uses the same design as the current hash but provides code paths for aligned and unaligned access as well as separate mixing functions for different blocks in the hash run instead of having one general purpose block. I think the speed will not be an obvious win with smaller items, but will be very important when hashing larger items (up to 32kb). Better in this case means that the new hash mixes more thoroughly which results in less collisions and more even bucket distribution. There is also a 64-bit varient which is still faster since it can take advantage of the 64-bit processor instruction set. Ken ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Sun, 2007-10-28 at 13:05 -0500, Kenneth Marshall wrote: > On Sun, Oct 28, 2007 at 05:27:38PM +0000, Simon Riggs wrote: > > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote: > > > Its features include a better and faster hash function. > > > > Looks very promising. Do you have any performance test results to show > > it really is faster, when compiled into Postgres? Better probably needs > > some definition also; in what way are the hash functions better? > > > > -- > > Simon Riggs > > 2ndQuadrant http://www.2ndQuadrant.com > > > The new hash function is roughly twice as fast as the old function in > terms of straight CPU time. It uses the same design as the current > hash but provides code paths for aligned and unaligned access as well > as separate mixing functions for different blocks in the hash run > instead of having one general purpose block. I think the speed will > not be an obvious win with smaller items, but will be very important > when hashing larger items (up to 32kb). > > Better in this case means that the new hash mixes more thoroughly > which results in less collisions and more even bucket distribution. > There is also a 64-bit varient which is still faster since it can > take advantage of the 64-bit processor instruction set. Ken, I was really looking for some tests that show both of the above were true. We've had some trouble proving the claims of other algorithms before, so I'm less inclined to take those things at face value. I'd suggest tests with Integers, BigInts, UUID, CHAR(20) and CHAR(100). Others may have different concerns. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Sun, Oct 28, 2007 at 08:06:58PM +0000, Simon Riggs wrote: > On Sun, 2007-10-28 at 13:05 -0500, Kenneth Marshall wrote: > > On Sun, Oct 28, 2007 at 05:27:38PM +0000, Simon Riggs wrote: > > > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote: > > > > Its features include a better and faster hash function. > > > > > > Looks very promising. Do you have any performance test results to show > > > it really is faster, when compiled into Postgres? Better probably needs > > > some definition also; in what way are the hash functions better? > > > > > > -- > > > Simon Riggs > > > 2ndQuadrant http://www.2ndQuadrant.com > > > > > The new hash function is roughly twice as fast as the old function in > > terms of straight CPU time. It uses the same design as the current > > hash but provides code paths for aligned and unaligned access as well > > as separate mixing functions for different blocks in the hash run > > instead of having one general purpose block. I think the speed will > > not be an obvious win with smaller items, but will be very important > > when hashing larger items (up to 32kb). > > > > Better in this case means that the new hash mixes more thoroughly > > which results in less collisions and more even bucket distribution. > > There is also a 64-bit varient which is still faster since it can > > take advantage of the 64-bit processor instruction set. > > Ken, I was really looking for some tests that show both of the above > were true. We've had some trouble proving the claims of other algorithms > before, so I'm less inclined to take those things at face value. > > I'd suggest tests with Integers, BigInts, UUID, CHAR(20) and CHAR(100). > Others may have different concerns. > Simon, I agree, that we should not take claims withoug testing them ourselves. My main motivation for posting the patch was to get feedback on how to add support for 64-bit hashes that will work with all of our supported platforms. I am trying to avoid the "work on a feature in isolation... and submit a giant patch with many problems" problem. I intend to do more extensive testing, but I am trying to reach a basic implementation level before I start the testing. I am pretty good with theory, but my coding skills are out of practice. It will take me longer to generate the tests now and without any clear benefit to the hash index implementation. I am willing to test further, but I would like to have my testing benefit the hash index implementation and not just the effectiveness and efficiency of the hashing algorithm. Regards, Ken > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Oct 28, 11:05 am, k...@rice.edu (Kenneth Marshall) wrote: > On Sun, Oct 28, 2007 at 05:27:38PM +0000, Simon Riggs wrote: > > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote: > > > Its features include a better and faster hash function. > > > Looks very promising. Do you have any performance test results to show > > it really is faster, when compiled into Postgres? Better probably needs > > some definition also; in what way are the hash functions better? > > > -- > > Simon Riggs > > 2ndQuadrant http://www.2ndQuadrant.com > > The new hash function is roughly twice as fast as the old function in > terms of straight CPU time. It uses the same design as the current > hash but provides code paths for aligned and unaligned access as well > as separate mixing functions for different blocks in the hash run > instead of having one general purpose block. I think the speed will > not be an obvious win with smaller items, but will be very important > when hashing larger items (up to 32kb). > > Better in this case means that the new hash mixes more thoroughly > which results in less collisions and more even bucket distribution. > There is also a 64-bit varient which is still faster since it can > take advantage of the 64-bit processor instruction set. > > Ken > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate I don't make use of 64-bit arithmetic when producing the 64-bit result in hashlittle2(). Wish I did. The routine internally produces 3 32- bit results a b c, the returned 64-bit result is roughly c | (b<<32). |
| |||
| On Oct 27, 10:15 pm, k...@rice.edu (Kenneth Marshall) wrote: > Dear PostgreSQL Developers, > > This patch is a "diff -c" against the hashfunc.c from postgresql-8.3beta1. > It implements the 2006 version of the hash function by Bob Jenkins. Its > features include a better and faster hash function. I have included the > versions supporting big-endian and little-endian machines that will be > selected based on the machine configuration. [snip] I have some question concerning Bob Jenkins' functions hashword(uint32_t*, size_t), hashlittle(uint8_t*, size_t) and hashbig(uint8_t*, size_t) in lookup3.c. Let k1 by a key: uint8_t* k1; strlen(k1)%sizeof(uint32_t) == 0. 1. hashlittle(k1) produces the same value on Little-Endian and Big- Endian machines. Let hashlittle(k1) be == L1. 2. hashbig(k1) produces the same value on Little-Endian and Big-Endian machines. Let hashbig(k1) be == B1. L1 != B1 3. hashword((uint32_t*)k1) produces * L1 on LittleEndian machine and * B1 on BigEndian machine. --------------------- The question is: is it possible to change hashword() to get * L1 on Little-Endian machine and * B1 on Big-Endian machine ? Thanks. Alex Vinokur email: alex DOT vinokur AT gmail DOT com http://mathforum.org/library/view/10978.html http://sourceforge.net/users/alexvn |
| |||
| On Nov 15, 10:40 am, Alex Vinokur <ale...@users.sourceforge.net> wrote: [snip] > I have some question concerning Bob Jenkins' functions > hashword(uint32_t*, size_t), hashlittle(uint8_t*, size_t) and > hashbig(uint8_t*, size_t) in lookup3.c. > > Let k1 by a key: uint8_t* k1; strlen(k1)%sizeof(uint32_t) == 0. > > 1. hashlittle(k1) produces the same value on Little-Endian and Big- > Endian machines. > Let hashlittle(k1) be == L1. > > 2. hashbig(k1) produces the same value on Little-Endian and Big-Endian > machines. > Let hashbig(k1) be == B1. > > L1 != B1 > > 3. hashword((uint32_t*)k1) produces > * L1 on LittleEndian machine and > * B1 on BigEndian machine. > =================================== > --------------------- > The question is: is it possible to change hashword() to get > * L1 on Little-Endian machine and > * B1 on Big-Endian machine > ? Sorry, it should be as follows: Is it possible to create two new hash functions on basis of hashword(): i) hashword_little () that produces L1 on Little-Endian and Big- Endian machines; ii) hashword_big () that produces B1 on Little-Endian and Big- Endian machines ? ==================================== Thanks. Alex Vinokur email: alex DOT vinokur AT gmail DOT com http://mathforum.org/library/view/10978.html http://sourceforge.net/users/alexvn |
| |||
| Alex Vinokur wrote: > On Nov 15, 10:40 am, Alex Vinokur <ale...@users.sourceforge.net> > wrote: > [snip] >> I have some question concerning Bob Jenkins' functions >> hashword(uint32_t*, size_t), hashlittle(uint8_t*, size_t) and >> hashbig(uint8_t*, size_t) in lookup3.c. >> >> Let k1 by a key: uint8_t* k1; strlen(k1)%sizeof(uint32_t) == 0. >> >> 1. hashlittle(k1) produces the same value on Little-Endian and Big- >> Endian machines. >> Let hashlittle(k1) be == L1. >> >> 2. hashbig(k1) produces the same value on Little-Endian and Big-Endian >> machines. >> Let hashbig(k1) be == B1. >> >> L1 != B1 >> >> 3. hashword((uint32_t*)k1) produces >> * L1 on LittleEndian machine and >> * B1 on BigEndian machine. >> > =================================== >> --------------------- >> The question is: is it possible to change hashword() to get >> * L1 on Little-Endian machine and >> * B1 on Big-Endian machine >> ? > > Sorry, it should be as follows: > > Is it possible to create two new hash functions on basis of > hashword(): > i) hashword_little () that produces L1 on Little-Endian and Big- > Endian machines; > ii) hashword_big () that produces B1 on Little-Endian and Big- > Endian machines > ? Why? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| On Nov 15, 1:23 pm, hei...@enterprisedb.com (Heikki Linnakangas) wrote: > Alex Vinokurwrote: > > On Nov 15, 10:40 am,Alex Vinokur<ale...@users.sourceforge.net> > > wrote: > > [snip] > >> I have some question concerning Bob Jenkins' functions > >> hashword(uint32_t*, size_t), hashlittle(uint8_t*, size_t) and > >> hashbig(uint8_t*, size_t) in lookup3.c. > > >> Let k1 by a key: uint8_t* k1; strlen(k1)%sizeof(uint32_t) == 0. > > >> 1. hashlittle(k1) produces the same value on Little-Endian and Big- > >> Endian machines. > >> Let hashlittle(k1) be == L1. > > >> 2. hashbig(k1) produces the same value on Little-Endian and Big-Endian > >> machines. > >> Let hashbig(k1) be == B1. > > >> L1 != B1 > > >> 3. hashword((uint32_t*)k1) produces > >> * L1 on LittleEndian machine and > >> * B1 on BigEndian machine. > > > =================================== > >> --------------------- > >> The question is: is it possible to change hashword() to get > >> * L1 on Little-Endian machine and > >> * B1 on Big-Endian machine > >> ? > > > Sorry, it should be as follows: > > > Is it possible to create two new hash functions on basis of > > hashword(): > > i) hashword_little () that produces L1 on Little-Endian and Big- > > Endian machines; > > ii) hashword_big () that produces B1 on Little-Endian and Big- > > Endian machines > > ? > > Why? > [snip] Suppose: uint8_t chBuf[SIZE32 * 4]; // ((size_t)&chBuf[0] & 3) == 0 Function hashlittle(chBuf, SIZE32 * 4, 0) produces the same hashValue (let this value be L1) on little-endian and big-endian machines. So, hashlittle() is endianness-indepent. On other hand, function hashword ((uint32_t)chBuf, SIZE32, 0) produces hashValue == L1 on little-endian machine and hashValue != L1 on big-endian machine. So, hashword() is endianness-dependent. I would like to use both hashlittle() and hashword() (or hashword_little) on little-endian and big-endian machine and to get identical hashValues. Alex Vinokur email: alex DOT vinokur AT gmail DOT com http://mathforum.org/library/view/10978.html http://sourceforge.net/users/alexvn |