This is a discussion on hash index work within the Pgsql Patches forums, part of the PostgreSQL category; --> This patch implements two changes to hash indexes: - rather than storing the values of the index keys, we ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| This patch implements two changes to hash indexes: - rather than storing the values of the index keys, we just store their hash value instead. The hash opclasses have been marked "lossy" so that the executor will recheck the scan qualifier to catch hash collisions. - within a page of a hash bucket, the entries are stored sorted by hash value. When doing a hash index scan, we can then binary search within a page rather than using a linear search. Unfortunately, I haven't been able to measure any performance improvement from either of these changes :-\ I've run tests on two types of tables: int4 keys and relatively short textual keys (I don't think there's much point testing longer keys: the patches should obviously be a win there, so I'm concerned about speeding up the common case). In both cases, the difference in index scan performance isn't significant: it's about the same with and without the patches. The indexes have been on tables with 1 million rows (of int4) and 400,000 rows (of text). I would test with larger tables, but creating hash indexes is so slow that even these sizes are painful enough. Since indexes of this size should be cached in memory the reduction in I/O for the text case isn't being measured (the index is about 30% smaller than it is when we store the full text value), but even so I would have expected the binary search to produce noticeable gains... Perhaps I've made a coding error that has pessimized the performance somehow (although nothing obvious shows up in profiles), or perhaps the linear scan of the pages of the hash bucket isn't a performance problem in the first place, at least for types with a cheap equality operator. Some oprofile data seems to support this -- for example, in the int4 case, we spend less than 0.5% of the time in _hash_next and children, and only 1.8% of the runtime in hashgetmulti and children (with the vanilla CVS HEAD code). -Neil ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Neil, I have added these item to the TODO list. Do you plan on applying this? --------------------------------------------------------------------------- Neil Conway wrote: > This patch implements two changes to hash indexes: > > - rather than storing the values of the index keys, we just store their > hash value instead. The hash opclasses have been marked "lossy" so that > the executor will recheck the scan qualifier to catch hash collisions. > > - within a page of a hash bucket, the entries are stored sorted by hash > value. When doing a hash index scan, we can then binary search within a > page rather than using a linear search. > > Unfortunately, I haven't been able to measure any performance > improvement from either of these changes :-\ > > I've run tests on two types of tables: int4 keys and relatively short > textual keys (I don't think there's much point testing longer keys: the > patches should obviously be a win there, so I'm concerned about speeding > up the common case). In both cases, the difference in index scan > performance isn't significant: it's about the same with and without the > patches. The indexes have been on tables with 1 million rows (of int4) > and 400,000 rows (of text). I would test with larger tables, but > creating hash indexes is so slow that even these sizes are painful > enough. Since indexes of this size should be cached in memory the > reduction in I/O for the text case isn't being measured (the index is > about 30% smaller than it is when we store the full text value), but > even so I would have expected the binary search to produce noticeable > gains... > > Perhaps I've made a coding error that has pessimized the performance > somehow (although nothing obvious shows up in profiles), or perhaps the > linear scan of the pages of the hash bucket isn't a performance problem > in the first place, at least for types with a cheap equality operator. > Some oprofile data seems to support this -- for example, in the int4 > case, we spend less than 0.5% of the time in _hash_next and children, > and only 1.8% of the runtime in hashgetmulti and children (with the > vanilla CVS HEAD code). > > -Neil > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Bruce Momjian wrote: > Neil, I have added these item to the TODO list. Do you plan on applying > this? No, I don't have any immediate plans to apply it, as unfortunately I didn't see a performance win :-( It's also possible I'm just not measuring the right workload, although I don't have time to rerun the benchmarks at the moment. The patch does two things: (1) change hash indexes to only store the key's hash value, not the entire key (2) store index elements within a hash bucket in order of hash key and search for matches via binary search. #1 is definitely a win in some in some circumstances (e.g. indexing large fields or types with expensive equality operators), but those aren't the common case. I'm surprised that #2 is not a more noticeable improvement... One possibility would be to provide an optional implementation of #1, perhaps via an alternate index operator class. That way people could select it in those situations in which it is worth using. -Neil ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| Neil Conway <neilc@samurai.com> writes: > The patch does two things: (1) change hash indexes to only store the > key's hash value, not the entire key (2) store index elements within a > hash bucket in order of hash key and search for matches via binary > search. #1 is definitely a win in some in some circumstances (e.g. > indexing large fields or types with expensive equality operators), but > those aren't the common case. I'm surprised that #2 is not a more > noticeable improvement... It occurs to me that change #1 makes it cheaper to skip over index entries that are in the same bucket but don't have the exact same hash code; but it makes it significantly more expensive to skip over entries that have the same hash code but aren't really equal to the key being sought (since you have to visit the heap to find out they aren't equal). Maybe your test workload had enough occurrences of the latter case to balance out the win from the former. I think it would be worth investigating a variant in which the index stores both the hash code and the key value. This allows cheap elimination of non-matching hash codes, and binary sorting of index entries, without adding any extra trips to the heap. The downside is that it makes the index larger so you incur more I/O there --- so this might not be a win either. > One possibility would be to provide an optional implementation of #1, > perhaps via an alternate index operator class. That way people could > select it in those situations in which it is worth using. I think it would definitely be a good idea to make the lossy behavior optional. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| ||||
| I have added these TODO items based on Neil's ideas: * Consider sorting hash buckets so entries can be found using a binary search, rather than a linear scan * In hash indexes, consider storing the hash value with or instead of the key itself However, Neil's tests don't show much of a win. Should I keep these items on the TODO list? --------------------------------------------------------------------------- Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > The patch does two things: (1) change hash indexes to only store the > > key's hash value, not the entire key (2) store index elements within a > > hash bucket in order of hash key and search for matches via binary > > search. #1 is definitely a win in some in some circumstances (e.g. > > indexing large fields or types with expensive equality operators), but > > those aren't the common case. I'm surprised that #2 is not a more > > noticeable improvement... > > It occurs to me that change #1 makes it cheaper to skip over index > entries that are in the same bucket but don't have the exact same hash > code; but it makes it significantly more expensive to skip over entries > that have the same hash code but aren't really equal to the key being > sought (since you have to visit the heap to find out they aren't equal). > Maybe your test workload had enough occurrences of the latter case to > balance out the win from the former. > > I think it would be worth investigating a variant in which the index > stores both the hash code and the key value. This allows cheap > elimination of non-matching hash codes, and binary sorting of index > entries, without adding any extra trips to the heap. The downside is > that it makes the index larger so you incur more I/O there --- so this > might not be a win either. > > > One possibility would be to provide an optional implementation of #1, > > perhaps via an alternate index operator class. That way people could > > select it in those situations in which it is worth using. > > I think it would definitely be a good idea to make the lossy behavior > optional. > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| Thread Tools | |
| Display Modes | |
|
|