Unix Technical Forum

hash index work

This is a discussion on hash index work within the Pgsql Patches forums, part of the PostgreSQL category; --> This patch implements two changes to hash indexes: - rather than storing the values of the index keys, we ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 12:20 AM
Neil Conway
 
Posts: n/a
Default hash index work

This patch implements two changes to hash indexes:

- rather than storing the values of the index keys, we just store their
hash value instead. The hash opclasses have been marked "lossy" so that
the executor will recheck the scan qualifier to catch hash collisions.

- within a page of a hash bucket, the entries are stored sorted by hash
value. When doing a hash index scan, we can then binary search within a
page rather than using a linear search.

Unfortunately, I haven't been able to measure any performance
improvement from either of these changes :-\

I've run tests on two types of tables: int4 keys and relatively short
textual keys (I don't think there's much point testing longer keys: the
patches should obviously be a win there, so I'm concerned about speeding
up the common case). In both cases, the difference in index scan
performance isn't significant: it's about the same with and without the
patches. The indexes have been on tables with 1 million rows (of int4)
and 400,000 rows (of text). I would test with larger tables, but
creating hash indexes is so slow that even these sizes are painful
enough. Since indexes of this size should be cached in memory the
reduction in I/O for the text case isn't being measured (the index is
about 30% smaller than it is when we store the full text value), but
even so I would have expected the binary search to produce noticeable
gains...

Perhaps I've made a coding error that has pessimized the performance
somehow (although nothing obvious shows up in profiles), or perhaps the
linear scan of the pages of the hash bucket isn't a performance problem
in the first place, at least for types with a cheap equality operator.
Some oprofile data seems to support this -- for example, in the int4
case, we spend less than 0.5% of the time in _hash_next and children,
and only 1.8% of the runtime in hashgetmulti and children (with the
vanilla CVS HEAD code).

-Neil


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 12:23 AM
Bruce Momjian
 
Posts: n/a
Default Re: hash index work


Neil, I have added these item to the TODO list. Do you plan on applying
this?

---------------------------------------------------------------------------

Neil Conway wrote:
> This patch implements two changes to hash indexes:
>
> - rather than storing the values of the index keys, we just store their
> hash value instead. The hash opclasses have been marked "lossy" so that
> the executor will recheck the scan qualifier to catch hash collisions.
>
> - within a page of a hash bucket, the entries are stored sorted by hash
> value. When doing a hash index scan, we can then binary search within a
> page rather than using a linear search.
>
> Unfortunately, I haven't been able to measure any performance
> improvement from either of these changes :-\
>
> I've run tests on two types of tables: int4 keys and relatively short
> textual keys (I don't think there's much point testing longer keys: the
> patches should obviously be a win there, so I'm concerned about speeding
> up the common case). In both cases, the difference in index scan
> performance isn't significant: it's about the same with and without the
> patches. The indexes have been on tables with 1 million rows (of int4)
> and 400,000 rows (of text). I would test with larger tables, but
> creating hash indexes is so slow that even these sizes are painful
> enough. Since indexes of this size should be cached in memory the
> reduction in I/O for the text case isn't being measured (the index is
> about 30% smaller than it is when we store the full text value), but
> even so I would have expected the binary search to produce noticeable
> gains...
>
> Perhaps I've made a coding error that has pessimized the performance
> somehow (although nothing obvious shows up in profiles), or perhaps the
> linear scan of the pages of the hash bucket isn't a performance problem
> in the first place, at least for types with a cheap equality operator.
> Some oprofile data seems to support this -- for example, in the int4
> case, we spend less than 0.5% of the time in _hash_next and children,
> and only 1.8% of the runtime in hashgetmulti and children (with the
> vanilla CVS HEAD code).
>
> -Neil



>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 12:23 AM
Neil Conway
 
Posts: n/a
Default Re: hash index work

Bruce Momjian wrote:
> Neil, I have added these item to the TODO list. Do you plan on applying
> this?


No, I don't have any immediate plans to apply it, as unfortunately I
didn't see a performance win :-( It's also possible I'm just not
measuring the right workload, although I don't have time to rerun the
benchmarks at the moment.

The patch does two things: (1) change hash indexes to only store the
key's hash value, not the entire key (2) store index elements within a
hash bucket in order of hash key and search for matches via binary
search. #1 is definitely a win in some in some circumstances (e.g.
indexing large fields or types with expensive equality operators), but
those aren't the common case. I'm surprised that #2 is not a more
noticeable improvement...

One possibility would be to provide an optional implementation of #1,
perhaps via an alternate index operator class. That way people could
select it in those situations in which it is worth using.

-Neil

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 12:23 AM
Tom Lane
 
Posts: n/a
Default Re: hash index work

Neil Conway <neilc@samurai.com> writes:
> The patch does two things: (1) change hash indexes to only store the
> key's hash value, not the entire key (2) store index elements within a
> hash bucket in order of hash key and search for matches via binary
> search. #1 is definitely a win in some in some circumstances (e.g.
> indexing large fields or types with expensive equality operators), but
> those aren't the common case. I'm surprised that #2 is not a more
> noticeable improvement...


It occurs to me that change #1 makes it cheaper to skip over index
entries that are in the same bucket but don't have the exact same hash
code; but it makes it significantly more expensive to skip over entries
that have the same hash code but aren't really equal to the key being
sought (since you have to visit the heap to find out they aren't equal).
Maybe your test workload had enough occurrences of the latter case to
balance out the win from the former.

I think it would be worth investigating a variant in which the index
stores both the hash code and the key value. This allows cheap
elimination of non-matching hash codes, and binary sorting of index
entries, without adding any extra trips to the heap. The downside is
that it makes the index larger so you incur more I/O there --- so this
might not be a win either.

> One possibility would be to provide an optional implementation of #1,
> perhaps via an alternate index operator class. That way people could
> select it in those situations in which it is worth using.


I think it would definitely be a good idea to make the lossy behavior
optional.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 12:32 AM
Bruce Momjian
 
Posts: n/a
Default Re: hash index work


I have added these TODO items based on Neil's ideas:

* Consider sorting hash buckets so entries can be found using a binary
search, rather than a linear scan
* In hash indexes, consider storing the hash value with or instead
of the key itself

However, Neil's tests don't show much of a win. Should I keep these
items on the TODO list?

---------------------------------------------------------------------------

Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > The patch does two things: (1) change hash indexes to only store the
> > key's hash value, not the entire key (2) store index elements within a
> > hash bucket in order of hash key and search for matches via binary
> > search. #1 is definitely a win in some in some circumstances (e.g.
> > indexing large fields or types with expensive equality operators), but
> > those aren't the common case. I'm surprised that #2 is not a more
> > noticeable improvement...

>
> It occurs to me that change #1 makes it cheaper to skip over index
> entries that are in the same bucket but don't have the exact same hash
> code; but it makes it significantly more expensive to skip over entries
> that have the same hash code but aren't really equal to the key being
> sought (since you have to visit the heap to find out they aren't equal).
> Maybe your test workload had enough occurrences of the latter case to
> balance out the win from the former.
>
> I think it would be worth investigating a variant in which the index
> stores both the hash code and the key value. This allows cheap
> elimination of non-matching hash codes, and binary sorting of index
> entries, without adding any extra trips to the heap. The downside is
> that it makes the index larger so you incur more I/O there --- so this
> might not be a win either.
>
> > One possibility would be to provide an optional implementation of #1,
> > perhaps via an alternate index operator class. That way people could
> > select it in those situations in which it is worth using.

>
> I think it would definitely be a good idea to make the lossy behavior
> optional.
>
> regards, tom lane
>


--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:34 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com