Unix Technical Forum

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

This is a discussion on Re: [PERFORM] Bad n_distinct estimation; hacks suggested? within the pgsql Hackers forums, part of the PostgreSQL category; --> This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers...dist_sampl.pdf It does require a single full table scan but it works in O(n) ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 04:35 AM
Greg Stark
 
Posts: n/a
Default Re: [PERFORM] Bad n_distinct estimation; hacks suggested?


This one looks *really* good.

http://www.aladdin.cs.cmu.edu/papers...dist_sampl.pdf

It does require a single full table scan but it works in O(n) time and
constant space and it guarantees the confidence intervals for the estimates it
provides like the histograms do for regular range scans.

It can even keep enough data to provide estimates for n_distinct when
unrelated predicates are applied. I'm not sure Postgres would want to do this
though; this seems like it's part of the cross-column correlation story more
than the n_distinct story. It seems to require keeping an entire copy of the
sampled record in the stats tables which would be prohibitive quickly in wide
tables (it would be O(n^2) storage in the number of columns) .

It also seems like a lot of work to implement. Nothing particular that would
be impossible, but it does require storing a moderately complex data
structure. Perhaps Postgres's new support for data structures will make this
easier.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 04:35 AM
Greg Stark
 
Posts: n/a
Default Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

Rod Taylor <rbt@sitesell.com> writes:

> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good.
> >
> > http://www.aladdin.cs.cmu.edu/papers...dist_sampl.pdf
> >
> > It does require a single full table scan

>
> Ack.. Not by default please.
>
> I have a few large append-only tables (vacuum isn't necessary) which do
> need stats rebuilt periodically.


The algorithm can also naturally be implemented incrementally. Which would be
nice for your append-only tables. But that's not Postgres's current philosophy
with statistics. Perhaps some trigger function that you could install yourself
to update statistics for a newly inserted record would be useful.


The paper is pretty straightforward and easy to read, but here's an executive
summary:

The goal is to gather a uniform sample of *distinct values* in the table as
opposed to a sample of records.

Instead of using a fixed percentage sampling rate for each record, use a hash
of the value to determine whether to include it. At first include everything,
but if the sample space overflows throw out half the values based on their
hash value. Repeat until finished.

In the end you'll have a sample of 1/2^n of your distinct values from your
entire data set where n is large enough for you sample to fit in your
predetermined constant sample space.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 04:35 AM
Rod Taylor
 
Posts: n/a
Default Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

On Tue, 2005-04-26 at 19:28 -0400, Greg Stark wrote:
> Rod Taylor <rbt@sitesell.com> writes:
>
> > On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > > This one looks *really* good.
> > >
> > > http://www.aladdin.cs.cmu.edu/papers...dist_sampl.pdf
> > >
> > > It does require a single full table scan

> >
> > Ack.. Not by default please.
> >
> > I have a few large append-only tables (vacuum isn't necessary) which do
> > need stats rebuilt periodically.

>
> The algorithm can also naturally be implemented incrementally. Which would be
> nice for your append-only tables. But that's not Postgres's current philosophy
> with statistics. Perhaps some trigger function that you could install yourself
> to update statistics for a newly inserted record would be useful.


If when we have partitions, that'll be good enough. If partitions aren't
available this would be quite painful to anyone with large tables --
much as the days of old used to be painful for ANALYZE.

--


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 04:35 AM
Tom Lane
 
Posts: n/a
Default Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

Rod Taylor <pg@rbt.ca> writes:
> If when we have partitions, that'll be good enough. If partitions aren't
> available this would be quite painful to anyone with large tables --
> much as the days of old used to be painful for ANALYZE.


Yeah ... I am very un-enthused about these suggestions to make ANALYZE
go back to doing a full scan ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 04:36 AM
Greg Stark
 
Posts: n/a
Default Re: [PERFORM] Bad n_distinct estimation; hacks suggested?


Tom Lane <tgl@sss.pgh.pa.us> writes:

> Rod Taylor <pg@rbt.ca> writes:
> > If when we have partitions, that'll be good enough. If partitions aren't
> > available this would be quite painful to anyone with large tables --
> > much as the days of old used to be painful for ANALYZE.

>
> Yeah ... I am very un-enthused about these suggestions to make ANALYZE
> go back to doing a full scan ...


Well one option would be to sample only a small number of records, but add the
data found from those records to the existing statistics. This would make
sense for a steady-state situation, but make it hard to recover from a drastic
change in data distribution. I think in the case of n_distinct it would also
bias the results towards underestimating n_distinct but perhaps that could be
corrected for.

But I'm unclear for what situation this is a concern.

For most use cases users have to run vacuum occasionally. In those cases
"vacuum analyze" would be no worse than a straight normal vacuum. Note that
this algorithm doesn't require storing more data because of the large scan or
performing large sorts per column. It's purely O(n) time and O(1) space.

On the other hand, if you have tables you aren't vacuuming that means you
perform zero updates or deletes. In which case some sort of incremental
statistics updating would be a good solution. A better solution even than
sampling.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 04:36 AM
Rod Taylor
 
Posts: n/a
Default Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> This one looks *really* good.
>
> http://www.aladdin.cs.cmu.edu/papers...dist_sampl.pdf
>
> It does require a single full table scan


Ack.. Not by default please.

I have a few large append-only tables (vacuum isn't necessary) which do
need stats rebuilt periodically.

Lets just say that we've been working hard to upgrade to 8.0 primarily
because pg_dump was taking over 18 hours to make a backup.

--
Rod Taylor <rbt@sitesell.com>


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 12:21 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com