Unix Technical Forum

Bad n_distinct estimation; hacks suggested?

This is a discussion on Bad n_distinct estimation; hacks suggested? within the Pgsql Performance forums, part of the PostgreSQL category; --> Hello Everybody, We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the foreign key constraints compile noticeably ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-18-2008, 11:39 AM
Ashish Arte
 
Posts: n/a
Default Foreign key constraints compile faster in 7.4

Hello Everybody,

We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the
foreign key constraints compile noticeably faster. In 7.3 the
constraints would typically take more than an hour to run on our
production data. Now they take a minute or two.

Can anybody explain such a major performance improvement ?

Thanks

--
Ashish Arte
Open Sky Software


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-18-2008, 11:39 AM
Tom Lane
 
Posts: n/a
Default Re: Foreign key constraints compile faster in 7.4

Ashish Arte <ashish@openskysoftware.com> writes:
> We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the
> foreign key constraints compile noticeably faster. In 7.3 the
> constraints would typically take more than an hour to run on our
> production data. Now they take a minute or two.


> Can anybody explain such a major performance improvement ?


Hey, we do do some work on this thing from time to time ;-)

Probably you are talking about this:

2003-10-06 12:38 tgl

* src/: backend/commands/tablecmds.c,
backend/utils/adt/ri_triggers.c, include/commands/trigger.h: During
ALTER TABLE ADD FOREIGN KEY, try to check the existing rows using a
single LEFT JOIN query instead of firing the check trigger for each
row individually. Stephan Szabo, with some kibitzing from Tom Lane
and Jan Wieck.


regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-18-2008, 11:39 AM
John A Meinel
 
Posts: n/a
Default Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

Josh Berkus wrote:
> Mischa,
>
>
>>Okay, although given the track record of page-based sampling for
>>n-distinct, it's a bit like looking for your keys under the streetlight,
>>rather than in the alley where you dropped them :-)

>
>
> Bad analogy, but funny.
>
> The issue with page-based vs. pure random sampling is that to do, for example,
> 10% of rows purely randomly would actually mean loading 50% of pages. With
> 20% of rows, you might as well scan the whole table.
>
> Unless, of course, we use indexes for sampling, which seems like a *really
> good* idea to me ....
>


But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once. And not
all columns would have indexes, though it could be assumed that if a
column doesn't have an index, then it doesn't matter as much for
calculations such as n_distinct.

But if you had 5 indexed rows in your table, then doing it index wise
means you would have to make 5 passes instead of just one.

Though I agree that page-based sampling is important for performance
reasons.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFCeBsdJdeBCYSNAAMRAu9UAKDDdD+Gc/SgZXjTo15SPIUtklrYZgCfd3VR
4o/cdrYTBR9vdo3fekGjZns=
=dsMV
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-18-2008, 11:39 AM
Josh Berkus
 
Posts: n/a
Default Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

John,

> But doesn't an index only sample one column at a time, whereas with
> page-based sampling, you can sample all of the columns at once.


Hmmm. Yeah, we're not currently doing that though. Another good idea ...

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-18-2008, 11:39 AM
Mischa Sandberg
 
Posts: n/a
Default Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

Quoting Josh Berkus <josh@agliodbs.com>:

> Mischa,
>
> > Okay, although given the track record of page-based sampling for
> > n-distinct, it's a bit like looking for your keys under the

> streetlight,
> > rather than in the alley where you dropped them :-)

>
> Bad analogy, but funny.


Bad analogy? Page-sampling effort versus row-sampling effort, c'est
moot. It's not good enough for stats to produce good behaviour on the
average. Straight random sampling, page or row, is going to cause
enough untrustworthy engine behaviour,for any %ages small enough to
allow sampling from scratch at any time.

I'm curious what the problem is with relying on a start-up plus
incremental method, when the method in the distinct-sampling paper
doesn't degenerate: you can start when the table is still empty.
Constructing an index requires an initial full scan plus incremental
update; what's the diff?

> Unless, of course, we use indexes for sampling, which seems like a
> *really
> good* idea to me ....


"distinct-sampling" applies for indexes, too. I started tracking the
discussion of this a bit late. Smart method for this is in VLDB'92:
Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I
don't think this is online anywhere, except if you have a DBLP
membership. Does nybod else know better?
Antoshenkov was the brains behind some of the really cool stuff in DEC
Rdb (what eventually became Oracle). Compressed bitmap indices,
parallel competing query plans, and smart handling of keys with
hyperbolic distributions.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:17 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com