This is a discussion on Bad n_distinct estimation; hacks suggested? within the Pgsql Performance forums, part of the PostgreSQL category; --> Hello Everybody, We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the foreign key constraints compile noticeably ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello Everybody, We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the foreign key constraints compile noticeably faster. In 7.3 the constraints would typically take more than an hour to run on our production data. Now they take a minute or two. Can anybody explain such a major performance improvement ? Thanks -- Ashish Arte Open Sky Software ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Ashish Arte <ashish@openskysoftware.com> writes: > We recently upgraded to Postgres 7.4 from 7.3.9 and noticed that the > foreign key constraints compile noticeably faster. In 7.3 the > constraints would typically take more than an hour to run on our > production data. Now they take a minute or two. > Can anybody explain such a major performance improvement ? Hey, we do do some work on this thing from time to time ;-) Probably you are talking about this: 2003-10-06 12:38 tgl * src/: backend/commands/tablecmds.c, backend/utils/adt/ri_triggers.c, include/commands/trigger.h: During ALTER TABLE ADD FOREIGN KEY, try to check the existing rows using a single LEFT JOIN query instead of firing the check trigger for each row individually. Stephan Szabo, with some kibitzing from Tom Lane and Jan Wieck. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| Josh Berkus wrote: > Mischa, > > >>Okay, although given the track record of page-based sampling for >>n-distinct, it's a bit like looking for your keys under the streetlight, >>rather than in the alley where you dropped them :-) > > > Bad analogy, but funny. > > The issue with page-based vs. pure random sampling is that to do, for example, > 10% of rows purely randomly would actually mean loading 50% of pages. With > 20% of rows, you might as well scan the whole table. > > Unless, of course, we use indexes for sampling, which seems like a *really > good* idea to me .... > But doesn't an index only sample one column at a time, whereas with page-based sampling, you can sample all of the columns at once. And not all columns would have indexes, though it could be assumed that if a column doesn't have an index, then it doesn't matter as much for calculations such as n_distinct. But if you had 5 indexed rows in your table, then doing it index wise means you would have to make 5 passes instead of just one. Though I agree that page-based sampling is important for performance reasons. John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFCeBsdJdeBCYSNAAMRAu9UAKDDdD+Gc/SgZXjTo15SPIUtklrYZgCfd3VR 4o/cdrYTBR9vdo3fekGjZns= =dsMV -----END PGP SIGNATURE----- |
| |||
| John, > But doesn't an index only sample one column at a time, whereas with > page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| Quoting Josh Berkus <josh@agliodbs.com>: > Mischa, > > > Okay, although given the track record of page-based sampling for > > n-distinct, it's a bit like looking for your keys under the > streetlight, > > rather than in the alley where you dropped them :-) > > Bad analogy, but funny. Bad analogy? Page-sampling effort versus row-sampling effort, c'est moot. It's not good enough for stats to produce good behaviour on the average. Straight random sampling, page or row, is going to cause enough untrustworthy engine behaviour,for any %ages small enough to allow sampling from scratch at any time. I'm curious what the problem is with relying on a start-up plus incremental method, when the method in the distinct-sampling paper doesn't degenerate: you can start when the table is still empty. Constructing an index requires an initial full scan plus incremental update; what's the diff? > Unless, of course, we use indexes for sampling, which seems like a > *really > good* idea to me .... "distinct-sampling" applies for indexes, too. I started tracking the discussion of this a bit late. Smart method for this is in VLDB'92: Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I don't think this is online anywhere, except if you have a DBLP membership. Does nybod else know better? Antoshenkov was the brains behind some of the really cool stuff in DEC Rdb (what eventually became Oracle). Compressed bitmap indices, parallel competing query plans, and smart handling of keys with hyperbolic distributions. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |