This is a discussion on Re: Re: Which qsort is used within the pgsql Hackers forums, part of the PostgreSQL category; --> > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Friday, December 16, 2005 9:03 PM > To: Dann ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Friday, December 16, 2005 9:03 PM > To: Dann Corbit > Cc: Qingqing Zhou; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql- > hackers@postgresql.org > Subject: Re: [HACKERS] Re: Which qsort is used > > "Dann Corbit" <DCorbit@connx.com> writes: > > Both of them are modified versions of Bentley's sort algorithm. > > Jon Bentley of Bell Labs? Small world ... he was my thesis adviser > for awhile when he was at CMU. He's a good algorithms man, for sure. > > > I just added the in-order check to both of them, and the reversed > > partition check for the second method (in the case of the subfiles > > because insertion sort is allergic to reversed partitions). > > I've still got a problem with these checks; I think they are a net > waste of cycles on average. They would only be a win if you expected a > nontrivial percentage of perfectly-in-order or perfectly-reverse-order > inputs. What I think is much more probable in the Postgres environment > is almost-but-not-quite-ordered inputs --- eg, a table that was > perfectly ordered by key when filled, but some of the tuples have since > been moved by UPDATEs. This is the worst possible case for the in-order > checks, because they can then grovel over large percentages of the file > before failing ... and when they fail, those cycles are entirely wasted; > you have not advanced the state of the sort at all. > > For the "usual" case of randomly ordered input, of course it doesn't > matter much at all because the in-order checks will fail after examining > not too many items. But to argue that the checks are worth making, you > have to assume that perfectly-ordered inputs are more common than > almost-ordered inputs, and I think that is exactly the wrong assumption > for the Postgres environment. I sure haven't seen any evidence that > it's a good assumption. The benchmarks say that they (order checks) are a good idea on average for ordered data, random data, and partly ordered data. It does not require perfectly ordered data for the checks to be useful. On mostly ordered data, it is likely that some partitions are perfectly ordered. If you trace the algorithms in a debugger you will be surprised at how often the partitions are ordered, even with random sets as input. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Sat, 17 Dec 2005, Dann Corbit wrote: > > The benchmarks say that they (order checks) are a good idea on average > for ordered data, random data, and partly ordered data. > I interpret that in linux, 5000000 seems a divide for qsortpdq. Before that number, it wins, after that, bsd wins more. On SunOS, qsortpdq takes the lead till the last second -- I suspect this is due to the rand() function: Linux - #define RAND_MAX 2147483647 SunOS - #define RAND_MAX 32767 So in SunOS, the data actually not that scattered - so more favourate for sorted() or reversed() check? Regards, Qingqing ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| "Dann Corbit" <DCorbit@connx.com> writes: >> I've still got a problem with these checks; I think they are a net >> waste of cycles on average. > The benchmarks say that they (order checks) are a good idea on average > for ordered data, random data, and partly ordered data. There are lies, damn lies, and benchmarks ;-) The problem with citing a benchmark for this discussion is that a benchmark can't tell you anything about real-world probabilities; it only tells you about the probabilities occuring in the benchmark case. You need to make the case that the benchmark reflects the real world, which you didn't. > If you trace the algorithms in a debugger you will be surprised at how > often the partitions are ordered, even with random sets as input. Well, I do agree that checking for orderedness on small partitions would succeed more often than on larger partitions or the whole file --- but the code-as-given checks all the way down. Moreover, the argument given for spending these cycles is that insertion sort sucks on reverse-order input ... where "sucks" means that it spends O(N^2) time. But it spends O(N^2) in the average case, too. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |