vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto > owner@postgresql.org] On Behalf Of Qingqing Zhou > Sent: Friday, December 16, 2005 5:14 PM > To: Bruce Momjian > Cc: Luke Lonergan; Tom Lane; Neil Conway; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Re: Which qsort is used > > On Fri, 16 Dec 2005, Bruce Momjian wrote: > > > > > At this point, I think we have done enough testing on enough platforms > > to just use port/qsort on all platforms in 8.2. It seems whenever > > someone tries to improve the BSD qsort, they make it worse. > > > > Not necessariliy true. Dann Corbit sent me an implementation a while ago > (you can see it on the same site). BSD qsort is improved, though not that > much, by two methods. Since Dann write the program from scratch, so I am > not sure if we are afford to take the efforts for the improvement. Both of them are modified versions of Bentley's sort algorithm. I just added the in-order check to both of them, and the reversed partition check for the second method (in the case of the subfiles because insertion sort is allergic to reversed partitions). At any rate, neither is much of an improvement on Bentley's version. For the rare cases of completely ordered or completely reversed, it will be a monumental improvement. But that is a pretty rare case. If I could use C++, I could do much better. It is very difficult for me to write an ADT in C instead of in C++. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| "Dann Corbit" <DCorbit@connx.com> writes: > Both of them are modified versions of Bentley's sort algorithm. Jon Bentley of Bell Labs? Small world ... he was my thesis adviser for awhile when he was at CMU. He's a good algorithms man, for sure. > I just added the in-order check to both of them, and the reversed > partition check for the second method (in the case of the subfiles > because insertion sort is allergic to reversed partitions). I've still got a problem with these checks; I think they are a net waste of cycles on average. They would only be a win if you expected a nontrivial percentage of perfectly-in-order or perfectly-reverse-order inputs. What I think is much more probable in the Postgres environment is almost-but-not-quite-ordered inputs --- eg, a table that was perfectly ordered by key when filled, but some of the tuples have since been moved by UPDATEs. This is the worst possible case for the in-order checks, because they can then grovel over large percentages of the file before failing ... and when they fail, those cycles are entirely wasted; you have not advanced the state of the sort at all. For the "usual" case of randomly ordered input, of course it doesn't matter much at all because the in-order checks will fail after examining not too many items. But to argue that the checks are worth making, you have to assume that perfectly-ordered inputs are more common than almost-ordered inputs, and I think that is exactly the wrong assumption for the Postgres environment. I sure haven't seen any evidence that it's a good assumption. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Tom Lane wrote: > I've still got a problem with these checks; I think they are a net > waste of cycles on average. They would only be a win if you expected a > nontrivial percentage of perfectly-in-order or perfectly-reverse-order > inputs. What I think is much more probable in the Postgres environment > is almost-but-not-quite-ordered inputs --- eg, a table that was > perfectly ordered by key when filled, but some of the tuples have since > been moved by UPDATEs. This is the worst possible case for the in-order > checks, because they can then grovel over large percentages of the file > before failing ... and when they fail, those cycles are entirely wasted; > you have not advanced the state of the sort at all. It seems to me like people who reimport their data ordered should simply also CLUSTER their tables (maybe CLUSTER should first check if things are already ordered). PG should then just check of the table is CLUSTER'ed or not. Obviously the "i am clustered flag" needs to be removed the second you get UPDATE's (maybe with a check if the clustered key was updated) or INSERT's (maybe with a check if the row did not end up in the proper order). This would then cover cases where people intentionally make sure their input is sorted, and so this would reduce the number if situations where this check would improve performance. regards, Lukas PS: Hope you guys do not object for me jumping into the discussion like that, not being a PG hacker and all. |
| |||
| On Sat, 17 Dec 2005 00:03:25 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >I've still got a problem with these checks; I think they are a net >waste of cycles on average. [...] > and when they fail, those cycles are entirely wasted; >you have not advanced the state of the sort at all. How can we make the initial check "adavance the state of the sort"? One answer might be to exclude the sorted sequence at the start of the array from the qsort, and merge the two sorted lists as the final stage of the sort. Qsorting N elements costs O(N*lnN), so excluding H elements from the sort reduces the cost by at least O(H*lnN). The merge step costs O(N) plus some (<=50%) more memory, unless someone knows a fast in-place merge. So depending on the constant factors involved there might be a usable solution. I've been playing with some numbers and assuming the constant factors to be equal for all the O()'s this method starts to pay off at H for N 20 100 130 1000 8000 100000 Servus Manfred ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Thu, Dec 22, 2005 at 01:43:34AM +0100, Manfred Koizar wrote: > Qsorting N elements costs O(N*lnN), so excluding H elements from the > sort reduces the cost by at least O(H*lnN). The merge step costs O(N) > plus some (<=50%) more memory, unless someone knows a fast in-place > merge. So depending on the constant factors involved there might be a > usable solution. But where are you including the cost to check how many cells are already sorted? That would be O(H), right? This is where we come back to the issue that comparisons in PostgreSQL are expensive. The cpu_cost in the tests I saw so far is unrealistically low. > I've been playing with some numbers and assuming the constant factors > to be equal for all the O()'s this method starts to pay off at > H for N > 20 100 20% > 130 1000 13% > 8000 100000 8% Hmm, what are the chances you have 100000 unordered items to sort and that the first 8% will already be in order. ISTM that that probability will be close enough to zero to not matter... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDqk8oIB7bNG8LQkwRAjJhAJ47eXRi1DJ02cfKcnN2iP kaBB0eaQCeIiF+ HOAYIPQrU2gpUUiGT3aGUUw= =R0hU -----END PGP SIGNATURE----- |
| ||||
| On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout <kleptog@svana.org> wrote: >But where are you including the cost to check how many cells are >already sorted? That would be O(H), right? Yes. I didn't mention it, because H < N. > This is where we come back >to the issue that comparisons in PostgreSQL are expensive. So we agree that we should try to reduce the number of comparisons. How many comparisons does it take to sort 100000 items? 1.5 million? >Hmm, what are the chances you have 100000 unordered items to sort and >that the first 8% will already be in order. ISTM that that probability >will be close enough to zero to not matter... If the items are totally unordered, the check is so cheap you won't even notice. OTOH in Tom's example ... |What I think is much more probable in the Postgres environment |is almost-but-not-quite-ordered inputs --- eg, a table that was |perfectly ordered by key when filled, but some of the tuples have since |been moved by UPDATEs. .... I'd not be surprised if H is 90% of N. Servus Manfred ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |