This is a discussion on Merge algorithms for large numbers of "tapes" within the pgsql Hackers forums, part of the PostgreSQL category; --> BTW, I was just looking over Knuth's discussion of sorting again, and realized that there is still something more ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| BTW, I was just looking over Knuth's discussion of sorting again, and realized that there is still something more that could be done within the existing sort framework. We currently use standard polyphase merge (his Algorithm 5.4.2D), which IIRC I chose because it was simple and for relatively small numbers of tapes T it was about as good as anything else. Knuth spends a great deal of energy on minimizing tape rewind time which of course is of no interest to us, and I had supposed that all of his more-complex algorithms were really only of interest if you needed to consider rewind time. However, now that we've changed the code to prefer large numbers of tapes, it's not at all clear that Algorithm D is still the right one to use. In particular I'm looking at cascade merge, Algorithm 5.4.3C, which appears to use significantly fewer passes when T is large. Do you want to try that? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > BTW, I was just looking over Knuth's discussion of sorting again, and > realized that there is still something more that could be done within > the existing sort framework. We currently use standard polyphase merge > (his Algorithm 5.4.2D), which IIRC I chose because it was simple and > for relatively small numbers of tapes T it was about as good as anything > else. Knuth spends a great deal of energy on minimizing tape rewind > time which of course is of no interest to us, and I had supposed that > all of his more-complex algorithms were really only of interest if you > needed to consider rewind time. However, now that we've changed the > code to prefer large numbers of tapes, it's not at all clear that > Algorithm D is still the right one to use. In particular I'm looking at > cascade merge, Algorithm 5.4.3C, which appears to use significantly > fewer passes when T is large. Do you want to try that? I haven't personally played with this algorithm but having spent the last 15 minutes reading it over, it does sound like an interesting idea for trial. At first glance it didn't seem much better than polyphase for our case, but after reading the entire algorithm, discussion, and thinking it over for a couple minutes, I could see it as potentially better. Guess we won't really know 'til it can be tested -- Jonah H. Harris, Database Internals Architect EnterpriseDB Corporation 732.331.1324 |
| |||
| Tom, > fewer passes when T is large. Do you want to try that? Two passes is the state-of-the-practice on external disk sorts. If we¹re looking to replace the tape sort approach, I would hope for a two pass approach, with the merge pass avoided in the case of unidirectional access. - Luke ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| "Luke Lonergan" <llonergan@greenplum.com> writes: > Two passes is the state-of-the-practice on external disk sorts. There is no such thing as a fixed number of passes regardless of available memory and size of the data. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Yes * all of the current best practice external sorts use two passes. A first to produce the runs, which results in ³S² number of ³files², then a single merge pass across the runs. At most 1 pass across the S runs is required to implement the merge. - Luke On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > "Luke Lonergan" <llonergan@greenplum.com> writes: >> > Two passes is the state-of-the-practice on external disk sorts. > > There is no such thing as a fixed number of passes regardless of > available memory and size of the data. > > regards, tom lane > > |
| |||
| "Jonah H. Harris" <jonah.harris@gmail.com> writes: > On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > However, now that we've changed the code to prefer large numbers of tapes, > > it's not at all clear that Algorithm D is still the right one to use. In > > particular I'm looking at cascade merge, Algorithm 5.4.3C, which appears > > to use significantly fewer passes when T is large. Do you want to try > > that? > > Guess we won't really know 'til it can be tested It would also be interesting to allow multiple temporary areas to be declared and try to spread tape files across the temporary areas. Ideally keeping input and output tapes on separate drives. -- greg ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Tom, On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > "Luke Lonergan" <llonergan@greenplum.com> writes: >> Two passes is the state-of-the-practice on external disk sorts. > > There is no such thing as a fixed number of passes regardless of > available memory and size of the data. While technically correct, the practice shows that two-passes is the norm for the vast majority of cases since 1986: http://research.microsoft.com/~Gray/...3_FastSort.doc Square root of the sort set is the memory requirement for a two pass sort. 10M for 10GB of sort set, for instance. Given the current resource management limitations we live with, you are correct about multi-pass being necessary, but we should be aware that modern commercial databases allocate enough memory to provide two-pass external sorts. - Luke ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Tue, 2006-03-07 at 18:14 -0500, Tom Lane wrote: > BTW, I was just looking over Knuth's discussion of sorting again, and > realized that there is still something more that could be done within > the existing sort framework. We currently use standard polyphase merge > (his Algorithm 5.4.2D), which IIRC I chose because it was simple and > for relatively small numbers of tapes T it was about as good as anything > else. Knuth spends a great deal of energy on minimizing tape rewind > time which of course is of no interest to us, and I had supposed that > all of his more-complex algorithms were really only of interest if you > needed to consider rewind time. However, now that we've changed the > code to prefer large numbers of tapes, it's not at all clear that > Algorithm D is still the right one to use. In particular I'm looking at > cascade merge, Algorithm 5.4.3C, which appears to use significantly > fewer passes when T is large. Ah! Well spotted. Yeh, looks like it will improve performance a good deal. So, yes, definitely a TODO item. > Do you want to try that? The Cascade Merge re-writes the way logical tapes are selected and how the runs are merged. It doesn't seem to do anything at all about the run-forming, which would still use heapsort. So the only effect is when we have more runs than "tapes", so for the limits of where we would begin noticing any benefit would be: work_mem= 1 GB benefit at 8 TB work_mem= 256MB benefit at 0.5 TB work_mem= 8MB benefit at 256 MB work_mem= 1MB benefit at 12 MB (min 7 tapes). (based upon runs on average twice size of memory, and each logical tape requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which for work_mem > 2 MB gives 0.5 * work_mem^2) Which means the benefit we get is when we have for some reason been unable to give the sort enough space, or not set parameters correctly. So, still a concern...but makes me think about 2 other issues first: 1. Earlier we had some results that showed that the heapsorts got slower when work_mem was higher and that concerns me most of all right now. It's possible you'll have reduced that considerably with the pull-out-the-first-attr patch. I'll look into some test results to show that has gone away. We also have Nyberg et al telling us that as of 1994 they established that heapsort would always be slower than qsort, as a result of CPU cache locality improvements. An improvement here would effect all sorts > work_mem. 2. Improvement in the way we do overall memory allocation, so we would not have the problem of undersetting work_mem that we currently experience. If we solved this problem we would have faster sorts in *all* cases, not just extremely large ones. Dynamically setting work_mem higher when possible would be very useful. I've looked at this a few times and have some suggestions, but perhaps its worth asking for ideas in this area? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > 1. Earlier we had some results that showed that the heapsorts got slower > when work_mem was higher and that concerns me most of all right now. Fair enough, but that's completely independent of the merge algorithm. (I don't think the Nyberg results necessarily apply to our situation anyway, as we are not sorting arrays of integers, and hence the cache effects are far weaker for us. I don't mind trying alternate sort algorithms, but I'm not going to believe an improvement in advance of direct evidence in our own environment.) > 2. Improvement in the way we do overall memory allocation, so we would > not have the problem of undersetting work_mem that we currently > experience. If we solved this problem we would have faster sorts in > *all* cases, not just extremely large ones. Dynamically setting work_mem > higher when possible would be very useful. I think this would be extremely dangerous, as it would encourage processes to take more than their fair share of available resources. Also, to the extent that you believe the problem is insufficient L2 cache, it seems increasing work_mem to many times the size of L2 will always be counterproductive. (Certainly there is no value in increasing work_mem until we are in a regime where it consistently improves performance significantly, which it seems we aren't yet.) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| Tom, On 3/8/06 7:21 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> 1. Earlier we had some results that showed that the heapsorts got slower >> when work_mem was higher and that concerns me most of all right now. > > Fair enough, but that's completely independent of the merge algorithm. > (I don't think the Nyberg results necessarily apply to our situation > anyway, as we are not sorting arrays of integers, and hence the cache > effects are far weaker for us. I don't mind trying alternate sort Even with the indirection, we should investigate alternative approaches that others have demonstrated to be superior WRT L2 cache use. A major commercial database currently performs external sorts of various fields 4 times faster, and commonly uses more than 256MB of sort memory in one example case to do it. > I think this would be extremely dangerous, as it would encourage > processes to take more than their fair share of available resources. I agree - in fact, we currently have no structured concept of "fair share of available resources", nor a way to share them. I think the answer to this should involve the use of statement queuing and resource queues. > Also, to the extent that you believe the problem is insufficient L2 > cache, it seems increasing work_mem to many times the size of L2 will > always be counterproductive. (Certainly there is no value in increasing > work_mem until we are in a regime where it consistently improves > performance significantly, which it seems we aren't yet.) Not if you cache block, the optimization that operates on a block of memory one L2 block in size at a time. - Luke ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| Thread Tools | |
| Display Modes | |
|
|