Unix Technical Forum

Merge algorithms for large numbers of "tapes"

This is a discussion on Merge algorithms for large numbers of "tapes" within the pgsql Hackers forums, part of the PostgreSQL category; --> BTW, I was just looking over Knuth's discussion of sorting again, and realized that there is still something more ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 01:30 AM
Tom Lane
 
Posts: n/a
Default Merge algorithms for large numbers of "tapes"

BTW, I was just looking over Knuth's discussion of sorting again, and
realized that there is still something more that could be done within
the existing sort framework. We currently use standard polyphase merge
(his Algorithm 5.4.2D), which IIRC I chose because it was simple and
for relatively small numbers of tapes T it was about as good as anything
else. Knuth spends a great deal of energy on minimizing tape rewind
time which of course is of no interest to us, and I had supposed that
all of his more-complex algorithms were really only of interest if you
needed to consider rewind time. However, now that we've changed the
code to prefer large numbers of tapes, it's not at all clear that
Algorithm D is still the right one to use. In particular I'm looking at
cascade merge, Algorithm 5.4.3C, which appears to use significantly
fewer passes when T is large. Do you want to try that?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 01:30 AM
Jonah H. Harris
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> BTW, I was just looking over Knuth's discussion of sorting again, and
> realized that there is still something more that could be done within
> the existing sort framework. We currently use standard polyphase merge
> (his Algorithm 5.4.2D), which IIRC I chose because it was simple and
> for relatively small numbers of tapes T it was about as good as anything
> else. Knuth spends a great deal of energy on minimizing tape rewind
> time which of course is of no interest to us, and I had supposed that
> all of his more-complex algorithms were really only of interest if you
> needed to consider rewind time. However, now that we've changed the
> code to prefer large numbers of tapes, it's not at all clear that
> Algorithm D is still the right one to use. In particular I'm looking at
> cascade merge, Algorithm 5.4.3C, which appears to use significantly
> fewer passes when T is large. Do you want to try that?



I haven't personally played with this algorithm but having spent the last 15
minutes reading it over, it does sound like an interesting idea for trial.
At first glance it didn't seem much better than polyphase for our case, but
after reading the entire algorithm, discussion, and thinking it over for a
couple minutes, I could see it as potentially better.

Guess we won't really know 'til it can be tested


--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 01:30 AM
Luke Lonergan
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Tom,

> fewer passes when T is large. Do you want to try that?


Two passes is the state-of-the-practice on external disk sorts.

If we¹re looking to replace the tape sort approach, I would hope for a two
pass approach, with the merge pass avoided in the case of unidirectional
access.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 01:30 AM
Tom Lane
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

"Luke Lonergan" <llonergan@greenplum.com> writes:
> Two passes is the state-of-the-practice on external disk sorts.


There is no such thing as a fixed number of passes regardless of
available memory and size of the data.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-12-2008, 01:30 AM
Luke Lonergan
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Yes * all of the current best practice external sorts use two passes. A
first to produce the runs, which results in ³S² number of ³files², then a
single merge pass across the runs. At most 1 pass across the S runs is
required to implement the merge.

- Luke


On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> "Luke Lonergan" <llonergan@greenplum.com> writes:
>> > Two passes is the state-of-the-practice on external disk sorts.

>
> There is no such thing as a fixed number of passes regardless of
> available memory and size of the data.
>
> regards, tom lane
>
>




Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-12-2008, 01:30 AM
Greg Stark
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

"Jonah H. Harris" <jonah.harris@gmail.com> writes:

> On 3/7/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > However, now that we've changed the code to prefer large numbers of tapes,
> > it's not at all clear that Algorithm D is still the right one to use. In
> > particular I'm looking at cascade merge, Algorithm 5.4.3C, which appears
> > to use significantly fewer passes when T is large. Do you want to try
> > that?

>
> Guess we won't really know 'til it can be tested


It would also be interesting to allow multiple temporary areas to be declared
and try to spread tape files across the temporary areas. Ideally keeping input
and output tapes on separate drives.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-12-2008, 01:30 AM
Luke Lonergan
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Tom,

On 3/7/06 8:03 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> "Luke Lonergan" <llonergan@greenplum.com> writes:
>> Two passes is the state-of-the-practice on external disk sorts.

>
> There is no such thing as a fixed number of passes regardless of
> available memory and size of the data.


While technically correct, the practice shows that two-passes is the norm
for the vast majority of cases since 1986:
http://research.microsoft.com/~Gray/...3_FastSort.doc

Square root of the sort set is the memory requirement for a two pass sort.
10M for 10GB of sort set, for instance.

Given the current resource management limitations we live with, you are
correct about multi-pass being necessary, but we should be aware that modern
commercial databases allocate enough memory to provide two-pass external
sorts.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-12-2008, 01:30 AM
Simon Riggs
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

On Tue, 2006-03-07 at 18:14 -0500, Tom Lane wrote:
> BTW, I was just looking over Knuth's discussion of sorting again, and
> realized that there is still something more that could be done within
> the existing sort framework. We currently use standard polyphase merge
> (his Algorithm 5.4.2D), which IIRC I chose because it was simple and
> for relatively small numbers of tapes T it was about as good as anything
> else. Knuth spends a great deal of energy on minimizing tape rewind
> time which of course is of no interest to us, and I had supposed that
> all of his more-complex algorithms were really only of interest if you
> needed to consider rewind time. However, now that we've changed the
> code to prefer large numbers of tapes, it's not at all clear that
> Algorithm D is still the right one to use. In particular I'm looking at
> cascade merge, Algorithm 5.4.3C, which appears to use significantly
> fewer passes when T is large.


Ah! Well spotted. Yeh, looks like it will improve performance a good
deal. So, yes, definitely a TODO item.

> Do you want to try that?


The Cascade Merge re-writes the way logical tapes are selected and how
the runs are merged. It doesn't seem to do anything at all about the
run-forming, which would still use heapsort. So the only effect is when
we have more runs than "tapes", so for the limits of where we would
begin noticing any benefit would be:
work_mem= 1 GB benefit at 8 TB
work_mem= 256MB benefit at 0.5 TB
work_mem= 8MB benefit at 256 MB
work_mem= 1MB benefit at 12 MB (min 7 tapes).
(based upon runs on average twice size of memory, and each logical tape
requiring 256KB memory, i.e. min(work_mem/4, 6) * work_mem * 2, which
for work_mem > 2 MB gives 0.5 * work_mem^2)

Which means the benefit we get is when we have for some reason been
unable to give the sort enough space, or not set parameters correctly.
So, still a concern...but makes me think about 2 other issues first:

1. Earlier we had some results that showed that the heapsorts got slower
when work_mem was higher and that concerns me most of all right now.
It's possible you'll have reduced that considerably with the
pull-out-the-first-attr patch. I'll look into some test results to show
that has gone away. We also have Nyberg et al telling us that as of 1994
they established that heapsort would always be slower than qsort, as a
result of CPU cache locality improvements. An improvement here would
effect all sorts > work_mem.

2. Improvement in the way we do overall memory allocation, so we would
not have the problem of undersetting work_mem that we currently
experience. If we solved this problem we would have faster sorts in
*all* cases, not just extremely large ones. Dynamically setting work_mem
higher when possible would be very useful. I've looked at this a few
times and have some suggestions, but perhaps its worth asking for ideas
in this area?

Best Regards, Simon Riggs



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-12-2008, 01:30 AM
Tom Lane
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Simon Riggs <simon@2ndquadrant.com> writes:
> 1. Earlier we had some results that showed that the heapsorts got slower
> when work_mem was higher and that concerns me most of all right now.


Fair enough, but that's completely independent of the merge algorithm.
(I don't think the Nyberg results necessarily apply to our situation
anyway, as we are not sorting arrays of integers, and hence the cache
effects are far weaker for us. I don't mind trying alternate sort
algorithms, but I'm not going to believe an improvement in advance of
direct evidence in our own environment.)

> 2. Improvement in the way we do overall memory allocation, so we would
> not have the problem of undersetting work_mem that we currently
> experience. If we solved this problem we would have faster sorts in
> *all* cases, not just extremely large ones. Dynamically setting work_mem
> higher when possible would be very useful.


I think this would be extremely dangerous, as it would encourage
processes to take more than their fair share of available resources.
Also, to the extent that you believe the problem is insufficient L2
cache, it seems increasing work_mem to many times the size of L2 will
always be counterproductive. (Certainly there is no value in increasing
work_mem until we are in a regime where it consistently improves
performance significantly, which it seems we aren't yet.)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-12-2008, 01:30 AM
Luke Lonergan
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Tom,

On 3/8/06 7:21 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Simon Riggs <simon@2ndquadrant.com> writes:
>> 1. Earlier we had some results that showed that the heapsorts got slower
>> when work_mem was higher and that concerns me most of all right now.

>
> Fair enough, but that's completely independent of the merge algorithm.
> (I don't think the Nyberg results necessarily apply to our situation
> anyway, as we are not sorting arrays of integers, and hence the cache
> effects are far weaker for us. I don't mind trying alternate sort


Even with the indirection, we should investigate alternative approaches that
others have demonstrated to be superior WRT L2 cache use.

A major commercial database currently performs external sorts of various
fields 4 times faster, and commonly uses more than 256MB of sort memory in
one example case to do it.

> I think this would be extremely dangerous, as it would encourage
> processes to take more than their fair share of available resources.


I agree - in fact, we currently have no structured concept of "fair share of
available resources", nor a way to share them.

I think the answer to this should involve the use of statement queuing and
resource queues.

> Also, to the extent that you believe the problem is insufficient L2
> cache, it seems increasing work_mem to many times the size of L2 will
> always be counterproductive. (Certainly there is no value in increasing
> work_mem until we are in a regime where it consistently improves
> performance significantly, which it seems we aren't yet.)


Not if you cache block, the optimization that operates on a block of memory
one L2 block in size at a time.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:39 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com