Unix Technical Forum

Re: Merge algorithms for large numbers of "tapes"

This is a discussion on Re: Merge algorithms for large numbers of "tapes" within the pgsql Hackers forums, part of the PostgreSQL category; --> I do not clearly understand the sorting code in PostgreSQL. If I did have a good grasp of it, ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 01:31 AM
Dann Corbit
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

I do not clearly understand the sorting code in PostgreSQL. If I did
have a good grasp of it, I would take a go at improving it.

Here are some suggestions of things that I know work really, really
well:

#1. Two pass merge (none of that silly poly-tape merge goo)

#2. Load ONLY the keys that are to be sorted into memory. Use a
pointer exchange sort, and do not move the physical rows of data at all.

I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.

A useful trick:
Since merge is mentioned, I should say something else about merge joins.
If you do not have room to load the sorted keys for bsearch, load every
kth key (where k is computed by sizeof merge_ram / sizeof key_data).
Then, when you have found the block the thing you are looking for by the
"kth key bsearch", bsearch that block.

Now, maybe PostrgeSQL already uses tricks better than these. I don't
know. But if they prove helpful suggestions I will be glad of it.

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailtogsql-hackers-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Wednesday, March 08, 2006 12:32 PM
> To: Jim C. Nasby
> Cc: Luke Lonergan; Simon Riggs; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > But do fewer/longer sorted runs translate into not merging back to

disk?
> > I thought that was controlled by if we had to be able to rewind the
> > result set.

>
> A plain SELECT ... ORDER BY doesn't assume that anymore. It is still
> required for some cases such as the input to a merge join, but the
> on-the-fly-final-merge code is going to be used a lot more in 8.2 than
> it was before.
>
> regards, tom lane
>
> ---------------------------(end of

broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that

your
> message can get through to the mailing list cleanly


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 01:31 AM
Luke Lonergan
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Dann,

On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote:

> Here are some suggestions of things that I know work really, really
> well:


Can you point to an example? That might help move the discussion along.

The reason to interject about the tape goo in this discussion is that we
seem to be spending a lot of time optimizing around the tape goo without
tackling the overall structure of the external sort. I think we'll just end
up having to replace all of this goo when we really get around to fixing the
problem.

Add to this that other commercial databases external sort in 1/4 the time or
better on the same hardware with the same CPU/memory resources using a
2-pass external sort.

> #1. Two pass merge (none of that silly poly-tape merge goo)


Voice of reason here. It's what the other database systems do.

> #2. Load ONLY the keys that are to be sorted into memory. Use a
> pointer exchange sort, and do not move the physical rows of data at all.


Sounds right. Example of this in practice?

> I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> have no idea if it is doing #2.


Yep. Even Knuth says that the tape goo is only interesting from a
historical perspective and may not be relevant in an era of disk drives.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 01:31 AM
Tom Lane
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

"Dann Corbit" <DCorbit@connx.com> writes:
> Here are some suggestions of things that I know work really, really
> well:
> #1. Two pass merge (none of that silly poly-tape merge goo)


This amounts to an assumption that you have infinite work_mem, in which
case you hardly need an external sort at all. If your work_mem is in
fact finite, then at some point you need more than two passes. I'm not
really interested in ripping out support for sort operations that are
much larger than work_mem.

> #2. Load ONLY the keys that are to be sorted into memory. Use a
> pointer exchange sort, and do not move the physical rows of data at all.


This suggestion isn't a whole lot better; in general the rows to be
sorted don't exist until we compute them, and so proposing that we
"don't load them until later" is pretty much irrelevant. Also, in
a lot of common cases the keys to be sorted are the bulk of the data
anyway.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 01:31 AM
Greg Stark
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"


"Luke Lonergan" <llonergan@greenplum.com> writes:

> > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > have no idea if it is doing #2.

>
> Yep. Even Knuth says that the tape goo is only interesting from a
> historical perspective and may not be relevant in an era of disk drives.


As the size of the data grows larger the behaviour of hard drives looks more
and more like tapes. The biggest factor controlling the speed of i/o
operations is how many seeks are required to complete them. Effectively
"rewinds" are still the problem it's just that the cost of rewinds becomes
constant regardless of how long the "tape" is.

That's one thing that gives me pause about the current approach of using more
tapes. It seems like ideally the user would create a temporary work space on
each spindle and the database would arrange to use no more than that number of
tapes. Then each merge operation would involve only sequential access for both
reads and writes.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-12-2008, 01:31 AM
Jim C. Nasby
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
>
> "Luke Lonergan" <llonergan@greenplum.com> writes:
>
> > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > > have no idea if it is doing #2.

> >
> > Yep. Even Knuth says that the tape goo is only interesting from a
> > historical perspective and may not be relevant in an era of disk drives.

>
> As the size of the data grows larger the behaviour of hard drives looks more
> and more like tapes. The biggest factor controlling the speed of i/o
> operations is how many seeks are required to complete them. Effectively
> "rewinds" are still the problem it's just that the cost of rewinds becomes
> constant regardless of how long the "tape" is.


But it will take a whole lot of those rewinds to equal the amount of
time required by an additional pass through the data. I'll venture a
guess that as long as you've got enough memory to still read chunks back
in 8k blocks that it won't be possible for a multi-pass sort to
out-perform a one-pass sort. Especially if you also had the ability to
do pre-fetching (not something to fuss with now, but certainly a
possibility in the future).

In any case, what we really need is at least good models backed by good
drive performance data. And we really should have that anyway so that we
can improve upon our cost estimator functions. I'm betting that what
that will show us is that no single sort method is going to work best
for all cases. IE: I'd bet that if your data set is sufficiently larger
than available memory that you'll actually be better off with a
multi-pass approach over a single/two pass approach.

> That's one thing that gives me pause about the current approach of using more
> tapes. It seems like ideally the user would create a temporary work space on
> each spindle and the database would arrange to use no more than that number of
> tapes. Then each merge operation would involve only sequential access for both
> reads and writes.


For that to be of any use, wouldn't you need to use only as many tapes
as spindles/2? Otherwise you're still trying to read and write from the
same set of drives, which means you're probably doing a lot of seeking.
Or do the tape algorithms re-write data as they read it?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-12-2008, 01:31 AM
Andrew Dunstan
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Dann Corbit wrote:

>I do not clearly understand the sorting code in PostgreSQL. If I did
>have a good grasp of it, I would take a go at improving it.
>
>
>


"Show me the code" (and the benchmarks).

Seriously. We see regular discussions on this and similar topics, but I
haven't seen a patch that anyone has proven is an unequivocal
improvement. that I can recall.

cheers

andrew



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-12-2008, 01:31 AM
Greg Stark
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"


"Jim C. Nasby" <jnasby@pervasive.com> writes:

> On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
> >
> > "Luke Lonergan" <llonergan@greenplum.com> writes:
> >
> > > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > > > have no idea if it is doing #2.
> > >
> > > Yep. Even Knuth says that the tape goo is only interesting from a
> > > historical perspective and may not be relevant in an era of disk drives.

> >
> > As the size of the data grows larger the behaviour of hard drives looks more
> > and more like tapes. The biggest factor controlling the speed of i/o
> > operations is how many seeks are required to complete them. Effectively
> > "rewinds" are still the problem it's just that the cost of rewinds becomes
> > constant regardless of how long the "tape" is.

>
> But it will take a whole lot of those rewinds to equal the amount of
> time required by an additional pass through the data. I'll venture a
> guess that as long as you've got enough memory to still read chunks back
> in 8k blocks that it won't be possible for a multi-pass sort to
> out-perform a one-pass sort.


Well that's clearly a bit overoptimistic. If we believe the random page cost
of 4 then having more tapes than you have spindles would impose a penalty
equal to having four times as many passes.

(And that's *with* the 8k block size. And with the kernel performing pre-fetch
already too.)

> For that to be of any use, wouldn't you need to use only as many tapes
> as spindles/2? Otherwise you're still trying to read and write from the
> same set of drives, which means you're probably doing a lot of seeking.
> Or do the tape algorithms re-write data as they read it?


Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
ie, including the output tape. You only have one output tape for each n-way
merge though.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-12-2008, 01:31 AM
Hannu Krosing
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

Ühel kenal päeval, K, 2006-03-08 kell 20:08, kirjutas Jim C. Nasby:

> But it will take a whole lot of those rewinds to equal the amount of
> time required by an additional pass through the data.


I guess that missing a sector read also implies a "rewind", i.e. if you
don't process the data read from a "tape" fast enough, you will have to
wait a whole disc revolution (~== "seek time" on modern disks) before
you get the next chunk of data.

> I'll venture a
> guess that as long as you've got enough memory to still read chunks back
> in 8k blocks that it won't be possible for a multi-pass sort to
> out-perform a one-pass sort. Especially if you also had the ability to
> do pre-fetching (not something to fuss with now, but certainly a
> possibility in the future).
>
> In any case, what we really need is at least good models backed by good
> drive performance data.


And filesystem performance data, as postgres uses OS-s native
filesystems.

--------------
Hannu


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-12-2008, 01:32 AM
Florian Weimer
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

* Greg Stark:

> That's one thing that gives me pause about the current approach of
> using more tapes. It seems like ideally the user would create a
> temporary work space on each spindle and the database would arrange
> to use no more than that number of tapes. Then each merge operation
> would involve only sequential access for both reads and writes.


And you'd need to preallocate the files in some way or other, to avoid
file system fragmentation.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-12-2008, 01:32 AM
Jim C. Nasby
 
Posts: n/a
Default Re: Merge algorithms for large numbers of "tapes"

On Wed, Mar 08, 2006 at 10:20:08PM -0500, Greg Stark wrote:
> > For that to be of any use, wouldn't you need to use only as many tapes
> > as spindles/2? Otherwise you're still trying to read and write from the
> > same set of drives, which means you're probably doing a lot of seeking.
> > Or do the tape algorithms re-write data as they read it?

>
> Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
> ie, including the output tape. You only have one output tape for each n-way
> merge though.


Well, the reality remains though; most folks are unlikely to setup
enough dedicated temp areas so that we can do one tape per disk, so it
would be really good to have a sort method that didn't rely on that.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:26 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com