This is a discussion on Re: Merge algorithms for large numbers of "tapes" within the pgsql Hackers forums, part of the PostgreSQL category; --> I do not clearly understand the sorting code in PostgreSQL. If I did have a good grasp of it, ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I do not clearly understand the sorting code in PostgreSQL. If I did have a good grasp of it, I would take a go at improving it. Here are some suggestions of things that I know work really, really well: #1. Two pass merge (none of that silly poly-tape merge goo) #2. Load ONLY the keys that are to be sorted into memory. Use a pointer exchange sort, and do not move the physical rows of data at all. I am pretty sure from this thread that PostgreSQL is not doing #1, and I have no idea if it is doing #2. A useful trick: Since merge is mentioned, I should say something else about merge joins. If you do not have room to load the sorted keys for bsearch, load every kth key (where k is computed by sizeof merge_ram / sizeof key_data). Then, when you have found the block the thing you are looking for by the "kth key bsearch", bsearch that block. Now, maybe PostrgeSQL already uses tricks better than these. I don't know. But if they prove helpful suggestions I will be glad of it. > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Wednesday, March 08, 2006 12:32 PM > To: Jim C. Nasby > Cc: Luke Lonergan; Simon Riggs; pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes" > > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > But do fewer/longer sorted runs translate into not merging back to disk? > > I thought that was controlled by if we had to be able to rewind the > > result set. > > A plain SELECT ... ORDER BY doesn't assume that anymore. It is still > required for some cases such as the input to a merge join, but the > on-the-fly-final-merge code is going to be used a lot more in 8.2 than > it was before. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Dann, On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit@connx.com> wrote: > Here are some suggestions of things that I know work really, really > well: Can you point to an example? That might help move the discussion along. The reason to interject about the tape goo in this discussion is that we seem to be spending a lot of time optimizing around the tape goo without tackling the overall structure of the external sort. I think we'll just end up having to replace all of this goo when we really get around to fixing the problem. Add to this that other commercial databases external sort in 1/4 the time or better on the same hardware with the same CPU/memory resources using a 2-pass external sort. > #1. Two pass merge (none of that silly poly-tape merge goo) Voice of reason here. It's what the other database systems do. > #2. Load ONLY the keys that are to be sorted into memory. Use a > pointer exchange sort, and do not move the physical rows of data at all. Sounds right. Example of this in practice? > I am pretty sure from this thread that PostgreSQL is not doing #1, and I > have no idea if it is doing #2. Yep. Even Knuth says that the tape goo is only interesting from a historical perspective and may not be relevant in an era of disk drives. - Luke ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| "Dann Corbit" <DCorbit@connx.com> writes: > Here are some suggestions of things that I know work really, really > well: > #1. Two pass merge (none of that silly poly-tape merge goo) This amounts to an assumption that you have infinite work_mem, in which case you hardly need an external sort at all. If your work_mem is in fact finite, then at some point you need more than two passes. I'm not really interested in ripping out support for sort operations that are much larger than work_mem. > #2. Load ONLY the keys that are to be sorted into memory. Use a > pointer exchange sort, and do not move the physical rows of data at all. This suggestion isn't a whole lot better; in general the rows to be sorted don't exist until we compute them, and so proposing that we "don't load them until later" is pretty much irrelevant. Also, in a lot of common cases the keys to be sorted are the bulk of the data anyway. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| "Luke Lonergan" <llonergan@greenplum.com> writes: > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I > > have no idea if it is doing #2. > > Yep. Even Knuth says that the tape goo is only interesting from a > historical perspective and may not be relevant in an era of disk drives. As the size of the data grows larger the behaviour of hard drives looks more and more like tapes. The biggest factor controlling the speed of i/o operations is how many seeks are required to complete them. Effectively "rewinds" are still the problem it's just that the cost of rewinds becomes constant regardless of how long the "tape" is. That's one thing that gives me pause about the current approach of using more tapes. It seems like ideally the user would create a temporary work space on each spindle and the database would arrange to use no more than that number of tapes. Then each merge operation would involve only sequential access for both reads and writes. -- greg ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote: > > "Luke Lonergan" <llonergan@greenplum.com> writes: > > > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I > > > have no idea if it is doing #2. > > > > Yep. Even Knuth says that the tape goo is only interesting from a > > historical perspective and may not be relevant in an era of disk drives. > > As the size of the data grows larger the behaviour of hard drives looks more > and more like tapes. The biggest factor controlling the speed of i/o > operations is how many seeks are required to complete them. Effectively > "rewinds" are still the problem it's just that the cost of rewinds becomes > constant regardless of how long the "tape" is. But it will take a whole lot of those rewinds to equal the amount of time required by an additional pass through the data. I'll venture a guess that as long as you've got enough memory to still read chunks back in 8k blocks that it won't be possible for a multi-pass sort to out-perform a one-pass sort. Especially if you also had the ability to do pre-fetching (not something to fuss with now, but certainly a possibility in the future). In any case, what we really need is at least good models backed by good drive performance data. And we really should have that anyway so that we can improve upon our cost estimator functions. I'm betting that what that will show us is that no single sort method is going to work best for all cases. IE: I'd bet that if your data set is sufficiently larger than available memory that you'll actually be better off with a multi-pass approach over a single/two pass approach. > That's one thing that gives me pause about the current approach of using more > tapes. It seems like ideally the user would create a temporary work space on > each spindle and the database would arrange to use no more than that number of > tapes. Then each merge operation would involve only sequential access for both > reads and writes. For that to be of any use, wouldn't you need to use only as many tapes as spindles/2? Otherwise you're still trying to read and write from the same set of drives, which means you're probably doing a lot of seeking. Or do the tape algorithms re-write data as they read it? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Dann Corbit wrote: >I do not clearly understand the sorting code in PostgreSQL. If I did >have a good grasp of it, I would take a go at improving it. > > > "Show me the code" (and the benchmarks). Seriously. We see regular discussions on this and similar topics, but I haven't seen a patch that anyone has proven is an unequivocal improvement. that I can recall. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| "Jim C. Nasby" <jnasby@pervasive.com> writes: > On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote: > > > > "Luke Lonergan" <llonergan@greenplum.com> writes: > > > > > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I > > > > have no idea if it is doing #2. > > > > > > Yep. Even Knuth says that the tape goo is only interesting from a > > > historical perspective and may not be relevant in an era of disk drives. > > > > As the size of the data grows larger the behaviour of hard drives looks more > > and more like tapes. The biggest factor controlling the speed of i/o > > operations is how many seeks are required to complete them. Effectively > > "rewinds" are still the problem it's just that the cost of rewinds becomes > > constant regardless of how long the "tape" is. > > But it will take a whole lot of those rewinds to equal the amount of > time required by an additional pass through the data. I'll venture a > guess that as long as you've got enough memory to still read chunks back > in 8k blocks that it won't be possible for a multi-pass sort to > out-perform a one-pass sort. Well that's clearly a bit overoptimistic. If we believe the random page cost of 4 then having more tapes than you have spindles would impose a penalty equal to having four times as many passes. (And that's *with* the 8k block size. And with the kernel performing pre-fetch already too.) > For that to be of any use, wouldn't you need to use only as many tapes > as spindles/2? Otherwise you're still trying to read and write from the > same set of drives, which means you're probably doing a lot of seeking. > Or do the tape algorithms re-write data as they read it? Well, spindles-1. I was thinking as many tapes as you have spindles *in total*, ie, including the output tape. You only have one output tape for each n-way merge though. -- greg ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Ühel kenal päeval, K, 2006-03-08 kell 20:08, kirjutas Jim C. Nasby: > But it will take a whole lot of those rewinds to equal the amount of > time required by an additional pass through the data. I guess that missing a sector read also implies a "rewind", i.e. if you don't process the data read from a "tape" fast enough, you will have to wait a whole disc revolution (~== "seek time" on modern disks) before you get the next chunk of data. > I'll venture a > guess that as long as you've got enough memory to still read chunks back > in 8k blocks that it won't be possible for a multi-pass sort to > out-perform a one-pass sort. Especially if you also had the ability to > do pre-fetching (not something to fuss with now, but certainly a > possibility in the future). > > In any case, what we really need is at least good models backed by good > drive performance data. And filesystem performance data, as postgres uses OS-s native filesystems. -------------- Hannu ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| * Greg Stark: > That's one thing that gives me pause about the current approach of > using more tapes. It seems like ideally the user would create a > temporary work space on each spindle and the database would arrange > to use no more than that number of tapes. Then each merge operation > would involve only sequential access for both reads and writes. And you'd need to preallocate the files in some way or other, to avoid file system fragmentation. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| On Wed, Mar 08, 2006 at 10:20:08PM -0500, Greg Stark wrote: > > For that to be of any use, wouldn't you need to use only as many tapes > > as spindles/2? Otherwise you're still trying to read and write from the > > same set of drives, which means you're probably doing a lot of seeking. > > Or do the tape algorithms re-write data as they read it? > > Well, spindles-1. I was thinking as many tapes as you have spindles *in total*, > ie, including the output tape. You only have one output tape for each n-way > merge though. Well, the reality remains though; most folks are unlikely to setup enough dedicated temp areas so that we can do one tape per disk, so it would be really good to have a sort method that didn't rely on that. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| Thread Tools | |
| Display Modes | |
|
|