Unix Technical Forum

Re: [PERFORM] A Better External Sort?

This is a discussion on Re: [PERFORM] A Better External Sort? within the pgsql Hackers forums, part of the PostgreSQL category; --> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto gsql-hackers- > owner@postgresql.org ] On Behalf Of PFC > Sent: Thursday, September ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 07:01 AM
Dann Corbit
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailtogsql-hackers-
> owner@postgresql.org] On Behalf Of PFC
> Sent: Thursday, September 29, 2005 9:10 AM
> To: rjpeace@earthlink.net
> Cc: Pg Hackers; pgsql-performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>
> Just to add a little anarchy in your nice debate...
>
> Who really needs all the results of a sort on your terabyte

table ?

Reports with ORDER BY/GROUP BY, and many other possibilities. 40% of
mainframe CPU cycles are spent sorting. That is because huge volumes of
data require lots of energy to be meaningfully categorized. Let's
suppose that instead of a terabyte of data (or a petabyte or whatever)
we have 10% of it. That's still a lot of data.

> I guess not many people do a SELECT from such a table and want

all
> the
> results.


What happens when they do? The cases where it is already fast are not
very important. The cases where things go into the crapper are the ones
that need attention.

>So, this leaves :
> - Really wanting all the results, to fetch using a cursor,
> - CLUSTER type things, where you really want everything in

order,
> - Aggregates (Sort->GroupAggregate), which might really need to

sort
> the
> whole table.
> - Complex queries where the whole dataset needs to be examined,

in
> order
> to return a few values
> - Joins (again, the whole table is probably not going to be
> selected)
> - And the ones I forgot.
>
> However,
>
> Most likely you only want to SELECT N rows, in some ordering :
> - the first N (ORDER BY x LIMIT N)
> - last N (ORDER BY x DESC LIMIT N)


For these, the QuickSelect algorithm is what is wanted. For example:
#include <stdlib.h>
typedef double Etype;

extern Etype RandomSelect(Etype * A, size_t p, size_t r, size_t i);
extern size_t RandRange(size_t a, size_t b);
extern size_t RandomPartition(Etype * A, size_t p, size_t r);
extern size_t Partition(Etype * A, size_t p, size_t r);

/*
**
** In the following code, every reference to CLR means:
**
** "Introduction to Algorithms"
** By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest
** ISBN 0-07-013143-0
*/


/*
** CLR, page 187
*/
Etype RandomSelect(Etype A[], size_t p, size_t r, size_t i)
{
size_t q,
k;
if (p == r)
return A[p];
q = RandomPartition(A, p, r);
k = q - p + 1;

if (i <= k)
return RandomSelect(A, p, q, i);
else
return RandomSelect(A, q + 1, r, i - k);
}

size_t RandRange(size_t a, size_t b)
{
size_t c = (size_t) ((double) rand() / ((double) RAND_MAX +
1) * (b - a));
return c + a;
}

/*
** CLR, page 162
*/
size_t RandomPartition(Etype A[], size_t p, size_t r)
{
size_t i = RandRange(p, r);
Etype Temp;
Temp = A[p];
A[p] = A[i];
A[i] = Temp;
return Partition(A, p, r);
}

/*
** CLR, page 154
*/
size_t Partition(Etype A[], size_t p, size_t r)
{
Etype x,
temp;
size_t i,
j;

x = A[p];
i = p - 1;
j = r + 1;

for (; {
do {
j--;
} while (!(A[j] <= x));
do {
i++;
} while (!(A[i] >= x));
if (i < j) {
temp = A[i];
A[i] = A[j];
A[j] = temp;
} else
return j;
}
}

> - WHERE x>value ORDER BY x LIMIT N
> - WHERE x<value ORDER BY x DESC LIMIT N
> - and other variants
>
> Or, you are doing a Merge JOIN against some other table ; in

that
> case,
> yes, you might need the whole sorted terabyte table, but most likely

there
> are WHERE clauses in the query that restrict the set, and thus, maybe

we
> can get some conditions or limit values on the column to sort.


Where clause filters are to be applied AFTER the join operations,
according to the SQL standard.

> Also the new, optimized hash join, which is more memory

efficient,
> might
> cover this case.


For == joins. Not every order by is applied to joins. And not every
join is an equal join.

> Point is, sometimes, you only need part of the results of your

sort.
> And
> the bigger the sort, the most likely it becomes that you only want

part of
> the results.


That is an assumption that will sometimes be true, and sometimes not.
It is not possible to predict usage patterns for a general purpose
database system.

> So, while we're in the fun hand-waving, new algorithm trying
> mode, why not consider this right from the start ? (I know I'm totally

in
> hand-waving mode right now, so slap me if needed).
>
> I'd say your new, fancy sort algorithm needs a few more input

values
> :
>
> - Range of values that must appear in the final result of the

sort :
> none, minimum, maximum, both, or even a set of values

from the
> other
> side of the join, hashed, or sorted.


That will already happen (or it certainly ought to)

> - LIMIT information (first N, last N, none)


That will already happen (or it certainly ought to -- I would be pretty
surprised if it does not happen)

> - Enhanced Limit information (first/last N values of the second
> column to
> sort, for each value of the first column) (the infamous "top10 by
> category" query)
> - etc.


All the filters will (at some point) be applied to the data unless they
cannot be applied to the data by formal rule.

> With this, the amount of data that needs to be kept in memory is
> dramatically reduced, from the whole table (even using your compressed
> keys, that's big) to something more manageable which will be closer to

the
> size of the final result set which will be returned to the client, and
> avoid a lot of effort.


Sorting the minimal set is a good idea. Sometimes there is a big
savings there. I would be pretty surprised if a large fraction of data
that does not have to be included is actually processed during the
sorts.

> So, this would not be useful in all cases, but when it applies,

it
> would
> be really useful.


No argument there. And if an algorithm is being reworked, it is a good
idea to look at things like filtering to see if all filtering that is
allowed by the language standard before the sort takes place is applied.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 12:27 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com