Unix Technical Forum

Ordered Append Node

This is a discussion on Ordered Append Node within the pgsql Hackers forums, part of the PostgreSQL category; --> * Markus Schiltknecht: >> uses a heap to efficiently find the next value from the source tapes. > > ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-15-2008, 10:34 PM
Florian Weimer
 
Posts: n/a
Default Re: Ordered Append Node

* Markus Schiltknecht:

>> uses a heap to efficiently find the next value from the source tapes.

>
> Well, maybe my point here is: why do you need the heap to sort?


I think you need it because there are potentially many input types.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-15-2008, 10:34 PM
Gregory Stark
 
Posts: n/a
Default Re: Ordered Append Node

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Markus Schiltknecht wrote:
>> Florian Weimer wrote:
>>>> Given the partitioning case, I'd expect all rows to have an equal
>>>> tuple descriptor. Maybe this is a matter of what to optimize, then?
>>>>
>>>> Could you elaborate on what use case you have in mind?
>>>
>>> You need a priority queue to figure out from which tape (partition)
>>> you need to remove the next tuple.

>>
>> And why do you need lots of heap memory to do that? Anything wrong with the
>> zipper approach I've outlined upthread?

>
> We're talking about a binary heap, with just one node per partition. AFAICT
> it's roughly the same data structure as the zipper tree you envisioned, but not
> implemented with separate executor nodes for each level.


Not quite the same since the Executor-based implementation would have a static
tree structure based on the partitions. Even if the partitions are all empty
except for one or two you would still have to push the result records through
all the nodes for the empty partitions.

A heap only has the next record from each input. If an input has reached eof
then the heap doesn't have an entry for that input. So if one of the inputs is
empty (often the parent of the inheritance tree) it doesn't require a test
anywhere to propagate the record up past it.

I also did an optimization similar to the bounded-sort case where we check if
the next tuple from the same input which last contributed the result record
comes before the top element of the heap. That avoids having to do an insert
and siftup only to pull out the same record you just inserted. In theory this
is not an optimization but in practice I think partitioned tables will often
contain contiguous blocks of key values and queries will often be joining
against that key and therefore often want to order by it.

Ideally we would also be able to do this in the planner. If the planner could
determine from the constraints that all the key values from each partition are
non-overlapping and order them properly then it could generate a regular
append node with a path order without the overhead of the run-time
comparisons.

But that requires a) dealing with the problem of the parent table which has no
constraints and b) an efficient way to prove that constraints don't overlap
and order them properly. The latter looks like an O(n^2) problem to me, though
it's a planning problem which might be worth making slow in exchange for even
a small speedup at run-time.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-15-2008, 10:34 PM
Csaba Nagy
 
Posts: n/a
Default Re: Ordered Append Node

On Fri, 2007-11-23 at 12:36 +0000, Gregory Stark wrote:
> I also did an optimization similar to the bounded-sort case where we check if
> the next tuple from the same input which last contributed the result record
> comes before the top element of the heap. That avoids having to do an insert
> and siftup only to pull out the same record you just inserted. In theory this
> is not an optimization but in practice I think partitioned tables will often
> contain contiguous blocks of key values and queries will often be joining
> against that key and therefore often want to order by it.


If it is an option, you could also do this by a new method on the heap
which adds a new entry and removes the resulting new head in one atomic
operation. That would work with one single comparison for the less than
current head situation, and it would not need to repeat that comparison
if that fails. Also it could directly remove the head and balance the
tree in one go.

Cheers,
Csaba.



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-15-2008, 10:34 PM
Markus Schiltknecht
 
Posts: n/a
Default Re: Ordered Append Node

Gregory Stark wrote:
> Not quite the same since the Executor-based implementation would have a static
> tree structure based on the partitions. Even if the partitions are all empty
> except for one or two you would still have to push the result records through
> all the nodes for the empty partitions.
>
> A heap only has the next record from each input. If an input has reached eof
> then the heap doesn't have an entry for that input. So if one of the inputs is
> empty (often the parent of the inheritance tree) it doesn't require a test
> anywhere to propagate the record up past it.


Ah, so the binary tree (binary heap?) gets adjusted dynamically. Very
clever! (OTOH probably a micro optimization, but as code is already
there, use it, yeah!)

> I also did an optimization similar to the bounded-sort case where we check if
> the next tuple from the same input which last contributed the result record
> comes before the top element of the heap. That avoids having to do an insert
> and siftup only to pull out the same record you just inserted. In theory this
> is not an optimization but in practice I think partitioned tables will often
> contain contiguous blocks of key values and queries will often be joining
> against that key and therefore often want to order by it.


Hm... that assumes range partitioning, no? If you partition among three
partitions by "id modulo 3", tuples are most probably coming from one
partition after the other, i.e.:
1 2 3 1 2 3 1 2 3 ...

And with hash partitioning, you're completely unable to predict the
ordering.

> Ideally we would also be able to do this in the planner. If the planner could
> determine from the constraints that all the key values from each partition are
> non-overlapping and order them properly then it could generate a regular
> append node with a path order without the overhead of the run-time
> comparisons.


Agreed.

> But that requires a) dealing with the problem of the parent table which has no
> constraints and b) an efficient way to prove that constraints don't overlap
> and order them properly. The latter looks like an O(n^2) problem to me, though
> it's a planning problem which might be worth making slow in exchange for even
> a small speedup at run-time.


Well, I think someday, Postgres needs better support for vertical data
partitioning in general. Dealing with constraints and inheritance is way
too flexible and prone to error. I'll shortly start a new thread about
that, to outline my current thoughts about that topic.

Regards

Markus

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-15-2008, 10:34 PM
Heikki Linnakangas
 
Posts: n/a
Default Re: Ordered Append Node

Gregory Stark wrote:
> Ideally we would also be able to do this in the planner. If the planner could
> determine from the constraints that all the key values from each partition are
> non-overlapping and order them properly then it could generate a regular
> append node with a path order without the overhead of the run-time
> comparisons.
>
> But that requires a) dealing with the problem of the parent table which has no
> constraints and ...


It would help even if you could only prove it for some partitions. You
could use a regular append node for the partitions you know not to
overlap, and merge the output of that with the new kind of ordered
append node. We'd see that the parent table is empty on the first
invocation, and after that the ordered append node would just pass
through the tuples.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-15-2008, 10:34 PM
Zeugswetter Andreas ADI SD
 
Posts: n/a
Default Re: Ordered Append Node


> > But that requires a) dealing with the problem of the parent table

which has no
> > constraints and ...


Imho we should provide a mechanism that forces the parent to be empty
and let the planner know.
e.g. a false constraint on parent ONLY.

Andreas


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-15-2008, 10:34 PM
Jens-Wolfhard Schicke
 
Posts: n/a
Default Re: Ordered Append Node

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gregory Stark wrote:
> But that requires a) dealing with the problem of the parent table which has no
> constraints and b) an efficient way to prove that constraints don't overlap
> and order them properly. The latter looks like an O(n^2) problem to me, though
> it's a planning problem which might be worth making slow in exchange for even
> a small speedup at run-time.


Is it really worthwile to optimize away the heap access by thinking about what the
child tables hold? If the tables are read using IO, I think the complete plan would
turn out to be IO-bound, and the heap is of no interest. If the tables reside in
memory, the heap still only slows the process down by O(log(<number of tables>)) which
usually won't be that much imho.

Nonetheless, in the case of range partitioning, a sort on the lower ends of the ranges
and a linear test of neighbouring ranges for "overlap", skipping emtpy ranges,
should work in O(n log(n)).

Jens-W. Schicke
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHRu2xzhchXT4RR5ARAu//AKCZWZj680RhnbivbU/UqLBvsigBggCgq0Tw
GB+OYl0VOidmzVcK6ckhFBw=
=gbt7
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #18 (permalink)  
Old 04-15-2008, 10:34 PM
Tom Lane
 
Posts: n/a
Default Re: Ordered Append Node

Gregory Stark <stark@enterprisedb.com> writes:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Markus Schiltknecht wrote:
>>> And why do you need lots of heap memory to do that? Anything wrong with the
>>> zipper approach I've outlined upthread?

>>
>> We're talking about a binary heap, with just one node per partition. AFAICT
>> it's roughly the same data structure as the zipper tree you envisioned, but not
>> implemented with separate executor nodes for each level.


> Not quite the same since the Executor-based implementation would have a static
> tree structure based on the partitions. Even if the partitions are all empty
> except for one or two you would still have to push the result records through
> all the nodes for the empty partitions.


Also, the overhead per executor node visit is not exactly trivial.
I think that "zipper" scheme would be quite slow compared to a standard
heap merge within a single node.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #19 (permalink)  
Old 04-15-2008, 10:50 PM
Tom Lane
 
Posts: n/a
Default Re: Ordered Append Node

Gregory Stark <stark@enterprisedb.com> writes:
> I've been hacking on the idea of an Append node which maintains the ordering
> of its subtables merging their records in order.


I finally got round to looking at this ...

> 1) I still haven't completely figured out what to do with equivalence classes.
> My hack of just stuffing all the append subrel vars into there seems to
> work fine. I need to understand what's going on to see if there's really a
> problem with it or not.


This still makes me itchy, but it might be all right, because we'd never
really consider a plan for a single subnode as representing a useful
plan for the whole query. What it would need is some additions to the
README files (ahem) describing what's going on.

> 2) I'm not sure this code will work when the append rel is a target (ie UPDATE
> and DELETE stmts).


It won't, at least not for the case where the appendrel is an
inheritance tree in which the children are unlike the parent.
You have missed updating the estate->es_result_relation_info and
estate->es_junkFilter to match the tuple actually returned. In a plain
Append it is only necessary to update those when switching from one
subnode to the next, so we handle it in exec_append_initialize_next.
In an ordered Append you'd have to track which subnode each entry in the
heap came from (easy, but you aren't doing it now) and update those
fields for every tuple returned.

If you need an example of what I'm talking about:

create table p1(f1 int, f2 int);
create table p2(f3 int, f4 int);
create table c() inherits(p1, p2);
... insert some data ...
update p2 set ...

Of course a plain UPDATE p2 wouldn't have much interest in
ordered-Append plans, but you could force the failure by doing
a joining update with a mergejoin plan.

> 4) I haven't handled mark/restore or random access. I think they could be
> handled and they'll probably be worth the complexity but I'm not sure.


I don't think it'd be nearly as easy as you think, eg just how are you
going to "back up" the merge? A heap merge doesn't normally remember
the past few tuples it emitted. I'd definitely recommend not trying
that in v1.

BTW, I concur with Heikki's suggestion that this might be better done
as a separate executor node type. There'd be a certain amount of
duplicated boilerplate, but that would force you to look at each and
every reference to Append nodes and decide what's needed. There are a
whole bunch of places you've missed touching that know something about
what Append nodes can do, and you'll need to look at them all in any
case.

> 5) Is it considering too many paths?


Hard to say for sure, but I concur that that needs thinking about.
One point is that as soon as you have even one subnode that hasn't
got a cheap-startup-cost path, it probably stops being interesting
to try to build a cheap-startup-cost ordered-Append path. But
exactly what's the threshold of "cheap enough", I'm not sure.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #20 (permalink)  
Old 04-15-2008, 10:50 PM
Gregory Stark
 
Posts: n/a
Default Re: Ordered Append Node


"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> I've been hacking on the idea of an Append node which maintains the ordering
>> of its subtables merging their records in order.

>
> I finally got round to looking at this ...


A lot of things to chew on. Thanks very much.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 09:38 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com