vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've been hacking on the idea of an Append node which maintains the ordering of its subtables merging their records in order. This is important for partitioned tables since otherwise a lot of plans are not available such as merge joins. The logic I've followed is to do as follows: 1) Go through all subrels asking for any interesting pathkey lists. Gather up the union of all of these. 2) Go back through all the subrels and for each accumulated pathkey list ask for the best path for that subrel for that pathkey list. 3) Generate an two append paths for each of these pathkey lists, one using the best startup_cost subpath and one with the best total_cost subpath available for the pathkey list for each child rel. If there is none available take the best unordered path and put a sort node above it. 4) Additionally advertise the traditional unordered append node which our parent could choose to put a sort node above same as ever. 5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan node, with sort function oids and so on. 6) Append exec state nodes look a lot like a (a few fields from) tuplesortstate glued onto an Append node. They have the ScanKey array and the fmgr sort functions. They also have an array of TupleTableSlot and a heap of indexes into that array. 8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm going to get typecasted) or more to the point, similar to the final merge of a merge sort. It directly returns the slot the child rel last returned to it. Open questions: 1) I still haven't completely figured out what to do with equivalence classes. My hack of just stuffing all the append subrel vars into there seems to work fine. I need to understand what's going on to see if there's really a problem with it or not. 2) I'm not sure this code will work when the append rel is a target (ie UPDATE and DELETE stmts). 3) It does seem to work when the columns in the subrels don't line up but I didn't do anything special to handle this case. 4) I haven't handled mark/restore or random access. I think they could be handled and they'll probably be worth the complexity but I'm not sure. 5) Is it considering too many paths? Are there some which maybe aren't worth considering? For example, maybe it only makes sense to take best start_cost paths since if that plan doesn't dominate then the best total_cost plan is likely to be the sequential scans + unordered append + sort. 6) I haven't looked at setops yet but this is particularly attractive for the UNION (as opposed to UNION ALL) case. 7) I copied/adapted a bunch of bits from tuplesort to maintain the heap and do the scankey comparisons. I could refactor that code back into tuplesort but it would mean twisting around tuplesort quite a bit. Essentially it would mean introducing a new type of tuplesort which would start off in FINAL_MERGE state only it would have to behave differently since we don't want it prefetching lots of records like FINAL_MERGE does, I don't think. Some example plans (though note that the first was with a lot of enable_* parameters set to off). QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=76.03..107.63 rows=12 width=12) (actual time=0.435..0.706 rows=11 loops=1) Merge Cond: (public.z.i = x.i) -> Append (cost=22.36..53.66 rows=12 width=8) (actual time=0.365..0.440 rows=12 loops=1) -> Index Scan using zi on z (cost=0.00..8.27 rows=1 width=8) (actual time=0.089..0.091 rows=1 loops=1) -> Index Scan using z1i on z1 z (cost=0.00..12.28 rows=2 width=8) (actual time=0.063..0.068 rows=2 loops=1) -> Index Scan using z2i on z2 z (cost=0.00..12.30 rows=3 width=8) (actual time=0.060..0.066 rows=3 loops=1) -> Index Scan using z3i on z3 z (cost=0.00..12.33 rows=5 width=8) (actual time=0.059..0.070 rows=5 loops=1) -> Index Scan using zzi on zz z (cost=0.00..8.27 rows=1 width=8) (actual time=0.076..0.079 rows=1 loops=1) -> Materialize (cost=53.67..53.79 rows=12 width=8) (actual time=0.051..0.170 rows=12 loops=1) -> Append (cost=22.36..53.66 rows=12 width=8) (actual time=0.036..0.104 rows=12 loops=1) -> Index Scan using zi on z x (cost=0.00..8.27 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1) -> Index Scan using z1i on z1 x (cost=0.00..12.28 rows=2 width=8) (actual time=0.004..0.009 rows=2 loops=1) -> Index Scan using z2i on z2 x (cost=0.00..12.30 rows=3 width=8) (actual time=0.005..0.014 rows=3 loops=1) -> Index Scan using z3i on z3 x (cost=0.00..12.33 rows=5 width=8) (actual time=0.004..0.016 rows=5 loops=1) -> Index Scan using zzi on zz x (cost=0.00..8.27 rows=1 width=8) (actual time=0.005..0.008 rows=1 loops=1) Total runtime: 0.951 ms postgres=# explain analyze select * from t order by i limit 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=805.41..805.47 rows=1 width=4) (actual time=44.206..44.208 rows=1 loops=1) -> Result (cost=805.41..18309.89 rows=290003 width=4) (actual time=44.201..44.201 rows=1 loops=1) -> Append (cost=805.41..18309.89 rows=290003 width=4) (actual time=44.196..44.196 rows=1 loops=1) -> Sort (cost=1.02..1.02 rows=1 width=4) (actual time=0.034..0.034 rows=1 loops=1) Sort Key: public.t.i Sort Method: quicksort Memory: 17kB -> Seq Scan on t (cost=0.00..1.01 rows=1 width=4) (actual time=0.010..0.013 rows=1 loops=1) -> Index Scan using ti1 on t1 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.033..0.033 rows=1 loops=1) -> Index Scan using ti2 on t2 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti3 on t3 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti4 on t4 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti5 on t5 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti6 on t6 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.031..0.031 rows=1 loops=1) -> Index Scan using ti7 on t7 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti8 on t8 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti9 on t9 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti0 on t0 t (cost=0.00..289.27 rows=10001 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti11 on t11 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti12 on t12 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti13 on t13 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti14 on t14 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti15 on t15 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti16 on t16 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti17 on t17 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti18 on t18 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti19 on t19 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti20 on t20 t (cost=0.00..289.27 rows=10001 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti21 on t21 t (cost=0.00..739.76 rows=20000 width=4) (actual time=0.028..0.028 rows=1 loops=1) -> Index Scan using ti22 on t22 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti23 on t23 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.036..0.036 rows=1 loops=1) -> Index Scan using ti24 on t24 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti26 on t26 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti27 on t27 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Index Scan using ti28 on t28 t (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1) -> Sort (cost=804.39..829.39 rows=10000 width=4) (actual time=43.336..43.336 rows=1 loops=1) Sort Key: public.t.i Sort Method: quicksort Memory: 647kB -> Seq Scan on t29 t (cost=0.00..140.00 rows=10000 width=4) (actual time=0.017..21.192 rows=10000 loops=1) Total runtime: 44.737 ms -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Hello Gregory, Gregory Stark wrote: > I've been hacking on the idea of an Append node which maintains the ordering > of its subtables merging their records in order. This is important for > partitioned tables since otherwise a lot of plans are not available such as > merge joins. Cool! Some time ago, I've been trying something very similar, but didn't get very far. I'd like to share my thoughts anyway. First of, I envisioned that append node to be applicable also to unordered reading from multiple partitions, simply interleaving between the partitions. > The logic I've followed is to do as follows: > > 1) Go through all subrels asking for any interesting pathkey lists. Gather up > the union of all of these. I also tried to modify the Append node first, then figured that it might be better to base on the merge join node instead. While this seems farther away, I had the hope that a binary tree of such 'plain merge' nodes would require less comparisons in total. Plus, I found it a lot simpler to cope with exactly two input relations instead of n, as with the append node. :-) > 2) Go back through all the subrels and for each accumulated pathkey list ask > for the best path for that subrel for that pathkey list. > > 3) Generate an two append paths for each of these pathkey lists, one using the > best startup_cost subpath and one with the best total_cost subpath > available for the pathkey list for each child rel. If there is none > available take the best unordered path and put a sort node above it. > > 4) Additionally advertise the traditional unordered append node which our > parent could choose to put a sort node above same as ever. > > 5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan > node, with sort function oids and so on. > > 6) Append exec state nodes look a lot like a (a few fields from) > tuplesortstate glued onto an Append node. They have the ScanKey array and > the fmgr sort functions. They also have an array of TupleTableSlot and a > heap of indexes into that array. > > 8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm > going to get typecasted) or more to the point, similar to the final merge > of a merge sort. It directly returns the slot the child rel last returned > to it. Uh.. well, yeah. I guess you have a better understanding of the planner and executor that I do. > Open questions: > > 1) I still haven't completely figured out what to do with equivalence classes. > My hack of just stuffing all the append subrel vars into there seems to > work fine. I need to understand what's going on to see if there's really a > problem with it or not. > > 2) I'm not sure this code will work when the append rel is a target (ie UPDATE > and DELETE stmts). > > 3) It does seem to work when the columns in the subrels don't line up but I > didn't do anything special to handle this case. Uh.. is there any use case for that? WRT partitioning certainly not, is there? I'll have a look at your patch. Regards Markus ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| "Markus Schiltknecht" <markus@bluegap.ch> writes: >> 1) Go through all subrels asking for any interesting pathkey lists. Gather up >> the union of all of these. > > I also tried to modify the Append node first, then figured that it might be > better to base on the merge join node instead. While this seems farther away, I > had the hope that a binary tree of such 'plain merge' nodes would require less > comparisons in total. It is kind of like a merge join but not quite. It's interleaving rows rather than matching them up. It's more like the final merge of a sort which also uses a heap to efficiently find the next value from the source tapes. >> 3) It does seem to work when the columns in the subrels don't line up but I >> didn't do anything special to handle this case. > > Uh.. is there any use case for that? WRT partitioning certainly not, is there? Not necessarily but it is something Postgres supports and I don't think we want to break it. Actually it's useful for partitioned tables if you build the new partition in a separate table and then add it to the partitioned table. In that case you may have gone through several steps of adding columns and dropping them to get the structure to line up. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Hello Gregory, Gregory Stark wrote: > It is kind of like a merge join but not quite. It's interleaving rows rather > than matching them up. It's more like the final merge of a sort which also > uses a heap to efficiently find the next value from the source tapes. Well, maybe my point here is: why do you need the heap to sort? The inputs you've got are already sorted, you only need to merge them. To me this sounds very much like the final stage of a merge sort, which would not require much memory. IMO, a merge sort could easier be implemented by a binary tree zipper node, as I had in mind. Leading to a plan like that (well, hey, this is all made up): Zipper (cost..., sort by public.t.i) -> Zipper (cost .., sort by public.t.i) -> Zipper (cost .., sort by public.t.i) -> Index Scan using ti1 on t1 -> Index Scan using t12 on t2 -> Index Scan using ti2 on t3 -> Zipper (cost .., sort by public.t.i) -> Index Scan using ti4 on t4 -> Index Scan using ti5 on t5 Every zipper node would simply have to keep the two top tuples from its inputs in memory, compare them and return the best. But maybe that's exactly how Knuth's polyphase merge sort internally also merge their inputs (or runs). And perhaps it makes sense to show the user just one simple append node instead of throwing a tree of Zipper nodes at her. ;-) > Not necessarily but it is something Postgres supports and I don't think we > want to break it. Actually it's useful for partitioned tables if you build the > new partition in a separate table and then add it to the partitioned table. In > that case you may have gone through several steps of adding columns and > dropping them to get the structure to line up. Agreed, especially because lining up the columns isn't that hard after all. OTOH I think Postgres is way too flexible in how it allows partitioning to be done and thus it often can't optimize it properly. I'd very much like to teach it a stricter and simpler to use partitioning scheme than what we have with constraint exclusion. But that's pipe dreaming, and your improvement to the append node is certainly a good step towards the right direction, keep up the good work! Regards Markus ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Hi, Florian Weimer wrote: > I think you need it because there are potentially many input types. Given the partitioning case, I'd expect all rows to have an equal tuple descriptor. Maybe this is a matter of what to optimize, then? Could you elaborate on what use case you have in mind? Regards Markus ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| * Markus Schiltknecht: > Florian Weimer wrote: >> I think you need it because there are potentially many input types. Eh, "tapes". > Given the partitioning case, I'd expect all rows to have an equal > tuple descriptor. Maybe this is a matter of what to optimize, then? > > Could you elaborate on what use case you have in mind? You need a priority queue to figure out from which tape (partition) you need to remove the next tuple. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Hi, Florian Weimer wrote: >> Florian Weimer wrote: >>> I think you need it because there are potentially many input types. > > Eh, "tapes". Aha.. >> Given the partitioning case, I'd expect all rows to have an equal >> tuple descriptor. Maybe this is a matter of what to optimize, then? >> >> Could you elaborate on what use case you have in mind? > > You need a priority queue to figure out from which tape (partition) > you need to remove the next tuple. And why do you need lots of heap memory to do that? Anything wrong with the zipper approach I've outlined upthread? Am I missing something? Regards Markus ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Markus Schiltknecht wrote: > Florian Weimer wrote: >>> Given the partitioning case, I'd expect all rows to have an equal >>> tuple descriptor. Maybe this is a matter of what to optimize, then? >>> >>> Could you elaborate on what use case you have in mind? >> >> You need a priority queue to figure out from which tape (partition) >> you need to remove the next tuple. > > And why do you need lots of heap memory to do that? Anything wrong with > the zipper approach I've outlined upthread? We're talking about a binary heap, with just one node per partition. AFAICT it's roughly the same data structure as the zipper tree you envisioned, but not implemented with separate executor nodes for each level. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| * Markus Schiltknecht: >> You need a priority queue to figure out from which tape (partition) >> you need to remove the next tuple. > > And why do you need lots of heap memory to do that? Anything wrong > with the zipper approach I've outlined upthread? "heap" == "priority queue" here, I guess. Looking at your zipper again, it's actually an implementation of a heap. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| Hi, Heikki Linnakangas wrote: > AFAICT it's roughly the same data structure as the zipper tree you > envisioned, but not implemented with separate executor nodes for each > level. Aha, that sounds good to me. ;-) As I've already mentioned, I think it's even better to simply show the user a single append node, instead of lots of small nodes from a binary tree merger. Regards Markus ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |