vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb: > Attached is a simplified example of a performance problem we have seen, Odd. Can you tell us your PG-Version? Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) "If I was god, I would recompile penguin with --enable-fly." (unknow) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889° ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| >>> On Wed, Mar 15, 2006 at 12:17 pm, in message <20060315181735.GA22240@KanotixBox>, Andreas Kretschmer <akretschmer@spamfence.net> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb: > >> Attached is a simplified example of a performance problem we have seen, > > Odd. Can you tell us your PG- Version? I know we really should move to 8.1.3, but I haven't gotten to it yet. We're on a build from the 8.1 stable branch as of February 10th, with a patch to allow ANSI standard interpretation of string literals. (So this is 8.1.2 with some 8.1.3 changes plus the string literal patch.) If there are any changes in that time frame which might affect this issue, I could deploy a standard release and make sure that I see the same behavior. Let me know. -Kevin ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Odd. Can you tell us your PG- Version? > this is 8.1.2 with some 8.1.3 changes plus the string literal patch.) 8.1 is certainly capable of devising the plan you want, for example in the regression database: regression=# explain select * from tenk1 where thousand = 10 and tenthous between 42 and 144; QUERY PLAN ------------------------------------------------------------------------------------ Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.00..6.01 rows=1 width=244) Index Cond: ((thousand = 10) AND (tenthous >= 42) AND (tenthous <= 144)) (2 rows) It looks to me like this is a matter of bad cost estimation, ie, it's thinking the other index is cheaper to use. Why that is is not clear. Can we see the pg_stats rows for ctofcNo and calDate? Also, try to force it to generate the plan you want, so we can see what it thinks the cost is for that. If you temporarily drop the wrong index you should be able to get there: begin; drop index "Cal_CalDate"; explain analyze select ... ; -- repeat as needed if it chooses some other wrong index rollback; I hope you have a play copy of the database to do this in --- although it would be safe to do the above in a live DB, the DROP would exclusive-lock the table until you finish the experiment and rollback, which probably is not good for response time ... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| >>> On Wed, Mar 15, 2006 at 1:17 pm, in message <28798.1142450270@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > 8.1 is certainly capable of devising the plan you want, for example > in the regression database: > > regression=# explain select * from tenk1 where thousand = 10 and tenthous > between 42 and 144; > QUERY PLAN > ------------------------------------------------------------------------------------ > Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.00..6.01 rows=1 > width=244) > Index Cond: ((thousand = 10) AND (tenthous >= 42) AND (tenthous <= 144)) > (2 rows) That matches one of the examples where it optimized well. I only saw the bad plan when low and high ends of the BETWEEN range were equal. > It looks to me like this is a matter of bad cost estimation, ie, it's > thinking the other index is cheaper to use. Why that is is not clear. > Can we see the pg_stats rows for ctofcNo and calDate? schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+-----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+------------- public | Cal | calDate | 0 | 4 | 2114 | {2003-06-02,2000-06-20,2001-04-16,2003-06-17,2003-12-01,2004-10-12,2001-04-23,2001-10-15,2002-03-06,2002-05-03} | {0.00333333,0.00233333,0.00233333,0.00233333,0.002 33333,0.00233333,0.002,0.002,0.002,0.002} | {1986-03-14,1999-06-11,2000-07-14,2001-05-18,2002-03-21,2002-12-04,2003-08-12,2004-05-13,2005-02-01,2005-09-28,2080-12-31} | 0.0545768 public | Cal | ctofcNo | 0 | 8 | 669 | {0793,1252,1571,0964,0894,1310,"DA ",0944,1668,0400} | {0.024,0.019,0.015,0.0123333,0.012,0.011,0.0106667 ,0.01,0.00966667,0.00866667} | {0000,0507,0733,0878,1203,1336,14AG,1633,1971,3705 ,YVJO} | -0.0179665 (2 rows) > Also, try to force it to generate the plan you want, so we can see what > it thinks the cost is for that. If you temporarily drop the wrong index > you should be able to get there: > > begin; > drop index "Cal_CalDate"; > explain analyze select ... ; > -- repeat as needed if it chooses some other wrong index > rollback; Sort (cost=4.03..4.03 rows=1 width=12) (actual time=48.484..48.486 rows=4 loops=1) Sort Key: "calDate", "startTime" -> Index Scan using "Cal_CtofcNo" on "Cal" "CA" (cost=0.00..4.02 rows=1 width=12) (actual time=36.750..48.228 rows=4 loops=1) Index Cond: ((("ctofcNo")::bpchar = '2192'::bpchar) AND (("calDate")::date >= '2006-03-15'::date) AND (("calDate")::date <= '2006-03-15'::date)) Total runtime: 56.616 ms ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Wed, 2006-03-15 at 14:17 -0500, Tom Lane wrote: > It looks to me like this is a matter of bad cost estimation, ie, it's > thinking the other index is cheaper to use. Why that is is not clear. > Can we see the pg_stats rows for ctofcNo and calDate? ISTM that when the BETWEEN constants match we end up in this part of clauselist_selectivity()... ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Thu, 2006-03-16 at 00:07 +0000, Simon Riggs wrote: > On Wed, 2006-03-15 at 14:17 -0500, Tom Lane wrote: > > > It looks to me like this is a matter of bad cost estimation, ie, it's > > thinking the other index is cheaper to use. Why that is is not clear. > > Can we see the pg_stats rows for ctofcNo and calDate? > > ISTM that when the BETWEEN constants match we end up in this part of > clauselist_selectivity()... (and now for the whole email...) /* * It's just roundoff error; use a small positive * value */ s2 = 1.0e-10; so that the planner underestimates the cost of using "Cal_CalDate" so that it ends up the same as "Cal_CtofcNo", and then we pick "Cal_CalDate" because it was created first. Using 1.0e-10 isn't very useful... the selectivity for a range should never be less than the selectivity for an equality, so we should simply put in a test against one of the pseudo constants and use that as the minimal value. That should lead to raising the apparent cost of Cal_CalDate so that Cal_CtofcNo can take precedence. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: >> ISTM that when the BETWEEN constants match we end up in this part of >> clauselist_selectivity()... Yeah, I think you are right. > so that the planner underestimates the cost of using "Cal_CalDate" so > that it ends up the same as "Cal_CtofcNo", and then we pick > "Cal_CalDate" because it was created first. No, it doesn't end up the same --- but the difference is small enough to be in the roundoff-error regime. The real issue here is that we're effectively assuming that one row will be fetched from the index in both cases, and this is clearly not the case for the Cal_CalDate index. So we need a more accurate estimate for the boundary case. > Using 1.0e-10 isn't very useful... the selectivity for a range should > never be less than the selectivity for an equality, so we should simply > put in a test against one of the pseudo constants and use that as the > minimal value. That's easier said than done, because you'd first have to find the appropriate equality operator to use (ie, one having semantics that agree with the inequality operators). Another point is that the above statement is simply wrong, consider calDate BETWEEN '2006-03-15' AND '2006-03-14' for which an estimate of zero really is correct. Possibly we could drop this code's reliance on seeing SCALARLTSEL/SCALARGTSEL as the estimators, and instead try to locate a common btree opclass for the operators --- which would then let us identify the right equality operator to use, and also let us distinguish > from >= etc. If we're trying to get the boundary cases right I suspect we have to account for that. I could see such an approach being tremendously slow though :-(, because we'd go looking for btree opclasses even for operators that have nothing to do with < or >. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Wed, 2006-03-15 at 21:05 -0500, Tom Lane wrote: > So we need a more accurate estimate for the boundary case. Agreed. > > Using 1.0e-10 isn't very useful... the selectivity for a range should > > never be less than the selectivity for an equality, so we should simply > > put in a test against one of the pseudo constants and use that as the > > minimal value. > > That's easier said than done, because you'd first have to find the > appropriate equality operator to use (ie, one having semantics that > agree with the inequality operators). .... Kevin: this is also the reason we can't simply transform the WHERE clause into a more appropriate form... > Possibly we could drop this code's reliance on seeing > SCALARLTSEL/SCALARGTSEL as the estimators, and instead try to locate a > common btree opclass for the operators --- which would then let us > identify the right equality operator to use, and also let us distinguish > > from >= etc. If we're trying to get the boundary cases right I > suspect we have to account for that. I could see such an approach being > tremendously slow though :-(, because we'd go looking for btree > opclasses even for operators that have nothing to do with < or >. Trying to get the information in the wrong place would be very expensive, I agree. But preparing that information when we have access to it and passing it through the plan would be much cheaper. Relating op->opclass will be very useful in other places in planning, even if any one case seems not to justify the work to record it. (This case feels like deja vu, all over again.) The operator and the opclass are only connected via an index access method, but for a particular index each column has only one opclass. So the opclass will have a 1-1 correspondence with the operator for *that* plan only, realising that other plans might have different correspondences. find_usable_indexes() or thereabouts could annotate a restriction OpExpr with the opclass it will use. Once we have the link, clauselist_selectivity() can trivially compare opclasses for both OpExprs, then retrieve other information for that opclass for various purposes. Seems lots of work for such a corner case, but would be worth it if this solves other problems as well. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > Trying to get the information in the wrong place would be very > expensive, I agree. But preparing that information when we have access > to it and passing it through the plan would be much cheaper. Where would that be? > The operator and the opclass are only connected via an index access > method, but for a particular index each column has only one opclass. If you're proposing making clauselist_selectivity depend on what indexes exist, I think that's very much the wrong approach. In the first place, it still has to give usable answers for unindexed columns, and in the second place there might be multiple indexes with different opclasses for the same column, so the ambiguity problem still exists. I have been wondering if we shouldn't add some more indexes on pg_amop or something to make it easier to do this sort of lookup --- we definitely seem to be finding multiple reasons to want to look up which opclasses contain a given operator. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| On Thu, 2006-03-16 at 10:57 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Trying to get the information in the wrong place would be very > > expensive, I agree. But preparing that information when we have access > > to it and passing it through the plan would be much cheaper. > > Where would that be? > > > The operator and the opclass are only connected via an index access > > method, but for a particular index each column has only one opclass. > > If you're proposing making clauselist_selectivity depend on what indexes > exist, I think that's very much the wrong approach. Using available information sounds OK to me. Guess you're thinking of the lack of plan invalidation? > In the first place, > it still has to give usable answers for unindexed columns, and in the > second place there might be multiple indexes with different opclasses > for the same column, so the ambiguity problem still exists. I was thinking that we would fill out the OpExpr with different opclasses for each plan, so each one sees a different story. (I was thinking there was a clauselist for each plan; if not, there could be.) So the multiple index problem shouldn't exist. Non-indexed cases still cause the problem, true. > I have been wondering if we shouldn't add some more indexes on pg_amop > or something to make it easier to do this sort of lookup --- we > definitely seem to be finding multiple reasons to want to look up > which opclasses contain a given operator. Agreed, but still looking for better way than that. [BTW how do you add new indexes to system tables? I want to add one to pg_inherits but not sure where to look.] Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |