vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Folks, Params: PostgreSQL 8.0.1 on Solaris 10 Statistics = 500 (tablenames have been changed to protect NDA) e1=# select tablename, null_frac, correlation, n_distinct from pg_stats where tablename = 'clickstream1' andattname = 'session_id'; tablename | null_frac | correlation | n_distinct ----------------------+-----------+-------------+------------ clickstream1 | 0 | 0.412034 | 378174 (2 rows) e1=# select count(distinct session_id) from clickstream1; count --------- 3174813 As you can see, n_distinct estimation is off by a factor of 10x and it's causing query planning problems. Any suggested hacks to improve the histogram on this? (BTW, increasing the stats to 1000 only doubles n_distinct, and doesn't solve the problem) -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Josh Berkus <josh@agliodbs.com> writes: > As you can see, n_distinct estimation is off by a factor of 10x and it's > causing query planning problems. Any suggested hacks to improve the > histogram on this? What's the histogram itself look like? (I'd like to see the whole pg_stats row not just part of it ...) There's probably no point in showing the target=1000 version, but maybe target=100 would be informative. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| Tom, > What's the histogram itself look like? (I'd like to see the whole > pg_stats row not just part of it ...) There's probably no point in > showing the target=1000 version, but maybe target=100 would be > informative. Here is the stats = 100 version. Notice that n_distinct has gone down. schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+----------------------+------------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- public | web_site_activity_fa | session_id | 0 | 8 | 96107 | {4393922,6049228,6026260,4394034,60341,4393810,256 2999,2573850,3006299,4705488,2561499,4705258,30073 78,4705490,60327,60352,2560950,2567640,2569852,300 6604,4394329,2570739,2406633,2407292,3006356,43936 03,4394121,6449083,2565815,4387881,2406770,2407081 ,2564340,3007328,2406578,2407295,2562813,2567603,4 387835,71014,2566253,2566900,6103079,2289424,24075 97,2567627,2568333,3457448,23450,23670,60743,70739 ,2406818,2406852,2407511,2562816,3007446,6306095,6 0506,71902,591543,1169136,1447077,2285047,2406830, 2573964,6222758,61393,70955,70986,71207,71530,2623 68,2289213,2406899,2567361,2775952,3006824,4387864 ,6239825,6244853,6422152,1739,58600,179293,278473, 488407,1896390,2286976,2407020,2546720,2677019,298 4333,3006133,3007497,3310286,3631413,3801909,43661 16,4388025} | {0.00166667,0.00146667,0.0013,0.0011,0.000933333,0 .0009,0.0008,0.0008,0.000733333,0.000733333,0.0007 ,0.000633333,0.0006,0.0006,0.000566667,0.000566667 ,0.000566667,0.000566667,0.000566667,0.000566667,0 .000566667,0.000533333,0.0005,0.0005,0.0005,0.0005 ,0.0005,0.0005,0.000466667,0.000466667,0.000433333 ,0.000433333,0.000433333,0.000433333,0.0004,0.0004 ,0.0004,0.0004,0.0004,0.000366667,0.000366667,0.00 0366667,0.000366667,0.000333333,0.000333333,0.0003 33333,0.000333333,0.000333333,0.0003,0.0003,0.0003 ,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003, 0.000266667,0.000266667,0.000266667,0.000266667,0. 000266667,0.000266667,0.000266667,0.000266667,0.00 0266667,0.000233333,0.000233333,0.000233333,0.0002 33333,0.000233333,0.000233333,0.000233333,0.000233 333,0.000233333,0.000233333,0.000233333,0.00023333 3,0.000233333,0.000233333,0.000233333,0.0002,0.000 2,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002 ,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002, 0.0002,0.0002} | {230,58907,88648,156764,216759,240405,264601,28904 7,312630,339947,364452,386486,409427,434075,455140 ,475759,500086,521530,544703,680376,981066,1313419 ,1712592,1860151,1882452,1905328,1927504,1948159,1 970054,1990408,2014501,2038573,2062786,2087163,211 0129,2132196,2155657,2181058,2204976,2228575,22562 29,2283897,2352453,2407153,2457716,2542081,2572119 ,2624133,2699592,2771254,2832224,2908151,2951500,3 005088,3032889,3137244,3158685,3179395,3203681,326 1587,3304359,3325577,3566688,3621357,3645094,37186 67,3740821,3762386,3783169,3804593,3826503,3904589 ,3931012,3957675,4141934,4265118,4288568,4316898,4 365625,4473965,4535752,4559700,4691802,4749478,597 7208,6000272,6021416,6045939,6078912,6111900,61451 55,6176422,6206627,6238291,6271270,6303067,6334117 ,6365200,6395250,6424719,6888329} | 0.41744 -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Tom, Any thoughts? This is really messing up query execution all across the database ... --Josh > Here is the stats = 100 version. Notice that n_distinct has gone down. > > schemaname | tablename | attname | null_frac | avg_width | > n_distinct | most_common_vals > > | most_common_freqs > | histogram_bounds | > > correlation >-------------------+------------- public | web_site_activity_fa | > session_id | 0 | 8 | 96107 | > {4393922,6049228,6026260,4394034,60341,4393810,256 2999,2573850,3006299,4705 >488,2561499,4705258,3007378,4705490,60327,60352,2 560950,2567640,2569852,3006 >604,4394329,2570739,2406633,2407292,3006356,43936 03,4394121,6449083,2565815, >4387881,2406770,2407081,2564340,3007328,2406578,2 407295,2562813,2567603,4387 >835,71014,2566253,2566900,6103079,2289424,2407597 ,2567627,2568333,3457448,23 >450,23670,60743,70739,2406818,2406852,2407511,256 2816,3007446,6306095,60506, >71902,591543,1169136,1447077,2285047,2406830,2573 964,6222758,61393,70955,709 >86,71207,71530,262368,2289213,2406899,2567361,277 5952,3006824,4387864,623982 >5,6244853,6422152,1739,58600,179293,278473,488407 ,1896390,2286976,2407020,25 >46720,2677019,2984333,3006133,3007497,3310286,363 1413,3801909,4366116,438802 >5} > > {0.00166667,0.00146667,0.0013,0.0011,0.000933333,0 .0009,0.0008,0.0008,0.000 >733333,0.000733333,0.0007,0.000633333,0.0006,0.00 06,0.000566667,0.000566667, >0.000566667,0.000566667,0.000566667,0.000566667,0 .000566667,0.000533333,0.00 >05,0.0005,0.0005,0.0005,0.0005,0.0005,0.000466667 ,0.000466667,0.000433333,0. >000433333,0.000433333,0.000433333,0.0004,0.0004,0 .0004,0.0004,0.0004,0.00036 >6667,0.000366667,0.000366667,0.000366667,0.000333 333,0.000333333,0.000333333 >,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0 003,0.0003,0.0003,0.0003,0. >0003,0.0003,0.0003,0.000266667,0.000266667,0.0002 66667,0.000266667,0.0002666 >67,0.000266667,0.000266667,0.000266667,0.00026666 7,0.000233333,0.000233333,0 >.000233333,0.000233333,0.000233333,0.000233333,0. 000233333,0.000233333,0.000 >233333,0.000233333,0.000233333,0.000233333,0.0002 33333,0.000233333,0.0002333 >33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.00 02,0.0002,0.0002,0.0002,0.0 >002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0 002} > > {230,58907,88648,156764,216759,240405,264601,28904 7,312630,339947,364452,38 >6486,409427,434075,455140,475759,500086,521530,54 4703,680376,981066,1313419, >1712592,1860151,1882452,1905328,1927504,1948159,1 970054,1990408,2014501,2038 >573,2062786,2087163,2110129,2132196,2155657,21810 58,2204976,2228575,2256229, >2283897,2352453,2407153,2457716,2542081,2572119,2 624133,2699592,2771254,2832 >224,2908151,2951500,3005088,3032889,3137244,31586 85,3179395,3203681,3261587, >3304359,3325577,3566688,3621357,3645094,3718667,3 740821,3762386,3783169,3804 >593,3826503,3904589,3931012,3957675,4141934,42651 18,4288568,4316898,4365625, >4473965,4535752,4559700,4691802,4749478,5977208,6 000272,6021416,6045939,6078 >912,6111900,6145155,6176422,6206627,6238291,62712 70,6303067,6334117,6365200, >6395250,6424719,6888329} > > | 0.41744 -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Hi. Sometimes, if the random number generator, that PostgreSQL uses, isn't good enough, the randomly selected pages for the statistics might not be random enough. Solaris is unknown to me. Maybe the used random number generator there isn't good enough? Good statistics depend on good random numbers. So, for example, if you have one million pages, but the upper bound for the random numbers is one hundred thousand pages, the statistics might get tuned. Or some random number generator has for example only 32000 different values. Regards, Marko Ristola Josh Berkus wrote: >Tom, > >Any thoughts? This is really messing up query execution all across the >database ... > >--Josh > > > >>Here is the stats = 100 version. Notice that n_distinct has gone down. >> >> schemaname | tablename | attname | null_frac | avg_width | >>n_distinct | most_common_vals >> >>| most_common_freqs >>| histogram_bounds | >> >>correlation >> >> > > > >>-------------------+------------- public | web_site_activity_fa | >>session_id | 0 | 8 | 96107 | >>{4393922,6049228,6026260,4394034,60341,4393810,2 562999,2573850,3006299,4705 >>488,2561499,4705258,3007378,4705490,60327,60352, 2560950,2567640,2569852,3006 >>604,4394329,2570739,2406633,2407292,3006356,4393 603,4394121,6449083,2565815, >>4387881,2406770,2407081,2564340,3007328,2406578, 2407295,2562813,2567603,4387 >>835,71014,2566253,2566900,6103079,2289424,240759 7,2567627,2568333,3457448,23 >>450,23670,60743,70739,2406818,2406852,2407511,25 62816,3007446,6306095,60506, >>71902,591543,1169136,1447077,2285047,2406830,257 3964,6222758,61393,70955,709 >>86,71207,71530,262368,2289213,2406899,2567361,27 75952,3006824,4387864,623982 >>5,6244853,6422152,1739,58600,179293,278473,48840 7,1896390,2286976,2407020,25 >>46720,2677019,2984333,3006133,3007497,3310286,36 31413,3801909,4366116,438802 >>5} >> >>{0.00166667,0.00146667,0.0013,0.0011,0.000933333 ,0.0009,0.0008,0.0008,0.000 >>733333,0.000733333,0.0007,0.000633333,0.0006,0.0 006,0.000566667,0.000566667, >>0.000566667,0.000566667,0.000566667,0.000566667, 0.000566667,0.000533333,0.00 >>05,0.0005,0.0005,0.0005,0.0005,0.0005,0.00046666 7,0.000466667,0.000433333,0. >>000433333,0.000433333,0.000433333,0.0004,0.0004, 0.0004,0.0004,0.0004,0.00036 >>6667,0.000366667,0.000366667,0.000366667,0.00033 3333,0.000333333,0.000333333 >>,0.000333333,0.000333333,0.0003,0.0003,0.0003,0. 0003,0.0003,0.0003,0.0003,0. >>0003,0.0003,0.0003,0.000266667,0.000266667,0.000 266667,0.000266667,0.0002666 >>67,0.000266667,0.000266667,0.000266667,0.0002666 67,0.000233333,0.000233333,0 >>.000233333,0.000233333,0.000233333,0.000233333,0 .000233333,0.000233333,0.000 >>233333,0.000233333,0.000233333,0.000233333,0.000 233333,0.000233333,0.0002333 >>33,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0 002,0.0002,0.0002,0.0002,0.0 >>002,0.0002,0.0002,0.0002,0.0002,0.0002,0.0002,0. 0002} >> >>{230,58907,88648,156764,216759,240405,264601,289 047,312630,339947,364452,38 >>6486,409427,434075,455140,475759,500086,521530,5 44703,680376,981066,1313419, >>1712592,1860151,1882452,1905328,1927504,1948159, 1970054,1990408,2014501,2038 >>573,2062786,2087163,2110129,2132196,2155657,2181 058,2204976,2228575,2256229, >>2283897,2352453,2407153,2457716,2542081,2572119, 2624133,2699592,2771254,2832 >>224,2908151,2951500,3005088,3032889,3137244,3158 685,3179395,3203681,3261587, >>3304359,3325577,3566688,3621357,3645094,3718667, 3740821,3762386,3783169,3804 >>593,3826503,3904589,3931012,3957675,4141934,4265 118,4288568,4316898,4365625, >>4473965,4535752,4559700,4691802,4749478,5977208, 6000272,6021416,6045939,6078 >>912,6111900,6145155,6176422,6206627,6238291,6271 270,6303067,6334117,6365200, >>6395250,6424719,6888329} >> >>| 0.41744 >> >> > > > ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| Marko, > Sometimes, if the random number generator, that PostgreSQL uses, > isn't good enough, the randomly selected pages for the statistics > might not be random enough. > > Solaris is unknown to me. Maybe the used random number generator there > isn't good enough? Hmmm. Good point. Will have to test on Linux. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| > > Solaris is unknown to me. Maybe the used random number generator there > > isn't good enough? > > Hmmm. Good point. Will have to test on Linux. Nope: Linux 2.4.20: test=# select tablename, attname, n_distinct from pg_stats where tablename = 'web_site_activity_fa'; tablename | attname | n_distinct ----------------------+---------------------+------------ web_site_activity_fa | session_id | 626127 test=# select count(distinct session_id) from web_site_activity_fa; count --------- 3174813 (1 row) .... I think the problem is in our heuristic sampling code. I'm not the first person to have this kind of a problem. Will be following up with tests ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| Josh Berkus <josh@agliodbs.com> writes: > ... I think the problem is in our heuristic sampling code. I'm not the first > person to have this kind of a problem. Will be following up with tests ... I looked into this a while back when we were talking about changing the sampling method. The conclusions were discouraging. Fundamentally, using constant sized samples of data for n_distinct is bogus. Constant sized samples only work for things like the histograms that can be analyzed through standard statistics population sampling which depends on the law of large numbers. n_distinct requires you to estimate how frequently very rare things occur. You can't apply the law of large numbers because even a single instance of a value out of a large pool alters the results disproportionately. To get a valid estimate for n_distinct you would need to sample a fixed percentage of the table. Not a fixed size sample. That just isn't practical. Moreover, I think the percentage would have to be quite large. Even if you sampled half the table I think the confidence interval would be quite wide. The situation is worsened because it's unclear how to interpolate values for subsets of the table. If the histogram says you have a million records and you're adding a clause that has a selectivity of 50% then half a million is a good guess. But if what you care about is n_distinct and you start with a million records with 1,000 distinct values and then apply a clause that filters 50% of them, how do you estimate the resulting n_distinct? 500 is clearly wrong, but 1,000 is wrong too. You could end up with anywhere from 0 to 1,000 and you have no good way to figure out where the truth lies. So I fear this is fundamentally a hopeless situation. It's going to be difficult to consistently get good plans for any queries that depend on good estimates for n_distinct. -- greg ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| Josh Berkus <josh@agliodbs.com> writes: > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in analyze.c, which seem to > think that we're *overestimating* n_distinct. Well, the notes are there because the early tests I ran on that formula did show it overestimating n_distinct more often than not. Greg is correct that this is inherently a hard problem :-( I have nothing against adopting a different formula, if you can find something with a comparable amount of math behind it ... but I fear it'd only shift the failure cases around. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| ||||
| Here is my opinion. I hope this helps. Maybe there is no one good formula: On boolean type, there are at most 3 distinct values. There is an upper bound for fornames in one country. There is an upper bound for last names in one country. There is a fixed number of states and postal codes in one country. On the other hand, with timestamp, every value could be distinct. A primary key with only one column has only distinct values. If the integer column refers with a foreign key into another table's only primary key, we could take advantage of that knolege. A column with a unique index has only distinct values. First ones are for classifying and the second ones measure continuous or discrete time or something like the time. The upper bound for classifying might be 3 (boolean), or it might be one million. The properties of the distribution might be hard to guess. Here is one way: 1. Find out the number of distinct values for 500 rows. 2. Try to guess, how many distinct values are for 1000 rows. Find out the real number of distinct values for 1000 rows. 3. If the guess and the reality are 50% wrong, do the iteration for 2x1000 rows. Iterate using a power of two to increase the samples, until you trust the estimate enough. So, in the phase two, you could try to guess with two distinct formulas: One for the classifying target (boolean columns hit there). Another one for the timestamp and numerical values. If there are one million classifications on one column, how you can find it out, by other means than checking at least two million rows? This means, that the user should have a possibility to tell the lower bound for the number of rows for sampling. Regards, Marko Ristola Tom Lane wrote: >Josh Berkus <josh@agliodbs.com> writes: > > >>Overall, our formula is inherently conservative of n_distinct. That is, I >>believe that it is actually computing the *smallest* number of distinct >>values which would reasonably produce the given sample, rather than the >>*median* one. This is contrary to the notes in analyze.c, which seem to >>think that we're *overestimating* n_distinct. >> >> > >Well, the notes are there because the early tests I ran on that formula >did show it overestimating n_distinct more often than not. Greg is >correct that this is inherently a hard problem :-( > >I have nothing against adopting a different formula, if you can find >something with a comparable amount of math behind it ... but I fear >it'd only shift the failure cases around. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > > ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |