This is a discussion on Expression index ignores column statistics target within the pgsql Hackers forums, part of the PostgreSQL category; --> I've noticed that row count estimates for expression indexes appear to rely on default_statistics_target rather than on a column's ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've noticed that row count estimates for expression indexes appear to rely on default_statistics_target rather than on a column's actual statistics target. That is, if I use ALTER TABLE SET STATISTICS to increase a column's statistics target and then run ANALYZE, then estimates for non-expression-index queries improve as expected. However, queries that use an expression index remain accurate for only around the N most common values, where N is the default_statistics_target that was in effect when ANALYZE ran. I'm still rummaging through the archives looking for past discussion; is this behavior a known limitation or just an oversight? CREATE TABLE foo (x integer); CREATE INDEX foo_x_idx ON foo (x); CREATE INDEX foo_abs_x_idx ON foo (abs(x)); INSERT INTO foo (x) SELECT r1 % r2 FROM generate_series(1, 100) AS g1(r1), generate_series(1, 100) AS g2(r2); SET default_statistics_target TO 15; ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20; ANALYZE foo; SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; most_common_vals ----------------------------------------------------- {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18 } (1 row) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) Recheck Cond: (x = 13) -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1) Index Cond: (x = 13) Total runtime: 2.905 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) Recheck Cond: (abs(x) = 13) -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1) Index Cond: (abs(x) = 13) Total runtime: 2.875 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) Recheck Cond: (x = 18) -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1) Index Cond: (x = 18) Total runtime: 2.393 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) Recheck Cond: (abs(x) = 18) -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1) Index Cond: (abs(x) = 18) Total runtime: 2.418 ms (5 rows) -- Michael Fuhr ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| This is expected. The main TODO items is: * Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics Basically, we don't have multi-column or expression statistics. ANALYZE just analyzes columns, even if an expression index exists. --------------------------------------------------------------------------- Michael Fuhr wrote: > I've noticed that row count estimates for expression indexes appear > to rely on default_statistics_target rather than on a column's > actual statistics target. That is, if I use ALTER TABLE SET > STATISTICS to increase a column's statistics target and then run > ANALYZE, then estimates for non-expression-index queries improve > as expected. However, queries that use an expression index remain > accurate for only around the N most common values, where N is the > default_statistics_target that was in effect when ANALYZE ran. I'm > still rummaging through the archives looking for past discussion; > is this behavior a known limitation or just an oversight? > > CREATE TABLE foo (x integer); > > CREATE INDEX foo_x_idx ON foo (x); > CREATE INDEX foo_abs_x_idx ON foo (abs(x)); > > INSERT INTO foo (x) > SELECT r1 % r2 > FROM generate_series(1, 100) AS g1(r1), > generate_series(1, 100) AS g2(r2); > > SET default_statistics_target TO 15; > ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20; > ANALYZE foo; > > SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; > most_common_vals > ----------------------------------------------------- > {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18 } > (1 row) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) > Recheck Cond: (x = 13) > -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1) > Index Cond: (x = 13) > Total runtime: 2.905 ms > (5 rows) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) > Recheck Cond: (abs(x) = 13) > -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1) > Index Cond: (abs(x) = 13) > Total runtime: 2.875 ms > (5 rows) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) > Recheck Cond: (x = 18) > -> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1) > Index Cond: (x = 18) > Total runtime: 2.393 ms > (5 rows) > > EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) > Recheck Cond: (abs(x) = 18) > -> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1) > Index Cond: (abs(x) = 18) > Total runtime: 2.418 ms > (5 rows) > > -- > Michael Fuhr > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Michael Fuhr <mike@fuhr.org> writes: > I've noticed that row count estimates for expression indexes appear > to rely on default_statistics_target rather than on a column's > actual statistics target. That is, if I use ALTER TABLE SET > STATISTICS to increase a column's statistics target and then run > ANALYZE, then estimates for non-expression-index queries improve > as expected. However, queries that use an expression index remain > accurate for only around the N most common values, where N is the > default_statistics_target that was in effect when ANALYZE ran. The code does in fact honor per-column statistics targets attached to expression indexes, viz alter table myfuncindex alter column pg_expression_1 set statistics 100; This isn't documented, mainly because pg_dump doesn't promise to dump such things, which it doesn't do because I didn't want to see the "pg_expression_N" naming for expression index columns become graven on stone tablets. I seem to recall bringing up the question of whether we could find a less implementation-specific way of commanding this behavior, but I can't find it in the archives right now. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote: > This is expected. The main TODO items is: > > * Allow accurate statistics to be collected on indexes with more than > one column or expression indexes, perhaps using per-index statistics > > Basically, we don't have multi-column or expression statistics. ANALYZE > just analyzes columns, even if an expression index exists. But the row count estimates imply that expression index queries do use column statistics, presumably as a proxy in the absence of expression statistics. This looks like a relevant commit: http://archives.postgresql.org/pgsql...2/msg00124.php The behavior I observed is that the planner does appear to use column statistics when planning an expression index query, but it doesn't appear to honor a column's non-default statistics target. In other words: * Row count estimates for expression index queries (at least simple ones) are reasonably accurate for the N most common column values, where N is the value of default_statistics_target when ANALYZE was run. * Specifically setting the column's statistics target with ALTER TABLE SET STATISTICS doesn't result in better statistics for expression index queries. That difference in behavior seems odd: if default_statistics_target has an effect, why doesn't ALTER TABLE SET STATISTICS? -- Michael Fuhr ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote: > The code does in fact honor per-column statistics targets attached to > expression indexes, viz > > alter table myfuncindex alter column pg_expression_1 set statistics 100; Aha -- that's the piece I didn't know about. I was wondering where those statistics were being stored, since they were affected by default_statistics_target but not by per-column statistics targets. And now I see them when I don't restrict queries against pg_stats by just the table or column name. Thanks. -- Michael Fuhr ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| I wrote: > I seem to recall bringing up the question of whether > we could find a less implementation-specific way of commanding this > behavior, but I can't find it in the archives right now. Ah, here it is: http://archives.postgresql.org/pgsql...3/msg00502.php No responses :-( regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote: > I wrote: > > I seem to recall bringing up the question of whether > > we could find a less implementation-specific way of commanding this > > behavior, but I can't find it in the archives right now. > > Ah, here it is: > http://archives.postgresql.org/pgsql...3/msg00502.php > > No responses :-( Would an ALTER INDEX SET STATISTICS form be possible? -- Michael Fuhr ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Michael Fuhr <mike@fuhr.org> writes: > On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote: >> Ah, here it is: >> http://archives.postgresql.org/pgsql...3/msg00502.php > Would an ALTER INDEX SET STATISTICS form be possible? It's not so much the table/index misnomer that's bothering me, it's the lack of a clean way to identify which column of the index you are talking about. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Sat, Oct 01, 2005 at 02:42:32AM -0400, Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > Would an ALTER INDEX SET STATISTICS form be possible? > > It's not so much the table/index misnomer that's bothering me, it's > the lack of a clean way to identify which column of the index you > are talking about. Ah, I see -- I wasn't thinking about expressions in multicolumn indexes. What about identifying the column with the expression itself, ala quote_ident(pg_get_indexdef())? That might be tedious for the user to type but it would be attractive from a self-documentation standpoint. ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100; I do see that indexes allow multiple instances of the same expression, so this approach could be ambiguous. Or should such repetition be prohibited as it is with column names? test=> CREATE TABLE foo (x integer); CREATE TABLE test=> CREATE INDEX foo1_idx ON foo (x, x); ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index" test=> CREATE INDEX foo2_idx ON foo (abs(x), abs(x)); CREATE INDEX -- Michael Fuhr ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| Tom Lane wrote: > Michael Fuhr <mike@fuhr.org> writes: > > I've noticed that row count estimates for expression indexes appear > > to rely on default_statistics_target rather than on a column's > > actual statistics target. That is, if I use ALTER TABLE SET > > STATISTICS to increase a column's statistics target and then run > > ANALYZE, then estimates for non-expression-index queries improve > > as expected. However, queries that use an expression index remain > > accurate for only around the N most common values, where N is the > > default_statistics_target that was in effect when ANALYZE ran. > > The code does in fact honor per-column statistics targets attached to > expression indexes, viz > > alter table myfuncindex alter column pg_expression_1 set statistics 100; > > This isn't documented, mainly because pg_dump doesn't promise to dump > such things, which it doesn't do because I didn't want to see the > "pg_expression_N" naming for expression index columns become graven on > stone tablets. I seem to recall bringing up the question of whether > we could find a less implementation-specific way of commanding this > behavior, but I can't find it in the archives right now. Is this a TODO? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| Thread Tools | |
| Display Modes | |
|
|