This is a discussion on IN() statement values order makes 2x performance hit within the Pgsql Performance forums, part of the PostgreSQL category; --> Hello everybody! I have found a performance issue with 2 equivalent queries stably taking different (~x2) time to finish. ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello everybody! I have found a performance issue with 2 equivalent queries stably taking different (~x2) time to finish. In just a few words it can be described like this: if you have a lot of values in an IN() statement, you should put most heavy (specifying most number of rows) ids first. This is mostly just a bug submit, than looking for help. So this is what I have: RHEL PostgreSQL 8.3.1 A table ext_feeder_item with ~4.6M records. kia=# \d+ ext_feeder_item; Table "public.ext_feeder_item" Column | Type | Modifiers | Description ----------+--------------------------+--------------------------------------------------------------+------------- id | bigint | not null default nextval('ext_feeder_item_id_seq'::regclass) | feed_id | bigint | not null | pub_date | timestamp with time zone | | Indexes: "ext_feeder_item_pkey" PRIMARY KEY, btree (id) "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date) "ext_feeder_item_idx" btree (feed_id) Triggers: ..... Has OIDs: no Statistics for the fields feed_id and pub_date are set to 1000; The table have just been vacuumed and analyzed. A simple query to the table: SELECT id FROM ext_feeder_item AS i WHERE i.feed_id IN (...) ORDER BY pub_date DESC, id DESC LIMIT 11 OFFSET 0; with many (~1200) ids in the IN() statement. The count of rows distribution for these ids (may be thought of as foreign keys in this table) is the following: id = 54461: ~180000 - actually the most heavy id in the whole table. other ids: a single id at most specifies 2032 rows; 6036 rows total. If I perform a query with IN(54461, ...) it stably (5 attempts) takes 4.5..4.7 secs. to perform. QUERY PLAN LimitÂ* (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=4585.420..4585.452 rows=11 loops=1) Â* ->Â* SortÂ* (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=4585.415..4585.425 rows=11 loops=1) Â*Â*Â*Â*Â*Â*Â* Sort Key: pub_date, id Â*Â*Â*Â*Â*Â*Â* Sort Method:Â* top-N heapsortÂ* Memory: 17kB Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Heap Scan on ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16) (actual time=894.622..4260.441 rows=185625 loops=1) Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Recheck Cond: (feed_id = ANY ('{54461, ...}'::bigint[])) Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Index Scan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228 width=0) (actual time=884.686..884.686 rows=185625 loops=1) Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Index Cond: (feed_id = ANY ('{54461, ....}'::bigint[])) Total runtime: 4585.852 ms If I perform a query with IN(..., 54461) it stably (5 attempts) takes 9.3..9.5 secs. to perform. QUERY PLAN LimitÂ* (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=9330.267..9330.298 rows=11 loops=1) Â* ->Â* SortÂ* (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=9330.263..9330.273 rows=11 loops=1) Â*Â*Â*Â*Â*Â*Â* Sort Key: pub_date, id Â*Â*Â*Â*Â*Â*Â* Sort Method:Â* top-N heapsortÂ* Memory: 17kB Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Heap Scan on ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16) (actual time=1018.401..8971.029 rows=185625 loops=1) Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Recheck Cond: (feed_id = ANY ('{... ,54461}'::bigint[])) Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Index Scan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228 width=0) (actual time=1008.791..1008.791 rows=185625 loops=1) Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Index Cond: (feed_id = ANY ('{... ,54461}'::bigint[])) Total runtime: 9330.729 ms I don't know what are the roots of the problem, but I think that some symptomatic healing could be applied: the PostgreSQL could sort the IDs due to the statistics. So currently I tend to select the IDs from another table ordering them due to their weights: it's easy for me thanks to denormalization. Also I would expect from PostgreSQL that it sorted the values to make index scan more sequential, but this expectation already conflicts with the bug described above |