Unix Technical Forum

IN() statement values order makes 2x performance hit

This is a discussion on IN() statement values order makes 2x performance hit within the Pgsql Performance forums, part of the PostgreSQL category; --> Hello everybody! I have found a performance issue with 2 equivalent queries stably taking different (~x2) time to finish. ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 06-02-2008, 12:37 PM
Alexey Kupershtokh
 
Posts: n/a
Default IN() statement values order makes 2x performance hit

Hello everybody!

I have found a performance issue with 2 equivalent queries stably taking different (~x2) time to finish. In just a few words it can be described like this: if you have a lot of values in an IN() statement, you should put most heavy (specifying most number of rows) ids first.
This is mostly just a bug submit, than looking for help.

So this is what I have: RHEL PostgreSQL 8.3.1 A table ext_feeder_item with ~4.6M records.
kia=# \d+ ext_feeder_item;
Table "public.ext_feeder_item"
Column | Type | Modifiers | Description
----------+--------------------------+--------------------------------------------------------------+-------------
id | bigint | not null default nextval('ext_feeder_item_id_seq'::regclass) |
feed_id | bigint | not null |
pub_date | timestamp with time zone | |
Indexes:
"ext_feeder_item_pkey" PRIMARY KEY, btree (id)
"ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)
"ext_feeder_item_idx" btree (feed_id)
Triggers:
.....
Has OIDs: no
Statistics for the fields feed_id and pub_date are set to 1000; The table have just been vacuumed and analyzed. A simple query to the table:
SELECT
id
FROM
ext_feeder_item AS i
WHERE
i.feed_id IN (...)
ORDER BY pub_date DESC, id DESC
LIMIT 11 OFFSET 0;

with many (~1200) ids in the IN() statement. The count of rows distribution for these ids (may be thought of as foreign keys in this table) is the following:
id = 54461: ~180000 - actually the most heavy id in the whole table.
other ids: a single id at most specifies 2032 rows; 6036 rows total. If I perform a query with
IN(54461, ...)
it stably (5 attempts) takes 4.5..4.7 secs. to perform.
QUERY PLAN
LimitÂ* (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=4585.420..4585.452 rows=11 loops=1)
Â* ->Â* SortÂ* (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=4585.415..4585.425 rows=11 loops=1)
Â*Â*Â*Â*Â*Â*Â* Sort Key: pub_date, id
Â*Â*Â*Â*Â*Â*Â* Sort Method:Â* top-N heapsortÂ* Memory: 17kB
Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Heap Scan on ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16) (actual time=894.622..4260.441 rows=185625 loops=1)
Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Recheck Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))
Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Index Scan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228 width=0) (actual time=884.686..884.686 rows=185625 loops=1)
Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Index Cond: (feed_id = ANY ('{54461, ....}'::bigint[]))
Total runtime: 4585.852 ms
If I perform a query with
IN(..., 54461)
it stably (5 attempts) takes 9.3..9.5 secs. to perform.
QUERY PLAN
LimitÂ* (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=9330.267..9330.298 rows=11 loops=1)
Â* ->Â* SortÂ* (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=9330.263..9330.273 rows=11 loops=1)
Â*Â*Â*Â*Â*Â*Â* Sort Key: pub_date, id
Â*Â*Â*Â*Â*Â*Â* Sort Method:Â* top-N heapsortÂ* Memory: 17kB
Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Heap Scan on ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16) (actual time=1018.401..8971.029 rows=185625 loops=1)
Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Recheck Cond: (feed_id = ANY ('{... ,54461}'::bigint[]))
Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* ->Â* Bitmap Index Scan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228 width=0) (actual time=1008.791..1008.791 rows=185625 loops=1)
Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â*Â* Index Cond: (feed_id = ANY ('{... ,54461}'::bigint[]))
Total runtime: 9330.729 ms
I don't know what are the roots of the problem, but I think that some symptomatic healing could be applied: the PostgreSQL could sort the IDs due to the statistics.
So currently I tend to select the IDs from another table ordering them due to their weights: it's easy for me thanks to denormalization.

Also I would expect from PostgreSQL that it sorted the values to make index scan more sequential, but this expectation already conflicts with the bug described above
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:11 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com