Unix Technical Forum

Inefficient queryplan for query with intersectable subselects/joins

This is a discussion on Inefficient queryplan for query with intersectable subselects/joins within the Pgsql Performance forums, part of the PostgreSQL category; --> Hi list, I'm writing an application that will aggregate records with a few million records into averages/sums/minimums etc grouped ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 01:18 PM
Arjen van der Meijden
 
Posts: n/a
Default Inefficient queryplan for query with intersectable subselects/joins

Hi list,

I'm writing an application that will aggregate records with a few
million records into averages/sums/minimums etc grouped per day.

Clients can add filters and do lots of customization on what they want
to see. And I've to translate that to one or more queries. Basically, I
append each filter as either an extra and-in-the-where or joined with
the clauses as ON-clause. The application now uses 8.1devel but I think
the basic plans are similar to 8.0. At least for this first query.

I noticed a query taking over 25 seconds to execute:

SELECT "make a timestamp" grouper, chart_2.Prijs as field_2_0
FROM
pwprijs as chart_2
JOIN pwprodukten t_0 ON chart_2.ProduktID = t_0.ID AND t_0.Cat2 IN
(SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545)
JOIN pwprijs t_1 ON chart_2.ProduktID = t_1.ProduktID
AND t_1.LeverancierID = 938 AND t_1.recordtimestamp >= "last
timestamp"
WHERE
chart_2.Prijs > 0

It yields quite a long plan, so I've send that as an attachment along.
Basically it combines two tables against an original to fetch "all
prices (of all suppliers) for products of a certain category that are
sold by a certain supplier".

I was wondering how rewriting it to subselects would improve
performance, but that wasn't a very clear winner. It shaved of about 5
seconds. So I took the subselects and used INTERSECT to unite them and
have only one IN-clause in the query. That made it go down to around 13
seconds.

I noticed it was doing a seq scan on the largest table to get the "Prijs
> 0"-condition. But since there are only 947 of the 7692207 with prijs

= 0 and none with < 0, it shouldn't be the main thing to look for.
Dropping the clause made a minor improvement in performance for the queries.

But disabling sequential scans allowed an improvement to only 660 ms
compared to the 13 seconds earlier! Row-estimates seem to be quite a bit
off, so I already set the statistics target to 1000 and re-analyzed.
Btw, adding the prijs-clause again would make it choose another index
and thus resulted in much longer operation.

The final query, only taking 650ms, would be:

SELECT
"make a timestamp" as grouper,
chart_2.Prijs as field_2_0
FROM
pwprijs as chart_2
WHERE
chart_2.ProduktID IN (SELECT ID FROM pwprodukten WHERE Cat2 IN
(SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545)
INTERSECT
SELECT produktid FROM pwprijs WHERE LeverancierID = 938
AND recordtimestamp >= "last timestamp")

So I'm wondering: how can I make postgres decide to use the (correct)
index without having to disable seq scans and how can I still add the
prijs-clause without dropping the index for it (since it should be used
for other filters). And for ease of use in my application I'd prefer to
use the first query or the version with two seperate IN-clauses.

Is that possible?

I left all the configuration-stuff to the defaults since changing values
didn't seem to impact much. Apart from the buffers and effective cache,
increasing those made the performance worse.

Best regards,

Arjen

Table "public.pwprijs"
Column | Type | Modifiers
-----------------+---------+-----------
produktid | integer | not null
leverancierid | integer | not null
prijs | real | not null
Time | bigint | not null
recordtimestamp | bigint | not null
Indexes:
"pwprijs_levid_idx" btree (leverancierid), tablespace "raptor"
"pwprijs_levid_pid_idx" btree (leverancierid, produktid), tablespace "raptor"
"pwprijs_levid_rects_idx" btree (leverancierid, recordtimestamp), tablespace "raptor"
"pwprijs_produktid_timestamp_idx" btree (produktid, recordtimestamp), tablespace "raptor"
"pwprijs_rec_levid_pid" btree (recordtimestamp, leverancierid, produktid), tablespace "raptor"
"pwprijs_recordtimestamp_idx" btree (recordtimestamp), tablespace "raptor"
"pwprijs_recordtimestamp_produktid_prijs_idx" btree (recordtimestamp, produktid, prijs), tablespace "raptor"
Tablespace: "raptor"

Table "public.pwprodukten"
Column | Type | Modifiers
-----------------+-----------------------+------------------------------
id | integer | not null
naam | character varying(80) | not null
cat2 | smallint | not null
recordtimestamp | bigint | not null
Indexes:
"pwprodukten_pkey" PRIMARY KEY, btree (id)
"pwprodukten_cat_idx" btree (cat2)



SELECT
CAST('epoch'::timestamp + (chart_2.RecordTimestamp - (chart_2.RecordTimestamp % 86400)) * interval '1 second' as date) as grouper,
chart_2.Prijs as field_2_0
FROM
pwprijs as chart_2
JOIN pwprodukten t_0 ON chart_2.ProduktID = t_0.ID
AND t_0.Cat2 IN (SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545)
JOIN pwprijs t_1 ON chart_2.ProduktID = t_1.ProduktID AND
t_1.LeverancierID = 938 AND t_1.recordtimestamp >=
(SELECT max_date - 60 FROM last_dates WHERE table_name = 'pricetracker')
WHERE
chart_2.Prijs > 0

Hash Join (cost=72062.34..940100.37 rows=22901407 width=12) (actual time=746.520..29655.748 rows=58065 loops=1)
Hash Cond: ("outer".produktid = "inner".id)
InitPlan
-> Seq Scan on last_dates (cost=0.00..1.06 rows=1 width=8) (actual time=0.041..0.045 rows=1 loops=1)
Filter: ((table_name)::text = 'pricetracker'::text)
-> Seq Scan on pwprijs chart_2 (cost=0.00..156721.59 rows=7691800 width=16) (actual time=8.457..14713.939 rows=7691260 loops=1)
Filter: (prijs > 0::double precision)
-> Hash (cost=71769.47..71769.47 rows=42319 width=8) (actual time=606.996..606.996 rows=103 loops=1)
-> Hash Join (cost=4329.24..71769.47 rows=42319 width=8) (actual time=599.771..606.782 rows=103 loops=1)
Hash Cond: ("outer".produktid = "inner".id)
-> Bitmap Heap Scan on pwprijs t_1 (cost=787.84..65786.28 rows=84640 width=4) (actual time=36.049..40.618 rows=4405 loops=1)
Recheck Cond: ((leverancierid = 938) AND (recordtimestamp >= $1))
-> Bitmap Index Scan on pwprijs_levid_rects_idx (cost=0.00..787.84 rows=84640 width=0) (actual time=36.004..36.004 rows=4405 loops=1)
Index Cond: ((leverancierid = 938) AND (recordtimestamp >= $1))
-> Hash (cost=3399.01..3399.01 rows=22156 width=4) (actual time=561.313..561.313 rows=1458 loops=1)
-> Seq Scan on pwprodukten t_0 (cost=10.45..3399.01 rows=22156 width=4) (actual time=60.957..559.595 rows=1458 loops=1)
Filter: (subplan)
SubPlan
-> Materialize (cost=10.45..10.53 rows=8 width=2) (actual time=0.002..0.006 rows=8 loops=44313)
-> Unique (cost=10.40..10.44 rows=8 width=2) (actual time=44.831..44.846 rows=8 loops=1)
-> Sort (cost=10.40..10.42 rows=8 width=2) (actual time=44.829..44.834 rows=8 loops=1)
Sort Key: "?column?"
-> Append (cost=0.00..10.28 rows=8 width=2) (actual time=0.006..44.754 rows=8 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..0.02 rows=1 width=0) (actual time=0.004..0.006 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.003 rows=1 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=2.02..10.26 rows=7 width=2) (actual time=44.723..44.738 rows=7 loops=1)
-> Bitmap Heap Scan on cat (cost=2.02..10.19 rows=7 width=2) (actual time=44.719..44.726 rows=7 loops=1)
Recheck Cond: (parentid = 545)
-> Bitmap Index Scan on cat_parentid_idx (cost=0.00..2.02 rows=7 width=0) (actual time=44.704..44.704 rows=7 loops=1)
Index Cond: (parentid = 545)
Total runtime: 29688.736 ms




SELECT
CAST('epoch'::timestamp + (chart_2.RecordTimestamp - (chart_2.RecordTimestamp % 86400)) * interval '1 second' as date) as grouper,
chart_2.Prijs as field_2_0
FROM
pwprijs as chart_2
WHERE
chart_2.ProduktID IN (SELECT ID FROM pwprodukten WHERE Cat2 IN (SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545))
AND
chart_2.ProduktID IN (SELECT produktid FROM pwprijs WHERE LeverancierID = 938 AND recordtimestamp >=
(SELECT max_date - 60 FROM last_dates WHERE table_name = 'pricetracker'))
AND
chart_2.Prijs > 0

Nested Loop IN Join (cost=2850.25..230556.55 rows=1699 width=12) (actual time=482.862..24419.848 rows=58065 loops=1)
InitPlan
-> Seq Scan on last_dates (cost=0.00..1.06 rows=1 width=8) (actual time=0.022..0.025 rows=1 loops=1)
Filter: ((table_name)::text = 'pricetracker'::text)
-> Hash IN Join (cost=2849.18..200109.24 rows=207947 width=20) (actual time=360.511..15450.592 rows=190487 loops=1)
Hash Cond: ("outer".produktid = "inner".id)
-> Seq Scan on pwprijs chart_2 (cost=0.00..156721.59 rows=7691800 width=16) (actual time=11.587..10137.642 rows=7691260 loops=1)
Filter: (prijs > 0::double precision)
-> Hash (cost=2846.19..2846.19 rows=1198 width=4) (actual time=244.576..244.576 rows=1458 loops=1)
-> Nested Loop (cost=12.93..2846.19 rows=1198 width=4) (actual time=48.646..243.239 rows=1458 loops=1)
-> Unique (cost=10.40..10.44 rows=8 width=2) (actual time=19.552..19.580 rows=8 loops=1)
-> Sort (cost=10.40..10.42 rows=8 width=2) (actual time=19.550..19.559 rows=8 loops=1)
Sort Key: "?column?"
-> Append (cost=0.00..10.28 rows=8 width=2) (actual time=0.005..19.532 rows=8 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..0.02 rows=1 width=0) (actual time=0.005..0.007 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=2.02..10.26 rows=7 width=2) (actual time=19.501..19.518 rows=7 loops=1)
-> Bitmap Heap Scan on cat (cost=2.02..10.19 rows=7 width=2) (actual time=19.499..19.506 rows=7 loops=1)
Recheck Cond: (parentid = 545)
-> Bitmap Index Scan on cat_parentid_idx (cost=0.00..2.02 rows=7 width=0) (actual time=19.493..19.493 rows=7 loops=1)
Index Cond: (parentid = 545)
-> Bitmap Heap Scan on pwprodukten (cost=2.52..352.58 rows=150 width=6) (actual time=3.717..27.732 rows=182 loops=8)
Recheck Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Bitmap Index Scan on pwprodukten_cat_idx (cost=0.00..2.52 rows=150 width=0) (actual time=1.510..1.510 rows=182 loops=8)
Index Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Index Scan using pwprijs_levid_pid_idx on pwprijs (cost=0.00..34.16 rows=3 width=4) (actual time=0.044..0.044 rows=0 loops=190487)
Index Cond: ((pwprijs.leverancierid = 938) AND ("outer".produktid = pwprijs.produktid))
Filter: (recordtimestamp >= $0)
Total runtime: 24466.560 ms




SELECT
CAST('epoch'::timestamp + (chart_2.RecordTimestamp - (chart_2.RecordTimestamp % 86400)) * interval '1 second' as date) as grouper,
chart_2.Prijs as field_2_0
FROM
pwprijs as chart_2
WHERE
chart_2.ProduktID IN (SELECT ID FROM pwprodukten WHERE Cat2 IN (SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545)
INTERSECT
SELECT produktid FROM pwprijs WHERE LeverancierID = 938 AND recordtimestamp >=
(SELECT max_date - 60 FROM last_dates WHERE table_name = 'pricetracker'))
AND
chart_2.Prijs > 0


Hash Join (cost=77791.44..457846.83 rows=2010732 width=12) (actual time=409.838..13202.566 rows=58065 loops=1)
Hash Cond: ("outer".produktid = "inner".id)
-> Seq Scan on pwprijs chart_2 (cost=0.00..156721.59 rows=7691800 width=16) (actual time=7.896..8253.585 rows=7691260 loops=1)
Filter: (prijs > 0::double precision)
-> Hash (cost=77769.98..77769.98 rows=8584 width=4) (actual time=325.008..325.008 rows=103 loops=1)
-> SetOp Intersect (cost=77254.95..77684.14 rows=8584 width=4) (actual time=317.384..324.932 rows=103 loops=1)
-> Sort (cost=77254.95..77469.55 rows=85838 width=4) (actual time=317.361..320.519 rows=5863 loops=1)
Sort Key: id
-> Append (cost=12.93..69491.92 rows=85838 width=4) (actual time=38.765..308.837 rows=5863 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=12.93..2858.17 rows=1198 width=4) (actual time=38.763..256.795 rows=1458 loops=1)
-> Nested Loop (cost=12.93..2846.19 rows=1198 width=4) (actual time=38.761..254.989 rows=1458 loops=1)
-> Unique (cost=10.40..10.44 rows=8 width=2) (actual time=8.994..9.027 rows=8 loops=1)
-> Sort (cost=10.40..10.42 rows=8 width=2) (actual time=8.993..9.000 rows=8 loops=1)
Sort Key: "?column?"
-> Append (cost=0.00..10.28 rows=8 width=2) (actual time=0.003..8.978 rows=8 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..0.02 rows=1 width=0) (actual time=0.003..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=2.02..10.26 rows=7 width=2) (actual time=8.949..8.963 rows=7 loops=1)
-> Bitmap Heap Scan on cat (cost=2.02..10.19 rows=7 width=2) (actual time=8.945..8.952 rows=7 loops=1)
Recheck Cond: (parentid = 545)
-> Bitmap Index Scan on cat_parentid_idx (cost=0.00..2.02 rows=7 width=0) (actual time=8.940..8.940 rows=7 loops=1)
Index Cond: (parentid = 545)
-> Bitmap Heap Scan on pwprodukten (cost=2.52..352.58 rows=150 width=6) (actual time=3.847..30.522 rows=182 loops=8)
Recheck Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Bitmap Index Scan on pwprodukten_cat_idx (cost=0.00..2.52 rows=150 width=0) (actual time=1.583..1.583 rows=182 loops=8)
Index Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Subquery Scan "*SELECT* 2" (cost=788.90..66633.75 rows=84640 width=4) (actual time=36.653..46.266 rows=4405 loops=1)
-> Bitmap Heap Scan on pwprijs (cost=788.90..65787.35 rows=84640 width=4) (actual time=36.651..41.326 rows=4405 loops=1)
Recheck Cond: ((leverancierid = 938) AND (recordtimestamp >= $0))
InitPlan
-> Seq Scan on last_dates (cost=0.00..1.06 rows=1 width=8) (actual time=0.025..0.029 rows=1 loops=1)
Filter: ((table_name)::text = 'pricetracker'::text)
-> Bitmap Index Scan on pwprijs_levid_rects_idx (cost=0.00..787.84 rows=84640 width=0) (actual time=36.614..36.614 rows=4405 loops=1)
Index Cond: ((leverancierid = 938) AND (recordtimestamp >= $0))
Total runtime: 13234.570 ms


set enable_seqscan = false;


without prijs-clause:

Nested Loop (cost=77299.32..7858118.21 rows=2013780 width=12) (actual time=33.964..634.359 rows=58065 loops=1)
-> SetOp Intersect (cost=77296.51..77727.66 rows=8623 width=4) (actual time=33.632..42.356 rows=103 loops=1)
-> Sort (cost=77296.51..77512.08 rows=86231 width=4) (actual time=33.616..36.979 rows=5863 loops=1)
Sort Key: id
-> Append (cost=12.93..69495.92 rows=86231 width=4) (actual time=0.214..25.380 rows=5863 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=12.93..2858.17 rows=1198 width=4) (actual time=0.213..8.779 rows=1458 loops=1)
-> Nested Loop (cost=12.93..2846.19 rows=1198 width=4) (actual time=0.211..7.065 rows=1458 loops=1)
-> Unique (cost=10.40..10.44 rows=8 width=2) (actual time=0.095..0.122 rows=8 loops=1)
-> Sort (cost=10.40..10.42 rows=8 width=2) (actual time=0.093..0.099 rows=8 loops=1)
Sort Key: "?column?"
-> Append (cost=0.00..10.28 rows=8 width=2) (actual time=0.003..0.079 rows=8 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..0.02 rows=1 width=0) (actual time=0.003..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=2.02..10.26 rows=7 width=2) (actual time=0.050..0.065 rows=7 loops=1)
-> Bitmap Heap Scan on cat (cost=2.02..10.19 rows=7 width=2) (actual time=0.048..0.054 rows=7 loops=1)
Recheck Cond: (parentid = 545)
-> Bitmap Index Scan on cat_parentid_idx (cost=0.00..2.02 rows=7 width=0) (actual time=0.040..0.040 rows=7 loops=1)
Index Cond: (parentid = 545)
-> Bitmap Heap Scan on pwprodukten (cost=2.52..352.58 rows=150 width=6) (actual time=0.076..0.658 rows=182 loops=8)
Recheck Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Bitmap Index Scan on pwprodukten_cat_idx (cost=0.00..2.52 rows=150 width=0) (actual time=0.056..0.056 rows=182 loops=8)
Index Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Subquery Scan "*SELECT* 2" (cost=794.53..66637.75 rows=85033 width=4) (actual time=1.064..10.785 rows=4405 loops=1)
-> Bitmap Heap Scan on pwprijs (cost=794.53..65787.42 rows=85033 width=4) (actual time=1.062..5.792 rows=4405 loops=1)
Recheck Cond: ((leverancierid = 938) AND (recordtimestamp >= $0))
InitPlan
-> Index Scan using last_dates_pkey on last_dates (cost=0.00..3.33 rows=1 width=8) (actual time=0.042..0.044 rows=1 loops=1)
Index Cond: ((table_name)::text = 'pricetracker'::text)
-> Bitmap Index Scan on pwprijs_levid_rects_idx (cost=0.00..791.20 rows=85033 width=0) (actual time=1.027..1.027 rows=4405 loops=1)
Index Cond: ((leverancierid = 938) AND (recordtimestamp >= $0))
-> Bitmap Heap Scan on pwprijs chart_2 (cost=2.82..895.85 rows=234 width=16) (actual time=0.363..3.980 rows=564 loops=103)
Recheck Cond: (chart_2.produktid = "outer".id)
-> Bitmap Index Scan on pwprijs_produktid_timestamp_idx (cost=0.00..2.82 rows=234 width=0) (actual time=0.206..0.206 rows=564 loops=103)
Index Cond: (chart_2.produktid = "outer".id)
Total runtime: 665.335 ms


with prijs-clause:


Hash Join (cost=134379.17..514503.02 rows=2013674 width=12) (actual time=3089.037..15063.409 rows=58065 loops=1)
Hash Cond: ("outer".produktid = "inner".id)
-> Bitmap Heap Scan on pwprijs chart_2 (cost=56543.72..213260.22 rows=7691800 width=16) (actual time=2972.869..10299.608 rows=7691260 loops=1)
Recheck Cond: (prijs > 0::double precision)
-> Bitmap Index Scan on pwprijs_recordtimestamp_produktid_prijs_idx (cost=0.00..56543.72 rows=7691800 width=0) (actual time=2962.018..2962.018 rows=7691260 loops=1)
Index Cond: (prijs > 0::double precision)
-> Hash (cost=77813.89..77813.89 rows=8623 width=4) (actual time=41.373..41.373 rows=103 loops=1)
-> SetOp Intersect (cost=77296.51..77727.66 rows=8623 width=4) (actual time=33.770..41.297 rows=103 loops=1)
-> Sort (cost=77296.51..77512.08 rows=86231 width=4) (actual time=33.756..36.971 rows=5863 loops=1)
Sort Key: id
-> Append (cost=12.93..69495.92 rows=86231 width=4) (actual time=0.247..25.379 rows=5863 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=12.93..2858.17 rows=1198 width=4) (actual time=0.246..8.816 rows=1458 loops=1)
-> Nested Loop (cost=12.93..2846.19 rows=1198 width=4) (actual time=0.243..7.092 rows=1458 loops=1)
-> Unique (cost=10.40..10.44 rows=8 width=2) (actual time=0.110..0.135 rows=8 loops=1)
-> Sort (cost=10.40..10.42 rows=8 width=2) (actual time=0.108..0.115 rows=8 loops=1)
Sort Key: "?column?"
-> Append (cost=0.00..10.28 rows=8 width=2) (actual time=0.004..0.092 rows=8 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..0.02 rows=1 width=0) (actual time=0.004..0.006 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.003 rows=1 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=2.02..10.26 rows=7 width=2) (actual time=0.062..0.077 rows=7 loops=1)
-> Bitmap Heap Scan on cat (cost=2.02..10.19 rows=7 width=2) (actual time=0.059..0.066 rows=7 loops=1)
Recheck Cond: (parentid = 545)
-> Bitmap Index Scan on cat_parentid_idx (cost=0.00..2.02 rows=7 width=0) (actual time=0.042..0.042 rows=7 loops=1)
Index Cond: (parentid = 545)
-> Bitmap Heap Scan on pwprodukten (cost=2.52..352.58 rows=150 width=6) (actual time=0.078..0.658 rows=182 loops=8)
Recheck Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Bitmap Index Scan on pwprodukten_cat_idx (cost=0.00..2.52 rows=150 width=0) (actual time=0.058..0.058 rows=182 loops=8)
Index Cond: (pwprodukten.cat2 = "outer"."?column?")
-> Subquery Scan "*SELECT* 2" (cost=794.53..66637.75 rows=85033 width=4) (actual time=1.075..10.724 rows=4405 loops=1)
-> Bitmap Heap Scan on pwprijs (cost=794.53..65787.42 rows=85033 width=4) (actual time=1.072..5.846 rows=4405 loops=1)
Recheck Cond: ((leverancierid = 938) AND (recordtimestamp >= $0))
InitPlan
-> Index Scan using last_dates_pkey on last_dates (cost=0.00..3.33 rows=1 width=8) (actual time=0.045..0.047 rows=1 loops=1)
Index Cond: ((table_name)::text = 'pricetracker'::text)
-> Bitmap Index Scan on pwprijs_levid_rects_idx (cost=0.00..791.20 rows=85033 width=0) (actual time=1.037..1.037 rows=4405 loops=1)
Index Cond: ((leverancierid = 938) AND (recordtimestamp >= $0))
Total runtime: 15095.402 ms


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 01:18 PM
Richard Huxton
 
Posts: n/a
Default Re: Inefficient queryplan for query with intersectable

Arjen van der Meijden wrote:
>
> I left all the configuration-stuff to the defaults since changing values
> didn't seem to impact much. Apart from the buffers and effective cache,
> increasing those made the performance worse.


I've not looked at the rest of your problem in detail, but using the
default configuration values is certainly not going to help things. In
particular effective_cache is supposed to tell PG how much memory your
OS is using to cache data.

Read this through and make sure your configuration settings are sane,
then it might be worthwhile looking in detail at this particular query.
http://www.powerpostgresql.com/PerfList

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 01:18 PM
Arjen van der Meijden
 
Posts: n/a
Default Re: Inefficient queryplan for query with intersectable

On 26-8-2005 15:05, Richard Huxton wrote:
> Arjen van der Meijden wrote:
>
>>
>> I left all the configuration-stuff to the defaults since changing
>> values didn't seem to impact much. Apart from the buffers and
>> effective cache, increasing those made the performance worse.

>
>
> I've not looked at the rest of your problem in detail, but using the
> default configuration values is certainly not going to help things. In
> particular effective_cache is supposed to tell PG how much memory your
> OS is using to cache data.
>
> Read this through and make sure your configuration settings are sane,
> then it might be worthwhile looking in detail at this particular query.
> http://www.powerpostgresql.com/PerfList


Thanks for the advice. But as said, I tried such things. Adjusting
shared buffers to 5000, 10000 or 15000 made minor improvements.
But adjusting the effective_cache was indeed very dramatic, to make
matters worse!
Changing the random_page_cost to 2.0 also made it choose another plan,
but still not the fast plan.

The machine has 1GB of memory, so I tested for effective cache size
10000, 20000, 40000, 60000 and 80000. The "600ms"-plan I'm talking about
will not come up with an effective cache set to 60000 or above and for
the lower values there was no improvement in performance over that
already very fast plan.
As said, it chooses sequential scans or "the wrong index plans" over a
perfectly good plan that is just not selected when the parameters are
"too well tuned" or sequential scanning of the table is allowed.

So I'm still looking for a way to get it to use the fast plan, instead
of the slower ones that appear when I tend to tune the database...


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 01:18 PM
Tom Lane
 
Posts: n/a
Default Re: Inefficient queryplan for query with intersectable

Arjen van der Meijden <acmmailing@tweakers.net> writes:
> As said, it chooses sequential scans or "the wrong index plans" over a
> perfectly good plan that is just not selected when the parameters are
> "too well tuned" or sequential scanning of the table is allowed.


I think some part of the problem comes from using inconsistent
datatypes. For instance, it seems very odd that the thing is not
using a hash or something to handle

t_0.Cat2 IN (SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545)

seeing that it correctly guesses there are only going to be about 8 rows
in the union. Part of the reason is that cat2 is smallint, whereas the
output of the union must be at least int, maybe wider depending on the
datatype of cat.id (which you did not show us); so the comparison isn't
hashable. Even a smallint vs int comparison would be mergejoinable,
though, so I'm really wondering what cat.id is.

Another big part of the problem comes from poor result size estimation.
I'm not sure you can eliminate that entirely given the multiple
conditions on different columns (which'd require cross-column statistics
to really do well, which we do not have). But you could avoid
constructs like

WHERE ... t_1.recordtimestamp >=
(SELECT max_date - 60 FROM last_dates WHERE table_name = 'pricetracker')

The planner is basically going to throw up its hands and make a default
guess on the selectivity of this; it's not smart enough to decide that
the sub-select probably represents a constant. What I'd do with this
is to define a function marked STABLE for the sub-select result, perhaps
something like

create function get_last_date(tabname text, offsetdays int)
returns timestamp as $$
SELECT max_date - $2 FROM last_dates WHERE table_name = $1
$$ language sql strict stable;

(I'm guessing as to datatypes and the amount of parameterization you
need.) Then write the query like

WHERE ... t_1.recordtimestamp >= get_last_date('pricetracker', 60)

In this formulation the planner will be able to make a reasonable guess
about how many rows will match ... at least if your statistics are up
to date ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 01:19 PM
Arjen van der Meijden
 
Posts: n/a
Default Re: Inefficient queryplan for query with intersectable



On 27-8-2005 0:56, Tom Lane wrote:
> Arjen van der Meijden <acmmailing@tweakers.net> writes:
>
>>As said, it chooses sequential scans or "the wrong index plans" over a
>>perfectly good plan that is just not selected when the parameters are
>>"too well tuned" or sequential scanning of the table is allowed.

>
>
> I think some part of the problem comes from using inconsistent
> datatypes. For instance, it seems very odd that the thing is not
> using a hash or something to handle
>
> t_0.Cat2 IN (SELECT 545 UNION SELECT ID FROM cat WHERE ParentID = 545)
>
> seeing that it correctly guesses there are only going to be about 8 rows
> in the union. Part of the reason is that cat2 is smallint, whereas the
> output of the union must be at least int, maybe wider depending on the
> datatype of cat.id (which you did not show us); so the comparison isn't
> hashable. Even a smallint vs int comparison would be mergejoinable,
> though, so I'm really wondering what cat.id is.


cat.id is a smallint. I replaced that subquery with these two:
t_0.Cat2 IN (SELECT '545'::smallint UNION SELECT ID FROM cat WHERE
ParentID = '545'::smallint)

t_0.Cat2 IN (SELECT '545' UNION SELECT ID FROM cat WHERE ParentID = '545')

But appareantly there is a bug in the explain mechanism of the 8.1devel
I'm using (I downloaded a nightly 25 august somewhere in the morning
(CEST)), since it returned:
ERROR: bogus varno: 9

So I can't see whether the plan changed, execution times didn't change
much. I also replaced the subselect with the result of that query (like
('545', '546', ...) ) but that didn't seem to make much difference in
the execution time as well. The plan did change of course, it used a
BitmapOr of 8 Bitmap Index Scans over the pwprodukten.

By the way, as far as I know, this is the only datatype mismatch in the
query.

> Another big part of the problem comes from poor result size estimation.
> I'm not sure you can eliminate that entirely given the multiple
> conditions on different columns (which'd require cross-column statistics
> to really do well, which we do not have). But you could avoid
> constructs like
>
> WHERE ... t_1.recordtimestamp >=
> (SELECT max_date - 60 FROM last_dates WHERE table_name = 'pricetracker')
>
> The planner is basically going to throw up its hands and make a default
> guess on the selectivity of this; it's not smart enough to decide that
> the sub-select probably represents a constant. What I'd do with this
> is to define a function marked STABLE for the sub-select result, perhaps
> something like

[...]
> need.) Then write the query like
>
> WHERE ... t_1.recordtimestamp >= get_last_date('pricetracker', 60)
>
> In this formulation the planner will be able to make a reasonable guess
> about how many rows will match ... at least if your statistics are up
> to date ...


I tried such a function and also tried replacing it with the fixed
outcome of that suquery itself. Although it has a considerable more
accurate estimate of the rows returned, it doesn't seem to impact the
basic plan much. It does make the sub-query itself use another index
(the one on the recordtimestamp alone, rather than the combined index on
leverancierid and recordtimestamp).
With that changed subquery it estimates about 4173 rows over 4405 real rows.

Actually with the adjusted or original query, it seems to favor the hash
join over a nested loop, but the rest of the plan (for the subqueries)
seems to be exactly the same.

Here is the first part of the explain analyze when it can do any trick
it wants:
Hash Join (cost=7367.43..186630.19 rows=132426 width=12) (actual
time=191.726..11072.025 rows=58065 loops=1)
Hash Cond: ("outer".produktid = "inner".id)
-> Seq Scan on pwprijs chart_2 (cost=0.00..137491.07 rows=7692207
width=16) (actual time=0.018..6267.744 rows=7692207 loops=1)
-> Hash (cost=7366.02..7366.02 rows=565 width=4) (actual
time=123.265..123.265 rows=103 loops=1)
-> SetOp Intersect (cost=7332.10..7360.37 rows=565 width=4)
(actual time=115.760..123.192 rows=103 loops=1)
[snip]

And here is the first (and last) part when I disable hash joins or seq
scans:
Nested Loop (cost=7334.92..517159.39 rows=132426 width=12) (actual
time=111.905..512.575 rows=58065 loops=1)
-> SetOp Intersect (cost=7332.10..7360.37 rows=565 width=4)
(actual time=111.588..120.035 rows=103 loops=1)
[snip]
-> Bitmap Heap Scan on pwprijs chart_2 (cost=2.82..895.85 rows=234
width=16) (actual time=0.344..2.149 rows=564 loops=103)
Recheck Cond: (chart_2.produktid = "outer".id)
-> Bitmap Index Scan on pwprijs_produktid_idx
(cost=0.00..2.82 rows=234 width=0) (actual time=0.189..0.189 rows=564
loops=103)
Index Cond: (chart_2.produktid = "outer".id)

Is a nested loop normally so much (3x) more costly than a hash join? Or
is it just this query that gets estimated wronly?

Best regards,

Arjen

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-18-2008, 01:19 PM
Tom Lane
 
Posts: n/a
Default Re: Inefficient queryplan for query with intersectable

Arjen van der Meijden <acmmailing@tweakers.net> writes:
> But appareantly there is a bug in the explain mechanism of the 8.1devel
> I'm using (I downloaded a nightly 25 august somewhere in the morning
> (CEST)), since it returned:
> ERROR: bogus varno: 9


Yeah, someone else sent in a test case for this failure (or at least one
with a similar symptom) yesterday. I'll try to fix it today.

> Is a nested loop normally so much (3x) more costly than a hash join? Or
> is it just this query that gets estimated wronly?


There's been some discussion that we are overestimating the cost of
nestloops in general, because we don't take into account that successive
scans of the inner relation are likely to find many pages already in
cache from the earlier scans. So far no one's come up with a good cost
model to use for this, though.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-18-2008, 01:19 PM
Ron
 
Posts: n/a
Default Re: Inefficient queryplan for query with

At 10:27 AM 8/27/2005, Tom Lane wrote:
>Arjen van der Meijden <acmmailing@tweakers.net> writes:
> > But appareantly there is a bug in the explain mechanism of the 8.1devel
> > I'm using (I downloaded a nightly 25 august somewhere in the morning
> > (CEST)), since it returned:
> > ERROR: bogus varno: 9

>
>Yeah, someone else sent in a test case for this failure (or at least one
>with a similar symptom) yesterday. I'll try to fix it today.
>
> > Is a nested loop normally so much (3x) more costly than a hash join? Or
> > is it just this query that gets estimated wronly?

>
>There's been some discussion that we are overestimating the cost of
>nestloops in general, because we don't take into account that successive
>scans of the inner relation are likely to find many pages already in
>cache from the earlier scans. So far no one's come up with a good cost
>model to use for this, though.
>
> regards, tom lane

It certainly seems common in the EXPLAIN ANALYZE output I see that
the (estimated) cost of Nested Loop is far higher than the actual
time measured.

What happened when someone tried the naive approach of telling the
planner to estimate the cost of a nested loop based on fitting
whatever entities are involved in the nested loop in RAM as much as
possible? When there are multiple such mappings, use whichever one
results in the lowest cost for the NL in question.

Clearly, this should lead to an underestimate of the cost of the
constant of operation involved, but since nested loops have the only
polynomial growth function of the planner's choices, NL's should
still have a decent chance of being more expensive than other choices
under most circumstances.

In addition, if those costs are based on actual measurements of how
long it takes to do such scans then the estimated cost has a decent
chance of being fairly accurate under such circumstances.

It might not work well, but it seems like a reasonable first attempt
at a solution?
Ron Peacetree



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-18-2008, 01:19 PM
Arjen van der Meijden
 
Posts: n/a
Default Re: Inefficient queryplan for query with intersectable

On 27-8-2005 16:27, Tom Lane wrote:
> Arjen van der Meijden <acmmailing@tweakers.net> writes:
>
>>Is a nested loop normally so much (3x) more costly than a hash join? Or
>>is it just this query that gets estimated wronly?

>
> There's been some discussion that we are overestimating the cost of
> nestloops in general, because we don't take into account that successive
> scans of the inner relation are likely to find many pages already in
> cache from the earlier scans. So far no one's come up with a good cost
> model to use for this, though.


Ah, that explains. I take it, there already is an estimation for the
cost of "the amount of pages that will be loaded for this operation".
For indexed lookups this will probably be something like "the amount of
expected pages to fetch * the random page cost"?

And appareantly for the nested loop its something like "iterations *
amount of pages per iteration * random page cost" ?
The naive approach seems to me, is to just calculate the probable amount
of pages to fetch from disk rather than from cache.

In this case there are 7692207 rows in 60569 pages and on average 234
rows per product (per nested loop) in the estimation. It estimates that
it'll have to do 565 iterations.
In worst case for the first 234 rows, no pages are already cached and
the rows are all in a seperate page. So thats 234 pages to fetch.
In the second iteration, you know already 234 pages are fetched and
that's about 0.386% of the total pages. So the expected amount of pages
for the next 234 pages expected to be in cache is 234 * 0.00386 = 1.
After that you'll have 234 + 233 pages in cache, etc, etc.
Following that approach, the 565th iteration only has to pull in about
27 new pages in the worst case of all records being perfectly scattered
over the pages, not 234.

Of course this has to be adjusted for the amount of available buffers
and cache and the expected amount of pages to fetch for the iterations,
which may be less than 234.

When a fetch of a random page costs 4 and one from cache 0.01, there is
quite a large difference: 565 * (234 * 4) = 530535 vs 215864,93

Actually the likeliness of a page being in cache is a bit higher, since
the expectation increases for each newly fetched page, not for batches
of 234. I didn't use that in my calculation here.

Anyway, this is probably been thought over already and there may be many
flaws in it. If not, please think it over.

Best regards,

Arjen

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:32 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com