This is a discussion on Sequential scan on FK join within the Pgsql Performance forums, part of the PostgreSQL category; --> All, I can see why the query below is slow. The lead table is 34 million rows, and a ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| All, I can see why the query below is slow. The lead table is 34 million rows, and a sequential scan always takes 3+ minutes. Mailing_id is the PK for mailing and is constrained as a foreign key (NULLS allowed) in lead. There is an index on lead.mailing_id. I've just run VACUUM ANALYZE on lead. I don't understand why it isn't being used. Thanks for your help, Martin Nickel SELECT m.mailcode, l.lead_id FROM mailing m INNER JOIN lead l ON m.mailing_id = l.mailing_id WHERE (m.maildate >= '2005-7-01'::date AND m.maildate < '2005-8-01'::date) -- takes 510,145 ms EXPLAIN SELECT m.mailcode, l.lead_id FROM mailing m INNER JOIN lead l ON m.mailing_id = l.mailing_id WHERE (m.maildate >= '2005-7-01'::date AND m.maildate < '2005-8-01'::date) Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) Hash Cond: ("outer".mailing_id = "inner".mailing_id) -> Seq Scan on lead l (cost=0.00..1804198.60 rows=34065260 width=8) -> Hash (cost=61.22..61.22 rows=362 width=20) -> Index Scan using mailing_maildate_idx on mailing m (cost=0.00..61.22 rows=362 width=20) Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date)) |
| |||
| Here's an EXPLAIN ANALYZE so you can see the actuals: "Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) (actual time=40.659..244709.315 rows=2125270 loops=1)" " Hash Cond: ("outer".mailing_id = "inner".mailing_id)" " -> Seq Scan on lead l (cost=0.00..1804198.60 rows=34065260 width=8) (actual time=8.621..180281.094 rows=34060373 loops=1)" " -> Hash (cost=61.22..61.22 rows=362 width=20) (actual time=28.718..28.718 rows=0 loops=1)" " -> Index Scan using mailing_maildate_idx on mailing m (cost=0.00..61.22 rows=362 width=20) (actual time=16.571..27.793 rows=430 loops=1)" " Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date))" "Total runtime: 248104.339 ms" On Wed, 12 Oct 2005 15:40:24 -0500, Martin Nickel wrote: > All, > > I can see why the query below is slow. The lead table is 34 million rows, > and a sequential scan always takes 3+ minutes. Mailing_id is the PK for > mailing and is constrained as a foreign key (NULLS allowed) in lead. > There is an index on lead.mailing_id. I've just run VACUUM ANALYZE on > lead. I don't understand why it isn't being used. > > Thanks for your help, > Martin Nickel > > SELECT m.mailcode, l.lead_id > FROM mailing m > INNER JOIN lead l ON m.mailing_id = l.mailing_id > WHERE (m.maildate >= '2005-7-01'::date > AND m.maildate < '2005-8-01'::date) > -- takes 510,145 ms > > EXPLAIN SELECT m.mailcode, l.lead_id > FROM mailing m > INNER JOIN lead l ON m.mailing_id = l.mailing_id > WHERE (m.maildate >= '2005-7-01'::date > AND m.maildate < '2005-8-01'::date) > > Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) > Hash Cond: ("outer".mailing_id = "inner".mailing_id) > -> Seq Scan on lead l (cost=0.00..1804198.60 rows=34065260 width=8) > -> Hash (cost=61.22..61.22 rows=362 width=20) > -> Index Scan using mailing_maildate_idx on mailing m (cost=0.00..61.22 rows=362 width=20) > Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date)) |
| |||
| .... and if I turn off enable_seqscan attempting to force an indexed join, the execution time goes up. This makes no sense at all to me. SELECT m.mailcode, l.lead_id FROM mailing m INNER JOIN lead l ON m.mailing_id = l.mailing_id WHERE (m.maildate >= '2005-7-01'::date AND m.maildate < '2005-8-01'::date) Nested Loop (cost=0.00..15995420.56 rows=2711552 width=20) -> Index Scan using mailing_maildate_idx on mailing m (cost=0.00..61.22 rows=362 width=20) Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date)) -> Index Scan using lead_mailing_fk_idx2 on lead l (cost=0.00..43870.96 rows=25209 width=8) Index Cond: ("outer".mailing_id = l.mailing_id) |
| |||
| Does anyone have any ideas on this? Right now I just have to say "Postgres doesn't work". The query is running on a dedicated dual-Opteron server with 4G memory and 3Ware SATA raid and runs in 2 minutes (best case, no activity on the server). The same query on a wimpy single-P4 MS SQL Server box runs in 1/3 the time. There is some kind of an issue here. On Wed, 12 Oct 2005 16:16:17 -0500, Martin Nickel wrote: > ... and if I turn off enable_seqscan attempting to force an indexed join, > the execution time goes up. This makes no sense at all to me. > > SELECT m.mailcode, l.lead_id > FROM mailing m > INNER JOIN lead l ON m.mailing_id = l.mailing_id WHERE (m.maildate >= > '2005-7-01'::date > AND m.maildate < '2005-8-01'::date) > > Nested Loop (cost=0.00..15995420.56 rows=2711552 width=20) > -> Index Scan using mailing_maildate_idx on mailing m > (cost=0.00..61.22 rows=362 width=20) > Index Cond: ((maildate >= '2005-07-01'::date) > AND (maildate < '2005-08-01'::date)) > -> Index Scan using lead_mailing_fk_idx2 on lead l > (cost=0.00..43870.96 rows=25209 width=8) > Index Cond: ("outer".mailing_id = l.mailing_id) |
| |||
| Martin Nickel wrote: > EXPLAIN SELECT m.mailcode, l.lead_id > FROM mailing m > INNER JOIN lead l ON m.mailing_id = l.mailing_id > WHERE (m.maildate >= '2005-7-01'::date > AND m.maildate < '2005-8-01'::date) > > Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) > Hash Cond: ("outer".mailing_id = "inner".mailing_id) > -> Seq Scan on lead l (cost=0.00..1804198.60 rows=34065260 width=8) > -> Hash (cost=61.22..61.22 rows=362 width=20) > -> Index Scan using mailing_maildate_idx on mailing m (cost=0.00..61.22 rows=362 width=20) > Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date)) Well the reason *why* is that the planner expects 2.71 million rows to be matched. If that was the case, then a seq-scan of 34 million rows might well make sense. The output from EXPLAIN ANALYSE would show us whether that estimate is correct - is it? -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Subject: Re: Sequential scan on FK join From: Martin Nickel <martin@portant.com> Newsgroups: pgsql.performance Date: Wed, 12 Oct 2005 15:53:35 -0500 Richard, here's the EXPLAIN ANALYZE. I see your point re: the 2.7M expected vs the 2 actual, but I've run ANALYZE on the lead table and it hasn't changed the plan. Suggestions? "Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) (actual time=40.659..244709.315 rows=2 125270 loops=1)" " Hash Cond: ("outer".mailing_id = "inner".mailing_id)" " -> Seq Scan on lead l (cost=0.00..1804198.60 rows=34065260 width=8) (actual time=8.621..180281.094 rows=34060373 loops=1)" " -> Hash (cost=61.22..61.22 rows=362 width=20) (actual time=28.718..28.718 rows=0 loops=1)" " -> Index Scan using mailing_maildate_idx on mailing m (cost=0.00..61.22 rows=362 width=20) (actual time=16.571..27.793 rows=430 loops=1)" " Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date))" "Total runtime: 248104.339 ms" |
| |||
| Martin Nickel wrote: > Subject: Re: Sequential scan on FK join > From: Martin Nickel <martin@portant.com> > Newsgroups: pgsql.performance > Date: Wed, 12 Oct 2005 15:53:35 -0500 > > Richard, here's the EXPLAIN ANALYZE. I see your point re: the 2.7M > expected vs the 2 actual, but I've run ANALYZE on the lead table and it > hasn't changed the plan. Suggestions? > > Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) > (actual time=40.659..244709.315 rows=2 125270 loops=1) ^^^ Hmm - is that not just a formatting gap there? Is it not 2,125,270 rows matching which would suggest PG is getting it more right than wrong. Try issuing "SET enable_seqscan=false" before running the explain analyse - that will force the planner to use any indexes it can find and should show us whether the index would help. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| When I turn of seqscan it does use the index - and it runs 20 to 30% longer. Based on that, the planner is correctly choosing a sequential scan - but that's just hard for me to comprehend. I'm joining on an int4 key, 2048 per index page - I guess that's a lot of reads - then the data -page reads. Still, the 8-minute query time seems excessive. On Mon, 17 Oct 2005 18:45:38 +0100, Richard Huxton wrote: > Martin Nickel wrote: >> Subject: Re: Sequential scan on FK join From: Martin Nickel >> <martin@portant.com> Newsgroups: pgsql.performance >> Date: Wed, 12 Oct 2005 15:53:35 -0500 >> >> Richard, here's the EXPLAIN ANALYZE. I see your point re: the 2.7M >> expected vs the 2 actual, but I've run ANALYZE on the lead table and it >> hasn't changed the plan. Suggestions? >> >> Hash Join (cost=62.13..2001702.55 rows=2711552 width=20) (actual >> time=40.659..244709.315 rows=2 125270 loops=1) > ^^^ > Hmm - is that not just a formatting gap there? Is it not 2,125,270 rows > matching which would suggest PG is getting it more right than wrong. > > Try issuing "SET enable_seqscan=false" before running the explain analyse > - that will force the planner to use any indexes it can find and should > show us whether the index would help. -- > Richard Huxton > Archonet Ltd > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org |
| |||
| Martin Nickel wrote: > When I turn of seqscan it does use the index - and it runs 20 to 30% > longer. Based on that, the planner is correctly choosing a sequential > scan - but that's just hard for me to comprehend. I'm joining on an int4 > key, 2048 per index page - I guess that's a lot of reads - then the data > -page reads. Still, the 8-minute query time seems excessive. You'll be getting (many) fewer than 2048 index entries per page. There's a page header and various pointers involved too, and index pages aren't going to be full. So - it needs to search the table on dates, fetch the id's and then assemble them for the hash join. Of course, if you have too many to join then all this will spill to disk slowing you further. Now, you'd rather get down below 8 minutes. There are a number of options: 1. Make sure your disk i/o is being pushed to its limit 2. Look into increasing the sort memory for this one query "set work_mem..." (see the runtime configuration section of the manual) 3. Actually - are you happy that your general configuration is OK? 4. Perhaps use a cursor - I'm guessing you want to process these mailings in some way and only want them one at a time in any case. 5. Try the query one day at a time and see if the balance tips the other way - you'll be dealing with substantially less data per query which might match your system better. Of course, this may not be practical for your applicaton. 6. If your lead table is updated only rarely, you could try a CLUSTER on the table by mailing_id - that should speed the scan. Read the manual for the cluster command first though. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| On Tue, 18 Oct 2005 08:52:15 +0100, Richard Huxton wrote: > Martin Nickel wrote: >> When I turn of seqscan it does use the index - and it runs 20 to 30% >> longer. Based on that, the planner is correctly choosing a sequential >> scan - but that's just hard for me to comprehend. I'm joining on an >> int4 key, 2048 per index page - I guess that's a lot of reads - then the >> data -page reads. Still, the 8-minute query time seems excessive. > > You'll be getting (many) fewer than 2048 index entries per page. There's a > page header and various pointers involved too, and index pages aren't > going to be full. So - it needs to search the table on dates, fetch the > id's and then assemble them for the hash join. Of course, if you have too > many to join then all this will spill to disk slowing you further. > > Now, you'd rather get down below 8 minutes. There are a number of options: > 1. Make sure your disk i/o is being pushed to its limit We are completely peaked out on disk io. iostat frequently shows 60% iowait time. This is quite an issue for us and I don't have any great ideas. Data is on a 3ware sata raid at raid 10 across 4 disks. I can barely even run vacuums on our largest table (lead) since it runs for a day and a half and kills our online performance while running. > 2. Look into increasing the sort memory for this one query "set > work_mem..." (see the runtime configuration section of the manual) I haven't tried this, and I will. Thanks for the idea. > 3. Actually - are you happy that your general configuration is OK? I'm not at all. Most of the configuration changes I've tried have made almost no discernable difference. I'll post the relevant numbers in a different post - possibly you'll have some suggestions. > 4. Perhaps use a cursor - I'm guessing you want to process these > mailings in some way and only want them one at a time in any case. Where this first came up was in trying to get aggregate totals per mailing. I gave up on that and created a nightly job to create a summary table since Postgres wasn't up to the job in real time. Still, I frequently need to do the join and limit it by other criteria - and it is incredibly slow - even when the result set is smallish. > 5. Try the query one day at a time and see if the balance tips the > other way - you'll be dealing with substantially less data per query > which might match your system better. Of course, this may not be > practical for your applicaton. It is not useful. > 6. If your lead table is updated only rarely, you could try a CLUSTER > on the table by mailing_id - that should speed the scan. Read the manual > for the cluster command first though. The lead table is one of the most volatle in our system. Each day we insert tens or hundreds of thousands of rows, update almost that many, and delete a few. It is growing, and could reach 100 million rows in 8 or 9 months. We're redesigning the data structure a little so lead is not updated (updates are just too slow), but it will continue to have inserts and deletes, and we'll have to join it with the associated table being updated, which already promises to be a slow operation. We're looking at 15K rpm scsi drives for a replacement raid array. We are getting the place where it may be cheaper to convert to Oracle or DB2 than to try and make Posgres work. |