Unix Technical Forum

Query Optimization

This is a discussion on Query Optimization within the Pgsql Performance forums, part of the PostgreSQL category; --> All, I hope that this is the right place to post. I am relatively new to PostgreSQL (i.e., < ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 11:14 AM
James G Wilkinson
 
Posts: n/a
Default Query Optimization

All,

I hope that this is the right place to post. I am relatively new to
PostgreSQL (i.e., < 1 year in coding) and am just starting to
delve into the issues of query optimization. I have hunted around
the web for the basics of query optimization, but I have not had
much success in interpreting the documents. I have also been
trying to learn the basics of the EXPLAIN command....also without
much success, but I will keep trying.

Anyway, here is what the system reports on the following command:

EXPLAIN SELECT a.country_code, a.state_county_fips,
icell, jcell, a.beld3_species_id, pollutant_code,
SUM(b.ratio * d.emissions_factor * a.percent_ag *
e.ag_fraction * 10000) as normalized_emissions
FROM "globals"."biogenic_beld3_data" a,
"spatial"."tmpgrid" b,
"globals"."biogenic_emissions_factors" d,
"globals"."biogenic_beld3_ag_data" e
WHERE a.beld3_icell=b.b_icell AND
a.beld3_jcell=b.b_jcell AND
a.country_code=e.country_code AND
a.state_county_fips=e.state_county_fips AND
a.beld3_species_id=d.beld3_species_id AND
a.ag_forest_records > 0 AND
a.percent_ag > 0 AND d.emissions_factor > 0
GROUP BY a.country_code, a.state_county_fips, icell, jcell,
a.beld3_species_id, pollutant_code
ORDER BY a.country_code, a.state_county_fips, icell, jcell,
a.beld3_species_id, pollutant_code;

QUERY
PLAN
-----------------------------------------------------------------------------------------------
GroupAggregate (cost=65034.94..71110.50 rows=151889 width=73)
->Sort (cost=65034.94..65414.66 rows=151889 width=73)
Sort Key: a.country_code, a.state_county_fips, b.icell, b.jcell,
a.beld3_species_id, d.pollutant_code
->Hash Join (cost=33749.64..37412.88 rows=151889 width=73)
Hash Cond: ("outer".beld3_species_id = "inner".beld3_species_id)
->Merge Join (cost=33728.84..35303.61 rows=37972 width=56)
Merge Cond: ((("outer".country_code)::text =
"inner"."?column8?") AND
(("outer".state_county_fips)::text =
"inner"."?column9?"))
->Index Scan using biogenic_beld3_ag_data_pk on
biogenic_beld3_ag_data e
(cost=0.00..806.68 rows=20701 width=26)
->Sort (cost=33728.84..33741.67 rows=5131 width=45)
Sort Key: (a.country_code)::text,
(a.state_county_fips)::text
->Nested Loop (cost=0.00..33412.65 rows=5131 width=45)
->Seq Scan on biogenic_beld3_data a
(cost=0.00..3593.02 rows=5637 width=37)
Filter: ((ag_forest_records > 0) AND (percent_ag
> 0::numeric))

->Index Scan using tmpgrid_pk on tmpgrid b
(cost=0.00..5.27 rows=1 width=24)
Index Cond: ((b.b_icell = "outer".beld3_icell) AND
(b.b_jcell = "outer".beld3_jcell))
->Hash (cost=18.50..18.50 rows=920 width=21)
->Seq Scan on biogenic_emissions_factors d
(cost=0.00..18.50 rows=920 width=21)
Filter: (emissions_factor > 0::numeric)
(18 rows)


Firstly, I am frankly mystified on how to interpret all this. If anyone
could point me to a document or two that will help me decipher this,
I will greatly appreciate it.

Secondly, I have figured out that SEQ SCANs are typically bad. I am
concerned that a SEQ SCAN is being performed on 'biogenic_beld3_data'
which is the largest table in the query. I would rather have a SEQ SCAN
be performed on 'tmpgrid' which contains the keys that subset the data
from 'biogenic_beld3_data.' Is this naive on my part?

Thirdly, I have run EXPLAIN on other queries that report back a
GroupAggregate Cost=<low 300,000s> that runs in about 30 minutes
on my relatively highend linux machine. But when I run this particular
query, it takes on the order of 90 minutes to complete. Any thoughts
on why this happens will be appreciated.

Finally, if anyone can be so kind as to provide insight on how to better
optimize this query, I will, again, be deeply grateful.

Thanks in advance.

terrakit

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 11:14 AM
John A Meinel
 
Posts: n/a
Default Re: Query Optimization

James G Wilkinson wrote:

> All,
>

....

> Firstly, I am frankly mystified on how to interpret all this. If anyone
> could point me to a document or two that will help me decipher this,
> I will greatly appreciate it.
>

I assume you have looked at:
http://www.postgresql.org/docs/8.0/s...ance-tips.html
And didn't find it helpful enough. I'm not really sure what help you are
asking. Are you saying that this query is performing slowly and you want
to speed it up? Or you just want to understand how to interpret the
output of explain?

> Secondly, I have figured out that SEQ SCANs are typically bad. I am
> concerned that a SEQ SCAN is being performed on 'biogenic_beld3_data'
> which is the largest table in the query. I would rather have a SEQ SCAN
> be performed on 'tmpgrid' which contains the keys that subset the data
> from 'biogenic_beld3_data.' Is this naive on my part?


It depends how much data is being extracted. If you have 1,000,000 rows,
and only need 10, then an index scan is wonderful. If you need 999,999,
then a sequential scan is much better (the break even point is <10%)
From the explain, it thinks it is going to be needing 5,637 rows from
biogenic_beld3_data, what is that portion relative to the total?

The values at least look like you've run vacuum analyze. Have you tried
running "explain analyze" instead of just explain? Then you can see if
the planners estimates are accurate.

If you want some help to force it, you could try a subselect query.
Something like:

select * from biogenic_beld3_data b where b.beld3_icell = (select
b_icell from tmpgrid_pk) and b.beld3_jcell = (select b_jcell from
tmpgrid_pk);

>
> Thirdly, I have run EXPLAIN on other queries that report back a
> GroupAggregate Cost=<low 300,000s> that runs in about 30 minutes
> on my relatively highend linux machine. But when I run this particular
> query, it takes on the order of 90 minutes to complete. Any thoughts
> on why this happens will be appreciated.
>

Remember cost is in terms of page fetches, not in seconds.
Probably it is just an issue of postgres mis-estimating the selectivity
of one of your queries.
Also, you have a fairly complex SUM occurring involving 4
multiplications on an estimated 150,000 rows. While doesn't seem like it
should take 90 minutes, it also isn't a trivial operation.

> Finally, if anyone can be so kind as to provide insight on how to better
> optimize this query, I will, again, be deeply grateful.
>
> Thanks in advance.
>
> terrakit
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


John
=:->


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (Cygwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCLlV9JdeBCYSNAAMRAoSyAKClXsPxacgTYhjhAzUiik 5uIE4AJgCgy6FO
+UzLbTuP7opLvnAKsKSO/ww=
=8cBI
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:14 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com