This is a discussion on Help with tuning this query within the Pgsql Performance forums, part of the PostgreSQL category; --> I've tried to use Dan Tow's tuning method and created all the right indexes from his diagraming method, but ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've tried to use Dan Tow's tuning method and created all the right indexes from his diagraming method, but the query still performs quite slow both inside the application and just inside pgadmin III. Can anyone be kind enough to help me tune it so that it performs better in postgres? I don't think it's using the right indexes, or maybe postgres needs special treatment. I've converted the below query to SQL from a Hibernate query, so the syntax is probably not perfect but it's semantics are exactly the same. I've done so by looking at the source code, but I can't run it to get the exact SQL since I don't have the database on my home machine. select s.* from shipment s inner join carrier_code cc on s.carrier_code_id = cc.id inner join carrier c on cc.carrier_id = c.id inner join carrier_to_person ctp on ctp.carrier_id = c.id inner join person p on p.id = ctp.person_id inner join shipment_status cs on s.current_status_id = cs.id inner join release_code rc on cs.release_code_id = rc.id left join shipment_status ss on ss.shipment_id = s.id where p.id = s.is_purged = false and rc.number = '9' and cs is not null and cs.date >= current_date - 31 order by cs.date desc Just assume I have no indexes for the moment because while some of the indexes I made make it work faster, it's still around 250 milliseconds and under heavy load, the query performs very badly (6-7 seconds). For your information: shipment contains 40,000 rows shipment_status contains 80,000 rows release_code contains 8 rows person contains 300 rows carrier contains 60 rows carrier_code contains 70 rows The filter ratios are: rc.number = '9' (0.125) cs.date >= current_date - 31 (.10) p.id = ? (0.003) s.is_purged = false (.98) I really hope someone can help since I'm pretty much stuck. Best regards and many thanks, Ken |
| |||
| Ken Egervari wrote: > I've tried to use Dan Tow's tuning method Who? What? > and created all the right > indexes from his diagraming method, but the query still performs > quite slow both inside the application and just inside pgadmin III. > Can anyone be kind enough to help me tune it so that it performs > better in postgres? I don't think it's using the right indexes, or > maybe postgres needs special treatment. > > I've converted the below query to SQL from a Hibernate query, so the > syntax is probably not perfect but it's semantics are exactly the > same. I've done so by looking at the source code, but I can't run it > to get the exact SQL since I don't have the database on my home > machine. Hibernate is a java thing, no? It'd be helpful to have the actual SQL the hibernate class (or whatever) generates. One of the problems with SQL is that you can have multiple ways to get the same results and it's not always possible for the planner to convert from one to the other. Anyway, people will want to see EXPLAIN ANALYSE for the query in question. Obviously, make sure you've vacuumed and analysed the tables in question recently. Oh, and make sure yousay what version of PG you're running. > select s.* from shipment s inner join carrier_code cc on > s.carrier_code_id = cc.id inner join carrier c on cc.carrier_id = > c.id inner join carrier_to_person ctp on ctp.carrier_id = c.id inner > join person p on p.id = ctp.person_id inner join shipment_status cs > on s.current_status_id = cs.id inner join release_code rc on > cs.release_code_id = rc.id left join shipment_status ss on > ss.shipment_id = s.id where p.id = > and rc.number = '9' and cs is not null and cs.date >= current_date - > 31 order by cs.date desc 1. Why are you quoting the 9 when checking against rc.number? 2. The "cs is not null" doesn't appear to be qualified - which table? > Just assume I have no indexes for the moment because while some of > the indexes I made make it work faster, it's still around 250 > milliseconds and under heavy load, the query performs very badly (6-7 > seconds). 3. If you rewrite the "current_date - 31" as a suitable ago(31) function then you can use an index on cs.date 4. Are you familiar with the configuration setting "join_collapse_limit"? -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| Richard Huxton wrote: > Ken Egervari wrote: > >> I've tried to use Dan Tow's tuning method > Who? What? http://www.singingsql.com/ Dan has written some remarkable papers on sql tuning. Some of it is pretty complex, but his book "SQL Tuning" is an excellent resource. -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________ ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| Bricklen Anderson wrote: > Richard Huxton wrote: > > Ken Egervari wrote: > > > >> I've tried to use Dan Tow's tuning method > > Who? What? > > http://www.singingsql.com/ That URL is invalid for me. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| Ken Egervari wrote: > I've tried to use Dan Tow's tuning method and created all the right > indexes from his diagraming method, but the query still performs quite > slow both inside the application and just inside pgadmin III. Can > anyone be kind enough to help me tune it so that it performs better in > postgres? I don't think it's using the right indexes, or maybe > postgres needs special treatment. > First, what version of postgres, and have you run VACUUM ANALYZE recently? Also, please attach the result of running EXPLAIN ANALYZE. (eg, explain analyze select s.* from shipment ...) It's very possible that you don't have up-to-date statistics, which causes postgres to make a bad estimate of what the fastest plan is. Also, if you are using an older version of postgres (like 7.1) you really should upgrade. There are quite a few performance and real bug fixes. > I've converted the below query to SQL from a Hibernate query, so the > syntax is probably not perfect but it's semantics are exactly the > same. I've done so by looking at the source code, but I can't run it > to get the exact SQL since I don't have the database on my home machine. I don't know how to make Hibernate do what you want, but if you change the query to using subselects (not all databases support this, so hibernate might not let you), you can see a performance improvement. Also sometimes using explicit joins can be worse than just letting the query manager figure it out. So something like select s.* from shipment s, carrier_code cc, carrier c, ... where s.carrier_code_id = cc.id and c.id = cc.carrier_id and .... But again, since this is generated from another program (Hibernate), I really don't know how you tell it how to tune the SQL. Probably the biggest "non-bug" performance improvements are from tuning the SQL. But if postgres isn't using the right indexes, etc, you can probably fix that. John =:-> > > select s.* > from shipment s > inner join carrier_code cc on s.carrier_code_id = cc.id > inner join carrier c on cc.carrier_id = c.id > inner join carrier_to_person ctp on ctp.carrier_id = c.id > inner join person p on p.id = ctp.person_id > inner join shipment_status cs on s.current_status_id = cs.id > inner join release_code rc on cs.release_code_id = rc.id > left join shipment_status ss on ss.shipment_id = s.id > where > p.id = > s.is_purged = false and > rc.number = '9' and > cs is not null and > cs.date >= current_date - 31 > order by cs.date desc > Just assume I have no indexes for the moment because while some of the > indexes I made make it work faster, it's still around 250 milliseconds > and under heavy load, the query performs very badly (6-7 seconds). > > For your information: > > shipment contains 40,000 rows > shipment_status contains 80,000 rows > release_code contains 8 rows > person contains 300 rows > carrier contains 60 rows > carrier_code contains 70 rows > > The filter ratios are: > > rc.number = '9' (0.125) > cs.date >= current_date - 31 (.10) > p.id = ? (0.003) > s.is_purged = false (.98) > > I really hope someone can help since I'm pretty much stuck. > > Best regards and many thanks, > Ken -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (Darwin) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCJfBRJdeBCYSNAAMRAhd8AJ0Za9S7yjBOyJBbrsnC20 QCb10DNwCeKXD7 oZWgu4d7mmEYg2+9iUYzzY0= =5kTN -----END PGP SIGNATURE----- |
| |||
| >First, what version of postgres, and have you run VACUUM ANALYZE recently? >Also, please attach the result of running EXPLAIN ANALYZE. >(eg, explain analyze select s.* from shipment ...) I'm using postgres 8.0. I wish I could paste explain analyze, but I won't be at work for a few days. I was hoping some Postgres/SQL experts here would be able to simply look at the query and make recommendations because it's not a very difficult or unique query. >It's very possible that you don't have up-to-date statistics, which >causes postgres to make a bad estimate of what the fastest plan is. I run VACUUM ANALYZE religiously. I even dumped the production database and used it as my test database after a full vacuum analyze. It's really as fresh as it can be. >I don't know how to make Hibernate do what you want, but if you change >the query to using subselects (not all databases support this, so >hibernate might not let you), you can see a performance improvement. Yes, Hibernate supports sub-selects. In fact, I can even drop down to JDBC explicitly, so whatever SQL tricks out there I can use will work on Hibernate. In what way will sub-selects improve this query? >Also sometimes using explicit joins can be worse than just letting the >query manager figure it out. So something like >select s.* from shipment s, carrier_code cc, carrier c, ... > where s.carrier_code_id = cc.id and c.id = cc.carrier_id and .... I think I can avoid using joins in Hibernate, but it makes the query harder to maintain. How much of a performance benefit are we talking with this change? Since hibernate is an object language, you don't actually have to specify many joins. You can use the "dot" notation. Query query = session.createQuery( "select shipment " + "from Shipment shipment " + " inner join shipment.cargoControlNumber.carrierCode.carrier.pe rsons person " + " inner join shipment.currentStatus currentStatus " + " inner join currentStatus.releaseCode releaseCode " + " left join fetch shipment.currentStatus " + "where " + " person.id = " shipment.isPurged = false and " + " releaseCode.number = '9' and " + " currentStatus is not null and " + " currentStatus.date >= current_date - 31 " + "order by currentStatus.date desc" ); query.setParameter( "personId", personId ); query.setFirstResult( firstResult ); query.setMaxResults( maxResults ); return query.list(); As you can see, it's fairly elegant language and maps to SQL quite well. >But again, since this is generated from another program (Hibernate), I >really don't know how you tell it how to tune the SQL. Probably the >biggest "non-bug" performance improvements are from tuning the SQL. I agree, but the ones I've tried aren't good enough. I have made these indexes that apply to this query as well as others in my from looking at my SQL scripts. Many of my queries have really sped up to 14 milliseconds from these indexes. But I can't make this query run any faster. CREATE INDEX carrier_to_person_person_id_idx ON carrier_to_person USING btree (person_id); CREATE INDEX carrier_to_person_carrier_id_idx ON carrier_to_person USING btree (carrier_id); CREATE INDEX carrier_code_carrier_id_idx ON carrier_code USING btree (carrier_id); CREATE INDEX shipment_carrier_code_id_idx ON shipment USING btree (carrier_code_id); CREATE INDEX current_status_date_idx ON shipment_status USING btree (date); CREATE INDEX shipment_current_status_id_idx ON shipment USING btree (current_status_id); CREATE INDEX shipment_status_shipment_id_idx ON shipment_status USING btree (shipment_id); Thanks for your responses everyone. I'll try and get you that explain analyze. I'm just not at work at the moment but this is a problem that I'm simply puzzled and worried about. I'm getting all of this from CVS on my work server. Ken ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| Ken Egervari wrote: >> First, what version of postgres, and have you run VACUUM ANALYZE >> recently? >> Also, please attach the result of running EXPLAIN ANALYZE. >> (eg, explain analyze select s.* from shipment ...) > > > I'm using postgres 8.0. I wish I could paste explain analyze, but I > won't be at work for a few days. I was hoping some Postgres/SQL > experts here would be able to simply look at the query and make > recommendations because it's not a very difficult or unique query. > That's the problem. Without explain analyze, it's hard to say why it is performing weird, because it *does* look like a straightforward query. >> It's very possible that you don't have up-to-date statistics, which >> causes postgres to make a bad estimate of what the fastest plan is. > > > I run VACUUM ANALYZE religiously. I even dumped the production > database and used it as my test database after a full vacuum analyze. > It's really as fresh as it can be. > Good. Again, this is just the first precaution, as not everyone is as careful as you. And without the explain analyze, you can't tell what the planner estimates are. >> I don't know how to make Hibernate do what you want, but if you change >> the query to using subselects (not all databases support this, so >> hibernate might not let you), you can see a performance improvement. > > > Yes, Hibernate supports sub-selects. In fact, I can even drop down to > JDBC explicitly, so whatever SQL tricks out there I can use will work > on Hibernate. In what way will sub-selects improve this query? > When doing massive joins across multiple tables (as you are doing) it is frequently faster to do a couple of small joins where you only need a couple of rows as input to the rest. Something like: select * from shipment s where s.carrier_code_id in (select cc.id from carrier_code cc join carrier c on cc.carrier_id = c.id) and s.current_status_id in (select cs.id from shipment_status cs where ...) Again it's something that you can try. I have found quite a few of my queries performed much better with subselects. I'm guessing it's because with big queries it has a harder time figuring out how to refactor (the decision tree becomes big). But I'm not really sure. I just know it can work. >> Also sometimes using explicit joins can be worse than just letting the >> query manager figure it out. So something like >> select s.* from shipment s, carrier_code cc, carrier c, ... >> where s.carrier_code_id = cc.id and c.id = cc.carrier_id and .... > > > I think I can avoid using joins in Hibernate, but it makes the query > harder to maintain. How much of a performance benefit are we talking > with this change? Since hibernate is an object language, you don't > actually have to specify many joins. You can use the "dot" notation. > I'm not saying this *will* improve performance. It is just something to try. It very easily could not be worth the overhead. > Query query = session.createQuery( > "select shipment " + > "from Shipment shipment " + > " inner join > shipment.cargoControlNumber.carrierCode.carrier.pe rsons person " + > " inner join shipment.currentStatus currentStatus " + > " inner join currentStatus.releaseCode releaseCode " + > " left join fetch shipment.currentStatus " + > "where " + > " person.id = > " shipment.isPurged = false and " + > " releaseCode.number = '9' and " + > " currentStatus is not null and " + > " currentStatus.date >= current_date - 31 " + > "order by currentStatus.date desc" > ); > > query.setParameter( "personId", personId ); > > query.setFirstResult( firstResult ); > query.setMaxResults( maxResults ); > > return query.list(); > > As you can see, it's fairly elegant language and maps to SQL quite well. > >> But again, since this is generated from another program (Hibernate), I >> really don't know how you tell it how to tune the SQL. Probably the >> biggest "non-bug" performance improvements are from tuning the SQL. > > > I agree, but the ones I've tried aren't good enough. I have made > these indexes that apply to this query as well as others in my from > looking at my SQL scripts. Many of my queries have really sped up to > 14 milliseconds from these indexes. But I can't make this query run > any faster. > > CREATE INDEX carrier_to_person_person_id_idx ON carrier_to_person > USING btree (person_id); > CREATE INDEX carrier_to_person_carrier_id_idx ON carrier_to_person > USING btree (carrier_id); > CREATE INDEX carrier_code_carrier_id_idx ON carrier_code USING btree > (carrier_id); > CREATE INDEX shipment_carrier_code_id_idx ON shipment USING btree > (carrier_code_id); > CREATE INDEX current_status_date_idx ON shipment_status USING btree > (date); > CREATE INDEX shipment_current_status_id_idx ON shipment USING btree > (current_status_id); > CREATE INDEX shipment_status_shipment_id_idx ON shipment_status USING > btree (shipment_id); > > Thanks for your responses everyone. I'll try and get you that explain > analyze. I'm just not at work at the moment but this is a problem > that I'm simply puzzled and worried about. I'm getting all of this > from CVS on my work server. > > Ken There is also the possibility that you are having problems with cross-column correlation, or poor distribution of a column. Postgres doesn't keep cross-column statistics, so if 2 columns are correlated, then it mis-estimates selectivity, and might pick the wrong plan. In general your query looks decent, we just need to figure out what is going on. John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (Darwin) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCJfoQJdeBCYSNAAMRAhp/AJoDpwm9b2Gf0Kc2OZgSECt/vujO2QCffekm sqdTxPtZYlpxLzN2sfwTLSQ= =l69T -----END PGP SIGNATURE----- |
| |||
| On Wed, 2005-03-02 at 01:51 -0500, Ken Egervari wrote: > > select s.* > from shipment s > inner join carrier_code cc on s.carrier_code_id = cc.id > inner join carrier c on cc.carrier_id = c.id > inner join carrier_to_person ctp on ctp.carrier_id = c.id > inner join person p on p.id = ctp.person_id > inner join shipment_status cs on s.current_status_id = cs.id > inner join release_code rc on cs.release_code_id = rc.id > left join shipment_status ss on ss.shipment_id = s.id > where > p.id = > s.is_purged = false and > rc.number = '9' and > cs is not null and > cs.date >= current_date - 31 > order by cs.date desc > ... > shipment contains 40,000 rows > shipment_status contains 80,000 rows I may be missing something, but it looks like the second join on shipment_status (the left join) is not adding anything to your results, except more work. ss is not used for output, nor in the where clause, so what is its purpose ? if cs.date has an upper limit, it might be helpful to change the condition to a BETWEEN in any case, i would think you might need an index on shipment(carrier_code_id) shipment(current_status_id) shipment_status(id) gnari ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| >> select s.* >> from shipment s >> inner join carrier_code cc on s.carrier_code_id = cc.id >> inner join carrier c on cc.carrier_id = c.id >> inner join carrier_to_person ctp on ctp.carrier_id = c.id >> inner join person p on p.id = ctp.person_id >> inner join shipment_status cs on s.current_status_id = cs.id >> inner join release_code rc on cs.release_code_id = rc.id >> left join shipment_status ss on ss.shipment_id = s.id >> where >> p.id = >> s.is_purged = false and >> rc.number = '9' and >> cs is not null and >> cs.date >= current_date - 31 >> order by cs.date desc >> ... >> shipment contains 40,000 rows >> shipment_status contains 80,000 rows > > I may be missing something, but it looks like the second join > on shipment_status (the left join) is not adding anything to your > results, except more work. ss is not used for output, nor in the where > clause, so what is its purpose ? It does look strange doesn't it? I would think the same thing if it were the first time I looked at it. But rest assured, it's done by design. A shipment relates to many shipment_status rows, but only 1 is the current shipment_status for the shipment. The first does queries on the current status only and doesn't analyze the rest of the related items. The second left join is for eager loading so that I don't have to run a seperate query to fetch the children for each shipment. This really does improve performance because otherwise you'll have to make N+1 queries to the database, and that's just too much overhead. Since I need all the shipment_status children along with the shipment for the domain logic to work on them, I have to load them all. On average, a shipment will have 2 shipment_status rows. So if the query selects 100 shipments, the query returns 200 rows. Hibernate is intelligent enough to map the shipment_status children to the appropriate shipment automatically. > if cs.date has an upper limit, it might be helpful to change the > condition to a BETWEEN Well, I could create an upper limit. It would be the current date. Would adding in this redundant condition improve performance? I've clustered the shipment table so that the dates are together, which has improved performance. I'm not sure adding in this implicit condition will speed up anything, but I will definately try it. > in any case, i would think you might need an index on > shipment(carrier_code_id) > shipment(current_status_id) > shipment_status(id) Unfortunately, I have indexes on all three (Postgres implicitly creates indexes for unique keys). Here are the other 2 that are already created: CREATE INDEX shipment_carrier_code_id_idx ON shipment USING btree (carrier_code_id); CREATE INDEX shipment_current_status_id_idx ON shipment USING btree (current_status_id); So I guess we've been thinking the same thing. Don't get me wrong. These indexes speed up the query from 1.6 seconds to 250 milliseconds. I just need to be around 30 milliseconds. Another idea that had occured to me was trying to force postgres to driver on the person table because that filter ratio is so great compared to everything else, but I do remember looking at the explain days ago and it was one of the last tables being filtered/joined. Is there anyway to force postgres to pick person? The reason I ask is because this would really reduce the number of rows it pulls out from the shipment table. Thanks for comments. I'll try making that date explicit and change the query to use between to see if that does anything. Regards and many thanks, Ken ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| ||||
| On Wed, 2005-03-02 at 13:28 -0500, Ken Egervari wrote: > >> select s.* > >> from shipment s > >> inner join carrier_code cc on s.carrier_code_id = cc.id > >> inner join carrier c on cc.carrier_id = c.id > >> inner join carrier_to_person ctp on ctp.carrier_id = c.id > >> inner join person p on p.id = ctp.person_id > >> inner join shipment_status cs on s.current_status_id = cs.id > >> inner join release_code rc on cs.release_code_id = rc.id > >> left join shipment_status ss on ss.shipment_id = s.id > >> where > >> p.id = > >> s.is_purged = false and > >> rc.number = '9' and > >> cs is not null and > >> cs.date >= current_date - 31 > >> order by cs.date desc > > > > I may be missing something, but it looks like the second join > > on shipment_status (the left join) is not adding anything to your > > results, except more work. ss is not used for output, nor in the where > > clause, so what is its purpose ? > ... The second > left join is for eager loading so that I don't have to run a seperate query > to fetch the children for each shipment. This really does improve > performance because otherwise you'll have to make N+1 queries to the > database, and that's just too much overhead. are you saying that you are actually doing a select s.*,ss.* ... ? > > if cs.date has an upper limit, it might be helpful to change the > > condition to a BETWEEN > > Well, I could create an upper limit. It would be the current date. Would > adding in this redundant condition improve performance? it might help the planner estimate better the number of cs rows affected. whether this improves performance depends on whether the best plans are sensitive to this. an EXPLAIN ANALYSE might reduce the guessing. gnari ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |