vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Mar 5, 2007, at 8:54 PM, Tom Lane wrote: > Hm, the cost for the upper nestloop is way less than you would expect > given that the HASH IN join is going to have to be repeated 100+ > times. > I think this must be due to a very low "join_in_selectivity" estimate > but I'm not sure why you are getting that, especially seeing that the > rowcount estimates aren't far off. Can you show us the pg_stats > rows for symptoms.id and symptom_reports.symptom_id? > Hi Tom, thanks for the response. Here are the pg_stats. I think I understand what the stats say, but I don't know what to conclude from them. plm_stage=# select * from pg_stats where tablename = 'symptoms' and attname = 'id'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+----------- +------------+------------------+------------------- +-------------------------------------+------------- public | symptoms | id | 0 | 4 | -1 | | | {1,11,24,34,46,57,71,85,95,106,117} | 0.451606 plm_stage=# select * from pg_stats where tablename = 'symptom_reports' and attname = 'symptom_id'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------------+------------+-----------+----------- +------------+------------------------ +----------------------------------------------------------------------- ---------------+-------------------------------------+------------- public | symptom_reports | symptom_id | 0 | 4 | 80 | {3,2,4,1,5,8,9,7,10,6} | {0.094,0.0933333,0.0933333,0.092,0.0913333,0.09033 33,0.0866667,0.0843333 ,0.084,0.08} | {12,18,24,30,38,44,51,57,91,91,114} | 0.0955925 And Ismo, I followed your suggestion to re-write the SQL more cleanly, and you are right it was faster, so that is certainly a solution. Although I am still curious why my original query slowed down after the vacuum analyze. In any case, here is the explain analyze from the new query. Compare that to the 3441.452 ms of the old query after the analyze (and 134.045 ms before the analyze): plm_stage=# explain analyze SELECT count(distinct s.id) AS count_all FROM symptoms s ,symptom_reports sr,users u WHERE s.id=sr.symptom_id and sr.user_id=u.id and u.disease_id in (1); QUERY PLAN ------------------------------------------------------------------------ ----------------------------------------------------------------- Aggregate (cost=770.05..770.06 rows=1 width=4) (actual time=176.749..176.751 rows=1 loops=1) -> Hash Join (cost=89.43..737.50 rows=13020 width=4) (actual time=7.762..142.063 rows=13038 loops=1) Hash Cond: ("outer".symptom_id = "inner".id) -> Hash Join (cost=86.09..538.86 rows=13020 width=4) (actual time=7.277..89.293 rows=13038 loops=1) Hash Cond: ("outer".user_id = "inner".id) -> Seq Scan on symptom_reports sr (cost=0.00..257.38 rows=13038 width=8) (actual time=0.003..30.499 rows=13038 loops=1) -> Hash (cost=82.41..82.41 rows=1471 width=4) (actual time=7.261..7.261 rows=1471 loops=1) -> Seq Scan on users u (cost=0.00..82.41 rows=1471 width=4) (actual time=0.006..4.133 rows=1471 loops=1) Filter: (disease_id = 1) -> Hash (cost=3.07..3.07 rows=107 width=4) (actual time=0.469..0.469 rows=107 loops=1) -> Seq Scan on symptoms s (cost=0.00..3.07 rows=107 width=4) (actual time=0.007..0.247 rows=107 loops=1) Total runtime: 176.842 ms (12 rows) ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Jeff Cole <cole.jeff@gmail.com> writes: > Hi Tom, thanks for the response. Here are the pg_stats. I think I > understand what the stats say, but I don't know what to conclude from > them. OK, the symptom_id row claims there are only 80 distinct values of symptom_id in symptom_reports. This is a bit low (looks like the true state of affairs is that all but 2 of the 108 entries of symptoms are represented in symptom_reports), but it's not horridly off considering that you're using the rather low default statistics_target. What happens is that the planner expects that on average only 80 rows of the inner join will need to be scanned to find a match for a given symptoms.id, and this makes the nestloop look cheap. However, per your previous EXPLAIN ANALYZE: > -> Nested Loop IN Join (cost=149.05..586.26 rows=85 width=0) (actual time=54.517..3441.115 rows=106 loops=1) > Join Filter: ("outer".id = "inner".symptom_id) > -> Seq Scan on symptoms (cost=0.00..3.08 rows=108 width=4) (actual time=0.007..0.273 rows=108 loops=1) > -> Hash IN Join (cost=149.05..603.90 rows=13074 width=4) (actual time=0.078..24.503 rows=3773 loops=108) the *actual* average number of rows scanned is 3773. I'm not sure why this should be --- is it possible that the distribution of keys in symptom_reports is wildly uneven? This could happen if all of the physically earlier rows in symptom_reports contain the same small set of symptom_ids, but the stats don't seem to indicate such a skew. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Mar 6, 2007, at 11:40 AM, Tom Lane wrote: > the *actual* average number of rows scanned is 3773. I'm not sure why > this should be --- is it possible that the distribution of keys in > symptom_reports is wildly uneven? This could happen if all of the > physically earlier rows in symptom_reports contain the same small set > of symptom_ids, but the stats don't seem to indicate such a skew. Hi Tom, you are correct, the distribution is uneven... In the 13k symptom_reports rows, there are 105 distinct symptom_ids. But the first 8k symptom_reports rows only have 10 distinct symptom_ids. Could this cause the problem and would there be anything I could do to address it? Thanks for all your help, I appreciate it. -Jeff ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Jeff Cole <cole.jeff@gmail.com> writes: > Hi Tom, you are correct, the distribution is uneven... In the 13k > symptom_reports rows, there are 105 distinct symptom_ids. But the > first 8k symptom_reports rows only have 10 distinct symptom_ids. > Could this cause the problem and would there be anything I could do > to address it? Ah-hah, yeah, that explains it. Is it worth your time to deliberately randomize the order of the rows in symptom_reports? It wasn't clear whether this query is actually something you need to optimize. You might have other queries that benefit from the rows being in nonrandom order, so I'm not entirely sure that this is a good thing to do ... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| Hi Tom, thanks for the great job getting to the core of this problem... I would say I'm not sure I want randomize the rows (not really even sure how to do it without truncating the table and re- adding the records in a random order). I think for the moment I will either a) re-write the query per Ismo's suggestion, or b) wait until more data comes into that table, potentially kicking the query planner into not using the Nested Loop anymore. Anyway, thanks again, I appreciate it... -Jeff On Mar 7, 2007, at 11:37 AM, Tom Lane wrote: > Jeff Cole <cole.jeff@gmail.com> writes: >> Hi Tom, you are correct, the distribution is uneven... In the 13k >> symptom_reports rows, there are 105 distinct symptom_ids. But the >> first 8k symptom_reports rows only have 10 distinct symptom_ids. >> Could this cause the problem and would there be anything I could do >> to address it? > > Ah-hah, yeah, that explains it. Is it worth your time to deliberately > randomize the order of the rows in symptom_reports? It wasn't clear > whether this query is actually something you need to optimize. You > might have other queries that benefit from the rows being in nonrandom > order, so I'm not entirely sure that this is a good thing to do ... > > regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| Thread Tools | |
| Display Modes | |
|
|