This is a discussion on Re: Updates on large tables are extremely slow within the Pgsql Performance forums, part of the PostgreSQL category; --> Yves Vindevogel wrote: > > I'm trying to update a table that has about 600.000 records. > The update ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Yves Vindevogel wrote: > > I'm trying to update a table that has about 600.000 records. > The update query is very simple : update mytable set pagesdesc = - pages ; > > The query takes about half an hour to an hour to execute. I have tried a > lot of things. > Half an hour seem a bit long - I would expect less than 5 minutes on reasonable hardware. You may have dead tuple bloat - can you post the output of 'ANALYZE VERBOSE mytable' ? Cheers Mark ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Apologies - I should have said output of 'VACUUM VERBOSE mytable'. (been using 8.1, which displays dead tuple info in ANALYZE...). Mark Yves Vindevogel wrote: > rvponp=# analyze verbose tblPrintjobs ; > INFO: analyzing "public.tblprintjobs" > INFO: "tblprintjobs": 19076 pages, 3000 rows sampled, 588209 estimated > total rows > ANALYZE > > > On 13 Jun 2005, at 04:43, Mark Kirkwood wrote: > > Yves Vindevogel wrote: > > I'm trying to update a table that has about 600.000 records. > The update query is very simple : update mytable set pagesdesc = > - pages ; > The query takes about half an hour to an hour to execute. I have > tried a lot of things. > > > Half an hour seem a bit long - I would expect less than 5 minutes on > reasonable hardware. > > You may have dead tuple bloat - can you post the output of 'ANALYZE > VERBOSE mytable' ? ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| Hi there I have a query (please refer to http://213.173.234.215:8080/get_content_plan.htm for the query as well as query plan) that is slow when it's run the first time and fast(ish) on all successive runs within a reasonable time period. That is, if the query is not run for like 30 min, execution time returns to the initial time. This leads me to suspect that when the query is first run, all used data have to be fetched from the disk where as once it has been run all data is available in the OS's disk cache. Comparing the execution times we're talking roughly a factor 35 in time difference, thus optimization would be handy. Is there anway to either enhance the chance that the data can be found in the disk cache or allowing the database to fetch the data faster? Is this what the CLUSTER command is for, if so, which tables would I need to cluster? Or is my only option to de-normalize the table structure around this query to speed it up? Furthermore, it seems the database spends the majority of its time in the loop marked with italic in the initial plan, any idea what it spends its time on there? Database is PG 7.3.9 on RH ES 3.0, with Dual XEON 1.9GHz processors and 2GB of RAM. effective_cache_size = 100k shared_buffers = 14k random_page_cost = 3 default_statistics_target = 50 VACUUM ANALYZE runs every few hours, so statistics should be up to date. Appreciate any input here. Cheers Jona ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Yves Vindevogel <yves.vindevogel@implements.be> writes: > rvponp=3D# vacuum verbose tblPrintjobs ; > INFO: vacuuming "public.tblprintjobs" > [ twenty-one different indexes on one table ] Well, there's your problem. You think updating all those indexes is free? It's *expensive*. Heed the manual's advice: avoid creating indexes you are not certain you need for identifiable commonly-used queries. (The reason delete is fast is it doesn't have to touch the indexes ... the necessary work is left to be done by VACUUM.) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| Jona <jonanews@oismail.com> writes: > I have a query (please refer to > http://213.173.234.215:8080/get_content_plan.htm for the query as well > as query plan) that is slow when it's run the first time and fast(ish) > on all successive runs within a reasonable time period. > This leads me to suspect that when the query is first run, all used data > have to be fetched from the disk where as once it has been run all data > is available in the OS's disk cache. Sounds like that to me too. > Is there anway to either enhance the chance that the data can be found > in the disk cache or allowing the database to fetch the data faster? Run the query more often? Also, that pile of INNER JOINs is forcing a probably-bad join order; you need to think carefully about the order you want things joined in, or else convert the query to non-JOIN syntax. See the "Performance Tips" chapter of the manual. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| ||||
| Thank you for the response Tom, I bet you get a lot of mails with "trivial" solutions (mine likely being one of them) I for one however truly appreciate you taking the time to answer them. >Run the query more often? > > The query is dynamically constructed from user input, although the total number of different queries that can be run is limited (around 10k different combinations I suspect) it seems rather pointless to run all of them (or even the most common) more often just to keep the data in the disk cache. Is there a way to make the data more accessible on the disk? >Also, that pile of INNER JOINs is forcing a probably-bad join order; >you need to think carefully about the order you want things joined in, >or else convert the query to non-JOIN syntax. See the "Performance >Tips" chapter of the manual. > > You're probably right here, the join order must be bad though it just flattening the join and letting the planner decide on what would be best makes the plan change for every execution. Have query cost variering from from 1350 to 4500. I wager it ends up using GEQO due to the number of possiblities for a join order that the query has and thus just decides on a "good" plan out of those it examined. In any case, the "right" way to do this is definning a good explicit join order, no? On top of my head I'm not sure how to re-write it proberly, suppose trial and errors is the only way.... From the plan it appears that the following part is where the cost dramatically increases (although the time does not??): -> Nested Loop (cost=0.00..1207.19 rows=75 width=32) (actual time=0.28..18.47 rows=164 loops=1) -> Nested Loop (cost=0.00..868.23 rows=58 width=20) (actual time=0.16..13.91 rows=164 loops=1) -> Index Scan using subcat_uq on sct2subcattype_tbl (cost=0.00..479.90 rows=82 width=8) (actual time=0.11..9.47 rows=164 loops=1) Index Cond: (subcattpid = 50) Filter: (NOT (subplan)) SubPlan -> Seq Scan on aff2sct2subcattype_tbl (cost=0.00..1.92 rows=1 width=4) (actual time=0.05..0.05 rows=0 loops=164) Filter: ((affid = 8) AND ($0 = sctid)) -> Index Scan using aff_price_uq on price_tbl (cost=0.00..4.72 rows=1 width=12) (actual time=0.02..0.02 rows=1 loops=164) Index Cond: ((price_tbl.affid = 8) AND (price_tbl.sctid = outer".sctid))" -> Index Scan using ctp_statcon on statcon_tbl (cost=0.00..5.86 rows=1 width=12) (actual time=0.02..0.02 rows=1 loops=164) Index Cond: ((statcon_tbl.sctid = outer".sctid) AND (statcon_tbl.ctpid = 1))" Especially the index scan on subcat_uq seems rather expensive, but is pretty fast. Can there be drawn a relation between estimated cost and execution time? Any other pointers in the right direction would be very much appreciated. For the full query and query plan, please refer to: http://213.173.234.215:8080/get_content_plan.htm Cheers Jona Tom Lane wrote: >Jona <jonanews@oismail.com> writes: > > >>I have a query (please refer to >>http://213.173.234.215:8080/get_content_plan.htm for the query as well >>as query plan) that is slow when it's run the first time and fast(ish) >>on all successive runs within a reasonable time period. >> >> > > > >>This leads me to suspect that when the query is first run, all used data >>have to be fetched from the disk where as once it has been run all data >>is available in the OS's disk cache. >> >> > >Sounds like that to me too. > > > >>Is there anway to either enhance the chance that the data can be found >>in the disk cache or allowing the database to fetch the data faster? >> >> > > > >Run the query more often? > > The query is dynamically constructed from user input, although the total number of different queries that can be run is limited (around 10k different combinations I suspect) it seems rather pointless to run all of them (or even the most common) more often just to keep the data in the disk cache. Is there a way to make the data more accessible on the disk? >Also, that pile of INNER JOINs is forcing a probably-bad join order; >you need to think carefully about the order you want things joined in, >or else convert the query to non-JOIN syntax. See the "Performance >Tips" chapter of the manual. > > You're probably right herem though I'm not sure I can > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > > |