vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi guys, I saw a strange behaviour on one of the production boxes. The pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of it (understandably, since its IDLE), there are no entries for this pid in pg_locks! Following are the snapshots of the two system views. procpid | current_query | waiting | duration | backend_start ---------+-----------------------+---------+------------------+------------------------------- 20762 | <IDLE> | f | | 2008-01-31 13:38:30.848898-08 19776 | <IDLE> | t | 00:38:34.76833 | 2008-01-31 12:51:29.005744-08 20356 | <IDLE> | f | 00:38:29.971425 | 2008-01-31 13:17:37.617497-08 19775 | <IDLE> | f | 00:38:27.187201 | 2008-01-31 12:51:28.999242-08 19774 | <IDLE> | f | 00:38:27.187068 | 2008-01-31 12:51:28.90554-08 20728 | <IDLE> | f | 00:14:03.913027 | 2008-01-31 13:36:11.345822-08 9727 | <IDLE> | f | 00:03:07.444273 | 2008-01-24 22:25:00.289931-08 9684 | <IDLE> | f | 00:00:07.704656 | 2008-01-24 22:22:00.007377-08 19390 | <IDLE> in transaction | f | 00:00:00.027585 | 2008-01-31 12:30:07.999246-08 19389 | <IDLE> in transaction | t | -00:00:00.000255 | 2008-01-31 12:30:07.973868-08 select * from pg_locks where pid in ( 19776, 19389 ); locktype | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid | mode | granted ---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+-------+------------------+--------- relation | 16584 | 17070 | | | | | | | 3700350056 | 19389 | RowExclusiveLock | t relation | 16584 | 17106 | | | | | | | 3700350056 | 19389 | RowExclusiveLock | t relation | 16584 | 17068 | | | | | | | 3700350056 | 19389 | RowExclusiveLock | t transactionid | | | | | 3700350056 | | | | 3700350056 | 19389 | ExclusiveLock | t relation | 16584 | 17108 | | | | | | | 3700350056 | 19389 | RowExclusiveLock | t (5 rows) The 'duration' column above is just now()-query_start. These are not just two instant snapshots, but we could see this output consistently for quite long. I tracked the 'waiting' column a little bit in the source code, and saw that it is actually generated from PgBackendStatus.st_waiting . Is it possible that, for some reason, postgres forgot to update this for a backend? select version(); version -------------------------------------------------------------------------------------------- PostgreSQL 8.2.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.0 (SUSE Linux) This issue has been seen twice now. -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com 17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad 18° 32' 57.25"N, 73° 56' 25.42"E - Pune 37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco * http://gurjeet.frihost.net Mail sent from my BlackLaptop device |
| |||
| "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > I saw a strange behaviour on one of the production boxes. The > pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of > it (understandably, since its IDLE), there are no entries for this pid in > pg_locks! Hmm, I can reproduce something like this by aborting a wait for lock. It seems the problem is that WaitOnLock() is ignoring its own good advice, assuming that it can do cleanup work after waiting. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| The situation seems pretty bad!! Here are the steps to reproduce in 'PostgreSQL 8.3beta2 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 (SuSE Linux)': session 1: begin; session 1: update test set a = 112 where a = 112; session 2: update test set a = 113 where a = 112; --waits session 1: select * from pg_stat_activity; -- executed this a few times before executing 'select version()' and then following: session 1: <stat query1> -- see end of mail for this query procpid | current_query | waiting | duration | backend_start ---------+----------------------------------------+---------+------------------+------------------------------- 12577 | update test set a = 113 where a = 112; | t | -00:01: 35.782881 | 2008-02-01 13:36:15.31027-08 11975 | select * from pg_stat_activity ; | f | -00:01: 52.554697 | 2008-02-01 13:30:40.396392-08 (2 rows) session 1: select * from pg_locks locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------- -----+-------+------------------+--------- transactionid | | | | | | 390 | | | | 2/14 | 12577 | ShareLock | f transactionid | | | | | | 390 | | | | 1/9 | 11975 | ExclusiveLock | t <snip irrelevant> Then, session 2: ^C Cancel request sent ERROR: canceling statement due to user request session1: <stat query1> procpid | current_query | waiting | duration | backend_start ---------+----------------------------------------+---------+------------------+------------------------------- 12577 | update test set a = 113 where a = 112; | t | -00:01: 35.782881 | 2008-02-01 13:36:15.31027-08 11975 | select * from pg_stat_activity ; | f | -00:01: 52.554697 | 2008-02-01 13:30:40.396392-08 (2 rows) session 1: select * from pg_locks ; <no traces of pid 12577> session 1: select pg_backend_pid(); pg_backend_pid ---------------- 11975 The last mentioned output of <stat query> shows session 1 executing 'select * from p_s_a', whereas the <stat query> _is_ being executed in session 1!!! This result is consistently returned for a while, and later... session 2: select pg_backend_pid(); pg_backend_pid ---------------- 12577 session 1: <stat query1> procpid | current_query | waiting | duration | backend_start ---------+-----------------------+---------+-----------------+------------------------------- 11975 | <IDLE> in transaction | f | 00:06:08.671029 | 2008-02-01 13:30:40.396392-08 (1 row) After a while again: session 1: <stat query2> -- notice 2 not 1; 'select *' comes back to haunt!!! procpid | current_query | waiting | duration | backend_start ---------+----------------------------------------+---------+------------------+------------------------------- 12577 | update test set a = 113 where a = 112; | t | -00:01: 35.782881 | 2008-02-01 13:36:15.31027-08 11975 | select * from pg_stat_activity ; | f | -00:01: 52.554697 | 2008-02-01 13:30:40.396392-08 (2 rows) session 1: <stat query1> -- 1 back in action procpid | current_query | waiting | duration | backend_start ---------+----------------------------------------+---------+------------------+------------------------------- 12577 | update test set a = 113 where a = 112; | t | -00:01: 35.782881 | 2008-02-01 13:36:15.31027-08 11975 | select * from pg_stat_activity ; | f | -00:01: 52.554697 | 2008-02-01 13:30:40.396392-08 (2 rows) The <stat query1> is: select procpid, current_query::varchar(50), waiting, now() - query_start as duration, backend_start from pg_stat_activity where current_query <> '<IDLE>' and current_query not like '%DONT COUNT ME1 %' order by duration desc limit 10; The <stat query2> is: select procpid, current_query::varchar(50), waiting, now() - query_start as duration, backend_start from pg_stat_activity where current_query not like '%DONT COUNT ME1 %' order by duration desc limit 10; Found more bugs than I was looking for, to reproduce!!! The reporter also made an observation (on 8.2.4) that there were deadlocks detected at around the same time. Looked at WaitOnLock(), and clearly there's a problem, but is it at the same/only place we are suspecting it to be? Best regards, PS: Ran the <stat query>ies 1 and 2 again, just before hitting 'send', and the result is the same: procpid | current_query | waiting | duration | backend_start ---------+----------------------------------------+---------+------------------+------------------------------- 12577 | update test set a = 113 where a = 112; | t | -00:01: 35.782881 | 2008-02-01 13:36:15.31027-08 11975 | select * from pg_stat_activity ; | f | -00:01: 52.554697 | 2008-02-01 13:30:40.396392-08 (2 rows) Clearly, there's something wrong On Feb 1, 2008 8:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > > I saw a strange behaviour on one of the production boxes. The > > pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top > of > > it (understandably, since its IDLE), there are no entries for this pid > in > > pg_locks! > > Hmm, I can reproduce something like this by aborting a wait for lock. > It seems the problem is that WaitOnLock() is ignoring its own good > advice, assuming that it can do cleanup work after waiting. > > regards, tom lane > -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com 17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad 18° 32' 57.25"N, 73° 56' 25.42"E - Pune 37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco * http://gurjeet.frihost.net Mail sent from my BlackLaptop device |
| |||
| "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > The situation seems pretty bad!! I think at least part of your problem is not understanding that a single transaction sees a frozen snapshot of pg_stat_activity. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Feb 1, 2008 3:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > > The situation seems pretty bad!! > > I think at least part of your problem is not understanding that a single > transaction sees a frozen snapshot of pg_stat_activity. > > It does! I assumed that pg_stat_activity produced the transaction-independent snapshot of internal memory structures! Is that the case with pg_locks too!? I hope not. BTW, we cannot say that the pg_stat_activity behaves in a consistent manner (transactions-wise). From what I could infer, this view's results are frozen when you first query the view, not when the transaction started (which is how other (normal) relations behave). It's a bit confusing, and should be documented if this is the way it is intended to work; Something along the lines of : "In a transaction, this view will repeatedly show the same results that were returned by it's first invocation in the transaction." in a less confusing way So we are back to the original problem... Canceling a 'waiting' transaction does not revert the session's 'waiting' state back to 'false' (consistently reproducible). -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com 17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad 18° 32' 57.25"N, 73° 56' 25.42"E - Pune 37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco * http://gurjeet.frihost.net Mail sent from my BlackLaptop device |
| |||
| I wrote: > "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: >> I saw a strange behaviour on one of the production boxes. The >> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top of >> it (understandably, since its IDLE), there are no entries for this pid in >> pg_locks! > Hmm, I can reproduce something like this by aborting a wait for lock. > It seems the problem is that WaitOnLock() is ignoring its own good > advice, assuming that it can do cleanup work after waiting. I've committed a fix for this. (Too late for 8.3.0, unfortunately.) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Feb 2, 2008 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > >> I saw a strange behaviour on one of the production boxes. The > >> pg_stat_activity shows a process as <IDLE> and yet 'waiting' !!! On top > of > >> it (understandably, since its IDLE), there are no entries for this pid > in > >> pg_locks! > > > Hmm, I can reproduce something like this by aborting a wait for lock. > > It seems the problem is that WaitOnLock() is ignoring its own good > > advice, assuming that it can do cleanup work after waiting. > > I've committed a fix for this. (Too late for 8.3.0, unfortunately.) > Thanks. Like 8.2, can it not be back-patched on 8.3 too? I just looked at the patch... Isn't PG_TRY() an expensive call to make in the lock.c code? I was thinking of registering a Xact callback using RegisterXactCallback() and performing 'waiting' reset in that callback if the Xact event is XACT_EVENT_ABORT. That would have been compliant with the previous comments ('if we fail, any cleanup must happen in xact abort processing, not here'). Comments. Best regards, -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com 17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad 18° 32' 57.25"N, 73° 56' 25.42"E - Pune 37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco * http://gurjeet.frihost.net Mail sent from my BlackLaptop device |
| |||
| Gurjeet Singh escribió: > I just looked at the patch... Isn't PG_TRY() an expensive call to make in > the lock.c code? I was thinking of registering a Xact callback using > RegisterXactCallback() and performing 'waiting' reset in that callback if > the Xact event is XACT_EVENT_ABORT. PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and another stack frame. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Feb 2, 2008 3:27 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Gurjeet Singh escribió: > > > I just looked at the patch... Isn't PG_TRY() an expensive call to make > in > > the lock.c code? I was thinking of registering a Xact callback using > > RegisterXactCallback() and performing 'waiting' reset in that callback > if > > the Xact event is XACT_EVENT_ABORT. > > PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and > another stack frame. > > Thats why I asked. I assumed that creating stacks was expensive. Isn't this the reason the compilers came up with the function inline capability; to avoid stacks, because they can be expensive. Or am I confusing two different types of stacks! Moreover, calling a callback, once in a while (only upon XACT abort), may prove to be much cheaper than setting up an additional stack on every lock-acquire call. Really, my 2 cents. -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com 17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad 18° 32' 57.25"N, 73° 56' 25.42"E - Pune 37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco * http://gurjeet.frihost.net Mail sent from my BlackLaptop device |
| ||||
| Alvaro Herrera <alvherre@commandprompt.com> writes: > Gurjeet Singh escribió: >> I just looked at the patch... Isn't PG_TRY() an expensive call to make in >> the lock.c code? I was thinking of registering a Xact callback using >> RegisterXactCallback() and performing 'waiting' reset in that callback if >> the Xact event is XACT_EVENT_ABORT. > PG_TRY is not expensive as all that -- it's just a sigsetjmp() call and > another stack frame. Also, since we're about to block here, shaving microseconds is not all that important. The reason I did it that way was to avoid having to export the saved ps-display string out to someplace LockWaitCancel could find it. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |