This is a discussion on Improving deadlock error messages within the pgsql Hackers forums, part of the PostgreSQL category; --> The error message we currently produce when a deadlock occurs is pretty unfriendly: ERROR: deadlock detected DETAIL: Process 32068 ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| The error message we currently produce when a deadlock occurs is pretty unfriendly: ERROR: deadlock detected DETAIL: Process 32068 waits for AccessExclusiveLock on relation 16413 of database 16384; blocked by process 32064. Process 32064 waits for AccessExclusiveLock on relation 16415 of database 16384; blocked by process 32068. Users encounter this message relatively frequently -- and they often depend on the information in the errdetail to track down the source of the deadlock. Presenting numeric OIDs effectively forces the DBA to resolve the OIDs to the appropriate relation and database names manually. I whipped up a quick patch to use names as well as OIDs for the identifiers in the message, but on reflection the simple approach to doing this is problematic: when we do syscache lookups to lookup the identifier names, we might need to acquire an AccessShareLock on various system catalogs (pg_class, pg_namespace, pg_database). This could fail (e.g. because of a deadlock involving a system catalog), causing the deadlock detector to infinitely recurse (albeit slowly). We could fix this by first conditionally acquiring AccessShareLocks on the necessary system catalogs. If any of those lock acquisitions fails, we can just print the numeric OID instead. Since failing to acquire an AccessShareLock on the system catalogs should not occur frequently in practise, most users should see a more friendly error message. Comments? Anyone see a better approach? -Neil ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Neil Conway <neilc@samurai.com> writes: > I whipped up a quick patch to use names as well as OIDs for the > identifiers in the message, but on reflection the simple approach to > doing this is problematic: when we do syscache lookups to lookup the > identifier names, we might need to acquire an AccessShareLock on various > system catalogs (pg_class, pg_namespace, pg_database). This could fail > (e.g. because of a deadlock involving a system catalog), causing the > deadlock detector to infinitely recurse (albeit slowly). Yup, that's exactly why it doesn't do that already. > We could fix this by first conditionally acquiring AccessShareLocks on > the necessary system catalogs. I don't think you've thought of quite all of the failure cases. One that's a bit pressing is that a deadlock isn't necessarily confined to objects in your own database. My take on this is that I'd much rather have unfriendly information than none at all. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| "Tom Lane" <tgl@sss.pgh.pa.us> writes: > I don't think you've thought of quite all of the failure cases. One > that's a bit pressing is that a deadlock isn't necessarily confined to > objects in your own database. We could do the syscache lookups for only the object we're waiting on and store that information in the lock table for others to refer to. We would have to do the syscache lookup either always or at the point where we first decide we have to block. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| On Fri, 2007-04-20 at 02:55 -0400, Tom Lane wrote: > I don't think you've thought of quite all of the failure cases. One > that's a bit pressing is that a deadlock isn't necessarily confined to > objects in your own database. I'm not sure I follow. If we conditionally acquire the locks we need and always fallback to just printing the numeric OIDs, ISTM we should be able to avoid the infinite recursion problem. (If necessary, we can always special-case objects outside our own database, although I'm not sure that's necessary.) -Neil ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Neil Conway <neilc@samurai.com> writes: > On Fri, 2007-04-20 at 02:55 -0400, Tom Lane wrote: >> I don't think you've thought of quite all of the failure cases. One >> that's a bit pressing is that a deadlock isn't necessarily confined to >> objects in your own database. > I'm not sure I follow. If we conditionally acquire the locks we need and > always fallback to just printing the numeric OIDs, ISTM we should be > able to avoid the infinite recursion problem. (If necessary, we can > always special-case objects outside our own database, although I'm not > sure that's necessary.) Maybe so, but you're going to be writing quite a lot of duplicative code, because the existing routines you might have been thinking of using (lsyscache.c etc) don't behave that way. The basic objection I've got to this is that it'll introduce a lot of complexity and fragility into a seldom-taken error-recovery path, which is almost by definition not well enough tested already. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Sat, 2007-04-21 at 02:38 -0400, Tom Lane wrote: > Maybe so, but you're going to be writing quite a lot of duplicative > code, because the existing routines you might have been thinking of > using (lsyscache.c etc) don't behave that way. Right, I'm envisioning doing a conditional LockAcquire and then heap_open() / heap_getnext() by hand. That will be relatively slow, but code that emits a deadlock error message is almost by definition not performance critical. BTW, another alternative would be to set a global variable instructing LockAcquire() to not block waiting for a lock; instead, it would longjmp(), a la elog(ERROR). You could even construct something similar to PG_TRY(): PG_COND_LOCK(); { /* do various things that might acquire lmgr locks */ } PG_ACQUIRE_FAILED() { /* failed to acquire an lmgr lock */ } PG_END_COND_LOCK(); The risk would be leaving the LockAcquire() call site in an inconsistent state when we longjmp(), but since DeadLockReport() is going to ereport(ERROR) anyway, it might be sufficiently safe. This scheme does seem a bit fragile, though... -Neil ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Sat, 2007-04-21 at 17:56 -0400, Neil Conway wrote: > Right, I'm envisioning doing a conditional LockAcquire and then > heap_open() / heap_getnext() by hand. That will be relatively slow, but > code that emits a deadlock error message is almost by definition not > performance critical. .... although it turns out you'd need to conditionally lock a *lot* of system catalogs to guarantee that you're not going to block on a lock at some point. Needless to say, that approach would be pretty ugly and fragile. > BTW, another alternative would be to set a global variable instructing > LockAcquire() to not block waiting for a lock; instead, it would > longjmp(), a la elog(ERROR). You could even construct something similar > to PG_TRY() Attached is a very quick hack of a patch to do this. -Neil ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote: > Attached is a very quick hack of a patch to do this. Does anyone have any feedback on this approach? If people are satisfied with this solution, I can get a cleaned up patch ready to apply shortly. -Neil ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Neil Conway <neilc@samurai.com> writes: > On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote: >> Attached is a very quick hack of a patch to do this. > Does anyone have any feedback on this approach? If people are satisfied > with this solution, I can get a cleaned up patch ready to apply shortly. I'm really still opposed to the entire concept. You're proposing to put a lot of fragile-looking code into a seldom-exercised error path. I fear bugs will survive a long time in there, and the net effect will be that we get no information when we need it most. The numeric printouts may be ugly, but they are reliable. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| ||||
| On Mon, 2007-04-23 at 17:38 -0400, Tom Lane wrote: > I'm really still opposed to the entire concept. You're proposing to put > a lot of fragile-looking code into a seldom-exercised error path. There's certainly not a "lot" of code: the patch just adds a few syscache lookups, wrapped in a PG_LOCK_NOWAIT() block. As for fragility, I think the important point is whether it's safe to siglongjmp() out of LockAcquire(); the rest is just window dressing. -Neil ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| Thread Tools | |
| Display Modes | |
|
|