This is a discussion on Load spikes on 8.1.11 within the pgsql Hackers forums, part of the PostgreSQL category; --> Hi All, I have been perplexed by random load spikes on an 8.1.11 instance. many a times they are ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi All, I have been perplexed by random load spikes on an 8.1.11 instance. many a times they are random, in the sense we cannot tie a particular scenario as the cause for it! But a few times we can see that when we are executing huge scripts, which include DDL as well as DML, the load on the box spikes to above 200. We see similar load spikes other times too when we are not running any such task on the DB. During these spikes, in the 'top' sessions we see the 'idle' PG processes consuming between 2 and 5 % CPU, and since the box has 8 CPUS (2 sockets and each CPU is a quad core Intel Xeon processors) and somewhere around 200 Postgres processes, the load spikes to above 200; and it does this very sharply. We are running the scripts using psql -f, but we can see the load even while running the commands on by one! When there's no load, an strace session on an 'idle' PG process looks like: [postgres@db1 data]$ strace -p 9375 Process 9375 attached - interrupt to quit recvfrom(9, <unfinished ...> Process 9375 detached But under these heavy load onditions, an 'idle' PG process' strace looks like: [postgres@db1 data]$ strace -p 22994 Process 22994 attached - interrupt to quit select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 11000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 14000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 17000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 31000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 51000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 12000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 12000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 23000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 27000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 47000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 70000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 11000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 19000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 35000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 53000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 75000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 76000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 102000}) = 0 (Timeout) Process 22994 detached So I guess there's something very wrong with the above 'select' calls. Can somebody please shed some light on this? Let me know what OS/hardware specs you need. Any help is greatly appreciated. Thanks in advance, -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com Mail sent from my BlackLaptop device |
| |||
| Just an addition... the strace o/p with selects timing out just runs almost continuously, it doesn't seem to pause anywhere! On Fri, Jul 18, 2008 at 9:16 AM, Gurjeet Singh <singh.gurjeet@gmail.com> wrote: > Hi All, > > I have been perplexed by random load spikes on an 8.1.11 instance. many > a times they are random, in the sense we cannot tie a particular scenario as > the cause for it! But a few times we can see that when we are executing huge > scripts, which include DDL as well as DML, the load on the box spikes to > above 200. We see similar load spikes other times too when we are not > running any such task on the DB. > > During these spikes, in the 'top' sessions we see the 'idle' PG > processes consuming between 2 and 5 % CPU, and since the box has 8 CPUS (2 > sockets and each CPU is a quad core Intel Xeon processors) and somewhere > around 200 Postgres processes, the load spikes to above 200; and it does > this very sharply. > > We are running the scripts using psql -f, but we can see the load even > while running the commands on by one! > > When there's no load, an strace session on an 'idle' PG process looks > like: > > [postgres@db1 data]$ strace -p 9375 > Process 9375 attached - interrupt to quit > recvfrom(9, <unfinished ...> > Process 9375 detached > > > But under these heavy load onditions, an 'idle' PG process' strace > looks like: > > [postgres@db1 data]$ strace -p 22994 > Process 22994 attached - interrupt to quit > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 11000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 14000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 17000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 31000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 51000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 12000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 12000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 23000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 27000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 47000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 70000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 11000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 19000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 35000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 53000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 75000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 76000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 102000}) = 0 (Timeout) > Process 22994 detached > > > So I guess there's something very wrong with the above 'select' calls. > > Can somebody please shed some light on this? Let me know what > OS/hardware specs you need. > > Any help is greatly appreciated. > > Thanks in advance, > > -- > gurjeet[.singh]@EnterpriseDB.com > singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com > > EnterpriseDB http://www.enterprisedb.com > > Mail sent from my BlackLaptop device > -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com Mail sent from my BlackLaptop device |
| |||
| "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > During these spikes, in the 'top' sessions we see the 'idle' PG > processes consuming between 2 and 5 % CPU, and since the box has 8 CPUS (2 > sockets and each CPU is a quad core Intel Xeon processors) and somewhere > around 200 Postgres processes, the load spikes to above 200; and it does > this very sharply. This looks like heavy contention for a spinlock. You need to get a higher-level analysis of what's happening before anyone can say much more than that. Note that 8.1 is pretty much ancient history as far as scalability to 8-core hardware goes. You should probably consider updating to 8.3 before investing too much time in tracking down what's happening. If you can still show the problem on 8.3 then there would be some interest in fixing it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
| |||
| On Fri, Jul 18, 2008 at 9:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > > During these spikes, in the 'top' sessions we see the 'idle' PG > > processes consuming between 2 and 5 % CPU, and since the box has 8 CPUS > (2 > > sockets and each CPU is a quad core Intel Xeon processors) and somewhere > > around 200 Postgres processes, the load spikes to above 200; and it does > > this very sharply. > > This looks like heavy contention for a spinlock. You need to get a > higher-level analysis of what's happening before anyone can say much > more than that. > > Note that 8.1 is pretty much ancient history as far as scalability to > 8-core hardware goes. You should probably consider updating to 8.3 > before investing too much time in tracking down what's happening. > If you can still show the problem on 8.3 then there would be some > interest in fixing it ... Upgrading is on the cards, but not as high priority as I would like it to be! This is a production box, and we desperatly need some respite from these spikes. Can you please elaborate on what high level diagnosis would you need? I just ran DROP SCHEMA _<slony schema> CASCADE; and it spiked again, on a very low loaded box!! Thanks for all you help. -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com Mail sent from my BlackLaptop device |
| |||
| On Fri, Jul 18, 2008 at 10:05 AM, Gurjeet Singh <singh.gurjeet@gmail.com> wrote: > On Fri, Jul 18, 2008 at 9:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: >> > During these spikes, in the 'top' sessions we see the 'idle' PG >> > processes consuming between 2 and 5 % CPU, and since the box has 8 CPUS >> (2 >> > sockets and each CPU is a quad core Intel Xeon processors) and somewhere >> > around 200 Postgres processes, the load spikes to above 200; and it does >> > this very sharply. >> >> This looks like heavy contention for a spinlock. You need to get a >> higher-level analysis of what's happening before anyone can say much >> more than that. >> >> Note that 8.1 is pretty much ancient history as far as scalability to >> 8-core hardware goes. You should probably consider updating to 8.3 >> before investing too much time in tracking down what's happening. >> If you can still show the problem on 8.3 then there would be some >> interest in fixing it ... > > > Upgrading is on the cards, but not as high priority as I would like it to > be! This is a production box, and we desperatly need some respite from these > spikes. > > Can you please elaborate on what high level diagnosis would you need? > > I just ran DROP SCHEMA _<slony schema> CASCADE; and it spiked again, on a > very low loaded box!! > > Thanks for all you help. > Would reducing the number of connections on the DB help in reducing the spike? -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com Mail sent from my BlackLaptop device |
| |||
| "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > Can you please elaborate on what high level diagnosis would you need? Well, we'd need some idea of which spinlock is being contended for... > I just ran DROP SCHEMA _<slony schema> CASCADE; and it spiked again, on a > very low loaded box!! That *might* mean that the problem is contention for SInvalLock, since dropping a large schema would result in a lot of sinval traffic. Or maybe it's something else. Do your spikes correspond to large DDL changes? If your platform has oprofile or DTrace or some such then getting an execution profile with that type of tool would tell something. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
| |||
| On Fri, Jul 18, 2008 at 10:05:33AM +0530, Gurjeet Singh wrote: > I just ran DROP SCHEMA _<slony schema> CASCADE; and it spiked again, on a > very low loaded box!! Ah, well, if slony is involved, then you have possible locking problems in the database _also_ to contend with, along with the spinlock problems. This will for sure cause spikes. You need to tell us more about what you're doing. And I bet some of it belongs on the slony lists. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
| |||
| On Fri, Jul 18, 2008 at 10:21 AM, Andrew Sullivan <ajs@commandprompt.com> wrote: > On Fri, Jul 18, 2008 at 10:05:33AM +0530, Gurjeet Singh wrote: > > > I just ran DROP SCHEMA _<slony schema> CASCADE; and it spiked again, on a > > very low loaded box!! > > Ah, well, if slony is involved, then you have possible locking > problems in the database _also_ to contend with, along with the > spinlock problems. This will for sure cause spikes. > > You need to tell us more about what you're doing. And I bet some of > it belongs on the slony lists. > > I am in the eye of the storm right now. Just started INIT cluster Slonik command and that spiked too.. for more than 10 minutes now!! -- gurjeet[.singh]@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com Mail sent from my BlackLaptop device |
| |||
| On Fri, Jul 18, 2008 at 10:41:36AM +0530, Gurjeet Singh wrote: > > Just started INIT cluster Slonik command and that spiked too.. for more than > 10 minutes now!! Are you attempting to do Slony changes (such as install Slony) on an active database? I strongly encourage you to read the Slony manual. Slony, frankly, sucks for this use case. The manual says as much, although in more orotund phrases than that. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
| ||||
| On Fri, Jul 18, 2008 at 10:15:42AM +0530, Gurjeet Singh wrote: > On Fri, Jul 18, 2008 at 10:05 AM, Gurjeet Singh <singh.gurjeet@gmail.com> > wrote: > > > On Fri, Jul 18, 2008 at 9:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > >> "Gurjeet Singh" <singh.gurjeet@gmail.com> writes: > >> > During these spikes, in the 'top' sessions we see the 'idle' PG > >> > processes consuming between 2 and 5 % CPU, and since the box has 8 CPUS > >> (2 > >> > sockets and each CPU is a quad core Intel Xeon processors) and somewhere > >> > around 200 Postgres processes, the load spikes to above 200; and it does > >> > this very sharply. > >> > >> This looks like heavy contention for a spinlock. You need to get a > >> higher-level analysis of what's happening before anyone can say much > >> more than that. > >> > >> Note that 8.1 is pretty much ancient history as far as scalability to > >> 8-core hardware goes. You should probably consider updating to 8.3 > >> before investing too much time in tracking down what's happening. > >> If you can still show the problem on 8.3 then there would be some > >> interest in fixing it ... > > > > > > Upgrading is on the cards, but not as high priority as I would like it to > > be! This is a production box, and we desperatly need some respite from these > > spikes. > > > > Can you please elaborate on what high level diagnosis would you need? > > > > I just ran DROP SCHEMA _<slony schema> CASCADE; and it spiked again, on a > > very low loaded box!! > > > > Thanks for all you help. > > > > Would reducing the number of connections on the DB help in reducing the > spike? Just generally, reducing the number of connections to the DB will help in reducing resource consumption. When you first get a chance, use or set up a test environment where you can test the upgrade to 8.3.latest. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |