vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| For the past several months we have been having strange shutdowns of 2 of our aix unix servers. Both of these servers happen to be in the same cabinet and share a UPS as well as a console connected through a switchbox. In both cases the servers are shut down, but still pingable on the network. Pushing the white power button causes the system to immediately power off. Pushing the power button again reboots the server and everything comes back as normal. A review of the errpt -a reveals nothing about a power failure or overheating problem. These servers are running the same base version of AIX 4.3.3. One server is running Oracle and the other is running Sybase. I'm not sure where to go from here since nothing is recorded in the error log. I have been told to disable the powerfail feature to see if that eliminates the problem, but again I find it odd that there aren't any error log entries about power failures. Has anyone experienced a problem like this? I would appreciate any advice to resolve this issue. Thanks. |
| |||
| earlrc_2000@yahoo.com wrote: > For the past several months we have been having strange shutdowns of 2 > of our aix unix servers. Both of these servers happen to be in the > same cabinet and share a UPS as well as a console connected through a > switchbox. In both cases the servers are shut down, but still pingable > on the network. Pushing the white power button causes the system to > immediately power off. Pushing the power button again reboots the > server and everything comes back as normal. A review of the errpt -a > reveals nothing about a power failure or overheating problem. These > servers are running the same base version of AIX 4.3.3. One server is > running Oracle and the other is running Sybase. I'm not sure where to > go from here since nothing is recorded in the error log. I have been > told to disable the powerfail feature to see if that eliminates the > problem, but again I find it odd that there aren't any error log > entries about power failures. Has anyone experienced a problem like > this? I would appreciate any advice to resolve this issue. > > Thanks. if you can still ping the server, it has not had a power failure. sounds like you most likely ran out of paging space. |
| |||
| <earlrc_2000@yahoo.com> wrote in message news:1144093438.112424.152280@i40g2000cwc.googlegr oups.com... > For the past several months we have been having strange shutdowns of 2 > of our aix unix servers. Both of these servers happen to be in the > same cabinet and share a UPS as well as a console connected through a > switchbox. In both cases the servers are shut down, but still pingable > on the network. Pushing the white power button causes the system to > immediately power off. Pushing the power button again reboots the > server and everything comes back as normal. A review of the errpt -a > reveals nothing about a power failure or overheating problem. These > servers are running the same base version of AIX 4.3.3. One server is > running Oracle and the other is running Sybase. I'm not sure where to > go from here since nothing is recorded in the error log. I have been > told to disable the powerfail feature to see if that eliminates the > problem, but again I find it odd that there aren't any error log > entries about power failures. Has anyone experienced a problem like > this? I would appreciate any advice to resolve this issue. > > Thanks. I sometimes observe the same symptoms on one of our RS/6000 44P model 170 machines and suspect a faulty HDLC card, but that's only a guess. Software-wise it's just like our other 44Ps so it shouldn't be running out of paging space (as the other poster suggested) although that's certainly worth watching. When it hangs there is no way to log onto the box, either via network or tty port, but as you say, ping still works. I sometimes observe "undetermined error" in the error log which isn't much better than no error messages at all. I suggest that you run diagnotics in maintenance mode so that it can have a good look at the hardware. Jeffrey. ------------- Get FREE newsgroup access from http://www.cheap56k.com |
| |||
| First of all: Post the errpt output for the time frame of the shutdown to see what has happend during this time ( No need for errpt -a ) Also take a look at the console log and post a section of this log.for the same time frame. Hajo |
| |||
| firstly as someone has quite rightly pointed out, if you can ping the system then the kernle is still servicing interrupts, so the system is not shut down .. the most likley cuase is a performance issue. you need to start collecting performance metrics (nmon in batch mode may be a good start - google for it) or some other perf metric tool like performance toolbox or tivoli perf mon etc... then take a look at the metrics from the timeframe of the issue. if this shows no performance problems, then it could be h/w ... You are running base 4.3.3 !!!!, at least get the latest ML level installed for 4.3.3 ... as a matter of course. HTH Mark Taylor |
| |||
| I gotta agree with the first reply. If you can ping it but can't get a login, you've most likely run out of paging space and the system is locked up. The white button is your only option when this happens. Keep a close eye on your paging space (lsps -a). If you're eating into it, then you either need more memory or you need to fix the memory leak (which is likely if you're running Oracle 8). |
| |||
| I agree with the other posters, this feels like a paging allocation problem. This may be a prime time to convince the powers that be, that it is time to upgrade the OS. AIX5.1 and above is much more graceful about what it kills when there is a paging space constraint problem. I wouldn't say that it is perfect...But I have seen a box lose most of it's functionality with overallocation of paging space recover itself in an hour or so. I have never seen 4.3.3 recover that way. By that time you will most likely need to restart your applications, or reboot the server, but at least you have no mystery of what had happened. If you are not sure that paging space allocation is causing a problem, then you should try and capture a dump, instead of rebooting. Though, then you will need to find someone who can read the dump.... Casey |
| ||||
| First, ping could be implemented in the adapter -- I've been told. Not 100% sure if that is true or not. Sounds like you have a system that does not have a key switch. So, you need to do "sysdumpdev -K". Then when you hit the white button you will get a dump. There are a few gotchas here. If the debugger is enabled, you will get plopped into the debugger. On 4.3.3, I'm not sure. The debugger would be lldb. As I recall "q dump" from lldb will continue and get the dump. The other approach is to disable the debugger which will involve a reboot. Make sure that /var has enough free space to get a good clean dump. sysdumpdev -e will get you an estimated size. I believe 4.3.3 had compression option on the dump. I prefer dumping to the paging system which makes the primary dump destination /dev/hd6. Then the system copies the dump to /var when it reboots. When you get your dump, get into crash and get the stack trace of the cpu(s). If you need help, post the stack trace here. |