View Single Post

   
  #4 (permalink)  
Old 03-10-2008, 05:54 PM
Steve M. Fabac, Jr.
 
Posts: n/a
Default Re: Trying to identify why a system rebooted by itself

Pat Welch wrote:
> Steve M. Fabac, Jr. wrote:
>> My clients system has started throwing the following message
>> on the system console (retyped from description read over
>> the phone):
>>
>> WARNING: allocb failed - NSTRPAGES exceeded
>>
>> I read TA116684 on how to debug failures and the
>> items on kernel tuning and drivers don't seem to be
>> applicable as this just started happening.
>>
>> 4. Failing hardware
>>
>> 5. External network hardware misbehaving
>>
>> 6. Extremely high network traffic
>>
>> Items 4, 5, and 6 seem to be possible candidates.
>>
>> The client called today when he simply rebooted the system
>> (as that was what I was advising him to do on previous
>> occasions) and then the system spontaneously rebooted about
>> an hour later.
>>
>> I connected via SSH to the running system and as I was
>> monitoring it, it rebooted again.
>>
>> I found the following in /usr/adm/syslog:
>> Mar 7 14:55:05 vetreal bootpd[1359]: IP address not found:
>> 192.168.160.143
>> Mar 7 15:00:53 vetreal TLW param1=-1
>> Fri Mar 7 15:00:53 CST 2008 reboot initated
>> Mar 7 15:03:44 vetreal syslogd: restart
>>
>> The two odd things that jump out above is TLW param1=-1
>> and "reboot initiated" both at 15:00:53.
>>
>> Does anyone recognize these two entries that seem to be related?
>>
>> I've never seen a system log with the message "YYYY reboot initiated"
>>
>> Checking /usr/adm/syslog:
>>
>> # grep "2008 reboot initated" /usr/adm/syslog
>> Mon Jan 28 13:19:20 CST 2008 reboot initated
>> Sat Feb 16 13:35:05 CST 2008 reboot initated
>> Mon Feb 25 17:30:05 CST 2008 reboot initated
>> Wed Mar 5 16:37:54 CST 2008 reboot initated
>> Fri Mar 7 12:38:13 CST 2008 reboot initated
>> Fri Mar 7 14:09:16 CST 2008 reboot initated
>> Fri Mar 7 14:16:23 CST 2008 reboot initated
>> Fri Mar 7 15:00:53 CST 2008 reboot initated
>> # grep "2007 reboot initated" /usr/adm/syslog
>> Tue May 1 00:04:44 CDT 2007 reboot initated
>> # grep "2006 reboot initated" /usr/adm/syslog
>> # grep "2005 reboot initated" /usr/adm/syslog
>> #
>>
>> Syslog starts Jan 31 2005.
>>
>> I see that it has been occurring but not to the level that it
>> has today.
>>
>>
>>

>
> Is this on an HP or Compaq system, with the full EFS installed?
>
> I've seen similar things from cpqmon, the EFS health monitor.
>
> But nothing matching that particular entry. Do you have some other HW
> monitor installed?
>
> Is 'initiated' really misspelled 'initated' that way? Might be
> worthwhile to run strings on binaries looking for 'reboot' or that
> particular mis-spelling of initiated.
>


That was a good suggestion. I found the offending script in:

#
# Purpose:
# To install and umpgrade the Fault-Freedom II Driver
# Description:
# arg 1: Name of step to perform.
# arg 2: Keyword list, e.g. UPGRADE.
# arg 3: space-separated list of packages.
#************************************************* ****************

LOGFILE=/tmp/ff2install.log
INSTALL_DIR=/usr/local/ff2
HALTSYS=`l -Wv /etc/haltsys | awk '{ print $11 }'`
SHUTDOWN=`l -Wv /etc/shutdown | awk '{ print $11 }'`
INITFILE=`l -Wv /etc/inittab | awk '{ print $11 }'`
INITBASE=/etc/conf/cf.d/init.base
....

###############
# /etc/reboot #
###############
grep Ff2 ${HALTSYS} > /dev/null 2>&1
if [ "$?" = "1" ]
then
ex_cmd cp ${HALTSYS} /etc/haltsys.preff2
sed -e '/^haltsys/a\
[ -x /usr/local/ff2/bin/Ff2 ] && \
{\
/usr/local/ff2/bin/Ff2 shutdown\
DATE=\`date\`\
> echo "${DATE} reboot initated" >> /usr/adm/syslog\

}' < /tmp/haltsys$$ > ${HALTSYS}
ex_cmd rm /tmp/haltsys$$
ex_cmd chmod 700 ${HALTSYS}
ex_cmd chown root ${HALTSYS}
ex_cmd chgrp sys ${HALTSYS}
fi
"ccs" line 250 of 649 --38%--
:!pwd
/opt/K/1776/FF2/1.0.2W/cntl/packages/FF2

FaultFredomII from 1776 Software is installed but
not running (no heartbeat connection).

It's a long story. When I was called in July 2001, the
client was trying to get FF2 working but whenever they
tried to manually fail to the backup server, the system
would lock up. So it was installed but disabled from
starting all its daemons.

I wrote scripts to manually switch identities between the
two machines and mirror the data directories overnight.

Now I've got to look to see what is still running and why
it has suddenly shut the server down four times in one day.

What we did find was the Cisco switches were showing a
lot of chatter. After hours, the client powered the
switches down and then back on and the chatter subsided.

Netstat -m run today after rebooting the Cisco switches on Friday
shows all zeros in the fail column where before, they were
getting 300+ in multiple buffers.

so it looks like
>> 5. External network hardware misbehaving

is the correct assessment.



--
Steve Fabac
S.M. Fabac & Associates
816/765-1670
Reply With Quote