Re: Weird, suspicious server failure On Fri, 02 Dec 2005 11:44:40 +0100, Steven Mocking <s.mocking@gmail.com> wrote:
> Sorry for the multipost from alt.linux, but this group is better
> suited.
>
> Last week I set up a NIS master on our fileserver. Clients worked fine.
> Then, out of the blue, the machine started to fail rather horribly
> after running cron.daily. Server processes continued to operate, but
> ssh and local logins resulted in a hanging terminal. Samba was also
> non-functional.
>
> Over ssh, Ctrl-C and Ctrl-4 couldn't recover the prompt. We managed to
> gain the ability to execute commands on it remotely as root by a not
> entirely intentional backdoor, so we could investigate and fix the
> machine. After a while we managed to get a process list. There were a
> few dozen defunct sshd children and smb also had children. The most
> striking was a whole bunch of crond children which seemed to have hung
> up.
Ugh, posting through Google is problematic, because they wrap the lines
making the post so hard to read, that you seriouly risk loosing the best
comments.
> root 1177 0.0 0.0 1568 148 ? S Nov07 0:02 crond
> root 1082 0.0 0.0 1568 108 ? S 04:42 0:00 \_ CROND
> smmsp 1084 0.0 0.0 5760 684 ? S 04:42 0:00 | \_ [sendmail]
> root 1091 0.0 0.0 1564 64 ? D 05:01 0:00 \_ CROND
> root 1093 0.0 0.0 1564 64 ? D 05:01 0:00 \_ CROND
> root 1095 0.0 0.0 1564 64 ? D 05:01 0:00 \_ CROND
> root 1109 0.0 0.0 1564 64 ? D 06:01 0:00 \_ CROND
> root 1111 0.0 0.0 1564 64 ? D 06:01 0:00 \_ CROND
> root 1113 0.0 0.0 1564 64 ? D 06:01 0:00 \_ CROND
> The list continued like this until the present - three hanging children
> every hour. Syslogd was also still running, even though
> /var/log/messages was empty:
> root 1037 0.0 0.0 1464 564 ? S Nov07 0:11 syslogd -m 0
> Strangely, the /var/log/messages file seemed empty. We then attempted
What about /var/log/messages.* ?
Cron.daily runs logrotate, no? that would leave /var/log/messages empty.
> to restart sshd to see if that would allow logins. This got stuck and
> blocked the only way we could execute commands on the machine. Minutes
> later the NIS server failed as well and clients were beginning to have
> problems. We had to reboot the machine (after informing a few dozen
> customers) and expected it to break down during boot. Interestingly
> though, it booted and still runs normally.
> But now, /var/log/messages was no longer empty - there was a rather
> conflicting and suspicious line at the top (note the timestamp).
>
> Dec 1 14:42:45 tk7 syslogd 1.4.1: restart.
> Dec 1 04:08:03 tk7 su(pam_unix)[1026]: session opened for user news by
> (uid=0)
What is your geographical location? Your timezone?
> Dec 1 04:08:03 tk7 su(pam_unix)[1026]: session closed for user news
> Dec 1 04:42:00 tk7 CROND[1078]: (root) CMD (run-parts
> /etc/cron.monthly)
> Dec 1 04:42:00 tk7 CROND[1083]: (root) CMD (root run-parts
> /etc/cron.monthly)
>
> /var/log/boot.log also seems to have moved back in time:
>
> Dec 1 15:45:47 tk7 sshd:
> Dec 1 15:45:47 tk7 sshd: succeeded
> Dec 1 15:45:48 tk7 tomcat: Starting tomcat:
> Dec 1 15:45:48 tk7 tomcat: tomcat.
> Dec 1 15:45:49 tk7 tomcat: Usage: /etc/init.d/tomcat
> {start|stop|restart|force-reload}
> Dec 1 15:45:49 tk7 rc: Starting tomcat: failed
mmm. Init runs "/etc/rc.d/rc 5" (or some other runlevel), and
/etc/rc.d/rc runs the scripts with "start" or "stop". I would
investicate why this script is complaining. But that may not
be the most urgent, this could just be some old cruft. (I don't
run apache, so I am no good source here, just my two cents.)
> Dec 1 14:48:24 tk7 ntpd: succeeded
Ntpd warps your time. One hour and three minutes, it seems. Use
/sbin/hwclock to check what the hardware clock is doing. What time
was it on your "wall clock" when you rebooted, do you know?
> Dec 1 14:48:24 tk7 ntpd: ntpd startup succeeded
> Dec 1 14:48:26 tk7 nfs: Starting NFS services: succeeded
> Dec 1 14:48:26 tk7 nfs: rpc.rquotad startup succeeded
> Dec 1 14:48:27 tk7 nfs: rpc.nfsd startup succeeded
> Dec 1 14:48:27 tk7 nfs: rpc.mountd startup succeeded
>
> I'm at a loss here and don't like it one bit. The machine is rather
> mission-critical and the possibility of a rootkit comes to mind, even
> though chkrootkit couldn't find a trace of it.
I don't thinks it smells terribly strongly of a rootkit quite yet. At
least, if you find the messages.1, messages.2 etc are there and have
reasonable contents. A hardware problem, perhaps. A disk failure, memory
failure... Some data about the hw config?
What does cron.daily do on your system? That is, what is in your
/etc/cron.daily?
> Does this sound familiar to someone?
Not to me. I am in short supply of explanations of what was bugging
the computer after the cron.daily. Perhaps it's a coincidence,
cron.daily may be a false lead. But I would like to see suggestions
from the newsgroup as to what could hav made e.g. sshd behave like
that.
If a rootkit is a serious threat, that is, if the computer holds data
that could be easily exchanged for serious money by a resourcefull attacker:
Rebuild the system from rpms, on a separate computer. That is, buy a (set of)
new disk(s) and rebuild the disks containing the system software, the boot
sectors, and the kernel. Supposing you already have all customer data
on separate partitions, and no customer-controlled software is running
with root privileges, you should be able to replace the newly built system
disk with minimal disruption.
Once you have the idea of a possible rootkit bugging you, checks like
rpm -V (on rpm-based distributions) are not completely satisfactory, because
the rootkit may include manipulation of the rpm database. Even the kernel
could be manipulated, and cover the traces no matter what you do. You would
have to build an equivalend rpm database on an uncompromised computer, and
have rpm check a mirror of the server against it. But his is more work,
probably, than just building a new system disk.
-Enrique |