vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I have two 170 MHz, 256M Sparc 5's running 3.1 on a small network connected via T-1 to the Internet. They have identical configurations. (They were built on the same day from the same checklist.) They run a handful of services: SSH, DNS, SMTP and POP. They've been running smoothly without a reboot for about 4 months. The load average seldom exceeds .5, as they really aren't asked to do very much. Yesterday afternoon around 13:00, both of them simultaneously became unresponsive to all network traffic -- TCP, UDP, ARP, ICMP, everything. Some minutes later (I estimate 10, but could easily be off) the problem went away; but it soon came back. Each time, there was at least one entry of the form "le0: dropping chained buffer" and a number of syslog entries indicating that the message was repeated. I rebooted one machine at around 14:50; while that temporarily cleared the problem, it did not appear to have any lasting effect, as that system showed the same symptom again within 20 minutes. However, neither system has shown any sign of this problem since about 16:00 yesterday. I have no explanation for that. Sniffing network traffic with another system didn't show much beyond normal traffic and various systems around the 'net infected with the M$ RPC DCOM worm and looking for more. This error is apparently rare: a search of the last several years' worth of archives of the OpenBSD, NetBSD, and FreeBSD mailing lists turned up only a few mentions of it, and all of them were of the form "What the hell is this?". A Google search was similarly fruitless. So I went to the source. I have traced this to the following bit of code in am7990.c (in sys/dev/ic): } else if ((rmd.rmd1_bits & (LE_R1_STP | LE_R1_ENP)) != (LE_R1_STP | LE_R1_ENP)) { printf("%s: dropping chained buffer\n", sc->sc_dev.dv_xname); ifp->if_ierrors++; This is inside am7990_rint(), which handles data receive interrupts. It's after code which looks for framing errors and crc errors, so I think those can be ruled out. I managed to find the manufacturer's spec sheet for the LANCE chip (which is what this is about) on AMD's web site, thanks to a URL given in am7990reg.h. (It's 17881.pdf, if you want to find it on AMD's site.) Between reading the spec sheet and the driver source (wow, it's been a LONG time since I've done this) my impression is that the reason the interface shut down was that ifp->if_ierrors became large enough to merit taking it offline for a bit, then resetting it and trying again. In other words, I think that was a symptom, not the cause. Trying to find the cause brings me back to that bit of code above and what conditions can trigger it. In am7990reg.h, we find: #define LE_R1_STP 0x02 /* start of packet */ #define LE_R1_ENP 0x01 /* end of packet */ so I believe the test above is checking to see if the corresponding bits (0x03) in rmd.rmd1_bits are both set. This seems to match up with 28 of the 7990 ("LANCE") data sheet, which says: STP START OF PACKET indicates that this is the first buffer used by the C-LANCE for this packet. It is used for data chaining buffers. ENP END OF PACKET indicates that this is the last buffer used by the C-LANCE for this packet. It is used for data chaining buffers. If both STP and ENP are set, the packet fits in one buffer and there is no data chaining. It's that last sentence that has me confused. I think the piece of code above is testing for exactly that condition, so I would expect that condition to be true if data was not being chained (across multiple buffers). (Aside: the LANCE data sheet goes on about this for a while, and diagrams can be found on page 32.) So the best I can come up with at the moment, is that something has wrong while receiving a packet, and it's gone wrong at a pretty low level, i.e. this doesn't seem to have anything to do with higher network layers. And it seems to have something to do with the driver's method of storing the packet -- i.e., it doesn't look like a malformed packet on the wire. I think I'll stop here, because I think one explanation for my confusion is that I've misread something or made another kind of mistake. If I have, I'd appreciate it if someone could point it out. But whether I have or haven't, any guidance on what might be causing this (and of course, how I can fix it) would be most welcome. Thanks, ---Rsk |
| Thread Tools | |
| Display Modes | |
|
|