vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello, at the moment we are running the latest ServiceGuard release with Support Plus patches from June 2003 on a two-node-cluster consisting of two L3000 running HP-UX 11.11. The system are connected via a 100 Base-T LAN and a 1000 Base-T LAN. Since May we had two network faults reported by cmcld that were not reproducible. SG had stopped the packages depending on the LAN interface, so it was not very good. I've watched /var/adm/nettl*, and all I could see was a "...bad cable connection...", "...going Offline @ [0/0/0/0] [Cable disconnected]". What worries me is that nettl never reports when the LAN interface is up again. According to cmcld the outage lasted exactly for 20 seconds. Last time it lasted exactly for 30 seconds. I guess that the driver is doing nonsense; maybe if the load is high. Can somebody explain what condition exactly triggers the "cable disconnected" message in the driver? Any insights? |
| |||
| Beyond making sure that you are up-to-date on driver patches, you should check the switch/hub/cable/other that is involved in the topology. MC/SG is only as good as the underlying architecture. You should ensure that you have done things like configuring standby NICs, running MC/SG heartbeat packets over multiple LANs, multiple sources of power, etc. Also, if a LANic outage does not warrant the package going down, you can configure the package to stay up. Good luck, HP_BRS Ulrich Windl <Ulrich.Windl@RZ.Uni-Regensburg.DE> wrote in message news:<m31xuno8dn.fsf@pc5234.klinik.uni-regensburg.de>... > Hello, > > at the moment we are running the latest ServiceGuard release with > Support Plus patches from June 2003 on a two-node-cluster consisting > of two L3000 running HP-UX 11.11. The system are connected via a 100 > Base-T LAN and a 1000 Base-T LAN. > > Since May we had two network faults reported by cmcld that were not > reproducible. SG had stopped the packages depending on the LAN > interface, so it was not very good. > > I've watched /var/adm/nettl*, and all I could see was a "...bad cable > connection...", "...going Offline @ [0/0/0/0] [Cable disconnected]". > > What worries me is that nettl never reports when the LAN interface is > up again. According to cmcld the outage lasted exactly for 20 seconds. > Last time it lasted exactly for 30 seconds. > > I guess that the driver is doing nonsense; maybe if the load is high. > Can somebody explain what condition exactly triggers the "cable > disconnected" message in the driver? > > Any insights? |
| |||
| born2veg@yahoo.com (HP_BRS) writes: > Beyond making sure that you are up-to-date on driver patches, you > should check the switch/hub/cable/other that is involved in the > topology. MC/SG is only as good as the underlying architecture. You > should ensure that you have done things like configuring standby NICs, Standby NICs will not avoid the LAN switching. I think the "cable disconnect" is triggered when it should not. Using a standby NIC will reduce the probability of two such events being triggered simultaneously, but will not solve the problem. > running MC/SG heartbeat packets over multiple LANs, multiple sources > of power, etc. Also, if a LANic outage does not warrant the package > going down, you can configure the package to stay up. Both nodes have a three-way power path, dual path to disks, and two LANs for hardbeat. However if the LAN a package uses fails in a way as described, the package is downed and restarted for nothing. If the package is a huge database application that serves hundreds of users, you don't like packet "switching for fun" very much. Anyway, patches are quite up-to-date, and I'll investigate the problem further. Thanks for the comment. Ulrich > > Good luck, > > HP_BRS > > > Ulrich Windl <Ulrich.Windl@RZ.Uni-Regensburg.DE> wrote in message news:<m31xuno8dn.fsf@pc5234.klinik.uni-regensburg.de>... > > Hello, > > > > at the moment we are running the latest ServiceGuard release with > > Support Plus patches from June 2003 on a two-node-cluster consisting > > of two L3000 running HP-UX 11.11. The system are connected via a 100 > > Base-T LAN and a 1000 Base-T LAN. > > > > Since May we had two network faults reported by cmcld that were not > > reproducible. SG had stopped the packages depending on the LAN > > interface, so it was not very good. > > > > I've watched /var/adm/nettl*, and all I could see was a "...bad cable > > connection...", "...going Offline @ [0/0/0/0] [Cable disconnected]". > > > > What worries me is that nettl never reports when the LAN interface is > > up again. According to cmcld the outage lasted exactly for 20 seconds. > > Last time it lasted exactly for 30 seconds. > > > > I guess that the driver is doing nonsense; maybe if the load is high. > > Can somebody explain what condition exactly triggers the "cable > > disconnected" message in the driver? > > > > Any insights? |
| |||
| Ulrich, one advantage of having multiple NICs within the same server is that a single NIC failure becomes a local event and the package won't cycle down / up because of the failure. The pointers on the TCP stack are just changed to make use of the other NIC. Another advantage is that if you see the "failure" message on both NICs at the same time, you can most likely eliminate the cables or the NICs as the problem. Then, if you have both NICs connected to separate switches, you can eliminate the switches because both probably didn't fail at the same time. That leaves the networking configuration or software on the server, OR a network storm that made the server think the NICs had failed. You can gind that by using tracing and logging on the network in question. Good luck on this troubleshooting... >> >> Ulrich Windl <Ulrich.Windl@RZ.Uni-Regensburg.DE> wrote in message >> news:<m31xuno8dn.fsf@pc5234.klinik.uni-regensburg.de>... >> > Hello, >> > >> > at the moment we are running the latest ServiceGuard release with >> > Support Plus patches from June 2003 on a two-node-cluster consisting >> > of two L3000 running HP-UX 11.11. The system are connected via a 100 >> > Base-T LAN and a 1000 Base-T LAN. >> > >> > Since May we had two network faults reported by cmcld that were not >> > reproducible. SG had stopped the packages depending on the LAN >> > interface, so it was not very good. >> > >> > I've watched /var/adm/nettl*, and all I could see was a "...bad cable >> > connection...", "...going Offline @ [0/0/0/0] [Cable disconnected]". >> > >> > What worries me is that nettl never reports when the LAN interface is >> > up again. According to cmcld the outage lasted exactly for 20 >> > seconds. Last time it lasted exactly for 30 seconds. >> > >> > I guess that the driver is doing nonsense; maybe if the load is high. >> > Can somebody explain what condition exactly triggers the "cable >> > disconnected" message in the driver? >> > >> > Any insights? |
| ||||
| Exactly. "Jim Carter" <root@arroglx01.myprivateshopper.com> wrote in message news:<pan.2003.09.20.04.55.46.61498@arroglx01.mypr ivateshopper.com>... > Ulrich, > one advantage of having multiple NICs within the same server is that a > single NIC failure becomes a local event and the package won't cycle down > / up because of the failure. The pointers on the TCP stack are just > changed to make use of the other NIC. > > Another advantage is that if you see the "failure" message on both NICs at > the same time, you can most likely eliminate the cables or the NICs as the > problem. Then, if you have both NICs connected to separate switches, you > can eliminate the switches because both probably didn't fail at the same > time. That leaves the networking configuration or software on the server, > OR a network storm that made the server think the NICs had failed. You can > gind that by using tracing and logging on the network in question. > > Good luck on this troubleshooting... > > > >> > >> Ulrich Windl <Ulrich.Windl@RZ.Uni-Regensburg.DE> wrote in message > >> news:<m31xuno8dn.fsf@pc5234.klinik.uni-regensburg.de>... > >> > Hello, > >> > > >> > at the moment we are running the latest ServiceGuard release with > >> > Support Plus patches from June 2003 on a two-node-cluster consisting > >> > of two L3000 running HP-UX 11.11. The system are connected via a 100 > >> > Base-T LAN and a 1000 Base-T LAN. > >> > > >> > Since May we had two network faults reported by cmcld that were not > >> > reproducible. SG had stopped the packages depending on the LAN > >> > interface, so it was not very good. > >> > > >> > I've watched /var/adm/nettl*, and all I could see was a "...bad cable > >> > connection...", "...going Offline @ [0/0/0/0] [Cable disconnected]". > >> > > >> > What worries me is that nettl never reports when the LAN interface is > >> > up again. According to cmcld the outage lasted exactly for 20 > >> > seconds. Last time it lasted exactly for 30 seconds. > >> > > >> > I guess that the driver is doing nonsense; maybe if the load is high. > >> > Can somebody explain what condition exactly triggers the "cable > >> > disconnected" message in the driver? > >> > > >> > Any insights? |