This is a discussion on Follow Up: RANT: Why Sun is losing to Linux within the Sun Solaris Administration forums, part of the Solaris Operating System category; --> As many of you are aware back in December 2006 I took issue with a feature of Sun's SunFire ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| As many of you are aware back in December 2006 I took issue with a feature of Sun's SunFire mid-range server line. Specifically the automatic blacklisting of components through the Component Health Status (CHS) feature. For those unaware this feature is intended to blacklist components the system has, for whatever reason, determined faulty. This is done in order to keep the component from being brought back online and causing system instability. Conceptually this appears to be a good idea. In reality the system can, and does, blacklist non-faulty components. This is exactly what happened to me. For reasons unknown during the POST test (after issuing the "setkey on" command) the system blacklisted several non-faulty components (memory modules and CPUs). So what's the problem? Well, for some reason Sun decided the end user should not be given the ability to re-enable to blacklisted components (please note that CHS blacklisting is different than the blacklisting capability provided to the end user. They are two different blacklist lists). From this point forward, without assistance from Sun, perfectly goods parts were now forever off limits. It was argued by some in this forum that I was unqualified to determine if these parts were faulty or not. Though I knew them to be fully problem free. Falling back to a previous version of firmware (5.13.0) which does not honor the CHS status allowed the system to utilize those components marked as failed. Some argued that the older release of code may not as thoroughly test the components of later releases...a valid argument. In the end the system ran flawlessly with the older release of firmware and "faulty" components for a little over four months. Until...I was able to obtain the Service Password (SP) for the system in question. The firmware was updated to the latest version available at the time (5.20.5). The Service Password obtain (the SP is dependent on the SC firmware version, among other things), and the "faulty" components removed from the CHS blacklist. After re-enabling the components the system has been in use and problem free. So what happened? Why did the system remove the components as faulty? Good question. Here is the CHS record (yes, the system stores a history for each component) for one of the CPUs in question: sunfire:SC[service]> showchs -v -c /N0/SB0/P2 Total # of records: 3 Component : /N0/SB0/P2 Time Stamp : Fri Apr 27 12:58:44 MDT 2007 New Status : OK Old Status : Faulty Event Code : Field Engineer Supplied Status Initiator : Field Engineer Message : Set Erroneously Component : /N0/SB0/P2 Time Stamp : Sat Dec 23 14:14:04 MST 2006 New Status : Faulty Old Status : Faulty Event Code : Diag Initiator : POST Message : 1.SF3800.FAULT.POST.LPOST.61--.16-0.1 Component : /N0/SB0/P2 Time Stamp : Sat Dec 23 14:13:51 MST 2006 New Status : Faulty Old Status : OK Event Code : Diag Initiator : POST Message : 1.SF3800.FAULT.POST.LPOST.61--.16-0.1 And for one of the memory modules: sunfire:SC[service]> showchs -v -c /N0/SB0/P2/B0/D0/L0 Total # of records: 2 Component : /N0/SB0/P2/B0/D0/L0 Time Stamp : Fri Apr 27 13:00:30 MDT 2007 New Status : OK Old Status : Faulty Event Code : Field Engineer Supplied Status Initiator : Field Engineer Message : Set Erroneously Component : /N0/SB0/P2/B0/D0/L0 Time Stamp : Thu Dec 21 17:26:38 MST 2006 New Status : Faulty Old Status : OK Event Code : Diag Initiator : POST Message : 1.SF3800.FAULT.POST.LPOST.61--.16-0.1 Perhaps the message section of the record contains more specific details but there doesn't appear to be much there. The more troubling aspect of the CHS feature has less to do with the feature itself but Suns position on allowing the end user to obtain the SP. My research revealed instances of people experiencing similar problems with CHS and how to obtain the SP. Obviously the most popular "solution" was to have a service contract. The next was to pay Sun to send a service tech to resolve the problem. And finally to open a service case at a cost of $150/hr, minimum 2 hours. The latter seemed reasonable until I tried it. Sun wanted the configuration, serial number, and to have someone come by and check out the system before they would help (come by and check it out? For what...the reason I'm opening a service case with Sun is to resolve a problem with "faulty" components). Sounded like a costly, and timely, method just to get a few components re-enabled. I spoke to a Sun sales person who I know hoping they could do me a favor and get the SP for me. While this person had no clue about CHS and the SP they did speak with a Sun service tech who stated, unequivocally, that the SP was not to be provided to the customer under any circumstances. Thus putting the brakes on that idea. So why have I spent all this time writing paragraphs upon paragraphs of my experience? Well, just to help others who may find themselves in a similar position and raise two points that hopefully Sun may someday address: 1. CHS is not foolproof. As my experience has shown when the components were in use with older firmware the components functioned flawlessly. And when re-enabled with newer firmware the continued to function flawlessly. 2. What is Suns policy on supplying the end user with the Service Password. In researching this I came across a variety of answers from obtaining a service contract to paying hourly for a tech to come on site to opening a support case with Sun to "under no circumstances should the SP be provided to the end user". As this equipment ages its going to find itself into the hands of hobbyists. Perhaps Sun should think about making the password available to hobbyists for free. Or providing a firmware update which doesn't require a service password at all. Josh |
| Thread Tools | |
| Display Modes | |
|
|