Unix Technical Forum

Follow Up: RANT: Why Sun is losing to Linux

This is a discussion on Follow Up: RANT: Why Sun is losing to Linux within the Sun Solaris Administration forums, part of the Solaris Operating System category; --> As many of you are aware back in December 2006 I took issue with a feature of Sun's SunFire ...


Go Back   Unix Technical Forum > Unix Operating Systems > Solaris Operating System > Sun Solaris Administration

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 01-16-2008, 10:34 AM
Josh McKee
 
Posts: n/a
Default Follow Up: RANT: Why Sun is losing to Linux

As many of you are aware back in December 2006 I took issue with a
feature of Sun's SunFire mid-range server line. Specifically the
automatic blacklisting of components through the Component Health Status
(CHS) feature. For those unaware this feature is intended to blacklist
components the system has, for whatever reason, determined faulty. This
is done in order to keep the component from being brought back online
and causing system instability. Conceptually this appears to be a good
idea. In reality the system can, and does, blacklist non-faulty
components. This is exactly what happened to me.

For reasons unknown during the POST test (after issuing the "setkey on"
command) the system blacklisted several non-faulty components (memory
modules and CPUs). So what's the problem? Well, for some reason Sun
decided the end user should not be given the ability to re-enable to
blacklisted components (please note that CHS blacklisting is different
than the blacklisting capability provided to the end user. They are two
different blacklist lists). From this point forward, without assistance
from Sun, perfectly goods parts were now forever off limits. It was
argued by some in this forum that I was unqualified to determine if
these parts were faulty or not. Though I knew them to be fully problem
free. Falling back to a previous version of firmware (5.13.0) which does
not honor the CHS status allowed the system to utilize those components
marked as failed. Some argued that the older release of code may not as
thoroughly test the components of later releases...a valid argument. In
the end the system ran flawlessly with the older release of firmware and
"faulty" components for a little over four months.

Until...I was able to obtain the Service Password (SP) for the system in
question. The firmware was updated to the latest version available at
the time (5.20.5). The Service Password obtain (the SP is dependent on
the SC firmware version, among other things), and the "faulty"
components removed from the CHS blacklist. After re-enabling the
components the system has been in use and problem free.

So what happened? Why did the system remove the components as faulty?
Good question. Here is the CHS record (yes, the system stores a history
for each component) for one of the CPUs in question:

sunfire:SC[service]> showchs -v -c /N0/SB0/P2
Total # of records: 3
Component : /N0/SB0/P2
Time Stamp : Fri Apr 27 12:58:44 MDT 2007
New Status : OK
Old Status : Faulty
Event Code : Field Engineer Supplied Status
Initiator : Field Engineer
Message : Set Erroneously

Component : /N0/SB0/P2
Time Stamp : Sat Dec 23 14:14:04 MST 2006
New Status : Faulty
Old Status : Faulty
Event Code : Diag
Initiator : POST
Message : 1.SF3800.FAULT.POST.LPOST.61--.16-0.1

Component : /N0/SB0/P2
Time Stamp : Sat Dec 23 14:13:51 MST 2006
New Status : Faulty
Old Status : OK
Event Code : Diag
Initiator : POST
Message : 1.SF3800.FAULT.POST.LPOST.61--.16-0.1

And for one of the memory modules:

sunfire:SC[service]> showchs -v -c /N0/SB0/P2/B0/D0/L0
Total # of records: 2
Component : /N0/SB0/P2/B0/D0/L0
Time Stamp : Fri Apr 27 13:00:30 MDT 2007
New Status : OK
Old Status : Faulty
Event Code : Field Engineer Supplied Status
Initiator : Field Engineer
Message : Set Erroneously

Component : /N0/SB0/P2/B0/D0/L0
Time Stamp : Thu Dec 21 17:26:38 MST 2006
New Status : Faulty
Old Status : OK
Event Code : Diag
Initiator : POST
Message : 1.SF3800.FAULT.POST.LPOST.61--.16-0.1

Perhaps the message section of the record contains more specific details
but there doesn't appear to be much there.

The more troubling aspect of the CHS feature has less to do with the
feature itself but Suns position on allowing the end user to obtain the
SP. My research revealed instances of people experiencing similar
problems with CHS and how to obtain the SP. Obviously the most popular
"solution" was to have a service contract. The next was to pay Sun to
send a service tech to resolve the problem. And finally to open a
service case at a cost of $150/hr, minimum 2 hours. The latter seemed
reasonable until I tried it. Sun wanted the configuration, serial
number, and to have someone come by and check out the system before they
would help (come by and check it out? For what...the reason I'm opening
a service case with Sun is to resolve a problem with "faulty"
components). Sounded like a costly, and timely, method just to get a few
components re-enabled. I spoke to a Sun sales person who I know hoping
they could do me a favor and get the SP for me. While this person had no
clue about CHS and the SP they did speak with a Sun service tech who
stated, unequivocally, that the SP was not to be provided to the
customer under any circumstances. Thus putting the brakes on that idea.

So why have I spent all this time writing paragraphs upon paragraphs of
my experience? Well, just to help others who may find themselves in a
similar position and raise two points that hopefully Sun may someday
address:

1. CHS is not foolproof. As my experience has shown when the components
were in use with older firmware the components functioned flawlessly.
And when re-enabled with newer firmware the continued to function
flawlessly.

2. What is Suns policy on supplying the end user with the Service
Password. In researching this I came across a variety of answers from
obtaining a service contract to paying hourly for a tech to come on site
to opening a support case with Sun to "under no circumstances should the
SP be provided to the end user".

As this equipment ages its going to find itself into the hands of
hobbyists. Perhaps Sun should think about making the password available
to hobbyists for free. Or providing a firmware update which doesn't
require a service password at all.

Josh
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:57 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com