vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| We have an X4100 which last week seems to have turned itself off under somewhat mysterious circumstances. There is no noise at all in any Solaris logs (it just rebooted cleanly when it was turned back on, and the machine was idle so there's not even a clear trace of when it died), but the ILOM event log reports: 305 03/21/2007 12:49:54 ps1.pwrok Power Supply State Deasserted - Asserted 304 03/21/2007 12:49:52 ps0.pwrok Power Supply State Deasserted - Asserted which (perhaps?) means it thinks it lost power then. We're fairly sure that there was no real power outage of any kind (other machines running of the same power strips were all fine, there's a UPS etc). Has anyone seen anything like this? --tim PS there is confusion about ILOM / other firmware revisions, but the ILOM reports: Device Hardware Revision 1 Device Firmware Revision 1.0.117440512 IPMI Version 2.0 Filesystem Version 0.1.13 Build Number 12513 At the time we built the machine (Feb) these were one back from the most recent, as the remote console didn't work in the most recent version (as reported here: http://groups.google.com/group/comp....4a86acef73064a) |
| |||
| On Mar 26, 2:56 am, "Tim Bradshaw" <tfb+goo...@tfeb.org> wrote: > We have an X4100 which last week seems to have turned itself off under > somewhat mysterious circumstances. There is no noise at all in any > Solaris logs (it just rebooted cleanly when it was turned back on, and > the machine was idle so there's not even a clear trace of when it > died), but the ILOM event log reports: > > 305 03/21/2007 12:49:54 ps1.pwrok Power Supply State Deasserted - > Asserted > 304 03/21/2007 12:49:52 ps0.pwrok Power Supply State Deasserted - > Asserted > > which (perhaps?) means it thinks it lost power then. We're fairly > sure that there was no real power outage of any kind (other machines > running of the same power strips were all fine, there's a UPS etc). > > Has anyone seen anything like this? > > --tim > > PS there is confusion about ILOM / other firmware revisions, but the > ILOM reports: > > Device Hardware Revision 1 > Device Firmware Revision 1.0.117440512 > IPMI Version 2.0 > Filesystem Version 0.1.13 > Build Number 12513 > > At the time we built the machine (Feb) these were one back from the > most recent, as the remote console didn't work in the most recent > version (as reported here:http://groups.google.com/group/comp....owse_thread/th...) Hi Tim, Did you ever resolve this? I'm having exactly the same problem. The x4100 has powered itself off about five times already. Regards, Gavin |
| |||
| On 2007-04-13 17:48:41 +0100, kgmathias@gmail.com said: > Did you ever resolve this? I'm having exactly the same problem. The > x4100 has powered itself off about five times already. No. We now have a case open with Sun (it's done it once more since then). If you mail me I can send you the ID on Monday and you could cross reference it. The engineer thinks it may be a thermal trip issue which is plausible except that the machine is completely idle when it dies so anything that might make it hot won't be (and it's in a decently cooled room). We're going to try VTS on Monday. If you can (if the machine is new) I'd argue that this is a warranty case and you just want a new one. We might try and do that (we should) but there's a vast bureaucracy in the way, this being a large company (I think the machine was bought in France even thought we're in the UK, so you can see the issues). It's spectacularly annoying because we can't really deploy it now it's done this to us as we'd be relying on it a bit. --tim |
| |||
| On 13 Apr 2007 09:48:41 -0700, kgmathias@gmail.com wrote: >On Mar 26, 2:56 am, "Tim Bradshaw" <tfb+goo...@tfeb.org> wrote: >> We have an X4100 which last week seems to have turned itself off under >> somewhat mysterious circumstances. There is no noise at all in any >> Solaris logs (it just rebooted cleanly when it was turned back on, and >> the machine was idle so there's not even a clear trace of when it >> died), but the ILOM event log reports: >> >> 305 03/21/2007 12:49:54 ps1.pwrok Power Supply State Deasserted - >> Asserted >> 304 03/21/2007 12:49:52 ps0.pwrok Power Supply State Deasserted - >> Asserted >> >> which (perhaps?) means it thinks it lost power then. We're fairly >> sure that there was no real power outage of any kind (other machines >> running of the same power strips were all fine, there's a UPS etc). >> >> Has anyone seen anything like this? >> >> --tim >> >> PS there is confusion about ILOM / other firmware revisions, but the >> ILOM reports: >> >> Device Hardware Revision 1 >> Device Firmware Revision 1.0.117440512 >> IPMI Version 2.0 >> Filesystem Version 0.1.13 >> Build Number 12513 >> >> At the time we built the machine (Feb) these were one back from the >> most recent, as the remote console didn't work in the most recent >> version (as reported here:http://groups.google.com/group/comp....owse_thread/th...) > >Hi Tim, > >Did you ever resolve this? I'm having exactly the same problem. The >x4100 has powered itself off about five times already. > >Regards, >Gavin I had a Compaq box do something similar to me- DL590/64 with 2 power supplies running Debian.. It would turn itself off at random times. Turn it back on and it was okay. For a while. The machine was rather heavily loaded (CPU usage >90%) running 4 instances of the BOINC SETI@Home application and little else. It went from turning itself off once every few days to turning itself off every few hours as the condition continued to deteriorate. The only other symptom was the DC powered fans seemed to slow down _slightly_ just a couple seconds before the power off 'click'. Since the machine was an eBay purchase (I spent more on freight than I did on the computer) I spent some time troubleshooting it myself. It turned out that one of the power supplies had a bad output filter capacitor. It would go to a near-shorted condition after a time, then recover functionality after a power cycle, then go shorted again after a while. When the 'bad' supply started acting up it would drag down the 48V buss supply until voltage dropped to about 30V and the system would quietly click itself off. I changed out the 'bad' capacitor and a couple more questionable looking electrolytic capacitors in the 'bad' power supply and the machine is now fine. For insurance, I changed the same electrolytics in the 'good' power supply. I spent $15 in parts from Digi-Key and 4 or 5 hours time. I know we're comparing oranges and pickles here with Sun and Compaq, but power supplies are power supplies.... Maybe Sun had a bad batch from their vendor. Have you got another X4100 you could swap a power supply from temporarily? Might save fighting red-tape with Sun to replace the whole machine. a/k/a Brian |
| |||
| On 2007-04-13 21:51:03 +0100, lost@the.net said: > It turned out that one of the power supplies had a bad output filter > capacitor. It would go to a near-shorted condition after a time, then > recover functionality after a power cycle, then go shorted again after > a while. When the 'bad' supply started acting up it would drag down > the 48V buss supply until voltage dropped to about 30V and the system > would quietly click itself off. That might make sense as a failure mode - if one PSU fails in such a way to short (or near short) some rail to earth I guess it could cause the other supply to have a tantrum and die (reasonably so, since it can't keep the machine alive). Have you got another X4100 you could swap a power supply from > temporarily? Might save fighting red-tape with Sun to replace the > whole machine. That might be a good approach, yes (we do have another one). --t |
| |||
| On Apr 13, 7:52 pm, Tim Bradshaw <t...@tfeb.org> wrote: > On 2007-04-13 17:48:41 +0100, kgmath...@gmail.com said: > > > Did you ever resolve this? I'm having exactly the same problem. The > >x4100has powered itself off about five times already. > > No. We now have a case open with Sun (it's done it once more since > then). If you mail me I can send you the ID on Monday and you could > cross reference it. The engineer thinks it may be a thermal trip issue > which is plausible except that the machine is completely idle when it > dies so anything that might make it hot won't be (and it's in a > decently cooled room). We're going to try VTS on Monday. > For what it's worth: we did try VTS and I can reliably kill the machine in an hour or so (with identical symptoms to the original mysterious death ones by running the CPU tests. Indications are that it's some thermal issue. So I'd try VTS on it, run the CPU stress tests with the number of instances set much higher than the default (we're using 10, up from 2), and you might find something interesting. --tim |
| |||
| Hi Tim, Just wondering if you have found any solution to your X4100 mysterious failure. We seem to be plagued with the same problem recently. Did the VTS give you with any helpful information? Did you end up getting new power supplies or a new server? Our machine has only failed between 1AM and 6AM, which means there is no stress on it. Could it be a temp sensor issue? Thanks for your help in advance. cheers, Amin |
| |||
| On Jul 26, 12:07 pm, "AMV" <avarg...@nospam.coaxinc.com> wrote: > Hi Tim, > Just wondering if you have found any solution to your X4100 mysterious > failure. We seem to be plagued with the same problem recently. Did the > VTS give you with any helpful information? Did you end up getting new > power supplies or a new server? > > Our machine has only failed between 1AM and 6AM, which means there is no > stress on it. Could it be a temp sensor issue? > Thanks for your help in advance. > cheers, > Amin I am seeing exactly the same problem on an X4200. No load on the machine, decently cooled area (housing other servers and AC, etc), mysterious power-offs occurring, generally every 2-3 weeks. The machine shuts down cleanly but this is obviously not a situation we want to continue. Did anyone definitively establish if this was a temperature trip or just faulty PSU's (which would they both go at the same time)? Thanks |
| |||
| On Aug 2, 10:15 am, RobT <robtr...@gmail.com> wrote: > On Jul 26, 12:07 pm, "AMV" <avarg...@nospam.coaxinc.com> wrote: > > > Hi Tim, > > Just wondering if you have found any solution to your X4100 mysterious > > failure. We seem to be plagued with the same problem recently. Did the > > VTS give you with any helpful information? Did you end up getting new > > power supplies or a new server? > > > Our machine has only failed between 1AM and 6AM, which means there is no > > stress on it. Could it be a temp sensor issue? > > Thanks for your help in advance. > > cheers, > > Amin > > I am seeing exactly the same problem on an X4200. No load on the > machine, decently cooled area (housing other servers and AC, etc), > mysterious power-offs occurring, generally every 2-3 weeks. The > machine shuts down cleanly but this is obviously not a situation we > want to continue. > > Did anyone definitively establish if this was a temperature trip or > just faulty PSU's (which would they both go at the same time)? > > Thanks Sun Engineer had me update ILOM firmware to latest revision (1.1.8 at time of writing) as earlier versions were buggy. I guess we have to wait a while to see if that does the trick. |