Unix Technical Forum

Help with Sol 7 Fatal PCI UE Error - intermittant

This is a discussion on Help with Sol 7 Fatal PCI UE Error - intermittant within the Sun Solaris Administration forums, part of the Solaris Operating System category; --> Is this a memory module failure or a cpu4 failure?? thx ahead of time. dmc Feb 1 22:54:40 e4500a ...


Go Back   Unix Technical Forum > Unix Operating Systems > Solaris Operating System > Sun Solaris Administration

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 01-12-2008, 06:16 AM
David McCall
 
Posts: n/a
Default Help with Sol 7 Fatal PCI UE Error - intermittant

Is this a memory module failure or a cpu4 failure??

thx ahead of time.

dmc


Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa mid
3) during dvma read transaction
Feb 1 22:54:40 e4500a unix: Transaction was a block operation.
Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000
AFAR=00000000.ce9ad898,
Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3.
Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60:
Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error
Feb 1 22:54:40 e4500a unix:
Feb 1 22:54:40 e4500a unix: syncing file systems...
Feb 1 22:54:40 e4500a unix: WARNING: md: d11: write error on
/dev/dsk/c0t11d0s0
Feb 1 22:54:43 e4500a last message repeated 1 time
Feb 1 22:54:43 e4500a unix: 23
Feb 1 22:54:45 e4500a unix: 5
Feb 1 22:54:46 e4500a unix: 3
Feb 1 22:54:56 e4500a last message repeated 8 times
Feb 1 22:54:57 e4500a unix: cannot sync -- giving up
Feb 1 22:54:58 e4500a unix: dumping to /dev/md/dsk/d2, offset 105316352
Feb 1 22:56:04 e4500a unix: ^M100% done: 58260 pages dumped, compression
ratio 3.10,


================================================== ==========================
==

System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise
E4500/E5500
System clock frequency: 84 MHz
Memory size: 4096Mb

========================= CPUs =========================

Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 336 4.0 US-II 2.0
0 1 1 336 4.0 US-II 2.0
2 4 0 336 4.0 US-II 2.0
2 5 1 336 4.0 US-II 2.0
4 8 0 336 4.0 US-II 2.0


========================= Memory =========================

Intrlv. Intrlv.
Brd Bank MB Status Condition Speed Factor With
--- ----- ---- ------- ---------- ----- ------- -------
0 0 1024 Active OK 60ns 4-way A
0 1 1024 Active OK 60ns 4-way A
2 0 1024 Active OK 60ns 4-way A
2 1 1024 Active OK 60ns 4-way A

========================= IO Cards =========================

Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- -------------------
---
1 PCI 33 0 SUNW,hme-pci108e,1001 SUNW,cheerio
1 PCI 33 1 SUNW,hme-pci108e,1001 SUNW,cheerio
1 PCI 66 2 network-pci108e,2bad SUNW,pci-gem
1 PCI 33 3 SUNW,isptwo/sd (block) QLGC,ISP1040B
1 PCI 33 4 SUNW,isptwo-pci1077,1020/sd (blo+ QLGC,ISP1040B

Detached Boards
===============
Slot State Type Info
---- --------- ------ -----------------------------------------
3 disabled disk Disk 0: Target: 10 Disk 1: Target: 11
5 disabled disk Disk 0: Target: 12 Disk 1: Target: 13

No failures found in System
===========================

No System Faults found
======================


David McCall

UNIX Administrator

AdvancedTelcomGroup

david@atgi.net

david.mccall@ge.com



Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 01-12-2008, 06:16 AM
drub sledkey
 
Posts: n/a
Default Re: Help with Sol 7 Fatal PCI UE Error - intermittant

"David McCall" <david@atgi.net> wrote in message news:<bvl5co$sp6$1@nnrp.atgi.net>...
> Is this a memory module failure or a cpu4 failure??
>
> thx ahead of time.
>
> dmc
>
>
> Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa mid
> 3) during dvma read transaction
> Feb 1 22:54:40 e4500a unix: Transaction was a block operation.
> Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000
> AFAR=00000000.ce9ad898,
> Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0 J3101
> J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3.
> Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60:
> Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error


I had this exact same symptom. It turned out to be a memory error.
It was a production E3500, and I didn't have the luxury of swapping
out each memory module one at a time to find the bad one, so I just
replaced the whole bank, and that fixed it. I then turned the 8
memory modules over to Sun, and they were able to figure out which one
of the memory modules was bad.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 01-12-2008, 06:17 AM
David McCall
 
Posts: n/a
Default Re: Help with Sol 7 Fatal PCI UE Error - intermittant

Now it's kinda changed???

eb 3 17:30:49 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU8 Data access at TL=0, errID 0x00008b68.adea170d
Feb 3 17:30:49 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.a4e4a018
Feb 3 17:30:49 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x1008aef8
Feb 3 17:30:49 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE>
UDBL.ESYND 0x03
Feb 3 17:30:49 e4500a UDBL Syndrome 0x3 Memory Module Board 0 J3100
J3200 J3300 J3400 J3500 J3600 J3700 J3800
Feb 3 17:30:49 e4500a unix: WARNING: [AFT1] errID 0x00008b68.adea170d
Syndrome 0x3 indicates that this may not be a memory module problem
Feb 3 17:30:49 e4500a unix: [AFT2] errID 0x00008b68.adea170d
PA=0x00000000.a4e4a018
Feb 3 17:30:49 e4500a E$tag 0x00000000.0fc0149c E$State: Modified
E$parity 0x07
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x00): 0x00000300.082203f8
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x08): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x18): 0x00000300.08309430 *Bad*
PSYND=0x00ff
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x20): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x28): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x30): 0x00000300.07c1fb20
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x38): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: panic[cpu8]/thread=300083d5960:
Feb 3 17:30:49 e4500a unix: [AFT1] errID 0x00008b68.adea170d UE Error(s)
Feb 3 17:30:49 e4500a See previous message(s) for details
Feb 3 17:30:49 e4500a unix:
Feb 3 17:30:49 e4500a unix: syncing file systems...
Feb 3 17:30:51 e4500a unix: 27
Feb 3 17:31:15 e4500a unix: 4
Feb 3 17:31:38 e4500a unix: 1
Feb 3 17:31:48 e4500a unix: panic[cpu8]/thread=2a1000abd60:
Feb 3 17:31:48 e4500a unix: panic sync timeout
"drub sledkey" <drubsledkey@yahoo.com> wrote in message
news:42b26a3f.0402020936.69425908@posting.google.c om...
> "David McCall" <david@atgi.net> wrote in message

news:<bvl5co$sp6$1@nnrp.atgi.net>...
> > Is this a memory module failure or a cpu4 failure??
> >
> > thx ahead of time.
> >
> > dmc
> >
> >
> > Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa

mid
> > 3) during dvma read transaction
> > Feb 1 22:54:40 e4500a unix: Transaction was a block operation.
> > Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000
> > AFAR=00000000.ce9ad898,
> > Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0

J3101
> > J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3.
> > Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60:
> > Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error

>
> I had this exact same symptom. It turned out to be a memory error.
> It was a production E3500, and I didn't have the luxury of swapping
> out each memory module one at a time to find the bad one, so I just
> replaced the whole bank, and that fixed it. I then turned the 8
> memory modules over to Sun, and they were able to figure out which one
> of the memory modules was bad.



Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 01-12-2008, 06:19 AM
David McCall
 
Posts: n/a
Default Re: Help with Sol 7 Fatal PCI UE Error - intermittant

And yet another permutation:


Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU8 Data access at TL=0, errID 0x000009ec.bb54d3a5
Feb 10 03:28:33 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x1001ee94
Feb 10 03:28:33 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE>
UDBL.ESYND 0x03
Feb 10 03:28:33 e4500a UDBL Syndrome 0x3 Memory Module Board 2 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] errID 0x000009ec.bb54d3a5
Syndrome 0x3 indicates that this may not be a memory module p
roblem
Feb 10 03:28:33 e4500a unix: [AFT2] errID 0x000009ec.bb54d3a5
PA=0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a E$tag 0x00000000.1ec00c1e E$State: Exclusive
E$parity 0x0f
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x00): 0x00000000.00000048
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x08): 0xe17a14ae.47f53f01
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000060
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x18): 0x213824b0.65e04011 *Bad*
PSYND=0x00ff
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x20): 0x1cd232b0.65e24028
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x28): 0x00000010.325436d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x30): 0x65e24053.a29c41d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x38): 0x65e24009.00000003
Feb 10 03:28:33 e4500a unix: panic[cpu8]/thread=2a1004b3d60:
Feb 10 03:28:33 e4500a unix: [AFT1] errID 0x000009ec.bb54d3a5 UE Error(s)

> > I had this exact same symptom. It turned out to be a memory error.
> > It was a production E3500, and I didn't have the luxury of swapping
> > out each memory module one at a time to find the bad one, so I just
> > replaced the whole bank, and that fixed it. I then turned the 8
> > memory modules over to Sun, and they were able to figure out which one
> > of the memory modules was bad.

>
>



Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 01-12-2008, 06:19 AM
David McCall
 
Posts: n/a
Default Re: Help with Sol 7 Fatal PCI UE Error - intermittant

And now even more interesting, with Module 0 and PCI2 swapped out of the
picture:'

Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU8 Data access at TL=0, errID 0x000009ec.bb54d3a5
Feb 10 03:28:33 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x1001ee94
Feb 10 03:28:33 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE>
UDBL.ESYND 0x03
Feb 10 03:28:33 e4500a UDBL Syndrome 0x3 Memory Module Board 2 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] errID 0x000009ec.bb54d3a5
Syndrome 0x3 indicates that this may not be a memory module problem
Feb 10 03:28:33 e4500a unix: [AFT2] errID 0x000009ec.bb54d3a5
PA=0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a E$tag 0x00000000.1ec00c1e E$State: Exclusive
E$parity 0x0f
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x00): 0x00000000.00000048
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x08): 0xe17a14ae.47f53f01
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000060
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x18): 0x213824b0.65e04011 *Bad*
PSYND=0x00ff
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x20): 0x1cd232b0.65e24028
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x28): 0x00000010.325436d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x30): 0x65e24053.a29c41d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x38): 0x65e24009.00000003
Feb 10 03:28:33 e4500a unix: panic[cpu8]/thread=2a1004b3d60:
Feb 10 03:28:33 e4500a unix: [AFT1] errID 0x000009ec.bb54d3a5 UE Error(s)
Feb 10 03:28:33 e4500a See previous message(s) for details
Feb 10 03:28:33 e4500a unix:
Feb 10 03:28:34 e4500a unix: syncing file systems...
Feb 10 03:28:35 e4500a unix: 13
Feb 10 03:28:58 e4500a unix: 1
Feb 10 03:29:13 e4500a unix: done
Feb 10 03:29:13 e4500a unix: panic[cpu8]/thread=2a1000abd60:
Feb 10 03:29:13 e4500a unix: panic sync timeout
Feb 10 03:29:13 e4500a unix:
Feb 10 03:29:14 e4500a unix: dumping to /dev/md/dsk/d2, offset 105316352
Feb 10 03:31:18 e4500a unix: ^M100% done: 43367 pages dumped, compression
ratio 4.07,
Feb 10 03:31:18 e4500a unix: dump succeeded



> > I had this exact same symptom. It turned out to be a memory error.
> > It was a production E3500, and I didn't have the luxury of swapping
> > out each memory module one at a time to find the bad one, so I just
> > replaced the whole bank, and that fixed it. I then turned the 8
> > memory modules over to Sun, and they were able to figure out which one
> > of the memory modules was bad.

>
>



Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 01-12-2008, 06:19 AM
David McCall
 
Posts: n/a
Default Re: Help with Sol 7 Fatal PCI UE Error - intermittant

Forgot to mention, All the soft links in / were gone after the last failure.

dunno if that means anything significant either.

not much left to replace at this point.


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:12 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com