This is a discussion on Help with Sol 7 Fatal PCI UE Error - intermittant within the Sun Solaris Administration forums, part of the Solaris Operating System category; --> Is this a memory module failure or a cpu4 failure?? thx ahead of time. dmc Feb 1 22:54:40 e4500a ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Is this a memory module failure or a cpu4 failure?? thx ahead of time. dmc Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa mid 3) during dvma read transaction Feb 1 22:54:40 e4500a unix: Transaction was a block operation. Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000 AFAR=00000000.ce9ad898, Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3. Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60: Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error Feb 1 22:54:40 e4500a unix: Feb 1 22:54:40 e4500a unix: syncing file systems... Feb 1 22:54:40 e4500a unix: WARNING: md: d11: write error on /dev/dsk/c0t11d0s0 Feb 1 22:54:43 e4500a last message repeated 1 time Feb 1 22:54:43 e4500a unix: 23 Feb 1 22:54:45 e4500a unix: 5 Feb 1 22:54:46 e4500a unix: 3 Feb 1 22:54:56 e4500a last message repeated 8 times Feb 1 22:54:57 e4500a unix: cannot sync -- giving up Feb 1 22:54:58 e4500a unix: dumping to /dev/md/dsk/d2, offset 105316352 Feb 1 22:56:04 e4500a unix: ^M100% done: 58260 pages dumped, compression ratio 3.10, ================================================== ========================== == System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise E4500/E5500 System clock frequency: 84 MHz Memory size: 4096Mb ========================= CPUs ========================= Run Ecache CPU CPU Brd CPU Module MHz MB Impl. Mask --- --- ------- ----- ------ ------ ---- 0 0 0 336 4.0 US-II 2.0 0 1 1 336 4.0 US-II 2.0 2 4 0 336 4.0 US-II 2.0 2 5 1 336 4.0 US-II 2.0 4 8 0 336 4.0 US-II 2.0 ========================= Memory ========================= Intrlv. Intrlv. Brd Bank MB Status Condition Speed Factor With --- ----- ---- ------- ---------- ----- ------- ------- 0 0 1024 Active OK 60ns 4-way A 0 1 1024 Active OK 60ns 4-way A 2 0 1024 Active OK 60ns 4-way A 2 1 1024 Active OK 60ns 4-way A ========================= IO Cards ========================= Bus Freq Brd Type MHz Slot Name Model --- ---- ---- ---- -------------------------------- ------------------- --- 1 PCI 33 0 SUNW,hme-pci108e,1001 SUNW,cheerio 1 PCI 33 1 SUNW,hme-pci108e,1001 SUNW,cheerio 1 PCI 66 2 network-pci108e,2bad SUNW,pci-gem 1 PCI 33 3 SUNW,isptwo/sd (block) QLGC,ISP1040B 1 PCI 33 4 SUNW,isptwo-pci1077,1020/sd (blo+ QLGC,ISP1040B Detached Boards =============== Slot State Type Info ---- --------- ------ ----------------------------------------- 3 disabled disk Disk 0: Target: 10 Disk 1: Target: 11 5 disabled disk Disk 0: Target: 12 Disk 1: Target: 13 No failures found in System =========================== No System Faults found ====================== David McCall UNIX Administrator AdvancedTelcomGroup david@atgi.net david.mccall@ge.com |
| |||
| "David McCall" <david@atgi.net> wrote in message news:<bvl5co$sp6$1@nnrp.atgi.net>... > Is this a memory module failure or a cpu4 failure?? > > thx ahead of time. > > dmc > > > Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa mid > 3) during dvma read transaction > Feb 1 22:54:40 e4500a unix: Transaction was a block operation. > Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000 > AFAR=00000000.ce9ad898, > Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0 J3101 > J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3. > Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60: > Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error I had this exact same symptom. It turned out to be a memory error. It was a production E3500, and I didn't have the luxury of swapping out each memory module one at a time to find the bad one, so I just replaced the whole bank, and that fixed it. I then turned the 8 memory modules over to Sun, and they were able to figure out which one of the memory modules was bad. |
| |||
| Now it's kinda changed??? eb 3 17:30:49 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU8 Data access at TL=0, errID 0x00008b68.adea170d Feb 3 17:30:49 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.a4e4a018 Feb 3 17:30:49 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1008aef8 Feb 3 17:30:49 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03 Feb 3 17:30:49 e4500a UDBL Syndrome 0x3 Memory Module Board 0 J3100 J3200 J3300 J3400 J3500 J3600 J3700 J3800 Feb 3 17:30:49 e4500a unix: WARNING: [AFT1] errID 0x00008b68.adea170d Syndrome 0x3 indicates that this may not be a memory module problem Feb 3 17:30:49 e4500a unix: [AFT2] errID 0x00008b68.adea170d PA=0x00000000.a4e4a018 Feb 3 17:30:49 e4500a E$tag 0x00000000.0fc0149c E$State: Modified E$parity 0x07 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x00): 0x00000300.082203f8 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x08): 0x00000000.00000000 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000000 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x18): 0x00000300.08309430 *Bad* PSYND=0x00ff Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x20): 0x00000000.00000000 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x28): 0x00000000.00000000 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x30): 0x00000300.07c1fb20 Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x38): 0x00000000.00000000 Feb 3 17:30:49 e4500a unix: panic[cpu8]/thread=300083d5960: Feb 3 17:30:49 e4500a unix: [AFT1] errID 0x00008b68.adea170d UE Error(s) Feb 3 17:30:49 e4500a See previous message(s) for details Feb 3 17:30:49 e4500a unix: Feb 3 17:30:49 e4500a unix: syncing file systems... Feb 3 17:30:51 e4500a unix: 27 Feb 3 17:31:15 e4500a unix: 4 Feb 3 17:31:38 e4500a unix: 1 Feb 3 17:31:48 e4500a unix: panic[cpu8]/thread=2a1000abd60: Feb 3 17:31:48 e4500a unix: panic sync timeout "drub sledkey" <drubsledkey@yahoo.com> wrote in message news:42b26a3f.0402020936.69425908@posting.google.c om... > "David McCall" <david@atgi.net> wrote in message news:<bvl5co$sp6$1@nnrp.atgi.net>... > > Is this a memory module failure or a cpu4 failure?? > > > > thx ahead of time. > > > > dmc > > > > > > Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa mid > > 3) during dvma read transaction > > Feb 1 22:54:40 e4500a unix: Transaction was a block operation. > > Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000 > > AFAR=00000000.ce9ad898, > > Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0 J3101 > > J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3. > > Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60: > > Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error > > I had this exact same symptom. It turned out to be a memory error. > It was a production E3500, and I didn't have the luxury of swapping > out each memory module one at a time to find the bad one, so I just > replaced the whole bank, and that fixed it. I then turned the 8 > memory modules over to Sun, and they were able to figure out which one > of the memory modules was bad. |
| |||
| And yet another permutation: Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU8 Data access at TL=0, errID 0x000009ec.bb54d3a5 Feb 10 03:28:33 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.60f7d1d8 Feb 10 03:28:33 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1001ee94 Feb 10 03:28:33 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03 Feb 10 03:28:33 e4500a UDBL Syndrome 0x3 Memory Module Board 2 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801 Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] errID 0x000009ec.bb54d3a5 Syndrome 0x3 indicates that this may not be a memory module p roblem Feb 10 03:28:33 e4500a unix: [AFT2] errID 0x000009ec.bb54d3a5 PA=0x00000000.60f7d1d8 Feb 10 03:28:33 e4500a E$tag 0x00000000.1ec00c1e E$State: Exclusive E$parity 0x0f Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x00): 0x00000000.00000048 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x08): 0xe17a14ae.47f53f01 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000060 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x18): 0x213824b0.65e04011 *Bad* PSYND=0x00ff Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x20): 0x1cd232b0.65e24028 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x28): 0x00000010.325436d0 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x30): 0x65e24053.a29c41d0 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x38): 0x65e24009.00000003 Feb 10 03:28:33 e4500a unix: panic[cpu8]/thread=2a1004b3d60: Feb 10 03:28:33 e4500a unix: [AFT1] errID 0x000009ec.bb54d3a5 UE Error(s) > > I had this exact same symptom. It turned out to be a memory error. > > It was a production E3500, and I didn't have the luxury of swapping > > out each memory module one at a time to find the bad one, so I just > > replaced the whole bank, and that fixed it. I then turned the 8 > > memory modules over to Sun, and they were able to figure out which one > > of the memory modules was bad. > > |
| |||
| And now even more interesting, with Module 0 and PCI2 swapped out of the picture:' Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on CPU8 Data access at TL=0, errID 0x000009ec.bb54d3a5 Feb 10 03:28:33 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.60f7d1d8 Feb 10 03:28:33 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1001ee94 Feb 10 03:28:33 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03 Feb 10 03:28:33 e4500a UDBL Syndrome 0x3 Memory Module Board 2 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801 Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] errID 0x000009ec.bb54d3a5 Syndrome 0x3 indicates that this may not be a memory module problem Feb 10 03:28:33 e4500a unix: [AFT2] errID 0x000009ec.bb54d3a5 PA=0x00000000.60f7d1d8 Feb 10 03:28:33 e4500a E$tag 0x00000000.1ec00c1e E$State: Exclusive E$parity 0x0f Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x00): 0x00000000.00000048 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x08): 0xe17a14ae.47f53f01 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000060 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x18): 0x213824b0.65e04011 *Bad* PSYND=0x00ff Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x20): 0x1cd232b0.65e24028 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x28): 0x00000010.325436d0 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x30): 0x65e24053.a29c41d0 Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x38): 0x65e24009.00000003 Feb 10 03:28:33 e4500a unix: panic[cpu8]/thread=2a1004b3d60: Feb 10 03:28:33 e4500a unix: [AFT1] errID 0x000009ec.bb54d3a5 UE Error(s) Feb 10 03:28:33 e4500a See previous message(s) for details Feb 10 03:28:33 e4500a unix: Feb 10 03:28:34 e4500a unix: syncing file systems... Feb 10 03:28:35 e4500a unix: 13 Feb 10 03:28:58 e4500a unix: 1 Feb 10 03:29:13 e4500a unix: done Feb 10 03:29:13 e4500a unix: panic[cpu8]/thread=2a1000abd60: Feb 10 03:29:13 e4500a unix: panic sync timeout Feb 10 03:29:13 e4500a unix: Feb 10 03:29:14 e4500a unix: dumping to /dev/md/dsk/d2, offset 105316352 Feb 10 03:31:18 e4500a unix: ^M100% done: 43367 pages dumped, compression ratio 4.07, Feb 10 03:31:18 e4500a unix: dump succeeded > > I had this exact same symptom. It turned out to be a memory error. > > It was a production E3500, and I didn't have the luxury of swapping > > out each memory module one at a time to find the bad one, so I just > > replaced the whole bank, and that fixed it. I then turned the 8 > > memory modules over to Sun, and they were able to figure out which one > > of the memory modules was bad. > > |