Hardware Diagnostics for Oracle Sun systems, A Toolkit for System Administrators

The easiest way to diagnose the hardware related problem on Oracle Sun server is by using of OBP OK Prompt commands, the Power On Self Test (POST), and the status LEDs on system boards.

ou can diagnose hardware related problems on Oracle Sun server and desktop products. With these low-level diagnostics, you can establish the state of the system and attached devices. For example, you can determine if a device is recognized by the system and working properly, or you can also obtain useful system configuration information.

OBP DIAGNOSTIC COMMANDS AND TOOLS
OBP is a powerful, low-level interface to the system and devices attached to the system (OBP is also known as the ok prompt). By entering simple OBP commands, you can learn system configuration details such as the ethernet address, the CPU and bus speeds, installed memory, and so on. Using OBP, you can also query and set system parameter values such as the default boot device, run tests on devices such as the network interface, and display the SCSI and SBUS devices attached to the system.

Below are the available commands in OBP OK prompt:
—————————-
banner
Displays the power on banner. The banner includes information such as CPU speed, OBP revision, total system memory, ethernet address and hostid.

devalias alias path
Defines a new device alias, where alias is the new alias name and path is the physical path of the device. If devalias is used without arguments, it displays all system device aliases (will run up to 120 MHz).

.enet-addr
Displays the ethernet address

led-off/led-on
Turns the system led off or on.

nvaliasname path
Creates a new alias for a device, where name is the name of the alias and path is the physical path of the device. Note – Run the reset-all or the nvstore command to save the new alias in non-volatile memory (NVRAM).

nvunalias name path
Deletes a user-created alias (see nvalias), where name is the name of the alias and path is the physical path of the device. Note – Run the reset-all or nvstore command to save changes in NVRAM.

nvstore
Copies the contents of the temporary buffer to NVRAM and discards the contents of the temporary buffer.

power-off/power-on
Powers the system off or on.

printenv
Displays all parameters, settings, and values

probe-fcal-all
dentifies Fiber Channel Arbitrated Loop (FCAL) devices on a system. 1

probe-sbus
Identifies devices attached to all SBUS slots. Note – This command works only on systems with SBUS slots.

probe-scsi
Identifies devices attached to the onboard SCSI bus. 1

probe-scsi-all
Identifies devices attached to all SCSI busses. 1

set-default parameter
Resets the value of parameter to the default setting.

set-defaults
Resets the value of all parameters to the default settings. Tip – You can also press the Stop and N keys simultaneously during system power-up to reset the values to their defaults.

setenv parameter value
Sets parameter to specified value. Note – Run the reset-all command to save changes in NVRAM.

show-devs
Displays all the devices recognized by the system.

show-disks
Displays the physical device path for disk controllers.

show-displays
Displays the physical device path for frame buffers.

show-nets
Displays the physical device path for network interfaces

show-post-results
If run after Power On Self Test (POST) is completed, this command displays the findings of POST in a readable format.

show-sbus
Displays devices attached to all SBUS slots. Similar to probe-sbus .

show-tapes
Displays the physical device path for tape controllers.

sifting string
Searches for OBP commands or methods that contain string. For example, the sifting probe command displays probe-scsi, probe-scsi-all, probe-sbus, and so on.

speed
Displays CPU and bus speeds

test device-specifier
Executes the selftest method for device-specifier. For example, the test net command tests the network connection.

test-all
Tests all devices that have a built-in test method.

version
Displays OBP and POST version information.

watch-clock
Tests a clock function.

watch-net
Monitors the network connection for the primary interface.

watch-net-all
Monitors all the network connections.

words
Displays all OBP commands and methods

—————————-

OBDIAG
OBDiag also displays diagnostic and error messages on the system console.

How To Run OBDiag
To run OBDiag, simply type obdiag at the Open Boot ok prompt.
You can also set up OBDiag to run automatically when the system is powered on using the following methods:

Set the OBP diagnostics variable:   ok setenv diag-switch  true
Press the Stop and D keys simultaneously while you power on the system.
On Ultra Enterprise servers, turn the key switch to the diagnostics position and power on the system.

POWER ON SELF TEST (POST)
POST is a program that resides in the firmware of each board in a system, and it is used to initialize, configure, and test the system boards. POST output is sent to serial port A (on an Ultra Enterprise server, POST output is sent only to serial port A on the system and clock board). The status LEDs of each system board on Ultra Enterprise servers indicate the POST completion status. For example, if a system board fails the POST test, the amber LED stays lit.
You can watch POST ouput in real-time by attaching a terminal device to serial port A. If none is available, you can use the OBP command show-post-results to view the results after POST completes.

How To Run POST
– connect to serial port
– set the dig-switch to ‘true’
ok setenv diag-switch  true
– Set the desired testing level (min or max), example:
ok setenv diag-level max
– Set the auto-boot variable to ‘false’
ok setenv auto-boot  false
– run ‘reset-all’ >> ok reset-all
– Reboot or Power cycle the system

SOLARIS OPERATING ENVIRONMENT DIAGNOSTIC COMMANDS
The following table describes OS commands you can use to display the system configuration, such as failed Field Replaceable Units (FRU), hardware revision information, installed patches, and so on

/usr/platform/sun4u/sbin/prtdiag -v
Displays system configuration and diagnostic information, and lists any failed Field Replaceable Units (FRU).

/usr/bin/showrev [-p]
Displays revision information for the current hardware and software. When used with the -p option, displays installed patches.

/usr/sbin/prtconf
Displays system configuration information.

/usr/sbin/psrinfo -v
Displays CPU information, including clock speed.

###########
ref# Doc ID 1005946.1

Advertisements

How to Upload Files to Oracle Support

As an old Sun/Oracle upload file method has been discontinued, below are the several method on how to upload files to Oracle Support based on file size.

mos_attach

  • FTPS & HTTPS to MOS File Upload service – 200 GB max

sftp_mos

  1.     Set “ftps://transport.oracle.com” as the Host
  2.     Supply the appropriate credentials (MOS Support Portal username and password)
  3.     Leave the Port setting blank
  4.     After connecting, double-click on the Issue directory in the right (Remote) pane
  5.     Double-click the SR number’s directory in the right (Remote) pane
  6.     Locate the file to be transferred in the left (Local) pane
  7.     Drag-and-drop the file into the relevant SR directory
  • Diagnostic Assistant (DA), using MOS file utilities – 200 GB max

Diagnostic Assistant (DA)

DA 2.2 (included w/RDA/Explorer/STB 8.02) now supports uploads via https to MOS File Upload Service. Use DA via menus,explorer or the command line.

Menu

  1. Run diagnostic assistant menu:
  2. /<linux/solaris rda home>/da/da.sh menu or <win rda home>dada.cmd menuDiagnosticAssistance
  3. Start with option 3: RDA, OCM,ADR, SR Creation / Packaging, and MOS ToolsDiagnosticAssistant
  4. Next select option 4: Package, Upload Diagnostic FilesDiagnosticAssistanct
  5. Complete it with option 7: Upload File Package to SRDiagnosticAssistant
  6. You will be prompted for your SR, credentials and the file.

 

To use DA do a command line upload:

da.sh upload -p sr=<SR Number>file=path=<path to file>

To use DA to upload with explorer

explorer -w default -T DA -SR <Service Request number>

NOTE: If SR Number is not specified, the file will be uploaded to transport.oracle.com/upload/proactive/

  • Secure File Transport (SFT), part of ASR Manager – 200 GB max

# /opt/SUNWsasm/bin/sasm transport -r
Enter “1” to select:
1) transport.oracle.com
Or, enter:
https://tranport.oracle.com

  • FTP, including SFTP, is not supported

*Reference: Doc ID 1547088.2 and Doc ID 1596914.1

How to Analyze Solaris Crash Dump

Did your Solaris OS suddenly crashed, hang and rebooted by it self for no reason? After initial checking no amber light found on or HW faulty on the server ? its time to check the crash dump log file.

If your system properly installed and configured, the moment system crash, it will save all data on the memory to specific file so call crash dump log file. Check your dump log file location with “dumpadm” command. Usually, its located at: /var/crash/<server_hostname>/, file name: vmdump.0

What we need to do is to read and analyze this most important file. Unfortunately, this dump log file can’t be read manually, its required special package: SUNWscat – Oracle Solaris Crash Analysis Tool.

Download and install SUNWscat – Solaris Crash Analyzer Tool

The latest version Oracle Solaris Crash Analysis Tool is version 5.5, this patches are available on MOS, and can be found by searching on the patchIDs 21099218 (Combined package supporting SPARC and X86/X64) and 21099215 (platform specific packages).

Go to MOS: https://support.oracle.com and login, Click on tab entitled Patches and Updates at top, Enter 21099218 and 21099215 for patch numbers.

————

[Install the package:

root@solaris10 # ls
p21099218_55000_Generic.zip
root@solaris10 # unzip p21099218_55000_Generic.zip
Archive: p21099218_55000_Generic.zip
inflating: Readme.txt
inflating: SUNWscat5.5-GA-combined.pkg.gz
root@solaris10 #
root@solaris10 # gunzip SUNWscat5.5-GA-combined.pkg.gz
root@solaris10 #
root@solaris10 # ls
Readme.txt SUNWscat5.5-GA-combined.pkg p21099218_55000_Generic.zip
root@solaris10 # pkgadd -d SUNWscat5.5-GA-combined.pkg

The following packages are available:
1 SUNWscat Oracle Solaris Crash Analysis Tool (5.5 GA)
(any) 5.5

Select package(s) you wish to process (or ‘all’ to process
all packages). (default: all) [?,??,q]: all

Processing package instance <SUNWscat> from </export/home/SCAT/SUNWscat5.5-GA-combined.pkg>
….
….
Installation of <SUNWscat> was successful.

[decompress the crash dump log file:
root@solaris10 # savecore -vf vmdump.0
savecore: System dump time: Mon Aug 10 05:43:52 2015

savecore: saving system crash dump in /opt/crash/solaris10/{unix,vmcore}.0
Constructing namelist /opt/crash/solaris10/unix.0
Constructing corefile /opt/crash/solaris10/vmcore.0
1:15 100% done: 341483 of 341483 pages saved
48444 (14%) zero pages were not written
1:16 dump decompress is done
root@solaris10 #

[Let start analyze the log:
root@solaris10 # cd /opt/crash/solaris10/
root@solaris10 # /opt/SUNWscat/bin/scat vmcore.0

Oracle Solaris Crash Analysis Tool
Version 5.5 for Oracle Solaris 10 64-bit UltraSPARC

Copyright (c) 1989, 2015, Oracle and/or its affiliates. All rights reserved.

Please note: Do not submit any health, payment card or other sensitive
production data that requires protections greater than those specified in
the Oracle GCS Security Practices. Information on how to remove data from
your submission is available at:
https://support.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1227943.1

For support, please use the Oracle Solaris kernel community at
https://community.oracle.com/community/support/oracle_sun_technologies/
Select “Subspaces” and then “Oracle Solaris Performance, Panics,
Hangs, and Dtrace”.
Further information may be found at https://blogs.oracle.com/SolarisCAT/

opening unix.0 vmcore.0 …dumphdr…symtab…maps…done
loading crashdump data: modules…CTF…globals…done

crash file: /opt/crash/solaris10/vmcore.0
user: Super-User (root:0)
release: 5.10 (64-bit)
version: Generic_144488-06
machine: sun4v
node name: XXXX
domain: default.solaris10.com
hw_provider: Sun_Microsystems
system type: SUNW,Netra-T5440 (UltraSPARC-T2+)
hostid: XXXXXXXX


disks…done

[run ‘analyze’:

CAT(vmcore.0/10V)> analyze

crash file: /opt/crash/solaris10/vmcore.0
user: Super-User (root:0)
release: 5.10 (64-bit)
version: Generic_144488-06
machine: sun4v
node name: XXX
domain: default.server.com
hw_provider: Sun_Microsystems
system type: SUNW,Netra-T5440 (UltraSPARC-T2+)
hostid: xxxxxxxx
dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c1t0d0s1(8G)
time of crash: Mon Aug 10 05:43:01 WIT 2015
age of system: 37 days 19 hours 54 minutes 4 seconds
panic CPU: 96 (96 CPUs, 15.7G memory, 2 nodes)
panic string: xt_sync: timeout
==== panic thread: 0x300c2653200 ==== CPU: 96 ====
==== panic user (LWP_SYS) thread: 0x300c2653200 PID: 22900 on CPU: 96 ====
cmd: /bin/sh /opt/scripts/xxxxxxx Called from script ‘/opt/scrips’    >>>>THE ROOT CAUSE OF CRASH/HUNG
t_procp: 0x60024f870b0
p_as: 0x6003a631c18 size: 1769472 RSS: 1482752

….
— switch to user thread’s user stack —

——————-

‘analyze’ is just one of initial investigation command, type help for other commands:

CAT(vmcore.0/10V)> help
CAT(vmcore.0/10V)>

MPT Firmware Fault, code 0800

Getting below error messages during my Sparc M5000 machine boot up:

{20} ok boot
Boot device: root  File and args:
MPT Firmware Fault, code 0800

read failed
ERROR: FCode Aborted.

The file just loaded does not appear to be executable.
{20} ok Sep  9 09:56:27 dm6-sc0 fmd: SOURCE: sde, REV: 1.16, CSN: BEF0850709  EVENT-ID: 49817271-d01d-4ece-8098-a362c9e52f71 Refer to http://www.sun.com/msg/SCF-8001-KC for detailed information.

Can’t find any clue what does “MPT Firmware Fault, code 0800” error code mean on MOS. Technical support asked to power cycle twice, including replacing the IOU, but the problem still persistent.

Anyway, after perform troubleshooting, below step solved my problem:

[boot into single user mode via DVD or Network, perform FSCK, mount root file system, installboot, then update boot-archive:

{29} ok boot net -s
# fsck -F ufs -y /dev/rdsk/c0t0d0s0

** /dev/rdsk/c0t0d0s0
BAD SUPERBLOCK AT BLOCK 16: MAGIC NUMBER WRONG
LOOK FOR ALTERNATE SUPERBLOCKS WITH MKFS?  yes
FOUND ALTERNATE SUPERBLOCK 80032 WITH MKFS
USE ALTERNATE SUPERBLOCK?  yes

FOUND ALTERNATE SUPERBLOCK AT 80032 USING MKFS
If filesystem was created with manually-specified geometry, using
auto-discovered superblock may result in irrecoverable damage to
filesystem and user data.

CANCEL FILESYSTEM CHECK?  yes

Please verify that the indicated block contains a proper
superblock for the filesystem (see fsdb(1M)).

FSCK was running in YES mode.  If you wish to run in that mode using
the alternate superblock, run `fsck -y -o b=80032 /dev/rdsk/c0t0d0s0′.

——————-

Owh, there was bad superblock issue, so I need to find the superblock backup then re-run fsck with -o b=<backup_superblock> option:

# newfs -Nv /dev/rdsk/c0t0d0s0
mkfs -F ufs -o N /dev/rdsk/c0t0d0s0 81937152 -1 -1 8192 1024 160 1 167 8192 t 0 -1 8 128 n
Warning: 5376 sector(s) in last cylinder unallocated
/dev/rdsk/c0t0d0s0:     81937152 sectors in 13337 cylinders of 48 tracks, 128 sectors
40008.4MB in 834 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
…………….
super-block backups for last 10 cylinder groups at:
81009696, 81108128, 81206560, 81304992, 81403424, 81501856, 81600288,
81698720, 81788960, 81887392
#

# fsck -F ufs -y -o b=81887392 /dev/rdsk/c0t0d0s0

*continue fsck to other slices

# mount /dev/dsk/c0t0d0s0 /mnt
# installboot /mnt/usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0
# bootadm update-archive -fv -R /mnt

How To replace M4000/M5000 XSCF board

XSCF or eXtended System Control facility unit is known as service processor for M-Series server.
The XSCF unit is a cold replacement component. This means the entire server must be powered off and the power cords disconnected to replace the XSCF unit. Execute “showhardconf” or “showstatus” command to make sure the XSCF is faulted.

XSCF> showhardconf


*   XSCFU Status:Degraded,Active; Ver:0101h; Serial:BFxxxxxxx  ;
+ FRU-Part-Number:CF00541-0481 04   /541-0481-04       ;

I hv been asked by some people on how to backup XSCF configuration before replacing the XSCF board. They presume the XSCF configuration need to backup first because there are only 1 XSCF board on M4000/M5000 server. In fact no need to backup the config because there was a backup copy of XSCF configuration on Operator Panel, both XSCF and Operator Panel always synchronizing its data each time XSCF bootup or there was a changes on XSCF configuration. Thats way there is a restriction to replace both XSCF and Operator Panel simultaneously.

Okay, if you ready to replace the XSCF board, below are the instruction:

[Shutdown the OS, Power off the server and unplug the power cord and XSCF ethernet cables.

[Use proper ESD grounding technique and anti static mat, replace the XSCF board:

*M4000 XSCF board location:

*M5000 XSCF board location:

# Plug all cables then power on server, wait till the new XSCF board startup. It will reboot around 2-3 times. you will see the messages of XSCF and OPNL synchronize its data during startup:

…..
initialize XSCF common database (OWN)  —  complete
synchronize setup data (XSCF -> OPNL)  —  complete
initialize XSCF common database (ACTIVE)  —  complete
wait for database synchronization  —  complete
execute S00clis_all  —  complete
…..

[If the boot process is finished, then try to log in. If you see below error messages:

XCP version of Panel EEPROM and XSCF FMEM mismatched,
Panel EEPROM=1090, XSCF FMEM=1100

Then you need to upgrade the XSCF firmware. Download the latest firmware from MOS, then perform firmware upgrade.

[XSCF FIRMWARE UPGRADE:

*VIA FTP:

XSCF> getflashimage -l        >CHECK CURRENT FIRWARE
XSCF> getflashimage -u AZIZ ftp://10.32.17.61/FFXCP1112.tar.gz    >> aziz is username, 10.32.17.61 is ftp server on my laptop
Password: *******
0MB received
1MB received
2MB received
3MB received
4MB received
5MB received
6MB received
7MB received
8MB received

Download successful: 42660 Kbytes in 50 secs (987.298 Kbytes/sec)
Checking file…
MD5: 73ca6370dc6c636f2e3845b66caa203a
XSCF> getflashimage -l

XSCF> flashupdate -c check -m xcp -s 1112
XCP update is possible with domains up

XSCF> flashupdate -c update -m xcp -s 1112
The XSCF will be reset. Continue? [y|n] :y
Checking the XCP image file, please wait a minute
XCP update is started (XCP version=1112:last version=1081)
OpenBoot PROM update is started (OpenBoot PROM version=02180000)

*VIA USB:
– checked version, update firmware
XSCF> version -c xcp -v

XSCF> getflashimage file:///media/usb_msd/FFXCP1112.tar.gz

Note the different of M-Series firmware file below:
getflashimage file:///media/usb_msd/IKXCP1112.tar.gz    >>for M3000
getflashimage file:///media/usb_msd/FFXCP1112.tar.gz    >>for M4000/5000
getflashimage file:///media/usb_msd/DCXCP1112.tar.gz    >>for M8000/M9000

XSCF> flashupdate -c check -m xcp -s 1112
XCP update is possible with domains up

XSCF> flashupdate -c update -m xcp -s 1112

XSCF> version -c xcp -v
XCP0 (Reserve): 1110 <<XCP0 will take few minutes to finish update
OpenBoot PROM : 02.29.0000
XSCF          : 01.11.0000
XCP1 (Current): 1112 <<updated already
OpenBoot PROM : 02.29.0000
XSCF          : 01.11.0002
OpenBoot PROM BACKUP

XSCF> version -c cmu -v

[If you hv finished on upgrading the firmware or there are no firmware issue, then make sure the device status again with “showhardconf” and “showstatus” command.

[Continue powering on the domain:

XSCF> poweron -d0
DomainIDs to power on:00
Continue? [y|n] :y
Poweron canceled due to invalid system date and time.
XSCF>

Wait, did you see above error messages? yes the domain unable to boot because the system date and time is invalid.

#set the new date and time, Example for 24 Oct 2012 @ 10:23:

XSCF> setdate -u -s 102410232012.00
Wed Oct 24 10:23:00 UTC 2012
The XSCF will be reset. Continue? [y|n] :y

#If you want to change the timezone, run the settimezone command. example:

XSCF> settimezone -c settz -s Asia/Jakarta

#DONE. Now power on the domain again.

How to Configure SL24 / SL48 with Netbackup

SL24 and SL48 are the Oracle’s entry level of Autoloader/Tape Library.

Check here for complete documentation.

L24/48 Library are using a single SCSI ID and two logical unit numbers (LUN). LUN 0 controls the tape drive and LUN 1 controls the robotic. So, its require an HBA that supports multiple LUNs. If multiple LUN support is not enabled, the host server cannot scan beyond LUN 0 to discover the Library. It just sees the tape drive.

To check the device and connectivity status from Solaris, please use the show_FCP_dev option: “cfgadm -o show_FCP_dev -al”, instead of “cfgadm -al” command. The robotic or changer will not shown if you use standard “cfgadm -al” command.
If the changer detected already by “cfgadm -o show_FCP_dev -al” command but still not detected by NBU sgscan command, check your NBU device configuration. You need to modify the st.conf file in order to detect the devices on two LUNs.

[Find the following line in the st.conf file:

name=”st” target=0 lun=0;

Replace that line and the following lines through target 5 with the following. Doing so modifies the st.conf file to include searches on non-zero LUNs:

name="st" target=0 lun=0;
name="st" target=0 lun=1;
name="st" target=1 lun=0;
name="st" target=1 lun=1;
name="st" target=2 lun=0;
name="st" target=2 lun=1;
name="st" target=3 lun=0;
name="st" target=3 lun=1;
name="st" target=4 lun=0;
name="st" target=4 lun=1;
name="st" target=5 lun=0;
name="st" target=5 lun=1;
name="st" parent="fp" target=0;
name="st" parent="fp" target=1;
name="st" parent="fp" target=2;
name="st" parent="fp" target=3;
name="st" parent="fp" target=4;
name="st" parent="fp" target=5;
name="st" parent="fp" target=6;

Click here for complete information on how to configure tape drive and robotic devices for Netbackup.

If the SL24/SL48 has the SAS tape drives and you are using LSI SAS HBA, please check and upgrade the SAS HBA driver.

There was an issue with the LSI SAS1 (3GB) HBA with a firmware level of 1.26.00 and below, where the HBA will not see any SAS devices connected to it. Check below document (MOS access required) for more detail.

HBA – LSI SAS HBA Firmware Issue, SAS Devices Not Being Seen by Server (Doc ID 1350564.1)

ERROR: Last Trap: Instruction Access Exception

{0} ok boot
Boot device: /pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/disk@0,0:a File and args:
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-T200/ufsboot
Loading: /platform/sun4v/ufsboot
ERROR: Last Trap: Instruction Access Exception

If you got above error messages when powering on Sun Server (T-series, T1000/T2000), and the boot process stuck in there, do not call Oracle support or open SR via MOS unless you try below simple troubleshooting step:

Try to unplug all USB devices – ie USB keyboard + mouse, KVM etc, then connect your laptop/PC to server via serial port then reboot the server. IF the error messages are disappeared, I believe the server will able to boot as usual.

This issue are mostly related with USB keyboard /mouse or other USB related devices. It could be the USB devices or the USB port of the server it self. Try to plug the USB device on another port then reboot the server again. For T2000 there are 4 USB port on the back and 2 USB port on the front.