Hardware Diagnostics for Oracle Sun systems, A Toolkit for System Administrators

The easiest way to diagnose the hardware related problem on Oracle Sun server is by using of OBP OK Prompt commands, the Power On Self Test (POST), and the status LEDs on system boards.

ou can diagnose hardware related problems on Oracle Sun server and desktop products. With these low-level diagnostics, you can establish the state of the system and attached devices. For example, you can determine if a device is recognized by the system and working properly, or you can also obtain useful system configuration information.

OBP DIAGNOSTIC COMMANDS AND TOOLS
OBP is a powerful, low-level interface to the system and devices attached to the system (OBP is also known as the ok prompt). By entering simple OBP commands, you can learn system configuration details such as the ethernet address, the CPU and bus speeds, installed memory, and so on. Using OBP, you can also query and set system parameter values such as the default boot device, run tests on devices such as the network interface, and display the SCSI and SBUS devices attached to the system.

Below are the available commands in OBP OK prompt:
—————————-
banner
Displays the power on banner. The banner includes information such as CPU speed, OBP revision, total system memory, ethernet address and hostid.

devalias alias path
Defines a new device alias, where alias is the new alias name and path is the physical path of the device. If devalias is used without arguments, it displays all system device aliases (will run up to 120 MHz).

.enet-addr
Displays the ethernet address

led-off/led-on
Turns the system led off or on.

nvaliasname path
Creates a new alias for a device, where name is the name of the alias and path is the physical path of the device. Note – Run the reset-all or the nvstore command to save the new alias in non-volatile memory (NVRAM).

nvunalias name path
Deletes a user-created alias (see nvalias), where name is the name of the alias and path is the physical path of the device. Note – Run the reset-all or nvstore command to save changes in NVRAM.

nvstore
Copies the contents of the temporary buffer to NVRAM and discards the contents of the temporary buffer.

power-off/power-on
Powers the system off or on.

printenv
Displays all parameters, settings, and values

probe-fcal-all
dentifies Fiber Channel Arbitrated Loop (FCAL) devices on a system. 1

probe-sbus
Identifies devices attached to all SBUS slots. Note – This command works only on systems with SBUS slots.

probe-scsi
Identifies devices attached to the onboard SCSI bus. 1

probe-scsi-all
Identifies devices attached to all SCSI busses. 1

set-default parameter
Resets the value of parameter to the default setting.

set-defaults
Resets the value of all parameters to the default settings. Tip – You can also press the Stop and N keys simultaneously during system power-up to reset the values to their defaults.

setenv parameter value
Sets parameter to specified value. Note – Run the reset-all command to save changes in NVRAM.

show-devs
Displays all the devices recognized by the system.

show-disks
Displays the physical device path for disk controllers.

show-displays
Displays the physical device path for frame buffers.

show-nets
Displays the physical device path for network interfaces

show-post-results
If run after Power On Self Test (POST) is completed, this command displays the findings of POST in a readable format.

show-sbus
Displays devices attached to all SBUS slots. Similar to probe-sbus .

show-tapes
Displays the physical device path for tape controllers.

sifting string
Searches for OBP commands or methods that contain string. For example, the sifting probe command displays probe-scsi, probe-scsi-all, probe-sbus, and so on.

speed
Displays CPU and bus speeds

test device-specifier
Executes the selftest method for device-specifier. For example, the test net command tests the network connection.

test-all
Tests all devices that have a built-in test method.

version
Displays OBP and POST version information.

watch-clock
Tests a clock function.

watch-net
Monitors the network connection for the primary interface.

watch-net-all
Monitors all the network connections.

words
Displays all OBP commands and methods

—————————-

OBDIAG
OBDiag also displays diagnostic and error messages on the system console.

How To Run OBDiag
To run OBDiag, simply type obdiag at the Open Boot ok prompt.
You can also set up OBDiag to run automatically when the system is powered on using the following methods:

Set the OBP diagnostics variable:   ok setenv diag-switch  true
Press the Stop and D keys simultaneously while you power on the system.
On Ultra Enterprise servers, turn the key switch to the diagnostics position and power on the system.

POWER ON SELF TEST (POST)
POST is a program that resides in the firmware of each board in a system, and it is used to initialize, configure, and test the system boards. POST output is sent to serial port A (on an Ultra Enterprise server, POST output is sent only to serial port A on the system and clock board). The status LEDs of each system board on Ultra Enterprise servers indicate the POST completion status. For example, if a system board fails the POST test, the amber LED stays lit.
You can watch POST ouput in real-time by attaching a terminal device to serial port A. If none is available, you can use the OBP command show-post-results to view the results after POST completes.

How To Run POST
– connect to serial port
– set the dig-switch to ‘true’
ok setenv diag-switch  true
– Set the desired testing level (min or max), example:
ok setenv diag-level max
– Set the auto-boot variable to ‘false’
ok setenv auto-boot  false
– run ‘reset-all’ >> ok reset-all
– Reboot or Power cycle the system

SOLARIS OPERATING ENVIRONMENT DIAGNOSTIC COMMANDS
The following table describes OS commands you can use to display the system configuration, such as failed Field Replaceable Units (FRU), hardware revision information, installed patches, and so on

/usr/platform/sun4u/sbin/prtdiag -v
Displays system configuration and diagnostic information, and lists any failed Field Replaceable Units (FRU).

/usr/bin/showrev [-p]
Displays revision information for the current hardware and software. When used with the -p option, displays installed patches.

/usr/sbin/prtconf
Displays system configuration information.

/usr/sbin/psrinfo -v
Displays CPU information, including clock speed.

###########
ref# Doc ID 1005946.1

Advertisements

How to Upload Files to Oracle Support

As an old Sun/Oracle upload file method has been discontinued, below are the several method on how to upload files to Oracle Support based on file size.

mos_attach

  • FTPS & HTTPS to MOS File Upload service – 200 GB max

sftp_mos

  1.     Set “ftps://transport.oracle.com” as the Host
  2.     Supply the appropriate credentials (MOS Support Portal username and password)
  3.     Leave the Port setting blank
  4.     After connecting, double-click on the Issue directory in the right (Remote) pane
  5.     Double-click the SR number’s directory in the right (Remote) pane
  6.     Locate the file to be transferred in the left (Local) pane
  7.     Drag-and-drop the file into the relevant SR directory
  • Diagnostic Assistant (DA), using MOS file utilities – 200 GB max

Diagnostic Assistant (DA)

DA 2.2 (included w/RDA/Explorer/STB 8.02) now supports uploads via https to MOS File Upload Service. Use DA via menus,explorer or the command line.

Menu

  1. Run diagnostic assistant menu:
  2. /<linux/solaris rda home>/da/da.sh menu or <win rda home>dada.cmd menuDiagnosticAssistance
  3. Start with option 3: RDA, OCM,ADR, SR Creation / Packaging, and MOS ToolsDiagnosticAssistant
  4. Next select option 4: Package, Upload Diagnostic FilesDiagnosticAssistanct
  5. Complete it with option 7: Upload File Package to SRDiagnosticAssistant
  6. You will be prompted for your SR, credentials and the file.

 

To use DA do a command line upload:

da.sh upload -p sr=<SR Number>file=path=<path to file>

To use DA to upload with explorer

explorer -w default -T DA -SR <Service Request number>

NOTE: If SR Number is not specified, the file will be uploaded to transport.oracle.com/upload/proactive/

  • Secure File Transport (SFT), part of ASR Manager – 200 GB max

# /opt/SUNWsasm/bin/sasm transport -r
Enter “1” to select:
1) transport.oracle.com
Or, enter:
https://tranport.oracle.com

  • FTP, including SFTP, is not supported

*Reference: Doc ID 1547088.2 and Doc ID 1596914.1

MPT Firmware Fault, code 0800

Getting below error messages during my Sparc M5000 machine boot up:

{20} ok boot
Boot device: root  File and args:
MPT Firmware Fault, code 0800

read failed
ERROR: FCode Aborted.

The file just loaded does not appear to be executable.
{20} ok Sep  9 09:56:27 dm6-sc0 fmd: SOURCE: sde, REV: 1.16, CSN: BEF0850709  EVENT-ID: 49817271-d01d-4ece-8098-a362c9e52f71 Refer to http://www.sun.com/msg/SCF-8001-KC for detailed information.

Can’t find any clue what does “MPT Firmware Fault, code 0800” error code mean on MOS. Technical support asked to power cycle twice, including replacing the IOU, but the problem still persistent.

Anyway, after perform troubleshooting, below step solved my problem:

[boot into single user mode via DVD or Network, perform FSCK, mount root file system, installboot, then update boot-archive:

{29} ok boot net -s
# fsck -F ufs -y /dev/rdsk/c0t0d0s0

** /dev/rdsk/c0t0d0s0
BAD SUPERBLOCK AT BLOCK 16: MAGIC NUMBER WRONG
LOOK FOR ALTERNATE SUPERBLOCKS WITH MKFS?  yes
FOUND ALTERNATE SUPERBLOCK 80032 WITH MKFS
USE ALTERNATE SUPERBLOCK?  yes

FOUND ALTERNATE SUPERBLOCK AT 80032 USING MKFS
If filesystem was created with manually-specified geometry, using
auto-discovered superblock may result in irrecoverable damage to
filesystem and user data.

CANCEL FILESYSTEM CHECK?  yes

Please verify that the indicated block contains a proper
superblock for the filesystem (see fsdb(1M)).

FSCK was running in YES mode.  If you wish to run in that mode using
the alternate superblock, run `fsck -y -o b=80032 /dev/rdsk/c0t0d0s0′.

——————-

Owh, there was bad superblock issue, so I need to find the superblock backup then re-run fsck with -o b=<backup_superblock> option:

# newfs -Nv /dev/rdsk/c0t0d0s0
mkfs -F ufs -o N /dev/rdsk/c0t0d0s0 81937152 -1 -1 8192 1024 160 1 167 8192 t 0 -1 8 128 n
Warning: 5376 sector(s) in last cylinder unallocated
/dev/rdsk/c0t0d0s0:     81937152 sectors in 13337 cylinders of 48 tracks, 128 sectors
40008.4MB in 834 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
…………….
super-block backups for last 10 cylinder groups at:
81009696, 81108128, 81206560, 81304992, 81403424, 81501856, 81600288,
81698720, 81788960, 81887392
#

# fsck -F ufs -y -o b=81887392 /dev/rdsk/c0t0d0s0

*continue fsck to other slices

# mount /dev/dsk/c0t0d0s0 /mnt
# installboot /mnt/usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0
# bootadm update-archive -fv -R /mnt

How To replace M4000/M5000 XSCF board

XSCF or eXtended System Control facility unit is known as service processor for M-Series server.
The XSCF unit is a cold replacement component. This means the entire server must be powered off and the power cords disconnected to replace the XSCF unit. Execute “showhardconf” or “showstatus” command to make sure the XSCF is faulted.

XSCF> showhardconf


*   XSCFU Status:Degraded,Active; Ver:0101h; Serial:BFxxxxxxx  ;
+ FRU-Part-Number:CF00541-0481 04   /541-0481-04       ;

I hv been asked by some people on how to backup XSCF configuration before replacing the XSCF board. They presume the XSCF configuration need to backup first because there are only 1 XSCF board on M4000/M5000 server. In fact no need to backup the config because there was a backup copy of XSCF configuration on Operator Panel, both XSCF and Operator Panel always synchronizing its data each time XSCF bootup or there was a changes on XSCF configuration. Thats way there is a restriction to replace both XSCF and Operator Panel simultaneously.

Okay, if you ready to replace the XSCF board, below are the instruction:

[Shutdown the OS, Power off the server and unplug the power cord and XSCF ethernet cables.

[Use proper ESD grounding technique and anti static mat, replace the XSCF board:

*M4000 XSCF board location:

*M5000 XSCF board location:

# Plug all cables then power on server, wait till the new XSCF board startup. It will reboot around 2-3 times. you will see the messages of XSCF and OPNL synchronize its data during startup:

…..
initialize XSCF common database (OWN)  —  complete
synchronize setup data (XSCF -> OPNL)  —  complete
initialize XSCF common database (ACTIVE)  —  complete
wait for database synchronization  —  complete
execute S00clis_all  —  complete
…..

[If the boot process is finished, then try to log in. If you see below error messages:

XCP version of Panel EEPROM and XSCF FMEM mismatched,
Panel EEPROM=1090, XSCF FMEM=1100

Then you need to upgrade the XSCF firmware. Download the latest firmware from MOS, then perform firmware upgrade.

[XSCF FIRMWARE UPGRADE:

*VIA FTP:

XSCF> getflashimage -l        >CHECK CURRENT FIRWARE
XSCF> getflashimage -u AZIZ ftp://10.32.17.61/FFXCP1112.tar.gz    >> aziz is username, 10.32.17.61 is ftp server on my laptop
Password: *******
0MB received
1MB received
2MB received
3MB received
4MB received
5MB received
6MB received
7MB received
8MB received

Download successful: 42660 Kbytes in 50 secs (987.298 Kbytes/sec)
Checking file…
MD5: 73ca6370dc6c636f2e3845b66caa203a
XSCF> getflashimage -l

XSCF> flashupdate -c check -m xcp -s 1112
XCP update is possible with domains up

XSCF> flashupdate -c update -m xcp -s 1112
The XSCF will be reset. Continue? [y|n] :y
Checking the XCP image file, please wait a minute
XCP update is started (XCP version=1112:last version=1081)
OpenBoot PROM update is started (OpenBoot PROM version=02180000)

*VIA USB:
– checked version, update firmware
XSCF> version -c xcp -v

XSCF> getflashimage file:///media/usb_msd/FFXCP1112.tar.gz

Note the different of M-Series firmware file below:
getflashimage file:///media/usb_msd/IKXCP1112.tar.gz    >>for M3000
getflashimage file:///media/usb_msd/FFXCP1112.tar.gz    >>for M4000/5000
getflashimage file:///media/usb_msd/DCXCP1112.tar.gz    >>for M8000/M9000

XSCF> flashupdate -c check -m xcp -s 1112
XCP update is possible with domains up

XSCF> flashupdate -c update -m xcp -s 1112

XSCF> version -c xcp -v
XCP0 (Reserve): 1110 <<XCP0 will take few minutes to finish update
OpenBoot PROM : 02.29.0000
XSCF          : 01.11.0000
XCP1 (Current): 1112 <<updated already
OpenBoot PROM : 02.29.0000
XSCF          : 01.11.0002
OpenBoot PROM BACKUP

XSCF> version -c cmu -v

[If you hv finished on upgrading the firmware or there are no firmware issue, then make sure the device status again with “showhardconf” and “showstatus” command.

[Continue powering on the domain:

XSCF> poweron -d0
DomainIDs to power on:00
Continue? [y|n] :y
Poweron canceled due to invalid system date and time.
XSCF>

Wait, did you see above error messages? yes the domain unable to boot because the system date and time is invalid.

#set the new date and time, Example for 24 Oct 2012 @ 10:23:

XSCF> setdate -u -s 102410232012.00
Wed Oct 24 10:23:00 UTC 2012
The XSCF will be reset. Continue? [y|n] :y

#If you want to change the timezone, run the settimezone command. example:

XSCF> settimezone -c settz -s Asia/Jakarta

#DONE. Now power on the domain again.

ERROR: Last Trap: Instruction Access Exception

{0} ok boot
Boot device: /pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/disk@0,0:a File and args:
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-T200/ufsboot
Loading: /platform/sun4v/ufsboot
ERROR: Last Trap: Instruction Access Exception

If you got above error messages when powering on Sun Server (T-series, T1000/T2000), and the boot process stuck in there, do not call Oracle support or open SR via MOS unless you try below simple troubleshooting step:

Try to unplug all USB devices – ie USB keyboard + mouse, KVM etc, then connect your laptop/PC to server via serial port then reboot the server. IF the error messages are disappeared, I believe the server will able to boot as usual.

This issue are mostly related with USB keyboard /mouse or other USB related devices. It could be the USB devices or the USB port of the server it self. Try to plug the USB device on another port then reboot the server again. For T2000 there are 4 USB port on the back and 2 USB port on the front.

How to reset RSC password

If you forgot the RSC password for V480, V880, V490, and V890 or other old legacy Sun machines, here are the procedure to reset the password. Requirement: SUNWrsc package

If you dont hv the package, please download the latest package from My Oracle Support:

  1. login to support.oracle.com
  2. click on “Patches & Updates” in the top menu
  3. in the search window (located on the right) click “Product or Family (Advanced)”
  4. in the updated search window type “Sun Remote” in the “Product” box, then select “Sun Remote System Control”
  5. Click the “Release” box (which says “Select up to 10”, in that box click “Sun Remote System Control” and then select the version “Sun Remote System Control 2.2.3”.
  6. In the new window you can now download RSC 2.2.3 (called p10264451_223.zip) by marking it and clicking “download”.

[reset RSC password:

Login with root privilege, install the package, then run rscadm command.

Prefix >> #/usr/platform/<platform>/rsc/rscadm userpassword <username>

[example for v890:
# /usr/platform/SUNW,Sun-Fire-V890/rsc/rscadm userpassword admin
Password:
Re-enter Password:

You can also reset the whole configuration by running “rsc-config” command.

Update:

If the SCADM not available, download the RSC software from MOS:

RSC Software Download (steps to download the latest RSC software):
1. Login to MOS and select “Patches and Updates Tab”
2. In “Patch Search” on the Top right panel, Click on “Product or Family (Advanced Search)”
3. In the “Product Is” pull-down select “Sun remote System Control”??
4. In the next pull down “Release is” select the RSC version (2.2.2 or 2.2.3).
5. Select OS and click “Search” (will get a list with RSC releases & patches)
6. Select the desired RSC Release (packages) or patch
7. Click Download on the Right

The packages for Solaris 8 and 9 (and later) are both in the zip file. There are two options of the zip file, 32bit, and 64bit, but they both have the same checksums, so there are no differences: p10264452_223_SOLARIS64.zip (p10264451_223_SOLARIS.zip)

Install the software as you would any Package with pkgadd.

Command syntax is same:
#/usr/platform/`uname -i`/rsc/rscadm userpassword admin
[To reconfigure the card run the command:
# usr/platform/`uname -i`/rsc/rsc-config

[If you had installed the software before and believe the card is configured check the setup:
# usr/platform/`uname -i`/rsc/rscadm show

How to clear fmadm log or FMA faults log

Here are the step by step of clearing the FMA faults on most of Oracle/Sun server. Work perfectly on Solaris 10:

Clear fmadm log, Example :
———————————-
For each fault listed in the ‘fmadm faulty’ run:
# fmadm repair <uuid>   (OR if the components are listed instead, e.g.:)
# fmadm repair 568a9180-7308-4535-92e6-a7c17ef1bfef

[Clear ereports and resource cache:
# cd /var/fm/fmd
# rm e* f* c*/eft/* r*/*

[Clearing out FMA files with no reboot needed:
svcadm disable -s svc:/system/fmd:default
cd /var/fm/fmd
find /var/fm/fmd -type f -exec ls {} \;
find /var/fm/fmd -type f -exec rm {} \;
svcadm enable svc:/system/fmd:default

[Reset the fmd serd modules:
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire