Solaris Fault Manager

Today, I got an error like this:

SUNW-MSG-ID: FMD-8000-2K, TYPE: Defect, VER: 1, SEVERITY: Minor
EVENT-TIME: Tue Aug  4 14:52:43 WIT 2009
PLATFORM: SUNW,SPARC-Enterprise, CSN: BEF09142B8, HOSTNAME: server
SOURCE: fmd-self-diagnosis, REV: 1.0
EVENT-ID: d232928b-dd11-e149-b515-82ff5df189f8
DESC: A Solaris Fault Manager component has experienced an error that required the module to be disabled.  Refer to http://sun.com/msg/FMD-8000-2K for more information.
AUTO-RESPONSE: The module has been disabled.  Events destined for the module will be saved for manual diagnosis.
IMPACT: Automated diagnosis and response for subsequent events associated with this module will not occur.
REC-ACTION: Use fmdump -v -u <EVENT-ID> to locate the module.  Use fmadm reset <module> to reset the module.
root@server #

and this:

Aug  4 14:37:13 server ufs: NOTICE: alloc: /var: file system full

it seem “/var” directory full. OK, lets check disk space.

root@server # df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/md/dsk/d10         20G   7.0G    12G    37%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                    88G   1.6M    88G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
fd                       0K     0K     0K     0%    /dev/fd
/dev/md/dsk/d30         20G    20G     0K   100%    /var
swap                    88G    32K    88G     1%    /tmp
swap                    88G    72K    88G     1%    /var/run
/dev/md/dsk/d40         12G    12M    12G     1%    /oracle
/dev/md/dsk/d50         36G    37M    36G     1%    /internaldisk1

its corrects..

now, lets check whats going on with “/var” directory.

root@server # cd /var/adm/
root@server # ls -ltr
total 4242
-rw-rw-rw-   1 root     bin            0 Aug 25  2008 spellhist
-rw-------   1 uucp     bin            0 Aug 25  2008 aculog
drwxr-xr-x   2 adm      adm          512 Jun 25 18:15 exacct
drwxr-xr-x   2 adm      adm          512 Jun 25 18:15 log
drwxr-xr-x   2 root     sys          512 Jun 25 18:15 streams
drwxr-xr-x   2 root     sys          512 Jun 25 18:19 pool
drwxrwxr-x   5 adm      adm          512 Jun 25 18:26 acct
drwxrwxr-x   2 adm      sys          512 Jun 25 18:26 sa
drwxr-xr-x   2 root     sys          512 Jun 25 18:57 sm.bin
-rw-r--r--   1 root     root           0 Jun 25 19:25 vold.log
-rw-r--r--   1 root     root       34472 Jun 25 19:27 messages.3
-rw-r--r--   1 root     root       73131 Jul  2 13:40 messages.2
-rw-r--r--   1 root     root         249 Jul 24 18:35 messages.1
-rw-r--r--   1 root     root     1906434 Jul 31 20:15 messages.0
-r--r--r--   1 root     root          28 Aug  4 15:18 lastlog
-rw-r--r--   1 root     bin         2232 Aug  4 15:18 utmpx
-rw-r--r--   1 adm      adm        51708 Aug  4 15:18 wtmpx
-rw-r--r--   1 root     root       81457 Aug  4 15:19 messages

wtmpx,utmpx,messages are normal in size.

Check crash dump and core files:

root@server # dumpadm
 Dump content: kernel pages
 Dump device: /dev/md/dsk/d30 (dedicated)
Savecore directory: /var/crash/server
 Savecore enabled: yes
root@server #
root@server # cd /var/crash/
root@server # ls
server
root@server # ls -ltr
total 2
drwx------   2 root     root         512 Jun 25 19:25 server
root@server # cd server/
root@server # ls
root@server # ls -ltr
total 0

OK,  no crash dump or core dump on /var.

finally, after checking each directory on “/var” one by one,  found directory “/var/fm” with 18GB space.

root@server # du -sh /var/fm/fmd
18G   fmd
root@server # pwd
/var/fm/fmd
root@server # ls
ckpt           core.fmd.1506  core.fmd.1726  core.fmd.1948  core.fmd.2168
core.fmd.1284  core.fmd.1508  core.fmd.1730  core.fmd.1950  core.fmd.2170
core.fmd.1287  core.fmd.1510  core.fmd.1732  core.fmd.1952  core.fmd.2172
core.fmd.1290  core.fmd.1512  core.fmd.1734  core.fmd.1954  core.fmd.2174
core.fmd.1293  core.fmd.1514  core.fmd.1736  core.fmd.1956  core.fmd.2176
core.fmd.1295  core.fmd.1516  core.fmd.1738  core.fmd.1958  core.fmd.2178
core.fmd.1297  core.fmd.1518  core.fmd.1740  core.fmd.1960  core.fmd.2180
core.fmd.1299  core.fmd.1520  core.fmd.1742  core.fmd.1962  core.fmd.2182
core.fmd.1302  core.fmd.1522  core.fmd.1744  core.fmd.1964  core.fmd.2184
core.fmd.1304  core.fmd.1526  core.fmd.1746  core.fmd.1966  core.fmd.2186
core.fmd.1307  core.fmd.1528  core.fmd.1748  core.fmd.1968  core.fmd.2188
core.fmd.1310  core.fmd.1530  core.fmd.1750  core.fmd.1970  core.fmd.2190
core.fmd.1312  core.fmd.1532  core.fmd.1752  core.fmd.1972  core.fmd.2192
core.fmd.1314  core.fmd.1534  core.fmd.1754  core.fmd.1974  core.fmd.2194
core.fmd.1316  core.fmd.1536  core.fmd.1756  core.fmd.1976  core.fmd.2196

etc…

etc…

Found alot of “core.fmd.xxx” files on “/var/fm/fmd”.

did you know that, this error is related with Solaris Fault Management, generated by fmd services on Solaris.

what is fmd services?

    fmd is a daemon that runs in the background on each Solaris system. fmd receives telemetry information relating to problems detected by the system software, diagnoses these problems, and initiates proactive self-healing activities such as disabling faulty components. When appropriate, the fault manager also sends a message to the syslogd(1M) service to notify an administrator that a problem has been detected. The message directs administrators to a knowledge article on Sun’s web site, http://www.sun.com/msg/, which explains more about the problem impact and appropriate responses.

    Each problem diagnosed by the fault manager is assigned a Universal Unique Identifier (UUID). The UUID uniquely identifes this particular problem across any set of systems. The fmdump(1M) utility can be used to view the list of problems diagnosed by the fault manager, along with their UUIDs and knowledge article message identifiers. The fmadm(1M) utility can be used to view the resources on the system believed to be faulty. The fmstat(1M) utility can be used to report statistics kept by the fault manager. The fault manager is started automatically when Solaris boots, so it is not necessary to use the fmd command directly. Sun’s web site explains more about what capabilities are currently available for the fault manager on Solaris.

The Solaris Fault Management Facility is designed to be integrated into the Service Management Facility to provide a self-healing capability to Solaris 10 systems. The fmd daemon is responsible for monitoring several aspects of system health.

my “/var/fm/fmd” is full because today I’m working on IPMP configuration and testing and there are a problem with my network connection to Cisco Switch. so thats way, Solaris create alot of  “core.fmd” files.

you can run “mdump -v -u <EVENT-ID>” to locate the module.

Use fmadm reset <module> to reset the module

Example:

# fmdump -v -u 815bf413-9de6-4667-e118-93dc3bc33e71

for temporary solution, I disable the “fmd” services and remove all core.fmd files from “/var/fm/fmd” directory.

root@server # svcs fmd
STATE          STIME    FMRI
online         Jul_29   svc:/system/fmd:default
root@server # svcadm disable fmd
root@server # svcs fmd
STATE          STIME    FMRI
disabled       16:48:03 svc:/system/fmd:default
root@server # rm core.fmd.*

I'll re enable again the fmd services when I finished with my IPMP configuration..

Source:
Fault Manager Daemon man pages
http://www.princeton.edu/~unix/Solaris/troubleshoot/fm.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s