Solaris 10 OS Feature Spotlight: Predictive Self-Healing

This entry was posted in Solaris Administration and tagged , , , , on June 17, 2012, by

Traditionally, when a hardware or software fault occurred on a Solaris system, a message would usually be logged to the appropriate device specified in /etc/syslog.conf, and the rest of the diagnosis and repair was left to the administrator. Predictive Self-Healing technology is introduced in the Solaris 10 OS, which is available for preview through the Software Express for Solaris program. Predictive Self-Healing is a newly designed cohesive architecture and methodology for automatically diagnosing, reporting, and handling software and hardware fault conditions. This new technology lessens the time required to debug a hardware or software problem and provides the administrator and Sun Technical Support with detailed data about each fault. The architecture consists of an event management protocol, the fault manager, and the software fault-handling software, the Solaris Service Manager

The Solar Fault Manager

 

When a hardware fault occurs, predictive self-healing augments traditional syslog messages by issuing binary telemetry events that are then correlated by underlying software. The underlying software then automatically diagnoses the fault, notifies the administrator, and takes corrective action when possible. Sun’s fault manager also provides a fault code and directs the administrator to the corresponding knowledge base article at www.sun.com when appropriate. The first implementation of Sun’s fault manager covers various SPARC CPU, memory, and I/O bus nexus components. A later release is scheduled to include modules for the Solaris OS on x86 platforms.

The system administrator’s primary interaction with the Sun fault manager happens through the fault manager daemon, fmd(1M). fmd(1M) starts at boot time and forks into the background (see the fmd(1M) man page for complete details) and continues to monitor the running system. When a component produces an error, the fault management system handles the error and then correlates the error report data with previous error reports and other related information in order to diagnose and react to the underlying fault. Once diagnosed, the fault manager assigns the problem a Universal Unique Identifier (UUID) which distinguishes the problem across any set of systems. When possible, fmd(1M) will then initiate steps to self-heal the failed component. The fmd(1M) program will also log the fault to syslogd and/or notify the administrator when appropriate.

The Fault Managed Resource Identifier

The Fault Managed Resource Identifier (FMRI) identifies a resource within the fault manager for the purpose of fault and error event propagation. The fault manager naming scheme, of which the FMRI is a URI subclass, is based on the URI syntax defined in RFC 2396. The FMRI syntax has an arbitrary number of different schemes, each naming a tree of related resources. FMRIs can be represented as URI strings or component name-value pairs. For example, the FMRI for DIMM 0 in memory bank 1 of memory module 2 on system board 3 of domain A of a Sun Fire 15K server could be represented as the following URI string. (Note: Line should not be broken in actual use.)

hc://chasis-id=138A2036,product-id=SunFire15000,domain-id=A/system-board=3/cpu-
module=2/memory-bank=1/dimm=0

The fault manager associates one of the following states with every FMRI:

  • ok: The resource is present and in use and has no known problems.
  • unknown: The resource is not present or not usable but has no known problems. This might indicate the resource has been disabled or unconfigured by an administrator.
  • degraded: The resource is present and usable, but one or more problems have been diagnosed.
  • faulted: The resource is present but is not usable because one or more unrecoverable problems have been diagnosed. The resource has been disabled to prevent further damage to the system and requires human intervention.

Fault Manager Command-Line Tools

The Solaris implementation of the fault manager includes several command-line tools to observe and modify the behavior of fmd(1M) and its modules. The most common tools that the administrator will use are the fmadm(1M), fmdump(1M), and fmstat(1M) tools.

The fmadm(1M) utility can view, load, and unload modules and view and update the resource cache. It provides system administrators with a way to display every resource that fmd(1M) believes to be faulty. The most common fmadm(1M) subcommands (see the fmadm(1M) man page for complete details) are:

  • config: Display the configuration, including the module name, version, and description of each component module.
  • faulty [-ai]: Display the list of resources currently believed to be faulted. The FMRI, resource state, and UUID of the diagnosis are listed for each resource. By default, the fmadm faulty command only lists output for resources that are currently present and faulty. If the -a option is specified, all resource information cached by the fault manager is listed, including information for components no longer present in the system. If the -i option is specified, the persistent cache identifier for each resource in the fault manager is shown instead of the most recent state and UUID.
  • load path: Load the specified module. The specified path must be an absolute path and refer to a module present in one of the defined directories for modules.
  • unload module: Unload the specified module. The module name is that specified in the fmadm config output. The fault manager usually loads and unloads modules automatically based on the system configuration, so this command should be seldom used.
  • rotate errlog | fltlog: Schedule a rotation of the specified fault manager log file. The log files are automatically rotated by an entry in the logadm(1M) configuration file that uses this subcommand.

The fmdump(1M) program enables the system administrator to view any log files associated with fmd(1M) and retrieve specific details of any diagnosis. By default the fmdump(1M) command shows the fault log, but will show the error log if given the -e command-line switch. The fmdump(1M) command can also take command line options to select only certain events (see the fmdump(1M) man page for complete details):

  • -c class: Select events that match the specified class.
  • -t time: Select events that occurred on or after the specified time.
  • -T time: Select events that occur on or before the specified time.
  • -U UUID: Select events that match the specified UUID.

Increasingly verbose output can be obtained for any command by specifying -v or -V.

The fmstat(1M) program is designed to report the statistics of the fault management system. If the -m module argument is given, fmstat(1M) reports statistics kept by the specified module. If -m is not specified, fmstat(1M) reports the following statistics for each module (see the fmstat(1M) man page for complete details):

  • module: The name of the module as reported by fmadm config
  • ev_recv: The number of events received by the module
  • ev_acpt: The number of events accepted by the module as relevant to a diagnosis
  • wait: The average number of events waiting to be examined by the module
  • svc_t: The average service time, in milliseconds, for events received by the module
  • %w: The percentage of time that there were events waiting to be processed
  • %b: The percentage of time that the module was busy processing events
  • open: The number of active cases owned by the module
  • solve: The number of cases solved by the module since it was loaded
  • memsz: The amount of dynamic memory currently allocated by the module
  • bufsz: The amount of persistent buffer space currently allocated by the module

An Example of the Predictive Self-Healing Fault Manager

Once a CPU fault has occurred, the administrator might see this message on the console and logged to syslog:

SUNW-MSG-ID: SUN4U-8000-6H, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sun Oct 17 14:15:50 PDT 2004 PLATFORM: SUNW,Sun-Blade-1000, CSN: -, HOSTNAME: myhost EVENT-ID: 64fe6c23-12b7-ccd1-f0a7-b531941738f8 DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4U-8000-6H for more information. AUTO-RESPONSE: An attempt will be made to remove the affected CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use fmdump -v -u to identify the CPU.

The CPU state changes from ok to faulted, the processes using that CPU are terminated, and the CPU is taken offline. The state of the CPU can be viewed by using the psrinfo(1M) command:

psrinfo 0 on-line since 09/27/2004 16:57:30 1 faulted since 10/17/2004 14:15:50

Run the fmdump(1M) command listed in the fault message, using the EVENT-ID for more information on the fault. The output shows that CPU 1 has a problem and the component in Slot 1 needs replacing. The text Slot 1, indicating the location of the defective part, can be found silk screened on the motherboard.

fmdump -v -u 64fe6c23-12b7-ccd1-f0a7-b531941738f8 TIME UUID SUNW-MSG-ID Oct 17 14:15:50.1630 64fe6c23-12b7-ccd1-f0a7-b531941738f8 SUN4U-8000-6H 100% fault.cpu.ultraSPARC-III.l2cachedata FRU: hc:///component=Slot 1 rsrc: cpu:///cpuid=1/serial=1107C270C8A

Once a replacement CPU is delivered, the defective CPU from Slot 1 can be replaced and re-enabled program

The Solaris Service manager

To better handle software faults, Sun has redesigned the way it starts and monitors services. Instead of the the traditional /etc/init.d startup scripts, many programs in the Solaris 10 OS have been converted to use the service management framework (smf) of the Solaris Service Manager to start, stop, modify, and monitor programs. The service manager is also used to identify software interdependencies and ensure that services are started in the correct order. Should a service, such as sendmail, suddenly die, the service manager automatically verifies that all of the requirements for the sendmail service are running and respawns the necessary programs. When a hardware fault occurs and hardware is offlined, the service manager can restart any programs under service manager control that needed to be stopped to remove the hardware from service.

Each service under the control of the service manager is controlled by an XML configuration file, called a manifest, that defines the name of the service, the type, any dependencies, and other important information. These manifests are stored in a repository and can be viewed and modified by the repository daemon, svc.configd(1M). The repository is read by the master restarter daemon, svc.startd(1M), which evaluates the dependencies and initiates the services as needed. Traditional inetd services are now part of the service manager as well. Any of the inetd services can be enabled, disabled, or restarted via the same mechanism as any other service manager-enabled program.

Service Manager Command-Line Tools

The service manager is made up of a number of programs, some of which are meant to be used by the administrator to view and manage services and service properties. These commands include: svcadm(1M), svcprop(1), svcs(1), and svccfg(1M). Additionally, the commands inetconv(1M) and inetadm(1M) exist to help transition traditional inetd services and manage them in the service manager framework.

The svcadm(1M) command allows the activation, deactivation, and state manipulation of service instances in the service configuration repository. Modification of these properties causes the responsible delegated restarter to take action to move the service instance into the appropriate state. If the service is not delegated, the master restarter performs these functions. The -v switch prints verbose information to standard out. Valid subcommands to the svcadm(1M) are:

  • disable [-t] [FMRI | pattern]: Disables the service instance specified by the operands. If the -t option is specified, the instance reverts to its previous enabled setting (which may have been disabled) upon reboot.
  • enable [-rt] [FMRI | pattern]: Enables the service instance specified by the operands. If the -r option is specified, the instance is enabled and recursively enables its dependencies. If the -t option is specified, the instance reverts to its previous enabled setting (which may have been enabled) upon reboot.
  • refresh [FMRI| pattern]: Refreshes the service instance specified by the operands. The instance should re-read its configuration.
  • restart [FMRI| pattern]: Restarts the service instance specified by the operands.
  • delegate restarter_FMRI [FMRI | pattern]: Changes the restarter assignment for the given instances to the restarter specified by restarter_FMRI property. If you re-delegate the instance to svc.startd(1M), this is equivalent. Re-delegation requires a restart operation to take effect. Not all restarters support the same underlying application model. Therefore, not all potential delegations result in a functioning service instance.
  • mark [-It] instance_state [FMRI| pattern]: Moves the service instance specified by the operands to the specified instance_state, either degraded and maintenance. A service must be in the online state to be interred in the degraded state. If the -I option is specified, the service instance is moved into the specified state immediately. If the -t option is specified, the interment is temporary. It persists only for the lifetime of the current system instance. The temporary interment option is not available for the degraded state.
  • milestone [-d] milestone_FMRI: Moves the system to the specified milestone. All services that the given milestone does not depend on (directly or indirectly) are temporarily disabled. If the -d option is specified, the given milestone becomes the default final milestone, and persists across reboots.
  • clear [FMRI| pattern]: For a service in the maintenance state, bring the service instance specified by each operand to the uninitialized state, such that it can be brought back online. For a service placed in the degraded state by the mark subcommand, bring the service back to the online state.

The svcprop(1) program prints values of properties in the service configuration repository. Properties are selected by -p options and FMRI operands. By default, when a single property is selected, its values are printed separated by spaces on a single line. The following options are supported:

  • -c: Retrieves the current property values, without composition.
  • -f: Designates properties by their FMRIs. Implies option -t.
  • -p [name/]name: Prints values of the named property or property group for each of the property groups, instances, or services specified by the operands.
  • -q: Quiet. Produces no output.
  • -s snapshot: Uses the named snapshot to retrieve the specified property or property group. If the given property group is not present in the snapshot, the current property values are examined.
  • -t: Uses the multi-property output format.
  • -v: Verbose. Prints error messages for nonexistent properties, even if option -q is also used.
  • -w: Waits for the selected property group or property to change before printing anything.

The svcs(1) command displays information about service instances as recorded in the service configuration repository. The svcs(1) command has three different forms:

 
svcs [-aHpv?] [-o col[,col]...] [-R instance_FMRI]... 
              [-sS col]... [FMRI | pattern] ...

svcs {-d | -D}  [-Hpv?] [-o col[,co= l]...]  [-sS col]... 
                [FMRI | pattern] ...

svcs -l [FMRI | pattern] ...

The first form prints one-line status listings for service instances specified by the arguments. Each instance is listed only once, and with no arguments; all enabled service instances, even if temporarily disabled, are listed. The second form of the command prints one-line status listings for the dependencies or dependents of the service instances specified by the arguments. The third form prints detailed information about specific services and instances. The options seen above in the three command explanations are:

  • -?: Displays an extended usage message, including column specifiers.
  • -a: Also selects disabled service instances.
  • -d: Lists the services or service instances upon which the given service instances depend.
  • -D: Lists the service instances that depend on the given services or service instances.
  • -H: Omits the column headers.
  • -l: Displays all available information about the selected services and service instances, with one service attribute displayed for each line. Information for different instances is separated by blank lines.
  • -o col[,col]...: Prints the specified columns. Each col should be a column name.
  • -p: Lists processes associated with each service instance. A service instance may have no associated processes. The process ID, start time, and command name (PID, STIME, and CMD fields from ps(1)) are displayed for each process.
  • -R instance_FMRI: Selects service instances that have the specified service instance as their restarter.
  • -s col: Sorts output by column. col should be a column name. Multiple options behave additively.
  • -S col: Sorts by col in the opposite order as options.
  • -v: Displays verbose columns: STATE, NSTATE, STIME, CTID, and FMRI.

The column names used with the svcs(1) command are case sensitive and are as follows:

  • CTID: The primary contract ID for the service instance, if one exists.
  • DESC: A brief description of the service from its template element. A service may not have a description available, in which case a hyphen is used to denote an empty value.
  • FMRI: The FMRI of the service instance.
  • INST: The instance name of the service instance.
  • NSTA: The abbreviated next state of the service instance, as given in the STA column description. A hyphen denotes that the instance is not transitioning, otherwise it’s the same as the STA.
  • NSTATE: The next state of the service. A hyphen is used to denote that the instance is not transitioning, otherwise it’s the same as the STATE.
  • SCOPE: The scope name of the service instance.
  • SVC: The service name of the service instance.
  • STA: The abbreviated state of the service instance.
  • STATE: The state of the service instance. An asterisk is appended for instances in transition, unless the NSTA or NSTATE column is also being displayed.
  • STIME: If the service instance entered the current state within the last 24 hours, this column indicates the time that it did so. Otherwise, this column indicates the date on which it did so, printed with underscores in place of blanks.

The svccfg(1M) command is used to import, export, and modify the configurations of services in the repository. It can be invoked interactively, by specifying subcommands, or by specifying a command file containing a series of subcommands. The three forms of invocation are:

/usr/sbin/svccfg [-v]

/usr/sbin/svccfg [-v] subcommand [args...]

Examples of the Predictive Self-Healing Service Manager
Using svcs(1), view the services on the system:

svcs …. online Oct_31 svc:/system/filesystem/local:default online Oct_31 svc:/network/rpc/bind:default online Oct_31 svc:/system/cron:default online Oct_31 svc:/system/sac:default online Oct_31 svc:/system/system-log:default online Oct_31 svc:/network/inetd:default online Oct_31 svc:/network/nis/client:default online Oct_31 svc:/network/rpc/keyserv:default online Oct_31 svc:/network/rpc/gss:ticotsord online Oct_31 svc:/network/security/ktkt_warn:ticotsord online Oct_31 svc:/milestone/multi-user:default ….

Use svcs -p to find out the relationship between services and processes. This example shows the NFS server service:

svcs -p nfs/server STATE STIME FMRI online Oct_12 svc:/network/nfs/server:default Oct_31 103729 mountd Oct_31 103731 nfsd

If a service has a problem, use the service manager tools to help diagnose it and review the suggested course of action to correct the issue. For example, the svcs -x option lists information about every service that isn’t running, and why:

svcs -x svc:/application/print/server:default (LP Print Service) State: disabled since Tue Oct 05 22:27:55 2004 Reason: Disabled by an administrator. See: http://sun.com/msg/SMF-8000-05 See: lpsched(1M) Impact: 1 service is not running.

www.sum.com provides additional information on the type of issue, and suggests steps to acquire additional data and correct the problem.

Use svccfg(1) to view the properties of the SMTP server and determine its dependencies:

svccfg svc:> select network/smtp svc:/network/smtp> listprop system-log dependency system-log/entities fmri svc:/system/system-log system-log/grouping astring optional_all system-log/restart_on astring none system-log/type astring service identity dependency identity/entities fmri svc:/system/identity:domain identity/grouping astring require_all identity/restart_on astring refresh identity/type astring service name-services dependency name-services/entities fmri svc:/milestone/name-services name-services/grouping astring require_all name-services/restart_on astring refresh name-services/type astring service network-service dependency network-service/entities fmri svc:/network/service network-service/grouping astring require_all network-service/restart_on astring none network-service/type astring service fs-local dependency fs-local/entities fmri svc:/system/filesystem/local fs-local/grouping astring require_all fs-local/restart_on astring none fs-local/type astring service general framework general/entity_stability astring Unstable general/single_instance boolean true


/usr/sbin/svccfg [-v] -f command-file

For a complete list of all of the available subcommands, please read the svccfg(1M) man page.

The inetconv(1M) program converts inetd.conf entries into smf(5) manifests, and imports them into the repository. There is a one-to-one mapping between a service line in the specified input file and the resulting configuration file generated. By default, the configuration files are named using the following template:

-.xml

The token is replaced by the service’s name and the token by the service’s protocol. Any forward slash characters that exist in the source line for the service name or protocol are replaced with underscores. Each resulting manifest includes the service line as a comment. If a service line is found to be malformed or to be for an internal inetd service during the conversion process, no manifest is generated and that service line in the input file is skipped. The inetconv(1M) program accepts the following command line options:

  • -?: Display a usage message.
  • -e: Enable smf(5) services which are enabled in the input file.
  • -f: If a service manifest of the same name as the one to be generated is found in the destination directory, inetconv(5) will overwrite that manifest if this option is specified. Otherwise, an error message is generated and the conversion of that service is not performed.
  • -i srcfile: Permits the specification of an alternate input file srcfile. If this option is not specified, then the inetd.conf(4) file is used as input.
  • -n: Turns off the auto-import of the manifests generated during the conversion process. Later, if you want to import a generated manifest into the smf(5) repository, you can do so through the use of the svccfg(1M) utility.
  • -o destdir: Permits the specification of an alternate destination directory destdir for the generated configuration files. If this option is not specified, then the manifests are placed in /var/svc/manifest/network/rpc if they are RPC services, otherwise in /var/svc/manifest/network.

The inetadm(1M) program views and configures inetd-controlled services. The following options are supported:

  • -?: Display a usage message.
  • -l FMRI: List all properties for the specified service instance in name=value pairs. In addition, if the property value is inherited from the default value provided by inetd, the name=value pairs are identified by the token (default). Property inheritance occurs when properties do not have a specified service instance default.
  • -e FMRI: Enable the specified service instance.
  • -d FMRI: Disable the specified service.
  • -p: Lists all default inet service property values provided by inetd in the form of name=value pairs. If the value is of boolean type, it is listed as TRUE or FALSE.
  • -m FMRI property_name=value [property_name=value...]: Change the values of the specified properties of the identified service instance. Properties are specified as whitespace-separated name=value pairs. To remove an instance-specific value and accept the default value for a property, simply specify the property without a value.
  • -M property_name=value [property_name=value...]: Change the values of the specified inetd default properties. Properties are specified as whitespace-separated name=value pairs.

Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright 2017 ©Aceadmins. All rights reserved.