Sophie: shinken-2.4.3-2.1.mga7 noarch

shinken-2.4.3-2.1.mga7.noarch.rpm

.. _thebasics/statetypes:

=============
 State Types 
=============


Introduction 
=============

The current state of monitored services and hosts is determined by two components:

  * The status of the service or host (i.e. OK, WARNING, UP, DOWN, etc.)
  * The type of state the service or host is in.

There are two state types in Shinken - SOFT states and HARD states. These state types are a crucial part of the monitoring logic, as they are used to determine when :ref:`event handlers <advanced/eventhandlers>` are executed and when :ref:`notifications <thebasics/notifications>` are initially sent out.

This document describes the difference between SOFT and HARD states, how they occur, and what happens when they occur.


Service and Host Check Retries 
===============================

In order to prevent false alarms from transient problems, Shinken allows you to define how many times a service or host should be (re)checked before it is considered to have a "real" problem. This is controlled by the max_check_attempts option in the host and service definitions. Understanding how hosts and services are (re)checked in order to determine if a real problem exists is important in understanding how state types work.


Soft States 
============

Soft states occur in the following situations...

  * When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition. This is called a soft error.
  * When a service or host recovers from a soft error. This is considered a soft recovery.

The following things occur when hosts or services experience SOFT state changes:

  * The SOFT state is logged.
  * Event handlers are executed to handle the SOFT state.

SOFT states are only logged if you enabled the :ref:`log_service_retries <configuration/configmain-advanced#log_service_retries>` or :ref:`log_host_retries <configuration/configmain-advanced#log_host_retries>` options in your main configuration file.

The only important thing that really happens during a soft state is the execution of event handlers. Using event handlers can be particularly useful if you want to try and proactively fix a problem before it turns into a HARD state. The :ref:`$HOSTSTATETYPE$ <$HOSTSTATETYPE$>` or :ref:`$SERVICESTATETYPE$ <$SERVICESTATETYPE$>` macros will have a value of "SOFT" when event handlers are executed, which allows your event handler scripts to know when they should take corrective action. More information on event handlers can be found :ref:`here <advanced/eventhandlers>`.


Hard States 
============

Hard states occur for hosts and services in the following situations:

  * When a host or service check results in a non-UP or non-OK state and it has been (re)checked the number of times specified by the max_check_attempts option in the host or service definition. This is a hard error state.
  * When a host or service transitions from one hard error state to another error state (e.g. WARNING to CRITICAL).
  * When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE.
  * When a host or service recovers from a hard error state. This is considered to be a hard recovery.
  * When a :ref:`passive host check <thebasics/passivechecks>` is received. Passive host checks are treated as HARD unless the :ref:`passive_host_checks_are_soft <configuration/configmain-advanced#passive_host_checks_are_soft>` option is enabled.

The following things occur when hosts or services experience HARD state changes:

  * The HARD state is logged.
  * Event handlers are executed to handle the HARD state.
  * Contacts are notifified of the host or service problem or recovery.

The :ref:`$HOSTSTATETYPE$ <$HOSTSTATETYPE$>` or :ref:`$SERVICESTATETYPE$ <$SERVICESTATETYPE$>` macros will have a value of "HARD" when event handlers are executed, which allows your event handler scripts to know when they should take corrective action. More information on event handlers can be found :ref:`here <advanced/eventhandlers>`.


Example 
========

Here's an example of how state types are determined, when state changes occur, and when event handlers and notifications are sent out. The table below shows consecutive checks of a service over time. The service has a max_check_attempts value of 3.


==== ======= ======== ========== ============ =============================================================================================================================================================================================================
Time Check # State    State Type State Change Notes                                                                                                                                                                                                        
0    1       OK       HARD       No           Initial state of the service                                                                                                                                                                                 
1    1       CRITICAL SOFT       Yes          First detection of a non-OK state. Event handlers execute.                                                                                                                                                   
2    2       WARNING  SOFT       Yes          Service continues to be in a non-OK state. Event handlers execute.                                                                                                                                           
3    3       CRITICAL HARD       Yes          Max check attempts has been reached, so service goes into a HARD state. Event handlers execute and a problem notification is sent out. Check # is reset to 1 immediately after this happens.                 
4    1       WARNING  HARD       Yes          Service changes to a HARD WARNING state. Event handlers execute and a problem notification is sent out.                                                                                                      
5    1       WARNING  HARD       No           Service stabilizes in a HARD problem state. Depending on what the notification interval for the service is, another notification might be sent out.                                                          
6    1       OK       HARD       Yes          Service experiences a HARD recovery. Event handlers execute and a recovery notification is sent out.                                                                                                         
7    1       OK       HARD       No           Service is still OK.                                                                                                                                                                                         
8    1       UNKNOWN  SOFT       Yes          Service is detected as changing to a SOFT non-OK state. Event handlers execute.                                                                                                                              
9    2       OK       SOFT       Yes          Service experiences a SOFT recovery. Event handlers execute, but notification are not sent, as this wasn't a "real" problem. State type is set HARD and check # is reset to 1 immediately after this happens.
10   1       OK       HARD       No           Service stabilizes in an OK state.                                                                                                                                                                           
==== ======= ======== ========== ============ =============================================================================================================================================================================================================