Auto healing

You can define policies in your templates to implement the auto healing function. Dell Cloud Manager will perform an auto healing action when it detects an impaired server. Dell Cloud Manager uses the following criteria to detect an impaired server:

  • The server is not active.
  • The server is active but DCM has not received the initial DCM agent handshake.
  • This time frame DCM waits for the initial handshake is specified in the Dell Cloud Manager server configuraton. The parameter is server.agent.wait-time and the default is currently 20 minutes.
Refer to the Dell Cloud Manager Administration Guide Working with DCM Configurations for details.
  • The server is active, DCM received the initial handshake from the DCM agent, but has not received a handshake recently.
  • DCM will “ping” the DCM agent every 10 minutes. If the agent does not respond then DCM will consider that the server is impaired.

Attention

Chef or Puppet failures on the launched servers in the stack do not indicate that a server is impaired and will not trigger auto healing.

To implement auto healing you first need to create groups. The groups are treated as a logical unit for the purpose of auto healing.

You need to assign a unique section name for each group in your template. Inside the group you need to define members, properties, policies, actions, measurements and criteria.

Attention

It is not recommended to “share” measurement, criterion, and actions with other policies defined in the template. Each policy you define in a template should have their own set of measurement, criterion, and actions.

Members

  • The members: section is where you define the name(s) of the node_templates to be included in the group.

    Example: members: [vm1, vm2]

Properties

There are properties you need to define in the group.

  • instances: - The initial number of resources to start when the stack is started
Example: instances: 2
  • minInstances: - The minimum number of resources.
Example: minInstances: 1
  • maxInstances: - The maximum number of resources.
Example: maxInstances: 10
  • coolDown: - The number of seconds to wait before performing an auto healing operation following a previous auto healing operation.
Example: coolDown: 300

Policies

  • The policies: section inside the group is where you define the auto healing policy definitions. The policies reference the actions:, measurements: and criteria:.

Actions

  • The actions: section inside the group is where you define the the auto healing actions.

Measurements

  • The measurements: section inside the group is where you define the measurements to perform which will be used to determine when to perform the auto healing actions.

Criteria

  • The criteria: section inside the group is where you define the criteria which is used along with the measurements used to determine when to perform the auto healing actions.

Example

Here is an example outline for a group defining an auto healing policy. The details of these sections: policies, actions, measurements and criteria will be covered in the next few pages.

 # healing_group_1 defines the auto healing properties that will apply to nodes vm1 and vm2
     groups:
        healing_group_1:
          members: [vm1, vm2]  # The node_templates with the statement names "vm1" and "vm2" are in this group
          properties:
            instances: 2       # The initial number of servers to start when the stack is started
            minInstances: 1    # The minimum number of servers
            maxInstances: 10   # The maximum number of servers
            coolDown: 300      # The number of seconds to wait before performing auto healing operation following a previous operation.

          # Policies define the auto healing policies. They reference by name (label) the actions, measurements, and criteria defined in the template.          
          policies:
            repair_on_status:          # This is a label
              type: dcm.policy.types.BasicPolicy
                  ...
                  ...

            cloud_reported_status:     # This is a label
              type: dcm.policy.types.BasicPolicy
                  ...
                  ...

          # Actions defines details of what actions to perform
          actions:
            repair:                    # This is a label
              type: dcm.policy.action.ReplaceResource
              properties:
                  ...
                  ...

          # Measurements are used determine when to perform the auto healing **actions** you define                  
          measurements:  
            cloud_reported_status:     # This is a label
              type: dcm.policy.measurement.ResourceActive
                  ...
                  ...

          # Criteria are used along with the measurements to determine when to perform the auto healing **actions** you define
          criteria:  
            status_failure:            # this is a label
              type: dcm.policy.criteria.False
                  ...
                  ...