Monitors

The Concept of threshold monitoring is well-known in the IT administration community. Any Threshold Monitor consists of two major parts:

  • An executable or a script that is able to submit a metric value from the Monitored component.
  • A Monitor Policy with configured threshold levels that defines different conditions for the values.

Every declared threshold defines some numeric (double) value. Generally the metric value becomes important for an operator when a value crosses a defined level. For example: The free disk space value is not important for productive environments while it stays above some minimum. To prevent an "out of disk space" situation an administrator has to specify the threshold for the free disk space Monitor that will give enough time to investigate and take corrective actions before the system will crash or degrade.

Due to the nature of an IT environment the threshold levels are different from system to system and must therefore be configurable. Another aspect of the thresholds is that ONE threshold value can't reflect the dynamic of changes in an adequate way. Multiple thresholds with different severities will give a clearer picture and more time to prevent problems.
Another side of the Monitors are the different objects of the same type that need to be monitored. Like each partition of a disk has a different size. In an ideal case one Monitor should be able to deliver values for multiple objects and these objects can have different thresholds.

In the boom Infrastructure every Monitor requires to have a Monitor Policy. The Monitor Policy contains all necessary definitions to start the Monitor Binary (if required) and react on submitted values.
Configured and uploaded to the boom Agent, the Monitor Policy is an instruction that declares what to trigger and how to process the data. All Policies in the boom Environment must have unique names. These names must be used for submitting Monitor values. In other words, if a Monitor Policy named as "Monitor_A" exists, this Monitor Policy will be used for checking values that are submitted as Monitor_A="double_value". The Monitor Binary can submit values for multiple instances by setting the object attribute for each instance.

Monitor Types and Variations

Threshold Monitors have different types and variations. The two main categories are MAXTHRESHOLD and MINTHRESHOLD.
For example: A 'free disk space' Monitor has a MINTHRESHOLD type, but CPU utilization has a MAXTHRESHOLD.
In perspective of call types - EXEC, JAVA and EXTERNAL types are supported.
Some more variations are Policy WITH RESET or WITHOUT RESET.

Maximum and Minimum

The thresholds are one of the major aspects of the Monitors. They allow to reduce the number of Indications that are coming to an operator screen. The boom product operates with two types of thresholds:


Since thresholds are part of a Policy, it makes sense to generalize the threshold type to the Policy. Even more - a threshold is not only one attribute that is used during processing, so a Policy has a list of conditions and every condition has an attribute named 'threshold'. Of course, every condition contains a list of additional attributes that are used during the creation of an Indication.

A Monitor Policy of the MAXTHRESHOLD type declares the number of conditions with maximum threshold levels in descending order. Opposite MINTHRESHOLD Monitor Policy expects ascending order of the thresholds.

This requirement is important since all thresholds will be checked by the calculation engine one after another. The first matched condition notifies the engine to stop processing all following conditions.

MAXTHRESHOLDMINTHRESHOLD

When a threshold value will be crossed the first time, the boom Agent creates an Indication and sends it to the boom Server. After that the agent keeps silence until one of the following values crosses a different threshold level. This suppression algorithm reduces the number of messages received by the operator as well as network traffic.

It is possible to have multiple conditions with the same severity, it is also possible to skip unnecessary severities. Supported object filters allow to combine multiple condition sets for different objects in one Policy.

During the processing of the calculation engine, only conditions will be taken into account that matches with the object value which has been submitted together with the monitor value. If an Object mask is not specified - all objects will be processed by the particular condition.

Reset Values

All Monitor Policies' thresholds are coming with Reset values. The "Reset" concept is playing an important role in the calculation.

First of all, the Monitor Policy has a global flag called "Policy with reset". If this flag is set to 'NO', the Policy is ignoring all reset values as well as the silence periods and it will deliver values all the time when it is submitted to the Agent. This type of Monitors is also known as 'Continuous Monitors'.
If the 'YES' value is selected - it will be a threshold Monitor with reset.

When a monitor value crosses a threshold in the defined direction (increasing for MAXTHRESHOLD and decreasing for MINTHRESHOLD) - it can be named "elevation". Backward direction is a "reset".
Monitors with a reset value different from the threshold value have some special handling in the "reset" direction. The reset value gives the possibility to ignore small value's fluctuations and keeps the reached threshold unchanged.


The reset feature can be explained with the following example:
A process CPU utilization monitor has a critical threshold = 95% indicating process high CPU load. A normal threshold that indicates a normal state is above 0%. When the process reaches 95%, an Indication will be sent to the server. Lets assume the critical condition has specified a reset value = 70%, this allows to keep the critical level unchanged until the process goes down below 70% CPU. So the deviations between 70 and 95 per cent will not reset the severity to normal.

An other example is the MINTHRESHOLD of free disk space:
A critical condition has a threshold equal to 100MB. The reset value is 1024MB. As result of this a critical Indication like "100 MB free space left on a disk..." can be kept active until an administrator cleans up disk space up to at least 1GB.
The Minimum type threshold requires a reset value to be bigger than the threshold, the Maximum type threshold requires the reset value to be less than the threshold.

Another optional possibility in the condition section of a Monitor Policy is the "Ignore Reset" flag. This flag is set to "NO" by default. In case of switching the flag's value to "YES" the threshold condition becomes a 'continuous' nature. That means that on this level any submitted Monitor value generates an Indication. This can be used for more precise monitoring of critical conditions.

The "Silence Count" parameter of a condition can be used when it is necessary to suppress a couple of first generated values. The boom Agent will ignore the specified amount of submitted values that match with the condition before an Indication will be sent to the boom Server.

If an Indication is delivered with Close Mask - the server is able to automatically close related previous Indications.

The default working directory for the monitor's executable is "$BOOM_ROOT/spi/". All binaries and script calls must be specified relative to this directory. In case it is necessary to use a binary that is placed in a different location that is available in the PATH variable - use the '#' character as a prefix. i.e. #df, #top

Finished Alerts

The "Alert Finished" state will be reached when the monitor submits a value outside the defined threshold borders. In case of MAXTHRESHOLD it's below the lowest threshold value and for MINTHRESHOLD - biggest one. This state indicates the end of the previous state and enables operators to identify if a problem is still ongoing. Beside this benefit for the exception based operations concept, it also avoids the sending of Indications with normal severity.

The Indication Browser displays such Indications with Severities crossed by line:

Browser view with finished alerts


After an alert is finished you can see in the Indication details the time stamp and last value that triggered the finished state.

Detail view of finished alert

MAXONLYCHANGES and MINONLYCHANGES

Starting from v2.55 Monitor policies supports two new types: MAXONLYCHANGE and MINONLYCHANGES. These types can be used for monitoring not frequently changed values. For such types a new Indication will be created for every change detected. The policy conditions are used only to detect severity and to define an indication attributes but thresholding is not used.