PREDECTIVE MONITORING --------------------- 1) Introduction Some people call the following concept "predictive monitoring": Let's assume that you have problems assigning levels for your CPU load on a specific server, because at certain times in a week an important CPU intense job is running. These jobs do produce a high load - which is completely OK. At other times - however - a high load could indicate a problem. If you now set the warn/crit levels such that the check does not trigger during the runs of the job - you make the monitoring blind to CPU load problems in the other periods. What we need is levels that change dynamically with the time, so that e.g. a load of 10 at monday 10:00 is OK while the same load at 11:00 should raise an alert. If those levels are then *automatically computed* from the values that have been measured in the past, then we get an intelligent monitoring that "predicts" what is OK and what not. Other people also call this "anomaly detection". 2) An idea for a solution Our idea is that we use the data that are kept in RRDs in order to compute sensible dynamic levels for certain check parameters. Let's stay with the CPU load as an example. In any default Check_MK or OMD installation, PNP4Nagios and thus RRDs are used for storing historic performance data such as the CPU load for up to four years. This will be our basis. We will analyse this data from time to time and compute a forecast for the future. Before we can do this we need to understand, that the whole prediction idea is based on *periodic intervals* that repeat again and again. For example if we had a high CPU load on each of the last 50 mondays from 10:00 to 10:05 then we assume that at the next monday we will have a similar development. But the day of the week might not always be the way to go. Here are some possible periods: * The day of the week * The day of the month (1st, 2nd, ...) * The day of the month reverse (last, 2nd last, ...) * The hour * Just whether its a work day or a holiday In general we need to make two decisions: * The slicing (for example "one day") * The grouping (for example "group by the day of the week") Example: if we slice into days and then group by the day of the week then we get seven different groups. For each of these groups we separately compute a prediction by fetching the relevant data from the past - limited to a certain time horizon, for example to the data of the last 20 mondays. Then we overlay these 20 graphs and compute for each time of day the - Maximum value - Minimum value - Average value - Standard deviation When doing this we could impose a larger weight on more recent mondays and a lesser weight on the older ones. The result is a condensed information about the past that we use as a prediction for the future. Based on that prediction we can now construct dynamic levels by creating a "corridor". An example could be: "Alert a warning if the CPU load is more than 10% higher then the predicted value". A percentage is not the only way to go. Useful seem at least: - +/- a percentage of the predicted value (average, min or max) - an absolute difference to the predicted value (e.g. +/- 5) - a difference in relation to the standard deviation Working with the the Standard deviation takes into account the difference between situations where the historic values show a greater or smaller variety. In other words: the smaller the standard deviation the more precise is the prediction. It is not only possible to set upper levels - also lower levels are possible. In other words: "Warn me, if the CPU load is too *low*!" This could be a hint for an important job that is *not* running. 3) Implementation within Check_MK When trying to find a good architecture for an implementation several aspects have to taken into mind: - performance (used CPU/IO/disk ressources on the monitoring host) - code complexity - and thus cost of implementation and code maintainance - flexibility for the user - transparency to the user - openness to later improvements The implementation that we suggest tries to maximise all those aspects - while we are sure that even better ideas might exist... The implementation touches various areas of Check_MK and consists of the following tasks: a) A script/program that analyses RRD data and creates predicted dynamic levels b) A helper function for checks that makes use of that data when they need to determine levels c) Implementing dynamic levels in several checks by using that function d) Enhancing the WATO rules of those checks such that the user can configure the dynamic levels e) Adapting the PNP templates of those checks such that the dynamic levels are being displayed in the graphs. Implementation details: a) Analyser script This script needs the following input parameters: * Hostname and RRD and variable therein to analyse (e.g. srvabc012 / CPU load / load1) * Slicing (e.g. 24 hours, aligned to UTC) * Slice to compute [1] (e.g. "monday") * Grouping [2] (e.g. group by day of week) * Time horizon (e.g. 20 weeks into the past) * Weight function (e.g. the weight of each week is just 90% of the weight of the succeessing week, or: weight all weeks identically) Notes: [1] If we just compute one slice at a time (e.g. only monday) then we can cut down the running time of the script and can do this right within the check on a on-demand base. [2] The grouping can be implemented as a Python function that maps a time stamp (the beginning of a slice) to a string that represents the group. E.g. 1763747600 -> "monday". The result is a binary encoded file that is stored below var, e.g.: var/check_mk/prediction/srvabc012/CPU load/load1/monday A second file (Python repr() syntax) contains the input parameters including the time of youngest contained slice: var/check_mk/prediction/srvabc012/CPU load/load1/monday.info This info file allows to re-run the prediction only when the check parameters have changed or if a new slice is needed. The implementation of this program is in Python, if this is fast enough (which I assume) or in C++ otherwise. If it is in Python then we do not need an external program but can put this into a module (just like currently snmp.py or automation.py) und call it directly from the check. b) Helper function for checks When a check (e.g. cpu.loads) wants to apply dynamic levels then it should call a helper function the encapsulates all of the intelligent stuff of the prediction. An example call could be the following (the current hostname and checktype are implicitely known, the service description is being computed from the checktype and the item): # These values come from the checks' parameters and are configured # on a per-check base: analyser_params = { "slicing" : (24 * 3600, 0), # duration in sec, offset from UTC "slice" : "monday", "grouping" : "weekday", "horizon" : 24 * 3600 * 150, # 150 days back "weight" : 0.95, } warn, crit = predict_levels("load1", "avg", "relative", (0.05, 0.10), analyser_params) The function prototype looks like this: def predict_levels(ds_name, levels_rel, levels_op, levels, prediction_parameters, item=None): - Get current slice - Check if prediction file is up-to-date - if not (re-)create prediction file - get min, max, avg, stddev from prediction file for current time - compute levels from that by applying levels_rel, levels_op and levels - return levels c) Implementing dynamic levels in several checks From the existing variety of different checks it should be clear that there can be no generic way of enabling dynamic levels for *all* checks. So we need to concentrate on those checks where dynamic levels make sense. Such are for example: CPU Load CPU Utilization Memory usage (?) Used bandwidth on network ports Disk IO Kernel counters like context switches and process creations Those checks should get *additional* parameters for dynamic levels. That way the current configuration of those checks keeps compatible and you can impose an ultimate upper limit - regardless of dynamic computations. Some of those checks need to be converted from a simple tuple to a dictionary based configuration in order to do this. Here we must make sure that old tuple-based configurations are still supported. d) WATO rules for dynamic levels When the user is using dynamic levels for a check then he needs to (or better: can) specify many parameters, as we have seen. All those parameters are mostly the same for all the different checks that support dynamic levels. We can create a helper function that makes it easier to declare such parameters in WATO rules. Checks that support upper *and* lower levels will get two sets of parameters, because the logic for upper and lower levels might differ. But the need to share the same parameters for the prediction generation (slicing, etc) so that one prediction file per check is sufficient. e) PNP Templates Making the actually predicted levels transparent to the user is a crucial point in the implementation. An easy way to do this is to simply add the predicted levels as additional performance values. The PNP templates of the affected checks need to detect the availability of those values and add nice lines to the graph. This is easy - while having one drawback: The user can see predicted levels not before they are actually applied. The advantage on the other hand is that if parameters for the prediction are changed then the graph still correctly shows the levels that had been valid at each point of time in the past.