Friday 11 July 2008

Tuning Nagios Load Checks

[See also: check_load initial values cheat sheet].

The standard Nagios plugins include a "check_load" command which will raise a warning or error if the load averages for the target machine exceed some threshold. A little while ago the ops manager and I were discussing what those thresholds should be.

The usage for the check_load command is as follows:
Usage: check_load -w WLOAD1,WLOAD5,WLOAD15 -c CLOAD1,CLOAD5,CLOAD15
Without looking at the source I'm pretty sure that the program is either just opening a pipe to uptime or using the /dev/proc file system to read the load averages for the past 1, 5 and 15 minutes. Should be safe to assume then that Nagios' concept of load is exactly the same as uptime's and that the figures are ultimately is coming from the kernel scheduler. (Note: Yup. :-) Just checked the source.)

So the first question is: what does "load" actually measure?

Unix and Linux Load

Breifly, when Unix machines report their "load" (usually through uptime, top or who) they are reporting a weighted average of the number of processes either running or waiting for the CPU (Linux will also count processes that may be blocked waiting on I/O). This average is calculated over 1, 5 and 15 minutes (hence the three values) based on values that are sampled every 5 seconds (on Linux at least). Dr Neil Gunther has written more than you might ever want to know about how those load averages are calculated and what they mean. It's an excellent series of articles (see also the inevitable Wikipedia article).

So assuming we have a single-core CPU, a load value of "1.0" would suggest that the CPU has been 100% utilised over whatever reporting period that figure was calculated for. A load of "2.0" would mean that whenever one process had the CPU there was another that was forced to wait. However, if we have 2 cores, the same "2.0" load value would suggest that both processes got the CPU time they needed, while a load of "1.0" would suggest the CPU had only been at 50% capacity.

On a simple web server, running a single 2-core CPU a load average of "2.0, 1.0, 0.5" suggests that, over the last minute, the CPU has been 100% utilised; over the last 5 minutes it's been 50% utilised; and over the last 15 minutes, it's been 25% utilised. Halve those values if 4 cores are available and double them if only one is in the system.

You can see then that sensible threshold values for warning and critical states requires you to consider how many CPUs and CPU cores your system has. You're therefore probably going to want to set your thresholds per machine or at least set them differently for each different type of configuration.

For example, one of our Solaris boxes has 12 cores so a load of "6.0" is nothing to be concerned about. However that same load figure on another, single-core box might be worthy of a warning or even critical alert, depending on how sensitive we were to process queue lengths on that box. Except if that box is a Linux box with a lot of I/O and slow devices (like a tape drive) and is counting processes that are sitting idle and waiting for an I/O operation to finish. And what is the application running on it? Is it threaded? How is your kernel counting threads in that total -- or is it just counting processes?

Setting the Check_Load Thresholds

So determining an appropriate warning and critical set of threshold values for check_load will depend on what you think a reasonable process queue length will be; how your specific system treats threads; how your applications on that system behave (and their expected responsiveness levels); and how many CPUs / cores your system has. Oh -- and your performance targets or SLAs.

This is why experienced admins use a time honoured, complicated heuristic process to set an initial value and then continually adjust that value based on the correlation of alerts raised and actual performance and hence user impact.

In other words: we rub our bellies and take a guess and then change the values if we get too many or too few alerts. We're experienced sysadmins -- how much time do you think we have? :-)

In our case, for web servers, we decided that over 5 and 15 minute periods we expect spare capacity on the box -- but we only want to be alerted if the box is basically maxing out on CPU over a significant period. Over 1 minute we expect the occasional spike and don't really want an alert unless it's way beyond expectations. We're using Apache with no threading so 1 load point = 1 process using or waiting for CPU.

We've set warning levels for 15 minute load average at number of CPU cores times 2 (plus one!). For 5 minutes increase the threshold by 5. For one minute, increase it by 5 again. Critical threshold starts at number of CPU cores times 4 and then follows the same pattern for the 5 and 1 minute warning.

Here's a sample nrpe.cfg config file for a web server with 2 cores:
command[check_load]=/path/check_load -w 15,10,5 -c 30,25,20
It's important to actually test this set up. Use ApacheBench or JMeter or similar tool to get your load average up and test performance under those thresholds to see if it's acceptable. If your application is unacceptably slow from a user perspective at lower load values then lower your thresholds.

More Information

I've put together a little check_load cheat sheet that has some initial values for some common configurations. It might be a useful starting point if you're just starting to configure check_load in your Nagios environment.

[Note: This post has been edited since initial publication.]


Glen Barber said...

Very nice! I was playing with the check_load values for 2 days before actually utilizing my Google skills. **rubs belly**

hissohathair said...

Thanks Glen! Glad you found it useful.

Unknown said...

Thanks a lot! It was exactly what i needed to know.

Unknown said...

Great post I'm glad I found this link.

Unknown said...

Great post I'm glad I found this!

Elif said...

Indeed very useful. Thank you for writing this.

RyanAsh said...

Script to use instead of check_load. it will dynamically build the thresholds bassed off of the formual in this blogpost.

By Ryan Ash

use strict;
use Getopt::Std;

use vars qw(%info %opts);

getopts ('hds:', \%opts) or USAGE();
USAGE() if ($opts{h});

if (-e "/proc/cpuinfo" ) {
$info{numprocs} = `/bin/grep -c -e "^processor.*:" /proc/cpuinfo`;
chomp ($info{numprocs});
print "DEBUG: found $info{numprocs} processors (cores)\n" if $opts{d};
if ($info{numprocs} !~ /\d*/ || $info{numprocs} < 1) {
print "ERROR: numprocs = $info{numprocs}\n";
exit 0;
print "ERROR: /proc/cpuinfo doesn't exist\n";

if ($opts{s}) {
#ahh nothing here right now

#Default description:
# Warning
# ($numprocs x 2) + 1 = 15 min threshold (w1)
# 1 min threshold + 5 = 5 min threshold (w2)
# 5 min threshold + 5 = 1 min threshold (w3)
# Critical
# w1 x 4 = 15 min threshold (c1)
# c1 + 5 = 5 min threshold (c2)
# c2 + 5 = 1 min threshold (c3)

$info{w1} = ($info{numprocs} * 2) + 1;
$info{w2} = $info{w1} + 5;
$info{w3} = $info{w2} + 5;

$info{c1} = $info{w1} * 4;
$info{c2} = $info{c1} + 5;
$info{c3} = $info{c2} + 5;

print "DEBUG: \n\tw1 = $info{w1}\n\tw2 = $info{w2}\n\tw3 = $info{w3}\n\tc1 = $info{c1}\n\tc2 = $info{c2}\n\tc3 = $info{c3}\n" if $opts{d};

print `/usr/local/nagios/libexec/check_load -w $info{w3},$info{w2},$info{w1} -c $info{c3},$info{c2},$info{c1}`;

sub USAGE {
print "$0 usage
-d debug
-h help
-s : This will be used to override our typically logic for determining the thresholds


Nikola Valentinov Petrov said...

Not sure if they added this after you wrote this post(it seems pretty old) but for future reference there is a -r flag in check_load which will divide the load based on the number of CPUs. From the help page:

-r, --percpu
Divide the load averages by the number of CPUs (when possible)

Wonder Dog said...

Change the following lines to return the correct exit code

print `/usr/local/nagios/libexec/check_load -w $info{w3},$info{w2},$info{w1} -c $info{c3},$info{c2},$info{c1}`;

system ("/usr/local/nagios/libexec/check_load -w $info{w3},$info{w2},$info{w1} -c $info{c3},$info{c2},$info{c1}");
exit $? >>8;

Wonder Dog said...

output correct exit code

Change This:
print `/usr/local/nagios/libexec/check_load -w $info{w3},$info{w2},$info{w1} -c $info{c3},$info{c2},$info{c1}`;

To this:
system ("/usr/local/nagios/libexec/check_load -w $info{w3},$info{w2},$info{w1} -c $info{c3},$info{c2},$info{c1}");
exit $? >>8;

martin said...

1) As we understand with 2 CPU, a workload average of 2. it means is 100%.

2) We had set the setting as below
-w 15,10,5 -c 30,25,20

3) I could not understand, why would we set a value of 5, which is 250% as only a warning?

I had thought a fair value is like 1.6?

hissohathair said...

Hi Martin

It depends on the performance of your applications or services. Tracking CPU load is helpful in understanding what's going on, but it's not directly what you care about. If your SLAs are being met and your users kept happy even at a CPU load of 250%, then why alert on that?

If, on the other hand, you know that once the CPU queue gets beyond a certain length then your response time will start to suffer then you probably DO want an alert. You should have other monitors checking response times and they should also be going off. Having the combination of "CPU load is high" and "Response times are slow" helps you respond quickly.

I guess in a nutshell I'm saying that CPU load is just one thing you're supposed to monitor -- but it's the system as a whole that you care about. Therefore, use your own judgement as to what works best for you & your systems.


hissohathair said...

Martin -- one more thing. Don't forget that on a Linux system the load figure includes processes waiting on I/O. So a load of "5" on a 2-CPU system doesn't necessarily mean your CPU is maxed out. You can cross-check that with a tool like top.

martin said...

Thanks hissohathair!

1) if we had monitored work load average. Do you think we still need to monitor CPU utilization?

The reason why I am saying this, is to put a threshold for CPU utilization, I can easily indicate warning as 70%, critical as 80%.

But for workload, if I need to tune for my few hundreds of servers, will be very difficult for me.

2) is there a co-relation on the value of workload average vs CPU utilization?

ie, if we hit a warning in workload average that means for CPU utilization will also be warning as well?

hissohathair said...

Hi Martin

I don't know your exact situation, but I'd hazard a guess that you don't need to monitor and alert on CPU utilisation. Think of it this way: if you received an alert at 3am that the CPU was 100% utilised, what would you do?

It might be more useful for you to monitor & log CPU utilisation -- but for capacity planning purposes more than alerting. So, for your questions:

1. Monitor & log CPU utilisation but don't alert on it. You may find it useful to alert on load but don't worry too much about tuning for each server.

2. Yes, CPU utilisation & load average are related. If load hits an alert threshold then CPU use is very likely high. The reverse is not always true, because load is a weighted average over 3 periods of time. So a CPU may 'spike' to 100% for a few seconds without necessarily affecting load averages too much. You almost certainly do not want to alert on CPU spikes!