Obsesive monitoring

By taskme

A few years ago, a colleague introduced me to mrtg. Mrtg was originally designed to query SNMP routers to establish the bandwidth usage. With a few tweaks it can be configured to monitor anything that be converted to a numeric value. Disk space, number of processes running, temperatures, voltages etc.

Mrtg has two major drawbacks:

  1. Vanilla mrtg can only monitor integer values
  2. It is designed to work with two values only – in and out.

cacti can do all that mrtg can, and much more. It is an absolute pig to configure, lots of non-intuitive settings, little logic, poor defaults. But it can make some nice graphs, with floating point values, and with colours, with a very high level of customisation.

With lm-sensors and apcupsd, not only can you monitor your network, but you can also monitor voltages, temperatures and much more.

Cacti can be extended by using your own scripts. If you can write a script for it, you can monitor it.

Eg:

#!/usr/bin/perl
# Display UPS data
#

@collect=("LINEV","LOADPCT","BCHARGE","TIMELEFT","MAXLINEV","MINLINEV",
	  "OUTPUTV","ITEMP","BATTV","LINEFREQ","LOTRANS","HITRANS");

foreach $_ (`/sbin/apcaccess status`) {
  ($line, $value) = split(/:/);
  chomp($value);
  foreach $val (@collect) {
    if ( index($line,$val) == 0) {
      @number = split / +/,$value;
      print "$val:$number[1] ";
    }
  }
}
print "\n";

Typical output is:

LINEV:240.5 LOADPCT:18.7 BCHARGE:100.0 TIMELEFT:73.0 MAXLINEV:241.8 MINLINEV:239.2 OUTPUTV:240.5
   LOTRANS:196.0 HITRANS:253.0 ITEMP:37.8 BATTV:55.6 LINEFREQ:50.0

which is basically a load of name value pairs. Cacti calls this script every five minutes, and extracts the values from the string, and stores them in a round-robin database.

The different graphs are generally interesting over different time periods. Temperatures show daily and yearly fluctuations. Network usage seems pretty much random, although you can identify big downloads months afterwards.

Some interesting graphs are included below…

For example, free space on my /home drive:

home space usage

home space usage

As you can see, I keep my home space small. This allows me to back it up relatively easily, although, it is quite difficult to keep it so low. About a month ago, I gave up and added the rest of the available space on my MD RAID partition. It seems to bob along about 5 Gigabytes free, just enough space to download a knoppix DVD image.

Another interesting one is the mains voltage.

Mains voltage over two years

Mains voltage over two years

Daily fluctuations in mains

Daily fluctuations in mains

The first graph is moderately interesting because of the step at the end of January. The second is less interesting but  you can see how much it varies in a day.

For a few years, my UPS kept tripping out in the winter with “over voltage”, after some research I discovered that the electricity board have to provide electricty at 230v + 10% or -6%, the over voltage switch over for my UPS was at the same point that the incoming electricity became illegally high – 253 volts. (the top red line on the graph)

It may have been that my UPS was a bit over sensitive. The graphs only show average voltage for the time, not peak voltage, which is why they don’t appear to cross the upper red line limit.

Although the UPS was protecting my IT hardware, and some other bits and bobs, I felt that the constant tripping of my UPS would be reducing its life, as it is going onto battery several times a day. Any equipment not protected by the UPS would also be vulnerable to over voltages, so I contacted the electricity company. They installed a line monitor for a week, and confirmed that the voltage had gone over 252v twice in that period. So my UPS, despite complaining 4-6 times a day, did have a valid reason for complaint.

The downward step was caused when the electricity company dropped the local voltage my moving the supply tap one loop on the sub-station transformer, thus reducing the local voltage. Although I thought electricity in the UK had to be supplied at 230v, most places, it seems, are still configured to run at the traditional 240v, despite the official change being made over 15 years ago.

As you can see, the voltage dropped by nearly 8 volts, and my UPS stopped complaining, so a positive result, and proof that you can get things changed for the better. Complaining works!

From my CPU fan speed monitor, you can see when I clean out the case.

Fan speed over two year

Fan speed over two years

December, a year and a half ago the fan was so choked up, it was starting to fail. I was expecting to have to replace it but the clean out revitalised its fortunes. Next time, the following October, you can see a step as the fan turned more easily with less dust in it. I try to clean out the machine at least once a year.

The final graph is the 12 v graph. It is quite boringly flat. I find it impressive that that power supply that must be over 8 years old, in use 24/7 has managed to supply a such a consistent voltage for at least 2 years.

Constantly boring

Constantly boring

Leave a Reply