f.zz.de
posts /

icinga2 check scheduler

Posted Fri 19 May 2017 12:09:37 PM CEST Florian Lohoff
in

For me icinga2 has been a pain in the ass as of its release. My primary problem with icinga1 was the inflexibility of defining dependeny. A host could not depend on a service which imho is broken. The host connected to a switch depends on the switch, and on the vlan/bridge on the switch. So i would like to construct/define a dependency on the vlan check. Icinga1 could not do this, as does icinga2.

Now - one needs to do the transition on day or the other so i did. The next issue came up. Icinga2 does not include the embedded perl interepreter anymore. This means you'll be fighting against a 10 fold increase in check runtime. The reasoning here is that this could not be made stable. I wonder why Apache is able to this but not the Icinga2 devs. So i rewrote my most performance sensitive checks from Perl to c++.

Lately i am having trouble because on my larger network devices i am seeing flapping interfaces. That means - from time to time (multiple times a day) a bulk of interfaces goes into the "Unknown" state which means my check could not get an SNMP response from the Device.

This came up when i migrated the Icinga2 instance from a virtual machine to bare metal. I could not explain the symptom and started debugging on a firewall which had been installed inbetween the host and the network.

No luck so far.

Today then i had another look because my collectd/influx/grafana host monitoring showed network traffic more like a heart beat. Every hour exporting the configuration was a beat which swung out into an "null line" e.g. an average.

After investigating a bit more i found that reloading the icinga2 config would erradicate the check scheduling queue which caused icinga2 to do a recalculation which made the check scheduling very much spikey in the check_interval interval.

It seems the Icinga2 (in my case 2.6.3) check scheduler does a very bad job in averaging out the check executions over their check_interval.

This is a graph i created using some shell and gnuplot vodoo. The event at 9:36:44 is the config generation issueing an service icinga2 reload which causes icinga2 to not issue/schedule checks for nearly a minute.