Blogposts tagged icinga2

Whats wrong with Icinga2

2018-09-25T09:10:02Z

I always had my problems with Icinga2. I always collided with the pretty strict correlation between services and hosts (For me monitoring and especially dependencys are a graph) and i always got the limits of dependency modelation. Yes i know, dependencies are not the common use in monitoring. 95% of Users might get away without a single explicit dependency.

Some migrations from Icinga1 simply failed because of missing mod_perl/embperl support which made a simple migration on the same hardware impossible. I made some hacks abusing the apache on localhost and its mod_perl support for running icinga2 checks - but that are hacks.

Today i think i found the issue with Icinga2 - It began with an EOL/EOS announcement for Icinga1 which i replied to on Twitter which i replied to that Icinga2 is not a drop in and there are technical issues which make some installations of Icinga1 to last possibly forever. I dont buy the argument that its impossible to embed perl, lua, python, php into Icinga2. Postgres does it, Apache, Nginx. All of them allow on one way or the other to run code in embedded languages. Yes - These interpreters may leak memory, they may not be as easy to embed as lua, but its doable and other projects do it for years. Pointing to this buys me an polite "Fuck off" - I guess i now know whats wrong here.

Michael Friedrich @dnsmichi 14 Std.vor 14 Stunden

JFYI - #Icinga 1.x is EOL and support ends in 3 months. This includes Classic UI.

Plan your migration to Icinga 2 & Icinga Web 2.

Florian Lohoff @fl0h0ff Antwort an @dnsmichi

I guess some instances are there to stay for another 10 years. I know some of them. And i can feel their pain. Icinga2 is not a drop in replacement and i know at least 3 instances who cant simple change because of performance issues caused by modperl missing.

Michael Friedrich @dnsmichi 2 Std.vor 2 Stunden Antwort an @fl0h0ff

Embedded Perl is a technical No-Go, similar to embedded Python. The implementation in Icinga 1.x for Perl has plenty of bugs and memory leaks - it may have worked, but there's room for changes.

Icinga 2 is here for four years now, with a helping community on migration tasks.

Florian Lohoff @fl0h0ff Antwort an @dnsmichi

Interestingly the webservers have solved these issues for years ... Either by separation and limiting the number of executions or factoring out into fpm helpers. And no ... The community does not help porting thousands of lines of code over from perl.

Michael Friedrich @dnsmichi 19 Min.vor 19 Minuten Antwort an @fl0h0ff

Sarcasm won‘t help much, goodbye.

icinga2 check scheduler

2017-05-19T10:30:28Z

For me icinga2 has been a pain in the ass as of its release. My primary problem with icinga1 was the inflexibility of defining dependeny. A host could not depend on a service which imho is broken. The host connected to a switch depends on the switch, and on the vlan/bridge on the switch. So i would like to construct/define a dependency on the vlan check. Icinga1 could not do this, as does icinga2.

Now - one needs to do the transition on day or the other so i did. The next issue came up. Icinga2 does not include the embedded perl interepreter anymore. This means you'll be fighting against a 10 fold increase in check runtime. The reasoning here is that this could not be made stable. I wonder why Apache is able to this but not the Icinga2 devs. So i rewrote my most performance sensitive checks from Perl to c++.

Lately i am having trouble because on my larger network devices i am seeing flapping interfaces. That means - from time to time (multiple times a day) a bulk of interfaces goes into the "Unknown" state which means my check could not get an SNMP response from the Device.

This came up when i migrated the Icinga2 instance from a virtual machine to bare metal. I could not explain the symptom and started debugging on a firewall which had been installed inbetween the host and the network.

No luck so far.

Today then i had another look because my collectd/influx/grafana host monitoring showed network traffic more like a heart beat. Every hour exporting the configuration was a beat which swung out into an "null line" e.g. an average.

After investigating a bit more i found that reloading the icinga2 config would erradicate the check scheduling queue which caused icinga2 to do a recalculation which made the check scheduling very much spikey in the check_interval interval.

It seems the Icinga2 (in my case 2.6.3) check scheduler does a very bad job in averaging out the check executions over their check_interval.

This is a graph i created using some shell and gnuplot vodoo. The event at 9:36:44 is the config generation issueing an service icinga2 reload which causes icinga2 to not issue/schedule checks for nearly a minute.

WWW::Mechanize

2017-03-21T20:08:39Z

Nach einer halben Stunde rumhampeln mit LWP::UserAgent, HTTP::Request, HTML::TreeBuilder um die Anmeldeseite von Roundcube auseinanderzudröseln um das Monitoring um die Cross-Site-Scripting protection drumherum hinzubekommen Stolpere ich über WWW::Mechanize .... Nach 11 Zeilen Fertig. Es kann so einfach sein - Noch ein bischen Nagios::Plugin drumherum und fertig ist der Check ...

my $mech=WWW::Mechanize->new();
$mech->get($np->opts->uri);
my $r=$mech->submit_form(
                form_number => 1,
                fields => {
                        _user => $np->opts->username,
                        _pass => $np->opts->password
                }
        );

if ($mech->status() ne 200) {
[ ... ]

Random number generator

2017-03-07T14:28:56Z

Es ist schon spannend wie man sich so die Karten legen kann und das monatelang nicht findet. Ich habe für das Monitoring mit Icinga2 einen interface check in c++ geschrieben der alle möglichen Dinge pollt. D.h. nicht nur die Standard ifAdminStatus oder ifOperStatus sondern auch ifIn/OutPackets, Bytes, Errors, Discards etc. Dann gibt es Schwellwerte die überwacht werden. Zusätzlich polle ich noch die optischen Pegel der SFPs d.h. rxPower, txPower, Current und Temperature.

Das ganze fliesst dann über eine Influxdb in ein Grafana um es sich ansprechend ansehen zu können.

Leider hatte ich seit dem Umzug von einer VM auf richtige Hardware das Problem das immer wieder sporadisch Messwerte fehlten oder auch auch mal kaputt waren d.h. eindeutig falsche Messwerte. Die Fehlermeldungen die ich jedoch sah war "SNMP Timeout" als wenn das Gerät nicht antworten würde.

Also mit tcpdump mal nachgesehen und nein - Es werden alle requests beantwortet. Nach einer Schleife mit der SNMP library snmp++, in der das timeout handling auch noch kaputt war, war ich noch ratloser als vorher.

Bis ich heute mal debug logging eingebaut habe und feststellte das zu dem Zeitpunkt wenn die SNMP Timeouts auftreten alle Requests für das Device eine identische RequestID haben.

Die RequestID wird typischerweise dafür verwendet, das das Target retransmits erkennen kann. D.h. sollte es einen SNMP Request bekommen wird die Antwort erzeugt und diese Antwort zusammen mit der RequestID in einen cache gepackt. Sollte dann ein transmit kommen wird der request mit der Antwort aus dem cache beantwortet um ein paar CPU cycles zu sparen.

Wenn jetzt natürlich unterschiedliche Interfaces und EntitySensors gepollt werden, und alle mit einer identischen RequestID, dann kommen natürlich willkürliche Antworten zurück, je nachdem welcher Request zuerst bearbeitet wurde.

Wenn die Anzahl der Variable bindings aber nicht passt wird jedoch natürlich die Antwort verworfen. Wenn jedoch die Antwort passt d.h. es ist auch Interface oder auch ein EntitySensor werden die Daten durcheinandergewürfelt. TengigabitEthernet0/0/1/0 daten tauchen dann auf TengigabitEthernet0/2/3/2 auf ... Das alles absolut unvorhersagbar und willkürlich.

Des Rätsels Lösung war, das der random number generator initialisiert werden muss. In Wirklichkeit ist das ja nur ein "Pseudo Random Number Generator" PRNG. Dieser erzeugt für ein Seed immer dieselbe Reihe an Zufallszahlen.

Wenn man den Random Number Generator jetzt mit "time(0)" initialisiert, kommt natürlich für alle Prozesse die in derselben Sekunde gestartet werden identische Zufallszahlen bei raus. Es fehlt an Entropie.

Der schnelle fix ist ein

std::srand(std::time(0)^getpid());

gewesen. Der nimmt dann nicht nur die Zeit sondern ein XOR zwischen Zeit und Process ID. Immer noch nicht optimal, aber zumindest fixed es das Problem das zeitgleich gestartete Prozesse identische Zufallszahlen verwenden.

icinga2, embedded perl und perlcc

2016-01-27T15:35:52Z

Heute mal ein bischen mehr Dampf in das Icinga2 gebracht. Rund 700 interfaces hinzugefügt und schon explodiert der Host.

Des Rätsels Lösung: Meine selbstgeschriebenen Perl Checks killen die Maschine - Load von > 100. Icinga2 supported offensichtlich keinen embedded perl Interpreter mehr. D.h. jeder check startet einen neuen Perl interpreter. 700 Interfaces a 5 Minuten macht 2.3Checks/Sekunde. Eigentlich hört sich das nicht so wild an - Geht aber nicht.

Kurze idee den perl check mit perlcc zumindest von der last des Parsens zu befreien haben sich zerschlagen. perlcc ist mit perl 5.10 entfernt worden weil unmaintained.

Vermutlich läuft das darauf hinaus die Checks in C++ zu schreiben.

icinga2 und service dependencys

2016-01-11T17:56:48Z

So - Nachdem ich ja mich hier mit der Einführung von neuen Konfigurations und Inventarisierungstools beschäftigt habe und letzte Woche die strukturierten Konfigurationssicherungen an den Start gekommen sind läuft jetzt auch die überwachung mit service dependencies für subinterface etc ... Es lohnt sich gleich icinga2 zu machen weil vieles mit den checks viel schönes gelöst ist. Damit plumpst dann jetzt automatisiert gleich sowas raus:

object Service "IF Bundle-Ether5" {
        import "generic-service"
        host_name = "cr-141"
        check_command = "customif"
        vars.ifname="Bundle-Ether5"
        vars.sla = "24x7"
}

object Service "IF Bundle-Ether5.4090" {
        import "generic-service"
        host_name = "cr-141"
        check_command = "customif"
        vars.ifname="Bundle-Ether5.4090"
        vars.sla = "24x7"
}

object Dependency "cr-141-cb7aee14377f35ccb58d46d9c30528e3-cr-141-cb7aee14377f35ccb58d46d9c305451c" {
        parent_host_name = "cr-141"
        child_host_name = "cr-141"

        parent_service_name = "IF Bundle-Ether5"
        child_service_name = "IF Bundle-Ether5.4090"
}

Network Management in a Box

2016-01-05T19:16:07Z

So langsam nimmt hier das neue Network Management Formen an. Nach 4 Wochen:

Zentrales git mit gitolite + Gitweb
Single Credential via LDAP
Inventarierung und Configuration basierend auf CouchDB + JS Frontend
Config Backup aller devices ins GIT
Icinga2 für das Monitoring

Das ganze kann man aus dem git auschecken - starten und nach 10-15 Minuten hat man den ganzen Kram wieder am laufen.

Morgen mache ich den rest Icinga und dann gehts an das Performance Management Backend damit das Cacti weg kommt. Das mit dem neu schreiben des ganzen Zeugs hat auch Vorteile. Ich bekomme fast Lust das mal als LiveCD zu bauen