Hi all!
It’s been a while since we had a post here regarding monitoring, last one was back in ’09. As with that post, this will be in English because of the large international monitoring community that can make use of this, but can be successfully translated with Google Translate :)
Basically, I’m getting lazy. I want to monitor Windows services and I don’t want to log onto every server checking which services that may need to be monitored (sometimes I don’t even get that info from the Windows Server guys so I might miss something every once in a while). Also, if a service stops, it’d be smooth if I didn’t have to start that service myself, wouldn’t it? Now here comes a problem, what if there’s a service that is set to automatic but doesn’t start? And never starts because it isn’t supposed to? It sure does sound stupid but remember, this is a Microsoft platform, it isn’t made to be logical. Remember “Performance Logs”? Or currently in 2008/2008R2, Windows Software Protection service?
There’s a few scripts already out there that checks automatic services, and really, you don’t need a script for it if you have the NSClient++ agent on your Windows server since it has that function including an exclude function (-c CheckServiceState -a CheckAll exclude= ), this doesn’t however attempt to start the service again. There are also scripts that attempts to start services, but they don’t include an exclude function.
The solution!
In Nagios we have eventhandlers that can deal with issues getting caught by a check which is nice. However, for my script I decided not to use an eventhandler for this. Basically because I found it so easy to include in the first script and base it on the first output of check_nrpe … -c CheckServiceState which is nice because it means we only have to ask the Windows server once. Did I mention that I love to reduce loads as well? The only arguments we pass to the check is which services that are to be excluded. I’m not a fancy programmer, so the script looks really ugly and messy but it works like a charm. I can sit back and look at my OP5 Ninja interface detect a failed service only to see the check come back from Soft alert with all services running before a notification has been sent out, really sweet! The check always enters a soft Critical state if a service has been detected as not running even if successfully restarted. This is because I want to be able to track down services that frequently stops and are automatically repaired in the Alert history log. As soon as I have time I will post the plugin to Nagios Exchange, but until then just give me your e-mail and I can send it to you.
This is how it can look in Ninja whilst in Ok mode;

I love it and I hope you will find it useful as well :)