Sometimes, a problem can seem really hard, but turn out to be rather simple if approached from a different perspective. The application team I support has a poorly written application running on a server. This should sound familiar to most system admins out there.
The application crashed the other day, but none of the windows services actually stopped, nor were there any event log errors to really go off of either. How do we monitor that service then? One option might be the TCP port but nobody seemed to know what that was. Digging into the application, it had a small scripting engine, which allowed us to run some basic scripts.
The first thought the application team had was, we'll write an event log saying everything is ok, and when that doesn't appear, we want an alert. Well, we can monitor for missing alerts in SCOM, but that seemed like it would be destined for error.
What we settled on instead was to have the program simply drop a file in the temp directory. It would put the file there every 30 minutes, with the same name. So now what? I created a small and simple batch file that would check for the file, then delete it if it was there. Otherwise, report the file missing and the service stopped.
IF EXIST C:\TEMP\running.log GOTO Good
EVENTCREATE /T ERROR /ID 333 /L application /d "Custom Application Failed"
DEL C:\TEMP\running.log /q
I then set a schedule task to run every 30 minutes to run this batch file. When the file went missing, it would write the error to the event log. From there, just setup an event monitor in System Center to catch and alert on the event.