Happy Alarms

Configuring the alarms from a health monitoring system can be challenging. The idea is to create alarms that will get the operator’s attention, won’t get ignored, are sent to the appropriate parties, and are clear and unambiguous.

To make this even more complex, numerous systems use email to send out alarms in a distributed manner. There may be a portable backup storage box with configured alarms in a non-standard format that still needs to be handled by someone monitoring the network. To make matters worse, often such devices have email alarms configured to an individual’s email address. This can cause problems when there is turnover.

As part of that I had explored all manner of ways of indicating the status of a system.

Borrowing from the Six Sigma tool set, one scheme involved using a 1,3,9 scale for ranking the severity of an item. A 1-3-9 scale forces the ranking of severity into meaningful categories. A 1-10 scale or similar provides room ambiguity.

Many systems use the existing syslog “standards” for ranking the severity of messages. This had to be incorporated.

For example:

AUTH,EMERGENCY
GENERAL,CRITICAL
AUTH,INFO

It made sense to develop a scheme that would incorporate the syslog “standards”, the 1-3-3 scale, and provide unambiguous information to someone who had never seen an alarm unambiguous data on the severity of an alarm.

A number of distribution lists were created based on a target groups.

The following are some examples:

ALL_ALL_CRITICAL
ALL_MGMT_CRITICAL
DEVELOPERS_MGMT_CRITICAL
DEVELOPERS_MGMT_INFO
OPERATIONS_STAFF_INFO
SECURITY_STAFF_EMERG

The last thing was designing the actual messages. It was decided that it would be important to specify fields in emails in the event that automated processing / parsing systems would have some role in reviewing messages from distributed systems in the future.

Here is a sample message:

“PROBLEM: sw3.local.X.com
Interface(10125) inside is Down at least 2 min on Switch: sw3.local.X.com (192.168.10.X).
Details:
Monitors that are down include: Interface(10125) inside Monitors that are up include: Ping,SNMP,HTTP,Telnet,Interface(1) Vlan1,Interface(100) Vlan100 (192.168.10.253),Interface(5010) Port-channel10,Interface(5011) Port-channel11,Interface(5015) Port-channel15,Interface(5016) Port-channel16,Interface(10101) dmz,Interface(10118) Inside – Alltel,Interface(10127) inside,Interface(10131) inside,Interface(10133) prd-003-vmi4, Channel-Group 10,Interface(10134) prd-003-vmi4, Channel-Group 10,Interface(10135) prd-004-vmi4, Channel-Group 11,Interface(10136) prd-004-vmi4, Channel-Group 11,Interface(10145) GigabitEthernet0/45,Interface(10146) sw-1 dmz trunking port,Interface(10147) sw2 inside trunking port,Interface(10148) storage trunking port,Interface(10501) Null0,”

This system has been in place for some time and seems to work well.

I kept thinking about this and realized that one of things to make this register and have people react a bit better still.

As I was thinking about this, I was shocked to discover that one of the Exchange Servers had become self-aware. Not one to waste an opportunity, I asked it about additional ways to improve this process. It reminded me that humans have emotions and perhaps that another way to improve the Health Monitoring system was by associating emotion with the status of various alarms.

So instead of saying that the DISK on Server A is RESTORE, instead we might say “Server A is relieved that it’s disk was replaced before a total system crash!”.

This self-aware exchange server, which we have now dubbed Fred, has a weird sense of humor.

0 Responses to “Happy Alarms”


  1. No Comments

Leave a Reply

You must login to post a comment.