self-diagnosing computers send alphanumeric beeper messages to staff

Autor: Stuart Cracraft (cracraft_at_rice-chex.ai.mit.edu)
Data: Thu 13 Apr 1995 - 01:17:38 MET DST


I have designed a computer-alert system (called MISalert) running at a
large Unix site that self-diagnoses a bunch of Unix computer
systems, sending alphanumeric pages to MIS staff beepers whenever
red-flags like daemons dead, important services down, networks
unavailable, hosts down, databases unavailable, processes hogging
system, filesystems filling up, etc. occur.

Recently my boss gave me permission to let the net know of it. MIS
staff like it because now they can spend their time doing high-level
design, project-planning, and "real work", instead of wading monk-like
through reams of reports or watcher-style outputs or constantly being
"on the edge" for that next embarrassing upset user e-mail or having
to learn ridiculously complex "observer" systems in order to maintain
a complex site in an orderly manner. Efficiency per staff member has
increased since the introduction of this tool in 1994.

MISalert can be especially effective for organizations with only a
small staff and a large number of computers or services provided to
the user base. Also, it would very likely be effective for large
staff sites that must trade off responsibilities in terms of a daily
"hot seat" or "system help desk" as the transport layer it uses
permits call-schedule times for on-call people.

An "agent" is the name for a small piece of code which checks for the
desired condition that normally an MIS staff person would have to be
paid to check for. Instead, now they can be paid to fix more of these
and do other higher-level things than scanning for errors. New agents
can be written rapidly with only a small familiarity with Perl
required. Your on-site Perl person should be able to handle this.
Right now, the site I mentioned has about 15 agents (see below) and a
granularity of 15 minutes for wakeups.

Writing a typical agent takes 10 minutes, including testing. It is
far, far easier, than coding an agent for SUN's SUNNET manager or
other similar systems. Basically, anything you can do with standard
existing Unix tools in terms of tracking system events can be tracked
using MISalert agents, the difference being that MISalert runs everything
and reports it elegantly to your pager with minimal effort on your part.

Agents can be configured, based on type or class, for transmission to
beeper or via E-mail. To keep beeper activity (and charges low), high
volume alerts like cpu or disk conditions, are typically configured as
E-mails, with everything else configured as beeps.

A master log with timestamp for each alert is maintained by MISalert
The system can turn itself off and on at specific times depending
upon MIS availability/committments to your overall organization.

Expansion to hundreds of agents and a shorter granularity is possible
The system is efficient in that expensive statistics gathering is done
once per pass, and, of course, because of Perl.

Current agents are:

# Tape devices no tape
        Complains if no tapes are loaded in tape units for day's backup
# Tmp directory
        Checks permissions of tmp directory
# Systems down
        Reports if other systems are down. Cross-check by all hosts
# Link to Internet down
        Checks if Internet link is down
# High load averages
        When load average goes above a high-watermark, this raises a flag.
# Dead wordperfect or lotus 1-2-3 daemons
        Or other standard daemons, so that users can always depend on
        third party software that runs lmgrd-based license daemons being
        available
# Financials production
        Oracle database financials production confirmed running
# Production company database
        Oracle database regular company production database confirmed running
# specialized line printer daemons not running
        Various lp daemons
# Line printer daemon not running
        Standard Berkeley daemons
# Fax server
        Check that fax services are up
# Check disk.
        Ensure that disk filesystems don't go above a certain high watermark
# XDM server
        Ensures X processes are up
# Runaway processes
        Flags any user or system processes above a high watermark in
        terms of system utilization

The software is run from crontab every 15 minutes on all hosts on a site's net.

The system consists of about 425 lines of Perl code and makes full use
of IxoBeep/Tpage for the transport layer. Since it is written in Perl,
it is extremely easy to add agents to. It is all currently running on
SUN systems but other systems should be able to run it. It does not
have any "hard-coding" dependency on the alphanumeric transport layer
it uses.

There is also an optional "cookie" feature to send out a motivational fortune
cookie if no problems are found, to keep MIS staff motivated and interested
(just kidding, we're all that way already, aren't we?)

If people think they would be interested in this capability, send me a
message. While I won't give it away free, I am willing to underprice
anyone else offering a similar service.

--Stuart



To archiwum zostało wygenerowane przez hypermail 2.1.7 : Wed 19 May 2004 - 15:50:47 MET DST