Realtime Coverage in the TPF World
by Alan Sadowsky
The American College Dictionary defines coverage as follows:
cov.er.age noun 1. Insurance. The total extent of risk, or the total number of risks, as fire, accident, etc., covered in a policy of insurance.
That coverage is considered, or even actually defined as "insurance" cannot be emphasized enough. Most of us who are familiar with coverage organizations in TPF shops will certainly agree with these statements. For those of you not acquainted with coverage and the role it plays within TPF technical support areas, you're in for what I believe will be a very informative article.
Realtime coverage at TPF installations is quite different from coverage on any other type of operating system. The ultimate goal of coverage is to maintain and insure the ongoing availability of the operating system. In very simple terms, if the system comes down it is the responsibility of coverage to:
While it may appear simple enough on the surface, there are several critical issues that must be considered before discounting the role of the TPF coverage programmer. First of all, TPF is unique in the respect that it is the operating system of choice for companies that face extremely high volumes of transactions. The American Airlines Sabre system as an example handles in excess of 3000 messages-a-second. When a system like Sabre is hit with unscheduled downtime, the cost to American can run into the millions of dollars. The coverage programming staff must be able to react quickly to get the system back up and available to the user community.
The next challenge coverage has to face is the determination of what caused the problem in the first place. Here again, TPF presents a challenge in that the luxury of calling a vendor "hotline" does not exist. For those folks who have worked in other areas (like MVS), you know that in many cases when you have a problem with a particular product, you can pick up the phone and get help from the vendor directly. That's not the case with TPF! Assuming the problem occurs during the normal business day, the only person you might be able to turn to for help is your local IBM SE (Systems Engineer). If a problem has to be escalated to the TPF Development lab in Danbury, only the SE can initiate that process, and the time it could take for a response is frightening if you're in a "system down" situation. (See "Reporting TPF Software Bugs", ACP*TPF Today, September 1990) So who deals with the situation, and who analyzes the problem, and who fixes the code, and who repairs the database corruption, and who magically does it all before the Directors and Vice Presidents begin flocking to the data center? Who said "coverage"? Give that person a cookie!
In some cases, it's difficult to not put the cart before the horse. Sometimes conditions dictate an immediate IPL, and other times, it's best to determine what happened before bringing the system back up. As an example, indications of progressive database corruption would be reason enough to delay bringing the system up until the complete problem is understood and pinpointed.
So far, so good. We've knocked off two of the three biggies. The last piece of the pie is to make sure you don't get burned again by the same problem. "But wait" you say. The problem was obviously fixed in the previous paragraph so you could get the system back up!?! Well maybe. Then again, maybe not. The short term fix to get the system back up may have been a quick patch to a program, or even a fallback of the offending segment. That really hasn't fixed the problem, and wouldn't give me any warm fuzzies about possibly dropping the patch, or inadvertently reloading the program. Now quite often the problem lies within an application program, and the analysis, console log, and dump can be forwarded to the individual responsible for the code so he or she can fix the problem. On the other hand however, it's not uncommon for these little moments of excitement to conveniently coincide with that 45 day vacation that Joe Programmer just started. Once again, coverage to the rescue.
So where does one find the elusive coverage programmer? Quite honestly, you can look almost anywhere within your TPF technical staff. The general career path that I've observed in many of the shops I've worked in, originates in Operations and usually takes several years. TPF operators move into analyst positions, followed by application programming experience, possibly a bit of time in database management or the utilities area, and eventually have an opportunity to move into Test Systems Coverage. Add several large helpings of dump reading, tool development, and inter-departmental interfacing. Let simmer for one to two years, turning up the heat every few months. Serves one.
Kidding aside, it does take a very special breed of individual to work effectively in a realtime coverage environment. The work is very often fast-paced, demanding, and stressful. There is no feeling in the world like the one you get when the system goes catastrophic, won't IPL, and you're the only coverage person on site. The skills and experience necessary to step up to the awesome responsibilities of coverage are not out there walking the streets.
Certainly there is talent in the marketplace, but even an experienced coverage programmer needs to learn the intricacies of a new system. Let's face it folks, there are no two shops out there that even begin to look alike. Most, if not all of us, have made "enhancements" to the control program and/or the associated operating system c-type code. Additionally, while many of us are running some flavor of reservations application, the differences between these applications outweigh the similarities. The learning curve for a coverage programmer is significant, and the realities of a constantly changing business climate don't allow for the training to ever end.
In an ideal world, the need for realtime coverage would not exist. Not only would developers design the cleanest, most efficient programs, but the control program would never have a bug in it, and the database, should it somehow get corrupted, would repair itself without even notifying the operator. Since we haven't quite come that far yet, the fact of the matter is that applications code can always be rewritten for the better, IBM religiously generates PTF tapes on a regular basis, and databases are not exactly self sustaining.
I guess the point I'm trying to make here is that a realtime coverage organization is not just a "nice to have" feature in your organizational structure. It is a vital and critical part of every successful TPF installation. For existing shops, the direction should be to enhance and hone the expertise and skills of your organization, and for the new kids on the block, do not make the mistake of deferring your decision to staff, and train, and build a knowledgeable and productive coverage team.
Why? Because I can guarantee that the first words out of management's mouth when the system falls to its knees will be ... "GET COVERAGE"