Business Continuance Planning

Making The Case For Business Continuance Planning - Is Cost Really The Issue?
by Jim Ward

Although natural disasters represent a small percentage of unplanned computer systems outages, recent global events--earthquakes in California, Mexico and Japan, floods in America's heartland, hurricanes in Florida and the World Trade Center bombing have all achieved front page headlines in the national news. From an IT perspective, these unfortunate events have also forced a renewed emphasis at disaster recovery planning. The plan that allows normal business resumption within 24 to 48 hours begs the question: What would two days of down time cost a financial institution, a securities exchange, a major airline or hotel reservation center? In today's fiercely competitive business environment, two days of lost revenue and the erosion of long-term customer confidence for a major corporation can mean the difference between reporting record profits and filing for Chapter 11 bankruptcy protection. Is this really the plan you want?

As we move from a CPU- centric to an information-centric world, most large corporations -- particularly airlines and other organizations involved in the transportation industries -- must honestly assess their existing IT infrastructure. “Is your MIS department thinking proactively when it comes to preparing for disaster?” “Can you ever be fully prepared for such an event?” But the real question might be “How can you afford not to have a business continuance plan?”

Before you answer these very tough questions, take a hard look at some basic -- albeit antiquated, crisis planning methods currently being deployed.

Prime/Dupe Pairing
To minimize the impact of direct access storage devices (DASD) outages (and to enhance performance), most Transaction Processing Facility (TPF) and Airline Control System (ALCS a.k.a. TPF/MVS) shops utilize prime/dupe pairing. While prime/dupe pairing is a highly effective data protection methodology in the event of selective hardware failure, this plan does little to ensure the continuance of an IS operation in the event of a disaster. Because all data is located in close physical proximity, damage to the system from an outside force will render a company unable to access mission-critical data for an undetermined period. In today's information centric business world, where IS is a company's lifeblood, a loss of this magnitude can not be tolerated.

Off-Site Backup Tape Storage
Most TPF shops run “capture” on a regular basis -- nightly, every other night, or for those living on the edge, twice per week. Capture tapes are typically transported to an off-site vault where, hopefully, they remain untouched until replaced by a more recent set. The tapes are then sent back to the center where the cycle starts all over again. This planning anticipates that, in the event of a disaster at the center, the set of capture tapes would be available to “restore” to a system with the same DASD configuration as the original site, if not at the original site itself.

For more complex data centers with larger capacities, finding hot sites with similar configurations is easier said than done. Even if an adequate site is found, the “restore” exercise is not completely reliable. Due to the possibility of tape media failure, human error, etc., even a regularly tested “restore” process doesn't always guarantee that DASD configurations will be completely restored. In this scenario, business-critical data may be lost forever. Furthermore, because off-site storage is typically located less than a few miles from the main center and the backup tapes must be physically transported to the alternate site, a regional disaster places both the data center and backup tapes at risk. In this case, backup tapes may not be in any condition to conduct the “restore” process.

Physical Separation of I/S Resources
To minimize reliance on the “restore” process, some IS managers split data resources into subsystems located in various rooms within an operation center or separate buildings entirely. Fireproof walls and independent power supplies serve to further protect resources from total disaster. This strategy ensures normal operations only if a disaster selectively affects certain areas of the partitioned data center. In this case, a reduced set of resources would be available for continued, limited business operations.

This is a short-term solution, at best, especially for larger shops. The plan demands a truncated system to handle the load of the original configuration. Even with highly reliable devices, the Mean Time Between Failure (MTBF) figures, when multiplied by a large number of devices, yield a failure probability measured in days, not years. Therefore, unless the disaster which originally affected a portion of the resources is temporary or transient in nature (i.e., those affected resources can be reused, repopulated or replaced in less seven days), DASD hardware failure is imminent. The surviving operation is likely to suffer severe disruption before normal operation can be restored.

Business Continuance: Proactive Data Protection
For years, disaster recovery has been part of IS vernacular. More recently, however, business continuance is the terminology used when referring to the proactive planning process necessary for disaster avoidance. Why the change? Disaster recovery implies a reactionary move or operation -- the process of recovering from a disaster. What is also implied is “INTERRUPTION”! Given today's dependence on mission-critical data, “interruption” must be avoided at all cost. Therefore, “business continuance” has become the preferred solution that goes one step further and enables companies to maintain operations. How do you keep the company running despite a disaster? You've probably already concluded that your organization simply cannot afford even a single day of “down time.” So it's appropriate to consider the following proactive techniques that enable innovative companies to take a quantum leap in data protection -- by raising the stakes and implementing a business continuance plan.

Remote On-line Data Facilities
The most reliable business continuance plan usually embraces some degree of remote operations capability. The key component of such an operation is the implementation of a remote data facility. By mirroring critical data in a separate, managed site located miles from the host data center, remote data facilities enable normal IS operations immediately following a disaster. Several vendors offer remote data services.

IBM Corporation offers two implementations of Remote Copy functionality as extensions to the 3990 Model 6 DASD Controller. The first, Peer-to-Peer Remote Copy (PPRC), is a synchronous, controller-to-controller DASD update implementation. The second, Extended Remote Copy (XRC), is an asynchronous DASD update function, implemented through a combination of 3990 Model 6 extensions, and DFSMS/MVS system software updates. Both implementations will support ESCON distances up to 20 kilometers (12.5 miles). Additional distance requirements may be accommodated through an RPQ request. However, neither feature is supported by TPF at this time.

EMC Corporation also offers two extended remote data: Campus and Extended Distance Solutions. Symmetrix Remote Data Facility (SRDF). EMC is the only vendor offering its customers access to data remotely-located virtually anywhere in the world. SRDF essentially builds on the design of local mirroring, offering the capability to mirror selected logical volumes to a remote location.

Campus Environments
EMC's SRDF Campus Solution is optimal for remote operations within 60 kilometers of the host data center. Data is transferred via private fiber or common carrier ESCON. The channel connections to the host may be either parallel or ESCON.

Extended Distance Environments
EMC's SRDF Extended Distance Solution carries data across T3, E3, ATM, or SMDS. Extended Distance is limited only by the type of carrier selected.

Both SRDF Campus and Extended Distance uses no host resources, and is operating system-independent. No changes to applications are necessary and no exits to the operating system are required. Although the link between the two controllers is ESCON (for the campus solution), the channel connections to the host may be either parallel or ESCON. Both 3380 and 3390 device geometry's are supported. SRDF provides fully automated recovery procedures. If a source volume should fail, the target volume is automatically accessed until the failed device is replaced and resynchronized.

Two operational modes are available for each solution: Synchronous (“real-time”) and Asynchronous (“journaling”). The mode of operation, as well as the type of implementation (Campus or Extended) should be heavily influenced by the host system requirements. Host systems, such as TPF, which typically demand low response times,usually will not tolerate delays associated with either synchronous mode or extended-distance (T3) data transfers. For these systems, the campus configuration using ESCON operating in asynchronous mode, is the ideal solution.

When operating under SRDF, each subsystem can have a mix of local, source, and target volumes. Local volumes may be dynamically spared and locally mirrored, if desired. All changes made to a source volume are automatically copied to the target volume. Since under normal conditions, only write operations occur to the target (remotely mirrored) volume, the remote unit has sufficient throughput for local volume activity. In addition, even though the target volume can only be written to from the source location, it may be accessed as “read only” from a locally attached processor. So, for example, this scenario might allow those companies utilizing half test systems for development, sufficient throughput to configure a “read only” VPARS against the remote copy.

For the purpose of disaster recovery, SRDF eliminates the need for nightly backup and by having a remote backup site located further away from the data center, you have created a safeguard against either a natural disaster, tape media failure or operator mishap. However, backups might still be desirable when performing periodic database maintenance, and could in fact, be performed more efficiently from the remote site. Nevertheless, by eliminating the nightly backup process, medium to large data centers can potentially save significant manpower resources and machine time.

Business Continuance Preparation
Cost tends to be a major consideration when developing a business continuance plan. Costs most usually associated with business continuance planning are 1) hardware and hot-site facilities; 2) the backup process: manpower, machine time, displaced processing of other applications; 3) the expense of disaster drills, including manpower and interrupted schedules, and 4) the cost of interrupted operations during the actual disaster.

Other typical considerations for business continuance of large TPF shops include ensuring sufficient backup resources to not only operate the business, but to operate it at speeds fast enough to remain competitive. If a “hot site” is established, who will be responsible for operating it? If the hot site operation is outsourced, what is the knowledge level of the operations and support personnel with regard to TPF? If the hot site is not outsourced, how will it be staffed if the permanent staff is affected by the disaster? It's also not enough to back up the real-time operation, but consideration has to be given to all of the support systems (ex: MVS) normally required on a daily basis. Finally, due to the wide variety of platforms utilized in IS today, organizations must also consider disaster recovery services that are currently offered for open systems and client-server environments.

Remote data facilities which offer operational flexibility, may be the answer when negotiating many of the obstacles associated with dedicated disaster recovery objectives. Unfortunately, no one has all the answers when trying to predict how and when disaster will strike. The best you can do is to try to cover as many bases as possible. Resources offering the most flexibility will help you achieve that objective.

Jim Ward has over 25 years experience in TPF Systems. He currently has responsibility for TPF Direction at EMC Corporation.