Configuration Planning for TPF DASD

Configuration Planning for TPF DASD - (Art, Science, or Magic?)
by Norman L. Laefer

Perhaps the most painful task for performance analysts and capacity planners in all of "TPFdom" is determining how much to spend for adding capacity to the I/O subsystem. If the proposed solution produces a shortfall in capacity, it means turning away business. The lead time to resolve this shortfall is typically measured in months. The alternative has been to buy unnecessarily large chunks of DASD capacity, which reduces the frequency of shortfalls but not the underlying risk. While outsiders may believe obtaining the right answer is magic, most of us who have practiced this discipline of capacity planning have thrown away our magic wands long ago. Instead, we rely on the principal components of configuration planning, which consist of defining and measuring capacity, characterizing the workload in similar terms, and imposing constraints that determine the viable solutions.

We can scientifically define capacity in TPF terms as the number of I/Os per second that can be sustained without exceeding a specified average time applications must wait for each I/O. Analytical models have been developed to predict the capacity of proposed DASD configurations in these terms. But what frequently remains an art form is the determination of how long an application can wait for an I/O and still meet the user's response time requirements at the terminal. One approach is to wait until performance problems start occurring and then add a percentage to the existing usage level, based on 2 to 3 years forecasted growth. This assumes that you can tolerate the pain while the system is being expanded, and that the usage profile of the system 2 to 3 years from now is simply more of the same. If the profile does change, then any predictions regarding when to upgrade next may be wrong; and the pain is sure to return at the most inconvenient time.

Background
The early ACP/TPF systems were designed to meet a user response time requirement of 90 percent of the responses within 3 seconds which could be restated as an average time by making some reasonable assumptions on distribution functions. These systems had limited channels and used 2314s with up to eight drives on a channel. Average seek times were 75 milliseconds, latency/search was 12.5 milliseconds, and the data transfer rate was less than 0.5 megabytes/second. Under these conditions, the device service times were around 45 milliseconds. Applying queuing theory, a device running 50 percent utilization will result in an equal amount of time, on average, being spent in queue waiting for service. With 90 milliseconds required for each data or program fetch, it might be possible to meet the user's response time requirements. If, however, device utilization reached 67 percent, the total I/O service times jumped to 135 milliseconds, impacting the ability to meet the requirements.

Today, device characteristics have changed significantly. Single image TPF systems are able to utilize multiple paths to access DASD devices. In addition, the availability and cost of channels create new price/performance tradeoffs. Although it is still true that increasing device utilization from 50 percent to 67 percent will increase response-time 50 percent, with today's devices it is more likely the result will be 10 milliseconds rather than the 45 milliseconds described in the earlier example. By utilizing devices having lower service times and increased pathing, planners are able to increase the device activity level past 50 percent and still meet the user's requirements.

Allocating Time
Since the topic relates to the internal response time of a TPF system the user's terminal response-time requirements need to be adjusted. It is in this area that a transformation from art to science begins to occur. We must first subtract the average round-trip delay/transit time. If the user requires a 1.5-second response and the network introduces a 1.3-second average delay, then the internal response time (message life) must be less than 200 milliseconds in the TPF host to meet the requirements.

Whether the preceding calculation's result is 500 or 50 milliseconds, the following methodology will guarantee meeting the specified requirements.

Based on operating at a maximum of 90 percent CPU utilization during the busy period, subtract 10 times (9 for queuing + 1 for service) the average CPU time per input message. If a transaction requires 5 milliseconds of CPU time and the response requirement is 200 milliseconds then the time allocated for DASD I/O is 150 milliseconds (200 - 10x5).
Calculate from the data collection reports the number of I/Os for which an average message will wait (exclude virtual file accesses (VFA)). This is roughly the number of data reads plus program reads, unless there is heavy usage of the FILNC macro.
Divide the result from step 1 by the result from step 2. This will be the design point in milliseconds per I/O for the DASD service time, including time spent in queue.

Selecting a Solution
Almost all vendors today have modeling tools that permit developing multiple solutions by varying the number of devices, control units, and channels to meet both the access and response time requirements. This allows customers to select solutions based on cost, availability, and physical planning considerations.

There is nothing wrong with choosing 50 percent as a device utilization target, but it may eliminate the optimal solution. In most cases, the optimal solution lies between 40 percent and 65 percent device utilization for TPF systems that are fully duplicated. It is at this point that one would like to wave a wand and select the solution. There are, however, at least two issues that must be considered before moving a solution from the world of magic to a viable option.

First is a limitation imposed by TPF. TPF requires most programs and data to reside in the lowest 16 megabytes of main storage. This area is also used to dynamically allocate storage for each incoming message as required during its processing life. For a given application the amount of storage required, on average, is relatively constant. What varies is the duration this storage is in-use by an application. This is often referred to as "K-Core-Seconds." The largest factor affecting message life and therefore K-Core-Seconds is I/O response time. I/O delays will impact the number of messages that can be processed without depleting the available storage below the 16 megabyte line. It is therefore critical that adequate storage be available to utilize the full CPU capabilities. To calculate the appropriate ratio of MIPS to megabytes, a message life must be assumed. This in turn implies an upper limit to average message life, which may be less than the user's response time requirements demand.

The second limiting factor is surviving a device failure. In many TPF applications, the average number of reads and writes to the device are equal. Since virtually all writes are duplicated, a device failure causes the reads from the failing device to be directed to the surviving duplicate. The writes from the failing device, however, do not add to the load since they are already part of the duplicate's normal activity. Under these circumstances, the additional reads would result in a 50 percent increase of activity on the duplicate and could result in throttling the entire TPF system as a result of one device failure.

In this situation, if the device utilization was normally 67 percent, the system would certainly develop an infinite queue on the surviving duplicate, which would reach 100 percent utilization, and the system would no longer be functional. If a design point is to be useful, the device utilization during a single device failure should probably not exceed 75 or 80 percent. It is clear that each TPF user may have a different design point based on fallback considerations, record duplication practices, and read/write ratios.

Special Considerations
No discussion of TPF DASD would be complete without addressing the special performance functions that can be ordered as request price quotations (RPQ). The two aspects of these RPQs that are appropriate to this paper are: (1) determining what environments derive the maximum benefits from cache, and (2) determining the potential impact on the TPF user.

In single image TPF environments, the most effective and economical way of increasing the throughput is by maximizing the use of processor storage for VFA. It is faster than accessing DASD cache and reduces the demand for access to the DASD subsystem. When a database is to be shared by multiple TPF images the effectiveness of VFA is significantly reduced as compared with single image mode, which results in more DASD accesses per message. This requires recalculating the maximum time allowed per I/O and will produce a lower target time. This usually turns out to be a double whammy. First, you need to handle more I/Os because less are going to VFA. Second, each I/O must take a shorter time to meet the user requirements. This is now a good time to consider using cache for TPF. The increases in message life that usually occur by loosely coupling multiple TPF images may preclude or limit the use of tightly coupled multi-processors (TCMP) in that environment.

The current implementation of caching for TPF requires the identification of records as cache candidates. For those messages that rely heavily on these specific record types, response times will be exceptionally good. However, since the I/O subsystem is designed for an average I/O service time, messages performing I/O that does not benefit from caching will suffer the delays from the below-average service times. The effectiveness of cache for TPF depends not only on the cache size but, in the case of DASD fast write (DFW) capability, it is extremely dependent on the amount of non-volatile storage (NVS) present. The limitation of 4 megabytes can cause response times to increase very rapidly when heavy loads are introduced due to long strings or unusual activity. While this condition can also occur in non-cache environments, these environments are generally less sensitive to similar swings in loading.

Testing For A Scientific Solution
When you can answer the following questions, you will have taken a major step forward into the science of configuration planning and away from the dark world of magic:

1. What will be the effect on user response time for an X percent increase in messages?

2. What will be the effects on user response time if messages require X additional reads or writes?

3. What will be the user response time during a period of device failure?

4. What is the maximum message rate that will not deplete the critical processor memory resource?

Opinions expressed are those of the author and do not necessarily reflect the views of Amdahl Corp.

About the author: Norman Laefer has worked in the data processing industry for 28 years, of which 14 years have been in the airline industry. He is currently the Manager of TPF and Airline Industry Marketing at Amdahl Corporation.