TPF IO Performance Tuning
by Mark A. Grasso

General Introduction
The IBM TPF operating system is a real-time, high-volume, transaction processing system with far different operating characteristics than other mainframe systems such as MVS, VM, etc. Hundreds or even thousands of end-user messages, originating from a wide geographical distribution, arrive at a TPF central processing complex (CPC) every second. Traditionally, these end-users are business agents and their messages generate queries or updates to a centralised database holding an inventory of user business data. Each of these hundreds or thousands of end-users expects a real-time response. Often, their business depends upon it.

This is the demanding environment in which a TPF on-line system is expected to perform and, in fact, ‘real-time performance' is an integral operating concept in on-line TPF systems. TPF systems contain all the normal components found in any mainframe computer system; CPUs, memory, IO subsystem, Control Program and Applications software. However, TPF was designed and is configured to do one thing extremely well. TPF is designed to maximise throughput at extremely high-volume workloads.

The Real World
Behind every end-user message arriving at the CPC is a real person waiting for a response. Hundreds or thousands of such responses are being awaited every second. As a result, more important than any single system component is the integration, that is, the interaction of all of these TPF system components. This is what ultimately determines the throughput and thereby, the actual real-time responsiveness of the system.

If any one component is inefficient or 'co-operates' poorly with any other, throughput and end-users suffer the effect.

Short Path CPU
With this in mind, TPF was designed and is configured with several goals in mind. On the CPU side, TPF aims for the shortest possible message processing CPU path and uses a first-in/first-out, minimal prioritisation philosophy to minimise the system overhead of task management. Multi-processing, with minimum interaction and locking between symmetric CPUs, provides CPU cycles while keeping 'tightly-coupled' (symmetric processing) overhead relatively 'flat' across a wide range of work loads. Multi-programming techniques seek to insure that all available CPU cycles are kept busy in handling end-user message processing rather than idly waiting for system resources to become available.

As in all systems (digital and otherwise), queues result from contention for shared resources. In TPF's high throughput environment though, excessive queuing in any system component results in real-time bottlenecks. In TPF, to wait is to die.

IO Challenges
The most significant contribution to overall existence time for an end-user message is related to DASD IO. Usually 96+% of a message's 'system lifetime' is consumed in DASD IO handling.

VFA (Virtual File Access) is TPF's main storage file record caching facility and is the primary tool for reducing end-user message existence time. VFA seeks to eliminate as many DASD IO accesses as possible. However, even in a relatively small TPF on-line system (about 180 disk modules), running peak message volumes of about 300 msg/sec, without the complications of a 'loosely-coupled' configuration (several TPF images sharing a common DASD database which limit VFA utilisation), thousands of DASD IOs per second are generated. In larger TPF on-line systems handling thousands of end-user messages a second, a typical DASD IO subsystem must be able to handle 50.000 or more IOs per second; all without exceeding real-time end-user message processing service requirements.

In contrast to typical Internet Web Server processing configurations where IO accesses are almost exclusively read-only and data can be kept in main storage, typically, half or more of all TPF IOs are writes which must ultimately go to DASD. Compared to main storage, DASD IO accesses are slower by orders-of-magnitude.

Managing such a high DASD IO load, in real-time, for hundreds or thousands of end-user messages each second, is a feat unmatched by any other operating system in the world today. Quite simply, if a business demands this type of real-time, high-volume, IO intensive processing, TPF is the only choice.

TPF IO Handling
TPF primarily uses a parallel database architecture with file records spread laterally across DASD modules (disk drives) in order to reduce contention for IO hardware resources; channels, DASD control units, and disk modules. Once again, contention for shared IO resources creates queues which lead to elongated IO response times and dramatically reduce message processing throughput at high volumes.

Because of the potential for an inefficient IO subsystem to bring TPF throughput "to it's knees", both the IO hardware configuration and Applications database design are critical in keeping IO resource contention to a minimum. Measuring performance (if you don’t measure, you can't analyse), analysing performance data (if you don't analyse, you can't tune) and tuning (if you don't tune, you can't control) of the IO subsystem to eliminate bottlenecks, is one of the most critical and demanding tasks faced by a TPF system performance analyst.

TPF IO Performance Tuning
Traditionally, TPF IO performance tuning has been based upon analysis of the effects of DASD IO resource contention, i.e. tracking module queues and IO response time as reported by TPF Data Collection, and, trying to get as many good VFA candidate records into VFA as possible.

I would suggest that a far more efficient and pre-emptive approach is to collect and analyse actual IO service timings. Facilities exist in the IBM/XA channel subsystem architecture for reporting IO service time by its component timings; Connect time, Pend time, and Disconnect time.

IO Performance Tuning using Service Time
Each of the individual IO service time components relates to and describes a portion of the physical IO path through the channel/IO subsystem.

Connect time is the time spent transferring IO commands and data across the channel. This may be parallel or fibre-optic channels and may include ESCON Directors. During this time, the physical channel is busy and cannot be used to service other IO requests.

All-important Pend time, (especially in loosely-coupled TPF complexes) describes the time that an IO spends waiting in the channel/IO subsystem for shared resources to become available (this is not time spent on the TPF module queue).

Disconnect time primarily results when a file record has to be retrieved from or possibly written to the disk media surface. Because of the time required for mechanical disk and head movements, the channel is disconnected and available to service other IO requests.

Analysis of IO service time components in an active TPF system can quickly reveal bad hardware configuration, poor logical database design (resulting in 'hot' records), and often even 'failing' hardware. In other words, IO bottlenecks in the TPF IO subsystem which have a direct impact on system processing throughput and end-user response time, become visible. Armed with this information, intelligent decisions (not guesses) can be made regarding IO subsystem configuration, VFA tuning (RIAT record-ID allocation), and the effects of Applications database design. Furthermore, it is the most accurate way to do TPF IO subsystem capacity planning and to evaluate new IO hardware options.

Datalex currently possesses a proven IO Monitoring tool which collects and organises IO performance information from on-line and native-test TPF systems. Datalex has developed a stand-alone, ECB based collector and a suite of off-line report generators. The on-line collector requires no Control Program changes and produces no visible overhead in the on-line system. The TPF IO Monitor, together with a selection of PC or mainframe based reports is extremely useful in evaluating IO subsystem performance in an active TPF system and in evaluating new I0 hardware.

In any case, the mechanics of IO performance monitoring utilising IO service time components can be found in IBM's Enterprise Systems Architecture/370-390 Principles of Operation. See the Channel-Subsystem Monitoring section in the IO Support Functions chapter.

For further information regarding Datalex's TPF IO Monitor tool and reporting facilities, feel free to contact...

Mark A. Grasso
Datalex – Amsterdam

MV 2007: remove the .NOSPAM to use the email address