Some Rise, Some Climb, Some Fall

Some Rise, Some Climb, Some Fall to Get to Networkin'
by Mark Gambino, lBM TPF Development

Feedback from my previous newsletter article regarding interesting communications problems indicated that you, our customers, consider information like that to be valuable. Because not all customers run their TPF system at the latest program update tape (PUT) level (shame on you!), requests have been made to continue to provide information about problems being corrected by upcoming PUTs. For this go-around, I will discuss a few items of interest that have come up in recent months. The first topic is likely to affect all large customers sooner or later, while the other two topics might be affecting your TPF system right now without you even realizing it.

Size Is Important
One popular use of the move character long (MVCL) instruction is to initialize a large table in core memory. To accomplish this, the length register of the first operand is set up with the size of the table, and the length register of the second operand is set to O. Since the dawn of computing time, the Systems Network Architecture (SNA) restart code has used an MVCL instruction to initialize network tables like the resource vector table (RVT) and terminal control table (WGTA). As the size of the network grows, so do the SNA core memory tables. If you study the fine print on the workings of the MVCL instruction, you will note that only 3 bytes of the first operand length register contain the actual length of data, which means that the maximum amount of data that can be copied or cleared by a single MVCL instruction is X'FFFFFF' (16 MB minus 1).

If you define more than 104 857 resource vector table (RVT) entries to the TPF system (using the MAXRVT parameter on the SNAKEY macro in CTK2), the size of RVT part 1 (RVT1) becomes greater than 16 MB and problems occur. For example, let us say that MAXRVT=118000. The size of RVT1 will be about 18 MB (X'1201600' to be exact). Because the length is truncated to 3 bytes by MVCL, the actual amount of data cleared is only X'201600', or about 2 MB. To make matters even worse, the SNA restart code counts on the MVCL instruction to position the first operand base register to the end of RVT1 but, in this example, MVCL bumps the base register by only 2 MB. The result is that SNA restart aborts when the first 2 MB of RVT1 are filled in (13526 RVT entries processed).

APAR P523240 (on PUT 5) corrects the code that processes SNA tables whose size can be greater than 16MB. APAR P522898 (also on PUT 5) fixes the code that initializes the WGTA. If there are any large user tables in core memory on your TPF system that are approaching or over 16 MB in size, verify that a single MVCL instruction is not used to copy or initialize those tables. (So much for not having to worry about that 16-MB magic number in TPF 4.1 anymore!)

Two Wrongs Don't Make a Right, But Two Rights Make a Wrong (Route)
There are a pair of VTAM APARs that Advanced Peer-to-Peer Networking (APPN) customers need to know about. The first APAR, PTF OW21687, is essential for availability purposes because without it, all active APPN links on your TPF system will break if the TPF system IPLs. The links can be reactivated right away but all LU-LU sessions need to be restarted. The symptoms of the problem are that during SNA restart after the IPL, you will first see message XID300611 indicating that the non-activation exchange identification (XID) was successful. Soon after that, however, you will see message CCIM0089E, indicating that the link failed. On the VTAM side you will see message IST6051, stating that the data in the CONTACTED request is not valid. PTF OW21687, available on VTAM V4R2 and V4R3, corrects the processing of the CONTACTED request and allows the APPN links to remain active across an IPL of the TPF system.

The second item, PTF OW21462, is recommended for all TPF customers who are using APPN. It enhances the route selection algorithm used to calculate the path for an LU-LU session. A composite network node (CNN) is a VTAM system and all of the network control programs (NCPs) (3745s) that it owns. When the path for an LU-LU session traverses a CNN, parallel transmission groups (TGs) or links exist between the TPF system and the CNN, and the VTAM system in that CNN is the network node server (NNS) calculating the session route, the best route is not always chosen. For example, the LU-LU session path might end up being TPF to NCPl,to NCPZ, to the remote LU, even when there is a better route (TPF to NCP2 to the remote LU). PTF OW21462, available on VTAM V4R2 and V4R3, fixes the majority of the CNN routing problems. How it does this is a detailed article in and of itself; therefore, we will not go into the details here.

Session Key 5 for PU 5 on PUT S....Plead the Fifth
Customers running large PU 5 networks can experience spikes in CPU utilization on their TPF system when a remote node or link fails that forces many LU-LU sessions to be deactivated. No dumps are taken and, eventually, all the sessions are cleaned up; therefore, this condition can go undetected. To understand the problem, we need to examine the session deactivation process and flows on the CDRM-CDRM session between VTAM and the TPF system.

When VTAM wants to end an LU-LU session and the primary LU (FLU) resides in the TPF system, a CDTERM request is sent on the CDRM-CDRM session to the TPF system. The CDTERM request contains one of three possible session keys to identify which LU-LU session to deactivate:

Key X'05', which contains the procedure correlation identifier (PCID) of the LU-LU session
Key X'06', which contains the names of the two LUs
Key X'15', which contains the network addresses (NAs) of the two LUs.

If the CDTERM contains key X'1 5', the TPF system can quickly find the RVT of the LU in question by searching the network address table (NAT).

A CDTERM request can be sent to stop an LU-LU session that is starting. Depending on how far along we are in the session initiation process, the node sending the CDTERM might not know the network addresses of both LUs, which means that key X'15' cannot be sent in this case. TPF 3.1 did not support key X'05', which meant that key X'06' had to be sent in this case. Again, the LU RVT can be easily found by using the INQRC macro to search the RVT table.

TPF 4.1 supports LU 6.2 parallel sessions. Using the previous example where CDTERM needs to be sent but key X'15' cannot be sent, key X'06' cannot be sent either. Why? If there are many (parallel) sessions between the two LUs, just sending the LU names does not identify which session. This meant that TPF 4.1had to support key X'05'.

The hopes were that it would be rare for the TPF system to receive a CDTERM request that contained key X'05' because the TPF system has no way to find a matching PCID other than to sequentially search the RVT one entry at a time. By supporting key X'05' in TPF 4.1, it is now possible to receive CDSESSEND commands that contain key X'05'. As it turns out, VTAM sends key X'05' the majority of the time on CDRM-CDRM session commands if the remote end supports key X'05'.

Now let us look at an example to see what all this means. Suppose that your TPF system has 100 000 LU-LU sessions and you lose a remote NCP through which 8000 of those sessions existed. The TPF system will receive 8000 UNBIND requests (one on each LU-LU session) and 8000 CDTERM requests (each with key X'05'). For each CDTERM request, the TPF system will search the RVT looking for an entry whose PCID matches the one in key X'05'. On the average, the TPF system would have to search 50000 RVT entries when processing each CDTERM, but the actual results are even worse; if the UNBIND is processed before the CDTERM for a given session, the RVT entry will have been cleaned up and the TPF system will end up searching all 100 000 RVT entries. Depending on the size of your network, the number of sessions that need to be cleaned up, and the processor capacity, the TPF system can run at 100% CPU utilization for a period of time while it is spending the vast majority of its time spinning through the RVT over and over. During this time, user messages for the remaining 92000 LU-LU sessions can be locked out of the TPF system.

APAR PJ19066 (on PUT 4) corrected the performance problem for LU-LU sessions across NCP connections. Because LU 6.2 parallel sessions are not supported across PU 5 NCP connections, TPF 4.1 now indicates that it does not support key X'05' on CDRM-CDRM sessions through NCP links.

APAR P523098 (on PUT 5) corrects the problem for LU-LU sessions across SNA channel-to-channel (CTC) connections. This change was more involved because parallel sessions are supported across CTC links. Now, TPF 4.1 indicates that key X'05' is not supported across all CDRM-CDRM sessions; even those through CTC connections. In the case where a CDTERM request needs to be sent and key X'15' cannot be sent, key X'06' containing the LU names is sent along with control vector (CV) X'60', which contains the fully qualified PCID. The TPF system uses the LU names in key X'06' to quickly find the LU RVT. For LU 6.2 parallel sessions, the session control block (SCB) entries that are chained to the LU RVT are then searched to find one whose PCID matches the PCID in CV X'60'. The end result is that the CDTERM and CDSESSEND code in the TPF system always use an effective search now, and never spin through the entire RVT.