The 10 SNA COMMANDments
by Mark Gambino, IBM TPF Development
Every now and then you find yourself debugging a problem that is so bizarre you are left screaming, "This makes absolutely no sense!" Typically, this type of problem is sporadic, cannot be easily re-created, never shows up on a test system, but does show up in a production environment. What really makes this type of problem confusing is that the symptoms are misleading and the events that caused the problem occurred long before the problem surfaced. The trace data usually shows that everything is fine and that the dump or problem should never have happened in the first place, which makes you question logic in general and think that 2 plus 2 might actually be equal to 5. We will now explore 10 of the more "interesting" problems that have shown up in the communications area over the years.
Thou Shalt Not Steal (Control via an I/O Interrupt)
Symptoms: The TPF system receives a message (PIU) and takes a dump because the origin network address (NA) cannot be converted. It turns out that the PIU was for an LU-LU session that has been active for hours, having hundreds of messages processed without incident. Analyzing the dump, we expected to find the resource vector table (RVT) entry or network address table (NAT) corrupted, but both tables were fine (contained the correct network address). What made this problem even more intriguing was that the PIU trace showed that all subsequent PIU's for this session were processed correctly by the TPF system.
Questions: Why did the dump occur if the SNA tables are correct, and why did only one PIU encounter the problem?
Explanation: Realtime segment CSD1 manages the NAT. When the NAT becomes full, CSD1 removes entries that were used by sessions that are no longer active and compresses the NAT. While in the middle of this compression, and interrupt from the NCP was serviced by the SNA I/O code in the control program and the search of the NAT by the PIU trace routine was not successful. The network address conversion error dump is not taken until SNA OPZERO processes the message; and by the time we reached here, CSD1 has finished its processing. Therefore, the NAT looked normal in the dump. The solution (APAR WP13104) added code to CSD1 to mask the TPF system for I/O interrupts while the NAT is being compressed.
Don't Call Us, We Won't Call You
Symptoms: During peak hours when hundreds of messages per second are normally flowing, no messages (PIU's) are being sent or received by the TPF system. Both the TPF system and NCP appear to be in a hung state. A TPF manual dump shows a large queue of messages on the SOUTC queue that is ready to be sent to the NCP. An NCP static dump also shows a large queue of messages that is ready to be sent to TPF.
Question: If the TPF system has messages to send to the NCP and the NCP has messages to send to the TPF system, why is there a trade embargo preventing goods from being exchanged?
Explanation: Data flow between the TPF system and the NCP is controlled by a pacing mechanism where one side can send a fixed amount of PIUs, sends a virtual route (VR) pacing request, and then must wait for a VR pacing response to be received before it can send more PIU's. The TPF system sent a VR pacing request and then built a SOUTC queue large enough to drive the TPF system into input list shutdown. While in this condition, the TPF system does not poll the NCP; therefore the VR pacing response cannot be read in by the TPF system. The only way to get out of input list shutdown, in this case, is to send out the PIUs on the SOUTC queue and return those core blocks to the system, but the TPF system is not allowed to send until the VR pacing response comes in. This is the classic deadly embrace problem that makes you wonder which really did come first, the chicken or the egg? APAR WP12749 provided mechanisms for preventing and recovering from deadly embrace.
It is Rude to Answer a Question with a Question
Symptoms: Your operations center starts getting calls from users who cannot log on. The monitoring tool does not show any congestion or failures in the network. The CDRM-CDRM session (used when starting new LU-LU sessions) is active, as are many LU-LU sessions.
Questions: Why, all of a sudden, did the TPF system stop issuing work visas, preventing any new sessions in the TPF province? Why do we not see any error messages or command rejections?
Explanation: When TPF3.1 sends a request on the CDRM-CDRM session, the RID (RVT index) of the LU involved is placed in the PIU sequence number field so that when a response comes in, the TPF system knows which LU RVT to use. In TPF4.1, the RID was expanded from 2 to 3 bytes; therefore, the sequence number field (SNF) table was created. An SNF table entry is assigned when TPF sends a CDRM-CDRM request and maps the 2-byte PIU sequence number to the 3-byte RID of the LU RVT. When a response is received, the SNF table entry is deleted and can be reused. During LU-LU session activation, the remote CDRM can reject a CDINIT request by sending a negative CDINIT response or a CDTERM request. In the case of the CDTERM, the SNF table entry is never deleted. Gradually, over time, enough CDTERMs came in to fill up the SNF table, which blocked the TPF system from sending any requests on the CDRM-CDRM session and preventing any new users from logging on. APAR WP14027 fixed the SNF table management routines.
Timing Is Everything... Bad Timing That Is!
Symptoms: A few times a week a CTL-001 dump is taken out of the SNA I/O code because the number of messages on queue (field SNADEVQN) value was -1, which is not valid. Because any I-stream can add an item to the CTC queue, a locking problem was suspected. However, the CTC queue and dequeue routines both serialize processing by locking the CTC control block throughout their respective processing.
Questions: If every piece of code that updates the CTC queue locks the CTC control block first, how is SNADEVQN becoming corrupted? Hundreds of messages are flowing across the CTC link every second so, if the CTC queuing logic is broken, why is the CTL-001 only seen about once a day?
Explanation: The SNA polling code does not update the queue (therefore, does not lock) but does check to see if the queue is empty by doing a logical AND instruction, "NC SNADEVQN,SNADEVQN". To process this instruction, fetch the first operand value (SNADEVQN), fetch the second operand (also SNADEVQN), do the math, and then store the result back in the first operand. In between the fetches and the store on the main I-stream, another I-stream updated the SNADEVQN field. Next, the main I-stream did the store, which wiped out the update which was just made by the other I-stream. APAR WP13579 corrects the check for an empty CTC queue to not inadvertently change the value of SNADEVQN. While it was never conclusively proven that the NC instruction was what caused the problem, the CTL-001 has not been taken in the 3 years since the fix was applied.
If You Build It (Incorrectly), They (Problems) Will Come
Symptoms: CTC links could not be activated after a hardware IPL of the TPF system. The subarea address table (SAT) showed that CTC links were active, but the CTC RVT entries showed that the links were not active. Code in SNA restart specifically checks for and corrects RVT/SAT mismatches for CTC. The code is straightforward and invoked after every IPL.
Questions: How did the RVT and the SAT become out of synch in the first place? What renegade routine is corrupting the SAT?
Explanation: The answer to the first question is simple. CTC links were activated, the SAT was filed to DASD, but the TPF system took a hardware IPL before the RVT could be filed out. When the TPF system came back up, the SNA restart code reloaded the RVT from file, which indicated that no CTC links were active, and then reloaded the SAT from file, which indicated that the CTC links were active. The RVT/SAT mismatch code scans the SAT for entries marked as CTC CDRMs and verifies their status against the RVT status. A slight problem - when the SAT is built during an SNA fresh load (which in this case was months ago), the code mistakenly never marks any SAT entries as CTC CDRMs; therefore, the RVT/SAT mismatch code in SNA restart did nothing. APAR PJ18052 corrects the build SAT routine.
Always Come to a Complete Stop before Proceeding through the Intersection
Symptoms: A user tries to log on, but the logon requests are never completed successfully. A display of the RVT on the TPF system shows that the LU is neither in session nor starting a session. A display of the LU on VTAM shows that the network is waiting for the TPF system to send a CDCINIT request to continue the session activation sequence. A PIU trace confirms that the TPF system received the CDINIT request, sent a +RSP(CDINIT), but never sent the CDCINIT. PIU trace also showed that all other users were able to successfully log on. By examining the code, the TPF system schedules CDCINIT immediately after sending the CDINIT response.
Questions: Is there a black hole or Bermuda Triangle hidden deep in the TPF system that made the CDCINIT disappear? Why is the LU RVT not marked as a session is starting?
Explanation: Further analysis of the PIU trace revealed that this user logged off and then immediately tried to log back on. During the deactivation process, the ECB doing the session cleanup gave up control, allowing the session reactivation (CDINIT request) to be processed and the CDCINIT scheduled. Next, the cleanup ECB resumed its processing, clearing the LU RVT entry, and wiped out the command (CDCINIT) that was scheduled. APARs WP12731 and WP12902 fixed several timing problems in the session activation and deactivation code.
A Rose by Any Other Name Would Smell Like an ACTCDRM
Symptoms: An LU-LU session suddenly encounters problems, and dumps reveal that the network address (NA) field in the RVT has been corrupted. This happens a few times a day to different sessions, but the NA was always the corrupted field in the RVT.
Questions: What program was the serial killer of network addresses in LU RVT entries and why?
Explanation: Code in PIU trace updates the NA in a CDRM RVT entry when an ACTCDRM request is received. This code only checks the first byte of data for a X'14', which is an ACTCDRM if the PIU type is session control. Every so often, an application in the TPF system sent a user data PIU containing a X'14' as the first byte of data. Seeing the X'14', PIU trace assumed that the PIU was and ACTCDRM and changed the NA in the LU RVT entry, which resulted in a hung session. APAR PJ19824 fixed the ACTCDRM check in PIU trace.
Ask For A Raise If You Are Over Your Credit Limit
Symptoms: Data on a PU2.1 session started backing up in the TPF system. The PIU trace showed that for the past 3 hours, every time the TPF system sent a pacing request on this LU-LU session, the NCP responded with a window size of 1. After the backlog of data subsided and traffic became moderate, the NCP kept the pacing window size at 1, even though the NCP was running at less than 10% capacity.
Question: PU2.1 uses adaptive session level pacing so, when the flood of messages was over, why didn't the NCP increase the pacing size window?
Explanation: In NCP V6R1, if more than 8KB of data queues up for an LU-LU session, the session is switched from interactive mode to batch mode. In interactive mode, the NCP automatically adjusts the pacing window size up or down based on current conditions; however in batch mode, what amounts to fixed pacing is used unless the LU requests a larger window size. NCP V6R2 changed its pacing algorithm to switch back to interactive mode when the queue drains, but code was also added to the TPF system (with APAR PJ13651) to take advantage of the request larger window size feature in PU2.1.
Dial O-M-T for Murder
Symptoms: For no apparent reason, the TPF system would occasionally UNBIND (break) an LU6.2 session. The operator did not enter any message to deactivate the session, nor were there any network failures. PIU trace did not show any errors in the data flow that would have led the TPF system to send the UNBIND.
Questions: What program is committing these random acts of violence, killing LU6.2 sessions, and why?
Explanation: When the TPF system sends data faster than the remote end can process it, the output message transmission (OMT) package has to queue the data. If the OMT queue is ever purged, the 'messages on queue' bit is left on by accident. A few minutes after that, if no message go on the queue, the OMT timeout routine will think that the session is hung because the 'messages on queue' bit is on, but no messages have been dequeued for a long period of time. The OMT timeout routine was deactivating the session. APAR PJ18167 fixes LU6.2 OMT processing.
Do Not Feed the Animals (or the Ready List)
Symptoms: All is well, messages are flowing in and out of the TPF system, and then, inexplicably, traffic comes to a halt. No PIUs are sent or received on any link for over 30 seconds, and then traffic resumes right where it left off. No messages were lost and no sessions failed. During that 30-second interval, very few ECBs and core blocks were in use; therefore, the TPF system was not in any shutdown condition. The same sequence of events with no traffic for a 30-second period of time repeated itself throughout the day.
Question: Why is the TPF system acting as if it was a video game in which someone hit the pause button?
Explanation: If the ALLOCATE verb is issued to start an LU6.2 parallel session and the CDRM-CDRM session is not active, segment CHRR is called to return the session control block (SCB) entry. If ALLOCATEs are queued and an active session fails, CHRR issues a CREMC macro, which adds a task to the ready list for segment CHRX to start a new session. However, that same code in CHRR was getting invoked when a session activation attempt failed. The CHRR ECB would exit, allowing the new ECB in CHRX to begin processing. CHRX would not be able to start a session, go to CHRR, and we would start all over again. Until the queued ALLOCATEs time out (which in this case took 30 seconds), there was always one item on the ready list (a CREMC to CHRX) when the CPU loop task dispatcher obtained control. Because the ready list never became empty, the input list was not serviced; therefore, no messages from the network could be processed. APAR PJ22298 prevents the loop in CHRR and CHRX code by checking for failed session activations.