Function Call Overhead in Prisym/C/TPF

Function Call Overhead in Prisym/C/TPF
by Dan Evans

C is a function-oriented language. A C function call is an expression and it can appear wherever an expression is valid. Function use flows directly from the design of the language, and the function call operator is heavily used. A C implementation should strive to minimize the overhead of a function call. In Prisym/C, we have tried to minimize the path length for a function call so that building parameter lists, transferring control, and acquiring local storage in the called function do not significantly add to execution overhead.

Before calling a function, the compiler must generate code to build the parameter list. C passes parameters to functions "by value": the value of each parameter expression is passed to the called function. Prisym/C generates one instruction per parameter to store each value into the parameter list after the parameter expression has been evaluated. Since the System/370 has a variety of store instructions, character, short, integer, long, pointer, float, and double values which are in registers can each be stored into the parameter list with a single four byte store instruction. If the parameter value is in storage, a six byte move instruction will place the value into the parameter list. After all values have been stored in the parameter list, a Load Address instruction is generated to place the address of the parameter list into register 1.

The next sequence of instructions generated by the compiler transfers control to the function being called. The address of the function is acquired in one of several ways, depending upon the location of the function definition. If the function is defined in the same segment as the caller, the function's address is an offset, smaller than 4K, from the current segment's base register. A Branch and Save instruction is sufficient to transfer control to the function and save the return address, in register 14. If the function is defined in a different segment, an additional Load instruction is needed to load the address of the new segment into the segment base register before the Branch and Save. The target of an inter-segment call is a branch instruction in the segment's function vector table, which adds one more instruction to the path length. If the function is a nucleus library function, two Load instructions are needed before the Branch and Save. The first loads the address of the nucleus library vector table from low storage, and the next loads the address of the function from the vector table. This sequence is generated by the CLIBRTN macro, so that the user can substitute an alternative transfer sequence for nucleus library functions. If the function is defined in an application library, the three instruction call sequence generated is equivalent to the inter-segment call.

When a function gets controls, it executes a prolog which saves registers and allocates the temporary storage needed by the function. Since this storage is needed only during the execution of the function, a stack is the ideal data structure. The S/370 stands practically alone among current computer architectures in having no stack instructions, so stacks must be simulated. TPF compounds the problem with a 4K limit on contiguous storage.

In Prisym/C, the function prolog is implemented by the macro PROLOG. The compiler generates this macro passing it the dynamic storage requirement of the function in bytes. The macro expands to a three instruction sequence which saves the caller's registers, loads the storage requirement as a constant into register 0, and branches to the GETSTAK routine, which is generated at the end of each segment by the macro GETSTAK$. GETSTAK executes a sequence of eight instructions to allocate the requested stack storage and return to the caller, assuming that a new 4K block is needed. Within a 4K block, stack storage is allocated downward from high address to low address. GETSTAK checks to see if the stack contains enough storage for the function's requirement, and decrements the stack pointer register by the requested amount. A new save area is part of the function's dynamic requirement, and is located at the lowest address in the newly acquired stack storage. GETSTAK stores the previous save area's address in the new save area and returns. The final instruction in the prolog copies the parameter list address from register 1 to register 5. If the stack cannot satisfy the request, a call to a nucleus library routine is made to get and chain another 4K block. Functions typically have local storage requirements of less than 250 bytes, so this should not happen too often. Using this average, it would take 16 nested function calls to overflow a 4K block.

The out-of-line location for GETSTAK saves storage in return for increasing the function prolog path length by two instructions: the branch to it, and the return. The total length of GETSTAK, including the infrequent call to the nucleus routine, is 32 bytes. An in-line GETSTAK would be 26 bytes without the additional two instructions. If GETSTAK were compiled in-line, a segment with two function definitions would have a 52 byte GETSTAK overhead. With the out-of-line GETSTAK, a segment with two functions in it has a total GETSTAK overhead or 4+4+32=40 bytes, and each additional function adds only 4 bytes to the GETSTAK overhead. Two optimizations may reduce the prolog overhead. Register 1 is not copied to register 5 if the function is not passed a parameter list. If the function has no dynamic storage requirements, GETSTAK is not called. A function which has no dynamic storage requirement does not call any other function, has no in-line Assembler code, and does not allocate local variables, with the possible exception of a few register class variables.

Function return values are most often passed in register 0. Since registers are restored by a called function, the return value is stored into register 0's word in the save area to be restored. This takes a four byte store instruction if the value to be returned is in a register, or a six byte move otherwise. Registers are restored with a Load Multiple and the return is made with a register Branch, for an additional six bytes. Restoring registers automatically pops the stack, making the storage used by the returning function available for reuse, and incurring no extra overhead.

The following summary gives the path length and the storage overhead for each phase of a function call Prisym/C.

Parameter List Construction

Path: one instruction for each parameter passed plus one to load the address of the parameter list.
Storage: four bytes for each parameter whose value is in a register, plus six bytes for each parameter in storage, plus four bytes.

Intra-segment Transfer of Control

Path: one Branch and Save instruction.
Storage: four bytes

Inter-segment Transfer of Control

Path: three instructions to get the address of new segment, Branch and Save, and branch from the function vector table to the code.
Storage: twelve bytes

Nucleus Transfer of Control

Path: three instructions to get the address of the vector table, get the address of the function, and Branch and Save Register.
Storage: ten bytes

Function Prolog

Path: eight instructions for a normal GETSTAK plus four additional for a total of 12 instructions.
Storage: 14 bytes per function plus 32 bytes per segment

Function Return

Path: two instructions to restore registers and return, plus one instruction if a value is returned.
Storage: 6 bytes plus 4 or 6 bytes to return a value

From the above information, an intra-segment call with two parameters to a function returning a value will have a path length of 19 instructions. An inter-segment call to the same function takes 21 instructions. Many of the Nucleus Library routines do not require stack storage, so the function call path length is significantly reduced. For example, the code for the Nucleus Library routine strcpy follows. Note that it does not need stack storage.

ENTRY STRCPY
USING *,15

STRCPY STM 1,5,28(SVR) overhead

LR 2,1 overhead

L 3,4(,5)

L 4,=V(#NULLTAB)

STRCP1 TRT 0(256,3),0(4)

BNZ STRCP2

LA 3,256(,3)

B STRCP1

STRCP2 L 2,4(,5)

LA 1,0(,1)

SLR 1,2

AH 1,=H'1'

LR 3,1

L 0,0(,5)

MVCL 0,2

L 0,0(,5) overhead

LM 1,5,28(SVR) overhead

BR 14 overhead

LTORG

END

The function call path length for this routine is 3 instructions to build the parameter list, 3 instructions to call the routine, 2 instructions to save the registers and the parameter list address at the beginning of the routine, and 3 instructions to restore registers, and return with a value at the end of the routine, for a total of 11. As an additional optimization, the compiler recognizes certain functions as candidates for in-line expansion. For example, memcpy will expand inline to a few Load's and a Move Character Long instruction. Code generation in Prisym/C has been carefully designed to preserve the performance required by a TPF user while bringing the full benefit of the C language to the TPF world.