C Pointer and Array Equivalence

C Pointer and Array Equivalence
by Dan Evans

In the previous issue, I presented an example of C obscurity, the program

#define C 1
main() { printf(&C["\001%c%.2s\0003ACP"],C+C["is"],"pfun"); }

I claimed that the output of this program is the three characters "tpf". This result hinges on the equivalence of pointer arithmetic and array references in C. Every Assembler programmer knows that an array reference involved address arithmetic. C provides the convenience of array references and address arithmetic within a higher level language framework which allows the compiler to handle the details and verify the correct use of an address. In C, an address is called a pointer. With the C compiler working as an assistant to the programmer, many of the common errors associated with address arithmetic in Assembler programs can be eliminated.

In PL/1 or PL/TPF, a pointer is an independent data type, declared without reference to other data objects. In C, a pointer does not have an independent existence. It must point to some other data type, and that type must be given in the pointer declaration. Thus, we have type declarations such as char *, a pointer to a character, or int **, a pointer to a pointer to an integer. A C pointer may only be set to the address of a value of its declared type, and a C compiler will enforce this.

Values of the same type may be sequentially organized in memory as an array. An array of integers in a S/370 is a sequential set of fullwords. The array concept provides the basis for C pointer arithmetic. If a pointer is set to the address of an element in a array, adding 1 to the pointer should logically produce the address of the next element in the array. If ip is an integer pointer, declared as int *ip;, then ip + 1 is the address of the next integer in an array of integers. Similarly, ip - 1 is the address of the previous integer. The addition or subtraction of an integer and a pointer always produces a pointer of the same type. This definition does not mention the number of memory locations occupied by the type located by the pointer. The pointer may point to a character occupying one byte, or to a structure of 57 bytes. Pointer/integer arithmetic is independent of the underlying element length. If a pointer is correctly set, adding or subtracting an integer leaves the pointer still pointing to an occurrence of its base type. When analyzing the data type produced by a C expression, we say that "pointer to type T plus integer yields pointer to type T". In Assembler language, to get the address of the next fullword, we must always add 4 on a S/370, but in C, the next occurrence of a type is always pointer plus one. The compiler is given the task of generating the necessary code to scale the operation by the length of the base type.

Now that we have defined C pointer arithmetic, we can use it to express an array reference. In C, the reference arr[i] is formally equivalent to *(arr + i). This may look mystifying at first glance, but it only states what Assembler programmers already know: to get the value of the i'th array element, scale the element's subscript by the length of an array element, add the product to the base address of the array, and use this result to access the value. In C, the name of an array is a convenient shorthand for the address of the first array element, &arr[0]. So, if arr is an array of type T, the type of the name arr, when used in an expression, is "pointer to T". As an aside, this is different than &arr, whose type is "pointer to array of T", but this distinction is not clear in pre-Ansi C compilers. In the expression *(arr + i), the pointer addition of arr, the address of the first array element, and i produces a pointer to the i'th element of the array. The deference operator * accesses the location referenced by a pointer and returns the value of the type stored there. Formally, "the deference of pointer to T produces T". To use a concrete example, declare an array of integers as int arr[5];. Both the reference arr[3] and *(arr + 3) yield the value of the fourth integer in the array.

By now, you may be wondering what this has to do with the obscure program. Just this: the pointer expression *(arr + 3) is the same as *(3 + arr), because the order of the operands in an addition is immaterial. For those mathematicians in the audience, addition is commutative. This implies that the reference i[ar] should be equivalent to ar[i], and indeed, it is. We are used to looking at array references as an array name followed by a subscript, but in C, since an array reference is equivalent to a pointer expression, the notation of a subscript enclosing ar array name is equally valid. Since our prior training makes it look strange, let's rewrite our obscure program reversing the array references and doing the preprocessor substitution of 1 for C:

main() { printf(&"\001%c%.2s\003ACP"[1],1+"is"[1],"pfun"); }

The program is clearer now, but in order to fully understand it, we have to look at strings in C. A string is an array of type char with a '\0' character in the last array element. A string literal is a sequence of characters enclosed in double quotes, with an implied '\0' character at the end. The string literal "pfun" is an array of 5 characters, {'p' , 'f' , 'u' , 'n' , '\0'}. A literal has no identifier associated with it, so the literal is its own name. The C compiler treats a string literal like an array name. The type of a string literal is "pointer to char", just like the name. The type of a string literal is "pointer to char", just like the name of a character array. We can simplify the program further by substituting character arrays for the string literals:

char s1[13] = "\001%c%.2s\0003ACP', s2[3] = "is" , s3[5] = "pfun";
main() { printf(&s1[1], 1+s2[1], s3); }

Now the program is readable. The first argument to printf() is always a format string; in other words, a pointer to the first element of an array of characters ending with a '\0' character, describing the format. The expression &s1[1] is a pointer to the second character in the string, so the first character, '\001' is effectively ignored. A string is assumed to end with a '\o' character, so the format string is really only "%c%.2s". If this is not clear, look up octal escape sequences in a C reference manual. The format string has two format specifications, so there should be two arguments following it in the printf() argument list. The second argument is an integer, to be printed using the %c format as single character, and the third is a string, a character pointer, from which a maximum of two characters are printed, according to the %.2s format. The second expression, 1 + s2[1], adds 1 to the second element of s2, the character 's', giving the EBCDIC code for 't', which is the first character printed. Since 't' also follows 's' in the ASCII code, the program does the same thing on an ASCII machine. This would not be the case if s2[1] was 'i', because 'j' does not immediately follow 'i' in EBCDIC. The third expression is the name of an array, and is therefore a pointer to the first element of the array. The format selects only the first two characters, 'p' and 'f'. The final result is "tpf".

Our obscures program has provided some interesting information about C pointer arithmetic, array references, and strings. However, it is still an obscure program and should never find its way into production. Programmers should strive for clarity in programs, as if they expect the programs to be read time and again by people fascinated with the technique of the original author.

In the next issue, I'll talk more about array references and the special problems presented by TPF, where large arrays cannot be organized in contiguous storage. Finally, as I alluded to differences in program behavior under different code sets, the following function gives an indication of how code set independence can be achieved.

/*map 'a' through 'z' to 0 to 25 for both EBCDID and ASCII */
int ord (int c)
{ return c - (( c <= 'i') ? 'a': (c <= 'r') ? 'j' - 9 : 's' - 18); }