C Corner - SEP 1992

With C in TPF - Two Dimensional Discontiguous Arrays
by Dan Evans

Two dimensional arrays should have a special place in the hearts of TPF programmers working with C, because they do not have to be contiguous blocks of storage. The expression object[i][j] in a C program may refer to a conventional block of storage organized into a two-dimensional array of rows and columns, or it may refer to a less conventional array of pointers to one-dimensional arrays which are in discontiguous storage. In practical terms, this means that a TPF C programmer may work with a 1000 by 1000 two-dimensional array of integers directly without concern that the rows of the two-dimensional array are stored in arbitrary discontiguous memory locations. A one million integer object is large enough for most computations, but make it a two-dimensional array of pointers and the capabilities are unlimited. TPF's 4K limit on memory is still felt in the maximum size of each row, which is why I limited the size of a dimension to 1000 integers or 4000 bytes.

In previous columns, I have discussed the equivalence of array references and pointer arithmetic in C. This is the basis for understanding two-dimensional array references, so let's quickly review. There are three rules:

pointer addition and subtraction are scaled so that the type of the underlying element is always preserved; a pointer to any type plus or minus an integer produces another pointer to the same type;
in an expression, the type of an array name is converted to a pointer to the first element of the array, and its value is the address of that element; the declaration int abc[6]; declares abc as an array of 6 integers, but in the array reference abc[2], the type of abc is converted to an integer pointer, and its value is the address of the first array element; this is a symbolic literal; if we actually knew that the compiler placed the array at some address, say 0x40056, then the array name would be equivalent to the expression (int *)0x40056;
an array reference is converted to an equivalent pointer arithmetic expression; the reference abc[2] is converted to the expression *(abc + 2);

Let's apply these rules to the analysis of a two-dimensional array reference, using the declaration int xyz[50][95];. This declares xyz as an array of 50 arrays of 95 integers. In other words, a two-dimensional array is an array of one-dimensional arrays. In this case, the one-dimensional arrays are each an array of 95 integers. The two-dimensional array puts together 50 of these one-dimensional arrays. In the array reference xyz[i][j], the first subscript, i, selects one of the 50 one-dimensional arrays, and the second subscript, j, selects an integer from that array.

A C compiler decomposes xyz[i][j] in several steps:

xyz[i][j] becomes (xyz[i])[j] because of the right to left associativity of the [] operator;
(xyz[i])[j] becomes (*(xyz + i))[j] because of rule number 3;
(*(xyz + i))[j] becomes *((*(xyz + i)) + j) again because of rule 3.

The two-dimensional array reference is reduced by the compiler to a series of pointer additions and dereferences. There is still a little more to this than meets the eye, so let's analyze the resulting expression keeping in mind rules 1 and 2. In order to emphasize the generality of this process, I'll use the word AnyType to refer to arbitrary types. Working from inside out, the expression is:

xyz + i - the type of xyz is array of AnyType, so it is converted to a pointer to AnyType whose value is the address of its first element, according to rule 2. The subscript i is scaled by the size of AnyType according to rule 1, and then added to the pointer producing an intermediate result we will call p1, a pointer to an occurrence of an AnyType. In our case, AnyType is an array of 95 integers, so i is scaled by 95 times 4, or 380. The resulting pointer p1 selects the beginning of the i'th occurrence of an AnyType in xyz.
*(p1) - p1 is a pointer to AnyType, and dereferencing it produces AnyType, our array of 95 integers. In order to continue the analysis, we'll call this array ia. Remember that the type of ia is an array of 95 integers, just as if it were declared as int ia[95];.
*(ia + j) - the type of ia is array of 95 integers, so it is converted to a pointer to an integer according to rule 2, and is therefore a pointer to its first element. The subscript j is scaled by the size of an integer according to rule 1, and then added to the array pointer to produce an intermediate result p2, a pointer to an occurrence of an integer. Since ia becomes a pointer to an integer, j is scaled by 4. The resulting pointer p2 selects the j'th integer in ia. The final expression *p2 produces the value of the selected integer, which is the final result of the reference xyz[i][j].

If you think that a 50 by 95 two-dimensional array of integers would take 19000 contiguous bytes of storage and would therefore not be possible in TPF, you would be wrong. Now that we thoroughly understand two-dimensional references, let's look at the same expression xyz[i][j] with the declaration int **xyz;. The compiler again decomposes the expression to *((*(xyz + i)) + j), but the change in the declaration produces differences in the evaluation.

xyz + i - the type of xyz is pointer to an integer pointer. The subscript i is scaled by the size of an integer pointer according to rule 1, and then added to the value of xyz producing an intermediate result p1, a pointer to an integer pointer. p1 selects the i'th occurrence of an integer pointer off the base address contained in the variable xyz.
*(p1) - p1 is a pointer to an integer pointer, so dereferencing it produces an integer pointer, the value of the location addressed by p1. In order to continue the analysis, we'll call this value ip, just as if it was declared as int *ip;.
*(ip + j) - the type of ip is integer pointer, so the subscript j is scaled by the size of an integer according to rule 1, and then added to ip produce an intermediate result p2, a pointer to an occurrence of an integer. The resulting pointer p2 selects the j'th integer in the block of storage pointed to by ip. The final expression *p2 produces the value of the selected integer, which is the final result of the reference xyz[i][j].

Here is the critical difference. When xyz is declared as a two-dimensional array, the name xyz refers to the location where the compiler has placed the array, whether it is allocated at compile time or it is allocated on the stack as a local variable at run time. When xyz is declared as an indirect pointer, it must be set to the address of an array of pointers, and each element of this array must be set to the address of an array of the indirect type, in our case, an array of integers. Here's how you create a two-dimensional array of the second form:

#define ROWS 50
#define COLUMNS 95
int i, j;
int **xyz;

xyz = (int **)malloc((sizeof *xyz) * ROWS);
for (i = 0; i < ROWS; ++i)
{
xyz[i] = (int *)malloc((sizeof **xyz) * COLUMNS);
for (j = 0; j < COLUMNS; ++j)
xyz[i][j] = 0;
}

In the above code, I have left out the obvious efficiencies in order to emphasize the one and two dimensional array references. If you count the number of separate request for storage, calls to malloc(), you will find there are 51 of them, each no larger than 380 bytes. In other words, this two-dimensional array object is built from 51 small discontiguous pieces suitable for allocation in TPF. The creation code makes it clear that references to the integer elements of the two-dimensional array hide the discontiguous nature of the storage and simplify the programmer's view. This is the perfect marriage of C and TPF.