Migrating from Assembler to C - the Secrets of Alchemy Revealed
by Doron Friedman

The Challenge

The TPF world is now facing the daunting challenge of conversion from Assembler to C. One factor that makes this task so difficult is the scale involved. While converting one program can be a relatively straightforward task for a programmer fluent with both programming languages, the conversion of large, mission-critical software systems with the retention of complete backward compatibility is a task that can "entertain" even a large IT department for a long time.

In an industry that supports and maintains legacy systems -- representing 1000’s of man-years of coding effort -- a total reengineering solution is a luxury that cannot be afforded. In addition, reengineering involves a much greater risk of introducing design or programming errors into a project. Reengineering can also produce enough drift in the system specification that backward compatibility can be lost. The new bugs it introduces are subtle and much more difficult to detect than those introduced in an automated process. So clearly, some degree of automation must be applied if legacy systems are to be maintained at a cost-effective overhead or modified/enhanced at a cost-effective investment.

Currently, automatic conversion employs such brute-force methods as translating each code line in the source language to an equivalent statement in the target language. In some cases, such as converting from Pascal to C, this approach may give satisfying results. However, when you begin examining TPF programs in Assembler and their equivalent in C, you can quickly realize why "simulated Assembler" C code will not do.

Assembler is a low-level language, intimately tied to the operating system, such as in the intensive use of registers. Converting such code on a line-to-line basis to C is possible, but the resulting code is nothing but a run-time simulation of the source Assembler program. This style of coding completely bypasses the advantage of a strong C compiler, producing an unacceptable performance penalty. Even more importantly, C code that reproduces Assembler on a line-by-line basis is unreadable.

Based on these conclusions, we started looking for a non-traditional solution. We found it buried during an archaeological (in software industry terms) expedition to academia.

Artificial Intelligence to the Rescue

In the late eighties, The Programmer’s Apprentice project was conducted in the MIT Artificial Intelligence lab, and the principal investigators, Rich and Waters, invented the "Plan Calculus" representation of program structure as a part of their effort.

Plan Calculus is a canonical, mathematical abstraction used to describe imperative programs. It is based on the insight that programs can be seen as the combination of their control-flow and data-flow. Plan Calculus allows an automated analysis in which the control-flow is abstracted from a program, and then the data-flow is abstracted. A "plan" in Plan Calculus is a superimposition of these two abstractions, which can be made "human-readable" with the use of a neat notation convention.

Once an abstract representation of the program is constructed, it is then possible to transcend the details of the source language that are irrelevant to the algorithm. For example, one of the fundamental aspects of assembly-language programming, the use of registers, is abstracted away because registers are only used for affecting the flow of data. This is illustrated in the following example:

Assembler code:

L R3,VAL1

S R3,VAL2

ST R3,VAL1

C code:

VAL1 -= VAL2;

Plan Calculus can be applied to program translation. The first stage of translation is parsing the source program and generating the abstract representation. The second stage consists of further analysis and transformations of the abstract representation. Cliche recognition and other advanced techniques can be implemented in this stage. The last stage is the code generation of the algorithm in the target language.

The Programmer’s Apprentice was not transferred from the MIT AI lab into the real world until 1992, when Yishai Feldman, a research scientist, brought this knowledge back with him to Israel. Since then, Sapiens International has adapted this elegant technique for real-world applications and commercial products such as automatic translation from IBM Assembler to portable C; automatic remediation of the year 2000 problem in Assembler programs; automatic remediation of the Euro conversion problem in Assembler programs; and data-flow analysis for interface mapping in large systems. This new technique was incorporated into Sapiens’ FalconTPF.

The best proof that this is a truly useful, general paradigm came from the Year 2000 chaos. When Sapiens developed a tool for automatic translation of Assembler to C (based on the techniques outlined above), the Year 2000 issue was not a part of the project. However, a simple analysis component for tracing date occurrences in Assembler code was derived in just four weeks. The analysis component was incorporated into the translation system, replacing the C code generation component. The result was the rapid release of a beta version of Falcon2000 - Sapiens’ Year 2000 Assembler analysis tool. The tool’s analysis power by far supercedes any of the "first-generation" analysis products.

Translation Details

Now let’s get into some details on how these techniques help us achieve a better translation. First, control-flow analysis enables FalconTPF to replace assembler branches ("goto") with structural C code using if, else, do, and while.

Example:

ASM code:

LOOP DS 0H

CLC ENTRADA,0(R4)

BE INCR

CLC ENTR,0(R4)

BE INCR

MVC OUT,0(R4)

B ELOOP

INCR DS 0H

LA R4,1(R4)

CLI R4,EFES

BE LOOP

C code:

do

{

if ((memcmp(ENTRADA,r4ucp,6) != 0) &&

(memcmp(ENTR,r4ucp,6) != 0))

{

memcpy(OUT,r4ucp,6);

break;

}

r4ucp++;

}

while (*r4ucp == EFES);

Control-flow analysis enables the identification of unreachable-code, while data-flow analysis enables the identification of unused-code. In most legacy system projects, both are omitted from the translated C.

Example:

ASM code:

L R6,FLTCOST

S R6,FLTDIFR

LR R4,R6

SRL R4,2 REDUNDANT CODE

ST R6,FLTCOST

B NET

CR R4,R3 DEAD CODE

BE NOINC

NET DS 0H

C code:

FLTCOST -= FLTDIFR;

A more complicated task is the detection of subroutine interfaces. While in assembly language all variables are global, in C each routine can have its own local variables and parameters (which can be sent either by value or reference.) For each Assembler routine, FalconTPF develops a C function interface and declares it accordingly.

Another challenge is the detection of data types. Assembly-language programs contain little information about data types, and the information is often ambiguous. The computation of data types is based on the propagation of type information from actual usage of values. For example, a "Load Halfword" or "Store Halfword" operation indicates that its operand is a halfword (16 bits), even though 32-bit registers are commonly passed to such an operation as an operand (meaning that the operation involves only one part of the register.) The type of operation can also give partial information about the type of the value: for example, a value that is dereferenced must be a pointer, and a value that is used in an integer addition could be an integer or a pointer. Types can also be propagated across operations. For example, the addition of two integers yields an integer, while the addition of an integer and a pointer yields a pointer. Once the type of a value that is dereferenced is known, the type of the pointer can also be deduced.

Type information collected from such sources is propagated through the data-flow network. Each value is associated with a set of possible types, which shrinks as the result of information propagated from neighboring nodes. In this way, the type information is refined until the best possible type for a given variable in the output C code is established.

FalconTPF also applies some program algorithm transformations. Utilizing a domain knowledge information base, FalconTPF achieves optimizations that cannot be reached with general compilers. For example, the case of saving global registers at the entrance of an Assembler subroutine and restoring the values before the exit is a cliche that, once recognized by FalconTPF, can be completely eliminated from the output C program.

As another example, consider the following pattern recognized:

ASM code:

MVI T_LINE,C' ' INITIALIZE THE TEXT LINE

MVC T_LINE+1(L'T_LINE-1),T_LINE

C code:

memset(T_LINE, ' ', 63);

TPF assembler programs introduce additional problems. One problem is the extensive use of system macros. FalconTPF addresses this by building a knowledge base. The knowledge base maintains an abstract representation of each macro call with its parameters. During code generation the translation proceeds as usual: the translation of the macro-call is a matching C function call.

Note that the macro-calls have to be inserted to the abstract representation, to keep data and control flow correct.

Example:

ASM code:

LEVTA LEVEL=2,NTUSED=TXC3D203

RELCC D2

C code:

If (levtest(D2) != 0)

{

relcc(D2);

}

The TPF 4K-size limitation forces the break down of one logical program to many small modules. In C this limitation is gone, one may want to group some assembler modules to be translated into a single C source file. FalconTPF implements this not only by putting the output sources together, but also by calculating the inter-module interfaces. This is an extension of calculation of inner routine interfaces.

The Apprentice Approach

FalconTPF was implemented using the advanced concepts described above. But we fully understood that it is impossible to achieve a black-box that works on all input Assembler files, and outputs correct high-quality C code.

Mainly, some pathological programming styles cannot be supported (e.g., self-modifying code.) Secondly, some low-quality Assembler programming styles result in low-quality output C code (e.g., "spaghetti" Assembler code.) Another limitation was discovered as work progressed; FalconTPF abstracted the code in various ways. As a result, the original assembly-language programmers found it hard to debug FalconTPF-generated C code.

To overcome this problem, certain allowances were made in FalconTPF for ease of debugging. For example, temporary variables created by FalconTPF are named after the registers containing these values in the original code. But it was clear that one couldn’t expect to avoid some amount of programmer intervention to achieve high-quality output. The question was not whether manual intervention is needed in the conversion process, but rather how to maximize its efficiency in terms of cost-benefit.

Based on the positive experience with Sapiens’ Year 2000 projects, we have decided to provide programmers with a complete, productivity-enhancing Integrated Development Environment (IDE). This workbench assists the programmer in all conversion processes.

As an example, consider the task of mapping a large application. Since TPF-based applications are typically organized into many small (less than 4K) modules, it may be desirable to group some modules into individual C programs. FalconTPF assists in identifying packages: groups of interrelated Assembler modules that call each other. The FalconTPF graphical hierarchy browser allows you to examine these relationships at the beginning of the conversion effort so that you can plan the project appropriately.

Conclusion

We believe our work on FalconTPF provides a unique example of taking the best of academic and industrial approach. Academic work is typically elegant but detached from real-world issues, and thus does not scale up to large projects. The industrial approach is typically brute-force; this produces short-term solutions, quickly becomes obsolete, and seldom allows for generalization. Furthermore, we believe that this hybrid approach is a must for dealing with the software engineering challenge of large-scale conversion from Assembler to C.

Acknowledgements

I am grateful to Sapiens International for their support of this project. They have been of great assistance in the academic aspects of this work.

About the Author

Doron A. Friedman is a research scientist at Tel Aviv University, specializing in applying Artificial Intelligence techniques and principles to the domains of software engineering and animated virtual environments. For the last 2 years Doron has been a consultant for Sapiens International in the creation of Falcon – A Fast Assembly Language Converter of IBM Assembler programs. For more information on Falcon, visit: www.sapiens.com, or Email: falcon.NOSPAM@sapiens.com
MV 2007: remove the .NOSPAM to use the email address