Lab 1: Write a 3-address-code to C translator

    Objective

    The goal of this project is to write a translator from 3-address format to C. The purpose of this assignment is to

    1. Make you familiar with compiler intermediate formats, in this particular case, our 3-address format.
    2. Write a translator and test its correctness. You will also use this translator to test code transformations in future assignments. (This project is often the same one compiler writers will undertake when starting a new compiler, because it eases testing, assuming a correct C compiler.)
    3. Write example C code to test your translator.

    Project Description

    We provide 'csc', which compiles a subset of C (CS for C-Subset) to 3-address format. Your will write a translator to convert from this 3-address format back to C. Note that the generated C code will not resemble the original C code.

    The rest of this document describes the c-subset compiler, the 3-address intermediate format, the application binary interface (ABI), the implementation requirements, and the grading criteria.

    Getting Started On Lab1

    1. Download the c-subset compiler and example C files at  http://www.cs.utexas.edu/users/mckinley/380C/labs/cs380c_lab1.tgz
    2. Decompress the file on the cs unix machines using
      	"tar xvfz cs380c_lab1.tgz".
      
      This command will produce the following directory structure
      	cs380c_lab1
      		src
      		examples
      		lab1
      
      The src directory contains the C-subset compiler source file and a make script. The examples directory contains 10 example c files, and two bash scripts (check.sh, check-one.sh) for checking your lab1 implementation (see 5 below).
    3. To compile the compiler
      	cd src; ./make.sh  // your shell should be bash
      
      The command will produce a binary file for the compiler "csc" in the src directory.
    4. Compile the example c files by using "csc"
      	cd ../examples; ../src/csc loop.c;
      
      This command will print out the 3-Address Intermediate Format code to your screen.
    5. Implement your own 3-Address to C converter in the lab1 directory. Modify the script compile.sh to compile your source code if needed. Modify the script run.sh to invoke your translator.
    6. Verify your implementation. Modify the script examples/check-one.sh to set the variable THREE_ADDR_TO_C_TRANSLATOR (line5) to your own converter. Then invoke the command "./check-one.sh loop.c". It produces several files:
      	loop.3addr     //the 3-address format file generated by csc
      	loop.3addr.c   //the c file generated by your converter
      	loop.3addr.bin //the binary file generated by gcc compiling loop.3addr.c
      	loop.3addr.txt //the file after executing loop.3addr.bin
      	loop.gcc.bin   //the binary file after gcc compiles loop.c
      	loop.gcc.txt   //the file after executing loop.gcc.bin
      
      If the two .txt files are same, it means your converter works correctly for this c file. The output message to your screen should be:
      	loop.c
      	compiling loop.c
      	loop.c: In function 'main':
      	loop.c:10: warning: return type of 'main' is not 'int'
      	4a39f3634793c530783afa3ff0b2bd7b  loop.gcc.txt
      	4a39f3634793c530783afa3ff0b2bd7b  loop.3addr.txt
      
      The last two lines show the md5sum results of the two .txt file. They should be same.
    7. Run your converter on all example files, including those you create (see Turning in your assignment below). The command "check.sh" executes check-one.sh on every input example c file in the directory.
    8. Turn your implementation in. (Check "Turning In Your Assignment" section for more details.)

    Our High-Level Language (a subset of C)

    This section describes the high-level language C-subset language and the csc compiler. We first give a rough description, some examples, and then a formal description. The language includes:

    • Arithmetic operations include +, -, *, /, %, <, <=, ==, !=, >=, >.
    • One integer data type 'long' whose size is 8 bytes.
    • Void datatype.
    • Structs and arrays.
    • Functions and recursion.
    • Two namespaces/scopes -- global and local variables.
    • Functions return void (i.e., nothing). The way to communicate the result of a function is to use global variables. (Not recommended otherwise :-)!)
    • Parameters of functions can only be of type 'long'. You cannot pass structs, arrays, etc. as parameters.
    • No pointers.
    • No floating point.
    • No gotos.
    • The structured programming constructs: 'if' and 'while'.
    • The compiler supports only a single source file.

    Note: gcc must correctly compile the output C file of your translator. However, the C output itself is not restricted to C-subset. For example, you may use "goto" in your C output to simplify your implementation.

    The 3-Address Intermediate Format

    The csc compiler outputs 3-address format. Before taking a look at the specification of the 3-address format, let us look at a simple program and its corresponding representation.

    	long global_array[24];
    	
    	struct Point {
    	    long x;
    	    long y;
    	} p;
    	
    	void initialize_globals() {
    	    long i;
    	
    	    i = 0;
    	    while(i < 24) {
    	        global_array[i] = i;
    	    }
    	
    	    p.x = 13;
    	    p.y = 7;
    	}
    	
    	void simple_function(long a, long b) {
    	    long local_array[3];
    	
    	    local_array[0] = a;
    	    local_array[1] = b;
    	    local_array[2] = a + b;
    	
    	    if (global_array[2] > a) {
    	        global_array[3] = b + p.y;
    	    }
    	    p.x = local_array[a % 3];
    	}
    	
    	instr 1: nop
    	instr 2: enter 8
    	instr 3: move 0 i#-8
    	instr 4: cmplt i#-8 24
    	instr 5: blbc (4) [11]
    	instr 6: mul i#-8 8
    	instr 7: add global_array_base#32576 GP
    	instr 8: add (7) (6)
    	instr 9: store i#-8 (8)
    	instr 10: br [4]
    	instr 11: add p_base#32560 GP
    	instr 12: add (11) x_offset#0
    	instr 13: store 13 (12)
    	instr 14: add p_base#32560 GP
    	instr 15: add (14) y_offset#8
    	instr 16: store 7 (15)
    	instr 17: ret 0
    	
    	instr 18: enter 24
    	instr 19: mul 0 8
    	instr 20: add local_array_base#-24 FP
    	instr 21: add (20) (19)
    	instr 22: store a#24 (21)
    	instr 23: mul 1 8
    	instr 24: add local_array_base#-24 FP
    	instr 25: add (24) (23)
    	instr 26: store b#16 (25)
    	instr 27: mul 2 8
    	instr 28: add local_array_base#-24 FP
    	instr 29: add (28) (27)
    	instr 30: add a#24 b#16
    	instr 31: store (30) (29)
    	instr 32: mul 2 8
    	instr 33: add global_array_base#32576 GP
    	instr 34: add (33) (32)
    	instr 35: load (34)
    	instr 36: cmple (35) a#24
    	instr 37: blbs (36) [46]
    	instr 38: mul 3 8
    	instr 39: add global_array_base#32576 GP
    	instr 40: add (39) (38)
    	instr 41: add p_base#32560 GP
    	instr 42: add (41) y_offset#8
    	instr 43: load (42)
    	instr 44: add b#16 (43)
    	instr 45: store (44) (40)
    	instr 46: add p_base#32560 GP
    	instr 47: add (46) x_offset#0
    	instr 48: mod a#24 3
    	instr 49: mul (48) 8
    	instr 50: add local_array_base#-24 FP
    	instr 51: add (50) (49)
    	instr 52: load (51)
    	instr 53: store (52) (47)
    	instr 54: ret 16
    	instr 55: nop
    

    The 3-address format uses a simple RISC instruction set, and assumes an infinite number of virtual registers (we discuss the stack frame layout below in the Function ABI section). The operands to these instructions may be one of the following:

    • GP (global pointer): A pointer to the beginning of the global address space.
    • FP (frame pointer): A pointer to the beginning of the frame of the current function.
    • Constants: For example, 24
    • Address offsets: In the name, global_array_base#32576, the number following the # is the starting address offset of variable global_array relative to the GP or FP. The variable names before the # here and below are to help humans read the representation.
    • Field offsets: In y_offset#8, the number following the # is the starting address offset of field y relative to the starting address of the corresponding struct (p_base#32560 in this example).
    • Local variables (scalars): For example, a#24 represents a local variable and its offset within the stack frame. You should ignore the offset for now (you will use it later for register allocation). You may assume that the variable a is allocated to the virtual register a.
    • Register names: A (13) means virtual register r13. Arithmetic and some other instructions write their results to a register. (See below for the complete list.) For example, instruction 13 may write its result to r13.
    • Instruction labels: The [46] label represents a instruction to jump to (in this example, jump to instruction 46).

    The instruction set contains the following instructions:

    • Arithmetic instructions. The opcodes add, sub, mul, div, mod, neg, cmpeq, cmple, and cmplt perform arithmetic operations on their operand(s) and write the result to a virtual register. Instruction k always writes its result to rk. The meaning of the first six operands should be obvious. The cmpeq instruction tests equality, the cmple instruction tests less than or equal, and the cmplt instruction tests less than. For example,
      	instr 42: add (41) y_offset#8
      
      stands for:
      	inter_42: r42 = r42 + 8
      
    • The opcode nop does nothing.
      	instr 1: nop
      
    • Branch instructions. The opcodes are br, blbc, blbs, and call. The opcode br means unconditional branch. The opcodes blbc and blbs denote branch if the register is cleared (0) or set (1), respectively. The call is an unconditional "jump and link" instruction used for function calls. For example,
      	instr 37: blbs (36) [46]
      
      means:
      	instr_37: if (r36 != 0) goto instr_46;
      
    • Data movement instructions. The load and store instructions read from and write to memory. The move instruction copies virtual registers. For example,
      	instr 14: load (13)
      	instr 15: move i#-8 j#-16
      	instr 16: store (15) (11)
      
      means:
      	instr_14: r14 = *(r13)
      	instr_15: r15 = j = i
      	instr_16: *(r11) = r15
      
    • I/O instructions. The read, write, and wrl are pseudo opcodes that are used for input and output. The read opcode reads a long integer from stdin, the write opcode prints a long integer to stdout, and wrl prints a newline to stdout.
      	instr 45: write x#-64
      	instr 46: wrl
      
      means:
      	instr_45: WriteLong(x); //defined as printf(" %lld", x)
      	instr_46: WriteLine();  //defined as printf("\n");
      
    • The param (used in conjunction with the call instruction) instruction pushes its operand onto the stack (it is used by a function that will be called in the future).
      	instr 60: param (59)
      	instr 62: param from#40
      	instr 63: call [23]
      
      means:
      	instr_63: call function_23(r59, from);
      
    • The enter instruction denotes the beginning of a function. Its operand specifies the amount of memory in bytes to be allocated on the stack frame for local variables of that function.
    • The ret instruction denotes a function return. Its operand specifies the amount of memory for formal parameters that needs to be removed (popped) from the stack. The following function
      	void function(long p1, long p2, long p3) {
      		long x;
      		long y;
      	}
      
      is compiled into
      	instr 1: nop
      	instr 2: enter 16
      	instr 3: ret 24
      	instr 4: nop
      
      which denotes 16 bytes of storage for local variables x and y and then 24 bytes for the formal parameters p1, p2 and p3 that must be popped on a return.
    • The entrypc instruction denotes the beginning of the main function.
      	void main() {
      		long i;
      	}
      
      is compiled into
      	instr 1: nop
      	instr 2: entrypc
      	instr 3: enter 8
      	instr 4: ret 0
      	instr 5: nop
      

    Function Call ABI

    For our 3-address representation, we have used the following Application Binary Interface (ABI), which specifies the addresses/offsets for the global variables, local variables, and formal parameters.

    We layout structures and arrays out from lower to higher addresses. Since our only basic datatype occupies 8 bytes, all integers, structs and arrays are 8-byte aligned.

    We layout global variables starting at address 32768 and downwards towards zero. This organization implies that the maximum space for global variables is 32768. In your implementation, it would suffice to declare 32768 bytes of storage to hold all globals. In the above example, the global variables are global_array[24] and p. Their starting addresses, represented as offsets from GP are

           global_array_base = 32576 = (32768 - (24*8)), and
           p_base            = 32560 = (32768 - (24*8) - 16),
    
    respectively.

    The stack grows from higher to lower addresses. The stack frame for a function looks like:

      Low        +---------------------+
      addr.      |                     |
                 +---------------------+
       -32       |       ...           |
                 +---------------------+ <---+
       -24       |     3rd local       |     |
                 +---------------------+     |
       -16       |     2nd local       |     |
                 +---------------------+     |
       - 8       |     1st local       |     |
                 +---------------------+     |   Stack
       FP ---->  |        FP           |     |   frame
                 +---------------------+     |
         8       |        LNK          |     |
                 +---------------------+     |
        16       |     2nd param       |     |
                 +---------------------+     |
        24       |     1st param       |     |
                 +---------------------+ <---+
                 |                     |
                 +---------------------+   previous frame.
      High       |                     |
      addr.      +---------------------+
    

    Accordingly, in the first example, local_array, and the formal parameters a, and b, have the following offsets from FP

         local_array_base: -24
         a:                 24
         b:                 16
    

    In your generated C code, you may use the same ABI, or you can define your own. For instance, you may choose to explicitly manage a stack to handle function calls, or use C's call-return mechanism to directly handle function calls in the 3-address code.

    EBNF Grammar Specification

    We provide a partial EBNF grammar specification to help you write C-subset programs and understand the 3-address format output generated by csc.

    The EBNF grammar for this language is given below.

        Factor = Designator | Number | "(" Expression ")".
        Term = Factor {("*" | "/" | "%") Factor}.
        SimpleExpr = ["+" | "-"] Term {("+" | "-") Term}.
        EqualityExpr = SimpleExpr [("<" | "<=" | ">" | ">=") SimpleExpr].
        Expression = EqualityExpr [("==" | "!=") EqualityExpr].
        ConstExpression = Expression.
    
        FieldList = VariableDeclaration {VariableDeclaration}.
        StructType = "struct" Ident ["{" FieldList "}"].
        Type = Ident | StructType.
        IdentArray = Ident {"[" ConstExpression "]"}.
        IdentList = IdentArray {"," IdentArray}.
        VariableDeclaration = Type IdentList ";".
        ConstantDeclaration = "const" Type Ident "=" ConstExpression ";".
    
        Designator = Ident {("." Ident) | ("[" Expression "]")}.
        Assignment = Designator "=" Expression ";".
        ExpList = Expression {"," Expression}.
        ProcedureCall = Ident "(" [ExpList] ")" ";".
        IfStatement = "if" "(" Expression ")" "{" StatementSequence "}" 
                       ["else" "{" StatementSequence "}"].
        WhileStatement = "while" "(" Expression ")" "{" StatementSequence "}".
        Statement = [Assignment | ProcedureCall | IfStatement | WhileStatement].
        StatementSequence = {Statement}.
    
        FPSection = Type IdentArray.
        FormalParameters = FPSection {"," FPSection}.
        ProcedureHeading = Ident "(" [FormalParameters] ")".
        ProcedureBody = {ConstDeclaration | VariableDeclaration} StatementSequence.
        ProcedureDeclaration = "void" ProcedureHeading "{" ProcedureBody "}".
    
        Program = {ConstantDeclaration | VariableDeclaration} ProcedureDeclaration {ProcedureDeclaration}.
    
    Number and Ident are terminal symbols. Program is the start symbol.

    Implementation and Turning in Your Assignment

    You are free to implement your translator in the programming language of your choice. The translator "must" compile (if necessary) and run on regular linux machines.

    You should test your translator with the example programs we provide. In addition, please provide at least three non-trivial source programs for testing.

    Your assignment should contain the following.

    1. A single tgz file named lab1-groupX.tgz (X should be replaced by your group number), which, when extracted, creates directory lab1.
    2. The lab1 directory can contain sub-directories.
    3. The lab1 directory should contain the following files:
      1. example1.c, example2.c, example3.c - Your non-trivial test programs. At the top of each file, explain what features this program tests.
      2. compile.sh - A script that compiles your source code, if needed. If you are using an interpreted language, this script can be empty.
      3. run.sh - A script that invokes your translator. Your translator should read 3-address code as input from stdin and print the generated C code to stdout. It should also generate a header file containing the declaration of all the functions.
      4. README - Please write your name(s) and student ID(s) here. The lab1 directory already exists with these files in the tarball you downloaded.

    Turn in your assignment by emailing the final lab1-groupX.tgz to the instructor.

     

    Grading Rubric

    1. Correct translation of arithmetic operations (10 pts)
    2. Correct translation of branch instructions (10 pts)
    3. Correct translation of data transfer instructions
      1. Correct calculation of the addresses (7.5 pts)
      2. Correct translation of the move instruction (7.5 pts)
      3. Correct translation of the load instruction (7.5 pts)
      4. Correct translation of the store instruction (7.5 pts)
    4. Correct translation of I/O instructions (5 pts)
    5. Correct translation of call and return semantics
      1. Correct parameter passing (10 pts)
      2. Correct stack clean up (10 pts)
      3. Correct stack frame management (15 pts)
    6. Correct space allocation and access to the global and local variable (10 pts)

  • Acknowledgements

    These assignments are derived from Prof. Kathryn McKinley's Advanced Compiler Techniques class.