Stanford CS143 Intro to Compilers course notes

16 Mar, 2023

Stanford's CS143 is the first compiler course I've finished. I followed the course on edX, but video lectures are also available (unofficially) on YouTube.

Coding assignments

The course offers 4 assignments corresponding to common compilation stages: lexing, parsing, semantic analysis, and code generation. Either C++ or Java can be used, I used C++.

The edX course offers a VirtualBox image with the required tools, but I wasn't able to run it on my M1 macOS 13.2.1. Instead, I found a quite useful Docker image. I've also compared my solutions with several other people's homework available on GitHub.

Content

Intro

There are so many programming languages because there are many different domains. New languages are created to fill a void, and old languages are slow to change. There is no universally accepted metric for language design. The dominant cost is programmer training.

Lexical analysis

Lexer reads the input source code and produces and stream of tokens. Each token is a pair of token class and the substring itself (called lexeme). The correspondence between a set of strings and a token class can be defined through regular expressions and implemented using finite automata. NFA (Non-deterministic Finite Automata) can be converted to DFA (Deterministic Finite Automata) and that can be used to generate an efficient table-driven implementation. Tools like Flex can generate lexer code.

Parsing

It is useful to have grammar for your language. Context-free grammar (CFG) is one common notation, it consists of terminals, non-terminals, and production rules. Rules are applied until there are no non-terminals. Context-free means that the left side of the rule is always a single non-terminal.

Given a string and context-free grammar, a parse tree can be constructed. It can be produced top-down or bottom-up. CFG can be formally written down as Backus–Naur form. Given the grammar, tools like Bison or Yacc can produce parsing code. The easiest path is a hand-written recursive descent parser. Parsing produces an Abstract Syntax Tree (AST). It has fewer details than a parse tree, it abstracts the concrete syntax.

The course does not go into PEG and parser combinators.

Semantic analysis

Usually, there can be multiple variable scopes: global, function, and local. Symbol table helps to track the binding for every identifier. It can be implemented at a stack of hash tables. Every stack item represents a scope. AST nodes can be extended with the symbol pointer.

Type checking

Type is a set of values and a set of operations on those values. Assembly is often untyped, so the same instruction can represent adding two integers and adding an integer to a pointer. Type-checking helps to ensure the intended interpretation of values. Type checker rules can be formally defined using logical rules of inference.

⊢ e1: Int  ⊢ e2: Int
--------------------
   ⊢ e1 + e2: Int

It reads as follows: if it is provable that e1 has type Int and e2 has type Int (hypothesis), then the expression e1 + e2 has type Int (conclusion).

Inheritance relations in classes introduce subtyping, X ≤ Y – X is a subtype of Y.

Soundness theorem: dynamic_type(E) ≤ static_type(E) for all expressions E.

Memory model

The operating system allocates memory for a program, loads code into it, then jumps to the entry label. There are 4 common areas in program memory: code, static data, stack, and heap. Each function call begins with activation record (also known as stack frame) construction. It can include function results, arguments, an old frame pointer (also known as a base pointer), and the return address. A frame pointer is needed to calculate the location of parameters and local variables.

Low addr.

+----------+
|          |
|   Code   |
|          |
+----------+
|          |
|  Static  |
|   data   |
|          |
+----------+
|          |
|  Stack   |
|    +     |
|    |     |
+----v-----+ <--+ Stack
|          |      pointer
|          |
|          |
|          |
+----^-----+ <--+ Heap
|    |     |      pointer
|    +     |
|   Heap   |
|          |
+----------+

High addr.

Code generation

The course uses MIPS architecture as a target for code generation. On macOS, MIPS assembly can be simulated with SPIM (available through Brew as well).

The simplest model is a stack machine. The N-register stack machine stores the top N items in registers. Stack machine code takes less space than register machine (WebAssembly and JVM use stack machine code).

Optimization

AST is enough to generate machine code but is not convenient for applying optimizations. Intermediate representation (IR) has more details than the source language, but fewer details than the target language (assembly). IR can deal with an unlimited number of registers and high-level instructions.

A popular candidate for lower level intermediate representation (IR) is linear IR. The most common form is a three-address code. Linear IR usually has explicit jumps. That makes it easy to find basic blocks – linear sequences of code that have a single entry at the beginning and a single exit at the end. Control flow graph (CFG, not to be confused with Context-free grammar) can build from basic blocks. Some optimizations are simpler if in IR every register occurs only once on the left side of the assignment (static single assignment form).

Three types of optimizations:

Local – apply to a single basic block. Include algebraic simplifications, common sub-expression elimination, and copy propagation.
Global – apply to a CFG of a function.
Interprocedural – cover higher areas.

It can be dangerous to do constant folding for floating points in the case of cross-architecture compilation.

Peephole optimizations are applied directly to machine code:

mov $a, $b   →   mov $a, $b
mov $b, $a

CFG enables dataflow analysis. One instance is liveness analysis. A variable is live if it might be used in the future. Liveness analysis is needed for register allocation.

A cache is storage between registers and memory. A cache miss will lead to reading from memory which is expensive in terms of cycles. Loop interchange optimization can help to stay away from cache misses.

Register allocation

Register allocation is the problem of replacing unlimited virtual registers of IR with concrete machine registers. That requires mapping many virtual registers to one machine register. So we need to know which values are live at the same time. That can be done by solving the graph coloring problem. The output is a Register Interference Graph where there is an edge between two nodes if they are live at the same time. If there at some point there are more live values than there are registers, some will be "spilled" into memory and the instructions using that variable will need to be padded with load/store instructions.

Automatic memory management

Manual memory management is hard, typical problems are forgetting to deallocate and dereferencing a dangling pointer. Two main approaches are garbage collection and reference counting.

The simplest technique for garbage collection (GC) is Mark and Sweep. In the mark stage, we follow all pointers starting from registers and the stack marking objects we find. Then when we are out of memory we scan the whole heap and deallocate all unmarked objects. Mark and Sweep approach can be beneficial when pointers cannot be copied (as they are exposed in the language).

A more efficient approach is called Stop and Copy where we divide all heap memory into two halves called "old" and "new". Allocation happens inside "old". When the "old" space is full, we copy all reachable objects from "old" to "new" and then swap the roles.

With reference counting deallocation of free objects is distributed across program execution so there are fewer pauses during execution. Every assignment increments the counter associated with the target object and decrements the counter associated with the source object.

// x points to object A
// y points to object B

x := y   →  RC(B) += 1
            RC(A) -= 1
            if RC(B) == 0 then free B
            x := y

It cannot handle circular structures and there is overhead for each assignment instruction.

Automatic memory management does not solve memory leaks, as it is still a programmer's job to set the pointer to null.

More advanced algorithms can do garbage collection concurrently with program execution.

Conclusion

Overall I enjoyed the course. I found in-depth parsing material somewhat boring and later chapters on code generation were much more interesting to me. The lectures and the code assignment felt a bit disconnected a few times. I found Introduction to Compilers and Language Design by Douglas Thain to be a good companion book for this course.