diff --git a/src/cmd/compile/README.md b/src/cmd/compile/README.md new file mode 100644 index 0000000000..c8369c7c8c --- /dev/null +++ b/src/cmd/compile/README.md @@ -0,0 +1,116 @@ +// Copyright 2018 The Go Authors. All rights reserved. +// Use of this source code is governed by a BSD-style +// license that can be found in the LICENSE file. + +## Introduction to the Go compiler + +`cmd/compile` contains the main packages that form the Go compiler. The compiler +may be logically split in four phases, which we will briefly describe alongside +the list of packages that contain their code. + +You may sometimes hear the terms "front-end" and "back-end" when referring to +the compiler. Roughly speaking, these translate to the first two and last two +phases we are going to list here. A third term, "middle-end", often refers to +much of the work that happens in the second phase. + +Note that the `go/*` family of packages, such as `go/parser` and `go/types`, +have no relation to the compiler. Since the compiler was initially written in C, +the `go/*` packages were developed to enable writing tools working with Go code, +such as `gofmt` and `vet`. + +It should be clarified that the name "gc" stands for "Go compiler", and has +little to do with uppercase GC, which stands for garbage collection. + +### 1. Parsing + +* `cmd/compile/internal/syntax` (lexer, parser, syntax tree) + +In the first phase of compilation, source code is tokenized (lexical analysis), +parsed (syntactic analyses), and a syntax tree is constructed for each source +file. + +Each syntax tree is an exact representation of the respective source file, with +nodes corresponding to the various elements of the source such as expressions, +declarations, and statements. The syntax tree also includes position information +which is used for error reporting and the creation of debugging information. + +### 2. Type-checking and AST transformations + +* `cmd/compile/internal/gc` (create compiler AST, type checking, AST transformations) + +The gc package includes an AST definition carried over from when it was written +in C. All of its code is written in terms of it, so the first thing that the gc +package must do is convert the syntax package's syntax tree to the compiler's +AST representation. This extra step may be refactored away in the future. + +The AST is then type-checked. The first steps are name resolution and type +inference, which determine which object belongs to which identifier, and what +type each expression has. Type-checking includes certain extra checks, such as +"declared and not used" as well as determining whether or not a function +terminates. + +Certain transformations are also done on the AST. Some nodes are refined based +on type information, such as string additions being split from the arithmetic +addition node type. Some other examples are dead code elimination, function call +inlining, and escape analysis. + +### 3. Generic SSA + +* `cmd/compile/internal/gc` (converting to SSA) +* `cmd/compile/internal/ssa` (SSA passes and rules) + + +In this phase, the AST is converted into Static Single Assignment (SSA) form, a +lower-level intermediate representation with specific properties that make it +easier to implement optimizations and to eventually generate machine code from +it. + +During this conversion, function intrinsics are applied. These are special +functions that the compiler has been taught to replace with heavily optimized +code on a case-by-case basis. + +Certain nodes are also lowered into simpler components during the AST to SSA +conversion, so that the rest of the compiler can work with them. For instance, +the copy builtin is replaced by memory moves, and range loops are rewritten into +for loops. Some of these currently happen before the conversion to SSA due to +historical reasons, but the long-term plan is to move all of them here. + +Then, a series of machine-independent passes and rules are applied. These do not +concern any single computer architecture, and thus run on all `GOARCH` variants. + +Some examples of these generic passes include dead code elimination, removal of +unneeded nil checks, and removal of unused branches. The generic rewrite rules +mainly concern expressions, such as replacing some expressions with constant +values, and optimizing multiplications and float operations. + +### 4. Generating machine code + +* `cmd/compile/internal/ssa` (SSA lowering and arch-specific passes) +* `cmd/internal/obj` (machine code generation) + +The machine-dependent phase of the compiler begins with the "lower" pass, which +rewrites generic values into their machine-specific variants. For example, on +amd64 memory operands are possible, so many load-store operations may be combined. + +Note that the lower pass runs all machine-specific rewrite rules, and thus it +currently applies lots of optimizations too. + +Once the SSA has been "lowered" and is more specific to the target architecture, +the final code optimization passes are run. This includes yet another dead code +elimination pass, moving values closer to their uses, the removal of local +variables that are never read from, and register allocation. + +Other important pieces of work done as part of this step include stack frame +layout, which assigns stack offsets to local variables, and pointer liveness +analysis, which computes which on-stack pointers are live at each GC safe point. + +At the end of the SSA generation phase, Go functions have been transformed into +a series of obj.Prog instructions. These are passed to the assembler +(`cmd/internal/obj`), which turns them into machine code and writes out the +final object file. The object file will also contain reflect data, export data, +and debugging information. + +### Further reading + +To dig deeper into how the SSA package works, including its passes and rules, +head to `cmd/compile/internal/ssa/README.md`.