cmd/cgo: add implementation comment

R=golang-dev, r, bradfitz, iant CC=golang-dev https://golang.org/cl/7407050
2024-11-22 09:14:40 -07:00 · 2013-02-27 20:55:01 -08:00 · 2013-02-27 20:55:01 -08:00 · 062a239046
commit 062a239046
parent 3b69efb010
1 changed files with 263 additions and 0 deletions
--- a/src/cmd/cgo/doc.go
+++ b/src/cmd/cgo/doc.go
@ -134,3 +134,266 @@ See "C? Go? Cgo!" for an introduction to using cgo:
 http://golang.org/doc/articles/c_go_cgo.html
 */
 package main
+
+/*
+Implementation details.
+
+Cgo provides a way for Go programs to call C code linked into the same
+address space. This comment explains the operation of cgo.
+
+Cgo reads a set of Go source files and looks for statements saying
+import "C". If the import has a doc comment, that comment is
+taken as literal C code to be used as a preamble to any C code
+generated by cgo. A typical preamble #includes necessary definitions:
+
+	// #include <stdio.h>
+	import "C"
+
+For more details about the usage of cgo, see the documentation
+comment at the top of this file.
+
+Understanding C
+
+Cgo scans the Go source files that import "C" for uses of that
+package, such as C.puts. It collects all such identifiers. The next
+step is to determine each kind of name. In C.xxx the xxx might refer
+to a type, a function, a constant, or a global variable. Cgo must
+decide which.
+
+The obvious thing for cgo to do is to process the preamble, expanding
+#includes and processing the corresponding C code. That would require
+a full C parser and type checker that was also aware of any extensions
+known to the system compiler (for example, all the GNU C extensions) as
+well as the system-specific header locations and system-specific
+pre-#defined macros. This is certainly possible to do, but it is an
+enormous amount of work.
+
+Cgo takes a different approach. It determines the meaning of C
+identifiers not by parsing C code but by feeding carefully constructed
+programs into the system C compiler and interpreting the generated
+error messages, debug information, and object files. In practice,
+parsing these is significantly less work and more robust than parsing
+C source.
+
+Cgo first invokes gcc -E -dM on the preamble, in order to find out
+about simple #defines for constants and the like. These are recorded
+for later use.
+
+Next, cgo needs to identify the kinds for each identifier. For the
+identifiers C.foo and C.bar, cgo generates this C program:
+
+	<preamble>
+	void __cgo__f__(void) {
+	#line 1 "cgo-test"
+		foo;
+		enum { _cgo_enum_0 = foo };
+		bar;
+		enum { _cgo_enum_1 = bar };
+	}
+
+This program will not compile, but cgo can look at the error messages
+to infer the kind of each identifier. The line number given in the
+error tells cgo which identifier is involved.
+
+An error like "unexpected type name" or "useless type name in empty
+declaration" or "declaration does not declare anything" tells cgo that
+the identifier is a type.
+
+An error like "statement with no effect" or "expression result unused"
+tells cgo that the identifier is not a type, but not whether it is a
+constant, function, or global variable.
+
+An error like "not an integer constant" tells cgo that the identifier
+is not a constant. If it is also not a type, it must be a function or
+global variable. For now, those can be treated the same.
+
+Next, cgo must learn the details of each type, variable, function, or
+constant. It can do this by reading object files. If cgo has decided
+that t1 is a type, v2 and v3 are variables or functions, and c4, c5,
+and c6 are constants, it generates:
+
+	<preamble>
+	typeof(t1) *__cgo__1;
+	typeof(v2) *__cgo__2;
+	typeof(v3) *__cgo__3;
+	typeof(c4) *__cgo__4;
+	enum { __cgo_enum__4 = c4 };
+	typeof(c5) *__cgo__5;
+	enum { __cgo_enum__5 = c5 };
+	typeof(c6) *__cgo__6;
+	enum { __cgo_enum__6 = c6 };
+
+	long long __cgo_debug_data[] = {
+		0, // t1
+		0, // v2
+		0, // v3
+		c4,
+		c5,
+		c6,
+		1
+	};
+
+and again invokes the system C compiler, to produce an object file
+containing debug information. Cgo parses the DWARF debug information
+for __cgo__N to learn the type of each identifier. (The types also
+distinguish functions from global variables.) If using a standard gcc,
+cgo can parse the DWARF debug information for the __cgo_enum__N to
+learn the identifier's value. The LLVM-based gcc on OS X emits
+incomplete DWARF information for enums; in that case cgo reads the
+constant values from the __cgo_debug_data from the object file's data
+segment.
+
+At this point cgo knows the meaning of each C.xxx well enough to start
+the translation process.
+
+Translating Go
+
+[The rest of this comment refers to 6g and 6c, the Go and C compilers
+that are part of the amd64 port of the gc Go toolchain. Everything here
+applies to another architecture's compilers as well.]
+
+Given the input Go files x.go and y.go, cgo generates these source
+files:
+
+	x.cgo1.go       # for 6g
+	y.cgo1.go       # for 6g
+	_cgo_gotypes.go # for 6g
+	_cgo_defun.c    # for 6c
+	x.cgo2.c        # for gcc
+	y.cgo2.c        # for gcc
+	_cgo_export.c   # for gcc
+	_cgo_main.c     # for gcc
+
+The file x.cgo1.go is a copy of x.go with the import "C" removed and
+references to C.xxx replaced with names like _Cfunc_xxx or _Ctype_xxx.
+The definitions of those identifiers, written as Go functions, types,
+or variables, are provided in _cgo_gotypes.go.
+
+Here is a _cgo_gotypes.go containing definitions for C.flush (provided
+in the preamble) and C.puts (from stdio):
+
+	type _Ctype_char int8
+	type _Ctype_int int32
+	type _Ctype_void [0]byte
+
+	func _Cfunc_CString(string) *_Ctype_char
+	func _Cfunc_flush() _Ctype_void
+	func _Cfunc_puts(*_Ctype_char) _Ctype_int
+
+For functions, cgo only writes an external declaration in the Go
+output. The implementation is in a combination of C for 6c (meaning
+any gc-toolchain compiler) and C for gcc.
+
+The 6c file contains the definitions of the functions. They all have
+similar bodies that invoke runtime·cgocall to make a switch from the
+Go runtime world to the system C (GCC-based) world.
+
+For example, here is the definition of _Cfunc_puts:
+
+	void _cgo_be59f0f25121_Cfunc_puts(void*);
+
+	void
+	·_Cfunc_puts(struct{uint8 x[1];}p)
+	{
+		runtime·cgocall(_cgo_be59f0f25121_Cfunc_puts, &p);
+	}
+
+The hexadecimal number is a hash of cgo's input, chosen to be
+deterministic yet unlikely to collide with other uses. The actual
+function _cgo_be59f0f25121_Cfunc_flush is implemented in a C source
+file compiled by gcc, the file x.cgo2.c:
+
+	void
+	_cgo_be59f0f25121_Cfunc_puts(void *v)
+	{
+		struct {
+			char* p0;
+			int r;
+			char __pad12[4];
+		} __attribute__((__packed__)) *a = v;
+		a->r = puts((void*)a->p0);
+	}
+
+It extracts the arguments from the pointer to _Cfunc_puts's argument
+frame, invokes the system C function (in this case, puts), stores the
+result in the frame, and returns.
+
+Linking
+
+Once the _cgo_export.c and *.cgo2.c files have been compiled with gcc,
+they need to be linked into the final binary, along with the libraries
+they might depend on (in the case of puts, stdio). 6l has been
+extended to understand basic ELF files, but it does not understand ELF
+in the full complexity that modern C libraries embrace, so it cannot
+in general generate direct references to the system libraries.
+
+Instead, the build process generates an object file using dynamic
+linkage to the desired libraries. The main function is provided by
+_cgo_main.c:
+
+	int main() { return 0; }
+	void crosscall2(void(*fn)(void*, int), void *a, int c) { }
+	void _cgo_allocate(void *a, int c) { }
+	void _cgo_panic(void *a, int c) { }
+
+The extra functions here are stubs to satisfy the references in the C
+code generated for gcc. The build process links this stub, along with
+_cgo_export.c and *.cgo2.c, into a dynamic executable and then lets
+cgo examine the executable. Cgo records the list of shared library
+references and resolved names and writes them into a new file
+_cgo_import.c, which looks like:
+
+	#pragma dynlinker "/lib64/ld-linux-x86-64.so.2"
+	#pragma dynimport puts puts#GLIBC_2.2.5 "libc.so.6"
+	#pragma dynimport __libc_start_main __libc_start_main#GLIBC_2.2.5 "libc.so.6"
+	#pragma dynimport stdout stdout#GLIBC_2.2.5 "libc.so.6"
+	#pragma dynimport fflush fflush#GLIBC_2.2.5 "libc.so.6"
+	#pragma dynimport _ _ "libpthread.so.0"
+	#pragma dynimport _ _ "libc.so.6"
+
+In the end, the compiled Go package, which will eventually be
+presented to 6l as part of a larger program, contains:
+
+	_go_.6        # 6g-compiled object for _cgo_gotypes.go *.cgo1.go
+	_cgo_defun.6  # 6c-compiled object for _cgo_defun.c
+	_all.o        # gcc-compiled object for _cgo_export.c, *.cgo2.c
+	_cgo_import.6 # 6c-compiled object for _cgo_import.c
+
+The final program will be a dynamic executable, so that 6l can avoid
+needing to process arbitrary .o files. It only needs to process the .o
+files generated from C files that cgo writes, and those are much more
+limited in the ELF or other features that they use.
+
+In essence, the _cgo_import.6 file includes the extra linking
+directives that 6l is not sophisticated enough to derive from _all.o
+on its own. Similarly, the _all.o uses dynamic references to real
+system object code because 6l is not sophisticated enough to process
+the real code.
+
+The main benefits of this system are that 6l remains relatively simple
+(it does not need to implement a complete ELF and Mach-O linker) and
+that gcc is not needed after the package is compiled. For example,
+package net uses cgo for access to name resolution functions provided
+by libc. Although gcc is needed to compile package net, gcc is not
+needed to link programs that import package net.
+
+Runtime
+
+When using cgo, Go must not assume that it owns all details of the
+process. In particular it needs to coordinate with C in the use of
+threads and thread-local storage. The runtime package, in its own
+(6c-compiled) C code, declares a few uninitialized (default bss)
+variables:
+
+	bool	runtime·iscgo;
+	void	(*libcgo_thread_start)(void*);
+	void	(*initcgo)(G*);
+
+Any package using cgo imports "runtime/cgo", which provides
+initializations for these variables. It sets iscgo to 1, initcgo to a
+gcc-compiled function that can be called early during program startup,
+and libcgo_thread_start to a gcc-compiled function that can be used to
+create a new thread, in place of the runtime's usual direct system
+calls.
+
+*/