In large functions with many variables, the register optimizer
may give up and choose not to track certain variables at all.
In this case, the "nextinnode" information linking together
all the words from a given variable will be incomplete, and
the result may be that only some of a multiword value is
preserved across a call. That confuses the garbage collector,
so don't do that. Instead, mark those variables as having
their address taken, so that they will be preserved at all
calls. It's overkill, but correct.
Tested by hand using the 6g -S output to see that it does fix
the buggy generated code leading to the issue 7726 failure.
There is no automated test because I managed to break the
compiler while writing a test (see issue 7727). I will check
in a test along with the fix to issue 7727.
Fixes#7726.
LGTM=khr
R=khr, bradfitz, dave
CC=golang-codereviews
https://golang.org/cl/85200043
On amd64p32 pointers are 32-bit-aligned and cannot be assumed to
have an offset multiple of widthreg. Instead check that they are
withptr-aligned.
Also change the threshold for region merging to 2*widthreg
instead of 2*widthptr because performance on amd64 and amd64p32
is expected to be the same.
Fixes#7712.
LGTM=khr
R=rsc, dave, khr, brad, bradfitz
CC=golang-codereviews
https://golang.org/cl/84690044
Native Client forbids jumps/calls to arbitrary locations and
enforces a particular alignement, which makes the Duff's device
ineffective.
LGTM=khr
R=rsc, dave, khr
CC=golang-codereviews
https://golang.org/cl/84400043
1. In functions with heap-allocated result variables or with
defer statements, the return sequence requires more than
just a single RET instruction. There is an optimization that
arranges for all returns to jump to a single copy of the return
epilogue in this case. Unfortunately, that optimization is
fundamentally incompatible with PC-based liveness information:
it takes PCs at many different points in the function and makes
them all land at one PC, making the combined liveness information
at that target PC a mess. Disable this optimization, so that each
return site gets its own copy of the 'call deferreturn' and the
copying of result variables back from the heap.
This removes quite a few spurious 'ambiguously live' variables.
2. Let orderexpr allocate temporaries that are passed by address
to a function call and then die on return, so that we can arrange
an appropriate VARKILL.
2a. Do this for ... slices.
2b. Do this for closure structs.
2c. Do this for runtime.concatstring, which is the implementation
of large string additions. Change representation of OADDSTR to
an explicit list in typecheck to avoid reconstructing list in both
walk and order.
3. Let orderexpr allocate the temporary variable copies used for
range loops, so that they can be killed when the loop is over.
Similarly, let it allocate the temporary holding the map iterator.
CL 81940043 reduced the number of ambiguously live temps
in the godoc binary from 860 to 711.
This CL reduces the number to 121. Still more to do, but another
good checkpoint.
Update #7345
LGTM=khr
R=khr
CC=golang-codereviews
https://golang.org/cl/83090046
REP MOVSQ and REP STOSQ have a really high startup overhead.
Use a Duff's device to do the repetition instead.
benchmark old ns/op new ns/op delta
BenchmarkClearFat32 7.20 1.60 -77.78%
BenchmarkCopyFat32 6.88 2.38 -65.41%
BenchmarkClearFat64 7.15 3.20 -55.24%
BenchmarkCopyFat64 6.88 3.44 -50.00%
BenchmarkClearFat128 9.53 5.34 -43.97%
BenchmarkCopyFat128 9.27 5.56 -40.02%
BenchmarkClearFat256 13.8 9.53 -30.94%
BenchmarkCopyFat256 13.5 10.3 -23.70%
BenchmarkClearFat512 22.3 18.0 -19.28%
BenchmarkCopyFat512 22.0 19.7 -10.45%
BenchmarkCopyFat1024 36.5 38.4 +5.21%
BenchmarkClearFat1024 35.1 35.0 -0.28%
TODO: use for stack frame zeroing
TODO: REP prefixes are still used for "reverse" copying when src/dst
regions overlap. Might be worth fixing.
LGTM=rsc
R=golang-codereviews, rsc
CC=golang-codereviews, r
https://golang.org/cl/81370046
The new channel and map runtime routines take pointers
to values, typically temporaries. Without help, the compiler
cannot tell when those temporaries stop being needed,
because it isn't sure what happened to the pointer.
Arrange to insert explicit VARKILL instructions for these
temporaries so that the liveness analysis can avoid seeing
them as "ambiguously live".
The change is made in order.c, which was already in charge of
introducing temporaries to preserve the order-of-evaluation
guarantees. Now its job has expanded to include introducing
temporaries as needed by runtime routines, and then also
inserting the VARKILL annotations for all these temporaries,
so that their lifetimes can be shortened.
In order to do its job for the map runtime routines, order.c arranges
that all map lookups or map assignments have the form:
x = m[k]
x, y = m[k]
m[k] = x
where x, y, and k are simple variables (often temporaries).
Likewise, receiving from a channel is now always:
x = <-c
In order to provide the map guarantee, order.c is responsible for
rewriting x op= y into x = x op y, so that m[k] += z becomes
t = m[k]
t2 = t + z
m[k] = t2
While here, fix a few bugs in order.c's traversal: it was failing to
walk into select and switch case bodies, so order of evaluation
guarantees were not preserved in those situations.
Added tests to test/reorder2.go.
Fixes#7671.
In gc/popt's temporary-merging optimization, allow merging
of temporaries with their address taken as long as the liveness
ranges do not intersect. (There is a good chance of that now
that we have VARKILL annotations to limit the liveness range.)
Explicitly killing temporaries cuts the number of ambiguously
live temporaries that must be zeroed in the godoc binary from
860 to 711, or -17%. There is more work to be done, but this
is a good checkpoint.
Update #7345
LGTM=khr
R=khr
CC=golang-codereviews
https://golang.org/cl/81940043
1. On entry to a function, only zero the ambiguously live stack variables.
Before, we were zeroing all stack variables containing pointers.
The zeroing is pretty inefficient right now (issue 7624), but there are also
too many stack variables detected as ambiguously live (issue 7345),
and that must be addressed before deciding how to improve the zeroing code.
(Changes in 5g/ggen.c, 6g/ggen.c, 8g/ggen.c, gc/pgen.c)
Fixes#7647.
2. Make the regopt word-based liveness analysis preserve the
whole-variable liveness property expected by the garbage collection
bitmap liveness analysis. That is, if the regopt liveness decides that
one word in a struct needs to be preserved, make sure it preserves
the entire struct. This is particularly important for multiword values
such as strings, slices, and interfaces, in which all the words need
to be present in order to understand the meaning.
(Changes in 5g/reg.c, 6g/reg.c, 8g/reg.c.)
Fixes#7591.
3. Make the regopt word-based liveness analysis treat a variable
as having its address taken - which makes it preserved across
all future calls - whenever n->addrtaken is set, for consistency
with the gc bitmap liveness analysis, even if there is no machine
instruction actually taking the address. In this case n->addrtaken
is incorrect (a nicer way to put it is overconservative), and ideally
there would be no such cases, but they can happen and the two
analyses need to agree.
(Changes in 5g/reg.c, 6g/reg.c, 8g/reg.c; test in bug484.go.)
Fixes crashes found by turning off "zero everything" in step 1.
4. Remove spurious VARDEF annotations. As the comment in
gc/pgen.c explains, the VARDEF must immediately precede
the initialization. It cannot be too early, and it cannot be too late.
In particular, if a function call sits between the VARDEF and the
actual machine instructions doing the initialization, the variable
will be treated as live during that function call even though it is
uninitialized, leading to problems.
(Changes in gc/gen.c; test in live.go.)
Fixes crashes found by turning off "zero everything" in step 1.
5. Do not treat loading the address of a wide value as a signal
that the value must be initialized. Instead depend on the existence
of a VARDEF or the first actual read/write of a word in the value.
If the load is in order to pass the address to a function that does
the actual initialization, treating the load as an implicit VARDEF
causes the same problems as described in step 4.
The alternative is to arrange to zero every such value before
passing it to the real initialization function, but this is a much
easier and more efficient change.
(Changes in gc/plive.c.)
Fixes crashes found by turning off "zero everything" in step 1.
6. Treat wide input parameters with their address taken as
initialized on entry to the function. Otherwise they look
"ambiguously live" and we will try to emit code to zero them.
(Changes in gc/plive.c.)
Fixes crashes found by turning off "zero everything" in step 1.
7. An array of length 0 has no pointers, even if the element type does.
Without this change, the zeroing code complains when asked to
clear a 0-length array.
(Changes in gc/reflect.c.)
LGTM=khr
R=khr
CC=golang-codereviews
https://golang.org/cl/80160044
Revision 3ae4607a43ff introduced CONVNOP layers
to fix type checking issues arising from comparisons.
The added complexity made 8g run out of registers
when compiling an equality function in go.net/ipv6.
A similar issue occurred in test/sizeof.go on
amd64p32 with 6g.
Fixes#7405.
LGTM=khr
R=rsc, dave, iant, khr
CC=golang-codereviews
https://golang.org/cl/78100044
Removes most uses of the REP prefix, which has a high startup cost.
LGTM=iant
R=golang-codereviews, iant, khr
CC=golang-codereviews
https://golang.org/cl/77920043
It was using a REP STOSQ but putting in CX the number of 32-bit
words to clear.
LGTM=dave
R=rsc, dave
CC=golang-codereviews
https://golang.org/cl/75240043
Replaces CL 70000043.
Introduce linkarchinit() from cmd/ld.
For cmd/6g, switch to the amd64p32 linker model if we are building under nacl/amd64p32.
LGTM=rsc
R=rsc
CC=golang-codereviews
https://golang.org/cl/71330045
For non-closure functions, the context register is uninitialized
on entry and will not be used, but morestack saves it and then the
garbage collector treats it as live. This can be a source of memory
leaks if the context register points at otherwise dead memory.
Avoid this by introducing a parallel set of morestack functions
that clear the context register, and use those for the non-closure functions.
I hope this will help with some of the finalizer flakiness, but it probably won't.
Fixes#7244.
LGTM=dvyukov
R=khr, dvyukov
CC=golang-codereviews
https://golang.org/cl/71030044
The gvardef function does nothing if n->class == PEXTERN, so
we don't need to test for that before calling it. This makes
the 6g/8g code more like the 5g code and clarifies that the
cases that do not test for n->class != PEXTERN are not buggy.
LGTM=rsc
R=rsc
CC=golang-codereviews
https://golang.org/cl/68900044
See golang.org/s/go13nacl for design overview.
This CL is the mostly mechanical changes from rsc's Go 1.2 based NaCl branch, specifically 39cb35750369 to 500771b477cf from https://code.google.com/r/rsc-go13nacl. This CL does not include working NaCl support, there are probably two or three more large merges to come.
CL 15750044 is not included as it involves more invasive changes to the linker which will need to be merged separately.
The exact change lists included are
15050047: syscall: support for Native Client
15360044: syscall: unzip implementation for Native Client
15370044: syscall: Native Client SRPC implementation
15400047: cmd/dist, cmd/go, go/build, test: support for Native Client
15410048: runtime: support for Native Client
15410049: syscall: file descriptor table for Native Client
15410050: syscall: in-memory file system for Native Client
15440048: all: update +build lines for Native Client port
15540045: cmd/6g, cmd/8g, cmd/gc: support for Native Client
15570045: os: support for Native Client
15680044: crypto/..., hash/crc32, reflect, sync/atomic: support for amd64p32
15690044: net: support for Native Client
15690048: runtime: support for fake time like on Go Playground
15690051: build: disable various tests on Native Client
LGTM=rsc
R=rsc
CC=golang-codereviews
https://golang.org/cl/68150047
The VARDEF placement must be before the initialization
but after any final use. If you have something like s = ... using s ...
the rhs must be evaluated, then the VARDEF, then the lhs
assigned.
There is a large comment in pgen.c on gvardef explaining
this in more detail.
This CL also includes Ian's suggestions from earlier CLs,
namely commenting the use of mode in link.h and fixing
the precedence of the ~r check in dcl.c.
This CL enables the check that if liveness analysis decides
a variable is live on entry to the function, that variable must
be a function parameter (not a result, and not a local variable).
If this check fails, it indicates a bug in the liveness analysis or
in the generated code being analyzed.
The race detector generates invalid code for append(x, y...).
The code declares a temporary t and then uses cap(t) before
initializing t. The new liveness check catches this bug and
stops the compiler from writing out the buggy code.
Consequently, this CL disables the race detector tests in
run.bash until the race detector bug can be fixed
(golang.org/issue/7334).
Except for the race detector bug, the liveness analysis check
does not detect any problems (this CL and the previous CLs
fixed all the detected problems).
The net test still fails with GOGC=0 but the rest of the tests
now pass or time out (because GOGC=0 is so slow).
TBR=iant
CC=golang-codereviews
https://golang.org/cl/64170043
Any initialization of a variable by a block copy or block zeroing
or by multiple assignments (componentwise copying or zeroing
of a multiword variable) needs to emit a VARDEF. These cases were not.
Fixes#7205.
TBR=iant
CC=golang-codereviews
https://golang.org/cl/63650044
The "fat" referred to being used for multiword values only.
We're going to use it for non-fat values sometimes too.
No change other than the renaming.
TBR=iant
CC=golang-codereviews
https://golang.org/cl/63650043
copyau1 was assuming that it could deduce the type of the
middle register p->reg from the type of the left or right
argument: in CMPF F1, F2, the p->reg==2 must be a D_FREG
because p->from is F1, and in CMP R1, R2, the p->reg==2 must
be a D_REG because p->from is R1.
This heuristic fails for CMP $0, R2, which was causing copyau1
not to recognize p->reg==2 as a reference to R2, which was
keeping it from properly renaming the register use when
substituting registers.
cmd/5c has the right approach: look at the opcode p->as to
decide the kind of register. It is unclear where 5g's copyau1
came from; perhaps it was an attempt to avoid expanding 5c's
a2type to include new instructions used only by 5g.
Copy a2type from cmd/5c, expand to include additional instructions,
and make it crash the compiler if asked about an instruction
it does not understand (avoid silent bugs in the future if new
instructions are added).
Should fix current arm build breakage.
While we're here, fix the print statements dumping the pred and
succ info in the asm listing to pass an int arg to %.4ud
(Prog.pc is a vlong now, due to the liblink merge).
TBR=ken2
CC=golang-codereviews
https://golang.org/cl/62730043
We now use the %A, %D, %P, and %R routines from liblink
across the board.
Fixes#7178.
Fixes#7055.
LGTM=iant
R=golang-codereviews, gobot, rsc, dave, iant, remyoudompheng
CC=golang-codereviews
https://golang.org/cl/49170043
The UNDEF instruction was listed in the instruction data as having the next instruction in the stream as its successor. This confused the optimizer into adding a load where it wasn't needed, in turn confusing the liveness analysis pass for GC bitmaps into thinking that the variable was live.
Fixes#7229.
LGTM=iant, rsc
R=golang-codereviews, bradfitz, iant, dave, rsc
CC=golang-codereviews
https://golang.org/cl/56910045
This changes makes sgen and clearfat use unaligned instructions for
the trailing bytes, like the runtime memmove does, resulting in faster
code when manipulating types whose size is not a multiple of 8.
LGTM=khr
R=khr, iant, rsc
CC=golang-codereviews
https://golang.org/cl/51740044
There is more zeroing than I would like right now -
temporaries used for the new map and channel runtime
calls need to be eliminated - but it will do for now.
This CL only has an effect if you are building with
GOEXPERIMENT=precisestack ./all.bash
(or make.bash). It costs about 5% in the overall time
spent in all.bash. That number will come down before
we make it on by default, but this should be enough for
Keith to try using the precise maps for copying stacks.
amd64 only (and it's not really great generated code).
TBR=khr, iant
CC=golang-codereviews
https://golang.org/cl/56430043
This change fixes a serious performance regression
with reflect.Value growing to 4 words instead of 3.
The json benchmark was ~50% slower, with this change
it is ~5% slower (and the binary is 0.5% larger).
Longer term, we probably need to rethink our copy
generation. Using REP is really expensive time-wise.
But inlining the copy grows the binary.
R=golang-codereviews, r
CC=golang-codereviews
https://golang.org/cl/44990043
Eventually we will want to bypass DATA for everything,
but the relocations are not standardized well enough across
architectures to make that possible.
This did not help as much as I expected, but it is definitely better.
It shaves maybe 1-2% off all.bash depending on how much you
trust the timings of a single run:
Before: 241.139r 362.702u 112.967s
After: 234.339r 359.623u 111.045s
R=golang-codereviews, gobot, r, iant
CC=golang-codereviews
https://golang.org/cl/44650043
warning: src/cmd/6g/reg.c:671 format mismatch d VLONG, arg 4
warning: src/cmd/gc/pgen.c:230 set and not used: oldstksize
warning: src/cmd/gc/plive.c:877 format mismatch lx UVLONG, arg 2
warning: src/cmd/gc/walk.c:2878 set and not used: cbv
warning: src/cmd/gc/walk.c:2885 set and not used: hbv
warning: src/cmd/ld/data.c:198 format mismatch s IND FUNC(IND CHAR) INT, arg 2
warning: src/cmd/ld/data.c:230 format mismatch s IND FUNC(IND CHAR) INT, arg 2
warning: src/cmd/ld/dwarf.c:1517 set and not used: pc
warning: src/cmd/ld/elf.c:1507 format mismatch d VLONG, arg 2
warning: src/cmd/ld/ldmacho.c:509 set and not used: dsymtab
R=golang-dev, gobot, rsc
CC=golang-dev
https://golang.org/cl/36740045
- add buffered stdout to all tools and provide to link ctxt.
- avoid extra \n before ! in .6 files written by assemblers
(makes them match the C compilers).
- use linkwriteobj instead of linkouthist+linkwritefuncs.
- in assemblers and C compilers, record pc explicitly in Prog,
for use by liblink.
- in C compilers, preserve jump target links.
- in Go compilers (gsubr.c) attach gotype directly to
corresponding LSym* instead of rederiving from instruction stream.
- in Go compilers, emit just one definition for runtime.zerovalue
from each compilation.
This CL consists entirely of small adjustments.
The heavy lifting is in CL 39680043.
Each depends on the other.
R=golang-dev, dave, iant
CC=golang-dev
https://golang.org/cl/37030045
Preparation for golang.org/s/go13linker work.
This CL does not build by itself. It depends on 35740044
and 35790044 and will be submitted at the same time.
R=iant
CC=golang-dev
https://golang.org/cl/34590045
This change allows the garbage collector to examine stack
slots that are determined as live and containing a pointer
value by the garbage collector. This results in a mean
reduction of 65% in the number of stack slots scanned during
an invocation of "GOGC=1 all.bash".
Unfortunately, this does not yet allow garbage collection to
be precise for the stack slots computed as live. Pointers
confound the determination of what definitions reach a given
instruction. In general, this problem is not solvable without
runtime cost but some advanced cooperation from the compiler
might mitigate common cases.
R=golang-dev, rsc, cshapiro
CC=golang-dev
https://golang.org/cl/14430048
This eliminates ~75% of the nil checks being emitted,
on all architectures. We can do better, but we need
a bit more general support from the compiler, and
I don't want to do that so close to Go 1.2.
What's here is simple but effective and safe.
A few small code generation cleanups were required
to make the analysis consistent on all systems about
which nil checks are omitted, at least in the test.
Fixes#6019.
R=ken2
CC=golang-dev
https://golang.org/cl/13334052
There is a cleaner, simpler way.
««« original CL description
cmd/5g, cmd/6g, cmd/8g: faster compilation
Replace linked list walk with memset.
This reduces CPU time taken by 'go install -a std' by ~10%.
Before:
real user sys
0m23.561s 0m16.625s 0m5.848s
0m23.766s 0m16.624s 0m5.846s
0m23.742s 0m16.621s 0m5.868s
after:
0m22.714s 0m14.858s 0m6.138s
0m22.644s 0m14.875s 0m6.120s
0m22.604s 0m14.854s 0m6.081s
R=golang-dev, r
CC=golang-dev
https://golang.org/cl/13084043
»»»
TBR=dvyukov
CC=golang-dev
https://golang.org/cl/13352049
Also introduce BGET2/4, BPUT2/4 as they are widely used.
Slightly improve BGETC/BPUTC implementation.
This gives ~5% CPU time improvement on go install -a -p1 std.
Before:
real user sys
0m23.561s 0m16.625s 0m5.848s
0m23.766s 0m16.624s 0m5.846s
0m23.742s 0m16.621s 0m5.868s
after:
0m22.999s 0m15.841s 0m5.889s
0m22.845s 0m15.808s 0m5.850s
0m22.889s 0m15.832s 0m5.848s
R=golang-dev, r
CC=golang-dev
https://golang.org/cl/12745047
Replace linked list walk with memset.
This reduces CPU time taken by 'go install -a std' by ~10%.
Before:
real user sys
0m23.561s 0m16.625s 0m5.848s
0m23.766s 0m16.624s 0m5.846s
0m23.742s 0m16.621s 0m5.868s
after:
0m22.714s 0m14.858s 0m6.138s
0m22.644s 0m14.875s 0m6.120s
0m22.604s 0m14.854s 0m6.081s
R=golang-dev, r
CC=golang-dev
https://golang.org/cl/13084043
When the new call site-specific frame bitmaps are available,
we can cut the zeroing to just those values that need it due
to scope escaping.
R=cshapiro, cshapiro
CC=golang-dev
https://golang.org/cl/13045043
See golang.org/s/go12nil.
This CL is about getting all the right checks inserted.
A followup CL will add an optimization pass to
remove redundant checks.
R=ken2
CC=golang-dev
https://golang.org/cl/12970043
The compilers assume they can generate temporary variables
as needed to preserve the right semantics or simplify code
generation and the back end will still generate good code.
This turns out not to be true. The back ends will only
track the first 128 variables per function and give up
on the remainder. That needs to be fixed too, in a later CL.
This CL merges temporary variables with equal types and
non-overlapping lifetimes using the greedy algorithm in
Poletto and Sarkar, "Linear Scan Register Allocation",
ACM TOPLAS 1999.
The result can be striking in the right functions.
Top 20 frame size changes in a 6g godoc binary by bytes saved:
5464 1984 (-3480, -63.7%) go/build.(*Context).Import
4456 1824 (-2632, -59.1%) go/printer.(*printer).expr1
2560 80 (-2480, -96.9%) time.nextStdChunk
3496 1608 (-1888, -54.0%) go/printer.(*printer).stmt
1896 272 (-1624, -85.7%) net/http.init
2688 1400 (-1288, -47.9%) fmt.(*pp).printReflectValue
2800 1512 (-1288, -46.0%) main.main
3296 2016 (-1280, -38.8%) crypto/tls.(*Conn).clientHandshake
1664 488 (-1176, -70.7%) time.loadZoneZip
1760 608 (-1152, -65.5%) time.parse
4104 3072 (-1032, -25.1%) runtime/pprof.writeHeap
1680 712 ( -968, -57.6%) go/ast.Walk
2488 1560 ( -928, -37.3%) crypto/x509.parseCertificate
1128 392 ( -736, -65.2%) math/big.nat.divLarge
1528 864 ( -664, -43.5%) go/printer.(*printer).fieldList
1360 712 ( -648, -47.6%) regexp/syntax.(*parser).factor
2104 1528 ( -576, -27.4%) encoding/asn1.parseField
1064 504 ( -560, -52.6%) encoding/xml.(*Decoder).text
584 48 ( -536, -91.8%) html.init
1400 864 ( -536, -38.3%) go/doc.playExample
In the same godoc build, cuts the number of functions with
too many vars from 83 to 32.
R=ken2
CC=golang-dev
https://golang.org/cl/12829043
Now there's only one copy of the flow graph construction
and dominator computation, and different optimizations
can attach different annotations to the instructions.
R=ken2
CC=golang-dev
https://golang.org/cl/12797045