Consider
switch x:= x.(type) {
case int:
// int stmts
case error:
// error stmts
}
Prior to this change, we lowered this roughly as:
if x, ok := x.(int); ok {
// int stmts
} else if x, ok := x.(error); ok {
// error stmts
}
x, ok := x.(error) is implemented with a call to runtime.assertE2I2 or runtime.assertI2I2.
x, ok := x.(int) generates inline code that checks whether x has type int,
and populates x and ok as appropriate. We then immediately branch again on ok.
The shortcircuit pass in the SSA backend is designed to recognize situations
like this, in which we are immediately branching on a bool value
that we just calculated with a branch.
However, the shortcircuit pass has limitations when the intermediate state has phis.
In this case, the phi value is x (the int).
CL 222923 improved the situation, but many cases are still unhandled.
I have further improvements in progress, which is how I found this particular problem,
but they are expensive, and may or may not see the light of day.
In the common case of a lone concrete type in a type switch case,
it is easier and cheaper to simply lower a different way, roughly:
if _, ok := x.(int); ok {
x := x.(int)
// int stmts
}
Instead of using a type assertion, though, we extract the value of x
from the interface directly.
This removes the need to track x (the int) across the branch on ok,
which removes the phi, which lets the shortcircuit pass do its job.
Benchmarks for encoding/binary show improvements, as well as some
wild swings on the super fast benchmarks (alignment effects?):
name old time/op new time/op delta
ReadSlice1000Int32s-8 5.25µs ± 2% 4.87µs ± 3% -7.11% (p=0.000 n=44+49)
ReadStruct-8 451ns ± 2% 417ns ± 2% -7.39% (p=0.000 n=45+46)
WriteStruct-8 412ns ± 2% 405ns ± 3% -1.58% (p=0.000 n=46+48)
ReadInts-8 296ns ± 8% 275ns ± 3% -7.23% (p=0.000 n=48+50)
WriteInts-8 324ns ± 1% 318ns ± 2% -1.67% (p=0.000 n=44+49)
WriteSlice1000Int32s-8 5.21µs ± 2% 4.92µs ± 1% -5.67% (p=0.000 n=46+44)
PutUint16-8 0.58ns ± 2% 0.59ns ± 2% +0.63% (p=0.000 n=49+49)
PutUint32-8 0.87ns ± 1% 0.58ns ± 1% -33.10% (p=0.000 n=46+44)
PutUint64-8 0.66ns ± 2% 0.87ns ± 2% +33.07% (p=0.000 n=47+48)
LittleEndianPutUint16-8 0.86ns ± 2% 0.87ns ± 2% +0.55% (p=0.003 n=47+50)
LittleEndianPutUint32-8 0.87ns ± 1% 0.87ns ± 1% ~ (p=0.547 n=45+47)
LittleEndianPutUint64-8 0.87ns ± 2% 0.87ns ± 1% ~ (p=0.451 n=46+47)
ReadFloats-8 79.8ns ± 5% 75.9ns ± 2% -4.83% (p=0.000 n=50+47)
WriteFloats-8 89.3ns ± 1% 88.9ns ± 1% -0.48% (p=0.000 n=46+44)
ReadSlice1000Float32s-8 5.51µs ± 1% 4.87µs ± 2% -11.74% (p=0.000 n=47+46)
WriteSlice1000Float32s-8 5.51µs ± 1% 4.93µs ± 1% -10.60% (p=0.000 n=48+47)
PutUvarint32-8 25.9ns ± 2% 24.0ns ± 2% -7.02% (p=0.000 n=48+50)
PutUvarint64-8 75.1ns ± 1% 61.5ns ± 2% -18.12% (p=0.000 n=45+47)
[Geo mean] 57.3ns 54.3ns -5.33%
Despite the rarity of type switches, this generates noticeably smaller binaries.
file before after Δ %
addr2line 4413296 4409200 -4096 -0.093%
api 5982648 5962168 -20480 -0.342%
cgo 4854168 4833688 -20480 -0.422%
compile 19694784 19682560 -12224 -0.062%
cover 5278008 5265720 -12288 -0.233%
doc 4694824 4682536 -12288 -0.262%
fix 3411336 3394952 -16384 -0.480%
link 6721496 6717400 -4096 -0.061%
nm 4371152 4358864 -12288 -0.281%
objdump 4760960 4752768 -8192 -0.172%
pprof 14810820 14790340 -20480 -0.138%
trace 11681076 11668788 -12288 -0.105%
vet 8285464 8244504 -40960 -0.494%
total 115824120 115627576 -196544 -0.170%
Compiler performance is marginally improved (note that go/types has many type switches):
name old alloc/op new alloc/op delta
Template 35.0MB ± 0% 35.0MB ± 0% +0.09% (p=0.008 n=5+5)
Unicode 28.5MB ± 0% 28.5MB ± 0% ~ (p=0.548 n=5+5)
GoTypes 114MB ± 0% 114MB ± 0% -0.76% (p=0.008 n=5+5)
Compiler 541MB ± 0% 541MB ± 0% -0.03% (p=0.008 n=5+5)
SSA 1.17GB ± 0% 1.17GB ± 0% ~ (p=0.841 n=5+5)
Flate 21.9MB ± 0% 21.9MB ± 0% ~ (p=0.421 n=5+5)
GoParser 26.9MB ± 0% 26.9MB ± 0% ~ (p=0.222 n=5+5)
Reflect 74.6MB ± 0% 74.6MB ± 0% ~ (p=1.000 n=5+5)
Tar 32.9MB ± 0% 32.8MB ± 0% ~ (p=0.056 n=5+5)
XML 42.4MB ± 0% 42.1MB ± 0% -0.77% (p=0.008 n=5+5)
[Geo mean] 73.2MB 73.1MB -0.15%
name old allocs/op new allocs/op delta
Template 377k ± 0% 377k ± 0% +0.06% (p=0.008 n=5+5)
Unicode 354k ± 0% 354k ± 0% ~ (p=0.095 n=5+5)
GoTypes 1.31M ± 0% 1.30M ± 0% -0.73% (p=0.008 n=5+5)
Compiler 5.44M ± 0% 5.44M ± 0% -0.04% (p=0.008 n=5+5)
SSA 11.7M ± 0% 11.7M ± 0% ~ (p=1.000 n=5+5)
Flate 239k ± 0% 239k ± 0% ~ (p=1.000 n=5+5)
GoParser 302k ± 0% 302k ± 0% -0.04% (p=0.008 n=5+5)
Reflect 977k ± 0% 977k ± 0% ~ (p=0.690 n=5+5)
Tar 346k ± 0% 346k ± 0% ~ (p=0.889 n=5+5)
XML 431k ± 0% 430k ± 0% -0.25% (p=0.008 n=5+5)
[Geo mean] 806k 806k -0.10%
For packages with many type switches, this considerably shrinks function text size.
Some examples:
file before after Δ %
encoding/binary.s 30726 29504 -1222 -3.977%
go/printer.s 77597 76005 -1592 -2.052%
cmd/vendor/golang.org/x/tools/go/ast/astutil.s 65704 63318 -2386 -3.631%
cmd/vendor/golang.org/x/tools/go/analysis/passes/unreachable.s 8047 7714 -333 -4.138%
Text size regressions are rare.
Change-Id: Ic10982bbb04876250eaa5bfee97990141ae5fc28
Reviewed-on: https://go-review.googlesource.com/c/go/+/228106
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
When initializing a new object, we're often writing
1) to a location that doesn't have a pointer to a heap object
2) a pointer that doesn't point to a heap object
When both those conditions are true, we can avoid the write barrier.
This CL detects case 1 by looking for writes to known-zeroed
locations. The results of runtime.newobject are zeroed, and we
perform a simple tracking of which parts of that object are written so
we can determine what part remains zero at each write.
This CL detects case 2 by looking for addresses of globals (including
the types and itabs which are used in interfaces) and for nil pointers.
Makes cmd/go 0.3% smaller. Some particular cases, like the slice
literal in #29573, can get much smaller.
TODO: we can remove actual zero writes also with this mechanism.
Update #29573
Change-Id: Ie74a3533775ea88da0495ba02458391e5db26cb9
Reviewed-on: https://go-review.googlesource.com/c/go/+/156363
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
We don't need a write barrier if:
1) The location we're writing to doesn't hold a heap pointer, and
2) The value we're writing isn't a heap pointer.
The freshly returned value from runtime.newobject satisfies (1).
Pointers to globals, and the contents of the read-only data section satisfy (2).
This is particularly helpful for code like:
p := []string{"abc", "def", "ghi"}
Where the compiler generates:
a := new([3]string)
move(a, statictmp_) // eliminates write barriers here
p := a[:]
For big slice literals, this makes the code a smaller and faster to
compile.
Update #13554. Reduces the compile time by ~10% and RSS by ~30%.
Change-Id: Icab81db7591c8777f68e5d528abd48c7e44c87eb
Reviewed-on: https://go-review.googlesource.com/c/151498
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
The writebarrier test has to change.
Now that T23 composite literals are passed to the backend,
they get SSA'd, so writes to their fields are treated separately,
so the relevant part of the first write to t23 is now a dead store.
Preserve the intent of the test by splitting it up into two functions.
Reduces code size a bit:
name old object-bytes new object-bytes delta
Template 386k ± 0% 386k ± 0% ~ (all equal)
Unicode 202k ± 0% 202k ± 0% ~ (all equal)
GoTypes 1.16M ± 0% 1.16M ± 0% ~ (all equal)
Compiler 3.92M ± 0% 3.91M ± 0% -0.19% (p=0.008 n=5+5)
SSA 7.91M ± 0% 7.91M ± 0% ~ (all equal)
Flate 228k ± 0% 228k ± 0% -0.05% (p=0.008 n=5+5)
GoParser 283k ± 0% 283k ± 0% ~ (all equal)
Reflect 952k ± 0% 952k ± 0% -0.06% (p=0.008 n=5+5)
Tar 188k ± 0% 188k ± 0% -0.09% (p=0.008 n=5+5)
XML 406k ± 0% 406k ± 0% -0.02% (p=0.008 n=5+5)
[Geo mean] 649k 648k -0.04%
Fixes#18872
Change-Id: Ifeed0f71f13849732999aa731cc2bf40c0f0e32a
Reviewed-on: https://go-review.googlesource.com/43154
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
The compiler did not emit write barrier for assigning global with
struct literal, like global = T{} where T contains pointer.
The relevant code path is:
walkexpr OAS var_ OSTRUCTLIT
oaslit
anylit OSTRUCTLIT
walkexpr OAS var_ nil
return without adding write barrier
return true
break (without adding write barrier)
This CL makes oaslit not apply to globals. See also CL
https://go-review.googlesource.com/c/36355/ for an alternative
fix.
The downside of this is that it generates static data for zeroing
struct now. Also this only covers global. If there is any lurking
bug with implicit zeroing other than globals, this doesn't fix.
Fixes#18956.
Change-Id: Ibcd27e4fae3aa38390ffa94a32a9dd7a802e4b37
Reviewed-on: https://go-review.googlesource.com/36410
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
When the compiler insert write barriers, the frontend makes
conservative decisions at an early stage. This may have false
positives which result in write barriers for stack writes.
A new phase, writebarrier, is added to the SSA backend, to delay
the decision and eliminate false positives. The frontend still
makes conservative decisions. When building SSA, instead of
emitting runtime calls directly, it emits WB ops (StoreWB,
MoveWB, etc.), which will be expanded to branches and runtime
calls in writebarrier phase. Writes to static locations on stack
are detected and write barriers are removed.
All write barriers of stack writes found by the script from
issue #17330 are eliminated (except two false positives).
Fixes#17330.
Change-Id: I9bd66333da9d0ceb64dcaa3c6f33502798d1a0f8
Reviewed-on: https://go-review.googlesource.com/31131
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
func f(x, y, z *int) {
a := []*int{x,y,z}
...
}
We used to use:
var tmp [3]*int
a := tmp[:]
a[0] = x
a[1] = y
a[2] = z
Now we do:
var tmp [3]*int
tmp[0] = x
tmp[1] = y
tmp[2] = z
a := tmp[:]
Doesn't sound like a big deal, but the compiler has trouble
eliminating write barriers when using the former method because it
doesn't know that the slice points to the stack. In the latter
method, the compiler knows the array is on the stack and as a result
doesn't emit any write barriers.
This turns out to be extremely common when building ... args, like
for calls fmt.Printf.
Makes go binaries ~1% smaller.
Doesn't have a measurable effect on the go1 fmt benchmarks,
unfortunately.
Fixes#14263
Update #6853
Change-Id: I9074a2788ec9e561a75f3b71c119b69f304d6ba2
Reviewed-on: https://go-review.googlesource.com/22395
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Don't write back parts of a slicing operation if they
are unchanged from the source of the slice. For example:
x.s = x.s[0:5] // don't write back pointer or cap
x.s = x.s[:5] // don't write back pointer or cap
x.s = x.s[:5:7] // don't write back pointer
There is more to be done here, for example:
x.s = x.s[:len(x.s):7] // don't write back ptr or len
This CL can't handle that one yet.
Fixes#14855
Change-Id: Id1e1a4fa7f3076dc1a76924a7f1cd791b81909bb
Reviewed-on: https://go-review.googlesource.com/20954
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Make sure we don't generate write barriers in runtime
code that is marked to forbid write barriers.
Implement the optimization that if we're writing a sliced
slice back to the location it came from, we don't need a
write barrier.
Fixes#14784
Change-Id: I04b6a3b2ac303c19817e932a36a3b006de103aaa
Reviewed-on: https://go-review.googlesource.com/20791
Reviewed-by: Austin Clements <austin@google.com>
Currently we generate write barriers when the right side of an
assignment is a global function. This doesn't fall into the existing
case of storing an address of a global because we haven't lowered the
function to a pointer yet.
This write barrier is unnecessary, so eliminate it.
Fixes#13901.
Change-Id: Ibc10e00a8803db0fd75224b66ab94c3737842a79
Reviewed-on: https://go-review.googlesource.com/20772
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Type switches need write barriers if the written-to
variable is heap allocated.
For the added needwritebarrier call, the right arg doesn't
really matter, I just pass something that will never disqualify
the write barrier. The left arg is the one that matters.
Fixes#14306
Change-Id: Ic2754167cce062064ea2eeac2944ea4f77cc9c3b
Reviewed-on: https://go-review.googlesource.com/19481
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
The code generated for x = append(x, v) is roughly:
t := x
if len(t)+1 > cap(t) {
t = grow(t)
}
t[len(t)] = v
len(t)++
x = t
We used to generate this code as Go pseudocode during walk.
Generate it instead as actual instructions during gen.
Doing so lets us apply a few optimizations. The most important
is that when, as in the above example, the source slice and the
destination slice are the same, the code can instead do:
t := x
if len(t)+1 > cap(t) {
t = grow(t)
x = {base(t), len(t)+1, cap(t)}
} else {
len(x)++
}
t[len(t)] = v
That is, in the fast path that does not reallocate the array,
only the updated length needs to be written back to x,
not the array pointer and not the capacity. This is more like
what you'd write by hand in C. It's faster in general, since
the fast path elides two of the three stores, but it's especially
faster when the form of x is such that the base pointer write
would turn into a write barrier. No write, no barrier.
name old mean new mean delta
BinaryTree17 5.68s × (0.97,1.04) 5.81s × (0.98,1.03) +2.35% (p=0.023)
Fannkuch11 4.41s × (0.98,1.03) 4.35s × (1.00,1.00) ~ (p=0.090)
FmtFprintfEmpty 92.7ns × (0.91,1.16) 86.0ns × (0.94,1.11) -7.31% (p=0.038)
FmtFprintfString 281ns × (0.96,1.08) 276ns × (0.98,1.04) ~ (p=0.219)
FmtFprintfInt 288ns × (0.97,1.06) 274ns × (0.98,1.06) -4.94% (p=0.002)
FmtFprintfIntInt 493ns × (0.97,1.04) 506ns × (0.99,1.01) +2.65% (p=0.009)
FmtFprintfPrefixedInt 423ns × (0.97,1.04) 391ns × (0.99,1.01) -7.52% (p=0.000)
FmtFprintfFloat 598ns × (0.99,1.01) 566ns × (0.99,1.01) -5.27% (p=0.000)
FmtManyArgs 1.89µs × (0.98,1.05) 1.91µs × (0.99,1.01) ~ (p=0.231)
GobDecode 14.8ms × (0.98,1.03) 15.3ms × (0.99,1.02) +3.01% (p=0.000)
GobEncode 12.3ms × (0.98,1.01) 11.5ms × (0.97,1.03) -5.93% (p=0.000)
Gzip 656ms × (0.99,1.05) 645ms × (0.99,1.01) ~ (p=0.055)
Gunzip 142ms × (1.00,1.00) 142ms × (1.00,1.00) -0.32% (p=0.034)
HTTPClientServer 91.2µs × (0.97,1.04) 90.5µs × (0.97,1.04) ~ (p=0.468)
JSONEncode 32.6ms × (0.97,1.08) 32.0ms × (0.98,1.03) ~ (p=0.190)
JSONDecode 114ms × (0.97,1.05) 114ms × (0.99,1.01) ~ (p=0.887)
Mandelbrot200 6.11ms × (0.98,1.04) 6.04ms × (1.00,1.01) ~ (p=0.167)
GoParse 6.66ms × (0.97,1.04) 6.47ms × (0.97,1.05) -2.81% (p=0.014)
RegexpMatchEasy0_32 159ns × (0.99,1.00) 171ns × (0.93,1.07) +7.19% (p=0.002)
RegexpMatchEasy0_1K 538ns × (1.00,1.01) 550ns × (0.98,1.01) +2.30% (p=0.000)
RegexpMatchEasy1_32 138ns × (1.00,1.00) 135ns × (0.99,1.02) -1.60% (p=0.000)
RegexpMatchEasy1_1K 869ns × (0.99,1.01) 879ns × (1.00,1.01) +1.08% (p=0.000)
RegexpMatchMedium_32 252ns × (0.99,1.01) 243ns × (1.00,1.00) -3.71% (p=0.000)
RegexpMatchMedium_1K 72.7µs × (1.00,1.00) 70.3µs × (1.00,1.00) -3.34% (p=0.000)
RegexpMatchHard_32 3.85µs × (1.00,1.00) 3.82µs × (1.00,1.01) -0.81% (p=0.000)
RegexpMatchHard_1K 118µs × (1.00,1.00) 117µs × (1.00,1.00) -0.56% (p=0.000)
Revcomp 920ms × (0.97,1.07) 917ms × (0.97,1.04) ~ (p=0.808)
Template 129ms × (0.98,1.03) 114ms × (0.99,1.01) -12.06% (p=0.000)
TimeParse 619ns × (0.99,1.01) 622ns × (0.99,1.01) ~ (p=0.062)
TimeFormat 661ns × (0.98,1.04) 665ns × (0.99,1.01) ~ (p=0.524)
See next CL for combination with a similar optimization for slice.
The benchmarks that are slower in this CL are still faster overall
with the combination of the two.
Change-Id: I2a7421658091b2488c64741b4db15ab6c3b4cb7e
Reviewed-on: https://go-review.googlesource.com/9812
Reviewed-by: David Chase <drchase@google.com>
We can expand the test cases as we discover problems.
This is some basic tests plus all the things I got wrong
in some recent work.
Change-Id: Id875fcfaf74eb087ae42b441fe47a34c5b8ccb39
Reviewed-on: https://go-review.googlesource.com/9158
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Austin Clements <austin@google.com>