1
0
mirror of https://github.com/golang/go synced 2024-11-27 02:21:17 -07:00
Commit Graph

187 Commits

Author SHA1 Message Date
LE Manh Cuong
ec4e8517cd cmd/compile: support more length types for slice extension optimization
golang.org/cl/109517 optimized the compiler to avoid the allocation for make in
append(x, make([]T, y)...). This was only implemented for the case that y has type int.

This change extends the optimization to trigger for all integer types where the value
is known at compile time to fit into an int.

name             old time/op    new time/op    delta
ExtendInt-12        106ns ± 4%     106ns ± 0%      ~     (p=0.351 n=10+6)
ExtendUint64-12    1.03µs ± 5%    0.10µs ± 4%   -90.01%  (p=0.000 n=9+10)

name             old alloc/op   new alloc/op   delta
ExtendInt-12        0.00B          0.00B           ~     (all equal)
ExtendUint64-12    13.6kB ± 0%     0.0kB       -100.00%  (p=0.000 n=10+10)

name             old allocs/op  new allocs/op  delta
ExtendInt-12         0.00           0.00           ~     (all equal)
ExtendUint64-12      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)

Updates #29785

Change-Id: Ief7760097c285abd591712da98c5b02bc3961fcd
Reviewed-on: https://go-review.googlesource.com/c/go/+/182559
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-09-17 17:18:17 +00:00
Alberto Donizetti
c2facbe937 Revert "test/codegen: document -all_codegen option in README"
This reverts CL 192101.

Reason for revert: The same paragraph was added 2 weeks ago
(look a few lines above)

Change-Id: I05efb2631d7b4966f66493f178f2a649c715a3cc
Reviewed-on: https://go-review.googlesource.com/c/go/+/195637
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-09-16 17:31:37 +00:00
Cherry Zhang
d9b8ffa51c test/codegen: document -all_codegen option in README
It is useful to know about the -all_codegen option for running
codegen tests for all platforms. I was puzzling that some codegen
test was not failing on my local machine or on trybot, until I
found this option.

Change-Id: I062cf4d73f6a6c9ebc2258195779d2dab21bc36d
Reviewed-on: https://go-review.googlesource.com/c/go/+/192101
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-09-16 11:53:57 +00:00
Ruixin Bao
98aa97806b cmd/compile: add math/bits.Mul64 intrinsic on s390x
This change adds an intrinsic for Mul64 on s390x. To achieve that,
a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly
instruction directly uses an existing instruction on Z and supports multiplication
of two 64 bit unsigned integer and stores the result in two separate registers.

In this case, we require the multiplcand to be stored in register R3 and
the output result (the high and low 64 bit of the product) to be stored in
R2 and R3 respectively.

A test case is also added.

Benchmark:
name      old time/op  new time/op  delta
Mul-18    11.1ns ± 0%   1.4ns ± 0%  -87.39%  (p=0.002 n=8+10)
Mul32-18  2.07ns ± 0%  2.07ns ± 0%     ~     (all equal)
Mul64-18  11.1ns ± 1%   1.4ns ± 0%  -87.42%  (p=0.000 n=10+10)

Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/194572
Reviewed-by: Michael Munday <mike.munday@ibm.com>
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-09-13 09:04:48 +00:00
Michael Munday
5c5f217b63 cmd/compile: improve s390x sign/zero extension removal
This CL gets rid of the MOVDreg and MOVDnop SSA operations on
s390x. They were originally inserted to help avoid situations
where a sign/zero extension was elided but a spill invalidated
the optimization. It's not really clear we need to do this though
(amd64 doesn't have these ops for example) so long as we are
careful when removing sign/zero extensions. Also, the MOVDreg
technique doesn't work if the register is spilled before the
MOVDreg op (I haven't seen that in practice).

Removing these ops reduces the complexity of the rules and also
allows us to unblock optimizations. For example, the compiler can
now merge the loads in binary.{Big,Little}Endian.PutUint16 which
it wasn't able to do before. This CL reduces the size of the .text
section in the go tool by about 4.7KB (0.09%).

Change-Id: Icaddae7f2e4f9b2debb6fabae845adb3f73b41db
Reviewed-on: https://go-review.googlesource.com/c/go/+/173897
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-09-10 13:17:24 +00:00
Martin Möhrmann
5bb59b6d16 Revert "compile: prefer an AND instead of SHR+SHL instructions"
This reverts commit 9ec7074a94.

Reason for revert: broke s390x (copysign, abs) and arm64 (bitfield) tests.

Change-Id: I16c1b389c062e8c4aa5de079f1d46c9b25b0db52
Reviewed-on: https://go-review.googlesource.com/c/go/+/193850
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-09-09 07:33:25 +00:00
Martin Möhrmann
9ec7074a94 compile: prefer an AND instead of SHR+SHL instructions
On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute.
A pair of shifts that operate on the same register will take 2 cycles
and needs to wait for the input register value to be available.

Large constants used to mask the high bits of a register with an AND
instruction can not be encoded as an immediate in the AND instruction
on amd64 and therefore need to be loaded into a register with a MOV
instruction.

However that MOV instruction is not dependent on the output register and
on many CPUs does not compete with the AND or shift instructions for
execution ports.

Using a pair of shifts to mask high bits instead of an AND to mask high
bits of a register has a shorter encoding and uses one less general
purpose register but is slower due to taking one clock cycle longer
if there is no register pressure that would make the AND variant need to
generate a spill.

For example the instructions emitted for (x & 1 << 63) before this CL are:
48c1ea3f                SHRQ $0x3f, DX
48c1e23f                SHLQ $0x3f, DX

after this CL the instructions are the same as GCC and LLVM use:
48b80000000000000080    MOVQ $0x8000000000000000, AX
4821d0                  ANDQ DX, AX

Some platforms such as arm64 already have SSA optimization rules to fuse
two shift instructions back into an AND.

Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark:

var GlobalU uint

func BenchmarkAndHighBits(b *testing.B) {
	x := uint(0)
	for i := 0; i < b.N; i++ {
		x &= 1 << 63
	}
	GlobalU = x
}

amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz:
name           old time/op  new time/op  delta
AndHighBits-4  0.61ns ± 6%  0.42ns ± 6%  -31.42%  (p=0.000 n=25+25):

Updates #33826
Updates #32781

Change-Id: I862d3587446410c447b9a7265196b57f85358633
Reviewed-on: https://go-review.googlesource.com/c/go/+/191780
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-09-09 06:49:17 +00:00
Alberto Donizetti
e6d2544d20 test/codegen: mention -all_codegen in the README
For performance reasons (avoiding costly cross-compilations) CL 177577
changed the codegen test harness to only run the tests for the
machine's GOARCH by default.

This change updates the codegen README accordingly, explaining what
all.bash does run by default and how to perform the tests for all
architectures.

Fixes #33924

Change-Id: I43328d878c3e449ebfda46f7e69963a44a511d40
Reviewed-on: https://go-review.googlesource.com/c/go/+/192619
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2019-09-01 15:37:13 +00:00
Brian Kessler
b003afe4fe cmd/compile: intrinsify RotateLeft32 on wasm
wasm has 32-bit versions of all integer operations. This change
lowers RotateLeft32 to i32.rotl on wasm and intrinsifies the math/bits
call.  Benchmarking on amd64 under node.js this is ~25% faster.

node v10.15.3/amd64
name          old time/op  new time/op  delta
RotateLeft    8.37ns ± 1%  8.28ns ± 0%   -1.05%  (p=0.029 n=4+4)
RotateLeft8   11.9ns ± 1%  11.8ns ± 0%     ~     (p=0.167 n=5+5)
RotateLeft16  11.8ns ± 0%  11.8ns ± 0%     ~     (all equal)
RotateLeft32  11.9ns ± 1%   8.7ns ± 0%  -26.32%  (p=0.008 n=5+5)
RotateLeft64  8.31ns ± 1%  8.43ns ± 2%     ~     (p=0.063 n=5+5)

Updates #31265

Change-Id: I5b8e155978faeea536c4f6427ac9564d2f096a46
Reviewed-on: https://go-review.googlesource.com/c/go/+/182359
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
2019-08-31 17:03:04 +00:00
Ben Shi
1786ecd502 cmd/compile: eliminate WASM's redundant extension & wrapping
This CL eliminates unnecessary pairs of I32WrapI64 and
I64ExtendI32U generated by the WASM backend for IF
statements. And it makes the total size of pkg/js_wasm/
decreases about 490KB.

Change-Id: I16b0abb686c4e30d5624323166ec2d0ec57dbe2d
Reviewed-on: https://go-review.googlesource.com/c/go/+/191758
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
2019-08-30 21:20:03 +00:00
Ben Shi
8d5197d818 cmd/compile: optimize 386's math.bits.TrailingZeros16
This CL reverts CL 192097 and fixes the issue in CL 189277.

Change-Id: Icd271262e1f5019a8e01c91f91c12c1261eeb02b
Reviewed-on: https://go-review.googlesource.com/c/go/+/192519
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-08-30 17:37:00 +00:00
Cherry Zhang
9859f6bedb test/codegen: fix ARM32 RotateLeft32 test
The syntax of a shifted operation does not have a "$" sign for
the shift amount. Remove it.

Change-Id: I50782fe942b640076f48c2fafea4d3175be8ff99
Reviewed-on: https://go-review.googlesource.com/c/go/+/192100
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-08-28 20:42:48 +00:00
Ben Shi
3cfd003a8a cmd/compile: optimize ARM's math.bits.RotateLeft32
This CL optimizes math.bits.RotateLeft32 to inline
"MOVW Rx@>Ry, Rd" on ARM.

The benchmark results of math/bits show some improvements.
name               old time/op  new time/op  delta
RotateLeft-4       9.42ns ± 0%  6.91ns ± 0%  -26.66%  (p=0.000 n=40+33)
RotateLeft8-4      8.79ns ± 0%  8.79ns ± 0%   -0.04%  (p=0.000 n=40+31)
RotateLeft16-4     8.79ns ± 0%  8.79ns ± 0%   -0.04%  (p=0.000 n=40+32)
RotateLeft32-4     8.16ns ± 0%  7.54ns ± 0%   -7.68%  (p=0.000 n=40+40)
RotateLeft64-4     15.7ns ± 0%  15.7ns ± 0%     ~     (all equal)

updates #31265

Change-Id: I77bc1c2c702d5323fc7cad5264a8e2d5666bf712
Reviewed-on: https://go-review.googlesource.com/c/go/+/188697
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-08-28 15:41:58 +00:00
Ben Shi
c683ab8128 cmd/compile: optimize ARM's math.Abs
This CL optimizes math.Abs to an inline ABSD instruction on ARM.

The benchmark results of src/math/ show big improvements.
name                   old time/op  new time/op  delta
Acos-4                  181ns ± 0%   182ns ± 0%   +0.30%  (p=0.000 n=40+40)
Acosh-4                 202ns ± 0%   202ns ± 0%     ~     (all equal)
Asin-4                  163ns ± 0%   163ns ± 0%     ~     (all equal)
Asinh-4                 242ns ± 0%   242ns ± 0%     ~     (all equal)
Atan-4                  120ns ± 0%   121ns ± 0%   +0.83%  (p=0.000 n=40+40)
Atanh-4                 202ns ± 0%   202ns ± 0%     ~     (all equal)
Atan2-4                 173ns ± 0%   173ns ± 0%     ~     (all equal)
Cbrt-4                 1.06µs ± 0%  1.06µs ± 0%   +0.09%  (p=0.000 n=39+37)
Ceil-4                 72.9ns ± 0%  72.8ns ± 0%     ~     (p=0.237 n=40+40)
Copysign-4             13.2ns ± 0%  13.2ns ± 0%     ~     (all equal)
Cos-4                   193ns ± 0%   183ns ± 0%   -5.18%  (p=0.000 n=40+40)
Cosh-4                  254ns ± 0%   239ns ± 0%   -5.91%  (p=0.000 n=40+40)
Erf-4                   112ns ± 0%   112ns ± 0%     ~     (all equal)
Erfc-4                  117ns ± 0%   117ns ± 0%     ~     (all equal)
Erfinv-4                127ns ± 0%   127ns ± 1%     ~     (p=0.492 n=40+40)
Erfcinv-4               128ns ± 0%   128ns ± 0%     ~     (all equal)
Exp-4                   212ns ± 0%   206ns ± 0%   -3.05%  (p=0.000 n=40+40)
ExpGo-4                 216ns ± 0%   209ns ± 0%   -3.24%  (p=0.000 n=40+40)
Expm1-4                 142ns ± 0%   142ns ± 0%     ~     (all equal)
Exp2-4                  191ns ± 0%   184ns ± 0%   -3.45%  (p=0.000 n=40+40)
Exp2Go-4                194ns ± 0%   187ns ± 0%   -3.61%  (p=0.000 n=40+40)
Abs-4                  14.4ns ± 0%   6.3ns ± 0%  -56.39%  (p=0.000 n=38+39)
Dim-4                  12.6ns ± 0%  12.6ns ± 0%     ~     (all equal)
Floor-4                49.6ns ± 0%  49.6ns ± 0%     ~     (all equal)
Max-4                  27.6ns ± 0%  27.6ns ± 0%     ~     (all equal)
Min-4                  27.0ns ± 0%  27.0ns ± 0%     ~     (all equal)
Mod-4                   349ns ± 0%   305ns ± 1%  -12.55%  (p=0.000 n=33+40)
Frexp-4                54.0ns ± 0%  47.1ns ± 0%  -12.78%  (p=0.000 n=38+38)
Gamma-4                 242ns ± 0%   234ns ± 0%   -3.16%  (p=0.000 n=36+40)
Hypot-4                84.8ns ± 0%  67.8ns ± 0%  -20.05%  (p=0.000 n=31+35)
HypotGo-4              88.5ns ± 0%  71.6ns ± 0%  -19.12%  (p=0.000 n=40+38)
Ilogb-4                45.8ns ± 0%  38.9ns ± 0%  -15.12%  (p=0.000 n=40+32)
J0-4                    821ns ± 0%   802ns ± 0%   -2.33%  (p=0.000 n=33+40)
J1-4                    816ns ± 0%   807ns ± 0%   -1.05%  (p=0.000 n=40+29)
Jn-4                   1.67µs ± 0%  1.65µs ± 0%   -1.45%  (p=0.000 n=40+39)
Ldexp-4                61.5ns ± 0%  54.6ns ± 0%  -11.27%  (p=0.000 n=40+32)
Lgamma-4                188ns ± 0%   188ns ± 0%     ~     (all equal)
Log-4                   154ns ± 0%   147ns ± 0%   -4.78%  (p=0.000 n=40+40)
Logb-4                 50.9ns ± 0%  42.7ns ± 0%  -16.11%  (p=0.000 n=34+39)
Log1p-4                 160ns ± 0%   159ns ± 0%     ~     (p=0.828 n=40+40)
Log10-4                 173ns ± 0%   166ns ± 0%   -4.05%  (p=0.000 n=40+40)
Log2-4                 65.3ns ± 0%  58.4ns ± 0%  -10.57%  (p=0.000 n=37+37)
Modf-4                 36.4ns ± 0%  36.4ns ± 0%     ~     (all equal)
Nextafter32-4          36.4ns ± 0%  36.4ns ± 0%     ~     (all equal)
Nextafter64-4          32.7ns ± 0%  32.6ns ± 0%     ~     (p=0.375 n=40+40)
PowInt-4                300ns ± 0%   277ns ± 0%   -7.78%  (p=0.000 n=40+40)
PowFrac-4               676ns ± 0%   635ns ± 0%   -6.00%  (p=0.000 n=40+35)
Pow10Pos-4             17.6ns ± 0%  17.6ns ± 0%     ~     (all equal)
Pow10Neg-4             22.0ns ± 0%  22.0ns ± 0%     ~     (all equal)
Round-4                30.1ns ± 0%  30.1ns ± 0%     ~     (all equal)
RoundToEven-4          38.9ns ± 0%  38.9ns ± 0%     ~     (all equal)
Remainder-4             291ns ± 0%   263ns ± 0%   -9.62%  (p=0.000 n=40+40)
Signbit-4              11.3ns ± 0%  11.3ns ± 0%     ~     (all equal)
Sin-4                   185ns ± 0%   185ns ± 0%     ~     (all equal)
Sincos-4                230ns ± 0%   230ns ± 0%     ~     (all equal)
Sinh-4                  253ns ± 0%   246ns ± 0%   -2.77%  (p=0.000 n=39+39)
SqrtIndirect-4         41.4ns ± 0%  41.4ns ± 0%     ~     (all equal)
SqrtLatency-4          13.8ns ± 0%  13.8ns ± 0%     ~     (all equal)
SqrtIndirectLatency-4  37.0ns ± 0%  37.0ns ± 0%     ~     (p=0.632 n=40+40)
SqrtGoLatency-4         911ns ± 0%   911ns ± 0%   +0.08%  (p=0.000 n=40+40)
SqrtPrime-4            13.2µs ± 0%  13.2µs ± 0%   +0.01%  (p=0.038 n=38+40)
Tan-4                   205ns ± 0%   205ns ± 0%     ~     (all equal)
Tanh-4                  264ns ± 0%   247ns ± 0%   -6.44%  (p=0.000 n=39+32)
Trunc-4                45.2ns ± 0%  45.2ns ± 0%     ~     (all equal)
Y0-4                    796ns ± 0%   792ns ± 0%   -0.55%  (p=0.000 n=35+40)
Y1-4                    804ns ± 0%   797ns ± 0%   -0.82%  (p=0.000 n=24+40)
Yn-4                   1.64µs ± 0%  1.62µs ± 0%   -1.27%  (p=0.000 n=40+39)
Float64bits-4          8.16ns ± 0%  8.16ns ± 0%   +0.04%  (p=0.000 n=35+40)
Float64frombits-4      10.7ns ± 0%  10.7ns ± 0%     ~     (all equal)
Float32bits-4          7.53ns ± 0%  7.53ns ± 0%     ~     (p=0.760 n=40+40)
Float32frombits-4      6.91ns ± 0%  6.91ns ± 0%   -0.04%  (p=0.002 n=32+38)
[Geo mean]              111ns        106ns        -3.98%

Change-Id: I54f4fd7f5160db020b430b556bde59cc0fdb996d
Reviewed-on: https://go-review.googlesource.com/c/go/+/188678
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-08-28 15:41:28 +00:00
Bryan C. Mills
372b0eed17 Revert "cmd/compile: optimize 386's math.bits.TrailingZeros16"
This reverts CL 189277.

Reason for revert: broke 32-bit builders.

Updates #33902

Change-Id: Ie5f180d0371a90e5057ed578c334372e5fc3a286
Reviewed-on: https://go-review.googlesource.com/c/go/+/192097
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2019-08-28 12:57:59 +00:00
Agniva De Sarker
7be97af2ff cmd/compile: apply optimization for readonly globals on wasm
Extend the optimization introduced in CL 141118 to the wasm architecture.

And for reference, the rules trigger 212 times while building std and cmd

$GOOS=js GOARCH=wasm gotip build std cmd
$grep -E "Wasm.rules:44(1|2|3|4)" rulelog | wc -l
212

Updates #26498

Change-Id: I153684a2b98589ae812b42268da08b65679e09d1
Reviewed-on: https://go-review.googlesource.com/c/go/+/185477
Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Richard Musiol <neelance@gmail.com>
2019-08-28 05:55:52 +00:00
Agniva De Sarker
8fedb2d338 cmd/compile: optimize bounded shifts on wasm
Use the shiftIsBounded function to generate more efficient
Shift instructions.

Updates #25167

Change-Id: Id350f8462dc3a7ed3bfed0bcbea2860b8f40048a
Reviewed-on: https://go-review.googlesource.com/c/go/+/182558
Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Richard Musiol <neelance@gmail.com>
2019-08-28 04:44:21 +00:00
Ben Shi
22355d6cd2 cmd/compile: optimize 386's math.bits.TrailingZeros16
This CL optimizes math.bits.TrailingZeros16 on 386 with
a pair of BSFL and ORL instrcutions.

The case TrailingZeros16-4 of the benchmark test in
math/bits shows big improvement.
name               old time/op  new time/op  delta
TrailingZeros16-4  1.55ns ± 1%  0.87ns ± 1%  -43.87%  (p=0.000 n=50+49)

Change-Id: Ia899975b0e46f45dcd20223b713ed632bc32740b
Reviewed-on: https://go-review.googlesource.com/c/go/+/189277
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-08-28 02:29:54 +00:00
Ben Shi
731e6fc34e cmd/compile: generate Select on WASM
This CL performs the branchelim optimization on WASM with its
select instruction. And the total size of pkg/js_wasm decreased
about 80KB by this optimization.

Change-Id: I868eb146120a1cac5c4609c8e9ddb07e4da8a1d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/190957
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-08-28 02:29:25 +00:00
LE Manh Cuong
c5f142fa9f cmd/compile: optimize bitset tests
The assembly output for x & c == c, where c is power of 2:

	MOVQ	"".set+8(SP), AX
	ANDQ	$8, AX
	CMPQ	AX, $8
	SETEQ	"".~r2+24(SP)

With optimization using bitset:

	MOVQ	"".set+8(SP), AX
	BTL	$3, AX
	SETCS	"".~r2+24(SP)

output less than 1 instruction.

However, there is no speed improvement:

name         old time/op  new time/op  delta
AllBitSet-8  0.35ns ± 0%  0.35ns ± 0%   ~     (all equal)

Fixes #31904

Change-Id: I5dca4e410bf45716ed2145e3473979ec997e35d4
Reviewed-on: https://go-review.googlesource.com/c/go/+/175957
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-08-27 18:01:16 +00:00
Keith Randall
8f296f59de Revert "Revert "cmd/compile,runtime: allocate defer records on the stack""
This reverts CL 180761

Reason for revert: Reinstate the stack-allocated defer CL.

There was nothing wrong with the CL proper, but stack allocation of defers exposed two other issues.

Issue #32477: Fix has been submitted as CL 181258.
Issue #32498: Possible fix is CL 181377 (not submitted yet).

Change-Id: I32b3365d5026600069291b068bbba6cb15295eb3
Reviewed-on: https://go-review.googlesource.com/c/go/+/181378
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-06-10 16:19:39 +00:00
Keith Randall
49200e3f3e Revert "cmd/compile,runtime: allocate defer records on the stack"
This reverts commit fff4f599fe.

Reason for revert: Seems to still have issues around GC.

Fixes #32452

Change-Id: Ibe7af629f9ad6a3d5312acd7b066123f484da7f0
Reviewed-on: https://go-review.googlesource.com/c/go/+/180761
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2019-06-05 19:50:09 +00:00
Keith Randall
fff4f599fe cmd/compile,runtime: allocate defer records on the stack
When a defer is executed at most once in a function body,
we can allocate the defer record for it on the stack instead
of on the heap.

This should make defers like this (which are very common) faster.

This optimization applies to 363 out of the 370 static defer sites
in the cmd/go binary.

name     old time/op  new time/op  delta
Defer-4  52.2ns ± 5%  36.2ns ± 3%  -30.70%  (p=0.000 n=10+10)

Fixes #6980
Update #14939

Change-Id: I697109dd7aeef9e97a9eeba2ef65ff53d3ee1004
Reviewed-on: https://go-review.googlesource.com/c/go/+/171758
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2019-06-04 17:35:20 +00:00
fanzha02
822a9f537f cmd/compile: fix the error of absorbing boolean tests into block(FGE, FGT)
The CL 164718 mistyped the comparison flags. The rules for floating
point comparison should be GreaterThanF and GreaterEqualF. Fortunately,
the wrong optimizations were overwritten by other integer rules, so the
issue won't cause failure but just some performance impact.

The fixed CL optimizes the floating point test as follows.

source code: func foo(f float64) bool { return f > 4 || f < -4}
previous version: "FCMPD", "CSET\tGT", "CBZ"
fixed version: "FCMPD", BLE"

Add the test case.

Change-Id: Iea954fdbb8272b2d642dae0f816dc77286e6e1fa
Reviewed-on: https://go-review.googlesource.com/c/go/+/177121
Reviewed-by: Ben Shi <powerman1st@163.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-05-16 13:46:25 +00:00
Michael Munday
2c1b5130aa cmd/compile: add math/bits.{Add,Sub}64 intrinsics on s390x
This CL adds intrinsics for the 64-bit addition and subtraction
functions in math/bits. These intrinsics use the condition code
to propagate the carry or borrow bit.

To make the carry chains more efficient I've removed the
'clobberFlags' property from most of the load and store
operations. Originally these ops did clobber flags when using
offsets that didn't fit in a signed 20-bit integer, however
that is no longer true.

As with other platforms the intrinsics are faster when executed
in a chain rather than a loop because currently we need to spill
and restore the carry bit between each loop iteration. We may
be able to reduce the need to do this on s390x (e.g. by using
compare-and-branch instructions that do not clobber flags) in the
future.

name           old time/op  new time/op  delta
Add64          1.21ns ± 2%  2.03ns ± 2%  +67.18%  (p=0.000 n=7+10)
Add64multiple  2.98ns ± 3%  1.03ns ± 0%  -65.39%  (p=0.000 n=10+9)
Sub64          1.23ns ± 4%  2.03ns ± 1%  +64.85%  (p=0.000 n=10+10)
Sub64multiple  3.73ns ± 4%  1.04ns ± 1%  -72.28%  (p=0.000 n=10+8)

Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083
Reviewed-on: https://go-review.googlesource.com/c/go/+/174303
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-05-03 10:41:15 +00:00
Lynn Boger
e30aa166ea test: enable more memcombine tests for ppc64le
This enables more of the testcases in memcombine for ppc64le,
and adds more detail to some existing.

Change-Id: Ic522a1175bed682b546909c96f9ea758f8db247c
Reviewed-on: https://go-review.googlesource.com/c/go/+/174737
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-05-01 14:27:27 +00:00
Brian Kessler
4d9dd35806 cmd/compile: add signed divisibility rules
"Division by invariant integers using multiplication" paper
by Granlund and Montgomery contains a method for directly computing
divisibility (x%c == 0 for c constant) by means of the modular inverse.
The method is further elaborated in "Hacker's Delight" by Warren Section 10-17

This general rule can compute divisibilty by one multiplication, and add
and a compare for odd divisors and an additional rotate for even divisors.

To apply the divisibility rule, we must take into account
the rules to rewrite x%c = x-((x/c)*c) and (x/c) for c constant on the first
optimization pass "opt".  This complicates the matching as we want to match
only in the cases where the result of (x/c) is not also needed.
So, we must match on the expanded form of (x/c) in the expression x == c*(x/c)
in the "late opt" pass after common subexpresion elimination.

Note, that if there is an intermediate opt pass introduced in the future we
could simplify these rules by delaying the magic division rewrite to "late opt"
and matching directly on (x/c) in the intermediate opt pass.

On amd64, the divisibility check is 30-45% faster.

name                     old time/op  new time/op  delta`
DivisiblePow2constI64-4  0.83ns ± 1%  0.82ns ± 0%     ~     (p=0.079 n=5+4)
DivisibleconstI64-4      2.68ns ± 1%  1.87ns ± 0%  -30.33%  (p=0.000 n=5+4)
DivisibleWDivconstI64-4  2.69ns ± 1%  2.71ns ± 3%     ~     (p=1.000 n=5+5)
DivisiblePow2constI32-4  1.15ns ± 1%  1.15ns ± 0%     ~     (p=0.238 n=5+4)
DivisibleconstI32-4      2.24ns ± 1%  1.20ns ± 0%  -46.48%  (p=0.016 n=5+4)
DivisibleWDivconstI32-4  2.27ns ± 1%  2.27ns ± 1%     ~     (p=0.683 n=5+5)
DivisiblePow2constI16-4  0.81ns ± 1%  0.82ns ± 1%     ~     (p=0.135 n=5+5)
DivisibleconstI16-4      2.11ns ± 2%  1.20ns ± 1%  -42.99%  (p=0.008 n=5+5)
DivisibleWDivconstI16-4  2.23ns ± 0%  2.27ns ± 2%   +1.79%  (p=0.029 n=4+4)
DivisiblePow2constI8-4   0.81ns ± 1%  0.81ns ± 1%     ~     (p=0.286 n=5+5)
DivisibleconstI8-4       2.13ns ± 3%  1.19ns ± 1%  -43.84%  (p=0.008 n=5+5)
DivisibleWDivconstI8-4   2.23ns ± 1%  2.25ns ± 1%     ~     (p=0.183 n=5+5)

Fixes #30282
Fixes #15806

Change-Id: Id20d78263a4fdfe0509229ae4dfa2fede83fc1d0
Reviewed-on: https://go-review.googlesource.com/c/go/+/173998
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-04-30 22:02:07 +00:00
Carlos Eduardo Seo
50ad09418e cmd/compile: intrinsify math/bits.Add64 for ppc64x
This change creates an intrinsic for Add64 for ppc64x and adds a
testcase for it.

name               old time/op  new time/op  delta
Add64-160          1.90ns ±40%  2.29ns ± 0%     ~     (p=0.119 n=5+5)
Add64multiple-160  6.69ns ± 2%  2.45ns ± 4%  -63.47%  (p=0.016 n=4+5)

Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97
Reviewed-on: https://go-review.googlesource.com/c/go/+/173758
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2019-04-28 23:51:04 +00:00
Brian Kessler
a28a942768 cmd/compile: add unsigned divisibility rules
"Division by invariant integers using multiplication" paper
by Granlund and Montgomery contains a method for directly computing
divisibility (x%c == 0 for c constant) by means of the modular inverse.
The method is further elaborated in "Hacker's Delight" by Warren Section 10-17

This general rule can compute divisibilty by one multiplication and a compare
for odd divisors and an additional rotate for even divisors.

To apply the divisibility rule, we must take into account
the rules to rewrite x%c = x-((x/c)*c) and (x/c) for c constant on the first
optimization pass "opt".  This complicates the matching as we want to match
only in the cases where the result of (x/c) is not also available.
So, we must match on the expanded form of (x/c) in the expression x == c*(x/c)
in the "late opt" pass after common subexpresion elimination.

Note, that if there is an intermediate opt pass introduced in the future we
could simplify these rules by delaying the magic division rewrite to "late opt"
and matching directly on (x/c) in the intermediate opt pass.

Additional rules to lower the generic RotateLeft* ops were also applied.

On amd64, the divisibility check is 25-50% faster.

name                     old time/op  new time/op  delta
DivconstI64-4            2.08ns ± 0%  2.08ns ± 1%     ~     (p=0.881 n=5+5)
DivisibleconstI64-4      2.67ns ± 0%  2.67ns ± 1%     ~     (p=1.000 n=5+5)
DivisibleWDivconstI64-4  2.67ns ± 0%  2.67ns ± 0%     ~     (p=0.683 n=5+5)
DivconstU64-4            2.08ns ± 1%  2.08ns ± 1%     ~     (p=1.000 n=5+5)
DivisibleconstU64-4      2.77ns ± 1%  1.55ns ± 2%  -43.90%  (p=0.008 n=5+5)
DivisibleWDivconstU64-4  2.99ns ± 1%  2.99ns ± 1%     ~     (p=1.000 n=5+5)
DivconstI32-4            1.53ns ± 2%  1.53ns ± 0%     ~     (p=1.000 n=5+5)
DivisibleconstI32-4      2.23ns ± 0%  2.25ns ± 3%     ~     (p=0.167 n=5+5)
DivisibleWDivconstI32-4  2.27ns ± 1%  2.27ns ± 1%     ~     (p=0.429 n=5+5)
DivconstU32-4            1.78ns ± 0%  1.78ns ± 1%     ~     (p=1.000 n=4+5)
DivisibleconstU32-4      2.52ns ± 2%  1.26ns ± 0%  -49.96%  (p=0.000 n=5+4)
DivisibleWDivconstU32-4  2.63ns ± 0%  2.85ns ±10%   +8.29%  (p=0.016 n=4+5)
DivconstI16-4            1.54ns ± 0%  1.54ns ± 0%     ~     (p=0.333 n=4+5)
DivisibleconstI16-4      2.10ns ± 0%  2.10ns ± 1%     ~     (p=0.571 n=4+5)
DivisibleWDivconstI16-4  2.22ns ± 0%  2.23ns ± 1%     ~     (p=0.556 n=4+5)
DivconstU16-4            1.09ns ± 0%  1.01ns ± 1%   -7.74%  (p=0.000 n=4+5)
DivisibleconstU16-4      1.83ns ± 0%  1.26ns ± 0%  -31.52%  (p=0.008 n=5+5)
DivisibleWDivconstU16-4  1.88ns ± 0%  1.89ns ± 1%     ~     (p=0.365 n=5+5)
DivconstI8-4             1.54ns ± 1%  1.54ns ± 1%     ~     (p=1.000 n=5+5)
DivisibleconstI8-4       2.10ns ± 0%  2.11ns ± 0%     ~     (p=0.238 n=5+4)
DivisibleWDivconstI8-4   2.22ns ± 0%  2.23ns ± 2%     ~     (p=0.762 n=5+5)
DivconstU8-4             0.92ns ± 1%  0.94ns ± 1%   +2.65%  (p=0.008 n=5+5)
DivisibleconstU8-4       1.66ns ± 0%  1.26ns ± 1%  -24.28%  (p=0.008 n=5+5)
DivisibleWDivconstU8-4   1.79ns ± 0%  1.80ns ± 1%     ~     (p=0.079 n=4+5)

A follow-up change will address the signed division case.

Updates #30282

Change-Id: I7e995f167179aa5c76bb10fbcbeb49c520943403
Reviewed-on: https://go-review.googlesource.com/c/go/+/168037
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-04-27 20:46:46 +00:00
Brian Kessler
44343c777c cmd/compile: add signed divisibility by power of 2 rules
For powers of two (c=1<<k), the divisibility check x%c == 0 can be made
just by checking the trailing zeroes via a mask x&(c-1) == 0 even for signed
integers. This avoids division fix-ups when just divisibility check is needed.

To apply this rule, we match on the fixed-up version of the division. This is
neccessary because the mod and division rewrite rules are already applied
during the initial opt pass.

The speed up on amd64 due to elimination of unneccessary fix-up code is ~55%:

name                     old time/op  new time/op  delta
DivconstI64-4            2.08ns ± 0%  2.09ns ± 1%     ~     (p=0.730 n=5+5)
DivisiblePow2constI64-4  1.78ns ± 1%  0.81ns ± 1%  -54.66%  (p=0.008 n=5+5)
DivconstU64-4            2.08ns ± 0%  2.08ns ± 0%     ~     (p=0.683 n=5+5)
DivconstI32-4            1.53ns ± 0%  1.53ns ± 1%     ~     (p=0.968 n=4+5)
DivisiblePow2constI32-4  1.79ns ± 1%  0.81ns ± 1%  -54.97%  (p=0.008 n=5+5)
DivconstU32-4            1.78ns ± 1%  1.80ns ± 2%     ~     (p=0.206 n=5+5)
DivconstI16-4            1.54ns ± 2%  1.54ns ± 0%     ~     (p=0.238 n=5+4)
DivisiblePow2constI16-4  1.78ns ± 0%  0.81ns ± 1%  -54.72%  (p=0.000 n=4+5)
DivconstU16-4            1.00ns ± 5%  1.01ns ± 1%     ~     (p=0.119 n=5+5)
DivconstI8-4             1.54ns ± 0%  1.54ns ± 2%     ~     (p=0.571 n=4+5)
DivisiblePow2constI8-4   1.78ns ± 0%  0.82ns ± 8%  -53.71%  (p=0.008 n=5+5)
DivconstU8-4             0.93ns ± 1%  0.93ns ± 1%     ~     (p=0.643 n=5+5)

A follow-up CL will address the general case of x%c == 0 for signed integers.

Updates #15806

Change-Id: Iabadbbe369b6e0998c8ce85d038ebc236142e42a
Reviewed-on: https://go-review.googlesource.com/c/go/+/173557
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-04-25 03:00:32 +00:00
Keith Randall
17615969b6 Revert "cmd/compile: add signed divisibility by power of 2 rules"
This reverts CL 168038 (git 68819fb6d2)

Reason for revert: Doesn't work on 32 bit archs.

Change-Id: Idec9098060dc65bc2f774c5383f0477f8eb63a3d
Reviewed-on: https://go-review.googlesource.com/c/go/+/173442
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-04-23 21:23:18 +00:00
Brian Kessler
68819fb6d2 cmd/compile: add signed divisibility by power of 2 rules
For powers of two (c=1<<k), the divisibility check x%c == 0 can be made
just by checking the trailing zeroes via a mask x&(c-1)==0 even for signed
integers.  This avoids division fixups when just divisibility check is needed.

To apply this rule the generic divisibility rule for  A%B = A-(A/B*B) is disabled
on the "opt" pass, but this does not affect generated code as this rule is applied
later.

The speed up on amd64 due to elimination of unneccessary fixup code is ~55%:

name                     old time/op  new time/op  delta
DivconstI64-4            2.08ns ± 0%  2.07ns ± 0%     ~     (p=0.079 n=5+5)
DivisiblePow2constI64-4  1.78ns ± 1%  0.81ns ± 1%  -54.55%  (p=0.008 n=5+5)
DivconstU64-4            2.08ns ± 0%  2.08ns ± 0%     ~     (p=1.000 n=5+5)
DivconstI32-4            1.53ns ± 0%  1.53ns ± 0%     ~     (all equal)
DivisiblePow2constI32-4  1.79ns ± 1%  0.81ns ± 4%  -54.75%  (p=0.008 n=5+5)
DivconstU32-4            1.78ns ± 1%  1.78ns ± 1%     ~     (p=1.000 n=5+5)
DivconstI16-4            1.54ns ± 2%  1.53ns ± 0%     ~     (p=0.333 n=5+4)
DivisiblePow2constI16-4  1.78ns ± 0%  0.79ns ± 1%  -55.39%  (p=0.000 n=4+5)
DivconstU16-4            1.00ns ± 5%  0.99ns ± 1%     ~     (p=0.730 n=5+5)
DivconstI8-4             1.54ns ± 0%  1.53ns ± 0%     ~     (p=0.714 n=4+5)
DivisiblePow2constI8-4   1.78ns ± 0%  0.80ns ± 0%  -55.06%  (p=0.000 n=5+4)
DivconstU8-4             0.93ns ± 1%  0.95ns ± 1%   +1.72%  (p=0.024 n=5+5)

A follow-up CL will address the general case of x%c == 0 for signed integers.

Updates #15806

Change-Id: I0d284863774b1bc8c4ce87443bbaec6103e14ef4
Reviewed-on: https://go-review.googlesource.com/c/go/+/168038
Reviewed-by: Keith Randall <khr@golang.org>
2019-04-23 20:35:54 +00:00
Keith Randall
fd788a86b6 cmd/compile: always mark atColumn1 results as statements
In 31618, we end up comparing the is-stmt-ness of positions
to repurpose real instructions as inline marks. If the is-stmt-ness
doesn't match, we end up not being able to remove the inline mark.

Always use statement-full positions to do the matching, so we
always find a match if there is one.

Also always use positions that are statements for inline marks.

Fixes #31618

Change-Id: Idaf39bdb32fa45238d5cd52973cadf4504f947d5
Reviewed-on: https://go-review.googlesource.com/c/go/+/173324
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2019-04-23 17:39:11 +00:00
erifan01
f8f265b9cf cmd/compile: intrinsify math/bits.Sub64 for arm64
This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS,
NGC and NEG, and optimzes the case of borrowing chains.

Benchmarks:
name              old time/op       new time/op       delta
Sub-64            2.500000ns +- 0%  2.048000ns +- 1%  -18.08%  (p=0.000 n=10+10)
Sub32-64          2.500000ns +- 0%  2.500000ns +- 0%     ~     (all equal)
Sub64-64          2.500000ns +- 0%  2.080000ns +- 0%  -16.80%  (p=0.000 n=10+7)
Sub64multiple-64  7.090000ns +- 0%  2.090000ns +- 0%  -70.52%  (p=0.000 n=10+10)

Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655
Reviewed-on: https://go-review.googlesource.com/c/go/+/159018
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-04-22 14:40:20 +00:00
Josh Bleecher Snyder
68d4b1265e cmd/compile: reduce bits.Div64(0, lo, y) to 64 bit division
With this change, these two functions generate identical code:

func f(x uint64) (uint64, uint64) {
	return bits.Div64(0, x, 5)
}

func g(x uint64) (uint64, uint64) {
	return x / 5, x % 5
}

Updates #31582

Change-Id: Ia96c2e67f8af5dd985823afee5f155608c04a4b6
Reviewed-on: https://go-review.googlesource.com/c/go/+/173197
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-04-20 19:34:03 +00:00
Keith Randall
48ef01051a cmd/compile: handle new panicindex/slice names in optimizations
These new calls should not prevent NOSPLIT promotion, like the old ones.
These new calls should not prevent racefuncenter/exit removal.

(The latter was already true, as the new calls are not yet lowered
to StaticCalls at the point where racefuncenter/exit removal is done.)

Add tests to make sure we don't regress (again).

Fixes #31219

Change-Id: I3fb6b17cdd32c425829f1e2498defa813a5a9ace
Reviewed-on: https://go-review.googlesource.com/c/go/+/170639
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
2019-04-03 21:24:17 +00:00
erifan01
d0cbf9bf53 cmd/compile: follow up intrinsifying math/bits.Add64 for arm64
This CL deals with the additional comments of CL 159017.

Change-Id: I4ad3c60c834646d58dc0c544c741b92bfe83fb8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/168857
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-22 15:09:47 +00:00
Carlos Eduardo Seo
3023d7da49 cmd/compile/internal, cmd/internal/obj/ppc64: generate new count trailing zeros instructions on POWER9
This change adds new POWER9 instructions for counting trailing zeros (CNTTZW/CNTTZD)
to the assembler and generates them in SSA when GOPPC64=power9.

name                 old time/op  new time/op  delta
TrailingZeros-160    1.59ns ±20%  1.45ns ±10%  -8.81%  (p=0.000 n=14+13)
TrailingZeros8-160   1.55ns ±23%  1.62ns ±44%    ~     (p=0.593 n=13+15)
TrailingZeros16-160  1.78ns ±23%  1.62ns ±38%  -9.31%  (p=0.003 n=14+14)
TrailingZeros32-160  1.64ns ±10%  1.49ns ± 9%  -9.15%  (p=0.000 n=13+14)
TrailingZeros64-160  1.53ns ± 6%  1.45ns ± 5%  -5.38%  (p=0.000 n=15+13)

Change-Id: I365e6ff79f3ce4d8ebe089a6a86b1771853eb596
Reviewed-on: https://go-review.googlesource.com/c/go/+/167517
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2019-03-20 20:27:00 +00:00
erifan01
5714c91b53 cmd/compile: intrinsify math/bits.Add64 for arm64
This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS
and ADC, and optimzes the case of carry chains.The CL also changes the
test code so that the intrinsic implementation can be tested.

Benchmarks:
name               old time/op       new time/op       delta
Add-224            2.500000ns +- 0%  2.090000ns +- 4%  -16.40%  (p=0.000 n=9+10)
Add32-224          2.500000ns +- 0%  2.500000ns +- 0%     ~     (all equal)
Add64-224          2.500000ns +- 0%  1.577778ns +- 2%  -36.89%  (p=0.000 n=10+9)
Add64multiple-224  6.000000ns +- 0%  2.000000ns +- 0%  -66.67%  (p=0.000 n=10+10)

Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615
Reviewed-on: https://go-review.googlesource.com/c/go/+/159017
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-20 05:39:49 +00:00
Josh Bleecher Snyder
250b96a7bf cmd/compile: slightly optimize adding 128
'SUBQ $-0x80, r' is shorter to encode than 'ADDQ $0x80, r',
and functionally equivalent. Use it instead.

Shaves off a few bytes here and there:

file    before    after     Δ       %       
compile 25935856  25927664  -8192   -0.032% 
nm      4251840   4247744   -4096   -0.096% 

Change-Id: Ia9e02ea38cbded6a52a613b92e3a914f878d931e
Reviewed-on: https://go-review.googlesource.com/c/go/+/168344
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-03-19 19:44:52 +00:00
Keith Randall
2c423f063b cmd/compile,runtime: provide index information on bounds check failure
A few examples (for accessing a slice of length 3):

   s[-1]    runtime error: index out of range [-1]
   s[3]     runtime error: index out of range [3] with length 3
   s[-1:0]  runtime error: slice bounds out of range [-1:]
   s[3:0]   runtime error: slice bounds out of range [3:0]
   s[3:-1]  runtime error: slice bounds out of range [:-1]
   s[3:4]   runtime error: slice bounds out of range [:4] with capacity 3
   s[0:3:4] runtime error: slice bounds out of range [::4] with capacity 3

Note that in cases where there are multiple things wrong with the
indexes (e.g. s[3:-1]), we report one of those errors kind of
arbitrarily, currently the rightmost one.

An exhaustive set of examples is in issue30116[u].out in the CL.

The message text has the same prefix as the old message text. That
leads to slightly awkward phrasing but hopefully minimizes the chance
that code depending on the error text will break.

Increases the size of the go binary by 0.5% (amd64). The panic functions
take arguments in registers in order to keep the size of the compiled code
as small as possible.

Fixes #30116

Change-Id: Idb99a827b7888822ca34c240eca87b7e44a04fdd
Reviewed-on: https://go-review.googlesource.com/c/go/+/161477
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2019-03-18 17:33:38 +00:00
Tobias Klauser
156c830bea cmd/compile: eliminate unnecessary type conversions in TrailingZeros(16|8) for arm
This follows CL 156999 which did the same for arm64.

name               old time/op  new time/op  delta
TrailingZeros-4    7.30ns ± 1%  7.30ns ± 0%     ~     (p=0.413 n=9+9)
TrailingZeros8-4   8.32ns ± 0%  7.17ns ± 0%  -13.77%  (p=0.000 n=10+9)
TrailingZeros16-4  8.30ns ± 0%  7.18ns ± 0%  -13.50%  (p=0.000 n=9+10)
TrailingZeros32-4  6.46ns ± 1%  6.47ns ± 1%     ~     (p=0.325 n=10+10)
TrailingZeros64-4  16.3ns ± 0%  16.2ns ± 0%   -0.61%  (p=0.000 n=7+10)

Change-Id: I7e9e1abf7e30d811aa474d272b2824ec7cbbaa98
Reviewed-on: https://go-review.googlesource.com/c/go/+/167797
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-15 18:37:22 +00:00
Richard Musiol
5ee1b84959 math, math/bits: add intrinsics for wasm
This commit adds compiler intrinsics for the packages math and
math/bits on the wasm architecture for better performance.

benchmark                        old ns/op     new ns/op     delta
BenchmarkCeil                    8.31          3.21          -61.37%
BenchmarkCopysign                5.24          3.88          -25.95%
BenchmarkAbs                     5.42          3.34          -38.38%
BenchmarkFloor                   8.29          3.18          -61.64%
BenchmarkRoundToEven             9.76          3.26          -66.60%
BenchmarkSqrtLatency             8.13          4.88          -39.98%
BenchmarkSqrtPrime               5246          3535          -32.62%
BenchmarkTrunc                   8.29          3.15          -62.00%
BenchmarkLeadingZeros            13.0          4.23          -67.46%
BenchmarkLeadingZeros8           4.65          4.42          -4.95%
BenchmarkLeadingZeros16          7.60          4.38          -42.37%
BenchmarkLeadingZeros32          10.7          4.48          -58.13%
BenchmarkLeadingZeros64          12.9          4.31          -66.59%
BenchmarkTrailingZeros           6.52          4.04          -38.04%
BenchmarkTrailingZeros8          4.57          4.14          -9.41%
BenchmarkTrailingZeros16         6.69          4.16          -37.82%
BenchmarkTrailingZeros32         6.97          4.23          -39.31%
BenchmarkTrailingZeros64         6.59          4.00          -39.30%
BenchmarkOnesCount               7.93          3.30          -58.39%
BenchmarkOnesCount8              3.56          3.19          -10.39%
BenchmarkOnesCount16             4.85          3.19          -34.23%
BenchmarkOnesCount32             7.27          3.19          -56.12%
BenchmarkOnesCount64             8.08          3.28          -59.41%
BenchmarkRotateLeft              4.88          3.80          -22.13%
BenchmarkRotateLeft64            5.03          3.63          -27.83%

Change-Id: Ic1e0c2984878be8defb6eb7eb6ee63765c793222
Reviewed-on: https://go-review.googlesource.com/c/go/+/165177
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-14 19:46:19 +00:00
Josh Bleecher Snyder
61945fc502 cmd/compile: don't generate panicshift for masked int shifts
We know that a & 31 is non-negative for all a, signed or not.
We can avoid checking that and needing to write out an
unreachable call to panicshift.

Change-Id: I32f32fb2c950d2b2b35ac5c0e99b7b2dbd47f917
Reviewed-on: https://go-review.googlesource.com/c/go/+/167499
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
2019-03-14 00:03:29 +00:00
Josh Bleecher Snyder
870cfe6484 test/codegen: gofmt
Change-Id: I33f5b5051e5f75aa264ec656926223c5a3c09c1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/167498
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Matt Layher <mdlayher@gmail.com>
2019-03-13 21:44:45 +00:00
Keith Randall
486ca37b14 test: fix memcombine tests
Two tests (load_le_byte8_uint64_inv and load_be_byte8_uint64)
pass but the generated code isn't actually correct.

The test regexp provides a false negative, as it matches the
MOVQ (SP), BP instruction in the epilogue.

Combined loads never worked for these cases - the test was added in error
as part of a batch and not noticed because of the above false match.

Normalize the amd64/386 tests to always negative match on narrower
loads and OR.

Change-Id: I256861924774d39db0e65723866c81df5ab5076f
Reviewed-on: https://go-review.googlesource.com/c/go/+/166837
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-03-11 19:18:03 +00:00
fanzha02
6efd51c6b7 cmd/compile: change the condition flags of floating-point comparisons in arm64 backend
Current compiler reverses operands to work around NaN in
"less than" and "less equal than" comparisons. But if we
want to use "FCMPD/FCMPS $(0.0), Fn" to do some optimization,
the workaround way does not work. Because assembler does
not support instruction "FCMPD/FCMPS Fn, $(0.0)".

This CL sets condition flags for floating-point comparisons
to resolve this problem.

Change-Id: Ia48076a1da95da64596d6e68304018cb301ebe33
Reviewed-on: https://go-review.googlesource.com/c/go/+/164718
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-07 21:23:52 +00:00
erifan01
4e2b0dda8c cmd/compile: eliminate unnecessary type conversions in TrailingZeros(16|8) for arm64
This CL eliminates unnecessary type conversion operations: OpZeroExt16to64 and OpZeroExt8to64.
If the input argrument is a nonzero value, then ORconst operation can also be eliminated.

Benchmarks:

name               old time/op  new time/op  delta
TrailingZeros-8    2.75ns ± 0%  2.75ns ± 0%     ~     (all equal)
TrailingZeros8-8   3.49ns ± 1%  2.93ns ± 0%  -16.00%  (p=0.000 n=10+10)
TrailingZeros16-8  3.49ns ± 1%  2.93ns ± 0%  -16.05%  (p=0.000 n=9+10)
TrailingZeros32-8  2.67ns ± 1%  2.68ns ± 1%     ~     (p=0.468 n=10+10)
TrailingZeros64-8  2.67ns ± 1%  2.65ns ± 0%   -0.62%  (p=0.022 n=10+9)

code:

func f16(x uint) { z = bits.TrailingZeros16(uint16(x)) }

Before:

"".f16 STEXT size=48 args=0x8 locals=0x0 leaf
        0x0000 00000 (test.go:7)        TEXT    "".f16(SB), LEAF|NOFRAME|ABIInternal, $0-8
        0x0000 00000 (test.go:7)        FUNCDATA        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        PCDATA  $2, ZR
        0x0000 00000 (test.go:7)        PCDATA  ZR, ZR
        0x0000 00000 (test.go:7)        MOVD    "".x(FP), R0
        0x0004 00004 (test.go:7)        MOVHU   R0, R0
        0x0008 00008 (test.go:7)        ORR     $65536, R0, R0
        0x000c 00012 (test.go:7)        RBIT    R0, R0
        0x0010 00016 (test.go:7)        CLZ     R0, R0
        0x0014 00020 (test.go:7)        MOVD    R0, "".z(SB)
        0x0020 00032 (test.go:7)        RET     (R30)

This line of code is unnecessary:
        0x0004 00004 (test.go:7)        MOVHU   R0, R0

After:

"".f16 STEXT size=32 args=0x8 locals=0x0 leaf
        0x0000 00000 (test.go:7)        TEXT    "".f16(SB), LEAF|NOFRAME|ABIInternal, $0-8
        0x0000 00000 (test.go:7)        FUNCDATA        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        PCDATA  $2, ZR
        0x0000 00000 (test.go:7)        PCDATA  ZR, ZR
        0x0000 00000 (test.go:7)        MOVD    "".x(FP), R0
        0x0004 00004 (test.go:7)        ORR     $65536, R0, R0
        0x0008 00008 (test.go:7)        RBITW   R0, R0
        0x000c 00012 (test.go:7)        CLZW    R0, R0
        0x0010 00016 (test.go:7)        MOVD    R0, "".z(SB)
        0x001c 00028 (test.go:7)        RET     (R30)

The situation of TrailingZeros8 is similar to TrailingZeros16.

Change-Id: I473bdca06be8460a0be87abbae6fe640017e4c9d
Reviewed-on: https://go-review.googlesource.com/c/go/+/156999
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07 14:24:56 +00:00
erifan01
fee84cc905 cmd/compile: add an optimization rule for math/bits.ReverseBytes16 on arm
This CL adds two rules to turn patterns like ((x<<8) | (x>>8)) (the type of
x is uint16, "|" can also be "+" or "^") to a REV16 instruction on arm v6+.
This optimization rule can be used for math/bits.ReverseBytes16.

Benchmarks on arm v6:
name               old time/op  new time/op  delta
ReverseBytes-32    2.86ns ± 0%  2.86ns ± 0%   ~     (all equal)
ReverseBytes16-32  2.86ns ± 0%  2.86ns ± 0%   ~     (all equal)
ReverseBytes32-32  1.29ns ± 0%  1.29ns ± 0%   ~     (all equal)
ReverseBytes64-32  1.43ns ± 0%  1.43ns ± 0%   ~     (all equal)

Change-Id: I819e633c9a9d308f8e476fb0c82d73fb73dd019f
Reviewed-on: https://go-review.googlesource.com/c/go/+/159019
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07 13:37:54 +00:00
erifan01
159b2de442 cmd/compile: optimize math/bits.Div32 for arm64
Benchmark:
name     old time/op  new time/op  delta
Div-8    22.0ns ± 0%  22.0ns ± 0%     ~     (all equal)
Div32-8  6.51ns ± 0%  3.00ns ± 0%  -53.90%  (p=0.000 n=10+8)
Div64-8  22.5ns ± 0%  22.5ns ± 0%     ~     (all equal)

Code:
func div32(hi, lo, y uint32) (q, r uint32) {return bits.Div32(hi, lo, y)}

Before:
        0x0020 00032 (test.go:24)       MOVWU   "".y+8(FP), R0
        0x0024 00036 ($GOROOT/src/math/bits/bits.go:472)        CBZW    R0, 132
        0x0028 00040 ($GOROOT/src/math/bits/bits.go:472)        MOVWU   "".hi(FP), R1
        0x002c 00044 ($GOROOT/src/math/bits/bits.go:472)        CMPW    R1, R0
        0x0030 00048 ($GOROOT/src/math/bits/bits.go:472)        BLS     96
        0x0034 00052 ($GOROOT/src/math/bits/bits.go:475)        MOVWU   "".lo+4(FP), R2
        0x0038 00056 ($GOROOT/src/math/bits/bits.go:475)        ORR     R1<<32, R2, R1
        0x003c 00060 ($GOROOT/src/math/bits/bits.go:476)        CBZ     R0, 140
        0x0040 00064 ($GOROOT/src/math/bits/bits.go:476)        UDIV    R0, R1, R2
        0x0044 00068 (test.go:24)       MOVW    R2, "".q+16(FP)
        0x0048 00072 ($GOROOT/src/math/bits/bits.go:476)        UREM    R0, R1, R0
        0x0050 00080 (test.go:24)       MOVW    R0, "".r+20(FP)
        0x0054 00084 (test.go:24)       MOVD    -8(RSP), R29
        0x0058 00088 (test.go:24)       MOVD.P  32(RSP), R30
        0x005c 00092 (test.go:24)       RET     (R30)

After:
        0x001c 00028 (test.go:24)       MOVWU   "".y+8(FP), R0
        0x0020 00032 (test.go:24)       CBZW    R0, 92
        0x0024 00036 (test.go:24)       MOVWU   "".hi(FP), R1
        0x0028 00040 (test.go:24)       CMPW    R0, R1
        0x002c 00044 (test.go:24)       BHS     84
        0x0030 00048 (test.go:24)       MOVWU   "".lo+4(FP), R2
        0x0034 00052 (test.go:24)       ORR     R1<<32, R2, R4
        0x0038 00056 (test.go:24)       UDIV    R0, R4, R3
        0x003c 00060 (test.go:24)       MSUB    R3, R4, R0, R4
        0x0040 00064 (test.go:24)       MOVW    R3, "".q+16(FP)
        0x0044 00068 (test.go:24)       MOVW    R4, "".r+20(FP)
        0x0048 00072 (test.go:24)       MOVD    -8(RSP), R29
        0x004c 00076 (test.go:24)       MOVD.P  16(RSP), R30
        0x0050 00080 (test.go:24)       RET     (R30)

UREM instruction in the previous assembly code will be converted to UDIV and MSUB instructions
on arm64. However the UDIV instruction in UREM is unnecessary, because it's a duplicate of the
previous UDIV. This CL adds a rule to have this extra UDIV instruction removed by CSE.

Change-Id: Ie2508784320020b2de022806d09f75a7871bb3d7
Reviewed-on: https://go-review.googlesource.com/c/159577
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-03 20:20:10 +00:00