1
0
mirror of https://github.com/golang/go synced 2024-11-17 16:24:42 -07:00
Commit Graph

373 Commits

Author SHA1 Message Date
Keith Randall
9ce27feaeb cmd/compile: add rule for post-decomposed growslice optimization
The recently added rule only works before decomposing slices.
Add a rule that works after decomposing slices.

The reason we need the latter is because although the length may
be a constant, it can be hidden inside a slice that is not constant
(its pointer or capacity might be changing). By applying this
optimization after decomposing slices, we can find more cases
where it applies.

Fixes #56440

Change-Id: I0094e59eee3065ab4d210defdda8227a6e897420
Reviewed-on: https://go-review.googlesource.com/c/go/+/446277
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2022-10-31 21:40:49 +00:00
Keith Randall
0156b797e6 cmd/compile: recognize when the result of append has a constant length
Fixes a performance regression due to CL 418554.

Fixes #56440

Change-Id: I6ff152e9b83084756363f49ee6b0844a7a284880
Reviewed-on: https://go-review.googlesource.com/c/go/+/445875
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-10-27 17:09:50 +00:00
Wayne Zuo
90a3527427 cmd/compile: intrinsify Sub64 on loong64
This is a follow up of CL 420095  on loong64.

file                                    before    after     Δ       %
compile/internal/ssa.a                  35649482  35653274  +3792   +0.011%
compile/internal/ssagen.a               4099858   4098728   -1130   -0.028%
ecdh.a                                  227896    226896    -1000   -0.439%
internal/nistec/fiat.a                  1212254   1128184   -84070  -6.935%
tls.a                                   3256800   3256802   +2      +0.000%
big.a                                   1708518   1702496   -6022   -0.352%
bits.a                                  106762    105734    -1028   -0.963%
math.a                                  578762    577288    -1474   -0.255%
netip.a                                 555922    555610    -312    -0.056%
net.a                                   3286528   3286530   +2      +0.000%
golang.org/x/crypto/internal/poly1305.a 109546    107686    -1860   -1.698%
total                                   260392768 260299668 -93100  -0.036%

Change-Id: Ieffca705aae5666501f284502d986ca179dde494
Reviewed-on: https://go-review.googlesource.com/c/go/+/428557
Reviewed-by: Carlos Amedee <carlos@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-10-07 18:16:26 +00:00
Wayne Zuo
97760ed651 cmd/compile: intrinsify Add64 on loong64
This is a follow up of CL 420094  on loong64.

Reduce go toolchain size slightly on linux/loong64.

compilecmp HEAD~1 -> HEAD
HEAD~1 (8a32354219): internal/trace: use strings.Builder
HEAD (1767784ac3): cmd/compile: intrinsify Add64 on loong64
platform: linux/loong64

file      before    after     Δ       %
addr2line 3882616   3882536   -80     -0.002%
api       5528866   5528450   -416    -0.008%
asm       5133780   5133796   +16     +0.000%
cgo       4668787   4668491   -296    -0.006%
compile   25163409  25164729  +1320   +0.005%
cover     4658055   4658007   -48     -0.001%
dist      3437783   3437727   -56     -0.002%
doc       3883069   3883205   +136    +0.004%
fix       3383254   3383070   -184    -0.005%
link      6747559   6747023   -536    -0.008%
nm        3793923   3793939   +16     +0.000%
objdump   4256628   4256812   +184    +0.004%
pack      2356328   2356144   -184    -0.008%
pprof     14233370  14131910  -101460 -0.713%
test2json 2638668   2638476   -192    -0.007%
trace     13392065  13360781  -31284  -0.234%
vet       7456388   7455588   -800    -0.011%
total     132498256 132364392 -133864 -0.101%

file                                    before    after     Δ       %
compile/internal/ssa.a                  35644590  35649482  +4892   +0.014%
compile/internal/ssagen.a               4101250   4099858   -1392   -0.034%
internal/edwards25519/field.a           226064    201718    -24346  -10.770%
internal/nistec/fiat.a                  1689922   1212254   -477668 -28.266%
tls.a                                   3256798   3256800   +2      +0.000%
big.a                                   1718552   1708518   -10034  -0.584%
bits.a                                  107786    106762    -1024   -0.950%
cmplx.a                                 169434    168214    -1220   -0.720%
math.a                                  581302    578762    -2540   -0.437%
netip.a                                 556096    555922    -174    -0.031%
net.a                                   3286526   3286528   +2      +0.000%
runtime.a                               8644786   8644510   -276    -0.003%
strconv.a                               519098    518374    -724    -0.139%
golang.org/x/crypto/internal/poly1305.a 115398    109546    -5852   -5.071%
total                                   260913122 260392768 -520354 -0.199%

Change-Id: I75b2bb7761fa5a0d0d032d4ebe3582d092ea77be
Reviewed-on: https://go-review.googlesource.com/c/go/+/428556
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-10-07 18:16:10 +00:00
Wayne Zuo
af668c689c cmd/compile: fold constant shift with extension on riscv64
For example:

  movb a0, a0
  srai $1, a0, a0

the assembler will expand to:

  slli $56, a0, a0
  srai $56, a0, a0
  srai $1, a0, a0

this CL optimize to:

  slli $56, a0, a0
  srai $57, a0, a0

Remove 270+ instructions from Go binary on linux/riscv64.

Change-Id: I375e19f9d3bd54f2781791d8cbe5970191297dc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/428496
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-10-06 05:21:04 +00:00
eric fang
ddc7d2a80c cmd/compile: add late lower pass for last rules to run
Usually optimization rules have corresponding priorities, some need to
be run first, some run next, and some run last, which produces the best
code. But currently our optimization rules have no priority, this CL
adds a late lower pass that runs those rules that need to be run at last,
such as split unreasonable constant folding. This pass can be seen as
the second round of the lower pass.

For example:
func foo(a, b uint64) uint64 {
        d := a+0x1234568
        d1 := b+0x1234568
        return d&d1
}
The code generated by the master branch:
	0x0004 00004        ADD     $19088744, R0, R2 // movz+movk+add
	0x0010 00016        ADD     $19088744, R1, R1 // movz+movk+add
	0x001c 00028        AND     R1, R2, R0

This is because the current constant folding optimization rules do not
take into account the range of constants, causing the constant to be
loaded repeatedly. This CL splits these unreasonable constants folding
in the late lower pass. With this CL the generated code:
	0x0004 00004        MOVD    $19088744, R2 // movz+movk
	0x000c 00012        ADD     R0, R2, R3
	0x0010 00016        ADD     R1, R2, R1
	0x0014 00020        AND     R1, R3, R0

This CL also adds constant folding optimization for ADDS instruction.

In addition, in order not to introduce the codegen regression, an
optimization rule is added to change the addition of a negative number
into a subtraction of a positive number.

go1 benchmarks:
name                     old time/op    new time/op    delta
BinaryTree17-8              1.22s ± 1%     1.24s ± 0%  +1.56%  (p=0.008 n=5+5)
Fannkuch11-8                1.54s ± 0%     1.53s ± 0%  -0.69%  (p=0.016 n=4+5)
FmtFprintfEmpty-8          14.1ns ± 0%    14.1ns ± 0%    ~     (p=0.079 n=4+5)
FmtFprintfString-8         26.0ns ± 0%    26.1ns ± 0%  +0.23%  (p=0.008 n=5+5)
FmtFprintfInt-8            32.3ns ± 0%    32.9ns ± 1%  +1.72%  (p=0.008 n=5+5)
FmtFprintfIntInt-8         54.5ns ± 0%    55.5ns ± 0%  +1.83%  (p=0.008 n=5+5)
FmtFprintfPrefixedInt-8    61.5ns ± 0%    62.0ns ± 0%  +0.93%  (p=0.008 n=5+5)
FmtFprintfFloat-8          72.0ns ± 0%    73.6ns ± 0%  +2.24%  (p=0.008 n=5+5)
FmtManyArgs-8               221ns ± 0%     224ns ± 0%  +1.22%  (p=0.008 n=5+5)
GobDecode-8                1.91ms ± 0%    1.93ms ± 0%  +0.98%  (p=0.008 n=5+5)
GobEncode-8                1.40ms ± 1%    1.39ms ± 0%  -0.79%  (p=0.032 n=5+5)
Gzip-8                      115ms ± 0%     117ms ± 1%  +1.17%  (p=0.008 n=5+5)
Gunzip-8                   19.4ms ± 1%    19.3ms ± 0%  -0.71%  (p=0.016 n=5+4)
HTTPClientServer-8         27.0µs ± 0%    27.3µs ± 0%  +0.80%  (p=0.008 n=5+5)
JSONEncode-8               3.36ms ± 1%    3.33ms ± 0%    ~     (p=0.056 n=5+5)
JSONDecode-8               17.5ms ± 2%    17.8ms ± 0%  +1.71%  (p=0.016 n=5+4)
Mandelbrot200-8            2.29ms ± 0%    2.29ms ± 0%    ~     (p=0.151 n=5+5)
GoParse-8                  1.35ms ± 1%    1.36ms ± 1%    ~     (p=0.056 n=5+5)
RegexpMatchEasy0_32-8      24.5ns ± 0%    24.5ns ± 0%    ~     (p=0.444 n=4+5)
RegexpMatchEasy0_1K-8       131ns ±11%     118ns ± 6%    ~     (p=0.056 n=5+5)
RegexpMatchEasy1_32-8      22.9ns ± 0%    22.9ns ± 0%    ~     (p=0.905 n=4+5)
RegexpMatchEasy1_1K-8       126ns ± 0%     127ns ± 0%    ~     (p=0.063 n=4+5)
RegexpMatchMedium_32-8      486ns ± 5%     483ns ± 0%    ~     (p=0.381 n=5+4)
RegexpMatchMedium_1K-8     15.4µs ± 1%    15.5µs ± 0%    ~     (p=0.151 n=5+5)
RegexpMatchHard_32-8        687ns ± 0%     686ns ± 0%    ~     (p=0.103 n=5+5)
RegexpMatchHard_1K-8       20.7µs ± 0%    20.7µs ± 1%    ~     (p=0.151 n=5+5)
Revcomp-8                   175ms ± 2%     176ms ± 3%    ~     (p=1.000 n=5+5)
Template-8                 20.4ms ± 6%    20.1ms ± 2%    ~     (p=0.151 n=5+5)
TimeParse-8                 112ns ± 0%     113ns ± 0%  +0.97%  (p=0.016 n=5+4)
TimeFormat-8                156ns ± 0%     145ns ± 0%  -7.14%  (p=0.029 n=4+4)

Change-Id: I3ced26e89041f873ac989586514ccc5ee09f13da
Reviewed-on: https://go-review.googlesource.com/c/go/+/425134
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-10-05 02:40:56 +00:00
Keith Randall
6485e8f503 cmd/compile: use stricter rule for possible partial overlap
Partial overlaps can only happen for strict sub-pieces of larger arrays.
That's a much stronger condition than the current optimization rules.

Update #54467

Change-Id: I11e539b71099e50175f37ee78fddf69283f83ee5
Reviewed-on: https://go-review.googlesource.com/c/go/+/433056
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2022-09-27 20:09:33 +00:00
eric fang
cf53990b18 cmd/compile: Add some CMP and CMN optimization rules on arm64
This CL adds some optimizaion rules:
1, Converts CMP to CMN, or vice versa, when comparing with a negative
number.
2, For equal and not equal comparisons, CMP can be converted to CMN in
some cases. In theory we could do the same optimization for LT, LE, GT
and GE, but need to account for overflow, this CL doesn't handle them.

There are no noticeable performance changes.

Change-Id: Ia49266c019ab7908ebc9510c2f02e121b1607869
Reviewed-on: https://go-review.googlesource.com/c/go/+/429795
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-09-20 01:14:40 +00:00
Joel Sing
a7bcc94719 cmd/compile: resolve known outcomes for SLTI/SLTIU on riscv64
When SLTI/SLTIU is used with ANDI/ORI, it may be possible to determine the
outcome based on the values of the immediates. Resolve these cases.

Improves code generation for various shift operations.

While here, sort tests by architecture to improve readability and ease
future maintenance.

Change-Id: I87e71e016a0e396a928e7d6389a2df61583dfd8d
Reviewed-on: https://go-review.googlesource.com/c/go/+/428217
Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Jenny Rakoczy <jenny@golang.org>
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Jenny Rakoczy <jenny@golang.org>
2022-09-17 17:17:52 +00:00
ruinan
454a058ffc cmd/compile: Add shiftIsBounded check for logic shifts of arm64
This CL adds shiftIsBounded checks for the Lsh* and Rsh* rules in arm64.
There is no need to check the shift value again with CMP + CSEL when the
shift value is valid.

Change-Id: I54620de64f02a1b5a11089add237248ae2de01b4
Reviewed-on: https://go-review.googlesource.com/c/go/+/417714
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-09-07 20:10:13 +00:00
Keith Randall
5b1fbfba1c cmd/compile: rewrite >>c<<c to &^(1<<c-1)
Fixes #54496

Change-Id: I3c2ed8cd55836d5b07c8cdec00d3b584885aca79
Reviewed-on: https://go-review.googlesource.com/c/go/+/424856
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Martin Möhrmann <martin@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Martin Möhrmann <martin@golang.org>
2022-09-02 18:51:37 +00:00
Matthew Dempsky
34f0029a85 cmd/compile/internal/noder: allow OCONVNOP for identical iface conversions
In go.dev/cl/421821, I included a hack to force OCONVNOP back to
OCONVIFACE for conversions involving shape types and non-empty
interfaces. The comment correctly noted that this was only needed for
conversions between non-identical types, but the code was conservative
and applied to even conversions between identical types.

This CL adds an extra bool to record whether the conversion is between
identical types, so we can keep OCONVNOP instead of forcing back to
OCONVIFACE. This has a small improvement to generated code, because we
no longer need a convI2I call (as demonstrated by codegen/ifaces.go).

But more usefully, this is relevant to pruning unnecessary itab slots
in runtime dictionaries (next CL).

Change-Id: I94f89e961cd26629b925037fea58d283140766ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/427678
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-02 18:26:02 +00:00
Derek Parker
6605686e3b cmd/compile: new inline heuristic for struct compares
This CL changes the heuristic used to determine whether we can inline a
struct equality check or if we must generate a function and call that
function for equality.

The old method was to count struct fields, but this can lead to poor
in lining decisions. We should really be determining the cost of the
equality check and use that to determine if we should inline or generate
a function.

The new benchmark provided in this CL returns the following when compared
against tip:

```
name         old time/op  new time/op  delta
EqStruct-32  2.46ns ± 4%  0.25ns ±10%  -89.72%  (p=0.000 n=39+39)
```

Fixes #38494

Change-Id: Ie06b80a2b2a03a3fd0978bcaf7715f9afb66e0ab
GitHub-Last-Rev: e9a18d9389
GitHub-Pull-Request: golang/go#53326
Reviewed-on: https://go-review.googlesource.com/c/go/+/411674
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-09-02 17:48:22 +00:00
ruinan
121344ac33 cmd/compile: optimize RotateLeft8/16 on arm64
This CL optimizes RotateLeft8/16 on arm64.

For 16 bits, we form a 32 bits register by duplicating two 16 bits
registers, then use RORW instruction to do the rotate shift.

For 8 bits, we just use LSR and LSL instead of RORW because the code is
simpler.

Benchmark          Old          ThisCL       delta
RotateLeft8-46     2.16 ns/op   1.73 ns/op   -19.70%
RotateLeft16-46    2.16 ns/op   1.54 ns/op   -28.53%

Change-Id: I09cde4383d12e31876a57f8cdfd3bb4f324fadb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/420976
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-09-02 17:46:31 +00:00
ruinan
54c7bc9cff cmd/compile: optimize shift ops on arm64 when the shift value is v&63
For the following code case:

  var x uint64
  x >> (shift & 63)

We can directly genereta `x >> shift` on arm64, since the hardware will
only use the bottom 6 bits of the shift amount.

Benchmark               old time/op  new time/op    delta
ShiftArithmeticRight-8  0.40ns       0.31ns        -21.7%

Change-Id: Id58c8a5b2f6dd5c30c3876f4a36e11b4d81e2dc9
Reviewed-on: https://go-review.googlesource.com/c/go/+/425294
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-09-02 17:44:29 +00:00
Keith Randall
33a7e5a4b4 cmd/compile: combine multiple rotate instructions
Rotating by c, then by d, is the same as rotating by c+d.

Change-Id: I36df82261460ff80f7c6d39bcdf0e840cef1c91a
Reviewed-on: https://go-review.googlesource.com/c/go/+/424894
Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Ruinan Sun <Ruinan.Sun@arm.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-08-31 22:10:52 +00:00
Keith Randall
69aed4712d cmd/compile: use better splitting condition for string binary search
Currently we use a full cmpstring to do the comparison for each
split in the binary search for a string switch.

Instead, split by comparing a single byte of the input string with a
constant. That will give us a much faster split (although it might be
not quite as good a split).

Fixes #53333

R=go1.20

Change-Id: I28c7209342314f367071e4aa1f2beb6ec9ff7123
Reviewed-on: https://go-review.googlesource.com/c/go/+/414894
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
2022-08-31 22:08:26 +00:00
Wayne Zuo
da6556968f cmd/compile: simplify bounded shift on riscv64
The prove pass will mark some shifts bounded, and then we can use that
information to generate better code on riscv64.

Change-Id: Ia22f43d0598453c9417adac7017db28d7240948b
Reviewed-on: https://go-review.googlesource.com/c/go/+/422616
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-08-31 20:21:00 +00:00
Wayne Zuo
e8f0340fa4 cmd/compile: intrinsify RotateLeft{32,64} on loong64
Benchmark on crypto/sha256 (provided by Xiaodong Liu):
name               old time/op    new time/op    delta
Hash8Bytes/New       1.19µs ± 0%    0.97µs ± 0%  -18.75%  (p=0.000 n=9+9)
Hash8Bytes/Sum224    1.21µs ± 0%    0.97µs ± 0%  -20.04%  (p=0.000 n=9+10)
Hash8Bytes/Sum256    1.21µs ± 0%    0.98µs ± 0%  -19.16%  (p=0.000 n=10+7)
Hash1K/New           15.9µs ± 0%    12.4µs ± 0%  -22.10%  (p=0.000 n=10+10)
Hash1K/Sum224        15.9µs ± 0%    12.4µs ± 0%  -22.18%  (p=0.000 n=8+10)
Hash1K/Sum256        15.9µs ± 0%    12.4µs ± 0%  -22.15%  (p=0.000 n=10+9)
Hash8K/New            119µs ± 0%      92µs ± 0%  -22.40%  (p=0.000 n=10+9)
Hash8K/Sum224         119µs ± 0%      92µs ± 0%  -22.41%  (p=0.000 n=9+10)
Hash8K/Sum256         119µs ± 0%      92µs ± 0%  -22.40%  (p=0.000 n=9+9)

name               old speed      new speed      delta
Hash8Bytes/New     6.70MB/s ± 0%  8.25MB/s ± 0%  +23.13%  (p=0.000 n=10+10)
Hash8Bytes/Sum224  6.60MB/s ± 0%  8.26MB/s ± 0%  +25.06%  (p=0.000 n=10+10)
Hash8Bytes/Sum256  6.59MB/s ± 0%  8.15MB/s ± 0%  +23.67%  (p=0.000 n=10+7)
Hash1K/New         64.3MB/s ± 0%  82.5MB/s ± 0%  +28.36%  (p=0.000 n=10+10)
Hash1K/Sum224      64.3MB/s ± 0%  82.6MB/s ± 0%  +28.51%  (p=0.000 n=10+10)
Hash1K/Sum256      64.3MB/s ± 0%  82.6MB/s ± 0%  +28.46%  (p=0.000 n=9+9)
Hash8K/New         69.0MB/s ± 0%  89.0MB/s ± 0%  +28.87%  (p=0.000 n=10+8)
Hash8K/Sum224      69.0MB/s ± 0%  89.0MB/s ± 0%  +28.88%  (p=0.000 n=9+10)
Hash8K/Sum256      69.0MB/s ± 0%  88.9MB/s ± 0%  +28.87%  (p=0.000 n=8+9)

Benchmark on crypto/sha512 (provided by Xiaodong Liu):
name               old time/op    new time/op     delta
Hash8Bytes/New       1.55µs ± 0%     1.31µs ± 0%  -15.67%  (p=0.000 n=10+10)
Hash8Bytes/Sum384    1.59µs ± 0%     1.35µs ± 0%  -14.97%  (p=0.000 n=10+10)
Hash8Bytes/Sum512    1.62µs ± 0%     1.39µs ± 0%  -14.02%  (p=0.000 n=10+10)
Hash1K/New           10.7µs ± 0%      8.6µs ± 0%  -19.60%  (p=0.000 n=8+8)
Hash1K/Sum384        10.8µs ± 0%      8.7µs ± 0%  -19.40%  (p=0.000 n=9+9)
Hash1K/Sum512        10.8µs ± 0%      8.7µs ± 0%  -19.35%  (p=0.000 n=9+10)
Hash8K/New           74.6µs ± 0%     59.6µs ± 0%  -20.08%  (p=0.000 n=10+9)
Hash8K/Sum384        74.7µs ± 0%     59.7µs ± 0%  -20.04%  (p=0.000 n=9+8)
Hash8K/Sum512        74.7µs ± 0%     59.7µs ± 0%  -20.01%  (p=0.000 n=10+10)

name               old speed      new speed       delta
Hash8Bytes/New     5.16MB/s ± 0%   6.12MB/s ± 0%  +18.60%  (p=0.000 n=10+8)
Hash8Bytes/Sum384  5.02MB/s ± 0%   5.90MB/s ± 0%  +17.56%  (p=0.000 n=10+10)
Hash8Bytes/Sum512  4.94MB/s ± 0%   5.74MB/s ± 0%  +16.29%  (p=0.000 n=10+9)
Hash1K/New         95.4MB/s ± 0%  118.6MB/s ± 0%  +24.38%  (p=0.000 n=10+10)
Hash1K/Sum384      95.0MB/s ± 0%  117.9MB/s ± 0%  +24.06%  (p=0.000 n=8+9)
Hash1K/Sum512      94.8MB/s ± 0%  117.5MB/s ± 0%  +23.99%  (p=0.000 n=8+9)
Hash8K/New          110MB/s ± 0%    137MB/s ± 0%  +25.11%  (p=0.000 n=9+6)
Hash8K/Sum384       110MB/s ± 0%    137MB/s ± 0%  +25.07%  (p=0.000 n=9+8)
Hash8K/Sum512       110MB/s ± 0%    137MB/s ± 0%  +25.01%  (p=0.000 n=10+10)

Change-Id: I28ccfce634659305a336c8e0a3f8589f7361d661
Reviewed-on: https://go-review.googlesource.com/c/go/+/422317
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: David Chase <drchase@google.com>
2022-08-30 03:21:44 +00:00
Wayne Zuo
a6219737e3 cmd/compile: intrinsify Sub64 on riscv64
After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:

name                 old time/op    new time/op    delta
ScalarBaseMult/P256    1.64ms ± 1%    1.60ms ± 1%   -2.36%  (p=0.008 n=5+5)
ScalarBaseMult/P224    1.53ms ± 1%    1.47ms ± 2%   -4.24%  (p=0.008 n=5+5)
ScalarBaseMult/P384    5.12ms ± 2%    5.03ms ± 2%     ~     (p=0.095 n=5+5)
ScalarBaseMult/P521    22.3ms ± 2%    13.8ms ± 1%  -37.89%  (p=0.008 n=5+5)
ScalarMult/P256        4.49ms ± 2%    4.26ms ± 2%   -5.13%  (p=0.008 n=5+5)
ScalarMult/P224        4.33ms ± 1%    4.09ms ± 1%   -5.59%  (p=0.008 n=5+5)
ScalarMult/P384        16.3ms ± 1%    15.5ms ± 2%   -4.78%  (p=0.008 n=5+5)
ScalarMult/P521         101ms ± 0%      47ms ± 2%  -53.36%  (p=0.008 n=5+5)

Change-Id: I31cf0506e27f9d85f576af1813630a19c20dda8a
Reviewed-on: https://go-review.googlesource.com/c/go/+/420095
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27 05:43:59 +00:00
Wayne Zuo
969f48a3a2 cmd/compile: intrinsify Add64 on riscv64
According to RISCV instruction set manual v2.2 Sec 2.4, we can
implement overflowing check for unsigned addition cheaply using
SLTU instructions.

After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:

name                 old time/op    new time/op    delta
ScalarBaseMult/P256    1.93ms ± 1%    1.64ms ± 1%  -14.96%  (p=0.008 n=5+5)
ScalarBaseMult/P224    1.80ms ± 2%    1.53ms ± 1%  -14.89%  (p=0.008 n=5+5)
ScalarBaseMult/P384    6.15ms ± 2%    5.12ms ± 2%  -16.73%  (p=0.008 n=5+5)
ScalarBaseMult/P521    25.9ms ± 1%    22.3ms ± 2%  -13.78%  (p=0.008 n=5+5)
ScalarMult/P256        5.59ms ± 1%    4.49ms ± 2%  -19.79%  (p=0.008 n=5+5)
ScalarMult/P224        5.42ms ± 1%    4.33ms ± 1%  -20.01%  (p=0.008 n=5+5)
ScalarMult/P384        19.9ms ± 2%    16.3ms ± 1%  -18.15%  (p=0.008 n=5+5)
ScalarMult/P521        97.3ms ± 1%   100.7ms ± 0%   +3.48%  (p=0.008 n=5+5)

Change-Id: Ic4c82ced4b072a4a6575343fa9f29dd09b0cabc4
Reviewed-on: https://go-review.googlesource.com/c/go/+/420094
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27 05:43:32 +00:00
Wayne Zuo
b60432df14 cmd/compile: deadcode for LoweredMuluhilo on riscv64
This is a follow up of CL 425101 on RISCV64.

According to RISCV Volume 1, Unprivileged Spec v. 20191213 Chapter 7.1:
If both the high and low bits of the same product are required, then the
recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2
(source register specifiers must be in same order and rdh cannot be the
same as rs1 or rs2). Microarchitectures can then fuse these into a single
multiply operation instead of performing two separate multiplies.

So we should not split Muluhilo to separate instructions.

Updates #54607

Change-Id: If47461f3aaaf00e27cd583a9990e144fb8bcdb17
Reviewed-on: https://go-review.googlesource.com/c/go/+/425203
Auto-Submit: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-08-24 18:08:33 +00:00
Keith Randall
60ad3c48f5 cmd/compile: move SSA rotate instruction detection to arch-independent rules
Detect rotate instructions while still in architecture-independent form.
It's easier to do here, and we don't need to repeat it in each
architecture file.

Change-Id: I9396954b3f3b3bfb96c160d064a02002309935bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/421195
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
Reviewed-by: Ruinan Sun <Ruinan.Sun@arm.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-08-23 21:24:14 +00:00
eric fang
9f0f87c806 cmd/internal/obj/arm64: remove the transition from $0 to ZR
Previously we convert $0 to the ZR register for some reasons, which causes
two problems:
1. Confusion, the special case of the ZR register needs to be considered
when dealing with constants. For encoding, some places we encode ZR, and
some places we encode $0, although we have converted $0 to ZR.
2. Unexpected instruction format. All instructions that support ZR register
operands can be replaced by $0.

This patch removes this conversion. Note that this patch may cause previously
unintendedly supported instruction formats to no longer be supported.

Change-Id: I3d8d2c06711b7614a38191397da7776417f1861c
Reviewed-on: https://go-review.googlesource.com/c/go/+/404316
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-23 06:11:32 +00:00
eric fang
0a52d80666 cmd/compile/internal/ssa: optimize memory moving on arm64
This CL optimizes memory moving with LDP and STP on arm64.

Benchmarks:
name              old time/op  new time/op  delta
ClearFat7-160     1.08ns ± 0%  0.95ns ± 0%  -11.41%  (p=0.008 n=5+5)
ClearFat8-160     0.84ns ± 0%  0.84ns ± 0%   -0.95%  (p=0.008 n=5+5)
ClearFat11-160    1.08ns ± 0%  0.95ns ± 0%  -11.46%  (p=0.008 n=5+5)
ClearFat12-160    0.95ns ± 0%  0.95ns ± 0%     ~     (p=0.063 n=4+5)
ClearFat13-160    1.08ns ± 0%  0.95ns ± 0%  -11.45%  (p=0.008 n=5+5)
ClearFat14-160    1.08ns ± 0%  0.95ns ± 0%  -11.47%  (p=0.008 n=5+5)
ClearFat15-160    1.24ns ± 0%  0.95ns ± 0%  -22.98%  (p=0.029 n=4+4)
ClearFat16-160    0.84ns ± 0%  0.83ns ± 0%   -0.11%  (p=0.008 n=5+5)
ClearFat24-160    2.15ns ± 0%  2.15ns ± 0%     ~     (all equal)
ClearFat32-160    2.86ns ± 0%  2.86ns ± 0%     ~     (p=0.333 n=5+4)
ClearFat40-160    2.15ns ± 0%  2.15ns ± 0%     ~     (all equal)
ClearFat48-160    3.32ns ± 1%  3.31ns ± 1%     ~     (p=0.690 n=5+5)
ClearFat56-160    2.15ns ± 0%  2.15ns ± 0%     ~     (all equal)
ClearFat64-160    3.25ns ± 1%  3.26ns ± 1%     ~     (p=0.841 n=5+5)
ClearFat72-160    2.22ns ± 0%  2.22ns ± 0%     ~     (p=0.444 n=5+5)
ClearFat128-160   4.03ns ± 0%  4.04ns ± 0%   +0.32%  (p=0.008 n=5+5)
ClearFat256-160   6.44ns ± 0%  6.44ns ± 0%   +0.08%  (p=0.016 n=4+5)
ClearFat512-160   12.2ns ± 0%  12.2ns ± 0%   +0.13%  (p=0.008 n=5+5)
ClearFat1024-160  24.3ns ± 0%  24.3ns ± 0%     ~     (p=0.167 n=5+5)
ClearFat1032-160  24.5ns ± 0%  24.5ns ± 0%     ~     (p=0.238 n=4+5)
ClearFat1040-160  29.2ns ± 0%  29.3ns ± 0%   +0.34%  (p=0.008 n=5+5)
CopyFat7-160      1.43ns ± 0%  1.07ns ± 0%  -24.97%  (p=0.008 n=5+5)
CopyFat8-160      0.89ns ± 0%  0.89ns ± 0%     ~     (p=0.238 n=5+5)
CopyFat11-160     1.43ns ± 0%  1.07ns ± 0%  -24.97%  (p=0.008 n=5+5)
CopyFat12-160     1.07ns ± 0%  1.07ns ± 0%     ~     (p=0.238 n=5+4)
CopyFat13-160     1.43ns ± 0%  1.07ns ± 0%     ~     (p=0.079 n=4+5)
CopyFat14-160     1.43ns ± 0%  1.07ns ± 0%  -24.95%  (p=0.008 n=5+5)
CopyFat15-160     1.79ns ± 0%  1.07ns ± 0%     ~     (p=0.079 n=4+5)
CopyFat16-160     1.07ns ± 0%  1.07ns ± 0%     ~     (p=0.444 n=5+5)
CopyFat24-160     1.84ns ± 2%  1.67ns ± 0%   -9.28%  (p=0.008 n=5+5)
CopyFat32-160     3.22ns ± 0%  2.92ns ± 0%   -9.40%  (p=0.008 n=5+5)
CopyFat64-160     3.64ns ± 0%  3.57ns ± 0%   -1.96%  (p=0.008 n=5+5)
CopyFat72-160     3.56ns ± 0%  3.11ns ± 0%  -12.89%  (p=0.008 n=5+5)
CopyFat128-160    5.06ns ± 0%  5.06ns ± 0%   +0.04%  (p=0.048 n=5+5)
CopyFat256-160    9.13ns ± 0%  9.13ns ± 0%     ~     (p=0.659 n=5+5)
CopyFat512-160    17.4ns ± 0%  17.4ns ± 0%     ~     (p=0.167 n=5+5)
CopyFat520-160    17.2ns ± 0%  17.3ns ± 0%   +0.37%  (p=0.008 n=5+5)
CopyFat1024-160   34.1ns ± 0%  34.0ns ± 0%     ~     (p=0.127 n=5+5)
CopyFat1032-160   80.9ns ± 0%  34.2ns ± 0%  -57.74%  (p=0.008 n=5+5)
CopyFat1040-160   94.4ns ± 0%  41.7ns ± 0%  -55.78%  (p=0.016 n=5+4)

Change-Id: I14186f9f82b0ecf8b6c02191dc5da566b9a21e6c
Reviewed-on: https://go-review.googlesource.com/c/go/+/421654
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-23 03:09:07 +00:00
Cherry Mui
8bf9e01473 cmd/compile: split Muluhilo op on ARM64
On ARM64 we use two separate instructions to compute the hi and lo
results of a 64x64->128 multiplication. Lower to two separate ops
so if only one result is needed we can deadcode the other.

Fixes #54607.

Change-Id: Ib023e77eb2b2b0bcf467b45471cb8a294bce6f90
Reviewed-on: https://go-review.googlesource.com/c/go/+/425101
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2022-08-22 21:29:31 +00:00
Keith Randall
908499adec cmd/compile: stop using VARKILL
With the introduction of stack objects, VARKILL information is
no longer needed.

With stack objects, an object is dead when there are no more static
references to it, and the stack scanner can't find any live pointers
to it. VARKILL information isn't used to establish live ranges for
address-taken variables any more. In effect, the last static reference
*is* the VARKILL, and there's an additional dynamic liveness check
during stack scanning.

Next CL will actually rip out the VARKILL opcodes.

Change-Id: I030a2ab867445cf4e0e69397911f8a2e2f0ed07b
Reviewed-on: https://go-review.googlesource.com/c/go/+/419234
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-08-18 17:36:38 +00:00
Archana R
d09c6ac417 test/codegen: updated multiple tests to verify on ppc64,ppc64le
Updated multiple tests in test/codegen: math.go, mathbits.go, shift.go
and slices.go to verify on ppc64/ppc64le as well

Change-Id: Id88dd41569b7097819fb4d451b615f69cf7f7a94
Reviewed-on: https://go-review.googlesource.com/c/go/+/412115
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-08-17 13:56:55 +00:00
Wayne Zuo
09932f95f5 cmd/compile: combine more constant stores on amd64
Fixes #53324

Change-Id: I06149d860f858b082235e9d80bf0ea494679b386
Reviewed-on: https://go-review.googlesource.com/c/go/+/411614
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-08-15 16:19:21 +00:00
eric fang
efe5929dbd cmd/compile/internal/ssa: optimize ARM64 code with TST
For signed comparisons, the following four optimization rules hold:

(CMPconst [0] z:(AND x y)) && z.Uses == 1 => (TST x y)
(CMPWconst [0] z:(AND x y)) && z.Uses == 1 => (TSTW x y)
(CMPconst [0] x:(ANDconst [c] y)) && x.Uses == 1 => (TSTconst [c] y)
(CMPWconst [0] x:(ANDconst [c] y)) && x.Uses == 1 => (TSTWconst [int32(c)] y)

But currently they only apply to jump instructions, not to conditional
instructions within a block, such as cset, csel, etc. This CL extends
the above rules into blocks so that conditional instructions can also be
optimized.

name                       old time/op  new time/op  delta
DivisiblePow2constI64-160  1.04ns ± 0%  0.86ns ± 0%  -17.30%  (p=0.008 n=5+5)
DivisiblePow2constI32-160  1.04ns ± 0%  0.87ns ± 0%  -16.16%  (p=0.016 n=4+5)
DivisiblePow2constI16-160  1.04ns ± 0%  0.87ns ± 0%  -16.03%  (p=0.008 n=5+5)
DivisiblePow2constI8-160   1.04ns ± 0%  0.86ns ± 0%  -17.15%  (p=0.008 n=5+5)

Change-Id: I6bc34bff30862210e8dd001e0340b8fe502fe3de
Reviewed-on: https://go-review.googlesource.com/c/go/+/420434
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Eric Fang <eric.fang@arm.com>
2022-08-10 02:13:53 +00:00
Cuong Manh Le
0f8dffd0aa all: use ":" for compiler generated symbols
As it can't appear in user package paths.

There is a hack for handling "go:buildid" and "type:*" on windows/386.

Previously, windows/386 requires underscore prefix on external symbols,
but that's only applied for SHOSTOBJ/SUNDEFEXT or cgo export symbols.
"go.buildid" is STEXT, "type.*" is STYPE, thus they are not prefixed
with underscore.

In external linking mode, the external linker can't resolve them as
external symbols. But we are lucky that they have "." in their name,
so the external linker see them as Forwarder RVA exports. See:

 - https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#export-address-table
 - https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=ld/pe-dll.c;h=e7b82ba6ffadf74dc1b9ee71dc13d48336941e51;hb=HEAD#l972)

This CL changes "." to ":" in symbols name, so theses symbols can not be
found by external linker anymore. So a hacky way is adding the
underscore prefix for these 2 symbols. I don't have enough knowledge to
verify whether adding the underscore for all STEXT/STYPE symbols are
fine, even if it could be, that would be done in future CL.

Fixes #37762

Change-Id: I92eaaf24c0820926a36e0530fdb07b07af1fcc35
Reviewed-on: https://go-review.googlesource.com/c/go/+/317917
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-09 11:28:56 +00:00
Lynn Boger
c1bfefe9d1 cmd/compile: fix confusion with ANDCCconst in PPC64 rules
Currently there is a an ANDconst and an ANDCCconst op in PPC64,
which is confusing since they map onto the same instruction.
One of these ops sets the result of the AND operation, and the
other sets the flag (condition register).

This converts ANDCCconst into an op with the 2 expected results:
the integer result of the AND and the flag setting. The ANDconst
op has been removed.

Note that in the PPC64 ISA the only variation of the 'and immediate'
is the one that sets the condition bit, which probably led to the
original (confusing) implementation.

This also adds a few rules to improve the use of ANDCCconst with
ISELB and some testcases to verify those improvements.

Change-Id: I523703fa4da2098eb995dc3ba744d36fa28e41d4
Reviewed-on: https://go-review.googlesource.com/c/go/+/422015
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
2022-08-08 20:15:55 +00:00
Keith Randall
c2a9c55823 cmd/compile: optimize unsafe.Slice generated code
We don't need a multiply when the element type is size 0 or 1.

The panic functions don't return, so we don't need any post-call
code (register restores, etc.).

Change-Id: I0dcea5df56d29d7be26554ddca966b3903c672e5
Reviewed-on: https://go-review.googlesource.com/c/go/+/419754
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2022-08-08 17:36:47 +00:00
cuiweixie
e7307034cc cmd/compile: store combine on amd64
Fixes #54120

Change-Id: I6915b6e8d459d9becfdef4fdcba95ee4dea6af05
GitHub-Last-Rev: 03f19942c7
GitHub-Pull-Request: golang/go#54126
Reviewed-on: https://go-review.googlesource.com/c/go/+/420115
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-08-08 16:40:58 +00:00
Matthew Dempsky
ab8d7dd75e cmd/compile: set LocalPkg.Path to -p flag
Since CL 391014, cmd/compile now requires the -p flag to be set the
build system. This CL changes it to initialize LocalPkg.Path to the
provided path, rather than relying on writing out `"".` into object
files and expecting cmd/link to substitute them.

However, this actually involved a rather long tail of fixes. Many have
already been submitted, but a few notable ones that have to land
simultaneously with changing LocalPkg:

1. When compiling package runtime, there are really two "runtime"
packages: types.LocalPkg (the source package itself) and
ir.Pkgs.Runtime (the compiler's internal representation, for synthetic
references). Previously, these ended up creating separate link
symbols (`"".xxx` and `runtime.xxx`, respectively), but now they both
end up as `runtime.xxx`, which causes lsym collisions (notably
inittask and funcsyms).

2. test/codegen tests need to be updated to expect symbols to be named
`command-line-arguments.xxx` rather than `"".foo`.

3. The issue20014 test case is sensitive to the sort order of field
tracking symbols. In particular, the local package now sorts to its
natural place in the list, rather than to the front.

Thanks to David Chase for helping track down all of the fixes needed
for this CL.

Updates #51734.

Change-Id: Iba3041cf7ad967d18c6e17922fa06ba11798b565
Reviewed-on: https://go-review.googlesource.com/c/go/+/393715
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
2022-05-16 18:19:47 +00:00
Cherry Mui
540f8c2b50 cmd/compile: use jump table on ARM64
Following CL 357330, use jump tables on ARM64.

name                         old time/op  new time/op  delta
Switch8Predictable-4         3.41ns ± 0%  3.21ns ± 0%     ~     (p=0.079 n=4+5)
Switch8Unpredictable-4       12.0ns ± 0%   9.5ns ± 0%  -21.17%  (p=0.000 n=5+4)
Switch32Predictable-4        3.06ns ± 0%  2.82ns ± 0%   -7.78%  (p=0.008 n=5+5)
Switch32Unpredictable-4      13.3ns ± 0%   9.5ns ± 0%  -28.87%  (p=0.016 n=4+5)
SwitchStringPredictable-4    3.71ns ± 0%  3.21ns ± 0%  -13.43%  (p=0.000 n=5+4)
SwitchStringUnpredictable-4  14.8ns ± 0%  15.1ns ± 0%   +2.37%  (p=0.008 n=5+5)

Change-Id: Ia0b85df7ca9273cf70c05eb957225c6e61822fa6
Reviewed-on: https://go-review.googlesource.com/c/go/+/403979
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
2022-05-13 19:51:03 +00:00
Paul E. Murphy
0c43878baa cmd/compile: lower Add64/Sub64 into ssa on PPC64
math/bits.Add64 and math/bits.Sub64 now lower and optimize
directly in SSA form.

The optimization of carry chains focuses around eliding
XER<->GPR transfers of the CA bit when used exclusively as an
input to a single carry operations, or when the CA value is
known.

This also adds support for handling XER spills in the assembler
which could happen if carry chains contain inter-dependencies
on each other (which seems very unlikely with practical usage),
or a clobber happens (SRAW/SRAD/SUBFC operations clobber CA).

With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.3µs ± 0%    46.9µs ± 0%   +1.34%
ScalarBaseMult/P224                    356µs ± 0%     209µs ± 0%  -41.14%
ScalarBaseMult/P384                   1.20ms ± 0%    0.57ms ± 0%  -52.14%
ScalarBaseMult/P521                   3.38ms ± 0%    1.44ms ± 0%  -57.27%
ScalarMult/P256                        199µs ± 0%     199µs ± 0%   -0.17%
ScalarMult/P224                        357µs ± 0%     212µs ± 0%  -40.56%
ScalarMult/P384                       1.20ms ± 0%    0.58ms ± 0%  -51.86%
ScalarMult/P521                       3.37ms ± 0%    1.44ms ± 0%  -57.32%
MarshalUnmarshal/P256/Uncompressed    2.59µs ± 0%    2.52µs ± 0%   -2.63%
MarshalUnmarshal/P256/Compressed      2.58µs ± 0%    2.52µs ± 0%   -2.06%
MarshalUnmarshal/P224/Uncompressed    1.54µs ± 0%    1.40µs ± 0%   -9.42%
MarshalUnmarshal/P224/Compressed      1.54µs ± 0%    1.39µs ± 0%   -9.87%
MarshalUnmarshal/P384/Uncompressed    2.40µs ± 0%    1.80µs ± 0%  -24.93%
MarshalUnmarshal/P384/Compressed      2.35µs ± 0%    1.81µs ± 0%  -23.03%
MarshalUnmarshal/P521/Uncompressed    3.79µs ± 0%    2.58µs ± 0%  -31.81%
MarshalUnmarshal/P521/Compressed      3.80µs ± 0%    2.60µs ± 0%  -31.67%

Note, P256 uses an asm implementation, thus, little variation is expected.

Change-Id: I88a24f6bf0f4f285c649e40243b1ab69cc452b71
Reviewed-on: https://go-review.googlesource.com/c/go/+/346870
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-05-10 20:03:53 +00:00
Jorropo
e1e056fa6a cmd/compile: fold constants found by prove
It is hit ~70k times building go.
This make the go binary, 0.04% smaller.
I didn't included benchmarks because this is just constant foldings
and is hard to mesure objectively.

For example, this enable rewriting things like:
  if x == 20 {
    return x + 30 + z
  }

Into:
  if x == 20 {
    return 50 + z
  }

It's not just fixing programer's code,
the ssa generator generate code like this sometimes.

Change-Id: I0861f342b27f7227b5f1c34d8267fa0057b1bbbc
GitHub-Last-Rev: 4c2f9b5216
GitHub-Pull-Request: golang/go#52669
Reviewed-on: https://go-review.googlesource.com/c/go/+/403735
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2022-05-04 20:30:17 +00:00
Paul E. Murphy
c570f0eda2 cmd/compile: combine OR + NOT into ORN on PPC64
This shows up in a few crypto functions, and other
assorted places.

Change-Id: I5a7f4c25ddd4a6499dc295ef693b9fe43d2448ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/404057
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
2022-05-04 18:49:50 +00:00
Cuong Manh Le
0668e3cb1a cmd/compile: support pointers to arrays in arrayClear
Fixes #52635

Change-Id: I85f182931e30292983ef86c55a0ab6e01282395c
Reviewed-on: https://go-review.googlesource.com/c/go/+/403337
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-05-03 05:42:48 +00:00
Keith Randall
c4b2288755 cmd/compile: add jump table codegen test
Change-Id: Ic67f676f5ebe146166a0d3c1d78a802881320e49
Reviewed-on: https://go-review.googlesource.com/c/go/+/400375
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2022-04-14 21:16:29 +00:00
Wayne Zuo
66f03f79da cmd/compile: add SHLX&SHRX without load
Change-Id: I79eb5e7d6bcb23f26d3a100e915efff6dae70391
Reviewed-on: https://go-review.googlesource.com/c/go/+/399061
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-04-13 17:48:36 +00:00
Wayne Zuo
517781b391 cmd/compile: add SARXQload and SARXLload
Change-Id: I4e8dc5009a5b8af37d71b62f1322f94806d3e9d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/399056
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-04-13 17:48:12 +00:00
Wayne Zuo
d6320f1a58 cmd/compile: add SARX instruction for GOAMD64>=3
name                    old time/op  new time/op  delta
ShiftArithmeticRight-8  0.68ns ± 5%  0.30ns ± 6%  -56.14%  (p=0.000 n=10+10)

Change-Id: I052a0d7b9e6526d526276444e588b0cc288beff4
Reviewed-on: https://go-review.googlesource.com/c/go/+/399055
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-04-12 12:59:27 +00:00
Wayne Zuo
32de2b0d1c cmd/compile: add MOVBE index load/store
Fixes #51724

Change-Id: I94e650a7482dc4c479d597f0162a6a89d779708d
Reviewed-on: https://go-review.googlesource.com/c/go/+/395474
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
2022-04-11 15:41:56 +00:00
Wayne Zuo
615d3c3040 test: adjust load and store test
In the load tests, we only want to test the assembly produced by
the load operations. If we use the global variable sink, it will produce
one load operation and one store operation(assign to sink).

For example:

func load_be64(b []byte) uint64 {
	sink64 = binary.BigEndian.Uint64(b)
}

If we compile this function with GOAMD64=v3, it may produce MOVBEQload
and MOVQstore or MOVQload and MOVBEQstore, but we only want MOVBEQload.
Discovered when developing CL 395474.

Same for the store tests.

Change-Id: I65c3c742f1eff657c3a0d2dd103f51140ae8079e
Reviewed-on: https://go-review.googlesource.com/c/go/+/397875
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
2022-04-11 15:41:04 +00:00
Wayne Zuo
7fbabe8d57 cmd/compile: use shlx&shrx instruction for GOAMD64>=v3
The SHRX/SHLX instruction can take any general register as the shift count operand, and can read source from memory. This CL introduces some operators to combine load and shift to one instruction.

For #47120

Change-Id: I13b48f53c7d30067a72eb2c8382242045dead36a
Reviewed-on: https://go-review.googlesource.com/c/go/+/385174
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Cherry Mui <cherryyz@google.com>
2022-04-04 20:07:23 +00:00
Wayne Zuo
a92ca51507 cmd/compile: use LZCNT instruction for GOAMD64>=3
LZCNT is similar to BSR, but BSR(x) is undefined when x == 0, so using
LZCNT can avoid a special case for zero input. Except that case,
LZCNTQ(x) == 63-BSRQ(x) and LZCNTL(x) == 31-BSRL(x).

And according to https://www.agner.org/optimize/instruction_tables.pdf,
LZCNT instructions are much faster than BSR on AMD CPU.

name              old time/op  new time/op  delta
LeadingZeros-8    0.91ns ± 1%  0.80ns ± 7%  -11.68%  (p=0.000 n=9+9)
LeadingZeros8-8   0.98ns ±15%  0.91ns ± 1%   -7.34%  (p=0.000 n=9+9)
LeadingZeros16-8  0.94ns ± 3%  0.92ns ± 2%   -2.36%  (p=0.001 n=10+10)
LeadingZeros32-8  0.89ns ± 1%  0.78ns ± 2%  -12.49%  (p=0.000 n=10+10)
LeadingZeros64-8  0.92ns ± 1%  0.78ns ± 1%  -14.48%  (p=0.000 n=10+10)

Change-Id: I125147fe3d6994a4cfe558432780408e9a27557a
Reviewed-on: https://go-review.googlesource.com/c/go/+/396794
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-04-04 04:01:17 +00:00
Wayne Zuo
ba6df85c7c cmd/compile: add MOVBEWstore support for GOAMD64>=3
This CL add MOVBE support for 16-bit version, but MOVBEWload is
excluded because it does not satisfy zero extented.

For #51724

Change-Id: I3fadf20bcbb9b423f6355e6a1e340107e8e621ac
Reviewed-on: https://go-review.googlesource.com/c/go/+/396617
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
2022-04-03 23:48:16 +00:00
fanzha02
8ab42a945a cmd/compile: merge ANDconst and UBFX into UBFX on arm64
Add a new rewrite rule to merge ANDconst and UBFX into
UBFX.

Add test cases.

Change-Id: I24d6442d0c956d7ce092c3a3858d4a3a41771670
Reviewed-on: https://go-review.googlesource.com/c/go/+/377054
Trust: Fannie Zhang <Fannie.Zhang@arm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-03-25 01:24:44 +00:00