1
0
mirror of https://github.com/golang/go synced 2024-09-28 21:14:28 -06:00
Commit Graph

69 Commits

Author SHA1 Message Date
Wayne Zuo
cedfcba3e8 cmd/compile: instrinsify TrailingZeros{8,32,64} for 386
This CL add support for instrinsifying the TrialingZeros{8,32,64}
functions for 386 architecture. We need handle the case when the input
is 0, which could lead to undefined output from the BSFL instruction.

Next CL will remove the assembly code in runtime/internal/sys package.

Change-Id: Ic168edf68e81bf69a536102100fdd3f56f0f4a1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/475735
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-03-14 08:10:32 +00:00
Wayne Zuo
d7ac5d1480 cmd/compile: intrinsify math/bits/ReverseBytes{32|64} for 386
The BSWAPL instruction is supported in i486 and newer.
https://github.com/golang/go/wiki/MinimumRequirements#386 says we
support "All Pentium MMX or later". The Pentium is also referred to as
i586, so that we are safe with these instructions.

Change-Id: I6dea1f9d864a45bb07c8f8f35a81cfe16cca216c
Reviewed-on: https://go-review.googlesource.com/c/go/+/465515
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2023-02-08 03:43:23 +00:00
Archana R
cd1fc87156 cmd/compile: intrinsify math/bits/ReverseBytes{16|32|64} for ppc64/power10
This change intrinsifies ReverseBytes{16|32|64} by generating the
corresponding new instructions in Power10: brh, brd and brw and
adds a verification test for the same.
On Power 9 and 8, the .go code performs optimally as it is.

Performance improvement seen on Power10:
ReverseBytes32  1.38ns ± 0%  1.18ns ± 0%  -14.2
ReverseBytes64  1.52ns ± 0%  1.11ns ± 0%  -26.87
ReverseBytes16  1.41ns ± 1%  1.18ns ± 0%  -16.47

Change-Id: I88f127f3ab9ba24a772becc21ad90acfba324b37
Reviewed-on: https://go-review.googlesource.com/c/go/+/446675
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2023-02-03 19:01:06 +00:00
Paul E. Murphy
1540531746 test/codegen: merge identical ppc64 and ppc64le tests
Manually consolidate the remaining ppc64/ppc64le test which
are not so trivial to automatically merge.

The remaining ppc64le tests are limited to cases where load/stores are
merged (this only happens on ppc64le) and the race detector (only
supported on ppc64le).

Change-Id: I1f9c0f3d3ddbb7fbbd8c81fbbd6537394fba63ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/463217
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2023-01-27 19:03:02 +00:00
Paul E. Murphy
0301c6c351 test/codegen: combine trivial PPC64 tests into ppc64x
Use a small python script to consolidate duplicate
ppc64/ppc64le tests into a single ppc64x codegen test.

This makes small assumption that anytime two tests with
for different arch/variant combos exists, those tests
can be combined into a single ppc64x test.

E.x:

  // ppc64le: foo
  // ppc64le/power9: foo
into
  // ppc64x: foo

or

  // ppc64: foo
  // ppc64le: foo
into
  // ppc64x: foo

import glob
import re
files = glob.glob("codegen/*.go")
for file in files:
    with open(file) as f:
        text = [l for l in f]
    i = 0
    while i < len(text):
        first = re.match("\s*// ?ppc64(le)?(/power[89])?:(.*)", text[i])
        if first:
            j = i+1
            while j < len(text):
                second = re.match("\s*// ?ppc64(le)?(/power[89])?:(.*)", text[j])
                if not second:
                    break
                if (not first.group(2) or first.group(2) == second.group(2)) and first.group(3) == second.group(3):
                    text[i] = re.sub(" ?ppc64(le|x)?"," ppc64x",text[i])
                    text=text[:j] + (text[j+1:])
                else:
                    j += 1
        i+=1
    with open(file, 'w') as f:
        f.write("".join(text))

Change-Id: Ic6b009b54eacaadc5a23db9c5a3bf7331b595821
Reviewed-on: https://go-review.googlesource.com/c/go/+/463220
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2023-01-27 18:24:12 +00:00
Paul E. Murphy
d031e9e07a cmd/compile/internal/ssa: re-adjust CarryChainTail scheduling priority
This needs to be as low as possible while not breaking priority
assumptions of other scores to correctly schedule carry chains.

Prior to the arm64 changes, it was set below ReadTuple. At the time,
this prevented the MulHiLo implementation on PPC64 from occluding
the scheduling of a full carry chain.

Memory scores can also prevent better scheduling, as can be observed
with crypto/internal/edwards25519/field.feMulGeneric.

Fixes #56497

Change-Id: Ia4b54e6dffcce584faf46b1b8d7cea18a3913887
Reviewed-on: https://go-review.googlesource.com/c/go/+/447435
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
2022-11-03 19:59:19 +00:00
Wayne Zuo
90a3527427 cmd/compile: intrinsify Sub64 on loong64
This is a follow up of CL 420095  on loong64.

file                                    before    after     Δ       %
compile/internal/ssa.a                  35649482  35653274  +3792   +0.011%
compile/internal/ssagen.a               4099858   4098728   -1130   -0.028%
ecdh.a                                  227896    226896    -1000   -0.439%
internal/nistec/fiat.a                  1212254   1128184   -84070  -6.935%
tls.a                                   3256800   3256802   +2      +0.000%
big.a                                   1708518   1702496   -6022   -0.352%
bits.a                                  106762    105734    -1028   -0.963%
math.a                                  578762    577288    -1474   -0.255%
netip.a                                 555922    555610    -312    -0.056%
net.a                                   3286528   3286530   +2      +0.000%
golang.org/x/crypto/internal/poly1305.a 109546    107686    -1860   -1.698%
total                                   260392768 260299668 -93100  -0.036%

Change-Id: Ieffca705aae5666501f284502d986ca179dde494
Reviewed-on: https://go-review.googlesource.com/c/go/+/428557
Reviewed-by: Carlos Amedee <carlos@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-10-07 18:16:26 +00:00
Wayne Zuo
97760ed651 cmd/compile: intrinsify Add64 on loong64
This is a follow up of CL 420094  on loong64.

Reduce go toolchain size slightly on linux/loong64.

compilecmp HEAD~1 -> HEAD
HEAD~1 (8a32354219): internal/trace: use strings.Builder
HEAD (1767784ac3): cmd/compile: intrinsify Add64 on loong64
platform: linux/loong64

file      before    after     Δ       %
addr2line 3882616   3882536   -80     -0.002%
api       5528866   5528450   -416    -0.008%
asm       5133780   5133796   +16     +0.000%
cgo       4668787   4668491   -296    -0.006%
compile   25163409  25164729  +1320   +0.005%
cover     4658055   4658007   -48     -0.001%
dist      3437783   3437727   -56     -0.002%
doc       3883069   3883205   +136    +0.004%
fix       3383254   3383070   -184    -0.005%
link      6747559   6747023   -536    -0.008%
nm        3793923   3793939   +16     +0.000%
objdump   4256628   4256812   +184    +0.004%
pack      2356328   2356144   -184    -0.008%
pprof     14233370  14131910  -101460 -0.713%
test2json 2638668   2638476   -192    -0.007%
trace     13392065  13360781  -31284  -0.234%
vet       7456388   7455588   -800    -0.011%
total     132498256 132364392 -133864 -0.101%

file                                    before    after     Δ       %
compile/internal/ssa.a                  35644590  35649482  +4892   +0.014%
compile/internal/ssagen.a               4101250   4099858   -1392   -0.034%
internal/edwards25519/field.a           226064    201718    -24346  -10.770%
internal/nistec/fiat.a                  1689922   1212254   -477668 -28.266%
tls.a                                   3256798   3256800   +2      +0.000%
big.a                                   1718552   1708518   -10034  -0.584%
bits.a                                  107786    106762    -1024   -0.950%
cmplx.a                                 169434    168214    -1220   -0.720%
math.a                                  581302    578762    -2540   -0.437%
netip.a                                 556096    555922    -174    -0.031%
net.a                                   3286526   3286528   +2      +0.000%
runtime.a                               8644786   8644510   -276    -0.003%
strconv.a                               519098    518374    -724    -0.139%
golang.org/x/crypto/internal/poly1305.a 115398    109546    -5852   -5.071%
total                                   260913122 260392768 -520354 -0.199%

Change-Id: I75b2bb7761fa5a0d0d032d4ebe3582d092ea77be
Reviewed-on: https://go-review.googlesource.com/c/go/+/428556
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-10-07 18:16:10 +00:00
ruinan
121344ac33 cmd/compile: optimize RotateLeft8/16 on arm64
This CL optimizes RotateLeft8/16 on arm64.

For 16 bits, we form a 32 bits register by duplicating two 16 bits
registers, then use RORW instruction to do the rotate shift.

For 8 bits, we just use LSR and LSL instead of RORW because the code is
simpler.

Benchmark          Old          ThisCL       delta
RotateLeft8-46     2.16 ns/op   1.73 ns/op   -19.70%
RotateLeft16-46    2.16 ns/op   1.54 ns/op   -28.53%

Change-Id: I09cde4383d12e31876a57f8cdfd3bb4f324fadb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/420976
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
2022-09-02 17:46:31 +00:00
Wayne Zuo
a6219737e3 cmd/compile: intrinsify Sub64 on riscv64
After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:

name                 old time/op    new time/op    delta
ScalarBaseMult/P256    1.64ms ± 1%    1.60ms ± 1%   -2.36%  (p=0.008 n=5+5)
ScalarBaseMult/P224    1.53ms ± 1%    1.47ms ± 2%   -4.24%  (p=0.008 n=5+5)
ScalarBaseMult/P384    5.12ms ± 2%    5.03ms ± 2%     ~     (p=0.095 n=5+5)
ScalarBaseMult/P521    22.3ms ± 2%    13.8ms ± 1%  -37.89%  (p=0.008 n=5+5)
ScalarMult/P256        4.49ms ± 2%    4.26ms ± 2%   -5.13%  (p=0.008 n=5+5)
ScalarMult/P224        4.33ms ± 1%    4.09ms ± 1%   -5.59%  (p=0.008 n=5+5)
ScalarMult/P384        16.3ms ± 1%    15.5ms ± 2%   -4.78%  (p=0.008 n=5+5)
ScalarMult/P521         101ms ± 0%      47ms ± 2%  -53.36%  (p=0.008 n=5+5)

Change-Id: I31cf0506e27f9d85f576af1813630a19c20dda8a
Reviewed-on: https://go-review.googlesource.com/c/go/+/420095
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27 05:43:59 +00:00
Wayne Zuo
969f48a3a2 cmd/compile: intrinsify Add64 on riscv64
According to RISCV instruction set manual v2.2 Sec 2.4, we can
implement overflowing check for unsigned addition cheaply using
SLTU instructions.

After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:

name                 old time/op    new time/op    delta
ScalarBaseMult/P256    1.93ms ± 1%    1.64ms ± 1%  -14.96%  (p=0.008 n=5+5)
ScalarBaseMult/P224    1.80ms ± 2%    1.53ms ± 1%  -14.89%  (p=0.008 n=5+5)
ScalarBaseMult/P384    6.15ms ± 2%    5.12ms ± 2%  -16.73%  (p=0.008 n=5+5)
ScalarBaseMult/P521    25.9ms ± 1%    22.3ms ± 2%  -13.78%  (p=0.008 n=5+5)
ScalarMult/P256        5.59ms ± 1%    4.49ms ± 2%  -19.79%  (p=0.008 n=5+5)
ScalarMult/P224        5.42ms ± 1%    4.33ms ± 1%  -20.01%  (p=0.008 n=5+5)
ScalarMult/P384        19.9ms ± 2%    16.3ms ± 1%  -18.15%  (p=0.008 n=5+5)
ScalarMult/P521        97.3ms ± 1%   100.7ms ± 0%   +3.48%  (p=0.008 n=5+5)

Change-Id: Ic4c82ced4b072a4a6575343fa9f29dd09b0cabc4
Reviewed-on: https://go-review.googlesource.com/c/go/+/420094
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27 05:43:32 +00:00
Wayne Zuo
b60432df14 cmd/compile: deadcode for LoweredMuluhilo on riscv64
This is a follow up of CL 425101 on RISCV64.

According to RISCV Volume 1, Unprivileged Spec v. 20191213 Chapter 7.1:
If both the high and low bits of the same product are required, then the
recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2
(source register specifiers must be in same order and rdh cannot be the
same as rs1 or rs2). Microarchitectures can then fuse these into a single
multiply operation instead of performing two separate multiplies.

So we should not split Muluhilo to separate instructions.

Updates #54607

Change-Id: If47461f3aaaf00e27cd583a9990e144fb8bcdb17
Reviewed-on: https://go-review.googlesource.com/c/go/+/425203
Auto-Submit: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-08-24 18:08:33 +00:00
Cherry Mui
8bf9e01473 cmd/compile: split Muluhilo op on ARM64
On ARM64 we use two separate instructions to compute the hi and lo
results of a 64x64->128 multiplication. Lower to two separate ops
so if only one result is needed we can deadcode the other.

Fixes #54607.

Change-Id: Ib023e77eb2b2b0bcf467b45471cb8a294bce6f90
Reviewed-on: https://go-review.googlesource.com/c/go/+/425101
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2022-08-22 21:29:31 +00:00
Archana R
d09c6ac417 test/codegen: updated multiple tests to verify on ppc64,ppc64le
Updated multiple tests in test/codegen: math.go, mathbits.go, shift.go
and slices.go to verify on ppc64/ppc64le as well

Change-Id: Id88dd41569b7097819fb4d451b615f69cf7f7a94
Reviewed-on: https://go-review.googlesource.com/c/go/+/412115
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-08-17 13:56:55 +00:00
Paul E. Murphy
0c43878baa cmd/compile: lower Add64/Sub64 into ssa on PPC64
math/bits.Add64 and math/bits.Sub64 now lower and optimize
directly in SSA form.

The optimization of carry chains focuses around eliding
XER<->GPR transfers of the CA bit when used exclusively as an
input to a single carry operations, or when the CA value is
known.

This also adds support for handling XER spills in the assembler
which could happen if carry chains contain inter-dependencies
on each other (which seems very unlikely with practical usage),
or a clobber happens (SRAW/SRAD/SUBFC operations clobber CA).

With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.3µs ± 0%    46.9µs ± 0%   +1.34%
ScalarBaseMult/P224                    356µs ± 0%     209µs ± 0%  -41.14%
ScalarBaseMult/P384                   1.20ms ± 0%    0.57ms ± 0%  -52.14%
ScalarBaseMult/P521                   3.38ms ± 0%    1.44ms ± 0%  -57.27%
ScalarMult/P256                        199µs ± 0%     199µs ± 0%   -0.17%
ScalarMult/P224                        357µs ± 0%     212µs ± 0%  -40.56%
ScalarMult/P384                       1.20ms ± 0%    0.58ms ± 0%  -51.86%
ScalarMult/P521                       3.37ms ± 0%    1.44ms ± 0%  -57.32%
MarshalUnmarshal/P256/Uncompressed    2.59µs ± 0%    2.52µs ± 0%   -2.63%
MarshalUnmarshal/P256/Compressed      2.58µs ± 0%    2.52µs ± 0%   -2.06%
MarshalUnmarshal/P224/Uncompressed    1.54µs ± 0%    1.40µs ± 0%   -9.42%
MarshalUnmarshal/P224/Compressed      1.54µs ± 0%    1.39µs ± 0%   -9.87%
MarshalUnmarshal/P384/Uncompressed    2.40µs ± 0%    1.80µs ± 0%  -24.93%
MarshalUnmarshal/P384/Compressed      2.35µs ± 0%    1.81µs ± 0%  -23.03%
MarshalUnmarshal/P521/Uncompressed    3.79µs ± 0%    2.58µs ± 0%  -31.81%
MarshalUnmarshal/P521/Compressed      3.80µs ± 0%    2.60µs ± 0%  -31.67%

Note, P256 uses an asm implementation, thus, little variation is expected.

Change-Id: I88a24f6bf0f4f285c649e40243b1ab69cc452b71
Reviewed-on: https://go-review.googlesource.com/c/go/+/346870
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-05-10 20:03:53 +00:00
Wayne Zuo
a92ca51507 cmd/compile: use LZCNT instruction for GOAMD64>=3
LZCNT is similar to BSR, but BSR(x) is undefined when x == 0, so using
LZCNT can avoid a special case for zero input. Except that case,
LZCNTQ(x) == 63-BSRQ(x) and LZCNTL(x) == 31-BSRL(x).

And according to https://www.agner.org/optimize/instruction_tables.pdf,
LZCNT instructions are much faster than BSR on AMD CPU.

name              old time/op  new time/op  delta
LeadingZeros-8    0.91ns ± 1%  0.80ns ± 7%  -11.68%  (p=0.000 n=9+9)
LeadingZeros8-8   0.98ns ±15%  0.91ns ± 1%   -7.34%  (p=0.000 n=9+9)
LeadingZeros16-8  0.94ns ± 3%  0.92ns ± 2%   -2.36%  (p=0.001 n=10+10)
LeadingZeros32-8  0.89ns ± 1%  0.78ns ± 2%  -12.49%  (p=0.000 n=10+10)
LeadingZeros64-8  0.92ns ± 1%  0.78ns ± 1%  -14.48%  (p=0.000 n=10+10)

Change-Id: I125147fe3d6994a4cfe558432780408e9a27557a
Reviewed-on: https://go-review.googlesource.com/c/go/+/396794
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-04-04 04:01:17 +00:00
Lynn Boger
33b3260c1e cmd/compile/internal/ssagen: set BitLen32 as intrinsic on PPC64
It was noticed through some other investigation that BitLen32
was not generating the best code and found that it wasn't recognized
as an intrinsic. This corrects that and enables the test for PPC64.

Change-Id: Iab496a8830c8552f507b7292649b1b660f3848b5
Reviewed-on: https://go-review.googlesource.com/c/go/+/355872
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-10-18 21:33:08 +00:00
nimelehin
097a82f54d cmd/compile: don't emit unnecessary amd64 extension checks
In case of amd64 the compiler issues checks if extensions are
available on a platform. With GOAMD64 microarchitecture levels
provided, some of the checks could be eliminated.

Change-Id: If15c178bcae273b2ce7d3673415cb8849292e087
Reviewed-on: https://go-review.googlesource.com/c/go/+/352010
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
2021-10-05 18:32:12 +00:00
wdvxdr
060cd73ab9 cmd/compile: use TZCNT instruction for GOAMD64>=v3
on my Intel CoffeeLake CPU:
name               old time/op  new time/op  delta
TrailingZeros-8    0.68ns ± 1%  0.64ns ± 1%  -6.26%  (p=0.000 n=10+10)
TrailingZeros8-8   0.70ns ± 1%  0.70ns ± 1%    ~     (p=0.697 n=10+10)
TrailingZeros16-8  0.70ns ± 1%  0.70ns ± 1%  +0.57%  (p=0.043 n=10+10)
TrailingZeros32-8  0.66ns ± 1%  0.64ns ± 1%  -3.35%  (p=0.000 n=10+10)
TrailingZeros64-8  0.68ns ± 1%  0.64ns ± 1%  -5.84%  (p=0.000 n=9+10)

Updates #45453

Change-Id: I228ff2d51df24b1306136f061432f8a12bb1d6fd
Reviewed-on: https://go-review.googlesource.com/c/go/+/353249
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2021-10-05 16:06:49 +00:00
Meng Zhuo
efd206eb40 cmd/compile: intrinsify Mul64 on riscv64
According to RISCV instruction set manual v2.2 Sec 6.1
MULHU followed by MUL will be fused into one multiply by microarchitecture

Benchstat on Hifive unmatched:
name          old time/op    new time/op    delta
Hash8Bytes       245ns ± 3%     186ns ± 4%  -23.99%  (p=0.000 n=10+10)
Hash320Bytes    1.94µs ± 1%    1.31µs ± 1%  -32.38%  (p=0.000 n=9+10)
Hash1K          5.84µs ± 0%    3.84µs ± 0%  -34.20%  (p=0.000 n=10+9)
Hash8K          45.3µs ± 0%    29.4µs ± 0%  -35.04%  (p=0.000 n=10+10)

name          old speed      new speed      delta
Hash8Bytes    32.7MB/s ± 3%  43.0MB/s ± 4%  +31.61%  (p=0.000 n=10+10)
Hash320Bytes   165MB/s ± 1%   244MB/s ± 1%  +47.88%  (p=0.000 n=9+10)
Hash1K         175MB/s ± 0%   266MB/s ± 0%  +51.98%  (p=0.000 n=10+9)
Hash8K         181MB/s ± 0%   279MB/s ± 0%  +53.94%  (p=0.000 n=10+10)

Change-Id: I3561495d02a4a0ad8578e9b9819bf0a4eaca5d12
Reviewed-on: https://go-review.googlesource.com/c/go/+/329970
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Meng Zhuo <mzh@golangcn.org>
2021-08-16 13:50:11 +00:00
David Chase
4df3d0e4df cmd/compile: rescue stmt boundaries from OpArgXXXReg and OpSelectN.
Fixes this failure:
go test cmd/compile/internal/ssa -run TestStmtLines -v
=== RUN   TestStmtLines
    stmtlines_test.go:115: Saw too many (amd64, > 1%) lines without
    statement marks, total=88263, nostmt=1930
    ('-run TestStmtLines -v' lists failing lines)

The failure has two causes.

One is that the first-line adjuster in code generation was relocating
"first lines" to instructions that would either not have any code generated,
or would have the statment marker removed by a different believed-good heuristic.

The other was that statement boundaries were getting attached to register
values (that with the old ABI were loads from the stack, hence real instructions).
The register values disappear at code generation.

The fixes are to (1) note that certain instructions are not good choices for
"first value" and skip them, and (2) in an expandCalls post-pass, look for
register valued instructions and under appropriate conditions move their
statement marker to a compatible use.

Also updates TestStmtLines to always log the score, for easier comparison of
minor compiler changes.

Updates #40724.

Change-Id: I485573ce900e292d7c44574adb7629cdb4695c3f
Reviewed-on: https://go-review.googlesource.com/c/go/+/309649
Trust: David Chase <drchase@google.com>
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-04-14 18:46:08 +00:00
Michael Munday
854e892ce1 cmd/compile: optimize shift pairs and masks on s390x
Optimize combinations of left and right shifts by a constant value
into a 'rotate then insert selected bits [into zero]' instruction.
Use the same instruction for contiguous masks since it has some
benefits over 'and immediate' (not restricted to 32-bits, does not
overwrite source register).

To keep the complexity of this change under control I've only
implemented 64 bit operations for now.

There are a lot more optimizations that can be done with this
instruction family. However, since their function overlaps with other
instructions we need to be somewhat careful not to break existing
optimization rules by creating optimization dead ends. This is
particularly true of the load/store merging rules which contain lots
of zero extensions and shifts.

This CL does interfere with the store merging rules when an operand
is shifted left before it is stored:

  binary.BigEndian.PutUint64(b, x << 1)

This is unfortunate but it's not critical and somewhat complex so
I plan to fix that in a follow up CL.

file      before    after     Δ       %
addr2line 4117446   4117282   -164    -0.004%
api       4945184   4942752   -2432   -0.049%
asm       4998079   4991891   -6188   -0.124%
buildid   2685158   2684074   -1084   -0.040%
cgo       4553732   4553394   -338    -0.007%
compile   19294446  19245070  -49376  -0.256%
cover     4897105   4891319   -5786   -0.118%
dist      3544389   3542785   -1604   -0.045%
doc       3926795   3927617   +822    +0.021%
fix       3302958   3293868   -9090   -0.275%
link      6546274   6543456   -2818   -0.043%
nm        4102021   4100825   -1196   -0.029%
objdump   4542431   4548483   +6052   +0.133%
pack      2482465   2416389   -66076  -2.662%
pprof     13366541  13363915  -2626   -0.020%
test2json 2829007   2761515   -67492  -2.386%
trace     10216164  10219684  +3520   +0.034%
vet       6773956   6773572   -384    -0.006%
total     107124151 106917891 -206260 -0.193%

Change-Id: I7591cce41e06867ba10a745daae9333513062746
Reviewed-on: https://go-review.googlesource.com/c/go/+/233317
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Michael Munday <mike.munday@ibm.com>
2020-11-06 10:45:31 +00:00
Paul E. Murphy
7615b20d06 cmd/compile: generate subfic on ppc64
This merges an lis + subf into subfic, and for 32b constants
lwa + subf into oris + ori + subf.

The carry bit is no longer used in code generation, therefore
I think we can clobber it as needed.  Note, lowered borrow/carry
arithmetic is self-contained and thus is not affected.

A few extra rules are added to ensure early transformations to
SUBFCconst don't trip up earlier rules, fold constant operations,
or otherwise simplify lowering.  Likewise, tests are added to
ensure all rules are hit.  Generic constant folding catches
trivial cases, however some lowering rules insert arithmetic
which can introduce new opportunities (e.g BitLen or Slicemask).

I couldn't find a specific benchmark to demonstrate noteworthy
improvements, but this is generating subfic in many of the default
bent test binaries, so we are at least saving a little code space.

Change-Id: Iad7c6e5767eaa9dc24dc1c989bd1c8cfe1982012
Reviewed-on: https://go-review.googlesource.com/c/go/+/249461
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2020-08-27 20:10:15 +00:00
Michael Munday
ab7a65f283 cmd/compile: clean up codegen for branch-on-carry on s390x
This CL optimizes code that uses a carry from a function such as
bits.Add64 as the condition in an if statement. For example:

    x, c := bits.Add64(a, b, 0)
    if c != 0 {
        panic("overflow")
    }

Rather than converting the carry into a 0 or a 1 value and using
that as an input to a comparison instruction the carry flag is now
used as the input to a conditional branch directly. This typically
removes an ADD LOGICAL WITH CARRY instruction when user code is
doing overflow detection and is closer to the code that a user
would expect to generate.

Change-Id: I950431270955ab72f1b5c6db873b6abe769be0da
Reviewed-on: https://go-review.googlesource.com/c/go/+/219757
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2020-04-22 20:11:06 +00:00
Josh Bleecher Snyder
fff7509d47 cmd/compile: add intrinsic HasCPUFeature for checking cpu features
Before using some CPU instructions, we must check for their presence.
We use global variables in the runtime package to record features.

Prior to this CL, we issued a regular memory load for these features.
The downside to this is that, because it is a regular memory load,
it cannot be hoisted out of loops or otherwise reordered with other loads.

This CL introduces a new intrinsic just for checking cpu features.
It still ends up resulting in a memory load, but that memory load can
now be floated to the entry block and rematerialized as needed.

One downside is that the regular load could be combined with the comparison
into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE.
(It is possible that MOVBQZX+TESTQ+NE would be better.)

This CL does only amd64. It is easy to extend to other architectures.

For the benchmark in #36196, on my machine, this offers a mild speedup.

name      old time/op  new time/op  delta
FMA-8     1.39ns ± 6%  1.29ns ± 9%  -7.19%  (p=0.000 n=97+96)
NonFMA-8  2.03ns ±11%  2.04ns ±12%    ~     (p=0.618 n=99+98)

Updates #15808
Updates #36196

Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb
Reviewed-on: https://go-review.googlesource.com/c/go/+/212360
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
2020-04-04 01:01:04 +00:00
Meng Zhuo
50f1157760 cmd/compile: add math/bits.Mul64 intrinsic on mips64x
Benchmark:
name   old time/op  new time/op  delta
Mul    36.0ns ± 1%   2.8ns ± 0%  -92.31%  (p=0.000 n=10+10)
Mul32  4.37ns ± 0%  4.37ns ± 0%     ~     (p=0.429 n=6+10)
Mul64  36.4ns ± 0%   2.8ns ± 0%  -92.37%  (p=0.000 n=10+9)

Change-Id: Ic4f4e5958adbf24999abcee721d0180b5413fca7
Reviewed-on: https://go-review.googlesource.com/c/go/+/200582
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-10-14 21:23:34 +00:00
Ruixin Bao
98aa97806b cmd/compile: add math/bits.Mul64 intrinsic on s390x
This change adds an intrinsic for Mul64 on s390x. To achieve that,
a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly
instruction directly uses an existing instruction on Z and supports multiplication
of two 64 bit unsigned integer and stores the result in two separate registers.

In this case, we require the multiplcand to be stored in register R3 and
the output result (the high and low 64 bit of the product) to be stored in
R2 and R3 respectively.

A test case is also added.

Benchmark:
name      old time/op  new time/op  delta
Mul-18    11.1ns ± 0%   1.4ns ± 0%  -87.39%  (p=0.002 n=8+10)
Mul32-18  2.07ns ± 0%  2.07ns ± 0%     ~     (all equal)
Mul64-18  11.1ns ± 1%   1.4ns ± 0%  -87.42%  (p=0.000 n=10+10)

Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/194572
Reviewed-by: Michael Munday <mike.munday@ibm.com>
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-09-13 09:04:48 +00:00
Brian Kessler
b003afe4fe cmd/compile: intrinsify RotateLeft32 on wasm
wasm has 32-bit versions of all integer operations. This change
lowers RotateLeft32 to i32.rotl on wasm and intrinsifies the math/bits
call.  Benchmarking on amd64 under node.js this is ~25% faster.

node v10.15.3/amd64
name          old time/op  new time/op  delta
RotateLeft    8.37ns ± 1%  8.28ns ± 0%   -1.05%  (p=0.029 n=4+4)
RotateLeft8   11.9ns ± 1%  11.8ns ± 0%     ~     (p=0.167 n=5+5)
RotateLeft16  11.8ns ± 0%  11.8ns ± 0%     ~     (all equal)
RotateLeft32  11.9ns ± 1%   8.7ns ± 0%  -26.32%  (p=0.008 n=5+5)
RotateLeft64  8.31ns ± 1%  8.43ns ± 2%     ~     (p=0.063 n=5+5)

Updates #31265

Change-Id: I5b8e155978faeea536c4f6427ac9564d2f096a46
Reviewed-on: https://go-review.googlesource.com/c/go/+/182359
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
2019-08-31 17:03:04 +00:00
Ben Shi
8d5197d818 cmd/compile: optimize 386's math.bits.TrailingZeros16
This CL reverts CL 192097 and fixes the issue in CL 189277.

Change-Id: Icd271262e1f5019a8e01c91f91c12c1261eeb02b
Reviewed-on: https://go-review.googlesource.com/c/go/+/192519
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-08-30 17:37:00 +00:00
Cherry Zhang
9859f6bedb test/codegen: fix ARM32 RotateLeft32 test
The syntax of a shifted operation does not have a "$" sign for
the shift amount. Remove it.

Change-Id: I50782fe942b640076f48c2fafea4d3175be8ff99
Reviewed-on: https://go-review.googlesource.com/c/go/+/192100
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-08-28 20:42:48 +00:00
Ben Shi
3cfd003a8a cmd/compile: optimize ARM's math.bits.RotateLeft32
This CL optimizes math.bits.RotateLeft32 to inline
"MOVW Rx@>Ry, Rd" on ARM.

The benchmark results of math/bits show some improvements.
name               old time/op  new time/op  delta
RotateLeft-4       9.42ns ± 0%  6.91ns ± 0%  -26.66%  (p=0.000 n=40+33)
RotateLeft8-4      8.79ns ± 0%  8.79ns ± 0%   -0.04%  (p=0.000 n=40+31)
RotateLeft16-4     8.79ns ± 0%  8.79ns ± 0%   -0.04%  (p=0.000 n=40+32)
RotateLeft32-4     8.16ns ± 0%  7.54ns ± 0%   -7.68%  (p=0.000 n=40+40)
RotateLeft64-4     15.7ns ± 0%  15.7ns ± 0%     ~     (all equal)

updates #31265

Change-Id: I77bc1c2c702d5323fc7cad5264a8e2d5666bf712
Reviewed-on: https://go-review.googlesource.com/c/go/+/188697
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-08-28 15:41:58 +00:00
Bryan C. Mills
372b0eed17 Revert "cmd/compile: optimize 386's math.bits.TrailingZeros16"
This reverts CL 189277.

Reason for revert: broke 32-bit builders.

Updates #33902

Change-Id: Ie5f180d0371a90e5057ed578c334372e5fc3a286
Reviewed-on: https://go-review.googlesource.com/c/go/+/192097
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2019-08-28 12:57:59 +00:00
Ben Shi
22355d6cd2 cmd/compile: optimize 386's math.bits.TrailingZeros16
This CL optimizes math.bits.TrailingZeros16 on 386 with
a pair of BSFL and ORL instrcutions.

The case TrailingZeros16-4 of the benchmark test in
math/bits shows big improvement.
name               old time/op  new time/op  delta
TrailingZeros16-4  1.55ns ± 1%  0.87ns ± 1%  -43.87%  (p=0.000 n=50+49)

Change-Id: Ia899975b0e46f45dcd20223b713ed632bc32740b
Reviewed-on: https://go-review.googlesource.com/c/go/+/189277
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2019-08-28 02:29:54 +00:00
Michael Munday
2c1b5130aa cmd/compile: add math/bits.{Add,Sub}64 intrinsics on s390x
This CL adds intrinsics for the 64-bit addition and subtraction
functions in math/bits. These intrinsics use the condition code
to propagate the carry or borrow bit.

To make the carry chains more efficient I've removed the
'clobberFlags' property from most of the load and store
operations. Originally these ops did clobber flags when using
offsets that didn't fit in a signed 20-bit integer, however
that is no longer true.

As with other platforms the intrinsics are faster when executed
in a chain rather than a loop because currently we need to spill
and restore the carry bit between each loop iteration. We may
be able to reduce the need to do this on s390x (e.g. by using
compare-and-branch instructions that do not clobber flags) in the
future.

name           old time/op  new time/op  delta
Add64          1.21ns ± 2%  2.03ns ± 2%  +67.18%  (p=0.000 n=7+10)
Add64multiple  2.98ns ± 3%  1.03ns ± 0%  -65.39%  (p=0.000 n=10+9)
Sub64          1.23ns ± 4%  2.03ns ± 1%  +64.85%  (p=0.000 n=10+10)
Sub64multiple  3.73ns ± 4%  1.04ns ± 1%  -72.28%  (p=0.000 n=10+8)

Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083
Reviewed-on: https://go-review.googlesource.com/c/go/+/174303
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-05-03 10:41:15 +00:00
Carlos Eduardo Seo
50ad09418e cmd/compile: intrinsify math/bits.Add64 for ppc64x
This change creates an intrinsic for Add64 for ppc64x and adds a
testcase for it.

name               old time/op  new time/op  delta
Add64-160          1.90ns ±40%  2.29ns ± 0%     ~     (p=0.119 n=5+5)
Add64multiple-160  6.69ns ± 2%  2.45ns ± 4%  -63.47%  (p=0.016 n=4+5)

Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97
Reviewed-on: https://go-review.googlesource.com/c/go/+/173758
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2019-04-28 23:51:04 +00:00
erifan01
f8f265b9cf cmd/compile: intrinsify math/bits.Sub64 for arm64
This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS,
NGC and NEG, and optimzes the case of borrowing chains.

Benchmarks:
name              old time/op       new time/op       delta
Sub-64            2.500000ns +- 0%  2.048000ns +- 1%  -18.08%  (p=0.000 n=10+10)
Sub32-64          2.500000ns +- 0%  2.500000ns +- 0%     ~     (all equal)
Sub64-64          2.500000ns +- 0%  2.080000ns +- 0%  -16.80%  (p=0.000 n=10+7)
Sub64multiple-64  7.090000ns +- 0%  2.090000ns +- 0%  -70.52%  (p=0.000 n=10+10)

Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655
Reviewed-on: https://go-review.googlesource.com/c/go/+/159018
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-04-22 14:40:20 +00:00
Josh Bleecher Snyder
68d4b1265e cmd/compile: reduce bits.Div64(0, lo, y) to 64 bit division
With this change, these two functions generate identical code:

func f(x uint64) (uint64, uint64) {
	return bits.Div64(0, x, 5)
}

func g(x uint64) (uint64, uint64) {
	return x / 5, x % 5
}

Updates #31582

Change-Id: Ia96c2e67f8af5dd985823afee5f155608c04a4b6
Reviewed-on: https://go-review.googlesource.com/c/go/+/173197
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-04-20 19:34:03 +00:00
erifan01
d0cbf9bf53 cmd/compile: follow up intrinsifying math/bits.Add64 for arm64
This CL deals with the additional comments of CL 159017.

Change-Id: I4ad3c60c834646d58dc0c544c741b92bfe83fb8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/168857
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-22 15:09:47 +00:00
Carlos Eduardo Seo
3023d7da49 cmd/compile/internal, cmd/internal/obj/ppc64: generate new count trailing zeros instructions on POWER9
This change adds new POWER9 instructions for counting trailing zeros (CNTTZW/CNTTZD)
to the assembler and generates them in SSA when GOPPC64=power9.

name                 old time/op  new time/op  delta
TrailingZeros-160    1.59ns ±20%  1.45ns ±10%  -8.81%  (p=0.000 n=14+13)
TrailingZeros8-160   1.55ns ±23%  1.62ns ±44%    ~     (p=0.593 n=13+15)
TrailingZeros16-160  1.78ns ±23%  1.62ns ±38%  -9.31%  (p=0.003 n=14+14)
TrailingZeros32-160  1.64ns ±10%  1.49ns ± 9%  -9.15%  (p=0.000 n=13+14)
TrailingZeros64-160  1.53ns ± 6%  1.45ns ± 5%  -5.38%  (p=0.000 n=15+13)

Change-Id: I365e6ff79f3ce4d8ebe089a6a86b1771853eb596
Reviewed-on: https://go-review.googlesource.com/c/go/+/167517
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2019-03-20 20:27:00 +00:00
erifan01
5714c91b53 cmd/compile: intrinsify math/bits.Add64 for arm64
This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS
and ADC, and optimzes the case of carry chains.The CL also changes the
test code so that the intrinsic implementation can be tested.

Benchmarks:
name               old time/op       new time/op       delta
Add-224            2.500000ns +- 0%  2.090000ns +- 4%  -16.40%  (p=0.000 n=9+10)
Add32-224          2.500000ns +- 0%  2.500000ns +- 0%     ~     (all equal)
Add64-224          2.500000ns +- 0%  1.577778ns +- 2%  -36.89%  (p=0.000 n=10+9)
Add64multiple-224  6.000000ns +- 0%  2.000000ns +- 0%  -66.67%  (p=0.000 n=10+10)

Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615
Reviewed-on: https://go-review.googlesource.com/c/go/+/159017
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-20 05:39:49 +00:00
Tobias Klauser
156c830bea cmd/compile: eliminate unnecessary type conversions in TrailingZeros(16|8) for arm
This follows CL 156999 which did the same for arm64.

name               old time/op  new time/op  delta
TrailingZeros-4    7.30ns ± 1%  7.30ns ± 0%     ~     (p=0.413 n=9+9)
TrailingZeros8-4   8.32ns ± 0%  7.17ns ± 0%  -13.77%  (p=0.000 n=10+9)
TrailingZeros16-4  8.30ns ± 0%  7.18ns ± 0%  -13.50%  (p=0.000 n=9+10)
TrailingZeros32-4  6.46ns ± 1%  6.47ns ± 1%     ~     (p=0.325 n=10+10)
TrailingZeros64-4  16.3ns ± 0%  16.2ns ± 0%   -0.61%  (p=0.000 n=7+10)

Change-Id: I7e9e1abf7e30d811aa474d272b2824ec7cbbaa98
Reviewed-on: https://go-review.googlesource.com/c/go/+/167797
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-15 18:37:22 +00:00
Richard Musiol
5ee1b84959 math, math/bits: add intrinsics for wasm
This commit adds compiler intrinsics for the packages math and
math/bits on the wasm architecture for better performance.

benchmark                        old ns/op     new ns/op     delta
BenchmarkCeil                    8.31          3.21          -61.37%
BenchmarkCopysign                5.24          3.88          -25.95%
BenchmarkAbs                     5.42          3.34          -38.38%
BenchmarkFloor                   8.29          3.18          -61.64%
BenchmarkRoundToEven             9.76          3.26          -66.60%
BenchmarkSqrtLatency             8.13          4.88          -39.98%
BenchmarkSqrtPrime               5246          3535          -32.62%
BenchmarkTrunc                   8.29          3.15          -62.00%
BenchmarkLeadingZeros            13.0          4.23          -67.46%
BenchmarkLeadingZeros8           4.65          4.42          -4.95%
BenchmarkLeadingZeros16          7.60          4.38          -42.37%
BenchmarkLeadingZeros32          10.7          4.48          -58.13%
BenchmarkLeadingZeros64          12.9          4.31          -66.59%
BenchmarkTrailingZeros           6.52          4.04          -38.04%
BenchmarkTrailingZeros8          4.57          4.14          -9.41%
BenchmarkTrailingZeros16         6.69          4.16          -37.82%
BenchmarkTrailingZeros32         6.97          4.23          -39.31%
BenchmarkTrailingZeros64         6.59          4.00          -39.30%
BenchmarkOnesCount               7.93          3.30          -58.39%
BenchmarkOnesCount8              3.56          3.19          -10.39%
BenchmarkOnesCount16             4.85          3.19          -34.23%
BenchmarkOnesCount32             7.27          3.19          -56.12%
BenchmarkOnesCount64             8.08          3.28          -59.41%
BenchmarkRotateLeft              4.88          3.80          -22.13%
BenchmarkRotateLeft64            5.03          3.63          -27.83%

Change-Id: Ic1e0c2984878be8defb6eb7eb6ee63765c793222
Reviewed-on: https://go-review.googlesource.com/c/go/+/165177
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-14 19:46:19 +00:00
erifan01
4e2b0dda8c cmd/compile: eliminate unnecessary type conversions in TrailingZeros(16|8) for arm64
This CL eliminates unnecessary type conversion operations: OpZeroExt16to64 and OpZeroExt8to64.
If the input argrument is a nonzero value, then ORconst operation can also be eliminated.

Benchmarks:

name               old time/op  new time/op  delta
TrailingZeros-8    2.75ns ± 0%  2.75ns ± 0%     ~     (all equal)
TrailingZeros8-8   3.49ns ± 1%  2.93ns ± 0%  -16.00%  (p=0.000 n=10+10)
TrailingZeros16-8  3.49ns ± 1%  2.93ns ± 0%  -16.05%  (p=0.000 n=9+10)
TrailingZeros32-8  2.67ns ± 1%  2.68ns ± 1%     ~     (p=0.468 n=10+10)
TrailingZeros64-8  2.67ns ± 1%  2.65ns ± 0%   -0.62%  (p=0.022 n=10+9)

code:

func f16(x uint) { z = bits.TrailingZeros16(uint16(x)) }

Before:

"".f16 STEXT size=48 args=0x8 locals=0x0 leaf
        0x0000 00000 (test.go:7)        TEXT    "".f16(SB), LEAF|NOFRAME|ABIInternal, $0-8
        0x0000 00000 (test.go:7)        FUNCDATA        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        PCDATA  $2, ZR
        0x0000 00000 (test.go:7)        PCDATA  ZR, ZR
        0x0000 00000 (test.go:7)        MOVD    "".x(FP), R0
        0x0004 00004 (test.go:7)        MOVHU   R0, R0
        0x0008 00008 (test.go:7)        ORR     $65536, R0, R0
        0x000c 00012 (test.go:7)        RBIT    R0, R0
        0x0010 00016 (test.go:7)        CLZ     R0, R0
        0x0014 00020 (test.go:7)        MOVD    R0, "".z(SB)
        0x0020 00032 (test.go:7)        RET     (R30)

This line of code is unnecessary:
        0x0004 00004 (test.go:7)        MOVHU   R0, R0

After:

"".f16 STEXT size=32 args=0x8 locals=0x0 leaf
        0x0000 00000 (test.go:7)        TEXT    "".f16(SB), LEAF|NOFRAME|ABIInternal, $0-8
        0x0000 00000 (test.go:7)        FUNCDATA        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        FUNCDATA        $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (test.go:7)        PCDATA  $2, ZR
        0x0000 00000 (test.go:7)        PCDATA  ZR, ZR
        0x0000 00000 (test.go:7)        MOVD    "".x(FP), R0
        0x0004 00004 (test.go:7)        ORR     $65536, R0, R0
        0x0008 00008 (test.go:7)        RBITW   R0, R0
        0x000c 00012 (test.go:7)        CLZW    R0, R0
        0x0010 00016 (test.go:7)        MOVD    R0, "".z(SB)
        0x001c 00028 (test.go:7)        RET     (R30)

The situation of TrailingZeros8 is similar to TrailingZeros16.

Change-Id: I473bdca06be8460a0be87abbae6fe640017e4c9d
Reviewed-on: https://go-review.googlesource.com/c/go/+/156999
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07 14:24:56 +00:00
erifan01
fee84cc905 cmd/compile: add an optimization rule for math/bits.ReverseBytes16 on arm
This CL adds two rules to turn patterns like ((x<<8) | (x>>8)) (the type of
x is uint16, "|" can also be "+" or "^") to a REV16 instruction on arm v6+.
This optimization rule can be used for math/bits.ReverseBytes16.

Benchmarks on arm v6:
name               old time/op  new time/op  delta
ReverseBytes-32    2.86ns ± 0%  2.86ns ± 0%   ~     (all equal)
ReverseBytes16-32  2.86ns ± 0%  2.86ns ± 0%   ~     (all equal)
ReverseBytes32-32  1.29ns ± 0%  1.29ns ± 0%   ~     (all equal)
ReverseBytes64-32  1.43ns ± 0%  1.43ns ± 0%   ~     (all equal)

Change-Id: I819e633c9a9d308f8e476fb0c82d73fb73dd019f
Reviewed-on: https://go-review.googlesource.com/c/go/+/159019
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07 13:37:54 +00:00
erifan01
159b2de442 cmd/compile: optimize math/bits.Div32 for arm64
Benchmark:
name     old time/op  new time/op  delta
Div-8    22.0ns ± 0%  22.0ns ± 0%     ~     (all equal)
Div32-8  6.51ns ± 0%  3.00ns ± 0%  -53.90%  (p=0.000 n=10+8)
Div64-8  22.5ns ± 0%  22.5ns ± 0%     ~     (all equal)

Code:
func div32(hi, lo, y uint32) (q, r uint32) {return bits.Div32(hi, lo, y)}

Before:
        0x0020 00032 (test.go:24)       MOVWU   "".y+8(FP), R0
        0x0024 00036 ($GOROOT/src/math/bits/bits.go:472)        CBZW    R0, 132
        0x0028 00040 ($GOROOT/src/math/bits/bits.go:472)        MOVWU   "".hi(FP), R1
        0x002c 00044 ($GOROOT/src/math/bits/bits.go:472)        CMPW    R1, R0
        0x0030 00048 ($GOROOT/src/math/bits/bits.go:472)        BLS     96
        0x0034 00052 ($GOROOT/src/math/bits/bits.go:475)        MOVWU   "".lo+4(FP), R2
        0x0038 00056 ($GOROOT/src/math/bits/bits.go:475)        ORR     R1<<32, R2, R1
        0x003c 00060 ($GOROOT/src/math/bits/bits.go:476)        CBZ     R0, 140
        0x0040 00064 ($GOROOT/src/math/bits/bits.go:476)        UDIV    R0, R1, R2
        0x0044 00068 (test.go:24)       MOVW    R2, "".q+16(FP)
        0x0048 00072 ($GOROOT/src/math/bits/bits.go:476)        UREM    R0, R1, R0
        0x0050 00080 (test.go:24)       MOVW    R0, "".r+20(FP)
        0x0054 00084 (test.go:24)       MOVD    -8(RSP), R29
        0x0058 00088 (test.go:24)       MOVD.P  32(RSP), R30
        0x005c 00092 (test.go:24)       RET     (R30)

After:
        0x001c 00028 (test.go:24)       MOVWU   "".y+8(FP), R0
        0x0020 00032 (test.go:24)       CBZW    R0, 92
        0x0024 00036 (test.go:24)       MOVWU   "".hi(FP), R1
        0x0028 00040 (test.go:24)       CMPW    R0, R1
        0x002c 00044 (test.go:24)       BHS     84
        0x0030 00048 (test.go:24)       MOVWU   "".lo+4(FP), R2
        0x0034 00052 (test.go:24)       ORR     R1<<32, R2, R4
        0x0038 00056 (test.go:24)       UDIV    R0, R4, R3
        0x003c 00060 (test.go:24)       MSUB    R3, R4, R0, R4
        0x0040 00064 (test.go:24)       MOVW    R3, "".q+16(FP)
        0x0044 00068 (test.go:24)       MOVW    R4, "".r+20(FP)
        0x0048 00072 (test.go:24)       MOVD    -8(RSP), R29
        0x004c 00076 (test.go:24)       MOVD.P  16(RSP), R30
        0x0050 00080 (test.go:24)       RET     (R30)

UREM instruction in the previous assembly code will be converted to UDIV and MSUB instructions
on arm64. However the UDIV instruction in UREM is unnecessary, because it's a duplicate of the
previous UDIV. This CL adds a rule to have this extra UDIV instruction removed by CSE.

Change-Id: Ie2508784320020b2de022806d09f75a7871bb3d7
Reviewed-on: https://go-review.googlesource.com/c/159577
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-03 20:20:10 +00:00
erifan01
192b675f17 cmd/compile: add an optimaztion rule for math/bits.ReverseBytes16 on arm64
On amd64 ReverseBytes16 is lowered to a rotate instruction. However arm64 doesn't
have 16-bit rotate instruction, but has a REV16W instruction which can be used
for ReverseBytes16. This CL adds a rule to turn the patterns like (x<<8) | (x>>8)
(the type of x is uint16, and "|" can also be "^" or "+") to a REV16W instruction.

Code:
func reverseBytes16(i uint16) uint16 { return bits.ReverseBytes16(i) }

Before:
        0x0004 00004 (test.go:6)        MOVHU   "".i(FP), R0
        0x0008 00008 ($GOROOT/src/math/bits/bits.go:262)        UBFX    $8, R0, $8, R1
        0x000c 00012 ($GOROOT/src/math/bits/bits.go:262)        ORR     R0<<8, R1, R0
        0x0010 00016 (test.go:6)        MOVH    R0, "".~r1+8(FP)
        0x0014 00020 (test.go:6)        RET     (R30)

After:
        0x0000 00000 (test.go:6)        MOVHU   "".i(FP), R0
        0x0004 00004 (test.go:6)        REV16W  R0, R0
        0x0008 00008 (test.go:6)        MOVH    R0, "".~r1+8(FP)
        0x000c 00012 (test.go:6)        RET     (R30)

Benchmarks:
name                old time/op       new time/op       delta
ReverseBytes-224    1.000000ns +- 0%  1.000000ns +- 0%     ~     (all equal)
ReverseBytes16-224  1.500000ns +- 0%  1.000000ns +- 0%  -33.33%  (p=0.000 n=9+10)
ReverseBytes32-224  1.000000ns +- 0%  1.000000ns +- 0%     ~     (all equal)
ReverseBytes64-224  1.000000ns +- 0%  1.000000ns +- 0%     ~     (all equal)

Change-Id: I87cd41b2d8e549bf39c601f185d5775bd42d739c
Reviewed-on: https://go-review.googlesource.com/c/157757
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-01 15:42:19 +00:00
erifan01
dd91269b7c cmd/compile: optimize math/bits Len32 intrinsic on arm64
Arm64 has a 32-bit CLZ instruction CLZW, which can be used for intrinsic Len32.
Function LeadingZeros32 calls Len32, with this change, the assembly code of
LeadingZeros32 becomes more concise.

Go code:

func f32(x uint32) { z = bits.LeadingZeros32(x) }

Before:

"".f32 STEXT size=32 args=0x8 locals=0x0 leaf
        0x0000 00000 (test.go:7)        TEXT    "".f32(SB), LEAF|NOFRAME|ABIInternal, $0-8
        0x0004 00004 (test.go:7)        MOVWU   "".x(FP), R0
        0x0008 00008 ($GOROOT/src/math/bits/bits.go:30) CLZ     R0, R0
        0x000c 00012 ($GOROOT/src/math/bits/bits.go:30) SUB     $32, R0, R0
        0x0010 00016 (test.go:7)        MOVD    R0, "".z(SB)
        0x001c 00028 (test.go:7)        RET     (R30)

After:

"".f32 STEXT size=32 args=0x8 locals=0x0 leaf
        0x0000 00000 (test.go:7)        TEXT    "".f32(SB), LEAF|NOFRAME|ABIInternal, $0-8
        0x0004 00004 (test.go:7)        MOVWU   "".x(FP), R0
        0x0008 00008 ($GOROOT/src/math/bits/bits.go:30) CLZW    R0, R0
        0x000c 00012 (test.go:7)        MOVD    R0, "".z(SB)
        0x0018 00024 (test.go:7)        RET     (R30)

Benchmarks:
name              old time/op  new time/op  delta
LeadingZeros-8    2.53ns ± 0%  2.55ns ± 0%   +0.67%  (p=0.000 n=10+10)
LeadingZeros8-8   3.56ns ± 0%  3.56ns ± 0%     ~     (all equal)
LeadingZeros16-8  3.55ns ± 0%  3.56ns ± 0%     ~     (p=0.465 n=10+10)
LeadingZeros32-8  3.55ns ± 0%  2.96ns ± 0%  -16.71%  (p=0.000 n=10+7)
LeadingZeros64-8  2.53ns ± 0%  2.54ns ± 0%     ~     (p=0.059 n=8+10)

Change-Id: Ie5666bb82909e341060e02ffd4e86c0e5d67e90a
Reviewed-on: https://go-review.googlesource.com/c/157000
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-02-27 16:09:33 +00:00
Brian Kessler
319787a528 cmd/compile: intrinsify math/bits.Div on amd64
Note that the intrinsic implementation panics separately for overflow and
divide by zero, which matches the behavior of the pure go implementation.
There is a modest performance improvement after intrinsic implementation.

name     old time/op  new time/op  delta
Div-4    53.0ns ± 1%  47.0ns ± 0%  -11.28%  (p=0.008 n=5+5)
Div32-4  18.4ns ± 0%  18.5ns ± 1%     ~     (p=0.444 n=5+5)
Div64-4  53.3ns ± 0%  47.5ns ± 4%  -10.77%  (p=0.008 n=5+5)

Updates #28273

Change-Id: Ic1688ecc0964acace2e91bf44ef16f5fb6b6bc82
Reviewed-on: https://go-review.googlesource.com/c/144378
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2018-11-27 05:04:25 +00:00
Martin Möhrmann
75798e8ada runtime: make processor capability variable naming platform specific
The current support_XXX variables are specific for the
amd64 and 386 platforms.

Prefix processor capability variables by architecture to have a
consistent naming scheme and avoid reuse of the existing
variables for new platforms.

This also aligns naming of runtime variables closer with internal/cpu
processor capability variable names.

Change-Id: I3eabb29a03874678851376185d3a62e73c1aff1d
Reviewed-on: https://go-review.googlesource.com/c/91435
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2018-11-14 20:30:31 +00:00
Keith Randall
dd789550a7 cmd/compile: intrinsify math/bits.Sub on amd64
name             old time/op  new time/op  delta
Sub-8            1.12ns ± 1%  1.17ns ± 1%   +5.20%          (p=0.008 n=5+5)
Sub32-8          1.11ns ± 0%  1.11ns ± 0%     ~     (all samples are equal)
Sub64-8          1.12ns ± 0%  1.18ns ± 1%   +5.00%          (p=0.016 n=4+5)
Sub64multiple-8  4.10ns ± 1%  0.86ns ± 1%  -78.93%          (p=0.008 n=5+5)

Fixes #28273

Change-Id: Ibcb6f2fd32d987c3bcbae4f4cd9d335a3de98548
Reviewed-on: https://go-review.googlesource.com/c/144258
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-10-25 19:47:27 +00:00