Before using some CPU instructions, we must check for their presence.
We use global variables in the runtime package to record features.
Prior to this CL, we issued a regular memory load for these features.
The downside to this is that, because it is a regular memory load,
it cannot be hoisted out of loops or otherwise reordered with other loads.
This CL introduces a new intrinsic just for checking cpu features.
It still ends up resulting in a memory load, but that memory load can
now be floated to the entry block and rematerialized as needed.
One downside is that the regular load could be combined with the comparison
into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE.
(It is possible that MOVBQZX+TESTQ+NE would be better.)
This CL does only amd64. It is easy to extend to other architectures.
For the benchmark in #36196, on my machine, this offers a mild speedup.
name old time/op new time/op delta
FMA-8 1.39ns ± 6% 1.29ns ± 9% -7.19% (p=0.000 n=97+96)
NonFMA-8 2.03ns ±11% 2.04ns ±12% ~ (p=0.618 n=99+98)
Updates #15808
Updates #36196
Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb
Reviewed-on: https://go-review.googlesource.com/c/go/+/212360
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
This change adds an intrinsic for Mul64 on s390x. To achieve that,
a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly
instruction directly uses an existing instruction on Z and supports multiplication
of two 64 bit unsigned integer and stores the result in two separate registers.
In this case, we require the multiplcand to be stored in register R3 and
the output result (the high and low 64 bit of the product) to be stored in
R2 and R3 respectively.
A test case is also added.
Benchmark:
name old time/op new time/op delta
Mul-18 11.1ns ± 0% 1.4ns ± 0% -87.39% (p=0.002 n=8+10)
Mul32-18 2.07ns ± 0% 2.07ns ± 0% ~ (all equal)
Mul64-18 11.1ns ± 1% 1.4ns ± 0% -87.42% (p=0.000 n=10+10)
Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/194572
Reviewed-by: Michael Munday <mike.munday@ibm.com>
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This CL reverts CL 192097 and fixes the issue in CL 189277.
Change-Id: Icd271262e1f5019a8e01c91f91c12c1261eeb02b
Reviewed-on: https://go-review.googlesource.com/c/go/+/192519
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
The syntax of a shifted operation does not have a "$" sign for
the shift amount. Remove it.
Change-Id: I50782fe942b640076f48c2fafea4d3175be8ff99
Reviewed-on: https://go-review.googlesource.com/c/go/+/192100
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This CL optimizes math.bits.TrailingZeros16 on 386 with
a pair of BSFL and ORL instrcutions.
The case TrailingZeros16-4 of the benchmark test in
math/bits shows big improvement.
name old time/op new time/op delta
TrailingZeros16-4 1.55ns ± 1% 0.87ns ± 1% -43.87% (p=0.000 n=50+49)
Change-Id: Ia899975b0e46f45dcd20223b713ed632bc32740b
Reviewed-on: https://go-review.googlesource.com/c/go/+/189277
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This CL adds intrinsics for the 64-bit addition and subtraction
functions in math/bits. These intrinsics use the condition code
to propagate the carry or borrow bit.
To make the carry chains more efficient I've removed the
'clobberFlags' property from most of the load and store
operations. Originally these ops did clobber flags when using
offsets that didn't fit in a signed 20-bit integer, however
that is no longer true.
As with other platforms the intrinsics are faster when executed
in a chain rather than a loop because currently we need to spill
and restore the carry bit between each loop iteration. We may
be able to reduce the need to do this on s390x (e.g. by using
compare-and-branch instructions that do not clobber flags) in the
future.
name old time/op new time/op delta
Add64 1.21ns ± 2% 2.03ns ± 2% +67.18% (p=0.000 n=7+10)
Add64multiple 2.98ns ± 3% 1.03ns ± 0% -65.39% (p=0.000 n=10+9)
Sub64 1.23ns ± 4% 2.03ns ± 1% +64.85% (p=0.000 n=10+10)
Sub64multiple 3.73ns ± 4% 1.04ns ± 1% -72.28% (p=0.000 n=10+8)
Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083
Reviewed-on: https://go-review.googlesource.com/c/go/+/174303
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This change creates an intrinsic for Add64 for ppc64x and adds a
testcase for it.
name old time/op new time/op delta
Add64-160 1.90ns ±40% 2.29ns ± 0% ~ (p=0.119 n=5+5)
Add64multiple-160 6.69ns ± 2% 2.45ns ± 4% -63.47% (p=0.016 n=4+5)
Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97
Reviewed-on: https://go-review.googlesource.com/c/go/+/173758
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS
and ADC, and optimzes the case of carry chains.The CL also changes the
test code so that the intrinsic implementation can be tested.
Benchmarks:
name old time/op new time/op delta
Add-224 2.500000ns +- 0% 2.090000ns +- 4% -16.40% (p=0.000 n=9+10)
Add32-224 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal)
Add64-224 2.500000ns +- 0% 1.577778ns +- 2% -36.89% (p=0.000 n=10+9)
Add64multiple-224 6.000000ns +- 0% 2.000000ns +- 0% -66.67% (p=0.000 n=10+10)
Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615
Reviewed-on: https://go-review.googlesource.com/c/go/+/159017
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This CL adds two rules to turn patterns like ((x<<8) | (x>>8)) (the type of
x is uint16, "|" can also be "+" or "^") to a REV16 instruction on arm v6+.
This optimization rule can be used for math/bits.ReverseBytes16.
Benchmarks on arm v6:
name old time/op new time/op delta
ReverseBytes-32 2.86ns ± 0% 2.86ns ± 0% ~ (all equal)
ReverseBytes16-32 2.86ns ± 0% 2.86ns ± 0% ~ (all equal)
ReverseBytes32-32 1.29ns ± 0% 1.29ns ± 0% ~ (all equal)
ReverseBytes64-32 1.43ns ± 0% 1.43ns ± 0% ~ (all equal)
Change-Id: I819e633c9a9d308f8e476fb0c82d73fb73dd019f
Reviewed-on: https://go-review.googlesource.com/c/go/+/159019
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Note that the intrinsic implementation panics separately for overflow and
divide by zero, which matches the behavior of the pure go implementation.
There is a modest performance improvement after intrinsic implementation.
name old time/op new time/op delta
Div-4 53.0ns ± 1% 47.0ns ± 0% -11.28% (p=0.008 n=5+5)
Div32-4 18.4ns ± 0% 18.5ns ± 1% ~ (p=0.444 n=5+5)
Div64-4 53.3ns ± 0% 47.5ns ± 4% -10.77% (p=0.008 n=5+5)
Updates #28273
Change-Id: Ic1688ecc0964acace2e91bf44ef16f5fb6b6bc82
Reviewed-on: https://go-review.googlesource.com/c/144378
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
The current support_XXX variables are specific for the
amd64 and 386 platforms.
Prefix processor capability variables by architecture to have a
consistent naming scheme and avoid reuse of the existing
variables for new platforms.
This also aligns naming of runtime variables closer with internal/cpu
processor capability variable names.
Change-Id: I3eabb29a03874678851376185d3a62e73c1aff1d
Reviewed-on: https://go-review.googlesource.com/c/91435
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
This change adds codegen tests for the intrinsification on ppc64 of
the OnesCount{64,32,16,8}, and TrailingZeros{64,32,16,8} math/bits
functions.
Change-Id: Id3364921fbd18316850e15c8c71330c906187fdb
Reviewed-on: https://go-review.googlesource.com/c/141897
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Add some rules to match the Go code like:
y &= 63
x << y | x >> (64-y)
or
y &= 63
x >> y | x << (64-y)
as a ROR instruction. Make math/bits.RotateLeft faster on arm64.
Extends CL 132435 to arm64.
Benchmarks of math/bits.RotateLeftxxN:
name old time/op new time/op delta
RotateLeft-8 3.548750ns +- 1% 2.003750ns +- 0% -43.54% (p=0.000 n=8+8)
RotateLeft8-8 3.925000ns +- 0% 3.925000ns +- 0% ~ (p=1.000 n=8+8)
RotateLeft16-8 3.925000ns +- 0% 3.927500ns +- 0% ~ (p=0.608 n=8+8)
RotateLeft32-8 3.925000ns +- 0% 2.002500ns +- 0% -48.98% (p=0.000 n=8+8)
RotateLeft64-8 3.536250ns +- 0% 2.003750ns +- 0% -43.34% (p=0.000 n=8+8)
Change-Id: I77622cd7f39b917427e060647321f5513973232c
Reviewed-on: https://go-review.googlesource.com/122542
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
This CL implements the math/bits.OnesCount{8,16,32,64} functions
as intrinsics on s390x using the 'population count' (popcnt)
instruction. This instruction was released as the 'population-count'
facility which uses the same facility bit (45) as the
'distinct-operands' facility which is a pre-requisite for Go on
s390x. We can therefore use it without a feature check.
The s390x popcnt instruction treats a 64 bit register as a vector
of 8 bytes, summing the number of ones in each byte individually.
It then writes the results to the corresponding bytes in the
output register. Therefore to implement OnesCount{16,32,64} we
need to sum the individual byte counts using some extra
instructions. To do this efficiently I've added some additional
pseudo operations to the s390x SSA backend.
Unlike other architectures the new instruction sequence is faster
for OnesCount8, so that is implemented using the intrinsic.
name old time/op new time/op delta
OnesCount 3.21ns ± 1% 1.35ns ± 0% -58.00% (p=0.000 n=20+20)
OnesCount8 0.91ns ± 1% 0.81ns ± 0% -11.43% (p=0.000 n=20+20)
OnesCount16 1.51ns ± 3% 1.21ns ± 0% -19.71% (p=0.000 n=20+17)
OnesCount32 1.91ns ± 0% 1.12ns ± 1% -41.60% (p=0.000 n=19+20)
OnesCount64 3.18ns ± 4% 1.35ns ± 0% -57.52% (p=0.000 n=20+20)
Change-Id: Id54f0bd28b6db9a887ad12c0d72fcc168ef9c4e0
Reviewed-on: https://go-review.googlesource.com/114675
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
On amd64, Ctz must include special handling of zeros.
But the prove pass has enough information to detect whether the input
is non-zero, allowing a more efficient lowering.
Introduce new CtzNonZero ops to capture and use this information.
Benchmark code:
func BenchmarkVisitBits(b *testing.B) {
b.Run("8", func(b *testing.B) {
for i := 0; i < b.N; i++ {
x := uint8(0xff)
for x != 0 {
sink = bits.TrailingZeros8(x)
x &= x - 1
}
}
})
// and similarly so for 16, 32, 64
}
name old time/op new time/op delta
VisitBits/8-8 7.27ns ± 4% 5.58ns ± 4% -23.35% (p=0.000 n=28+26)
VisitBits/16-8 14.7ns ± 7% 10.5ns ± 4% -28.43% (p=0.000 n=30+28)
VisitBits/32-8 27.6ns ± 8% 19.3ns ± 3% -30.14% (p=0.000 n=30+26)
VisitBits/64-8 44.0ns ±11% 38.0ns ± 5% -13.48% (p=0.000 n=30+30)
Fixes#25077
Change-Id: Ie6e5bd86baf39ee8a4ca7cadcf56d934e047f957
Reviewed-on: https://go-review.googlesource.com/109358
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
The previous change sped up the pure computation form of LeadingZeros8.
This places it somewhat close to the table lookup form.
Depending on something that varies from toolchain to toolchain
(alignment, perhaps?), the slowdown from ditching the table lookup
is either 20% or 5%.
This benchmark is the best case scenario for the table lookup:
It is in the L1 cache already.
I think we're close enough that we can switch to the computational version,
and trust that the memory effects and binary size savings will be worth it.
Code:
func f8(x uint8) { z = bits.LeadingZeros8(x) }
Before:
"".f8 STEXT nosplit size=34 args=0x8 locals=0x0
0x0000 00000 (x.go:7) TEXT "".f8(SB), NOSPLIT, $0-8
0x0000 00000 (x.go:7) FUNCDATA $0, gclocals·2a5305abe05176240e61b8620e19a815(SB)
0x0000 00000 (x.go:7) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (x.go:7) MOVBLZX "".x+8(SP), AX
0x0005 00005 (x.go:7) MOVBLZX AL, AX
0x0008 00008 (x.go:7) LEAQ math/bits.len8tab(SB), CX
0x000f 00015 (x.go:7) MOVBLZX (CX)(AX*1), AX
0x0013 00019 (x.go:7) ADDQ $-8, AX
0x0017 00023 (x.go:7) NEGQ AX
0x001a 00026 (x.go:7) MOVQ AX, "".z(SB)
0x0021 00033 (x.go:7) RET
After:
"".f8 STEXT nosplit size=30 args=0x8 locals=0x0
0x0000 00000 (x.go:7) TEXT "".f8(SB), NOSPLIT, $0-8
0x0000 00000 (x.go:7) FUNCDATA $0, gclocals·2a5305abe05176240e61b8620e19a815(SB)
0x0000 00000 (x.go:7) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (x.go:7) MOVBLZX "".x+8(SP), AX
0x0005 00005 (x.go:7) MOVBLZX AL, AX
0x0008 00008 (x.go:7) LEAL 1(AX)(AX*1), AX
0x000c 00012 (x.go:7) BSRL AX, AX
0x000f 00015 (x.go:7) ADDQ $-8, AX
0x0013 00019 (x.go:7) NEGQ AX
0x0016 00022 (x.go:7) MOVQ AX, "".z(SB)
0x001d 00029 (x.go:7) RET
Change-Id: Icc7db50a7820fb9a3da8a816d6b6940d7f8e193e
Reviewed-on: https://go-review.googlesource.com/108942
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Only RotateLeft{64,32} were tested, and just for ppc64. This CL adds
tests for RotateLeft{64,32,16,8} on arm64 and amd64/386, for the cases
where the calls are actually instrinsified.
RotateLeft tests (the last ones for math/bits functions) are deleted
from asm_test.
This CL also adds a space between the "//" and the arch name in the
comments, to uniform this file to the style used in all the other
files.
Change-Id: Ifc2a27261d70bcc294b4ec64490d8367f62d2b89
Reviewed-on: https://go-review.googlesource.com/99596
Reviewed-by: Giovanni Bajo <rasky@develer.com>
And remove them from ssa_test.
Change-Id: If767af662801219774d1bdb787c77edfa6067770
Reviewed-on: https://go-review.googlesource.com/98976
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
And remove them from ssa_test.
Change-Id: I3efac5fea529bb0efa2dae32124530482ba5058e
Reviewed-on: https://go-review.googlesource.com/98815
Reviewed-by: Keith Randall <khr@golang.org>
And remove them from ssa_test.
Change-Id: Ib5de5c0d908f23915e0847eca338cacf2fa5325b
Reviewed-on: https://go-review.googlesource.com/98795
Reviewed-by: Giovanni Bajo <rasky@develer.com>
This change move bits.Len* intrinsification tests to the new codegen
test harness, removing them from the old ssa_test file. Five different
test functions (one for each bit.Len function tested) was used, to
avoid possible unwanted interactions between multiple calls inside one
function.
Change-Id: Iffd5be55b58e88597fa30a562a28dacb01236d8b
Reviewed-on: https://go-review.googlesource.com/98156
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>