qbit/go - go - Tape:neT

qbit/go

mirror of https://github.com/golang/go synced 2024-11-26 08:17:59 -07:00

Author	SHA1	Message	Date
Joel Sing	e126129d76	cmd/compile/internal/ssa: combine shift and addition for riscv64 rva22u64 When GORISCV64 enables rva22u64, combined shift and addition using the SH1ADD, SH2ADD and SH3ADD instructions that are available via the Zba extension. This results in more than 2000 instructions being removed from the Go binary on riscv64. Change-Id: Ia62ae7dda3d8083cff315113421bee73f518eea8 Reviewed-on: https://go-review.googlesource.com/c/go/+/606636 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mark Ryan <markdryan@rivosinc.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>	2024-08-28 13:46:24 +00:00
Paul E. Murphy	c6d142c4a7	cmd/compile/internal/ssa: fix ppc64 merging of (CLRLSLDI (SRD ...)) The rotate value was not correctly converted from a 64 bit to 32 bit rotate. This caused a miscompile of golang.org/x/text/unicode/runenames.Names. Fixes #67526 Change-Id: Ief56fbab27ccc71cd4c01117909bfee7f60a2ea1 Reviewed-on: https://go-review.googlesource.com/c/go/+/586915 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Carlos Amedee <carlos@golang.org>	2024-05-21 18:53:43 +00:00
Paul E. Murphy	0222a028f1	cmd/compile/internal/ssa: combine more shift and masking on PPC64 Investigating binaries, these patterns seem to show up frequently. Change-Id: I987251e4070e35c25e98da321e444ccaa1526912 Reviewed-on: https://go-review.googlesource.com/c/go/+/583302 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>	2024-05-15 13:27:41 +00:00
Paul E. Murphy	7994da4cc1	cmd/compile/internal/ssa: on PPC64, try combining CLRLSLDI and SRDconst into RLWINM This provides a small performance bump to crc64 as measured on ppc64le/power10: name old time/op new time/op delta Crc64/ISO64KB 49.6µs ± 0% 46.6µs ± 0% -6.18% Crc64/ISO4KB 3.16µs ± 0% 2.97µs ± 0% -5.83% Crc64/ISO1KB 840ns ± 0% 794ns ± 0% -5.46% Crc64/ECMA64KB 49.6µs ± 0% 46.5µs ± 0% -6.20% Crc64/Random64KB 53.1µs ± 0% 49.9µs ± 0% -6.04% Crc64/Random16KB 15.9µs ± 1% 15.0µs ± 0% -5.73% Change-Id: I302b5431c7dc46dfd2d211545c483bdcdfe011f1 Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10,gotip-linux-ppc64_power8,gotip-linux-ppc64le_power8,gotip-linux-ppc64le_power9,gotip-linux-ppc64le_power10 Reviewed-on: https://go-review.googlesource.com/c/go/+/581937 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Eli Bendersky <eliben@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>	2024-05-03 21:12:29 +00:00
Joel Sing	70c7fb75e9	cmd/compile: correct code generation for right shifts on riscv64 The code generation on riscv64 will currently result in incorrect assembly when a 32 bit integer is right shifted by an amount that exceeds the size of the type. In particular, this occurs when an int32 or uint32 is cast to a 64 bit type and right shifted by a value larger than 31. Fix this by moving the SRAW/SRLW conversion into the right shift rules and removing the SignExt32to64/ZeroExt32to64. Add additional rules that rewrite to SRAIW/SRLIW when the shift is less than the size of the type, or replace/eliminate the shift when it exceeds the size of the type. Add SSA tests that would have caught this issue. Also add additional codegen tests to ensure that the resulting assembly is what we expect in these overflow cases. Fixes #64285 Change-Id: Ie97b05668597cfcb91413afefaab18ee1aa145ec Reviewed-on: https://go-review.googlesource.com/c/go/+/545035 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: M Zhuo <mzh@golangcn.org> Reviewed-by: Mark Ryan <markdryan@rivosinc.com> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-12-01 19:30:59 +00:00
Ubuntu	8fc043ccfa	cmd/compile: optimize right shifts of int32 on riscv64 The compiler is currently sign extending 32 bit signed integers to 64 bits before right shifting them using a 64 bit shift instruction. There's no need to do this as RISC-V has instructions for right shifting 32 bit signed values (sraw and sraiw) which sign extend the result of the shift to 64 bits. Change the compiler so that it uses sraw and sraiw for shifts of signed 32 bit integers reducing in most cases the number of instructions needed to perform the shift. Here are some examples of code sequences that are changed by this patch: int32(a) >> 2 before: sll x5,x10,0x20 sra x10,x5,0x22 after: sraw x10,x10,0x2 int32(v) >> int(s) before: sext.w x5,x10 sltiu x6,x11,64 add x6,x6,-1 or x6,x11,x6 sra x10,x5,x6 after: sltiu x5,x11,32 add x5,x5,-1 or x5,x11,x5 sraw x10,x10,x5 int32(v) >> (int(s) & 31) before: sext.w x5,x10 and x6,x11,63 sra x10,x5,x6 after: and x5,x11,31 sraw x10,x10,x5 int32(100) >> int(a) before: bltz x10,<target address calls runtime.panicshift> sltiu x5,x10,64 add x5,x5,-1 or x5,x10,x5 li x6,100 sra x10,x6,x5 after: bltz x10,<target address calls runtime.panicshift> sltiu x5,x10,32 add x5,x5,-1 or x5,x10,x5 li x6,100 sraw x10,x6,x5 int32(v) >> (int(s) & 63) before: sext.w x5,x10 and x6,x11,63 sra x10,x5,x6 after: and x5,x11,63 sltiu x6,x5,32 add x6,x6,-1 or x5,x5,x6 sraw x10,x10,x5 In most cases we eliminate one instruction. In the case where we shift a int32 constant by a variable the number of instructions generated is identical. A sra is simply replaced by a sraw. In the unusual case where we shift right by a variable anded with a constant > 31 but < 64, we generate two additional instructions. As this is an unusual case we do not try to optimize for it. Some improvements can be seen in some of the existing benchmarks, notably in the utf8 package which performs right shifts of runes which are signed 32 bit integers. \| utf8-old \| utf8-new \| \| sec/op \| sec/op vs base \| EncodeASCIIRune-4 17.68n ± 0% 17.67n ± 0% ~ (p=0.312 n=10) EncodeJapaneseRune-4 35.34n ± 0% 34.53n ± 1% -2.31% (p=0.000 n=10) AppendASCIIRune-4 3.213n ± 0% 3.213n ± 0% ~ (p=0.318 n=10) AppendJapaneseRune-4 36.14n ± 0% 35.35n ± 0% -2.19% (p=0.000 n=10) DecodeASCIIRune-4 28.11n ± 0% 27.36n ± 0% -2.69% (p=0.000 n=10) DecodeJapaneseRune-4 38.55n ± 0% 38.58n ± 0% ~ (p=0.612 n=10) Change-Id: I60a91cbede9ce65597571c7b7dd9943eeb8d3cc2 Reviewed-on: https://go-review.googlesource.com/c/go/+/535115 Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: M Zhuo <mzh@golangcn.org> Reviewed-by: David Chase <drchase@google.com>	2023-10-30 14:47:06 +00:00
Lynn Boger	80834af206	cmd/compile: avoid ANDCCconst on PPC64 if condition not needed In the PPC64 ISA, the instruction to do an 'and' operation using an immediate constant is only available in the form that also sets CR0 (i.e. clobbers the condition register.) This means CR0 is being clobbered unnecessarily in many cases. That affects some decisions made during some compiler passes that check for it. In those cases when the constant used by the ANDCC is a right justified consecutive set of bits, a shift instruction can be used which has the same effect if CR0 does not need to be set. The rule to do that has been added to the late rules file after other rules using ANDCCconst have been processed in the main rules file. Some codegen tests had to be updated since ANDCC is no longer generated for some cases. A new test case was added to verify the ANDCC is present if the results for both the AND and CR0 are used. Change-Id: I304f607c039a458e2d67d25351dd00aea72ba542 Reviewed-on: https://go-review.googlesource.com/c/go/+/531435 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Paul Murphy <murp@ibm.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Jayanth Krishnamurthy <jayanth.krishnamurthy@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>	2023-10-18 15:56:53 +00:00
Mark Ryan	561bf0457f	cmd/compile: optimize right shifts of uint32 on riscv The compiler is currently zero extending 32 bit unsigned integers to 64 bits before right shifting them using a 64 bit shift instruction. There's no need to do this as RISC-V has instructions for right shifting 32 bit unsigned values (srlw and srliw) which zero extend the result of the shift to 64 bits. Change the compiler so that it uses srlw and srliw for 32 bit unsigned shifts reducing in most cases the number of instructions needed to perform the shift. Here are some examples of code sequences that are changed by this patch: uint32(a) >> 2 before: sll x5,x10,0x20 srl x10,x5,0x22 after: srlw x10,x10,0x2 uint32(a) >> int(b) before: sll x5,x10,0x20 srl x5,x5,0x20 srl x5,x5,x11 sltiu x6,x11,64 neg x6,x6 and x10,x5,x6 after: srlw x5,x10,x11 sltiu x6,x11,32 neg x6,x6 and x10,x5,x6 bits.RotateLeft32(uint32(a), 1) before: sll x5,x10,0x1 sll x6,x10,0x20 srl x7,x6,0x3f or x5,x5,x7 after: sll x5,x10,0x1 srlw x6,x10,0x1f or x10,x5,x6 bits.RotateLeft32(uint32(a), int(b)) before: and x6,x11,31 sll x7,x10,x6 sll x8,x10,0x20 srl x8,x8,0x20 add x6,x6,-32 neg x6,x6 srl x9,x8,x6 sltiu x6,x6,64 neg x6,x6 and x6,x9,x6 or x6,x6,x7 after: and x5,x11,31 sll x6,x10,x5 add x5,x5,-32 neg x5,x5 srlw x7,x10,x5 sltiu x5,x5,32 neg x5,x5 and x5,x7,x5 or x10,x6,x5 The one regression observed is the following case, an unbounded right shift of a uint32 where the value we're shifting by is known to be < 64 but > 31. As this is an unusual case this commit does not optimize for it, although the existing code does. uint32(a) >> (b & 63) before: sll x5,x10,0x20 srl x5,x5,0x20 and x6,x11,63 srl x10,x5,x6 after and x5,x11,63 srlw x6,x10,x5 sltiu x5,x5,32 neg x5,x5 and x10,x6,x5 Here we have one extra instruction. Some benchmark highlights, generated on a VisionFive2 8GB running Ubuntu 23.04. pkg: math/bits LeadingZeros32-4 18.64n ± 0% 17.32n ± 0% -7.11% (p=0.000 n=10) LeadingZeros64-4 15.47n ± 0% 15.51n ± 0% +0.26% (p=0.027 n=10) TrailingZeros16-4 18.48n ± 0% 17.68n ± 0% -4.33% (p=0.000 n=10) TrailingZeros32-4 16.87n ± 0% 16.07n ± 0% -4.74% (p=0.000 n=10) TrailingZeros64-4 15.26n ± 0% 15.27n ± 0% +0.07% (p=0.043 n=10) OnesCount32-4 20.08n ± 0% 19.29n ± 0% -3.96% (p=0.000 n=10) RotateLeft-4 8.864n ± 0% 8.838n ± 0% -0.30% (p=0.006 n=10) RotateLeft32-4 8.837n ± 0% 8.032n ± 0% -9.11% (p=0.000 n=10) Reverse32-4 29.77n ± 0% 26.52n ± 0% -10.93% (p=0.000 n=10) ReverseBytes32-4 9.640n ± 0% 8.838n ± 0% -8.32% (p=0.000 n=10) Sub32-4 8.835n ± 0% 8.035n ± 0% -9.06% (p=0.000 n=10) geomean 11.50n 11.33n -1.45% pkg: crypto/md5 Hash8Bytes-4 1.486µ ± 0% 1.426µ ± 0% -4.04% (p=0.000 n=10) Hash64-4 2.079µ ± 0% 1.968µ ± 0% -5.36% (p=0.000 n=10) Hash128-4 2.720µ ± 0% 2.557µ ± 0% -5.99% (p=0.000 n=10) Hash256-4 3.996µ ± 0% 3.733µ ± 0% -6.58% (p=0.000 n=10) Hash512-4 6.541µ ± 0% 6.072µ ± 0% -7.18% (p=0.000 n=10) Hash1K-4 11.64µ ± 0% 10.75µ ± 0% -7.58% (p=0.000 n=10) Hash8K-4 82.95µ ± 0% 76.32µ ± 0% -7.99% (p=0.000 n=10) Hash1M-4 10.436m ± 0% 9.591m ± 0% -8.10% (p=0.000 n=10) Hash8M-4 83.50m ± 0% 76.73m ± 0% -8.10% (p=0.000 n=10) Hash8BytesUnaligned-4 1.494µ ± 0% 1.434µ ± 0% -4.02% (p=0.000 n=10) Hash1KUnaligned-4 11.64µ ± 0% 10.76µ ± 0% -7.52% (p=0.000 n=10) Hash8KUnaligned-4 83.01µ ± 0% 76.32µ ± 0% -8.07% (p=0.000 n=10) geomean 28.32µ 26.42µ -6.72% Change-Id: I20483a6668cca1b53fe83944bee3706aadcf8693 Reviewed-on: https://go-review.googlesource.com/c/go/+/528975 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-10-07 12:31:38 +00:00
Paul E. Murphy	1540531746	test/codegen: merge identical ppc64 and ppc64le tests Manually consolidate the remaining ppc64/ppc64le test which are not so trivial to automatically merge. The remaining ppc64le tests are limited to cases where load/stores are merged (this only happens on ppc64le) and the race detector (only supported on ppc64le). Change-Id: I1f9c0f3d3ddbb7fbbd8c81fbbd6537394fba63ce Reviewed-on: https://go-review.googlesource.com/c/go/+/463217 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>	2023-01-27 19:03:02 +00:00
Paul E. Murphy	0301c6c351	test/codegen: combine trivial PPC64 tests into ppc64x Use a small python script to consolidate duplicate ppc64/ppc64le tests into a single ppc64x codegen test. This makes small assumption that anytime two tests with for different arch/variant combos exists, those tests can be combined into a single ppc64x test. E.x: // ppc64le: foo // ppc64le/power9: foo into // ppc64x: foo or // ppc64: foo // ppc64le: foo into // ppc64x: foo import glob import re files = glob.glob("codegen/.go") for file in files: with open(file) as f: text = [l for l in f] i = 0 while i < len(text): first = re.match("\s// ?ppc64(le)?(/power[89])?:(.)", text[i]) if first: j = i+1 while j < len(text): second = re.match("\s// ?ppc64(le)?(/power[89])?:(.*)", text[j]) if not second: break if (not first.group(2) or first.group(2) == second.group(2)) and first.group(3) == second.group(3): text[i] = re.sub(" ?ppc64(le\|x)?"," ppc64x",text[i]) text=text[:j] + (text[j+1:]) else: j += 1 i+=1 with open(file, 'w') as f: f.write("".join(text)) Change-Id: Ic6b009b54eacaadc5a23db9c5a3bf7331b595821 Reviewed-on: https://go-review.googlesource.com/c/go/+/463220 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Bryan Mills <bcmills@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2023-01-27 18:24:12 +00:00
Wayne Zuo	af668c689c	cmd/compile: fold constant shift with extension on riscv64 For example: movb a0, a0 srai $1, a0, a0 the assembler will expand to: slli $56, a0, a0 srai $56, a0, a0 srai $1, a0, a0 this CL optimize to: slli $56, a0, a0 srai $57, a0, a0 Remove 270+ instructions from Go binary on linux/riscv64. Change-Id: I375e19f9d3bd54f2781791d8cbe5970191297dc8 Reviewed-on: https://go-review.googlesource.com/c/go/+/428496 Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>	2022-10-06 05:21:04 +00:00
Joel Sing	a7bcc94719	cmd/compile: resolve known outcomes for SLTI/SLTIU on riscv64 When SLTI/SLTIU is used with ANDI/ORI, it may be possible to determine the outcome based on the values of the immediates. Resolve these cases. Improves code generation for various shift operations. While here, sort tests by architecture to improve readability and ease future maintenance. Change-Id: I87e71e016a0e396a928e7d6389a2df61583dfd8d Reviewed-on: https://go-review.googlesource.com/c/go/+/428217 Reviewed-by: Wayne Zuo <wdvxdr@golangcn.org> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Jenny Rakoczy <jenny@golang.org> Reviewed-by: Jenny Rakoczy <jenny@golang.org> Run-TryBot: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Jenny Rakoczy <jenny@golang.org>	2022-09-17 17:17:52 +00:00
ruinan	454a058ffc	cmd/compile: Add shiftIsBounded check for logic shifts of arm64 This CL adds shiftIsBounded checks for the Lsh* and Rsh* rules in arm64. There is no need to check the shift value again with CMP + CSEL when the shift value is valid. Change-Id: I54620de64f02a1b5a11089add237248ae2de01b4 Reviewed-on: https://go-review.googlesource.com/c/go/+/417714 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com>	2022-09-07 20:10:13 +00:00
Keith Randall	5b1fbfba1c	cmd/compile: rewrite >>c<<c to &^(1<<c-1) Fixes #54496 Change-Id: I3c2ed8cd55836d5b07c8cdec00d3b584885aca79 Reviewed-on: https://go-review.googlesource.com/c/go/+/424856 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com> Run-TryBot: Martin Möhrmann <martin@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Martin Möhrmann <martin@golang.org>	2022-09-02 18:51:37 +00:00
ruinan	54c7bc9cff	cmd/compile: optimize shift ops on arm64 when the shift value is v&63 For the following code case: var x uint64 x >> (shift & 63) We can directly genereta `x >> shift` on arm64, since the hardware will only use the bottom 6 bits of the shift amount. Benchmark old time/op new time/op delta ShiftArithmeticRight-8 0.40ns 0.31ns -21.7% Change-Id: Id58c8a5b2f6dd5c30c3876f4a36e11b4d81e2dc9 Reviewed-on: https://go-review.googlesource.com/c/go/+/425294 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Heschi Kreinick <heschi@google.com>	2022-09-02 17:44:29 +00:00
Wayne Zuo	da6556968f	cmd/compile: simplify bounded shift on riscv64 The prove pass will mark some shifts bounded, and then we can use that information to generate better code on riscv64. Change-Id: Ia22f43d0598453c9417adac7017db28d7240948b Reviewed-on: https://go-review.googlesource.com/c/go/+/422616 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>	2022-08-31 20:21:00 +00:00
Archana R	d09c6ac417	test/codegen: updated multiple tests to verify on ppc64,ppc64le Updated multiple tests in test/codegen: math.go, mathbits.go, shift.go and slices.go to verify on ppc64/ppc64le as well Change-Id: Id88dd41569b7097819fb4d451b615f69cf7f7a94 Reviewed-on: https://go-review.googlesource.com/c/go/+/412115 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Archana Ravindar <aravind5@in.ibm.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Paul Murphy <murp@ibm.com> Reviewed-by: Ian Lance Taylor <iant@google.com>	2022-08-17 13:56:55 +00:00
Joel Sing	fe8347b61a	cmd/compile: optimise immediate operands with constants on riscv64 Instructions with immediates can be precomputed when operating on a constant - do so for SLTI/SLTIU, SLLI/SRLI/SRAI, NEG/NEGW, ANDI, ORI and ADDI. Additionally, optimise ANDI and ORI when the immediate is all ones or all zeroes. In particular, the RISCV64 logical left and right shift rules (Lshx/RshUx) produce sequences that check if the shift amount exceeds 64 and if so returns zero. When the shift amount is a constant we can precompute and eliminate the filter entirely. Likewise the arithmetic right shift rules produce sequences that check if the shift amount exceeds 64 and if so, ensures that the lower six bits of the shift are all ones. When the shift amount is a constant we can precompute the shift value. Arithmetic right shift sequences like: 117fc: 00100513 li a0,1 11800: 04053593 sltiu a1,a0,64 11804: fff58593 addi a1,a1,-1 11808: 0015e593 ori a1,a1,1 1180c: 40b45433 sra s0,s0,a1 Are now a single srai instruction: 117fc: 40145413 srai s0,s0,0x1 Likewise for logical left shift (and logical right shift): 1d560: 01100413 li s0,17 1d564: 04043413 sltiu s0,s0,64 1d568: 40800433 neg s0,s0 1d56c: 01131493 slli s1,t1,0x11 1d570: 0084f433 and s0,s1,s0 Which are now a single slli (or srli) instruction: 1d120: 01131413 slli s0,t1,0x11 This removes more than 30,000 instructions from the Go binary and should improve performance in a variety of areas - of note runtime.makemap_small drops from 48 to 36 instructions. Similar gains exist in at least other parts of runtime and math/bits. Change-Id: I33f6f3d1fd36d9ff1bda706997162bfe4bb859b6 Reviewed-on: https://go-review.googlesource.com/c/go/+/350689 Trust: Joel Sing <joel@sing.id.au> Reviewed-by: Michael Munday <mike.munday@lowrisc.org> Reviewed-by: Cherry Mui <cherryyz@google.com>	2021-09-24 10:51:48 +00:00
Joel Sing	d413908320	test/codegen: add shift tests for RISCV64 Add tests for shift by constant, masked shifts and bounded shifts. While here, sort tests by architecture and keep order of tests consistent (lsh, rshU, rsh). Change-Id: I512d64196f34df9cb2884e8c0f6adcf9dd88b0fc Reviewed-on: https://go-review.googlesource.com/c/go/+/351289 Trust: Joel Sing <joel@sing.id.au> Reviewed-by: Michael Munday <mike.munday@lowrisc.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com>	2021-09-24 07:22:13 +00:00
Josh Bleecher Snyder	43d5f213e2	cmd/compile: optimize multi-register shifts on amd64 amd64 can shift in bits from another register instead of filling with 0/1. This pattern is helpful when implementing 128 bit shifts or arbitrary length shifts. In the standard library, it shows up in pure Go math/big. Benchmarks results on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.45ns ± 3% 4.39ns ± 1% -1.28% (p=0.000 n=30+27) NonZeroShifts/1/shlVU-8 4.13ns ± 4% 4.10ns ± 2% ~ (p=0.254 n=29+28) NonZeroShifts/2/shrVU-8 5.55ns ± 1% 5.63ns ± 2% +1.42% (p=0.000 n=28+29) NonZeroShifts/2/shlVU-8 5.70ns ± 2% 5.14ns ± 1% -9.82% (p=0.000 n=29+28) NonZeroShifts/3/shrVU-8 6.79ns ± 2% 6.35ns ± 2% -6.46% (p=0.000 n=28+29) NonZeroShifts/3/shlVU-8 6.69ns ± 1% 6.25ns ± 1% -6.60% (p=0.000 n=28+27) NonZeroShifts/4/shrVU-8 7.79ns ± 2% 7.06ns ± 2% -9.48% (p=0.000 n=30+30) NonZeroShifts/4/shlVU-8 7.82ns ± 1% 7.24ns ± 1% -7.37% (p=0.000 n=28+29) NonZeroShifts/5/shrVU-8 8.90ns ± 3% 7.93ns ± 1% -10.84% (p=0.000 n=29+26) NonZeroShifts/5/shlVU-8 8.68ns ± 1% 7.92ns ± 1% -8.76% (p=0.000 n=29+29) NonZeroShifts/10/shrVU-8 14.4ns ± 1% 12.3ns ± 2% -14.79% (p=0.000 n=28+29) NonZeroShifts/10/shlVU-8 14.1ns ± 1% 11.9ns ± 2% -15.55% (p=0.000 n=28+27) NonZeroShifts/100/shrVU-8 118ns ± 1% 96ns ± 3% -18.82% (p=0.000 n=30+29) NonZeroShifts/100/shlVU-8 120ns ± 2% 98ns ± 2% -18.46% (p=0.000 n=29+28) NonZeroShifts/1000/shrVU-8 1.10µs ± 1% 0.88µs ± 2% -19.63% (p=0.000 n=29+30) NonZeroShifts/1000/shlVU-8 1.10µs ± 2% 0.88µs ± 2% -20.28% (p=0.000 n=29+28) NonZeroShifts/10000/shrVU-8 10.9µs ± 1% 8.7µs ± 1% -19.78% (p=0.000 n=28+27) NonZeroShifts/10000/shlVU-8 10.9µs ± 2% 8.7µs ± 1% -19.64% (p=0.000 n=29+27) NonZeroShifts/100000/shrVU-8 111µs ± 2% 90µs ± 2% -19.39% (p=0.000 n=28+29) NonZeroShifts/100000/shlVU-8 113µs ± 2% 90µs ± 2% -20.43% (p=0.000 n=30+27) The assembly version is still faster, unfortunately, but the gap is narrowing. Speedup from pure Go to assembly: name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.39ns ± 1% 3.45ns ± 2% -21.36% (p=0.000 n=27+29) NonZeroShifts/1/shlVU-8 4.10ns ± 2% 3.47ns ± 3% -15.42% (p=0.000 n=28+30) NonZeroShifts/2/shrVU-8 5.63ns ± 2% 3.97ns ± 0% -29.40% (p=0.000 n=29+25) NonZeroShifts/2/shlVU-8 5.14ns ± 1% 3.77ns ± 2% -26.65% (p=0.000 n=28+26) NonZeroShifts/3/shrVU-8 6.35ns ± 2% 4.79ns ± 2% -24.52% (p=0.000 n=29+29) NonZeroShifts/3/shlVU-8 6.25ns ± 1% 4.42ns ± 1% -29.29% (p=0.000 n=27+26) NonZeroShifts/4/shrVU-8 7.06ns ± 2% 5.64ns ± 1% -20.05% (p=0.000 n=30+29) NonZeroShifts/4/shlVU-8 7.24ns ± 1% 5.34ns ± 2% -26.23% (p=0.000 n=29+29) NonZeroShifts/5/shrVU-8 7.93ns ± 1% 6.56ns ± 2% -17.26% (p=0.000 n=26+30) NonZeroShifts/5/shlVU-8 7.92ns ± 1% 6.27ns ± 1% -20.79% (p=0.000 n=29+25) NonZeroShifts/10/shrVU-8 12.3ns ± 2% 10.2ns ± 2% -17.21% (p=0.000 n=29+29) NonZeroShifts/10/shlVU-8 11.9ns ± 2% 10.5ns ± 2% -12.45% (p=0.000 n=27+29) NonZeroShifts/100/shrVU-8 95.9ns ± 3% 77.7ns ± 1% -19.00% (p=0.000 n=29+30) NonZeroShifts/100/shlVU-8 97.5ns ± 2% 66.8ns ± 2% -31.47% (p=0.000 n=28+30) NonZeroShifts/1000/shrVU-8 884ns ± 2% 705ns ± 1% -20.17% (p=0.000 n=30+28) NonZeroShifts/1000/shlVU-8 880ns ± 2% 590ns ± 1% -32.96% (p=0.000 n=28+25) NonZeroShifts/10000/shrVU-8 8.74µs ± 1% 7.34µs ± 3% -15.94% (p=0.000 n=27+30) NonZeroShifts/10000/shlVU-8 8.73µs ± 1% 6.00µs ± 1% -31.25% (p=0.000 n=27+28) NonZeroShifts/100000/shrVU-8 89.6µs ± 2% 75.5µs ± 2% -15.80% (p=0.000 n=29+29) NonZeroShifts/100000/shlVU-8 89.6µs ± 2% 68.0µs ± 3% -24.09% (p=0.000 n=27+30) Change-Id: I18f58d8f5513d737d9cdf09b8f9d14011ffe3958 Reviewed-on: https://go-review.googlesource.com/c/go/+/297050 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2021-03-11 19:11:46 +00:00
Paul E. Murphy	48ddf70128	cmd/asm,cmd/compile: support 5 operand RLWNM/RLWMI on ppc64 These instructions are actually 5 argument opcodes as specified by the ISA. Prior to this patch, the MB and ME arguments were merged into a single bitmask operand to workaround the limitations of the ppc64 assembler backend. This limitation no longer exists. Thus, we can pass operands for these opcodes without having to merge the MB and ME arguments in the assembler frontend or compiler backend. Likewise, support for 4 operand variants is unchanged. Change-Id: Ib086774f3581edeaadfd2190d652aaaa8a90daeb Reviewed-on: https://go-review.googlesource.com/c/go/+/298750 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org> Trust: Carlos Eduardo Seo <carlos.seo@linaro.org>	2021-03-09 20:35:41 +00:00
Michael Munday	854e892ce1	cmd/compile: optimize shift pairs and masks on s390x Optimize combinations of left and right shifts by a constant value into a 'rotate then insert selected bits [into zero]' instruction. Use the same instruction for contiguous masks since it has some benefits over 'and immediate' (not restricted to 32-bits, does not overwrite source register). To keep the complexity of this change under control I've only implemented 64 bit operations for now. There are a lot more optimizations that can be done with this instruction family. However, since their function overlaps with other instructions we need to be somewhat careful not to break existing optimization rules by creating optimization dead ends. This is particularly true of the load/store merging rules which contain lots of zero extensions and shifts. This CL does interfere with the store merging rules when an operand is shifted left before it is stored: binary.BigEndian.PutUint64(b, x << 1) This is unfortunate but it's not critical and somewhat complex so I plan to fix that in a follow up CL. file before after Δ % addr2line 4117446 4117282 -164 -0.004% api 4945184 4942752 -2432 -0.049% asm 4998079 4991891 -6188 -0.124% buildid 2685158 2684074 -1084 -0.040% cgo 4553732 4553394 -338 -0.007% compile 19294446 19245070 -49376 -0.256% cover 4897105 4891319 -5786 -0.118% dist 3544389 3542785 -1604 -0.045% doc 3926795 3927617 +822 +0.021% fix 3302958 3293868 -9090 -0.275% link 6546274 6543456 -2818 -0.043% nm 4102021 4100825 -1196 -0.029% objdump 4542431 4548483 +6052 +0.133% pack 2482465 2416389 -66076 -2.662% pprof 13366541 13363915 -2626 -0.020% test2json 2829007 2761515 -67492 -2.386% trace 10216164 10219684 +3520 +0.034% vet 6773956 6773572 -384 -0.006% total 107124151 106917891 -206260 -0.193% Change-Id: I7591cce41e06867ba10a745daae9333513062746 Reviewed-on: https://go-review.googlesource.com/c/go/+/233317 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Trust: Michael Munday <mike.munday@ibm.com>	2020-11-06 10:45:31 +00:00
Paul E. Murphy	c3c6fbf314	cmd/compile: combine more 32 bit shift and mask operations on ppc64 Combine (AND m (SRWconst x)) or (SRWconst (AND m x)) when mask m is and the shift value produce constant which can be encoded into an RLWINM instruction. Combine (CLRLSLDI (SRWconst x)) if the combining of the underling rotate masks produces a constant which can be encoded into RLWINM. Likewise for (SLDconst (SRWconst x)) and (CLRLSDI (RLWINM x)). Combine rotate word + and operations which can be encoded as a single RLWINM/RLWNM instruction. The most notable performance improvements arise from the crypto benchmarks below (GOARCH=power8 on a ppc64le/linux): pkg:golang.org/x/crypto/blowfish goos:linux goarch:ppc64le ExpandKeyWithSalt 52.2µs ± 0% 47.5µs ± 0% -8.88% ExpandKey 44.4µs ± 0% 40.3µs ± 0% -9.15% pkg:golang.org/x/crypto/ssh/internal/bcrypt_pbkdf goos:linux goarch:ppc64le Key 57.6ms ± 0% 52.3ms ± 0% -9.13% pkg:golang.org/x/crypto/bcrypt goos:linux goarch:ppc64le Equal 90.9ms ± 0% 82.6ms ± 0% -9.13% DefaultCost 91.0ms ± 0% 82.7ms ± 0% -9.12% Change-Id: I59a0ca29face38f4ab46e37124c32906f216c4ce Reviewed-on: https://go-review.googlesource.com/c/go/+/260798 Run-TryBot: Carlos Eduardo Seo <carlos.seo@linaro.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>	2020-10-27 18:33:20 +00:00
Lynn Boger	cc2a5cf4b8	cmd/compile,cmd/internal/obj/ppc64: fix some shift rules due to a regression A recent change to improve shifts was generating some invalid cases when the rule was based on an AND. The extended mnemonics CLRLSLDI and CLRLSLWI only allow certain values for the operands and in the mask case those values were not being checked properly. This adds a check to those rules to verify that the 'b' and 'n' values used when an AND was part of the rule have correct values. There was a bug in some diag messages in asm9. The message expected 3 values but only provided 2. Those are corrected here also. The test/codegen/shift.go was updated to add a few more cases to check for the case mentioned here. Some of the comments that mention the order of operands in these extended mnemonics were wrong and those have been corrected. Fixes #41683. Change-Id: If5bb860acaa5051b9e0cd80784b2868b85898c31 Reviewed-on: https://go-review.googlesource.com/c/go/+/258138 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Paul Murphy <murp@ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>	2020-10-01 18:51:18 +00:00
Lynn Boger	a424f6e45e	cmd/asm,cmd/compile,cmd/internal/obj/ppc64: add extswsli support on power9 This adds support for the extswsli instruction which combines extsw followed by a shift. New benchmark demonstrates the improvement: name old time/op new time/op delta ExtShift 1.34µs ± 0% 1.30µs ± 0% -3.15% (p=0.057 n=4+3) Change-Id: I21b410676fdf15d20e0cbbaa75d7c6dcd3bbb7b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/257017 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <carlos.seo@gmail.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>	2020-09-28 18:13:48 +00:00
Lynn Boger	967465da29	cmd/compile: use combined shifts to improve array addressing on ppc64x This change adds rules to find pairs of instructions that can be combined into a single shifts. These instruction sequences are common in array addressing within loops. Improvements can be seen in many crypto packages and the hash packages. These are based on the extended mnemonics found in the ISA sections C.8.1 and C.8.2. Some rules in PPC64.rules were moved because the ordering prevented some matching. The following results were generated on power9. hash/crc32: CRC32/poly=Koopman/size=40/align=0 195ns ± 0% 163ns ± 0% -16.41% CRC32/poly=Koopman/size=40/align=1 200ns ± 0% 163ns ± 0% -18.50% CRC32/poly=Koopman/size=512/align=0 1.98µs ± 0% 1.67µs ± 0% -15.46% CRC32/poly=Koopman/size=512/align=1 1.98µs ± 0% 1.69µs ± 0% -14.80% CRC32/poly=Koopman/size=1kB/align=0 3.90µs ± 0% 3.31µs ± 0% -15.27% CRC32/poly=Koopman/size=1kB/align=1 3.85µs ± 0% 3.31µs ± 0% -14.15% CRC32/poly=Koopman/size=4kB/align=0 15.3µs ± 0% 13.1µs ± 0% -14.22% CRC32/poly=Koopman/size=4kB/align=1 15.4µs ± 0% 13.1µs ± 0% -14.79% CRC32/poly=Koopman/size=32kB/align=0 137µs ± 0% 105µs ± 0% -23.56% CRC32/poly=Koopman/size=32kB/align=1 137µs ± 0% 105µs ± 0% -23.53% crypto/rc4: RC4_128 733ns ± 0% 650ns ± 0% -11.32% (p=1.000 n=1+1) RC4_1K 5.80µs ± 0% 5.17µs ± 0% -10.89% (p=1.000 n=1+1) RC4_8K 45.7µs ± 0% 40.8µs ± 0% -10.73% (p=1.000 n=1+1) crypto/sha1: Hash8Bytes 635ns ± 0% 613ns ± 0% -3.46% (p=1.000 n=1+1) Hash320Bytes 2.30µs ± 0% 2.18µs ± 0% -5.38% (p=1.000 n=1+1) Hash1K 5.88µs ± 0% 5.38µs ± 0% -8.62% (p=1.000 n=1+1) Hash8K 42.0µs ± 0% 37.9µs ± 0% -9.75% (p=1.000 n=1+1) There are other improvements found in golang.org/x/crypto which are all in the range of 5-15%. Change-Id: I193471fbcf674151ffe2edab212799d9b08dfb8c Reviewed-on: https://go-review.googlesource.com/c/go/+/252097 Trust: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>	2020-09-17 12:37:40 +00:00
Lynn Boger	a1550d3ca3	cmd/compile: use isel with variable shifts on ppc64x This changes the code generated for variable length shift counts to use isel instead of instructions that set and read the carry flag. This reduces the generated code for shifts like this by 1 instruction and avoids the use of instructions to set and read the carry flag. This sequence can be found in strconv with these results on power9: Atof64Decimal 71.6ns ± 0% 68.3ns ± 0% -4.61% Atof64Float 95.3ns ± 0% 90.9ns ± 0% -4.62% Atof64FloatExp 153ns ± 0% 149ns ± 0% -2.61% Atof64Big 234ns ± 0% 232ns ± 0% -0.85% Atof64RandomBits 348ns ± 0% 369ns ± 0% +6.03% Atof64RandomFloats 262ns ± 0% 262ns ± 0% ~ Atof32Decimal 72.0ns ± 0% 68.2ns ± 0% -5.28% Atof32Float 92.1ns ± 0% 87.1ns ± 0% -5.43% Atof32FloatExp 159ns ± 0% 158ns ± 0% -0.63% Atof32Random 194ns ± 0% 191ns ± 0% -1.55% Some tests in codegen/shift.go are enabled to verify the expected instructions are generated. Change-Id: I968715d10ada405a8c46132bf19b8ed9b85796d1 Reviewed-on: https://go-review.googlesource.com/c/go/+/227337 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2020-04-09 19:18:56 +00:00
Lynn Boger	e4a1cf8a56	cmd/compile: add rules to eliminate unnecessary signed shifts This change to the rules removes some unnecessary signed shifts that appear in the math/rand functions. Existing rules did not cover some of the signed cases. A little improvement seen in math/rand due to removing 1 of 2 instructions generated for Int31n, which is inlined quite a bit. Intn1000 46.9ns ± 0% 45.5ns ± 0% -2.99% (p=1.000 n=1+1) Int63n1000 33.5ns ± 0% 32.8ns ± 0% -2.09% (p=1.000 n=1+1) Int31n1000 32.7ns ± 0% 32.6ns ± 0% -0.31% (p=1.000 n=1+1) Float32 32.7ns ± 0% 30.3ns ± 0% -7.34% (p=1.000 n=1+1) Float64 21.7ns ± 0% 20.9ns ± 0% -3.69% (p=1.000 n=1+1) Perm3 205ns ± 0% 202ns ± 0% -1.46% (p=1.000 n=1+1) Perm30 1.71µs ± 0% 1.68µs ± 0% -1.35% (p=1.000 n=1+1) Perm30ViaShuffle 1.65µs ± 0% 1.65µs ± 0% -0.30% (p=1.000 n=1+1) ShuffleOverhead 2.83µs ± 0% 2.83µs ± 0% -0.07% (p=1.000 n=1+1) Read3 18.7ns ± 0% 16.1ns ± 0% -13.90% (p=1.000 n=1+1) Read64 126ns ± 0% 124ns ± 0% -1.59% (p=1.000 n=1+1) Read1000 1.75µs ± 0% 1.63µs ± 0% -7.08% (p=1.000 n=1+1) Change-Id: I11502dfca7d65aafc76749a8d713e9e50c24a858 Reviewed-on: https://go-review.googlesource.com/c/go/+/225917 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2020-03-27 16:05:42 +00:00
Agniva De Sarker	8fedb2d338	cmd/compile: optimize bounded shifts on wasm Use the shiftIsBounded function to generate more efficient Shift instructions. Updates #25167 Change-Id: Id350f8462dc3a7ed3bfed0bcbea2860b8f40048a Reviewed-on: https://go-review.googlesource.com/c/go/+/182558 Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Richard Musiol <neelance@gmail.com>	2019-08-28 04:44:21 +00:00
Josh Bleecher Snyder	61945fc502	cmd/compile: don't generate panicshift for masked int shifts We know that a & 31 is non-negative for all a, signed or not. We can avoid checking that and needing to write out an unreachable call to panicshift. Change-Id: I32f32fb2c950d2b2b35ac5c0e99b7b2dbd47f917 Reviewed-on: https://go-review.googlesource.com/c/go/+/167499 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Keith Randall <khr@golang.org>	2019-03-14 00:03:29 +00:00
Josh Bleecher Snyder	870cfe6484	test/codegen: gofmt Change-Id: I33f5b5051e5f75aa264ec656926223c5a3c09c1b Reviewed-on: https://go-review.googlesource.com/c/go/+/167498 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Matt Layher <mdlayher@gmail.com>	2019-03-13 21:44:45 +00:00
Michael Munday	8af0c77df3	cmd/compile: simplify shift lowering on s390x Use conditional moves instead of subtractions with borrow to handle saturation cases. This allows us to delete the SUBE/SUBEW ops and associated rules from the SSA backend. Using conditional moves also means we can detect when shift values are masked so I've added some new rules to constant fold the relevant comparisons and masking ops. Also use the new shiftIsBounded() function to avoid generating code to handle saturation cases where possible. Updates #25167 for s390x. Change-Id: Ief9991c91267c9151ce4c5ec07642abb4dcc1c0d Reviewed-on: https://go-review.googlesource.com/110070 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2018-05-08 16:19:56 +00:00

32 Commits