qbit/go - go - Tape:neT

qbit/go

mirror of https://github.com/golang/go synced 2024-11-06 12:26:11 -07:00

Author	SHA1	Message	Date
Brian Kessler	6b1d5471b9	cmd/compile: add signed indivisibility by power of 2 rules Commit `44343c777c` (CL 173557) added rules for handling divisibility checks for powers of 2 for signed integers, x%c ==0. This change adds the complementary indivisibility rules, x%c != 0. Fixes #34166 Change-Id: I87379e30af7aff633371acca82db2397da9b2c07 Reviewed-on: https://go-review.googlesource.com/c/go/+/194219 Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-11-07 16:30:46 +00:00
Russ Cox	543c6d2e0d	math, cmd/compile: rename Fma to FMA This API was added for #25819, where it was discussed as math.FMA. The commit adding it used math.Fma, presumably for consistency with the rest of the unusual names in package math (Sincos, Acosh, Erfcinv, Float32bits, etc). I believe that using an idiomatic Go name is more important here than consistency with these other names, most of which are historical baggage from C's standard library. Early additions like Float32frombits happened before "uppercase for export" (so they were originally like "float32frombits") and they were not properly reconsidered when we uppercased the symbols to export them. That's a mistake we live with. The names of functions we have added since then, and even a few that were legacy, are more properly Go-cased, such as IsNaN, IsInf, and RoundToEven, rather than Isnan, Isinf, and Roundtoeven. And also constants like MaxFloat32. For new API, we should keep using proper Go-cased symbols instead of minimally-upper-cased-C symbols. So math.FMA, not math.Fma. This API has not yet been released, so this change does not break the compatibility promise. This CL also modifies cmd/compile, since the compiler knows the name of the function. I could have stopped at changing the string constants, but it seemed to make more sense to use a consistent casing everywhere. Change-Id: I0f6f3407f41e99bfa8239467345c33945088896e Reviewed-on: https://go-review.googlesource.com/c/go/+/205317 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2019-11-07 14:51:06 +00:00
smasher164	58b031949b	cmd/compile: add fma intrinsic for arm This change introduces an arm intrinsic that generates the FMULAD instruction for the fused-multiply-add operation on systems that support it. System support is detected via cpu.ARM.HasVFPv4. A rewrite rule translates the generic intrinsic to FMULAD. Updates #25819. Change-Id: I8459e5dd1cdbdca35f88a78dbeb7d387f1e20efa Reviewed-on: https://go-review.googlesource.com/c/go/+/142117 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-10-21 17:42:47 +00:00
smasher164	7a6da218b1	cmd/compile: add fma intrinsic for amd64 To permit ssa-level optimization, this change introduces an amd64 intrinsic that generates the VFMADD231SD instruction for the fused-multiply-add operation on systems that support it. System support is detected via cpu.X86.HasFMA. A rewrite rule can then translate the generic ssa intrinsic ("Fma") to VFMADD231SD. The benchmark compares the software implementation (old) with the intrinsic (new). name old time/op new time/op delta Fma-4 27.2ns ± 1% 1.0ns ± 9% -96.48% (p=0.008 n=5+5) Updates #25819. Change-Id: I966655e5f96817a5d06dff5942418a3915b09584 Reviewed-on: https://go-review.googlesource.com/c/go/+/137156 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-10-21 16:42:10 +00:00
smasher164	33425ab8db	cmd/compile: introduce generic ssa intrinsic for fused-multiply-add In order to make math.FMA a compiler intrinsic for ISAs like ARM64, PPC64[le], and S390X, a generic 3-argument opcode "Fma" is provided and rewritten as ARM64: (Fma x y z) -> (FMADDD z x y) PPC64: (Fma x y z) -> (FMADD x y z) S390X: (Fma x y z) -> (FMADD z x y) Updates #25819. Change-Id: Ie5bc628311e6feeb28ddf9adaa6e702c8c291efa Reviewed-on: https://go-review.googlesource.com/c/go/+/131959 Run-TryBot: Akhil Indurti <aindurti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-10-21 16:24:15 +00:00
David Chase	6adaf17eaa	cmd/compile: preserve statements in late nilcheckelim optimization When a subsequent load/store of a ptr makes the nil check of that pointer unnecessary, if their lines differ, change the line of the load/store to that of the nilcheck, and attempt to rehome the load/store position instead. This fix makes profiling less accurate in order to make panics more informative. Fixes #33724 Change-Id: Ib9afaac12fe0d0320aea1bf493617facc34034b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/200197 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-10-15 16:43:44 +00:00
Meng Zhuo	50f1157760	cmd/compile: add math/bits.Mul64 intrinsic on mips64x Benchmark: name old time/op new time/op delta Mul 36.0ns ± 1% 2.8ns ± 0% -92.31% (p=0.000 n=10+10) Mul32 4.37ns ± 0% 4.37ns ± 0% ~ (p=0.429 n=6+10) Mul64 36.4ns ± 0% 2.8ns ± 0% -92.37% (p=0.000 n=10+9) Change-Id: Ic4f4e5958adbf24999abcee721d0180b5413fca7 Reviewed-on: https://go-review.googlesource.com/c/go/+/200582 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-10-14 21:23:34 +00:00
Michael Munday	6ec4c71eef	cmd/compile: add SSA rules for s390x compare-and-branch instructions This commit adds SSA rules for the s390x combined compare-and-branch instructions. These have a shorter encoding than separate compare and branch instructions and they also don't clobber the condition code (a.k.a. flag register) reducing pressure on the flag allocator. I have deleted the 'loop_test.go' file and replaced it with a new codegen test which performs a wider range of checks. Object sizes from compilebench: name old object-bytes new object-bytes delta Template 562kB ± 0% 561kB ± 0% -0.28% (p=0.000 n=10+10) Unicode 217kB ± 0% 217kB ± 0% -0.17% (p=0.000 n=10+10) GoTypes 2.03MB ± 0% 2.02MB ± 0% -0.59% (p=0.000 n=10+10) Compiler 8.16MB ± 0% 8.11MB ± 0% -0.62% (p=0.000 n=10+10) SSA 27.4MB ± 0% 27.0MB ± 0% -1.45% (p=0.000 n=10+10) Flate 356kB ± 0% 356kB ± 0% -0.12% (p=0.000 n=10+10) GoParser 438kB ± 0% 436kB ± 0% -0.51% (p=0.000 n=10+10) Reflect 1.37MB ± 0% 1.37MB ± 0% -0.42% (p=0.000 n=10+10) Tar 485kB ± 0% 483kB ± 0% -0.39% (p=0.000 n=10+10) XML 630kB ± 0% 621kB ± 0% -1.45% (p=0.000 n=10+10) [Geo mean] 1.14MB 1.13MB -0.60% name old text-bytes new text-bytes delta HelloSize 763kB ± 0% 754kB ± 0% -1.30% (p=0.000 n=10+10) CmdGoSize 10.7MB ± 0% 10.6MB ± 0% -0.91% (p=0.000 n=10+10) [Geo mean] 2.86MB 2.82MB -1.10% Change-Id: Ibca55d9c0aa1254aee69433731ab5d26a43a7c18 Reviewed-on: https://go-review.googlesource.com/c/go/+/198037 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-10-08 10:03:04 +00:00
Cuong Manh Le	77f5adba55	cmd/compile: don't use statictmps for small object in slice literal Fixes #21561 Change-Id: I89c59752060dd9570d17d73acbbaceaefce5d8ce Reviewed-on: https://go-review.googlesource.com/c/go/+/197560 Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-10-08 06:09:26 +00:00
Keith Randall	72dc9ab191	cmd/compile: reuse dead register before reusing register holding constant For commuting ops, check whether the second argument is dead before checking if the first argument is rematerializeable. Reusing the register holding a dead value is always best. Fixes #33580 Change-Id: I7372cfc03d514e6774d2d9cc727a3e6bf6ce2657 Reviewed-on: https://go-review.googlesource.com/c/go/+/199559 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>	2019-10-07 15:16:26 +00:00
Dan Scales	225f484c88	misc, runtime, test: extra tests and benchmarks for defer Add a bunch of extra tests and benchmarks for defer, in preparation for new low-cost (open-coded) implementation of defers (see #34481), - New file defer_test.go that tests a bunch more unusual defer scenarios, including things that might have problems for open-coded defers. - Additions to callers_test.go actually verifying what the stack trace looks like for various panic or panic-recover scenarios. - Additions to crash_test.go testing several more crash scenarios involving recursive panics. - New benchmark in runtime_test.go measuring speed of panic-recover - New CGo benchmark in cgo_test.go calling from Go to C back to Go that shows defer overhead Updates #34481 Change-Id: I423523f3e05fc0229d4277dd00073289a5526188 Reviewed-on: https://go-review.googlesource.com/c/go/+/197017 Run-TryBot: Dan Scales <danscales@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Austin Clements <austin@google.com>	2019-09-25 23:27:16 +00:00
Martin Möhrmann	f41451e7eb	compile: prefer an AND instead of SHR+SHL instructions On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc \| y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '\|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I5a59f6239660d53c029cd22dfb44ddf39f93a56c Reviewed-on: https://go-review.googlesource.com/c/go/+/196810 Run-TryBot: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-09-24 20:30:59 +00:00
Bryan C. Mills	34fe8295c5	Revert "compile: prefer an AND instead of SHR+SHL instructions" This reverts CL 194297. Reason for revert: introduced register allocation failures on PPC64LE builders. Updates #33826 Updates #32781 Updates #34468 Change-Id: I7d0b55df8cdf8e7d2277f1814299b083c2692e48 Reviewed-on: https://go-review.googlesource.com/c/go/+/196957 Run-TryBot: Bryan C. Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>	2019-09-23 15:20:12 +00:00
Martin Möhrmann	4e2b84ffc5	compile: prefer an AND instead of SHR+SHL instructions On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): 'go run run.go -all_codegen -v codegen' passes with following adjustments: ARM64: The BFXIL pattern ((x << lc) >> rc \| y & ac) needed adjustment since ORshiftRL generation fusing '>> rc' and '\|' interferes with matching ((x << lc) >> rc) to generate UBFX. Previously ORshiftLL was created first using the shifts generated for (y & ac). S390X: Add rules for abs and copysign to match use of AND instead of SHIFTs. Updates #33826 Updates #32781 Change-Id: I43227da76b625de03fbc51117162b23b9c678cdb Reviewed-on: https://go-review.googlesource.com/c/go/+/194297 Run-TryBot: Martin Möhrmann <martisch@uos.de> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-09-21 18:00:13 +00:00
Agniva De Sarker	ecc7dd5469	test/codegen: fix wasm codegen breakage i32.eqz instructions don't appear unless needed in if conditions anymore after CL 195204. I forgot to run the codegen tests while submitting the CL. Thanks to @martisch for catching it. Fixes #34442 Change-Id: I177b064b389be48e39d564849714d7a8839be13e Reviewed-on: https://go-review.googlesource.com/c/go/+/196580 Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com>	2019-09-21 16:31:44 +00:00
Matthew Dempsky	85fc765341	cmd/compile: optimize switch on strings When compiling expression switches, we try to optimize runs of constants into binary searches. The ordering used isn't visible to the application, so it's unimportant as long as we're consistent between sorting and searching. For strings, it's much cheaper to compare string lengths than strings themselves, so instead of ordering strings by "si <= sj", we currently order them by "len(si) < len(sj) \|\| len(si) == len(sj) && si <= sj" (i.e., the lexicographical ordering on the 2-tuple (len(s), s)). However, it's also somewhat cheaper to compare strings for equality (i.e., ==) than for ordering (i.e., <=). And if there were two or three string constants of the same length in a switch statement, we might unnecessarily emit ordering comparisons. For example, given: switch s { case "", "1", "2", "3": // ordered by length then content goto L } we currently compile this as: if len(s) < 1 \|\| len(s) == 1 && s <= "1" { if s == "" { goto L } else if s == "1" { goto L } } else { if s == "2" { goto L } else if s == "3" { goto L } } This CL switches to using a 2-level binary search---first on len(s), then on s itself---so that string ordering comparisons are only needed when there are 4 or more strings of the same length. (4 being the cut-off for when using binary search is actually worthwhile.) So the above switch instead now compiles to: if len(s) == 0 { if s == "" { goto L } } else if len(s) == 1 { if s == "1" { goto L } else if s == "2" { goto L } else if s == "3" { goto L } } which is better optimized by walk and SSA. (Notably, because there are only two distinct lengths and no more than three strings of any particular length, this example ends up falling back to simply using linear search.) Test case by khr@ from CL 195138. Fixes #33934. Change-Id: I8eeebcaf7e26343223be5f443d6a97a0daf84f07 Reviewed-on: https://go-review.googlesource.com/c/go/+/195340 Run-TryBot: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-09-18 05:33:05 +00:00
LE Manh Cuong	ec4e8517cd	cmd/compile: support more length types for slice extension optimization golang.org/cl/109517 optimized the compiler to avoid the allocation for make in append(x, make([]T, y)...). This was only implemented for the case that y has type int. This change extends the optimization to trigger for all integer types where the value is known at compile time to fit into an int. name old time/op new time/op delta ExtendInt-12 106ns ± 4% 106ns ± 0% ~ (p=0.351 n=10+6) ExtendUint64-12 1.03µs ± 5% 0.10µs ± 4% -90.01% (p=0.000 n=9+10) name old alloc/op new alloc/op delta ExtendInt-12 0.00B 0.00B ~ (all equal) ExtendUint64-12 13.6kB ± 0% 0.0kB -100.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ExtendInt-12 0.00 0.00 ~ (all equal) ExtendUint64-12 1.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10) Updates #29785 Change-Id: Ief7760097c285abd591712da98c5b02bc3961fcd Reviewed-on: https://go-review.googlesource.com/c/go/+/182559 Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-09-17 17:18:17 +00:00
Alberto Donizetti	c2facbe937	Revert "test/codegen: document -all_codegen option in README" This reverts CL 192101. Reason for revert: The same paragraph was added 2 weeks ago (look a few lines above) Change-Id: I05efb2631d7b4966f66493f178f2a649c715a3cc Reviewed-on: https://go-review.googlesource.com/c/go/+/195637 Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-09-16 17:31:37 +00:00
Cherry Zhang	d9b8ffa51c	test/codegen: document -all_codegen option in README It is useful to know about the -all_codegen option for running codegen tests for all platforms. I was puzzling that some codegen test was not failing on my local machine or on trybot, until I found this option. Change-Id: I062cf4d73f6a6c9ebc2258195779d2dab21bc36d Reviewed-on: https://go-review.googlesource.com/c/go/+/192101 Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-09-16 11:53:57 +00:00
Ruixin Bao	98aa97806b	cmd/compile: add math/bits.Mul64 intrinsic on s390x This change adds an intrinsic for Mul64 on s390x. To achieve that, a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly instruction directly uses an existing instruction on Z and supports multiplication of two 64 bit unsigned integer and stores the result in two separate registers. In this case, we require the multiplcand to be stored in register R3 and the output result (the high and low 64 bit of the product) to be stored in R2 and R3 respectively. A test case is also added. Benchmark: name old time/op new time/op delta Mul-18 11.1ns ± 0% 1.4ns ± 0% -87.39% (p=0.002 n=8+10) Mul32-18 2.07ns ± 0% 2.07ns ± 0% ~ (all equal) Mul64-18 11.1ns ± 1% 1.4ns ± 0% -87.42% (p=0.000 n=10+10) Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1 Reviewed-on: https://go-review.googlesource.com/c/go/+/194572 Reviewed-by: Michael Munday <mike.munday@ibm.com> Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-09-13 09:04:48 +00:00
Michael Munday	5c5f217b63	cmd/compile: improve s390x sign/zero extension removal This CL gets rid of the MOVDreg and MOVDnop SSA operations on s390x. They were originally inserted to help avoid situations where a sign/zero extension was elided but a spill invalidated the optimization. It's not really clear we need to do this though (amd64 doesn't have these ops for example) so long as we are careful when removing sign/zero extensions. Also, the MOVDreg technique doesn't work if the register is spilled before the MOVDreg op (I haven't seen that in practice). Removing these ops reduces the complexity of the rules and also allows us to unblock optimizations. For example, the compiler can now merge the loads in binary.{Big,Little}Endian.PutUint16 which it wasn't able to do before. This CL reduces the size of the .text section in the go tool by about 4.7KB (0.09%). Change-Id: Icaddae7f2e4f9b2debb6fabae845adb3f73b41db Reviewed-on: https://go-review.googlesource.com/c/go/+/173897 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-09-10 13:17:24 +00:00
Martin Möhrmann	5bb59b6d16	Revert "compile: prefer an AND instead of SHR+SHL instructions" This reverts commit `9ec7074a94`. Reason for revert: broke s390x (copysign, abs) and arm64 (bitfield) tests. Change-Id: I16c1b389c062e8c4aa5de079f1d46c9b25b0db52 Reviewed-on: https://go-review.googlesource.com/c/go/+/193850 Run-TryBot: Martin Möhrmann <moehrmann@google.com> Reviewed-by: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-09-09 07:33:25 +00:00
Martin Möhrmann	9ec7074a94	compile: prefer an AND instead of SHR+SHL instructions On modern 64bit CPUs a SHR, SHL or AND instruction take 1 cycle to execute. A pair of shifts that operate on the same register will take 2 cycles and needs to wait for the input register value to be available. Large constants used to mask the high bits of a register with an AND instruction can not be encoded as an immediate in the AND instruction on amd64 and therefore need to be loaded into a register with a MOV instruction. However that MOV instruction is not dependent on the output register and on many CPUs does not compete with the AND or shift instructions for execution ports. Using a pair of shifts to mask high bits instead of an AND to mask high bits of a register has a shorter encoding and uses one less general purpose register but is slower due to taking one clock cycle longer if there is no register pressure that would make the AND variant need to generate a spill. For example the instructions emitted for (x & 1 << 63) before this CL are: 48c1ea3f SHRQ $0x3f, DX 48c1e23f SHLQ $0x3f, DX after this CL the instructions are the same as GCC and LLVM use: 48b80000000000000080 MOVQ $0x8000000000000000, AX 4821d0 ANDQ DX, AX Some platforms such as arm64 already have SSA optimization rules to fuse two shift instructions back into an AND. Removing the general rule to rewrite AND to SHR+SHL speeds up this benchmark: var GlobalU uint func BenchmarkAndHighBits(b *testing.B) { x := uint(0) for i := 0; i < b.N; i++ { x &= 1 << 63 } GlobalU = x } amd64/darwin on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz: name old time/op new time/op delta AndHighBits-4 0.61ns ± 6% 0.42ns ± 6% -31.42% (p=0.000 n=25+25): Updates #33826 Updates #32781 Change-Id: I862d3587446410c447b9a7265196b57f85358633 Reviewed-on: https://go-review.googlesource.com/c/go/+/191780 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-09-09 06:49:17 +00:00
Alberto Donizetti	e6d2544d20	test/codegen: mention -all_codegen in the README For performance reasons (avoiding costly cross-compilations) CL 177577 changed the codegen test harness to only run the tests for the machine's GOARCH by default. This change updates the codegen README accordingly, explaining what all.bash does run by default and how to perform the tests for all architectures. Fixes #33924 Change-Id: I43328d878c3e449ebfda46f7e69963a44a511d40 Reviewed-on: https://go-review.googlesource.com/c/go/+/192619 Reviewed-by: Daniel Martí <mvdan@mvdan.cc>	2019-09-01 15:37:13 +00:00
Brian Kessler	b003afe4fe	cmd/compile: intrinsify RotateLeft32 on wasm wasm has 32-bit versions of all integer operations. This change lowers RotateLeft32 to i32.rotl on wasm and intrinsifies the math/bits call. Benchmarking on amd64 under node.js this is ~25% faster. node v10.15.3/amd64 name old time/op new time/op delta RotateLeft 8.37ns ± 1% 8.28ns ± 0% -1.05% (p=0.029 n=4+4) RotateLeft8 11.9ns ± 1% 11.8ns ± 0% ~ (p=0.167 n=5+5) RotateLeft16 11.8ns ± 0% 11.8ns ± 0% ~ (all equal) RotateLeft32 11.9ns ± 1% 8.7ns ± 0% -26.32% (p=0.008 n=5+5) RotateLeft64 8.31ns ± 1% 8.43ns ± 2% ~ (p=0.063 n=5+5) Updates #31265 Change-Id: I5b8e155978faeea536c4f6427ac9564d2f096a46 Reviewed-on: https://go-review.googlesource.com/c/go/+/182359 Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Richard Musiol <neelance@gmail.com>	2019-08-31 17:03:04 +00:00
Ben Shi	1786ecd502	cmd/compile: eliminate WASM's redundant extension & wrapping This CL eliminates unnecessary pairs of I32WrapI64 and I64ExtendI32U generated by the WASM backend for IF statements. And it makes the total size of pkg/js_wasm/ decreases about 490KB. Change-Id: I16b0abb686c4e30d5624323166ec2d0ec57dbe2d Reviewed-on: https://go-review.googlesource.com/c/go/+/191758 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Richard Musiol <neelance@gmail.com>	2019-08-30 21:20:03 +00:00
Ben Shi	8d5197d818	cmd/compile: optimize 386's math.bits.TrailingZeros16 This CL reverts CL 192097 and fixes the issue in CL 189277. Change-Id: Icd271262e1f5019a8e01c91f91c12c1261eeb02b Reviewed-on: https://go-review.googlesource.com/c/go/+/192519 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-08-30 17:37:00 +00:00
Cherry Zhang	9859f6bedb	test/codegen: fix ARM32 RotateLeft32 test The syntax of a shifted operation does not have a "$" sign for the shift amount. Remove it. Change-Id: I50782fe942b640076f48c2fafea4d3175be8ff99 Reviewed-on: https://go-review.googlesource.com/c/go/+/192100 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2019-08-28 20:42:48 +00:00
Ben Shi	3cfd003a8a	cmd/compile: optimize ARM's math.bits.RotateLeft32 This CL optimizes math.bits.RotateLeft32 to inline "MOVW Rx@>Ry, Rd" on ARM. The benchmark results of math/bits show some improvements. name old time/op new time/op delta RotateLeft-4 9.42ns ± 0% 6.91ns ± 0% -26.66% (p=0.000 n=40+33) RotateLeft8-4 8.79ns ± 0% 8.79ns ± 0% -0.04% (p=0.000 n=40+31) RotateLeft16-4 8.79ns ± 0% 8.79ns ± 0% -0.04% (p=0.000 n=40+32) RotateLeft32-4 8.16ns ± 0% 7.54ns ± 0% -7.68% (p=0.000 n=40+40) RotateLeft64-4 15.7ns ± 0% 15.7ns ± 0% ~ (all equal) updates #31265 Change-Id: I77bc1c2c702d5323fc7cad5264a8e2d5666bf712 Reviewed-on: https://go-review.googlesource.com/c/go/+/188697 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-08-28 15:41:58 +00:00
Ben Shi	c683ab8128	cmd/compile: optimize ARM's math.Abs This CL optimizes math.Abs to an inline ABSD instruction on ARM. The benchmark results of src/math/ show big improvements. name old time/op new time/op delta Acos-4 181ns ± 0% 182ns ± 0% +0.30% (p=0.000 n=40+40) Acosh-4 202ns ± 0% 202ns ± 0% ~ (all equal) Asin-4 163ns ± 0% 163ns ± 0% ~ (all equal) Asinh-4 242ns ± 0% 242ns ± 0% ~ (all equal) Atan-4 120ns ± 0% 121ns ± 0% +0.83% (p=0.000 n=40+40) Atanh-4 202ns ± 0% 202ns ± 0% ~ (all equal) Atan2-4 173ns ± 0% 173ns ± 0% ~ (all equal) Cbrt-4 1.06µs ± 0% 1.06µs ± 0% +0.09% (p=0.000 n=39+37) Ceil-4 72.9ns ± 0% 72.8ns ± 0% ~ (p=0.237 n=40+40) Copysign-4 13.2ns ± 0% 13.2ns ± 0% ~ (all equal) Cos-4 193ns ± 0% 183ns ± 0% -5.18% (p=0.000 n=40+40) Cosh-4 254ns ± 0% 239ns ± 0% -5.91% (p=0.000 n=40+40) Erf-4 112ns ± 0% 112ns ± 0% ~ (all equal) Erfc-4 117ns ± 0% 117ns ± 0% ~ (all equal) Erfinv-4 127ns ± 0% 127ns ± 1% ~ (p=0.492 n=40+40) Erfcinv-4 128ns ± 0% 128ns ± 0% ~ (all equal) Exp-4 212ns ± 0% 206ns ± 0% -3.05% (p=0.000 n=40+40) ExpGo-4 216ns ± 0% 209ns ± 0% -3.24% (p=0.000 n=40+40) Expm1-4 142ns ± 0% 142ns ± 0% ~ (all equal) Exp2-4 191ns ± 0% 184ns ± 0% -3.45% (p=0.000 n=40+40) Exp2Go-4 194ns ± 0% 187ns ± 0% -3.61% (p=0.000 n=40+40) Abs-4 14.4ns ± 0% 6.3ns ± 0% -56.39% (p=0.000 n=38+39) Dim-4 12.6ns ± 0% 12.6ns ± 0% ~ (all equal) Floor-4 49.6ns ± 0% 49.6ns ± 0% ~ (all equal) Max-4 27.6ns ± 0% 27.6ns ± 0% ~ (all equal) Min-4 27.0ns ± 0% 27.0ns ± 0% ~ (all equal) Mod-4 349ns ± 0% 305ns ± 1% -12.55% (p=0.000 n=33+40) Frexp-4 54.0ns ± 0% 47.1ns ± 0% -12.78% (p=0.000 n=38+38) Gamma-4 242ns ± 0% 234ns ± 0% -3.16% (p=0.000 n=36+40) Hypot-4 84.8ns ± 0% 67.8ns ± 0% -20.05% (p=0.000 n=31+35) HypotGo-4 88.5ns ± 0% 71.6ns ± 0% -19.12% (p=0.000 n=40+38) Ilogb-4 45.8ns ± 0% 38.9ns ± 0% -15.12% (p=0.000 n=40+32) J0-4 821ns ± 0% 802ns ± 0% -2.33% (p=0.000 n=33+40) J1-4 816ns ± 0% 807ns ± 0% -1.05% (p=0.000 n=40+29) Jn-4 1.67µs ± 0% 1.65µs ± 0% -1.45% (p=0.000 n=40+39) Ldexp-4 61.5ns ± 0% 54.6ns ± 0% -11.27% (p=0.000 n=40+32) Lgamma-4 188ns ± 0% 188ns ± 0% ~ (all equal) Log-4 154ns ± 0% 147ns ± 0% -4.78% (p=0.000 n=40+40) Logb-4 50.9ns ± 0% 42.7ns ± 0% -16.11% (p=0.000 n=34+39) Log1p-4 160ns ± 0% 159ns ± 0% ~ (p=0.828 n=40+40) Log10-4 173ns ± 0% 166ns ± 0% -4.05% (p=0.000 n=40+40) Log2-4 65.3ns ± 0% 58.4ns ± 0% -10.57% (p=0.000 n=37+37) Modf-4 36.4ns ± 0% 36.4ns ± 0% ~ (all equal) Nextafter32-4 36.4ns ± 0% 36.4ns ± 0% ~ (all equal) Nextafter64-4 32.7ns ± 0% 32.6ns ± 0% ~ (p=0.375 n=40+40) PowInt-4 300ns ± 0% 277ns ± 0% -7.78% (p=0.000 n=40+40) PowFrac-4 676ns ± 0% 635ns ± 0% -6.00% (p=0.000 n=40+35) Pow10Pos-4 17.6ns ± 0% 17.6ns ± 0% ~ (all equal) Pow10Neg-4 22.0ns ± 0% 22.0ns ± 0% ~ (all equal) Round-4 30.1ns ± 0% 30.1ns ± 0% ~ (all equal) RoundToEven-4 38.9ns ± 0% 38.9ns ± 0% ~ (all equal) Remainder-4 291ns ± 0% 263ns ± 0% -9.62% (p=0.000 n=40+40) Signbit-4 11.3ns ± 0% 11.3ns ± 0% ~ (all equal) Sin-4 185ns ± 0% 185ns ± 0% ~ (all equal) Sincos-4 230ns ± 0% 230ns ± 0% ~ (all equal) Sinh-4 253ns ± 0% 246ns ± 0% -2.77% (p=0.000 n=39+39) SqrtIndirect-4 41.4ns ± 0% 41.4ns ± 0% ~ (all equal) SqrtLatency-4 13.8ns ± 0% 13.8ns ± 0% ~ (all equal) SqrtIndirectLatency-4 37.0ns ± 0% 37.0ns ± 0% ~ (p=0.632 n=40+40) SqrtGoLatency-4 911ns ± 0% 911ns ± 0% +0.08% (p=0.000 n=40+40) SqrtPrime-4 13.2µs ± 0% 13.2µs ± 0% +0.01% (p=0.038 n=38+40) Tan-4 205ns ± 0% 205ns ± 0% ~ (all equal) Tanh-4 264ns ± 0% 247ns ± 0% -6.44% (p=0.000 n=39+32) Trunc-4 45.2ns ± 0% 45.2ns ± 0% ~ (all equal) Y0-4 796ns ± 0% 792ns ± 0% -0.55% (p=0.000 n=35+40) Y1-4 804ns ± 0% 797ns ± 0% -0.82% (p=0.000 n=24+40) Yn-4 1.64µs ± 0% 1.62µs ± 0% -1.27% (p=0.000 n=40+39) Float64bits-4 8.16ns ± 0% 8.16ns ± 0% +0.04% (p=0.000 n=35+40) Float64frombits-4 10.7ns ± 0% 10.7ns ± 0% ~ (all equal) Float32bits-4 7.53ns ± 0% 7.53ns ± 0% ~ (p=0.760 n=40+40) Float32frombits-4 6.91ns ± 0% 6.91ns ± 0% -0.04% (p=0.002 n=32+38) [Geo mean] 111ns 106ns -3.98% Change-Id: I54f4fd7f5160db020b430b556bde59cc0fdb996d Reviewed-on: https://go-review.googlesource.com/c/go/+/188678 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-08-28 15:41:28 +00:00
Bryan C. Mills	372b0eed17	Revert "cmd/compile: optimize 386's math.bits.TrailingZeros16" This reverts CL 189277. Reason for revert: broke 32-bit builders. Updates #33902 Change-Id: Ie5f180d0371a90e5057ed578c334372e5fc3a286 Reviewed-on: https://go-review.googlesource.com/c/go/+/192097 Run-TryBot: Bryan C. Mills <bcmills@google.com> Reviewed-by: Daniel Martí <mvdan@mvdan.cc>	2019-08-28 12:57:59 +00:00
Agniva De Sarker	7be97af2ff	cmd/compile: apply optimization for readonly globals on wasm Extend the optimization introduced in CL 141118 to the wasm architecture. And for reference, the rules trigger 212 times while building std and cmd $GOOS=js GOARCH=wasm gotip build std cmd $grep -E "Wasm.rules:44(1\|2\|3\|4)" rulelog \| wc -l 212 Updates #26498 Change-Id: I153684a2b98589ae812b42268da08b65679e09d1 Reviewed-on: https://go-review.googlesource.com/c/go/+/185477 Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Richard Musiol <neelance@gmail.com>	2019-08-28 05:55:52 +00:00
Agniva De Sarker	8fedb2d338	cmd/compile: optimize bounded shifts on wasm Use the shiftIsBounded function to generate more efficient Shift instructions. Updates #25167 Change-Id: Id350f8462dc3a7ed3bfed0bcbea2860b8f40048a Reviewed-on: https://go-review.googlesource.com/c/go/+/182558 Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Richard Musiol <neelance@gmail.com>	2019-08-28 04:44:21 +00:00
Ben Shi	22355d6cd2	cmd/compile: optimize 386's math.bits.TrailingZeros16 This CL optimizes math.bits.TrailingZeros16 on 386 with a pair of BSFL and ORL instrcutions. The case TrailingZeros16-4 of the benchmark test in math/bits shows big improvement. name old time/op new time/op delta TrailingZeros16-4 1.55ns ± 1% 0.87ns ± 1% -43.87% (p=0.000 n=50+49) Change-Id: Ia899975b0e46f45dcd20223b713ed632bc32740b Reviewed-on: https://go-review.googlesource.com/c/go/+/189277 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-08-28 02:29:54 +00:00
Ben Shi	731e6fc34e	cmd/compile: generate Select on WASM This CL performs the branchelim optimization on WASM with its select instruction. And the total size of pkg/js_wasm decreased about 80KB by this optimization. Change-Id: I868eb146120a1cac5c4609c8e9ddb07e4da8a1d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/190957 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Richard Musiol <neelance@gmail.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-08-28 02:29:25 +00:00
LE Manh Cuong	c5f142fa9f	cmd/compile: optimize bitset tests The assembly output for x & c == c, where c is power of 2: MOVQ "".set+8(SP), AX ANDQ $8, AX CMPQ AX, $8 SETEQ "".~r2+24(SP) With optimization using bitset: MOVQ "".set+8(SP), AX BTL $3, AX SETCS "".~r2+24(SP) output less than 1 instruction. However, there is no speed improvement: name old time/op new time/op delta AllBitSet-8 0.35ns ± 0% 0.35ns ± 0% ~ (all equal) Fixes #31904 Change-Id: I5dca4e410bf45716ed2145e3473979ec997e35d4 Reviewed-on: https://go-review.googlesource.com/c/go/+/175957 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-08-27 18:01:16 +00:00
Keith Randall	8f296f59de	Revert "Revert "cmd/compile,runtime: allocate defer records on the stack"" This reverts CL 180761 Reason for revert: Reinstate the stack-allocated defer CL. There was nothing wrong with the CL proper, but stack allocation of defers exposed two other issues. Issue #32477: Fix has been submitted as CL 181258. Issue #32498: Possible fix is CL 181377 (not submitted yet). Change-Id: I32b3365d5026600069291b068bbba6cb15295eb3 Reviewed-on: https://go-review.googlesource.com/c/go/+/181378 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2019-06-10 16:19:39 +00:00
Keith Randall	49200e3f3e	Revert "cmd/compile,runtime: allocate defer records on the stack" This reverts commit `fff4f599fe`. Reason for revert: Seems to still have issues around GC. Fixes #32452 Change-Id: Ibe7af629f9ad6a3d5312acd7b066123f484da7f0 Reviewed-on: https://go-review.googlesource.com/c/go/+/180761 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>	2019-06-05 19:50:09 +00:00
Keith Randall	fff4f599fe	cmd/compile,runtime: allocate defer records on the stack When a defer is executed at most once in a function body, we can allocate the defer record for it on the stack instead of on the heap. This should make defers like this (which are very common) faster. This optimization applies to 363 out of the 370 static defer sites in the cmd/go binary. name old time/op new time/op delta Defer-4 52.2ns ± 5% 36.2ns ± 3% -30.70% (p=0.000 n=10+10) Fixes #6980 Update #14939 Change-Id: I697109dd7aeef9e97a9eeba2ef65ff53d3ee1004 Reviewed-on: https://go-review.googlesource.com/c/go/+/171758 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>	2019-06-04 17:35:20 +00:00
fanzha02	822a9f537f	cmd/compile: fix the error of absorbing boolean tests into block(FGE, FGT) The CL 164718 mistyped the comparison flags. The rules for floating point comparison should be GreaterThanF and GreaterEqualF. Fortunately, the wrong optimizations were overwritten by other integer rules, so the issue won't cause failure but just some performance impact. The fixed CL optimizes the floating point test as follows. source code: func foo(f float64) bool { return f > 4 \|\| f < -4} previous version: "FCMPD", "CSET\tGT", "CBZ" fixed version: "FCMPD", BLE" Add the test case. Change-Id: Iea954fdbb8272b2d642dae0f816dc77286e6e1fa Reviewed-on: https://go-review.googlesource.com/c/go/+/177121 Reviewed-by: Ben Shi <powerman1st@163.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-05-16 13:46:25 +00:00
Michael Munday	2c1b5130aa	cmd/compile: add math/bits.{Add,Sub}64 intrinsics on s390x This CL adds intrinsics for the 64-bit addition and subtraction functions in math/bits. These intrinsics use the condition code to propagate the carry or borrow bit. To make the carry chains more efficient I've removed the 'clobberFlags' property from most of the load and store operations. Originally these ops did clobber flags when using offsets that didn't fit in a signed 20-bit integer, however that is no longer true. As with other platforms the intrinsics are faster when executed in a chain rather than a loop because currently we need to spill and restore the carry bit between each loop iteration. We may be able to reduce the need to do this on s390x (e.g. by using compare-and-branch instructions that do not clobber flags) in the future. name old time/op new time/op delta Add64 1.21ns ± 2% 2.03ns ± 2% +67.18% (p=0.000 n=7+10) Add64multiple 2.98ns ± 3% 1.03ns ± 0% -65.39% (p=0.000 n=10+9) Sub64 1.23ns ± 4% 2.03ns ± 1% +64.85% (p=0.000 n=10+10) Sub64multiple 3.73ns ± 4% 1.04ns ± 1% -72.28% (p=0.000 n=10+8) Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083 Reviewed-on: https://go-review.googlesource.com/c/go/+/174303 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2019-05-03 10:41:15 +00:00
Lynn Boger	e30aa166ea	test: enable more memcombine tests for ppc64le This enables more of the testcases in memcombine for ppc64le, and adds more detail to some existing. Change-Id: Ic522a1175bed682b546909c96f9ea758f8db247c Reviewed-on: https://go-review.googlesource.com/c/go/+/174737 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-05-01 14:27:27 +00:00
Brian Kessler	4d9dd35806	cmd/compile: add signed divisibility rules "Division by invariant integers using multiplication" paper by Granlund and Montgomery contains a method for directly computing divisibility (x%c == 0 for c constant) by means of the modular inverse. The method is further elaborated in "Hacker's Delight" by Warren Section 10-17 This general rule can compute divisibilty by one multiplication, and add and a compare for odd divisors and an additional rotate for even divisors. To apply the divisibility rule, we must take into account the rules to rewrite x%c = x-((x/c)c) and (x/c) for c constant on the first optimization pass "opt". This complicates the matching as we want to match only in the cases where the result of (x/c) is not also needed. So, we must match on the expanded form of (x/c) in the expression x == c(x/c) in the "late opt" pass after common subexpresion elimination. Note, that if there is an intermediate opt pass introduced in the future we could simplify these rules by delaying the magic division rewrite to "late opt" and matching directly on (x/c) in the intermediate opt pass. On amd64, the divisibility check is 30-45% faster. name old time/op new time/op delta` DivisiblePow2constI64-4 0.83ns ± 1% 0.82ns ± 0% ~ (p=0.079 n=5+4) DivisibleconstI64-4 2.68ns ± 1% 1.87ns ± 0% -30.33% (p=0.000 n=5+4) DivisibleWDivconstI64-4 2.69ns ± 1% 2.71ns ± 3% ~ (p=1.000 n=5+5) DivisiblePow2constI32-4 1.15ns ± 1% 1.15ns ± 0% ~ (p=0.238 n=5+4) DivisibleconstI32-4 2.24ns ± 1% 1.20ns ± 0% -46.48% (p=0.016 n=5+4) DivisibleWDivconstI32-4 2.27ns ± 1% 2.27ns ± 1% ~ (p=0.683 n=5+5) DivisiblePow2constI16-4 0.81ns ± 1% 0.82ns ± 1% ~ (p=0.135 n=5+5) DivisibleconstI16-4 2.11ns ± 2% 1.20ns ± 1% -42.99% (p=0.008 n=5+5) DivisibleWDivconstI16-4 2.23ns ± 0% 2.27ns ± 2% +1.79% (p=0.029 n=4+4) DivisiblePow2constI8-4 0.81ns ± 1% 0.81ns ± 1% ~ (p=0.286 n=5+5) DivisibleconstI8-4 2.13ns ± 3% 1.19ns ± 1% -43.84% (p=0.008 n=5+5) DivisibleWDivconstI8-4 2.23ns ± 1% 2.25ns ± 1% ~ (p=0.183 n=5+5) Fixes #30282 Fixes #15806 Change-Id: Id20d78263a4fdfe0509229ae4dfa2fede83fc1d0 Reviewed-on: https://go-review.googlesource.com/c/go/+/173998 Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-04-30 22:02:07 +00:00
Carlos Eduardo Seo	50ad09418e	cmd/compile: intrinsify math/bits.Add64 for ppc64x This change creates an intrinsic for Add64 for ppc64x and adds a testcase for it. name old time/op new time/op delta Add64-160 1.90ns ±40% 2.29ns ± 0% ~ (p=0.119 n=5+5) Add64multiple-160 6.69ns ± 2% 2.45ns ± 4% -63.47% (p=0.016 n=4+5) Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97 Reviewed-on: https://go-review.googlesource.com/c/go/+/173758 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>	2019-04-28 23:51:04 +00:00
Brian Kessler	a28a942768	cmd/compile: add unsigned divisibility rules "Division by invariant integers using multiplication" paper by Granlund and Montgomery contains a method for directly computing divisibility (x%c == 0 for c constant) by means of the modular inverse. The method is further elaborated in "Hacker's Delight" by Warren Section 10-17 This general rule can compute divisibilty by one multiplication and a compare for odd divisors and an additional rotate for even divisors. To apply the divisibility rule, we must take into account the rules to rewrite x%c = x-((x/c)c) and (x/c) for c constant on the first optimization pass "opt". This complicates the matching as we want to match only in the cases where the result of (x/c) is not also available. So, we must match on the expanded form of (x/c) in the expression x == c(x/c) in the "late opt" pass after common subexpresion elimination. Note, that if there is an intermediate opt pass introduced in the future we could simplify these rules by delaying the magic division rewrite to "late opt" and matching directly on (x/c) in the intermediate opt pass. Additional rules to lower the generic RotateLeft* ops were also applied. On amd64, the divisibility check is 25-50% faster. name old time/op new time/op delta DivconstI64-4 2.08ns ± 0% 2.08ns ± 1% ~ (p=0.881 n=5+5) DivisibleconstI64-4 2.67ns ± 0% 2.67ns ± 1% ~ (p=1.000 n=5+5) DivisibleWDivconstI64-4 2.67ns ± 0% 2.67ns ± 0% ~ (p=0.683 n=5+5) DivconstU64-4 2.08ns ± 1% 2.08ns ± 1% ~ (p=1.000 n=5+5) DivisibleconstU64-4 2.77ns ± 1% 1.55ns ± 2% -43.90% (p=0.008 n=5+5) DivisibleWDivconstU64-4 2.99ns ± 1% 2.99ns ± 1% ~ (p=1.000 n=5+5) DivconstI32-4 1.53ns ± 2% 1.53ns ± 0% ~ (p=1.000 n=5+5) DivisibleconstI32-4 2.23ns ± 0% 2.25ns ± 3% ~ (p=0.167 n=5+5) DivisibleWDivconstI32-4 2.27ns ± 1% 2.27ns ± 1% ~ (p=0.429 n=5+5) DivconstU32-4 1.78ns ± 0% 1.78ns ± 1% ~ (p=1.000 n=4+5) DivisibleconstU32-4 2.52ns ± 2% 1.26ns ± 0% -49.96% (p=0.000 n=5+4) DivisibleWDivconstU32-4 2.63ns ± 0% 2.85ns ±10% +8.29% (p=0.016 n=4+5) DivconstI16-4 1.54ns ± 0% 1.54ns ± 0% ~ (p=0.333 n=4+5) DivisibleconstI16-4 2.10ns ± 0% 2.10ns ± 1% ~ (p=0.571 n=4+5) DivisibleWDivconstI16-4 2.22ns ± 0% 2.23ns ± 1% ~ (p=0.556 n=4+5) DivconstU16-4 1.09ns ± 0% 1.01ns ± 1% -7.74% (p=0.000 n=4+5) DivisibleconstU16-4 1.83ns ± 0% 1.26ns ± 0% -31.52% (p=0.008 n=5+5) DivisibleWDivconstU16-4 1.88ns ± 0% 1.89ns ± 1% ~ (p=0.365 n=5+5) DivconstI8-4 1.54ns ± 1% 1.54ns ± 1% ~ (p=1.000 n=5+5) DivisibleconstI8-4 2.10ns ± 0% 2.11ns ± 0% ~ (p=0.238 n=5+4) DivisibleWDivconstI8-4 2.22ns ± 0% 2.23ns ± 2% ~ (p=0.762 n=5+5) DivconstU8-4 0.92ns ± 1% 0.94ns ± 1% +2.65% (p=0.008 n=5+5) DivisibleconstU8-4 1.66ns ± 0% 1.26ns ± 1% -24.28% (p=0.008 n=5+5) DivisibleWDivconstU8-4 1.79ns ± 0% 1.80ns ± 1% ~ (p=0.079 n=4+5) A follow-up change will address the signed division case. Updates #30282 Change-Id: I7e995f167179aa5c76bb10fbcbeb49c520943403 Reviewed-on: https://go-review.googlesource.com/c/go/+/168037 Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-04-27 20:46:46 +00:00
Brian Kessler	44343c777c	cmd/compile: add signed divisibility by power of 2 rules For powers of two (c=1<<k), the divisibility check x%c == 0 can be made just by checking the trailing zeroes via a mask x&(c-1) == 0 even for signed integers. This avoids division fix-ups when just divisibility check is needed. To apply this rule, we match on the fixed-up version of the division. This is neccessary because the mod and division rewrite rules are already applied during the initial opt pass. The speed up on amd64 due to elimination of unneccessary fix-up code is ~55%: name old time/op new time/op delta DivconstI64-4 2.08ns ± 0% 2.09ns ± 1% ~ (p=0.730 n=5+5) DivisiblePow2constI64-4 1.78ns ± 1% 0.81ns ± 1% -54.66% (p=0.008 n=5+5) DivconstU64-4 2.08ns ± 0% 2.08ns ± 0% ~ (p=0.683 n=5+5) DivconstI32-4 1.53ns ± 0% 1.53ns ± 1% ~ (p=0.968 n=4+5) DivisiblePow2constI32-4 1.79ns ± 1% 0.81ns ± 1% -54.97% (p=0.008 n=5+5) DivconstU32-4 1.78ns ± 1% 1.80ns ± 2% ~ (p=0.206 n=5+5) DivconstI16-4 1.54ns ± 2% 1.54ns ± 0% ~ (p=0.238 n=5+4) DivisiblePow2constI16-4 1.78ns ± 0% 0.81ns ± 1% -54.72% (p=0.000 n=4+5) DivconstU16-4 1.00ns ± 5% 1.01ns ± 1% ~ (p=0.119 n=5+5) DivconstI8-4 1.54ns ± 0% 1.54ns ± 2% ~ (p=0.571 n=4+5) DivisiblePow2constI8-4 1.78ns ± 0% 0.82ns ± 8% -53.71% (p=0.008 n=5+5) DivconstU8-4 0.93ns ± 1% 0.93ns ± 1% ~ (p=0.643 n=5+5) A follow-up CL will address the general case of x%c == 0 for signed integers. Updates #15806 Change-Id: Iabadbbe369b6e0998c8ce85d038ebc236142e42a Reviewed-on: https://go-review.googlesource.com/c/go/+/173557 Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2019-04-25 03:00:32 +00:00
Keith Randall	17615969b6	Revert "cmd/compile: add signed divisibility by power of 2 rules" This reverts CL 168038 (git `68819fb6d2`) Reason for revert: Doesn't work on 32 bit archs. Change-Id: Idec9098060dc65bc2f774c5383f0477f8eb63a3d Reviewed-on: https://go-review.googlesource.com/c/go/+/173442 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2019-04-23 21:23:18 +00:00
Brian Kessler	68819fb6d2	cmd/compile: add signed divisibility by power of 2 rules For powers of two (c=1<<k), the divisibility check x%c == 0 can be made just by checking the trailing zeroes via a mask x&(c-1)==0 even for signed integers. This avoids division fixups when just divisibility check is needed. To apply this rule the generic divisibility rule for A%B = A-(A/B*B) is disabled on the "opt" pass, but this does not affect generated code as this rule is applied later. The speed up on amd64 due to elimination of unneccessary fixup code is ~55%: name old time/op new time/op delta DivconstI64-4 2.08ns ± 0% 2.07ns ± 0% ~ (p=0.079 n=5+5) DivisiblePow2constI64-4 1.78ns ± 1% 0.81ns ± 1% -54.55% (p=0.008 n=5+5) DivconstU64-4 2.08ns ± 0% 2.08ns ± 0% ~ (p=1.000 n=5+5) DivconstI32-4 1.53ns ± 0% 1.53ns ± 0% ~ (all equal) DivisiblePow2constI32-4 1.79ns ± 1% 0.81ns ± 4% -54.75% (p=0.008 n=5+5) DivconstU32-4 1.78ns ± 1% 1.78ns ± 1% ~ (p=1.000 n=5+5) DivconstI16-4 1.54ns ± 2% 1.53ns ± 0% ~ (p=0.333 n=5+4) DivisiblePow2constI16-4 1.78ns ± 0% 0.79ns ± 1% -55.39% (p=0.000 n=4+5) DivconstU16-4 1.00ns ± 5% 0.99ns ± 1% ~ (p=0.730 n=5+5) DivconstI8-4 1.54ns ± 0% 1.53ns ± 0% ~ (p=0.714 n=4+5) DivisiblePow2constI8-4 1.78ns ± 0% 0.80ns ± 0% -55.06% (p=0.000 n=5+4) DivconstU8-4 0.93ns ± 1% 0.95ns ± 1% +1.72% (p=0.024 n=5+5) A follow-up CL will address the general case of x%c == 0 for signed integers. Updates #15806 Change-Id: I0d284863774b1bc8c4ce87443bbaec6103e14ef4 Reviewed-on: https://go-review.googlesource.com/c/go/+/168038 Reviewed-by: Keith Randall <khr@golang.org>	2019-04-23 20:35:54 +00:00
Keith Randall	fd788a86b6	cmd/compile: always mark atColumn1 results as statements In 31618, we end up comparing the is-stmt-ness of positions to repurpose real instructions as inline marks. If the is-stmt-ness doesn't match, we end up not being able to remove the inline mark. Always use statement-full positions to do the matching, so we always find a match if there is one. Also always use positions that are statements for inline marks. Fixes #31618 Change-Id: Idaf39bdb32fa45238d5cd52973cadf4504f947d5 Reviewed-on: https://go-review.googlesource.com/c/go/+/173324 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com>	2019-04-23 17:39:11 +00:00
erifan01	f8f265b9cf	cmd/compile: intrinsify math/bits.Sub64 for arm64 This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS, NGC and NEG, and optimzes the case of borrowing chains. Benchmarks: name old time/op new time/op delta Sub-64 2.500000ns +- 0% 2.048000ns +- 1% -18.08% (p=0.000 n=10+10) Sub32-64 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal) Sub64-64 2.500000ns +- 0% 2.080000ns +- 0% -16.80% (p=0.000 n=10+7) Sub64multiple-64 7.090000ns +- 0% 2.090000ns +- 0% -70.52% (p=0.000 n=10+10) Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655 Reviewed-on: https://go-review.googlesource.com/c/go/+/159018 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2019-04-22 14:40:20 +00:00

1 2 3 4 5

203 Commits