qbit/go - go - Tape:neT

qbit/go

mirror of https://github.com/golang/go synced 2024-11-19 13:44:52 -07:00

Author	SHA1	Message	Date
Carlos Eduardo Seo	a44c72823c	math/big: improve performance of addVW/subVW for ppc64x This change adds a better implementation in asm for addVW/subVW for ppc64x, with speedups up to 3.11x. benchmark old ns/op new ns/op delta BenchmarkAddVW/1-16 6.87 5.71 -16.89% BenchmarkAddVW/2-16 7.72 5.94 -23.06% BenchmarkAddVW/3-16 8.74 6.56 -24.94% BenchmarkAddVW/4-16 9.66 7.26 -24.84% BenchmarkAddVW/5-16 10.8 7.26 -32.78% BenchmarkAddVW/10-16 17.4 9.97 -42.70% BenchmarkAddVW/100-16 164 56.0 -65.85% BenchmarkAddVW/1000-16 1638 524 -68.01% BenchmarkAddVW/10000-16 16421 5201 -68.33% BenchmarkAddVW/100000-16 165762 53324 -67.83% BenchmarkSubVW/1-16 6.76 5.62 -16.86% BenchmarkSubVW/2-16 7.69 6.02 -21.72% BenchmarkSubVW/3-16 8.85 6.61 -25.31% BenchmarkSubVW/4-16 10.0 7.34 -26.60% BenchmarkSubVW/5-16 11.3 7.33 -35.13% BenchmarkSubVW/10-16 19.5 18.7 -4.10% BenchmarkSubVW/100-16 153 55.9 -63.46% BenchmarkSubVW/1000-16 1502 519 -65.45% BenchmarkSubVW/10000-16 15005 5165 -65.58% BenchmarkSubVW/100000-16 150620 53124 -64.73% benchmark old MB/s new MB/s speedup BenchmarkAddVW/1-16 1165.12 1400.76 1.20x BenchmarkAddVW/2-16 2071.39 2693.25 1.30x BenchmarkAddVW/3-16 2744.72 3656.92 1.33x BenchmarkAddVW/4-16 3311.63 4407.34 1.33x BenchmarkAddVW/5-16 3700.52 5512.48 1.49x BenchmarkAddVW/10-16 4605.63 8026.37 1.74x BenchmarkAddVW/100-16 4856.15 14296.76 2.94x BenchmarkAddVW/1000-16 4883.96 15264.21 3.13x BenchmarkAddVW/10000-16 4871.52 15380.78 3.16x BenchmarkAddVW/100000-16 4826.17 15002.48 3.11x BenchmarkSubVW/1-16 1183.20 1423.03 1.20x BenchmarkSubVW/2-16 2081.92 2657.44 1.28x BenchmarkSubVW/3-16 2711.52 3632.30 1.34x BenchmarkSubVW/4-16 3198.30 4360.30 1.36x BenchmarkSubVW/5-16 3534.43 5460.40 1.54x BenchmarkSubVW/10-16 4106.34 4273.51 1.04x BenchmarkSubVW/100-16 5213.48 14306.32 2.74x BenchmarkSubVW/1000-16 5324.27 15391.21 2.89x BenchmarkSubVW/10000-16 5331.33 15486.57 2.90x BenchmarkSubVW/100000-16 5311.35 15059.01 2.84x Change-Id: Ibaa5b9b38d63fba8e01a9c327eb8bef1e6e908c1 Reviewed-on: https://go-review.googlesource.com/101975 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>	2018-03-27 15:06:53 +00:00
Vlad Krasnov	26d74e8b65	math/big: reduce amount of copying in Montgomery multiplication Instead shifting the accumulator every iteration of the loop, shift once in the end. This significantly improves performance on arm64. On arm64: name old time/op new time/op delta RSA2048Decrypt 3.33ms ± 0% 2.63ms ± 0% -20.94% (p=0.000 n=11+11) RSA2048Sign 4.22ms ± 0% 3.55ms ± 0% -15.89% (p=0.000 n=11+11) 3PrimeRSA2048Decrypt 1.95ms ± 0% 1.59ms ± 0% -18.59% (p=0.000 n=11+11) On Skylake: name old time/op new time/op delta RSA2048Decrypt-8 1.73ms ± 2% 1.55ms ± 2% -10.19% (p=0.000 n=10+10) RSA2048Sign-8 2.17ms ± 2% 2.00ms ± 2% -7.93% (p=0.000 n=10+10) 3PrimeRSA2048Decrypt-8 1.10ms ± 2% 0.96ms ± 2% -13.03% (p=0.000 n=10+9) Change-Id: I5786191a1a09e4217fdb1acfd90880d35c5855f7 Reviewed-on: https://go-review.googlesource.com/99838 Run-TryBot: Vlad Krasnov <vlad@cloudflare.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Adam Langley <agl@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2018-03-19 21:40:56 +00:00
Alberto Donizetti	168cc7ff9c	math/big: add 0 shift fastpath to shl and shr One could expect calls like z.mant.shl(z.mant, shiftAmount) (or higher-level-functions calls that use lhs/rhs) to be almost free when shiftAmount = 0; and expect calls like z.mant.shl(x.mant, 0) to have the same cost of a x.mant -> z.mant copy. Neither of this things are currently true. For an 800 words nat, the first kind of calls cost ~800ns for rigth shifts and ~3.5µs for left shift; while the second kind of calls are doing more work than necessary by calling shlVU/shrVU. This change makes the first kind of calls ({Shl,Shr}Same) almost free, and the second kind of calls ({Shl,Shr}) about 30% faster. name old time/op new time/op delta ZeroShifts/Shl-4 3.64µs ± 3% 2.49µs ± 1% -31.55% (p=0.000 n=10+10) ZeroShifts/ShlSame-4 3.65µs ± 1% 0.01µs ± 1% -99.85% (p=0.000 n=9+9) ZeroShifts/Shr-4 3.65µs ± 1% 2.49µs ± 1% -31.91% (p=0.000 n=10+10) ZeroShifts/ShrSame-4 825ns ± 0% 6ns ± 1% -99.33% (p=0.000 n=9+10) During go test math/big, the shl zeroshift fastpath is triggered 1380 times; while the shr fastpath is triggered 153334 times(!). Change-Id: I5f92b304a40638bd8453a86c87c58e54b337bcdf Reviewed-on: https://go-review.googlesource.com/87660 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2018-03-19 18:01:37 +00:00
Robert Griesemer	7d4d2cb686	math/big: add comment about internal assumptions on nat values Change-Id: I7ed40507a019c0bf521ba748fc22c03d74bb17b7 Reviewed-on: https://go-review.googlesource.com/100719 Reviewed-by: Ian Lance Taylor <iant@golang.org>	2018-03-14 18:31:05 +00:00
erifan01	140bfe9cfe	math/big: optimize shlVU and shrVU on arm64 This CL implements shlVU and shrVU with arm64 HW instructions "LDP" and "STP" to reduce load cost, it also removes unnecessary checks on the number of shifts for better performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.6ns ± 1% 21.6ns ± 1% ~ (p=0.683 n=5+5) AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal) AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal) AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal) AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal) AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal) AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal) AddVV/1000-8 2.02µs ± 0% 2.03µs ± 0% +0.73% (p=0.008 n=5+5) AddVV/10000-8 20.5µs ± 0% 20.5µs ± 0% -0.01% (p=0.008 n=5+5) AddVV/100000-8 246µs ± 5% 250µs ± 4% ~ (p=0.548 n=5+5) AddVW/1-8 9.26ns ± 0% 9.32ns ± 0% +0.65% (p=0.016 n=4+5) AddVW/2-8 19.8ns ± 3% 19.8ns ± 0% ~ (p=0.143 n=5+5) AddVW/3-8 11.5ns ± 0% 11.5ns ± 0% ~ (all equal) AddVW/4-8 13.0ns ± 0% 13.0ns ± 0% ~ (all equal) AddVW/5-8 14.5ns ± 0% 14.5ns ± 0% ~ (all equal) AddVW/10-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal) AddVW/100-8 167ns ± 0% 166ns ± 0% -0.60% (p=0.000 n=5+4) AddVW/1000-8 1.52µs ± 0% 1.52µs ± 0% ~ (all equal) AddVW/10000-8 15.1µs ± 0% 15.1µs ± 0% +0.01% (p=0.008 n=5+5) AddVW/100000-8 163µs ± 4% 153µs ± 3% -5.97% (p=0.016 n=5+5) AddMulVVW/1-8 32.4ns ± 1% 33.0ns ± 1% +1.73% (p=0.040 n=5+5) AddMulVVW/2-8 56.4ns ± 2% 55.9ns ± 1% ~ (p=0.135 n=5+5) AddMulVVW/3-8 85.4ns ± 1% 85.1ns ± 0% ~ (p=0.079 n=5+5) AddMulVVW/4-8 129ns ± 1% 129ns ± 0% ~ (p=0.397 n=5+5) AddMulVVW/5-8 148ns ± 0% 148ns ± 0% ~ (all equal) AddMulVVW/10-8 270ns ± 0% 268ns ± 0% -0.74% (p=0.029 n=4+4) AddMulVVW/100-8 2.75µs ± 0% 2.75µs ± 0% -0.09% (p=0.008 n=5+5) AddMulVVW/1000-8 26.0µs ± 0% 26.0µs ± 0% -0.06% (p=0.024 n=5+5) AddMulVVW/10000-8 312µs ± 0% 312µs ± 0% -0.09% (p=0.008 n=5+5) AddMulVVW/100000-8 2.89ms ± 0% 2.89ms ± 0% +0.14% (p=0.016 n=5+5) DecimalConversion-8 315µs ± 1% 312µs ± 0% ~ (p=0.095 n=5+5) FloatString/100-8 2.56µs ± 1% 2.52µs ± 1% -1.31% (p=0.016 n=5+5) FloatString/1000-8 58.6µs ± 0% 58.2µs ± 0% -0.75% (p=0.008 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.056 n=5+5) FloatString/100000-8 446ms ± 0% 446ms ± 0% -0.04% (p=0.008 n=5+5) FloatAdd/10-8 184ns ± 0% 178ns ± 0% -3.48% (p=0.008 n=5+5) FloatAdd/100-8 189ns ± 3% 178ns ± 2% -6.02% (p=0.008 n=5+5) FloatAdd/1000-8 371ns ± 0% 267ns ± 0% -27.99% (p=0.000 n=5+4) FloatAdd/10000-8 1.87µs ± 0% 1.03µs ± 0% -44.74% (p=0.008 n=5+5) FloatAdd/100000-8 17.1µs ± 0% 8.8µs ± 0% -48.71% (p=0.016 n=5+4) FloatSub/10-8 148ns ± 0% 146ns ± 0% -1.35% (p=0.000 n=5+4) FloatSub/100-8 148ns ± 0% 140ns ± 0% -5.41% (p=0.008 n=5+5) FloatSub/1000-8 242ns ± 0% 191ns ± 0% -21.24% (p=0.008 n=5+5) FloatSub/10000-8 1.07µs ± 0% 0.64µs ± 1% -39.89% (p=0.016 n=4+5) FloatSub/100000-8 9.48µs ± 0% 5.32µs ± 0% -43.87% (p=0.008 n=5+5) ParseFloatSmallExp-8 29.3µs ± 3% 28.6µs ± 1% ~ (p=0.310 n=5+5) ParseFloatLargeExp-8 125µs ± 1% 123µs ± 0% -1.99% (p=0.008 n=5+5) GCD10x10/WithoutXY-8 278ns ± 4% 289ns ± 5% +3.96% (p=0.040 n=5+5) GCD10x10/WithXY-8 2.12µs ± 2% 2.15µs ± 2% ~ (p=0.095 n=5+5) GCD10x100/WithoutXY-8 615ns ± 1% 629ns ± 5% ~ (p=0.135 n=5+5) GCD10x100/WithXY-8 3.42µs ± 1% 3.53µs ± 2% +3.38% (p=0.008 n=5+5) GCD10x1000/WithoutXY-8 1.39µs ± 1% 1.38µs ± 1% ~ (p=0.460 n=5+5) GCD10x1000/WithXY-8 7.47µs ± 2% 7.49µs ± 3% ~ (p=1.000 n=5+5) GCD10x10000/WithoutXY-8 8.71µs ± 1% 8.71µs ± 0% ~ (p=0.841 n=5+5) GCD10x10000/WithXY-8 28.4µs ± 2% 27.2µs ± 2% -4.24% (p=0.008 n=5+5) GCD10x100000/WithoutXY-8 78.9µs ± 1% 79.1µs ± 0% ~ (p=0.222 n=5+5) GCD10x100000/WithXY-8 240µs ± 1% 228µs ± 1% -4.98% (p=0.008 n=5+5) GCD100x100/WithoutXY-8 1.87µs ± 2% 1.89µs ± 1% ~ (p=0.095 n=5+5) GCD100x100/WithXY-8 26.6µs ± 1% 26.3µs ± 0% -1.14% (p=0.032 n=5+5) GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.47µs ± 2% ~ (p=0.444 n=5+5) GCD100x1000/WithXY-8 36.7µs ± 1% 36.0µs ± 1% -1.96% (p=0.008 n=5+5) GCD100x10000/WithoutXY-8 22.8µs ± 1% 22.3µs ± 1% -2.52% (p=0.008 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 142µs ± 0% -1.88% (p=0.008 n=5+5) GCD100x100000/WithoutXY-8 198µs ± 1% 190µs ± 0% -4.06% (p=0.008 n=5+5) GCD100x100000/WithXY-8 1.11ms ± 0% 1.09ms ± 0% -1.87% (p=0.008 n=5+5) GCD1000x1000/WithoutXY-8 25.4µs ± 1% 25.0µs ± 1% -1.34% (p=0.008 n=5+5) GCD1000x1000/WithXY-8 515µs ± 1% 485µs ± 0% -5.85% (p=0.008 n=5+5) GCD1000x10000/WithoutXY-8 57.3µs ± 1% 56.2µs ± 1% -1.95% (p=0.008 n=5+5) GCD1000x10000/WithXY-8 1.21ms ± 0% 1.18ms ± 0% -2.65% (p=0.008 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 352µs ± 1% -1.71% (p=0.008 n=5+5) GCD1000x100000/WithXY-8 8.72ms ± 0% 8.66ms ± 0% -0.71% (p=0.008 n=5+5) GCD10000x10000/WithoutXY-8 690µs ± 0% 687µs ± 1% ~ (p=0.095 n=5+5) GCD10000x10000/WithXY-8 16.0ms ± 0% 12.5ms ± 0% -22.01% (p=0.008 n=5+5) GCD10000x100000/WithoutXY-8 2.09ms ± 0% 2.07ms ± 0% -0.58% (p=0.008 n=5+5) GCD10000x100000/WithXY-8 86.8ms ± 0% 83.4ms ± 0% -3.95% (p=0.008 n=5+5) GCD100000x100000/WithoutXY-8 51.2ms ± 0% 51.2ms ± 0% ~ (p=0.548 n=5+5) GCD100000x100000/WithXY-8 1.25s ± 0% 0.89s ± 0% -28.98% (p=0.008 n=5+5) Hilbert-8 2.46ms ± 2% 2.53ms ± 1% +2.89% (p=0.032 n=5+5) Binomial-8 5.15µs ± 4% 4.92µs ± 1% -4.43% (p=0.032 n=5+5) QuoRem-8 7.10µs ± 0% 7.05µs ± 0% -0.59% (p=0.008 n=5+5) Exp-8 161ms ± 0% 161ms ± 0% -0.24% (p=0.008 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% -0.30% (p=0.016 n=4+5) Bitset-8 40.4ns ± 0% 40.3ns ± 0% ~ (p=0.159 n=5+5) BitsetNeg-8 158ns ± 4% 155ns ± 2% ~ (p=0.183 n=5+5) BitsetOrig-8 374ns ± 0% 383ns ± 1% +2.35% (p=0.008 n=5+5) BitsetNegOrig-8 620ns ± 1% 663ns ± 2% +7.00% (p=0.008 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=0.841 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5) ModSqrt5430_Tonelli-8 62.3s ± 0% 62.4s ± 0% +0.15% (p=0.008 n=5+5) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.151 n=5+5) Sqrt-8 101µs ± 0% 97µs ± 0% -3.99% (p=0.008 n=5+5) IntSqr/1-8 32.7ns ± 1% 32.5ns ± 1% ~ (p=0.325 n=5+5) IntSqr/2-8 161ns ± 4% 160ns ± 4% ~ (p=0.659 n=5+5) IntSqr/3-8 296ns ± 7% 297ns ± 6% ~ (p=0.841 n=5+5) IntSqr/5-8 752ns ± 7% 755ns ± 6% ~ (p=0.889 n=5+5) IntSqr/8-8 1.91µs ± 3% 1.90µs ± 3% ~ (p=0.746 n=5+5) IntSqr/10-8 2.99µs ± 4% 3.00µs ± 4% ~ (p=0.516 n=5+5) IntSqr/20-8 6.29µs ± 2% 6.19µs ± 2% ~ (p=0.151 n=5+5) IntSqr/30-8 14.0µs ± 1% 13.8µs ± 2% ~ (p=0.056 n=5+5) IntSqr/50-8 38.1µs ± 3% 37.9µs ± 3% ~ (p=0.548 n=5+5) IntSqr/80-8 95.1µs ± 1% 94.7µs ± 1% ~ (p=0.310 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.548 n=5+5) IntSqr/200-8 587µs ± 1% 587µs ± 1% ~ (p=1.000 n=5+5) IntSqr/300-8 1.31ms ± 1% 1.32ms ± 1% ~ (p=0.151 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.49ms ± 0% ~ (p=0.310 n=5+5) IntSqr/800-8 4.68ms ± 0% 4.67ms ± 0% ~ (p=0.548 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5) Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.151 n=5+5) Exp3Power/0x10-8 584ns ± 2% 573ns ± 1% ~ (p=0.190 n=5+5) Exp3Power/0x40-8 646ns ± 2% 649ns ± 1% ~ (p=0.690 n=5+5) Exp3Power/0x100-8 1.42µs ± 2% 1.45µs ± 1% +2.03% (p=0.032 n=5+5) Exp3Power/0x400-8 8.28µs ± 1% 8.39µs ± 0% +1.33% (p=0.008 n=5+5) Exp3Power/0x1000-8 60.1µs ± 0% 59.8µs ± 0% -0.44% (p=0.008 n=5+5) Exp3Power/0x4000-8 818µs ± 0% 816µs ± 0% -0.23% (p=0.008 n=5+5) Exp3Power/0x10000-8 7.79ms ± 0% 7.78ms ± 0% ~ (p=0.690 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 73.3ms ± 0% ~ (p=0.151 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.16% (p=0.016 n=4+5) Exp3Power/0x400000-8 5.99s ± 0% 5.97s ± 0% -0.24% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 117ms ± 0% +0.42% (p=0.008 n=5+5) NatSqr/1-8 113ns ± 2% 112ns ± 1% ~ (p=0.190 n=5+5) NatSqr/2-8 249ns ± 2% 250ns ± 2% ~ (p=0.365 n=5+5) NatSqr/3-8 379ns ± 1% 381ns ± 2% ~ (p=0.127 n=5+5) NatSqr/5-8 838ns ± 3% 841ns ± 5% ~ (p=0.754 n=5+5) NatSqr/8-8 1.97µs ± 3% 1.97µs ± 4% ~ (p=1.000 n=5+5) NatSqr/10-8 3.04µs ± 4% 3.04µs ± 4% ~ (p=1.000 n=5+5) NatSqr/20-8 6.49µs ± 3% 6.50µs ± 2% ~ (p=0.841 n=5+5) NatSqr/30-8 14.3µs ± 2% 14.2µs ± 2% ~ (p=0.548 n=5+5) NatSqr/50-8 38.5µs ± 3% 38.3µs ± 3% ~ (p=0.421 n=5+5) NatSqr/80-8 96.3µs ± 1% 96.1µs ± 1% ~ (p=0.421 n=5+5) NatSqr/100-8 149µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5) NatSqr/200-8 591µs ± 1% 592µs ± 1% ~ (p=0.690 n=5+5) NatSqr/300-8 1.31ms ± 1% 1.32ms ± 0% ~ (p=0.190 n=5+4) NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.095 n=5+5) NatSqr/800-8 4.70ms ± 0% 4.69ms ± 0% ~ (p=0.222 n=5+5) NatSqr/1000-8 7.60ms ± 0% 7.58ms ± 0% ~ (p=0.222 n=5+5) ScanPi-8 326µs ± 0% 327µs ± 1% ~ (p=0.222 n=5+5) StringPiParallel-8 71.4µs ± 5% 67.7µs ± 4% ~ (p=0.095 n=5+5) Scan/10/Base2-8 1.09µs ± 0% 1.10µs ± 1% ~ (p=0.810 n=5+5) Scan/100/Base2-8 7.79µs ± 0% 7.83µs ± 0% +0.53% (p=0.008 n=5+5) Scan/1000/Base2-8 78.9µs ± 0% 79.0µs ± 0% ~ (p=0.151 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.23ms ± 1% ~ (p=0.690 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.1ms ± 0% +0.10% (p=0.008 n=5+5) Scan/10/Base8-8 512ns ± 1% 534ns ± 1% +4.34% (p=0.008 n=5+5) Scan/100/Base8-8 2.90µs ± 1% 2.92µs ± 0% +0.67% (p=0.024 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.27% (p=0.008 n=5+5) Scan/10000/Base8-8 741µs ± 0% 744µs ± 1% ~ (p=0.310 n=5+5) Scan/100000/Base8-8 50.5ms ± 0% 50.7ms ± 0% +0.23% (p=0.016 n=5+4) Scan/10/Base10-8 485ns ± 0% 510ns ± 1% +5.15% (p=0.008 n=5+5) Scan/100/Base10-8 2.68µs ± 0% 2.70µs ± 0% +0.84% (p=0.008 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.8µs ± 0% +0.34% (p=0.008 n=5+5) Scan/10000/Base10-8 717µs ± 0% 720µs ± 1% ~ (p=0.238 n=5+5) Scan/100000/Base10-8 50.3ms ± 0% 50.3ms ± 0% +0.02% (p=0.016 n=4+5) Scan/10/Base16-8 439ns ± 0% 461ns ± 1% +5.06% (p=0.008 n=5+5) Scan/100/Base16-8 2.48µs ± 0% 2.49µs ± 0% +0.59% (p=0.024 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.3µs ± 0% ~ (p=0.063 n=5+5) Scan/10000/Base16-8 722µs ± 0% 725µs ± 1% ~ (p=0.421 n=5+5) Scan/100000/Base16-8 52.7ms ± 0% 52.7ms ± 0% ~ (p=0.686 n=4+4) String/10/Base2-8 248ns ± 1% 248ns ± 1% ~ (p=0.802 n=5+5) String/100/Base2-8 1.51µs ± 0% 1.51µs ± 0% -0.54% (p=0.024 n=5+5) String/1000/Base2-8 13.6µs ± 0% 13.6µs ± 0% ~ (p=0.548 n=5+5) String/10000/Base2-8 135µs ± 1% 135µs ± 2% ~ (p=0.421 n=5+5) String/100000/Base2-8 1.32ms ± 1% 1.33ms ± 1% ~ (p=0.310 n=5+5) String/10/Base8-8 169ns ± 0% 170ns ± 0% ~ (p=0.079 n=5+5) String/100/Base8-8 635ns ± 1% 633ns ± 1% ~ (p=0.595 n=5+5) String/1000/Base8-8 5.33µs ± 0% 5.30µs ± 0% ~ (p=0.063 n=5+5) String/10000/Base8-8 50.7µs ± 1% 50.7µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base8-8 499µs ± 1% 500µs ± 1% ~ (p=1.000 n=5+5) String/10/Base10-8 517ns ± 1% 512ns ± 1% -1.01% (p=0.032 n=5+5) String/100/Base10-8 1.97µs ± 0% 2.01µs ± 1% +2.13% (p=0.008 n=5+5) String/1000/Base10-8 12.6µs ± 1% 12.1µs ± 1% -4.16% (p=0.008 n=5+5) String/10000/Base10-8 57.9µs ± 1% 54.8µs ± 1% -5.46% (p=0.008 n=5+5) String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% -0.12% (p=0.008 n=5+5) String/10/Base16-8 149ns ± 0% 149ns ± 1% ~ (p=1.000 n=5+5) String/100/Base16-8 514ns ± 0% 514ns ± 1% ~ (p=0.825 n=5+5) String/1000/Base16-8 4.01µs ± 0% 4.01µs ± 0% ~ (p=0.595 n=5+5) String/10000/Base16-8 37.7µs ± 0% 37.8µs ± 1% ~ (p=0.222 n=5+5) String/100000/Base16-8 373µs ± 1% 372µs ± 0% ~ (p=1.000 n=5+5) LeafSize/0-8 6.64ms ± 0% 6.66ms ± 0% +0.32% (p=0.008 n=5+5) LeafSize/1-8 74.0µs ± 1% 71.2µs ± 1% -3.75% (p=0.008 n=5+5) LeafSize/2-8 74.1µs ± 0% 70.7µs ± 1% -4.53% (p=0.008 n=5+5) LeafSize/3-8 379µs ± 0% 374µs ± 0% -1.25% (p=0.008 n=5+5) LeafSize/4-8 72.7µs ± 0% 69.2µs ± 0% -4.79% (p=0.008 n=5+5) LeafSize/5-8 471µs ± 0% 466µs ± 0% -1.05% (p=0.008 n=5+5) LeafSize/6-8 377µs ± 0% 373µs ± 0% -1.16% (p=0.008 n=5+5) LeafSize/7-8 245µs ± 0% 241µs ± 0% -1.65% (p=0.008 n=5+5) LeafSize/8-8 73.1µs ± 0% 69.4µs ± 0% -5.10% (p=0.008 n=5+5) LeafSize/9-8 538µs ± 0% 532µs ± 0% -1.01% (p=0.008 n=5+5) LeafSize/10-8 472µs ± 0% 467µs ± 0% -1.07% (p=0.008 n=5+5) LeafSize/11-8 460µs ± 0% 454µs ± 0% -1.22% (p=0.008 n=5+5) LeafSize/12-8 378µs ± 0% 373µs ± 0% -1.34% (p=0.008 n=5+5) LeafSize/13-8 344µs ± 0% 338µs ± 0% -1.61% (p=0.008 n=5+5) LeafSize/14-8 247µs ± 0% 243µs ± 0% -1.62% (p=0.008 n=5+5) LeafSize/15-8 169µs ± 0% 165µs ± 0% -2.71% (p=0.008 n=5+5) LeafSize/16-8 73.3µs ± 1% 69.5µs ± 0% -5.11% (p=0.008 n=5+5) LeafSize/32-8 82.7µs ± 0% 79.2µs ± 0% -4.24% (p=0.008 n=5+5) LeafSize/64-8 135µs ± 0% 132µs ± 0% -2.20% (p=0.008 n=5+5) ProbablyPrime/n=0-8 44.2ms ± 0% 43.9ms ± 0% -0.69% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.8ms ± 0% 64.4ms ± 0% -0.60% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.34% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 249ms ± 0% -0.29% (p=0.008 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.29% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 23.2ms ± 0% -1.44% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% -0.31% (p=0.008 n=5+5) FloatSqrt/64-8 2.27µs ± 1% 2.11µs ± 1% -7.02% (p=0.008 n=5+5) FloatSqrt/128-8 4.93µs ± 1% 4.40µs ± 1% -10.73% (p=0.008 n=5+5) FloatSqrt/256-8 13.6µs ± 0% 6.6µs ± 1% -51.40% (p=0.008 n=5+5) FloatSqrt/1000-8 69.8µs ± 0% 31.2µs ± 0% -55.27% (p=0.008 n=5+5) FloatSqrt/10000-8 1.91ms ± 0% 0.59ms ± 0% -69.17% (p=0.008 n=5+5) FloatSqrt/100000-8 55.4ms ± 0% 17.8ms ± 0% -67.79% (p=0.008 n=5+5) FloatSqrt/1000000-8 4.56s ± 0% 1.52s ± 0% -66.59% (p=0.008 n=5+5) Change-Id: Icce52c69668f564490c69b908338b21a2288e116 Reviewed-on: https://go-review.googlesource.com/79355 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-12 14:42:11 +00:00
Alberto Donizetti	010579c237	math/big: allocate less in Float.Sqrt The Newton sqrtInverse procedure we use to compute Float.Sqrt should not allocate a number of times proportional to the number of Newton iterations we need to reach the desired precision. At the beginning the function the target precision is known, so even if we do want to perform the early steps at low precisions (to save time), it's still possible to pre-allocate larger backing arrays, both for the temp variables in the loop and the variable that'll hold the final result. There's one complication. At the following line: u.Sub(three, u) the Sub method will allocate, because the receiver aliases one of the arguments, and the large backing array we initially allocated for u will be replaced by a smaller one allocated by Sub. We can work around this by introducing a second temp variable u2 that we use to hold the Sub call result. Overall, the sqrtInverse procedure still allocates a number of times proportional to the number of Newton steps, because unfortunately a few of the Mul calls in the Newton function allocate; but at least we allocate less in the function itself. FloatSqrt/256-4 1.97µs ± 1% 1.84µs ± 1% -6.61% (p=0.000 n=8+8) FloatSqrt/1000-4 4.80µs ± 3% 4.28µs ± 1% -10.78% (p=0.000 n=8+8) FloatSqrt/10000-4 40.0µs ± 1% 38.3µs ± 1% -4.15% (p=0.000 n=8+8) FloatSqrt/100000-4 955µs ± 1% 932µs ± 0% -2.49% (p=0.000 n=8+7) FloatSqrt/1000000-4 79.8ms ± 1% 79.4ms ± 1% ~ (p=0.105 n=8+8) name old alloc/op new alloc/op delta FloatSqrt/256-4 816B ± 0% 512B ± 0% -37.25% (p=0.000 n=8+8) FloatSqrt/1000-4 2.50kB ± 0% 1.47kB ± 0% -41.03% (p=0.000 n=8+8) FloatSqrt/10000-4 23.5kB ± 0% 18.2kB ± 0% -22.62% (p=0.000 n=8+8) FloatSqrt/100000-4 251kB ± 0% 173kB ± 0% -31.26% (p=0.000 n=8+8) FloatSqrt/1000000-4 4.61MB ± 0% 2.86MB ± 0% -37.90% (p=0.000 n=8+8) name old allocs/op new allocs/op delta FloatSqrt/256-4 12.0 ± 0% 8.0 ± 0% -33.33% (p=0.000 n=8+8) FloatSqrt/1000-4 19.0 ± 0% 9.0 ± 0% -52.63% (p=0.000 n=8+8) FloatSqrt/10000-4 35.0 ± 0% 14.0 ± 0% -60.00% (p=0.000 n=8+8) FloatSqrt/100000-4 55.0 ± 0% 23.0 ± 0% -58.18% (p=0.000 n=8+8) FloatSqrt/1000000-4 122 ± 0% 75 ± 0% -38.52% (p=0.000 n=8+8) Change-Id: I950dbf61a40267a6cca82ae72524c3024bcb149c Reviewed-on: https://go-review.googlesource.com/87659 Reviewed-by: Robert Griesemer <gri@golang.org>	2018-03-08 19:12:35 +00:00
isharipo	d2a5263a9c	math/big: speedup nat.setBytes for bigger slices Set up to _S (number of bytes in Uint) bytes at time by using BigEndian.Uint32 and BigEndian.Uint64. The performance improves for slices bigger than _S bytes. This is the case for 128/256bit arith that initializes it's objects from bytes. name old time/op new time/op delta NatSetBytes/8-4 29.8ns ± 1% 11.4ns ± 0% -61.63% (p=0.000 n=9+8) NatSetBytes/24-4 109ns ± 1% 56ns ± 0% -48.75% (p=0.000 n=9+8) NatSetBytes/128-4 420ns ± 2% 110ns ± 1% -73.83% (p=0.000 n=10+10) NatSetBytes/7-4 26.2ns ± 1% 21.3ns ± 2% -18.63% (p=0.000 n=8+9) NatSetBytes/23-4 106ns ± 1% 67ns ± 1% -36.93% (p=0.000 n=9+10) NatSetBytes/127-4 410ns ± 2% 121ns ± 0% -70.46% (p=0.000 n=9+8) Found this optimization opportunity by looking at ethereum_corevm community benchmark cpuprofile. name old time/op new time/op delta OpDiv256-4 715ns ± 1% 596ns ± 1% -16.57% (p=0.008 n=5+5) OpDiv128-4 373ns ± 1% 314ns ± 1% -15.83% (p=0.008 n=5+5) OpDiv64-4 301ns ± 0% 285ns ± 1% -5.12% (p=0.008 n=5+5) Change-Id: I8e5a680ae6284c8233d8d7431d51253a8a740b57 Reviewed-on: https://go-review.googlesource.com/98775 Run-TryBot: Iskander Sharipov <iskander.sharipov@intel.com> Reviewed-by: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-08 18:50:10 +00:00
erifan01	0585d41c87	math/big: optimize addVW and subVW on arm64 The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 21.5ns ± 0% ~ (all equal) AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal) AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal) AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal) AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal) AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal) AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal) AddVV/1000-8 2.02µs ± 0% 2.02µs ± 0% ~ (all equal) AddVV/10000-8 20.3µs ± 0% 20.3µs ± 0% ~ (p=0.603 n=5+5) AddVV/100000-8 223µs ± 7% 228µs ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) AddVW/2-8 19.8ns ± 3% 10.5ns ± 0% -46.92% (p=0.008 n=5+5) AddVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) AddVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) AddVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) AddVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) AddVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) AddVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.55% (p=0.008 n=5+5) AddVW/100000-8 150µs ± 0% 71µs ± 0% -52.95% (p=0.008 n=5+5) SubVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) SubVW/2-8 19.7ns ± 2% 10.5ns ± 0% -46.70% (p=0.008 n=5+5) SubVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) SubVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) SubVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) SubVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) SubVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) SubVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) SubVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.49% (p=0.008 n=5+5) SubVW/100000-8 150µs ± 0% 71µs ± 0% -52.91% (p=0.008 n=5+5) AddMulVVW/1-8 32.4ns ± 1% 32.6ns ± 1% ~ (p=0.119 n=5+5) AddMulVVW/2-8 57.0ns ± 0% 57.0ns ± 0% ~ (p=0.643 n=5+5) AddMulVVW/3-8 90.8ns ± 0% 90.7ns ± 0% ~ (p=0.524 n=5+5) AddMulVVW/4-8 118ns ± 0% 118ns ± 1% ~ (p=1.000 n=4+5) AddMulVVW/5-8 144ns ± 1% 144ns ± 0% ~ (p=0.794 n=5+4) AddMulVVW/10-8 294ns ± 1% 296ns ± 0% +0.48% (p=0.040 n=5+5) AddMulVVW/100-8 2.73µs ± 0% 2.73µs ± 0% ~ (p=0.278 n=5+5) AddMulVVW/1000-8 26.0µs ± 0% 26.5µs ± 0% +2.14% (p=0.008 n=5+5) AddMulVVW/10000-8 297µs ± 0% 297µs ± 0% +0.24% (p=0.008 n=5+5) AddMulVVW/100000-8 3.15ms ± 1% 3.13ms ± 0% ~ (p=0.690 n=5+5) DecimalConversion-8 311µs ± 2% 309µs ± 2% ~ (p=0.310 n=5+5) FloatString/100-8 2.55µs ± 2% 2.54µs ± 2% ~ (p=1.000 n=5+5) FloatString/1000-8 58.1µs ± 0% 58.1µs ± 0% ~ (p=0.151 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.151 n=5+5) FloatString/100000-8 446ms ± 0% 446ms ± 0% +0.01% (p=0.016 n=5+5) FloatAdd/10-8 183ns ± 0% 183ns ± 0% ~ (p=0.333 n=4+5) FloatAdd/100-8 187ns ± 1% 192ns ± 2% ~ (p=0.056 n=5+5) FloatAdd/1000-8 369ns ± 0% 371ns ± 0% +0.54% (p=0.016 n=4+5) FloatAdd/10000-8 1.88µs ± 0% 1.88µs ± 0% -0.14% (p=0.000 n=4+5) FloatAdd/100000-8 17.2µs ± 0% 17.1µs ± 0% -0.37% (p=0.008 n=5+5) FloatSub/10-8 147ns ± 0% 147ns ± 0% ~ (all equal) FloatSub/100-8 145ns ± 0% 146ns ± 0% ~ (p=0.238 n=5+4) FloatSub/1000-8 241ns ± 0% 241ns ± 0% ~ (p=0.333 n=5+4) FloatSub/10000-8 1.06µs ± 0% 1.06µs ± 0% ~ (p=0.444 n=5+5) FloatSub/100000-8 9.50µs ± 0% 9.48µs ± 0% -0.14% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.4µs ± 2% 28.5µs ± 1% ~ (p=0.690 n=5+5) ParseFloatLargeExp-8 125µs ± 1% 124µs ± 1% ~ (p=0.095 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 278ns ± 3% ~ (p=0.937 n=5+5) GCD10x10/WithXY-8 2.08µs ± 3% 2.15µs ± 3% ~ (p=0.056 n=5+5) GCD10x100/WithoutXY-8 592ns ± 3% 613ns ± 4% ~ (p=0.056 n=5+5) GCD10x100/WithXY-8 3.40µs ± 2% 3.42µs ± 4% ~ (p=0.841 n=5+5) GCD10x1000/WithoutXY-8 1.37µs ± 2% 1.35µs ± 3% ~ (p=0.460 n=5+5) GCD10x1000/WithXY-8 7.34µs ± 2% 7.33µs ± 4% ~ (p=0.841 n=5+5) GCD10x10000/WithoutXY-8 8.52µs ± 0% 8.51µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 27.5µs ± 2% 27.2µs ± 1% ~ (p=0.151 n=5+5) GCD10x100000/WithoutXY-8 78.3µs ± 1% 78.5µs ± 1% ~ (p=0.690 n=5+5) GCD10x100000/WithXY-8 231µs ± 0% 229µs ± 1% -1.11% (p=0.016 n=5+5) GCD100x100/WithoutXY-8 1.86µs ± 2% 1.86µs ± 2% ~ (p=0.881 n=5+5) GCD100x100/WithXY-8 27.1µs ± 2% 27.2µs ± 1% ~ (p=0.421 n=5+5) GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.41µs ± 1% ~ (p=0.310 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 36.2µs ± 1% ~ (p=0.310 n=5+5) GCD100x10000/WithoutXY-8 22.6µs ± 2% 22.5µs ± 1% ~ (p=0.690 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 145µs ± 1% ~ (p=1.000 n=5+5) GCD100x100000/WithoutXY-8 195µs ± 0% 196µs ± 1% ~ (p=0.548 n=5+5) GCD100x100000/WithXY-8 1.10ms ± 0% 1.10ms ± 0% -0.30% (p=0.016 n=5+5) GCD1000x1000/WithoutXY-8 25.0µs ± 1% 25.2µs ± 2% ~ (p=0.222 n=5+5) GCD1000x1000/WithXY-8 520µs ± 0% 520µs ± 1% ~ (p=0.151 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 1% 56.9µs ± 1% ~ (p=0.690 n=5+5) GCD1000x10000/WithXY-8 1.21ms ± 0% 1.21ms ± 1% ~ (p=0.881 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 359µs ± 1% ~ (p=0.548 n=5+5) GCD1000x100000/WithXY-8 8.73ms ± 0% 8.73ms ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithoutXY-8 686µs ± 0% 687µs ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithXY-8 15.9ms ± 0% 15.9ms ± 0% ~ (p=0.841 n=5+5) GCD10000x100000/WithoutXY-8 2.08ms ± 0% 2.08ms ± 0% ~ (p=1.000 n=5+5) GCD10000x100000/WithXY-8 86.7ms ± 0% 86.7ms ± 0% ~ (p=1.000 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 51.0ms ± 0% ~ (p=0.151 n=5+5) GCD100000x100000/WithXY-8 1.23s ± 0% 1.23s ± 0% ~ (p=0.841 n=5+5) Hilbert-8 2.41ms ± 1% 2.42ms ± 2% ~ (p=0.690 n=5+5) Binomial-8 4.86µs ± 1% 4.86µs ± 1% ~ (p=0.889 n=5+5) QuoRem-8 7.09µs ± 0% 7.08µs ± 0% -0.09% (p=0.024 n=5+5) Exp-8 161ms ± 0% 161ms ± 0% -0.08% (p=0.032 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=1.000 n=5+5) Bitset-8 40.7ns ± 0% 40.6ns ± 0% ~ (p=0.095 n=4+5) BitsetNeg-8 159ns ± 4% 148ns ± 0% -6.92% (p=0.016 n=5+4) BitsetOrig-8 378ns ± 1% 378ns ± 1% ~ (p=0.937 n=5+5) BitsetNegOrig-8 647ns ± 5% 647ns ± 4% ~ (p=1.000 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=1.000 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5) ModSqrt5430_Tonelli-8 62.8s ± 1% 62.5s ± 0% ~ (p=0.063 n=5+4) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.310 n=5+5) Sqrt-8 101µs ± 1% 101µs ± 0% -0.35% (p=0.032 n=5+5) IntSqr/1-8 32.3ns ± 1% 32.5ns ± 1% ~ (p=0.421 n=5+5) IntSqr/2-8 157ns ± 5% 156ns ± 5% ~ (p=0.651 n=5+5) IntSqr/3-8 292ns ± 2% 291ns ± 3% ~ (p=0.881 n=5+5) IntSqr/5-8 738ns ± 6% 740ns ± 5% ~ (p=0.841 n=5+5) IntSqr/8-8 1.82µs ± 4% 1.83µs ± 4% ~ (p=0.730 n=5+5) IntSqr/10-8 2.92µs ± 1% 2.93µs ± 1% ~ (p=0.643 n=5+5) IntSqr/20-8 6.28µs ± 2% 6.28µs ± 2% ~ (p=1.000 n=5+5) IntSqr/30-8 13.8µs ± 2% 13.9µs ± 3% ~ (p=1.000 n=5+5) IntSqr/50-8 37.8µs ± 4% 37.9µs ± 4% ~ (p=0.690 n=5+5) IntSqr/80-8 95.9µs ± 1% 95.8µs ± 1% ~ (p=0.841 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5) IntSqr/200-8 586µs ± 1% 586µs ± 1% ~ (p=0.841 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 0% ~ (p=0.222 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.48ms ± 0% ~ (p=0.556 n=5+4) IntSqr/800-8 4.68ms ± 0% 4.68ms ± 0% ~ (p=0.548 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5) Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.548 n=5+5) Exp3Power/0x10-8 559ns ± 1% 560ns ± 1% ~ (p=0.984 n=5+5) Exp3Power/0x40-8 641ns ± 1% 634ns ± 1% ~ (p=0.063 n=5+5) Exp3Power/0x100-8 1.39µs ± 2% 1.40µs ± 2% ~ (p=0.381 n=5+5) Exp3Power/0x400-8 8.27µs ± 1% 8.26µs ± 0% ~ (p=0.571 n=5+5) Exp3Power/0x1000-8 59.9µs ± 0% 59.7µs ± 0% -0.23% (p=0.008 n=5+5) Exp3Power/0x4000-8 816µs ± 0% 816µs ± 0% ~ (p=1.000 n=5+5) Exp3Power/0x10000-8 7.77ms ± 0% 7.77ms ± 0% ~ (p=0.841 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 73.4ms ± 0% ~ (p=0.690 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.14% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.98s ± 0% 5.98s ± 0% -0.09% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 116ms ± 0% -0.25% (p=0.008 n=5+5) NatSqr/1-8 115ns ± 3% 116ns ± 2% ~ (p=0.238 n=5+5) NatSqr/2-8 237ns ± 1% 237ns ± 1% ~ (p=0.683 n=5+5) NatSqr/3-8 367ns ± 3% 368ns ± 3% ~ (p=0.817 n=5+5) NatSqr/5-8 807ns ± 3% 812ns ± 3% ~ (p=0.913 n=5+5) NatSqr/8-8 1.93µs ± 2% 1.93µs ± 3% ~ (p=0.651 n=5+5) NatSqr/10-8 2.98µs ± 2% 2.99µs ± 2% ~ (p=0.690 n=5+5) NatSqr/20-8 6.49µs ± 2% 6.46µs ± 2% ~ (p=0.548 n=5+5) NatSqr/30-8 14.4µs ± 2% 14.3µs ± 2% ~ (p=0.690 n=5+5) NatSqr/50-8 38.6µs ± 2% 38.7µs ± 2% ~ (p=0.841 n=5+5) NatSqr/80-8 96.1µs ± 2% 95.8µs ± 2% ~ (p=0.548 n=5+5) NatSqr/100-8 149µs ± 1% 149µs ± 1% ~ (p=0.841 n=5+5) NatSqr/200-8 593µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 0% 1.32ms ± 1% ~ (p=0.222 n=5+5) NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.690 n=5+5) NatSqr/800-8 4.69ms ± 0% 4.69ms ± 0% ~ (p=1.000 n=5+5) NatSqr/1000-8 7.59ms ± 0% 7.58ms ± 0% ~ (p=0.841 n=5+5) ScanPi-8 322µs ± 0% 321µs ± 0% ~ (p=0.095 n=5+5) StringPiParallel-8 71.4µs ± 5% 68.8µs ± 4% ~ (p=0.151 n=5+5) Scan/10/Base2-8 1.10µs ± 0% 1.09µs ± 0% -0.36% (p=0.032 n=5+5) Scan/100/Base2-8 7.78µs ± 0% 7.79µs ± 0% +0.14% (p=0.008 n=5+5) Scan/1000/Base2-8 78.8µs ± 0% 79.0µs ± 0% +0.24% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% ~ (p=0.056 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.0ms ± 0% -0.15% (p=0.008 n=5+5) Scan/10/Base8-8 514ns ± 0% 515ns ± 0% ~ (p=0.079 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% +0.15% (p=0.008 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.12% (p=0.008 n=5+5) Scan/10000/Base8-8 740µs ± 0% 740µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.5ms ± 0% -0.06% (p=0.016 n=4+5) Scan/10/Base10-8 492ns ± 1% 490ns ± 1% ~ (p=0.310 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.056 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% ~ (p=1.000 n=5+5) Scan/10000/Base10-8 717µs ± 0% 716µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.2ms ± 0% 50.3ms ± 0% +0.05% (p=0.008 n=5+5) Scan/10/Base16-8 442ns ± 1% 442ns ± 0% ~ (p=0.468 n=5+5) Scan/100/Base16-8 2.46µs ± 0% 2.45µs ± 0% ~ (p=0.159 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.2µs ± 0% ~ (p=0.841 n=5+5) Scan/10000/Base16-8 721µs ± 0% 722µs ± 0% ~ (p=0.548 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.6ms ± 0% +0.07% (p=0.008 n=5+5) String/10/Base2-8 244ns ± 1% 242ns ± 1% ~ (p=0.103 n=5+5) String/100/Base2-8 1.48µs ± 0% 1.48µs ± 1% ~ (p=0.786 n=5+5) String/1000/Base2-8 13.3µs ± 1% 13.3µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base2-8 132µs ± 1% 132µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base2-8 1.30ms ± 1% 1.30ms ± 1% ~ (p=1.000 n=5+5) String/10/Base8-8 167ns ± 1% 168ns ± 1% ~ (p=0.135 n=5+5) String/100/Base8-8 623ns ± 1% 626ns ± 1% ~ (p=0.151 n=5+5) String/1000/Base8-8 5.24µs ± 1% 5.24µs ± 0% ~ (p=1.000 n=5+5) String/10000/Base8-8 50.0µs ± 1% 50.0µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base8-8 492µs ± 1% 489µs ± 1% ~ (p=0.056 n=5+5) String/10/Base10-8 503ns ± 1% 501ns ± 0% ~ (p=0.183 n=5+5) String/100/Base10-8 1.96µs ± 0% 1.97µs ± 0% ~ (p=0.389 n=5+5) String/1000/Base10-8 12.4µs ± 1% 12.4µs ± 1% ~ (p=0.841 n=5+5) String/10000/Base10-8 56.7µs ± 1% 56.6µs ± 0% ~ (p=1.000 n=5+5) String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% ~ (p=0.222 n=5+5) String/10/Base16-8 147ns ± 0% 148ns ± 2% ~ (p=1.000 n=4+5) String/100/Base16-8 505ns ± 0% 505ns ± 1% ~ (p=0.778 n=5+5) String/1000/Base16-8 3.94µs ± 0% 3.94µs ± 0% ~ (p=0.841 n=5+5) String/10000/Base16-8 37.4µs ± 1% 37.2µs ± 1% ~ (p=0.095 n=5+5) String/100000/Base16-8 367µs ± 1% 367µs ± 0% ~ (p=1.000 n=5+5) LeafSize/0-8 6.64ms ± 0% 6.65ms ± 0% ~ (p=0.690 n=5+5) LeafSize/1-8 72.5µs ± 1% 72.4µs ± 1% ~ (p=0.841 n=5+5) LeafSize/2-8 72.6µs ± 1% 72.6µs ± 1% ~ (p=1.000 n=5+5) LeafSize/3-8 377µs ± 0% 377µs ± 0% ~ (p=0.421 n=5+5) LeafSize/4-8 71.2µs ± 1% 71.3µs ± 0% ~ (p=0.278 n=5+5) LeafSize/5-8 469µs ± 0% 469µs ± 0% ~ (p=0.310 n=5+5) LeafSize/6-8 376µs ± 0% 376µs ± 0% ~ (p=0.841 n=5+5) LeafSize/7-8 244µs ± 0% 244µs ± 0% ~ (p=0.841 n=5+5) LeafSize/8-8 71.9µs ± 1% 72.1µs ± 1% ~ (p=0.548 n=5+5) LeafSize/9-8 536µs ± 0% 536µs ± 0% ~ (p=0.151 n=5+5) LeafSize/10-8 470µs ± 0% 471µs ± 0% +0.10% (p=0.032 n=5+5) LeafSize/11-8 458µs ± 0% 458µs ± 0% ~ (p=0.881 n=5+5) LeafSize/12-8 376µs ± 0% 376µs ± 0% ~ (p=0.548 n=5+5) LeafSize/13-8 341µs ± 0% 342µs ± 0% ~ (p=0.222 n=5+5) LeafSize/14-8 246µs ± 0% 245µs ± 0% ~ (p=0.167 n=5+5) LeafSize/15-8 168µs ± 0% 168µs ± 0% ~ (p=0.548 n=5+5) LeafSize/16-8 72.1µs ± 1% 72.2µs ± 1% ~ (p=0.690 n=5+5) LeafSize/32-8 81.5µs ± 1% 81.4µs ± 1% ~ (p=1.000 n=5+5) LeafSize/64-8 133µs ± 1% 134µs ± 1% ~ (p=0.690 n=5+5) ProbablyPrime/n=0-8 44.3ms ± 0% 44.2ms ± 0% -0.28% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.8ms ± 0% 64.7ms ± 0% -0.15% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.11% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 250ms ± 0% ~ (p=0.056 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.05% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 23.5ms ± 0% -0.29% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% ~ (p=0.690 n=5+5) FloatSqrt/64-8 2.01µs ± 1% 2.02µs ± 1% ~ (p=0.421 n=5+5) FloatSqrt/128-8 4.43µs ± 2% 4.38µs ± 2% ~ (p=0.222 n=5+5) FloatSqrt/256-8 6.64µs ± 1% 6.68µs ± 2% ~ (p=0.516 n=5+5) FloatSqrt/1000-8 31.9µs ± 0% 31.8µs ± 0% ~ (p=0.095 n=5+5) FloatSqrt/10000-8 595µs ± 0% 594µs ± 0% ~ (p=0.056 n=5+5) FloatSqrt/100000-8 17.9ms ± 0% 17.9ms ± 0% ~ (p=0.151 n=5+5) FloatSqrt/1000000-8 1.52s ± 0% 1.52s ± 0% ~ (p=0.841 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 2.97GB/s ± 0% ~ (p=0.971 n=4+4) AddVV/2-8 9.47GB/s ± 0% 9.47GB/s ± 0% +0.01% (p=0.016 n=5+5) AddVV/3-8 12.4GB/s ± 0% 12.4GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/4-8 14.6GB/s ± 0% 14.6GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/5-8 16.4GB/s ± 0% 16.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/10-8 21.7GB/s ± 0% 21.7GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/100-8 29.4GB/s ± 0% 29.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 31.7GB/s ± 0% ~ (p=0.524 n=5+4) AddVV/10000-8 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.690 n=5+5) AddVV/100000-8 28.8GB/s ± 7% 28.1GB/s ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) AddVW/2-8 809MB/s ± 2% 1520MB/s ± 0% +87.78% (p=0.008 n=5+5) AddVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.54% (p=0.008 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.016 n=4+5) AddVW/5-8 2.76GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) AddVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.83% (p=0.008 n=5+5) AddVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.12% (p=0.008 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) AddVW/10000-8 5.31GB/s ± 0% 11.19GB/s ± 0% +110.71% (p=0.008 n=5+5) AddVW/100000-8 5.32GB/s ± 0% 11.32GB/s ± 0% +112.56% (p=0.008 n=5+5) SubVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) SubVW/2-8 812MB/s ± 2% 1520MB/s ± 0% +87.09% (p=0.008 n=5+5) SubVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.55% (p=0.008 n=5+5) SubVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.008 n=5+5) SubVW/5-8 2.75GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) SubVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.82% (p=0.008 n=5+5) SubVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.13% (p=0.008 n=5+5) SubVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) SubVW/10000-8 5.31GB/s ± 0% 11.17GB/s ± 0% +110.44% (p=0.008 n=5+5) SubVW/100000-8 5.32GB/s ± 0% 11.31GB/s ± 0% +112.35% (p=0.008 n=5+5) AddMulVVW/1-8 1.97GB/s ± 1% 1.96GB/s ± 1% ~ (p=0.151 n=5+5) AddMulVVW/2-8 2.24GB/s ± 0% 2.25GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/3-8 2.11GB/s ± 0% 2.12GB/s ± 0% ~ (p=0.548 n=5+5) AddMulVVW/4-8 2.17GB/s ± 1% 2.17GB/s ± 1% ~ (p=0.548 n=5+5) AddMulVVW/5-8 2.22GB/s ± 1% 2.21GB/s ± 1% ~ (p=0.421 n=5+5) AddMulVVW/10-8 2.17GB/s ± 1% 2.16GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/100-8 2.35GB/s ± 0% 2.35GB/s ± 0% ~ (p=0.421 n=5+5) AddMulVVW/1000-8 2.47GB/s ± 0% 2.41GB/s ± 0% -2.09% (p=0.008 n=5+5) AddMulVVW/10000-8 2.16GB/s ± 0% 2.15GB/s ± 0% -0.23% (p=0.008 n=5+5) AddMulVVW/100000-8 2.03GB/s ± 1% 2.04GB/s ± 0% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta FloatString/100-8 400B ± 0% 400B ± 0% ~ (all equal) FloatString/1000-8 3.22kB ± 0% 3.22kB ± 0% ~ (all equal) FloatString/10000-8 55.6kB ± 0% 55.5kB ± 0% ~ (p=0.206 n=5+5) FloatString/100000-8 627kB ± 0% 627kB ± 0% ~ (all equal) FloatAdd/10-8 0.00B 0.00B ~ (all equal) FloatAdd/100-8 0.00B 0.00B ~ (all equal) FloatAdd/1000-8 0.00B 0.00B ~ (all equal) FloatAdd/10000-8 0.00B 0.00B ~ (all equal) FloatAdd/100000-8 0.00B 0.00B ~ (all equal) FloatSub/10-8 0.00B 0.00B ~ (all equal) FloatSub/100-8 0.00B 0.00B ~ (all equal) FloatSub/1000-8 0.00B 0.00B ~ (all equal) FloatSub/10000-8 0.00B 0.00B ~ (all equal) FloatSub/100000-8 0.00B 0.00B ~ (all equal) FloatSqrt/64-8 416B ± 0% 416B ± 0% ~ (all equal) FloatSqrt/128-8 720B ± 0% 720B ± 0% ~ (all equal) FloatSqrt/256-8 816B ± 0% 816B ± 0% ~ (all equal) FloatSqrt/1000-8 2.50kB ± 0% 2.50kB ± 0% ~ (all equal) FloatSqrt/10000-8 23.5kB ± 0% 23.5kB ± 0% ~ (all equal) FloatSqrt/100000-8 251kB ± 0% 251kB ± 0% ~ (all equal) FloatSqrt/1000000-8 4.61MB ± 0% 4.61MB ± 0% ~ (all equal) name old allocs/op new allocs/op delta FloatString/100-8 8.00 ± 0% 8.00 ± 0% ~ (all equal) FloatString/1000-8 10.0 ± 0% 10.0 ± 0% ~ (all equal) FloatString/10000-8 42.0 ± 0% 42.0 ± 0% ~ (all equal) FloatString/100000-8 346 ± 0% 346 ± 0% ~ (all equal) FloatAdd/10-8 0.00 0.00 ~ (all equal) FloatAdd/100-8 0.00 0.00 ~ (all equal) FloatAdd/1000-8 0.00 0.00 ~ (all equal) FloatAdd/10000-8 0.00 0.00 ~ (all equal) FloatAdd/100000-8 0.00 0.00 ~ (all equal) FloatSub/10-8 0.00 0.00 ~ (all equal) FloatSub/100-8 0.00 0.00 ~ (all equal) FloatSub/1000-8 0.00 0.00 ~ (all equal) FloatSub/10000-8 0.00 0.00 ~ (all equal) FloatSub/100000-8 0.00 0.00 ~ (all equal) FloatSqrt/64-8 9.00 ± 0% 9.00 ± 0% ~ (all equal) FloatSqrt/128-8 13.0 ± 0% 13.0 ± 0% ~ (all equal) FloatSqrt/256-8 12.0 ± 0% 12.0 ± 0% ~ (all equal) FloatSqrt/1000-8 19.0 ± 0% 19.0 ± 0% ~ (all equal) FloatSqrt/10000-8 35.0 ± 0% 35.0 ± 0% ~ (all equal) FloatSqrt/100000-8 55.0 ± 0% 55.0 ± 0% ~ (all equal) FloatSqrt/1000000-8 122 ± 0% 122 ± 0% ~ (all equal) Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24 Reviewed-on: https://go-review.googlesource.com/77832 Reviewed-by: Vlad Krasnov <vlad@cloudflare.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-08 15:31:37 +00:00
Vlad Krasnov	fd3d27938a	math/big: implement addMulVVW on arm64 The lack of proper addMulVVW implementation for arm64 hurts RSA performance. This assembly implementation is optimized for arm64 based servers. name old time/op new time/op delta pkg:math/big goos:linux goarch:arm64 AddMulVVW/1 55.2ns ± 0% 11.9ns ± 1% -78.37% (p=0.000 n=8+10) AddMulVVW/2 67.0ns ± 0% 11.2ns ± 0% -83.28% (p=0.000 n=7+10) AddMulVVW/3 93.2ns ± 0% 13.2ns ± 0% -85.84% (p=0.000 n=10+10) AddMulVVW/4 126ns ± 0% 13ns ± 1% -89.82% (p=0.000 n=10+10) AddMulVVW/5 151ns ± 0% 17ns ± 0% -88.87% (p=0.000 n=10+9) AddMulVVW/10 323ns ± 0% 25ns ± 0% -92.20% (p=0.000 n=10+10) AddMulVVW/100 3.28µs ± 0% 0.14µs ± 0% -95.82% (p=0.000 n=10+10) AddMulVVW/1000 31.7µs ± 0% 1.3µs ± 0% -96.00% (p=0.000 n=10+8) AddMulVVW/10000 313µs ± 0% 13µs ± 0% -95.98% (p=0.000 n=10+10) AddMulVVW/100000 3.24ms ± 0% 0.13ms ± 1% -96.13% (p=0.000 n=9+9) pkg:crypto/rsa goos:linux goarch:arm64 RSA2048Decrypt 44.7ms ± 0% 4.0ms ± 6% -91.08% (p=0.000 n=8+10) RSA2048Sign 46.3ms ± 0% 5.0ms ± 0% -89.29% (p=0.000 n=9+10) 3PrimeRSA2048Decrypt 22.3ms ± 0% 2.4ms ± 0% -89.26% (p=0.000 n=10+10) Change-Id: I295f0bd5c51a4442d02c44ece1f6026d30dff0bc Reviewed-on: https://go-review.googlesource.com/76270 Reviewed-by: Vlad Krasnov <vlad@cloudflare.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Vlad Krasnov <vlad@cloudflare.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-07 23:04:38 +00:00
Cherry Zhang	084143d844	math/big: don't use R18 in ARM64 assembly R18 seems reserved on Apple platforms. May fix darwin/arm64 build. Change-Id: Ia2c1de550a64827c85a64affa53b94c62aacce8e Reviewed-on: https://go-review.googlesource.com/98896 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Elias Naur <elias.naur@gmail.com>	2018-03-06 15:34:00 +00:00
erifan01	c4f3fe95c6	math/big: optimize addVV and subVV on arm64 The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4x and 2x, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 11.5ns ± 0% -46.51% (p=0.008 n=5+5) AddVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5) AddVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5) AddVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5) AddVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5) AddVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5) AddVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5) AddVV/1000-8 2.02µs ± 0% 1.03µs ± 0% -48.85% (p=0.008 n=5+5) AddVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.70% (p=0.008 n=5+5) AddVV/100000-8 247µs ± 3% 154µs ± 0% -37.52% (p=0.008 n=5+5) SubVV/1-8 21.5ns ± 0% 11.5ns ± 0% ~ (p=0.079 n=4+5) SubVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5) SubVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5) SubVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5) SubVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5) SubVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5) SubVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5) SubVV/1000-8 2.02µs ± 0% 0.80µs ± 0% -60.50% (p=0.008 n=5+5) SubVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.99% (p=0.008 n=5+5) SubVV/100000-8 221µs ±11% 223µs ±16% ~ (p=0.690 n=5+5) AddVW/1-8 9.32ns ± 0% 9.32ns ± 0% ~ (all equal) AddVW/2-8 19.7ns ± 1% 19.7ns ± 0% ~ (p=0.381 n=5+4) AddVW/3-8 11.5ns ± 0% 11.5ns ± 0% ~ (all equal) AddVW/4-8 13.0ns ± 0% 13.0ns ± 0% ~ (all equal) AddVW/5-8 14.5ns ± 0% 14.5ns ± 0% ~ (all equal) AddVW/10-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal) AddVW/100-8 167ns ± 0% 167ns ± 0% ~ (all equal) AddVW/1000-8 1.52µs ± 0% 1.52µs ± 0% +0.40% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 15.1µs ± 0% ~ (p=0.556 n=5+4) AddVW/100000-8 152µs ± 1% 152µs ± 1% ~ (p=0.690 n=5+5) AddMulVVW/1-8 33.3ns ± 0% 32.7ns ± 1% -1.86% (p=0.008 n=5+5) AddMulVVW/2-8 59.3ns ± 1% 56.9ns ± 1% -4.15% (p=0.008 n=5+5) AddMulVVW/3-8 80.5ns ± 1% 85.4ns ± 3% +6.19% (p=0.008 n=5+5) AddMulVVW/4-8 127ns ± 0% 111ns ± 1% -13.19% (p=0.008 n=5+5) AddMulVVW/5-8 144ns ± 0% 149ns ± 0% +3.47% (p=0.016 n=4+5) AddMulVVW/10-8 298ns ± 1% 283ns ± 0% -4.77% (p=0.008 n=5+5) AddMulVVW/100-8 3.06µs ± 0% 2.99µs ± 0% -2.21% (p=0.008 n=5+5) AddMulVVW/1000-8 31.3µs ± 0% 26.9µs ± 0% -14.17% (p=0.008 n=5+5) AddMulVVW/10000-8 316µs ± 0% 305µs ± 0% -3.51% (p=0.008 n=5+5) AddMulVVW/100000-8 3.17ms ± 0% 3.17ms ± 1% ~ (p=0.690 n=5+5) DecimalConversion-8 316µs ± 1% 313µs ± 2% ~ (p=0.095 n=5+5) FloatString/100-8 2.53µs ± 1% 2.56µs ± 2% ~ (p=0.222 n=5+5) FloatString/1000-8 58.4µs ± 0% 58.5µs ± 0% ~ (p=0.206 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.58ms ± 0% -0.31% (p=0.008 n=5+5) FloatString/100000-8 446ms ± 0% 444ms ± 0% -0.31% (p=0.008 n=5+5) FloatAdd/10-8 184ns ± 0% 172ns ± 0% -6.30% (p=0.008 n=5+5) FloatAdd/100-8 189ns ± 2% 191ns ± 4% ~ (p=0.381 n=5+5) FloatAdd/1000-8 371ns ± 0% 347ns ± 1% -6.42% (p=0.008 n=5+5) FloatAdd/10000-8 1.87µs ± 0% 1.68µs ± 0% -10.16% (p=0.008 n=5+5) FloatAdd/100000-8 17.1µs ± 0% 15.6µs ± 0% -8.74% (p=0.016 n=5+4) FloatSub/10-8 152ns ± 0% 138ns ± 0% -9.47% (p=0.000 n=4+5) FloatSub/100-8 148ns ± 0% 142ns ± 0% -4.05% (p=0.000 n=5+4) FloatSub/1000-8 245ns ± 1% 217ns ± 0% -11.28% (p=0.000 n=5+4) FloatSub/10000-8 1.07µs ± 0% 0.88µs ± 1% -18.14% (p=0.008 n=5+5) FloatSub/100000-8 9.58µs ± 0% 7.96µs ± 0% -16.84% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.8µs ± 1% 29.0µs ± 1% ~ (p=0.095 n=5+5) ParseFloatLargeExp-8 126µs ± 1% 126µs ± 1% ~ (p=0.841 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 281ns ± 4% ~ (p=0.746 n=5+5) GCD10x10/WithXY-8 2.10µs ± 1% 2.12µs ± 3% ~ (p=0.548 n=5+5) GCD10x100/WithoutXY-8 615ns ± 3% 607ns ± 2% ~ (p=0.135 n=5+5) GCD10x100/WithXY-8 3.50µs ± 2% 3.62µs ± 5% ~ (p=0.151 n=5+5) GCD10x1000/WithoutXY-8 1.39µs ± 2% 1.39µs ± 3% ~ (p=0.690 n=5+5) GCD10x1000/WithXY-8 7.39µs ± 1% 7.34µs ± 2% ~ (p=0.135 n=5+5) GCD10x10000/WithoutXY-8 8.66µs ± 1% 8.68µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 28.1µs ± 2% 27.0µs ± 2% -3.81% (p=0.008 n=5+5) GCD10x100000/WithoutXY-8 79.3µs ± 1% 79.3µs ± 1% ~ (p=0.841 n=5+5) GCD10x100000/WithXY-8 238µs ± 0% 227µs ± 1% -4.74% (p=0.008 n=5+5) GCD100x100/WithoutXY-8 1.89µs ± 1% 1.88µs ± 2% ~ (p=0.968 n=5+5) GCD100x100/WithXY-8 26.7µs ± 1% 27.0µs ± 1% +1.44% (p=0.032 n=5+5) GCD100x1000/WithoutXY-8 4.48µs ± 1% 4.45µs ± 2% ~ (p=0.341 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 35.1µs ± 1% -3.27% (p=0.008 n=5+5) GCD100x10000/WithoutXY-8 22.8µs ± 0% 22.7µs ± 1% ~ (p=0.056 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 133µs ± 1% -8.33% (p=0.008 n=5+5) GCD100x100000/WithoutXY-8 198µs ± 0% 195µs ± 0% -1.56% (p=0.008 n=5+5) GCD100x100000/WithXY-8 1.11ms ± 0% 1.00ms ± 0% -10.04% (p=0.008 n=5+5) GCD1000x1000/WithoutXY-8 25.2µs ± 1% 24.8µs ± 1% -1.63% (p=0.016 n=5+5) GCD1000x1000/WithXY-8 513µs ± 0% 517µs ± 2% ~ (p=0.421 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 0% 52.7µs ± 1% -7.56% (p=0.008 n=5+5) GCD1000x10000/WithXY-8 1.20ms ± 0% 1.10ms ± 0% -8.70% (p=0.008 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 318µs ± 1% -11.03% (p=0.008 n=5+5) GCD1000x100000/WithXY-8 8.71ms ± 0% 7.65ms ± 0% -12.19% (p=0.008 n=5+5) GCD10000x10000/WithoutXY-8 690µs ± 0% 630µs ± 0% -8.71% (p=0.008 n=5+5) GCD10000x10000/WithXY-8 16.0ms ± 1% 14.9ms ± 0% -6.85% (p=0.008 n=5+5) GCD10000x100000/WithoutXY-8 2.09ms ± 0% 1.75ms ± 0% -16.09% (p=0.016 n=5+4) GCD10000x100000/WithXY-8 86.8ms ± 0% 76.3ms ± 0% -12.09% (p=0.008 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 46.0ms ± 0% -9.97% (p=0.008 n=5+5) GCD100000x100000/WithXY-8 1.25s ± 0% 1.15s ± 0% -7.92% (p=0.008 n=5+5) Hilbert-8 2.45ms ± 1% 2.49ms ± 1% +1.99% (p=0.008 n=5+5) Binomial-8 4.98µs ± 3% 4.90µs ± 2% ~ (p=0.421 n=5+5) QuoRem-8 7.10µs ± 0% 6.21µs ± 0% -12.55% (p=0.016 n=5+4) Exp-8 161ms ± 0% 161ms ± 0% ~ (p=0.421 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=0.151 n=5+5) Bitset-8 40.4ns ± 0% 40.3ns ± 0% ~ (p=0.190 n=5+5) BitsetNeg-8 163ns ± 3% 137ns ± 2% -15.91% (p=0.008 n=5+5) BitsetOrig-8 377ns ± 1% 372ns ± 1% -1.22% (p=0.024 n=5+5) BitsetNegOrig-8 631ns ± 1% 605ns ± 1% -4.09% (p=0.008 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.26ms ± 0% ~ (p=0.548 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=1.000 n=5+5) ModSqrt5430_Tonelli-8 62.4s ± 0% 62.4s ± 0% ~ (p=0.841 n=5+5) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.7s ± 0% ~ (p=0.056 n=5+5) Sqrt-8 101µs ± 0% 89µs ± 0% -12.17% (p=0.008 n=5+5) IntSqr/1-8 32.5ns ± 1% 32.7ns ± 1% ~ (p=0.056 n=5+5) IntSqr/2-8 160ns ± 5% 158ns ± 0% ~ (p=0.397 n=5+4) IntSqr/3-8 298ns ± 4% 296ns ± 4% ~ (p=0.667 n=5+5) IntSqr/5-8 737ns ± 5% 761ns ± 3% +3.34% (p=0.016 n=5+5) IntSqr/8-8 1.87µs ± 4% 1.90µs ± 3% ~ (p=0.222 n=5+5) IntSqr/10-8 2.96µs ± 4% 2.92µs ± 6% ~ (p=0.310 n=5+5) IntSqr/20-8 6.28µs ± 3% 6.21µs ± 2% ~ (p=0.310 n=5+5) IntSqr/30-8 14.0µs ± 2% 13.9µs ± 2% ~ (p=0.548 n=5+5) IntSqr/50-8 37.7µs ± 3% 38.3µs ± 2% ~ (p=0.095 n=5+5) IntSqr/80-8 95.9µs ± 2% 95.1µs ± 1% ~ (p=0.310 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.841 n=5+5) IntSqr/200-8 586µs ± 1% 587µs ± 1% ~ (p=1.000 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 1% -0.73% (p=0.032 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.45ms ± 0% -1.15% (p=0.008 n=5+5) IntSqr/800-8 4.68ms ± 0% 4.62ms ± 0% -1.23% (p=0.008 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.50ms ± 0% -0.84% (p=0.008 n=5+5) Mul-8 311ms ± 0% 308ms ± 0% -0.81% (p=0.008 n=5+5) Exp3Power/0x10-8 574ns ± 1% 578ns ± 2% ~ (p=0.500 n=5+5) Exp3Power/0x40-8 640ns ± 1% 646ns ± 0% ~ (p=0.056 n=5+5) Exp3Power/0x100-8 1.42µs ± 1% 1.42µs ± 1% ~ (p=0.246 n=5+5) Exp3Power/0x400-8 8.30µs ± 1% 8.29µs ± 1% ~ (p=0.802 n=5+5) Exp3Power/0x1000-8 60.0µs ± 0% 59.9µs ± 0% -0.24% (p=0.016 n=5+5) Exp3Power/0x4000-8 817µs ± 0% 816µs ± 0% -0.17% (p=0.008 n=5+5) Exp3Power/0x10000-8 7.80ms ± 1% 7.70ms ± 0% -1.23% (p=0.008 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 72.5ms ± 0% -1.28% (p=0.008 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 656ms ± 0% -1.34% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.99s ± 0% 5.90s ± 0% -1.40% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 50ms ± 0% -57.09% (p=0.008 n=5+5) NatSqr/1-8 112ns ± 4% 112ns ± 2% ~ (p=0.968 n=5+5) NatSqr/2-8 251ns ± 2% 250ns ± 1% ~ (p=0.571 n=5+5) NatSqr/3-8 378ns ± 2% 379ns ± 2% ~ (p=0.794 n=5+5) NatSqr/5-8 829ns ± 3% 827ns ± 2% ~ (p=1.000 n=5+5) NatSqr/8-8 1.97µs ± 2% 1.95µs ± 2% ~ (p=0.310 n=5+5) NatSqr/10-8 3.02µs ± 2% 2.99µs ± 2% ~ (p=0.421 n=5+5) NatSqr/20-8 6.51µs ± 2% 6.49µs ± 1% ~ (p=0.841 n=5+5) NatSqr/30-8 14.1µs ± 2% 14.0µs ± 2% ~ (p=0.841 n=5+5) NatSqr/50-8 38.1µs ± 2% 38.3µs ± 3% ~ (p=0.690 n=5+5) NatSqr/80-8 95.5µs ± 2% 96.0µs ± 1% ~ (p=0.421 n=5+5) NatSqr/100-8 150µs ± 1% 148µs ± 2% ~ (p=0.095 n=5+5) NatSqr/200-8 588µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 1% 1.31ms ± 1% ~ (p=0.841 n=5+5) NatSqr/500-8 2.50ms ± 0% 2.47ms ± 0% -1.03% (p=0.008 n=5+5) NatSqr/800-8 4.70ms ± 0% 4.64ms ± 0% -1.31% (p=0.008 n=5+5) NatSqr/1000-8 7.60ms ± 0% 7.52ms ± 0% -1.01% (p=0.008 n=5+5) ScanPi-8 326µs ± 0% 326µs ± 0% ~ (p=0.841 n=5+5) StringPiParallel-8 70.3µs ± 5% 63.8µs ±10% ~ (p=0.056 n=5+5) Scan/10/Base2-8 1.09µs ± 0% 1.09µs ± 0% ~ (p=0.317 n=5+5) Scan/100/Base2-8 7.79µs ± 0% 7.78µs ± 0% ~ (p=0.063 n=5+5) Scan/1000/Base2-8 79.0µs ± 0% 78.9µs ± 0% -0.18% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% -0.15% (p=0.008 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.2ms ± 0% +0.20% (p=0.008 n=5+5) Scan/10/Base8-8 512ns ± 0% 512ns ± 1% ~ (p=0.810 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% ~ (p=0.810 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.0µs ± 0% ~ (p=0.151 n=5+5) Scan/10000/Base8-8 740µs ± 0% 741µs ± 0% +0.10% (p=0.008 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.6ms ± 0% +0.08% (p=0.008 n=5+5) Scan/10/Base10-8 487ns ± 0% 487ns ± 0% ~ (p=0.571 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.810 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% +0.06% (p=0.008 n=5+5) Scan/10000/Base10-8 716µs ± 0% 717µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.3ms ± 0% 50.3ms ± 0% +0.10% (p=0.008 n=5+5) Scan/10/Base16-8 438ns ± 0% 437ns ± 1% ~ (p=0.786 n=5+5) Scan/100/Base16-8 2.47µs ± 0% 2.47µs ± 0% -0.19% (p=0.048 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.3µs ± 0% ~ (p=0.087 n=5+5) Scan/10000/Base16-8 722µs ± 0% 722µs ± 0% +0.11% (p=0.008 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.7ms ± 0% +0.15% (p=0.008 n=5+5) String/10/Base2-8 247ns ± 2% 248ns ± 1% ~ (p=0.437 n=5+5) String/100/Base2-8 1.51µs ± 0% 1.51µs ± 0% -0.37% (p=0.024 n=5+5) String/1000/Base2-8 13.6µs ± 1% 13.5µs ± 0% ~ (p=0.095 n=5+5) String/10000/Base2-8 135µs ± 0% 135µs ± 1% ~ (p=0.841 n=5+5) String/100000/Base2-8 1.32ms ± 1% 1.32ms ± 1% ~ (p=0.690 n=5+5) String/10/Base8-8 169ns ± 1% 169ns ± 1% ~ (p=1.000 n=5+5) String/100/Base8-8 636ns ± 0% 634ns ± 1% ~ (p=0.413 n=5+5) String/1000/Base8-8 5.33µs ± 1% 5.32µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base8-8 50.9µs ± 1% 50.7µs ± 0% ~ (p=0.151 n=5+5) String/100000/Base8-8 500µs ± 1% 497µs ± 0% ~ (p=0.421 n=5+5) String/10/Base10-8 516ns ± 1% 513ns ± 0% -0.62% (p=0.016 n=5+4) String/100/Base10-8 1.97µs ± 0% 1.96µs ± 0% ~ (p=0.667 n=4+5) String/1000/Base10-8 12.5µs ± 0% 11.5µs ± 0% -7.92% (p=0.008 n=5+5) String/10000/Base10-8 57.7µs ± 0% 52.5µs ± 0% -8.93% (p=0.008 n=5+5) String/100000/Base10-8 25.6ms ± 0% 21.6ms ± 0% -15.94% (p=0.008 n=5+5) String/10/Base16-8 150ns ± 1% 149ns ± 0% ~ (p=0.413 n=5+4) String/100/Base16-8 514ns ± 1% 514ns ± 1% ~ (p=0.849 n=5+5) String/1000/Base16-8 4.01µs ± 0% 4.01µs ± 0% ~ (p=0.421 n=5+5) String/10000/Base16-8 37.8µs ± 1% 37.8µs ± 1% ~ (p=0.841 n=5+5) String/100000/Base16-8 373µs ± 2% 373µs ± 0% ~ (p=0.421 n=5+5) LeafSize/0-8 6.63ms ± 0% 6.63ms ± 0% ~ (p=0.730 n=4+5) LeafSize/1-8 74.0µs ± 0% 67.7µs ± 1% -8.53% (p=0.008 n=5+5) LeafSize/2-8 74.2µs ± 0% 68.3µs ± 1% -7.99% (p=0.008 n=5+5) LeafSize/3-8 379µs ± 0% 309µs ± 0% -18.52% (p=0.008 n=5+5) LeafSize/4-8 72.7µs ± 1% 66.7µs ± 0% -8.37% (p=0.008 n=5+5) LeafSize/5-8 471µs ± 0% 384µs ± 0% -18.55% (p=0.008 n=5+5) LeafSize/6-8 378µs ± 0% 308µs ± 0% -18.59% (p=0.008 n=5+5) LeafSize/7-8 245µs ± 0% 204µs ± 1% -16.75% (p=0.008 n=5+5) LeafSize/8-8 73.4µs ± 0% 66.9µs ± 1% -8.79% (p=0.008 n=5+5) LeafSize/9-8 538µs ± 0% 437µs ± 0% -18.75% (p=0.008 n=5+5) LeafSize/10-8 472µs ± 0% 396µs ± 1% -16.01% (p=0.008 n=5+5) LeafSize/11-8 460µs ± 0% 374µs ± 0% -18.58% (p=0.008 n=5+5) LeafSize/12-8 378µs ± 0% 308µs ± 0% -18.38% (p=0.008 n=5+5) LeafSize/13-8 343µs ± 0% 284µs ± 0% -17.30% (p=0.008 n=5+5) LeafSize/14-8 248µs ± 0% 206µs ± 0% -16.94% (p=0.008 n=5+5) LeafSize/15-8 169µs ± 0% 144µs ± 0% -14.69% (p=0.008 n=5+5) LeafSize/16-8 72.9µs ± 0% 66.8µs ± 1% -8.27% (p=0.008 n=5+5) LeafSize/32-8 82.5µs ± 0% 76.7µs ± 0% -7.04% (p=0.008 n=5+5) LeafSize/64-8 134µs ± 0% 129µs ± 0% -3.80% (p=0.008 n=5+5) ProbablyPrime/n=0-8 44.2ms ± 0% 43.4ms ± 0% -1.95% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.9ms ± 0% 64.0ms ± 0% -1.27% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 146ms ± 0% -0.58% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 249ms ± 0% -0.35% (p=0.008 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.18% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 22.7ms ± 0% -3.74% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.7ms ± 0% 20.6ms ± 0% ~ (p=0.421 n=5+5) FloatSqrt/64-8 2.25µs ± 1% 2.29µs ± 0% +1.48% (p=0.008 n=5+5) FloatSqrt/128-8 4.86µs ± 1% 4.92µs ± 1% +1.21% (p=0.032 n=5+5) FloatSqrt/256-8 13.6µs ± 0% 13.7µs ± 1% +1.31% (p=0.032 n=5+5) FloatSqrt/1000-8 70.0µs ± 1% 70.1µs ± 0% ~ (p=0.690 n=5+5) FloatSqrt/10000-8 1.92ms ± 0% 1.90ms ± 0% -0.59% (p=0.008 n=5+5) FloatSqrt/100000-8 55.3ms ± 0% 54.8ms ± 0% -1.01% (p=0.008 n=5+5) FloatSqrt/1000000-8 4.56s ± 0% 4.50s ± 0% -1.28% (p=0.008 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.85% (p=0.008 n=5+5) AddVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.50% (p=0.008 n=5+5) AddVV/3-8 12.4GB/s ± 0% 14.7GB/s ± 0% +19.10% (p=0.008 n=5+5) AddVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.63% (p=0.016 n=4+5) AddVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=5+4) AddVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5) AddVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 61.9GB/s ± 0% +95.43% (p=0.008 n=5+5) AddVV/10000-8 31.2GB/s ± 0% 56.4GB/s ± 0% +80.83% (p=0.008 n=5+5) AddVV/100000-8 25.9GB/s ± 3% 41.4GB/s ± 0% +59.98% (p=0.008 n=5+5) SubVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.97% (p=0.016 n=4+5) SubVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.51% (p=0.008 n=5+5) SubVV/3-8 12.4GB/s ± 0% 14.8GB/s ± 0% +19.23% (p=0.016 n=4+5) SubVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.56% (p=0.008 n=5+5) SubVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=4+5) SubVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5) SubVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5) SubVV/1000-8 31.6GB/s ± 0% 80.1GB/s ± 0% +153.08% (p=0.008 n=5+5) SubVV/10000-8 31.2GB/s ± 0% 56.7GB/s ± 0% +81.79% (p=0.008 n=5+5) SubVV/100000-8 29.1GB/s ±10% 29.0GB/s ±18% ~ (p=0.690 n=5+5) AddVW/1-8 859MB/s ± 0% 859MB/s ± 0% -0.01% (p=0.008 n=5+5) AddVW/2-8 811MB/s ± 1% 814MB/s ± 0% ~ (p=0.413 n=5+4) AddVW/3-8 2.08GB/s ± 0% 2.08GB/s ± 0% ~ (p=0.206 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.46GB/s ± 0% ~ (p=0.056 n=5+5) AddVW/5-8 2.75GB/s ± 0% 2.75GB/s ± 0% ~ (p=0.508 n=5+5) AddVW/10-8 3.63GB/s ± 0% 3.63GB/s ± 0% ~ (p=0.214 n=5+5) AddVW/100-8 4.79GB/s ± 0% 4.79GB/s ± 0% ~ (p=0.500 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 5.25GB/s ± 0% -0.43% (p=0.008 n=5+5) AddVW/10000-8 5.30GB/s ± 0% 5.30GB/s ± 0% ~ (p=0.397 n=5+5) AddVW/100000-8 5.27GB/s ± 1% 5.25GB/s ± 1% ~ (p=0.690 n=5+5) AddMulVVW/1-8 1.92GB/s ± 0% 1.96GB/s ± 1% +1.95% (p=0.008 n=5+5) AddMulVVW/2-8 2.16GB/s ± 1% 2.25GB/s ± 1% +4.32% (p=0.008 n=5+5) AddMulVVW/3-8 2.39GB/s ± 1% 2.25GB/s ± 3% -5.79% (p=0.008 n=5+5) AddMulVVW/4-8 2.00GB/s ± 0% 2.31GB/s ± 1% +15.31% (p=0.008 n=5+5) AddMulVVW/5-8 2.22GB/s ± 0% 2.14GB/s ± 0% -3.86% (p=0.008 n=5+5) AddMulVVW/10-8 2.15GB/s ± 1% 2.25GB/s ± 0% +5.03% (p=0.008 n=5+5) AddMulVVW/100-8 2.09GB/s ± 0% 2.14GB/s ± 0% +2.25% (p=0.008 n=5+5) AddMulVVW/1000-8 2.04GB/s ± 0% 2.38GB/s ± 0% +16.52% (p=0.008 n=5+5) AddMulVVW/10000-8 2.03GB/s ± 0% 2.10GB/s ± 0% +3.64% (p=0.008 n=5+5) AddMulVVW/100000-8 2.02GB/s ± 0% 2.02GB/s ± 1% ~ (p=0.690 n=5+5) Change-Id: Ie482d67a7dbb5af6f5d81af2b3d9d14bd66336db Reviewed-on: https://go-review.googlesource.com/77831 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-06 00:22:08 +00:00
Ilya Tocar	c15984c6c6	math: remove unused variable useSSE41 was used inside asm implementation of floor to select between base and ss4 code path. We intrinsified floor and left asm functions as a backup for non-sse4 systems. This made variable unused, so remove it. Change-Id: Ia2633de7c7cb1ef1d5b15a2366b523e481b722d9 Reviewed-on: https://go-review.googlesource.com/97935 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-03-01 18:51:44 +00:00
erifan01	ed6c6c9c11	math: optimize sinh and cosh Improve performance by reducing unnecessary function calls Benchmarks: Tme old time/op new time/op delta Cosh-8 229ns ± 0% 138ns ± 0% -39.74% (p=0.008 n=5+5) Sinh-8 231ns ± 0% 139ns ± 0% -39.83% (p=0.008 n=5+5) Change-Id: Icab5485849bbfaafca8429d06b67c558101f4f3c Reviewed-on: https://go-review.googlesource.com/85477 Reviewed-by: Robert Griesemer <gri@golang.org>	2018-02-27 04:34:37 +00:00
Ilya Tocar	c3935c08d2	math/big: speed-up addMulVVW on amd64 Use MULX/ADOX/ADCX instructions to speed-up addMulVVW, when they are available. addMulVVW is a hotspot in rsa. This is faster than ADD/ADC/IMUL version, because ADOX/ADCX only modify carry/overflow flag, so they can be interleaved with each other and with MULX, which doesn't modify flags at all. Increasing unroll factor to e. g. 16 makes rsa 1% faster, but 3PrimeRSA2048Decrypt performance falls back to baseline. Updates #20058 AddMulVVW/1-8 3.28ns ± 2% 3.26ns ± 3% ~ (p=0.107 n=10+10) AddMulVVW/2-8 4.26ns ± 2% 4.24ns ± 3% ~ (p=0.327 n=9+9) AddMulVVW/3-8 5.07ns ± 2% 5.26ns ± 2% +3.73% (p=0.000 n=10+10) AddMulVVW/4-8 6.40ns ± 2% 6.50ns ± 2% +1.61% (p=0.000 n=10+10) AddMulVVW/5-8 6.77ns ± 2% 6.86ns ± 1% +1.38% (p=0.001 n=9+9) AddMulVVW/10-8 12.2ns ± 2% 10.6ns ± 3% -13.65% (p=0.000 n=10+10) AddMulVVW/100-8 79.7ns ± 2% 52.4ns ± 1% -34.17% (p=0.000 n=10+10) AddMulVVW/1000-8 695ns ± 1% 491ns ± 2% -29.39% (p=0.000 n=9+10) AddMulVVW/10000-8 7.26µs ± 2% 5.92µs ± 6% -18.42% (p=0.000 n=10+10) AddMulVVW/100000-8 72.6µs ± 2% 62.2µs ± 2% -14.31% (p=0.000 n=10+10) crypto/rsa speed-up is smaller, but stil noticeable: RSA2048Decrypt-8 1.61ms ± 1% 1.38ms ± 1% -14.13% (p=0.000 n=10+10) RSA2048Sign-8 1.93ms ± 1% 1.70ms ± 1% -11.86% (p=0.000 n=10+10) 3PrimeRSA2048Decrypt-8 932µs ± 0% 828µs ± 0% -11.15% (p=0.000 n=10+10) Results on crypto/tls: HandshakeServer/RSA-8 901µs ± 1% 777µs ± 0% -13.70% (p=0.000 n=10+8) HandshakeServer/ECDHE-P256-RSA-8 1.01ms ± 1% 0.90ms ± 0% -11.53% (p=0.000 n=10+9) Full math/big benchmarks: name old time/op new time/op delta AddVV/1-8 3.74ns ± 6% 3.55ns ± 2% ~ (p=0.082 n=10+8) AddVV/2-8 3.96ns ± 2% 3.98ns ± 5% ~ (p=0.794 n=10+9) AddVV/3-8 4.97ns ± 2% 4.94ns ± 1% ~ (p=0.081 n=10+9) AddVV/4-8 5.59ns ± 2% 5.59ns ± 2% ~ (p=0.809 n=10+10) AddVV/5-8 6.63ns ± 1% 6.62ns ± 1% ~ (p=0.560 n=9+10) AddVV/10-8 8.11ns ± 1% 8.11ns ± 2% ~ (p=0.402 n=10+10) AddVV/100-8 46.9ns ± 2% 46.8ns ± 1% ~ (p=0.809 n=10+10) AddVV/1000-8 389ns ± 1% 391ns ± 4% ~ (p=0.809 n=10+10) AddVV/10000-8 5.05µs ± 5% 4.98µs ± 2% ~ (p=0.113 n=9+10) AddVV/100000-8 55.3µs ± 3% 55.2µs ± 3% ~ (p=0.796 n=10+10) AddVW/1-8 3.04ns ± 3% 3.02ns ± 3% ~ (p=0.538 n=10+10) AddVW/2-8 3.57ns ± 2% 3.61ns ± 2% +1.12% (p=0.032 n=9+9) AddVW/3-8 3.77ns ± 1% 3.79ns ± 2% ~ (p=0.719 n=10+10) AddVW/4-8 4.69ns ± 1% 4.69ns ± 2% ~ (p=0.920 n=10+9) AddVW/5-8 4.58ns ± 1% 4.58ns ± 1% ~ (p=0.812 n=10+10) AddVW/10-8 7.62ns ± 2% 7.63ns ± 1% ~ (p=0.926 n=10+10) AddVW/100-8 41.1ns ± 2% 42.4ns ± 3% +3.34% (p=0.000 n=10+10) AddVW/1000-8 386ns ± 2% 389ns ± 4% ~ (p=0.514 n=10+10) AddVW/10000-8 3.88µs ± 3% 3.87µs ± 3% ~ (p=0.448 n=10+10) AddVW/100000-8 41.2µs ± 3% 41.7µs ± 3% ~ (p=0.148 n=10+10) AddMulVVW/1-8 3.28ns ± 2% 3.26ns ± 3% ~ (p=0.107 n=10+10) AddMulVVW/2-8 4.26ns ± 2% 4.24ns ± 3% ~ (p=0.327 n=9+9) AddMulVVW/3-8 5.07ns ± 2% 5.26ns ± 2% +3.73% (p=0.000 n=10+10) AddMulVVW/4-8 6.40ns ± 2% 6.50ns ± 2% +1.61% (p=0.000 n=10+10) AddMulVVW/5-8 6.77ns ± 2% 6.86ns ± 1% +1.38% (p=0.001 n=9+9) AddMulVVW/10-8 12.2ns ± 2% 10.6ns ± 3% -13.65% (p=0.000 n=10+10) AddMulVVW/100-8 79.7ns ± 2% 52.4ns ± 1% -34.17% (p=0.000 n=10+10) AddMulVVW/1000-8 695ns ± 1% 491ns ± 2% -29.39% (p=0.000 n=9+10) AddMulVVW/10000-8 7.26µs ± 2% 5.92µs ± 6% -18.42% (p=0.000 n=10+10) AddMulVVW/100000-8 72.6µs ± 2% 62.2µs ± 2% -14.31% (p=0.000 n=10+10) DecimalConversion-8 108µs ±19% 104µs ± 4% ~ (p=0.460 n=10+8) FloatString/100-8 926ns ±14% 908ns ± 5% ~ (p=0.398 n=9+9) FloatString/1000-8 25.7µs ± 1% 25.7µs ± 1% ~ (p=0.739 n=10+10) FloatString/10000-8 2.13ms ± 1% 2.12ms ± 1% ~ (p=0.353 n=10+10) FloatString/100000-8 207ms ± 1% 206ms ± 2% ~ (p=0.912 n=10+10) FloatAdd/10-8 61.3ns ± 3% 61.9ns ± 3% ~ (p=0.183 n=10+10) FloatAdd/100-8 62.0ns ± 2% 62.9ns ± 4% ~ (p=0.118 n=10+10) FloatAdd/1000-8 84.7ns ± 2% 84.4ns ± 1% ~ (p=0.591 n=10+10) FloatAdd/10000-8 305ns ± 2% 306ns ± 1% ~ (p=0.443 n=10+10) FloatAdd/100000-8 2.45µs ± 1% 2.46µs ± 1% ~ (p=0.782 n=10+10) FloatSub/10-8 56.8ns ± 4% 56.5ns ± 5% ~ (p=0.423 n=10+10) FloatSub/100-8 57.3ns ± 4% 57.1ns ± 5% ~ (p=0.540 n=10+10) FloatSub/1000-8 66.8ns ± 4% 66.6ns ± 1% ~ (p=0.868 n=10+10) FloatSub/10000-8 199ns ± 1% 198ns ± 1% ~ (p=0.287 n=10+9) FloatSub/100000-8 1.47µs ± 2% 1.47µs ± 2% ~ (p=0.920 n=10+9) ParseFloatSmallExp-8 8.74µs ±10% 9.48µs ±10% +8.51% (p=0.010 n=9+10) ParseFloatLargeExp-8 39.2µs ±25% 39.6µs ±12% ~ (p=0.529 n=10+10) GCD10x10/WithoutXY-8 173ns ±23% 177ns ±20% ~ (p=0.698 n=10+10) GCD10x10/WithXY-8 736ns ±12% 728ns ±16% ~ (p=0.838 n=10+10) GCD10x100/WithoutXY-8 325ns ±16% 326ns ±14% ~ (p=0.912 n=10+10) GCD10x100/WithXY-8 1.14µs ±13% 1.16µs ± 6% ~ (p=0.287 n=10+9) GCD10x1000/WithoutXY-8 851ns ±25% 820ns ±12% ~ (p=0.592 n=10+10) GCD10x1000/WithXY-8 2.89µs ±17% 2.85µs ± 5% ~ (p=1.000 n=10+9) GCD10x10000/WithoutXY-8 6.66µs ±12% 6.82µs ±19% ~ (p=0.529 n=10+10) GCD10x10000/WithXY-8 18.0µs ± 5% 17.2µs ±19% ~ (p=0.315 n=7+10) GCD10x100000/WithoutXY-8 77.8µs ±18% 73.3µs ±11% ~ (p=0.315 n=10+9) GCD10x100000/WithXY-8 186µs ±14% 204µs ±29% ~ (p=0.218 n=10+10) GCD100x100/WithoutXY-8 1.09µs ± 1% 1.09µs ± 2% ~ (p=0.117 n=9+10) GCD100x100/WithXY-8 7.93µs ± 1% 7.97µs ± 1% +0.52% (p=0.006 n=10+10) GCD100x1000/WithoutXY-8 2.00µs ± 3% 2.04µs ± 6% ~ (p=0.053 n=9+10) GCD100x1000/WithXY-8 9.23µs ± 1% 9.29µs ± 1% +0.63% (p=0.009 n=10+10) GCD100x10000/WithoutXY-8 10.2µs ±11% 9.7µs ± 6% ~ (p=0.278 n=10+9) GCD100x10000/WithXY-8 33.3µs ± 4% 33.6µs ± 4% ~ (p=0.481 n=10+10) GCD100x100000/WithoutXY-8 106µs ±17% 105µs ±13% ~ (p=0.853 n=10+10) GCD100x100000/WithXY-8 289µs ±17% 276µs ± 8% ~ (p=0.353 n=10+10) GCD1000x1000/WithoutXY-8 12.2µs ± 1% 12.1µs ± 1% -0.45% (p=0.007 n=10+10) GCD1000x1000/WithXY-8 131µs ± 1% 132µs ± 0% +0.93% (p=0.000 n=9+7) GCD1000x10000/WithoutXY-8 20.6µs ± 2% 20.6µs ± 1% ~ (p=0.326 n=10+9) GCD1000x10000/WithXY-8 238µs ± 1% 237µs ± 1% ~ (p=0.356 n=9+10) GCD1000x100000/WithoutXY-8 117µs ± 8% 114µs ±11% ~ (p=0.190 n=10+10) GCD1000x100000/WithXY-8 1.51ms ± 1% 1.50ms ± 1% ~ (p=0.053 n=9+10) GCD10000x10000/WithoutXY-8 220µs ± 1% 218µs ± 1% -0.86% (p=0.000 n=10+10) GCD10000x10000/WithXY-8 3.04ms ± 0% 3.05ms ± 0% +0.33% (p=0.001 n=9+10) GCD10000x100000/WithoutXY-8 513µs ± 0% 511µs ± 0% -0.38% (p=0.000 n=10+10) GCD10000x100000/WithXY-8 15.1ms ± 0% 15.0ms ± 0% ~ (p=0.053 n=10+9) GCD100000x100000/WithoutXY-8 10.4ms ± 1% 10.4ms ± 2% ~ (p=0.258 n=9+9) GCD100000x100000/WithXY-8 205ms ± 1% 205ms ± 1% ~ (p=0.481 n=10+10) Hilbert-8 1.25ms ±15% 1.24ms ±17% ~ (p=0.853 n=10+10) Binomial-8 3.03µs ±24% 2.90µs ±16% ~ (p=0.481 n=10+10) QuoRem-8 1.95µs ± 1% 1.95µs ± 2% ~ (p=0.117 n=9+10) Exp-8 5.12ms ± 2% 3.99ms ± 1% -22.02% (p=0.000 n=10+9) Exp2-8 5.14ms ± 2% 3.98ms ± 0% -22.55% (p=0.000 n=10+9) Bitset-8 16.4ns ± 2% 16.5ns ± 2% ~ (p=0.311 n=9+10) BitsetNeg-8 46.3ns ± 4% 45.8ns ± 4% ~ (p=0.272 n=10+10) BitsetOrig-8 250ns ±19% 247ns ±14% ~ (p=0.671 n=10+10) BitsetNegOrig-8 416ns ±14% 429ns ±14% ~ (p=0.353 n=10+10) ModSqrt225_Tonelli-8 400µs ± 0% 320µs ± 0% -19.88% (p=0.000 n=9+7) ModSqrt224_3Mod4-8 123µs ± 1% 97µs ± 0% -21.21% (p=0.000 n=9+10) ModSqrt5430_Tonelli-8 1.87s ± 0% 1.39s ± 1% -25.70% (p=0.000 n=9+10) ModSqrt5430_3Mod4-8 630ms ± 2% 465ms ± 1% -26.12% (p=0.000 n=10+10) Sqrt-8 25.8µs ± 1% 25.9µs ± 0% +0.66% (p=0.002 n=10+8) IntSqr/1-8 11.3ns ± 1% 11.3ns ± 2% ~ (p=0.360 n=9+10) IntSqr/2-8 26.6ns ± 1% 27.4ns ± 2% +2.87% (p=0.000 n=8+9) IntSqr/3-8 36.5ns ± 6% 36.6ns ± 5% ~ (p=0.589 n=10+10) IntSqr/5-8 57.2ns ± 2% 57.8ns ± 1% +0.92% (p=0.045 n=10+9) IntSqr/8-8 112ns ± 1% 93ns ± 1% -16.60% (p=0.000 n=10+10) IntSqr/10-8 148ns ± 1% 129ns ± 5% -12.85% (p=0.000 n=10+10) IntSqr/20-8 642ns ±28% 692ns ±21% ~ (p=0.105 n=10+10) IntSqr/30-8 1.03µs ±18% 1.06µs ±15% ~ (p=0.422 n=10+8) IntSqr/50-8 2.33µs ±14% 2.14µs ±20% ~ (p=0.063 n=10+10) IntSqr/80-8 4.06µs ±13% 3.72µs ±14% -8.31% (p=0.029 n=10+10) IntSqr/100-8 5.79µs ±10% 5.20µs ±18% -10.15% (p=0.004 n=10+10) IntSqr/200-8 17.1µs ± 1% 12.9µs ± 3% -24.44% (p=0.000 n=10+10) IntSqr/300-8 35.9µs ± 0% 26.6µs ± 1% -25.75% (p=0.000 n=10+10) IntSqr/500-8 84.9µs ± 0% 71.7µs ± 1% -15.49% (p=0.000 n=10+10) IntSqr/800-8 170µs ± 1% 142µs ± 2% -16.73% (p=0.000 n=10+10) IntSqr/1000-8 258µs ± 1% 218µs ± 1% -15.65% (p=0.000 n=10+10) Mul-8 10.4ms ± 1% 8.3ms ± 0% -20.05% (p=0.000 n=10+9) Exp3Power/0x10-8 311ns ±15% 321ns ±24% ~ (p=0.447 n=10+10) Exp3Power/0x40-8 358ns ±21% 346ns ±37% ~ (p=0.591 n=10+10) Exp3Power/0x100-8 611ns ±19% 570ns ±27% ~ (p=0.393 n=10+10) Exp3Power/0x400-8 1.31µs ±26% 1.34µs ±19% ~ (p=0.853 n=10+10) Exp3Power/0x1000-8 6.76µs ±23% 6.22µs ±16% ~ (p=0.095 n=10+9) Exp3Power/0x4000-8 37.6µs ±14% 36.4µs ±21% ~ (p=0.247 n=10+10) Exp3Power/0x10000-8 345µs ±14% 310µs ±11% -9.99% (p=0.005 n=10+10) Exp3Power/0x40000-8 2.77ms ± 1% 2.34ms ± 1% -15.47% (p=0.000 n=10+10) Exp3Power/0x100000-8 25.1ms ± 1% 21.3ms ± 1% -15.26% (p=0.000 n=10+10) Exp3Power/0x400000-8 225ms ± 1% 190ms ± 1% -15.61% (p=0.000 n=10+10) Fibo-8 23.4ms ± 1% 23.3ms ± 0% ~ (p=0.052 n=10+10) NatSqr/1-8 58.4ns ±24% 59.8ns ±38% ~ (p=0.739 n=10+10) NatSqr/2-8 122ns ±21% 122ns ±16% ~ (p=0.896 n=10+10) NatSqr/3-8 140ns ±28% 148ns ±30% ~ (p=0.288 n=10+10) NatSqr/5-8 193ns ±29% 210ns ±34% ~ (p=0.469 n=10+10) NatSqr/8-8 317ns ±21% 296ns ±25% ~ (p=0.393 n=10+10) NatSqr/10-8 362ns ± 8% 373ns ±30% ~ (p=0.617 n=9+10) NatSqr/20-8 1.24µs ±16% 1.06µs ±29% -14.57% (p=0.019 n=10+10) NatSqr/30-8 1.90µs ±32% 1.71µs ±10% ~ (p=0.176 n=10+9) NatSqr/50-8 4.22µs ±19% 3.67µs ± 7% -13.03% (p=0.017 n=10+9) NatSqr/80-8 7.33µs ±20% 6.50µs ±15% -11.26% (p=0.009 n=10+10) NatSqr/100-8 9.84µs ±18% 9.33µs ± 8% ~ (p=0.280 n=10+10) NatSqr/200-8 21.4µs ± 7% 20.0µs ±14% ~ (p=0.075 n=10+10) NatSqr/300-8 38.0µs ± 2% 31.3µs ±10% -17.63% (p=0.000 n=10+10) NatSqr/500-8 102µs ± 5% 101µs ± 4% ~ (p=0.780 n=9+10) NatSqr/800-8 190µs ± 3% 166µs ± 6% -12.29% (p=0.000 n=10+10) NatSqr/1000-8 277µs ± 2% 245µs ± 6% -11.64% (p=0.000 n=10+10) ScanPi-8 144µs ±23% 149µs ±24% ~ (p=0.579 n=10+10) StringPiParallel-8 25.6µs ± 0% 25.8µs ± 0% +0.69% (p=0.000 n=9+10) Scan/10/Base2-8 305ns ± 1% 309ns ± 1% +1.32% (p=0.000 n=10+9) Scan/100/Base2-8 1.95µs ± 1% 1.98µs ± 1% +1.10% (p=0.000 n=10+10) Scan/1000/Base2-8 19.5µs ± 1% 19.7µs ± 1% +1.39% (p=0.000 n=10+10) Scan/10000/Base2-8 270µs ± 1% 272µs ± 1% +0.58% (p=0.024 n=9+9) Scan/100000/Base2-8 10.3ms ± 0% 10.3ms ± 0% +0.16% (p=0.022 n=9+10) Scan/10/Base8-8 146ns ± 4% 154ns ± 4% +5.57% (p=0.000 n=9+9) Scan/100/Base8-8 748ns ± 1% 759ns ± 1% +1.51% (p=0.000 n=9+10) Scan/1000/Base8-8 7.88µs ± 1% 8.00µs ± 1% +1.64% (p=0.000 n=10+10) Scan/10000/Base8-8 155µs ± 1% 155µs ± 1% ~ (p=0.968 n=10+9) Scan/100000/Base8-8 9.11ms ± 0% 9.11ms ± 0% ~ (p=0.604 n=9+10) Scan/10/Base10-8 140ns ± 5% 149ns ± 5% +6.39% (p=0.000 n=9+10) Scan/100/Base10-8 680ns ± 0% 688ns ± 1% +1.08% (p=0.000 n=9+10) Scan/1000/Base10-8 7.09µs ± 1% 7.16µs ± 1% +0.98% (p=0.019 n=10+10) Scan/10000/Base10-8 149µs ± 3% 150µs ± 3% ~ (p=0.143 n=10+10) Scan/100000/Base10-8 9.16ms ± 0% 9.16ms ± 0% ~ (p=0.661 n=10+9) Scan/10/Base16-8 134ns ± 5% 135ns ± 3% ~ (p=0.505 n=9+9) Scan/100/Base16-8 560ns ± 1% 563ns ± 0% +0.67% (p=0.000 n=10+8) Scan/1000/Base16-8 6.28µs ± 1% 6.26µs ± 1% ~ (p=0.448 n=10+10) Scan/10000/Base16-8 161µs ± 1% 162µs ± 1% +0.74% (p=0.008 n=9+9) Scan/100000/Base16-8 9.64ms ± 0% 9.64ms ± 0% ~ (p=0.436 n=10+10) String/10/Base2-8 116ns ±12% 118ns ±13% ~ (p=0.645 n=10+10) String/100/Base2-8 871ns ±23% 860ns ±22% ~ (p=0.699 n=10+10) String/1000/Base2-8 10.0µs ±20% 10.0µs ±23% ~ (p=0.853 n=10+10) String/10000/Base2-8 110µs ±21% 120µs ±25% ~ (p=0.436 n=10+10) String/100000/Base2-8 768µs ±11% 733µs ±16% ~ (p=0.393 n=10+10) String/10/Base8-8 51.3ns ± 1% 51.0ns ± 3% ~ (p=0.286 n=9+9) String/100/Base8-8 284ns ± 9% 272ns ±12% ~ (p=0.267 n=9+10) String/1000/Base8-8 3.06µs ± 9% 3.04µs ±10% ~ (p=0.739 n=10+10) String/10000/Base8-8 36.1µs ±14% 35.1µs ± 9% ~ (p=0.447 n=10+9) String/100000/Base8-8 371µs ±12% 373µs ±16% ~ (p=0.739 n=10+10) String/10/Base10-8 167ns ±11% 165ns ± 9% ~ (p=0.781 n=10+10) String/100/Base10-8 727ns ± 1% 740ns ± 2% +1.70% (p=0.001 n=10+10) String/1000/Base10-8 5.30µs ±18% 5.37µs ±14% ~ (p=0.631 n=10+10) String/10000/Base10-8 45.0µs ±14% 44.6µs ±10% ~ (p=0.720 n=9+10) String/100000/Base10-8 5.10ms ± 1% 5.05ms ± 3% ~ (p=0.211 n=9+10) String/10/Base16-8 47.7ns ± 6% 47.7ns ± 6% ~ (p=0.985 n=10+10) String/100/Base16-8 221ns ±10% 234ns ±27% ~ (p=0.541 n=10+10) String/1000/Base16-8 2.23µs ±11% 2.12µs ± 8% -4.81% (p=0.029 n=9+8) String/10000/Base16-8 28.3µs ±21% 28.5µs ±14% ~ (p=0.796 n=10+10) String/100000/Base16-8 291µs ±16% 293µs ±15% ~ (p=0.931 n=9+9) LeafSize/0-8 2.43ms ± 1% 2.49ms ± 1% +2.56% (p=0.000 n=10+10) LeafSize/1-8 49.7µs ± 9% 46.3µs ±16% -6.78% (p=0.017 n=10+9) LeafSize/2-8 48.4µs ±18% 46.3µs ±19% ~ (p=0.436 n=10+10) LeafSize/3-8 81.7µs ± 3% 80.9µs ± 3% ~ (p=0.278 n=10+9) LeafSize/4-8 47.0µs ± 7% 47.9µs ±13% ~ (p=0.905 n=9+10) LeafSize/5-8 96.8µs ± 1% 97.3µs ± 2% ~ (p=0.515 n=8+10) LeafSize/6-8 82.5µs ± 4% 80.9µs ± 2% -1.92% (p=0.019 n=10+10) LeafSize/7-8 67.2µs ±13% 66.6µs ± 9% ~ (p=0.842 n=10+9) LeafSize/8-8 46.0µs ±28% 45.1µs ±12% ~ (p=0.739 n=10+10) LeafSize/9-8 111µs ± 1% 111µs ± 1% ~ (p=0.739 n=10+10) LeafSize/10-8 98.8µs ± 4% 97.9µs ± 3% ~ (p=0.278 n=10+9) LeafSize/11-8 96.8µs ± 1% 96.4µs ± 1% ~ (p=0.211 n=9+10) LeafSize/12-8 81.0µs ± 4% 81.3µs ± 3% ~ (p=0.579 n=10+10) LeafSize/13-8 79.7µs ± 5% 79.2µs ± 3% ~ (p=0.661 n=10+9) LeafSize/14-8 67.6µs ±12% 65.8µs ± 7% ~ (p=0.447 n=10+9) LeafSize/15-8 63.9µs ±17% 66.3µs ±14% ~ (p=0.481 n=10+10) LeafSize/16-8 44.0µs ±28% 46.0µs ±27% ~ (p=0.481 n=10+10) LeafSize/32-8 46.2µs ±13% 43.5µs ±18% ~ (p=0.156 n=9+10) LeafSize/64-8 53.3µs ±10% 53.0µs ±19% ~ (p=0.730 n=9+9) ProbablyPrime/n=0-8 3.60ms ± 1% 3.39ms ± 1% -5.87% (p=0.000 n=10+9) ProbablyPrime/n=1-8 4.42ms ± 1% 4.08ms ± 1% -7.69% (p=0.000 n=10+10) ProbablyPrime/n=5-8 7.57ms ± 2% 6.79ms ± 1% -10.24% (p=0.000 n=10+10) ProbablyPrime/n=10-8 11.6ms ± 2% 10.2ms ± 1% -11.69% (p=0.000 n=10+10) ProbablyPrime/n=20-8 19.4ms ± 2% 16.9ms ± 2% -12.89% (p=0.000 n=10+10) ProbablyPrime/Lucas-8 2.81ms ± 2% 2.72ms ± 1% -3.22% (p=0.000 n=10+9) ProbablyPrime/MillerRabinBase2-8 797µs ± 1% 680µs ± 1% -14.64% (p=0.000 n=10+10) name old speed new speed delta AddVV/1-8 17.1GB/s ± 6% 18.0GB/s ± 2% ~ (p=0.122 n=10+8) AddVV/2-8 32.4GB/s ± 2% 32.2GB/s ± 4% ~ (p=0.661 n=10+9) AddVV/3-8 38.6GB/s ± 2% 38.9GB/s ± 1% ~ (p=0.113 n=10+9) AddVV/4-8 45.8GB/s ± 2% 45.8GB/s ± 2% ~ (p=0.796 n=10+10) AddVV/5-8 48.1GB/s ± 2% 48.3GB/s ± 1% ~ (p=0.315 n=10+10) AddVV/10-8 78.9GB/s ± 1% 78.9GB/s ± 2% ~ (p=0.353 n=10+10) AddVV/100-8 136GB/s ± 2% 137GB/s ± 1% ~ (p=0.971 n=10+10) AddVV/1000-8 164GB/s ± 1% 164GB/s ± 4% ~ (p=0.853 n=10+10) AddVV/10000-8 126GB/s ± 6% 129GB/s ± 2% ~ (p=0.063 n=10+10) AddVV/100000-8 116GB/s ± 3% 116GB/s ± 3% ~ (p=0.796 n=10+10) AddVW/1-8 2.64GB/s ± 3% 2.64GB/s ± 3% ~ (p=0.579 n=10+10) AddVW/2-8 4.49GB/s ± 2% 4.44GB/s ± 2% -1.09% (p=0.040 n=9+9) AddVW/3-8 6.36GB/s ± 1% 6.34GB/s ± 2% ~ (p=0.684 n=10+10) AddVW/4-8 6.83GB/s ± 1% 6.82GB/s ± 2% ~ (p=0.905 n=10+9) AddVW/5-8 8.75GB/s ± 1% 8.73GB/s ± 1% ~ (p=0.796 n=10+10) AddVW/10-8 10.5GB/s ± 2% 10.5GB/s ± 1% ~ (p=0.971 n=10+10) AddVW/100-8 19.5GB/s ± 2% 18.9GB/s ± 2% -3.22% (p=0.000 n=10+10) AddVW/1000-8 20.7GB/s ± 2% 20.6GB/s ± 4% ~ (p=0.631 n=10+10) AddVW/10000-8 20.6GB/s ± 3% 20.7GB/s ± 3% ~ (p=0.481 n=10+10) AddVW/100000-8 19.4GB/s ± 2% 19.2GB/s ± 3% ~ (p=0.165 n=10+10) AddMulVVW/1-8 19.5GB/s ± 2% 19.7GB/s ± 3% ~ (p=0.123 n=10+10) AddMulVVW/2-8 30.1GB/s ± 2% 30.2GB/s ± 3% ~ (p=0.297 n=9+9) AddMulVVW/3-8 37.9GB/s ± 2% 36.5GB/s ± 2% -3.63% (p=0.000 n=10+10) AddMulVVW/4-8 40.0GB/s ± 2% 39.4GB/s ± 2% -1.58% (p=0.001 n=10+10) AddMulVVW/5-8 47.3GB/s ± 2% 46.6GB/s ± 1% -1.35% (p=0.001 n=9+9) AddMulVVW/10-8 52.3GB/s ± 2% 60.6GB/s ± 3% +15.76% (p=0.000 n=10+10) AddMulVVW/100-8 80.3GB/s ± 2% 122.1GB/s ± 1% +51.92% (p=0.000 n=10+10) AddMulVVW/1000-8 92.0GB/s ± 1% 130.3GB/s ± 2% +41.61% (p=0.000 n=9+10) AddMulVVW/10000-8 88.2GB/s ± 2% 108.2GB/s ± 5% +22.66% (p=0.000 n=10+10) AddMulVVW/100000-8 88.2GB/s ± 2% 102.9GB/s ± 2% +16.69% (p=0.000 n=10+10) Change-Id: Ic98e30c91d437d845fed03e07e976c3fdbf02b36 Reviewed-on: https://go-review.googlesource.com/74851 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Adam Langley <agl@golang.org>	2018-02-24 00:13:03 +00:00
Shawn Smith	d3beea8c52	all: fix misspellings GitHub-Last-Rev: `468df242d0` GitHub-Pull-Request: golang/go#23935 Change-Id: If751ce3ffa3a4d5e00a3138211383d12cb6b23fc Reviewed-on: https://go-review.googlesource.com/95577 Run-TryBot: Andrew Bonventre <andybons@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Andrew Bonventre <andybons@golang.org>	2018-02-20 21:02:58 +00:00
Alberto Donizetti	331092c58f	math/big: fix %s verbs in Float tests error messages Fatalf calls in two Float tests use the %s verb with Floats values, which is not allowed and results in failure messages that look like this: float_test.go:1385: i = 0, prec = 1, ToZero: %!s(big.Float=1) [0] / %!s(big.Float=1) [0] = %!s(big.Float=0.0625) want %!s(big.Float=1) Switch to %v. Change-Id: Ifdc80bf19c91ca1b190f6551a6d0a51b42ed5919 Reviewed-on: https://go-review.googlesource.com/87199 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2018-02-14 09:50:19 +00:00
Thanabodee Charoenpiriyakij	1124fa300b	math: use Abs rather than if x < 0 { x = -x } This is the benchmark result base on darwin with amd64 architecture: name old time/op new time/op delta Cos 10.2ns ± 2% 10.3ns ± 3% +1.18% (p=0.032 n=10+10) Cosh 25.3ns ± 3% 24.6ns ± 2% -3.00% (p=0.000 n=10+10) Hypot 6.40ns ± 2% 6.19ns ± 3% -3.36% (p=0.000 n=10+10) HypotGo 7.16ns ± 3% 6.54ns ± 2% -8.66% (p=0.000 n=10+10) J0 66.0ns ± 2% 63.7ns ± 1% -3.42% (p=0.000 n=9+10) Fixes #21812 Change-Id: I2b88fbdfc250cd548f8f08b44ce2eb172dcacf43 Reviewed-on: https://go-review.googlesource.com/84437 Reviewed-by: Giovanni Bajo <rasky@develer.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Giovanni Bajo <rasky@develer.com> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-02-13 20:12:23 +00:00
Paul PISCUC	3526c40979	math/rand: typo fixed in documentation of seedPos In the comment of seedPost, the word: condiiton was changed to: condition Change-Id: I8967cc0e9f5d37776bada96cc1443c8bf46e1117 Reviewed-on: https://go-review.googlesource.com/86156 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2018-01-04 20:27:29 +00:00
Brian Kessler	5305bdd86b	math: correct result for Pow(x, ±.5) Fixes #23224 The previous Pow code had an optimization for powers equal to ±0.5 that used Sqrt for increased accuracy/speed. This caused special cases involving powers of ±0.5 to disagree with the Pow spec. This change places the Sqrt optimization after all of the special case handling. Change-Id: I6bf757f6248256b29cc21725a84e27705d855369 Reviewed-on: https://go-review.googlesource.com/85660 Reviewed-by: Robert Griesemer <gri@golang.org> Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2018-01-02 18:10:43 +00:00
Ilya Tocar	1992893307	math: remove asm version of Dim Dim performance has regressed by 14% vs 1.9 on amd64. Current pure go version of Dim is faster and, what is even more important for performance, is inlinable, so instead of tweaking asm implementation, just remove it. I had to update BenchmarkDim, because it was simply reloading constant(answer) in a loop. Perf data below: name old time/op new time/op delta Dim-6 6.79ns ± 0% 1.60ns ± 1% -76.39% (p=0.000 n=7+10) If I modify benchmark to be the same as in this CL results are even better: name old time/op new time/op delta Dim-6 10.2ns ± 0% 1.6ns ± 1% -84.27% (p=0.000 n=8+10) Updates #21913 Change-Id: I00e23c8affc293531e1d9f0e0e49f3a525634f53 Reviewed-on: https://go-review.googlesource.com/80695 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2017-11-30 21:00:33 +00:00
Alberto Donizetti	ff534e2130	math/big: protect against aliasing in nat.divLarge In nat.divLarge (having signature (z nat).divLarge(u, uIn, v nat)), we check whether z aliases uIn or v, but aliasing is currently not checked for the u parameter. Unfortunately, z and u aliasing each other can in some cases cause errors in the computation. The q return parameter (which will hold the result's quotient), is unconditionally initialized as q = z.make(m + 1) When cap(z) ≥ m+1, z.make() will reuse z's backing array, causing q and z to share the same backing array. If then z aliases u, setting q during the quotient computation will then corrupt u, which at that point already holds computation state. To fix this, we add an alias(z, u) check at the beginning of the function, taking care of aliasing the same way we already do for uIn and v. Fixes #22830 Change-Id: I3ab81120d5af6db7772a062bb1dfc011de91f7ad Reviewed-on: https://go-review.googlesource.com/78995 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-11-30 20:36:54 +00:00
Vladimir Stefanovic	2708da0dc1	runtime/cgo, math: don't use FP instructions for soft-float mips{,le} Updates #18162 Change-Id: I591fcf71a02678a99a56a6487da9689d3c9b1bb6 Reviewed-on: https://go-review.googlesource.com/37955 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2017-11-30 17:12:32 +00:00
Brian Kessler	802a8f88a3	math/cmplx: use signed zero to correct branch cuts Branch cuts for the elementary complex functions along real or imaginary axes should be resolved in floating point calculations by one-sided continuity with signed zero as described in: "Branch Cuts for Complex Elementary Functions or Much Ado About Nothing's Sign Bit" W. Kahan Available at: https://people.freebsd.org/~das/kahan86branch.pdf And as described in the C99 standard which is claimed as the original cephes source. Sqrt did not return the correct branch when imag(x) == 0. The branch is now determined by sign(imag(x)). This incorrect branch choice was affecting the behavior of the Trigonometric/Hyperbolic functions that use Sqrt in intermediate calculations. Asin, Asinh and Atan had spurious domain checks, whereas the functions should be valid over the whole complex plane with appropriate branch cuts. Fixes #6888 Change-Id: I9b1278af54f54bfb4208276ae345bbd3ddf3ec83 Reviewed-on: https://go-review.googlesource.com/46492 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>	2017-11-27 07:44:00 +00:00
Russ Cox	d9a198c7d0	Revert "math/rand: make Perm match Shuffle" This reverts CL 55972. Reason for revert: this changes Perm's behavior unnecessarily. I asked for this change originally but I now regret it. Reverting so that I don't have to justify it in Go 1.10 release notes. Edited to keep the change to rand_test.go, which seems to have been mostly unrelated. Fixes #22744. Change-Id: If8bb1bcde3ced0db2fdcd0aa65ab128613686c66 Reviewed-on: https://go-review.googlesource.com/78195 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com> Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>	2017-11-16 16:24:30 +00:00
Daniel Martí	a265f2e90e	go/printer: indent lone comments in composite lits If a composite literal contains any comments on their own lines without any elements, the printer would unindent the comments. The comments in this edge case are written when the closing '}' is written. Indent and outdent first so that the indentation is interspersed before the comment is written. Also note that the go/printer golden tests don't show the exact same behaviour that gofmt does. Added a TODO to figure this out in a separate CL. While at it, ensure that the tree conforms to gofmt. The changes are unrelated to this indentation fix, however. Fixes #22355. Change-Id: I5ac25ac6de95a236f1e123479127cc4dd71e93fe Reviewed-on: https://go-review.googlesource.com/74232 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-11-15 18:48:48 +00:00
Brian Kessler	2955a8a6cc	math/big: clarify comment on lehmerGCD overflow A clarifying comment was added to indicate that overflow of a single Word is not possible in the single digit calculation. Lehmer's paper includes a proof of the bounds on the size of the cosequences (u0, u1, u2, v0, v1, v2). Change-Id: I98127a07aa8f8fe44814b74b2bc6ff720805194b Reviewed-on: https://go-review.googlesource.com/77451 Reviewed-by: Robert Griesemer <gri@golang.org>	2017-11-14 17:32:39 +00:00
Filippo Valsorda	ef0e2af7b0	math/big: add security warning to (*Int).Rand Change-Id: I22a67733aa2d07298e124077654c9b1473802100 Reviewed-on: https://go-review.googlesource.com/76012 Reviewed-by: Aliaksandr Valialkin <valyala@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>	2017-11-06 15:55:31 +00:00
Tobias Klauser	89bcbf40b8	math/bits: add examples for right rotation Right rotation is achieved using negative k in RotateLeft*(x, k). Add examples demonstrating that functionality. Change-Id: I15dab159accd2937cb18d3fa8ca32da8501567d3 Reviewed-on: https://go-review.googlesource.com/75371 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-11-03 20:12:07 +00:00
Lynn Boger	3860478b42	math: implement asm modf for ppc64x This change adds an asm implementations modf for ppc64x. Improvements: BenchmarkModf-16 7.48 6.26 -16.31% Updates: #21390 Change-Id: I9c4f3213688e3e8842d050840dc04fc9c0bf6ce4 Reviewed-on: https://go-review.googlesource.com/74411 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Munday <mike.munday@ibm.com>	2017-11-02 13:24:32 +00:00
griesemer	85c32c3744	math/big: implement CmpAbs Fixes #22473. Change-Id: Ie886dfc8b5510970d6d63ca6472c73325f6f2276 Reviewed-on: https://go-review.googlesource.com/74971 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com>	2017-11-01 18:17:49 +00:00
Alberto Donizetti	856dccb175	math/big: avoid unnecessary Newton iteration in Float.Sqrt An initial draft of the Newton code for Float.Sqrt was structured like this: for condition // do Newton iteration.. prec = 2 since prec, at the end of the loop, was double the precision used in the last Newton iteration, the termination condition was set to 2limit. The code was later rewritten in the form for condition prec = 2 // do Newton iteration.. but condition was not updated, and it's still 2limit, which is about double what we actually need, and is triggering the execution of an additional, and unnecessary, Newton iteration. This change adjusts the Newton termination condition to the (correct) value of z.prec, plus 32 guard bits as a safety margin. name old time/op new time/op delta FloatSqrt/64-4 798ns ± 3% 802ns ± 3% ~ (p=0.458 n=8+8) FloatSqrt/128-4 1.65µs ± 1% 1.65µs ± 1% ~ (p=0.290 n=8+8) FloatSqrt/256-4 3.10µs ± 1% 2.10µs ± 0% -32.32% (p=0.000 n=8+7) FloatSqrt/1000-4 8.83µs ± 1% 4.91µs ± 2% -44.39% (p=0.000 n=8+8) FloatSqrt/10000-4 107µs ± 1% 40µs ± 1% -62.68% (p=0.000 n=8+8) FloatSqrt/100000-4 2.91ms ± 1% 0.96ms ± 1% -67.13% (p=0.000 n=8+8) FloatSqrt/1000000-4 240ms ± 1% 80ms ± 1% -66.66% (p=0.000 n=8+8) name old alloc/op new alloc/op delta FloatSqrt/64-4 416B ± 0% 416B ± 0% ~ (all equal) FloatSqrt/128-4 720B ± 0% 720B ± 0% ~ (all equal) FloatSqrt/256-4 1.34kB ± 0% 0.82kB ± 0% -39.29% (p=0.000 n=8+8) FloatSqrt/1000-4 5.09kB ± 0% 2.50kB ± 0% -50.94% (p=0.000 n=8+8) FloatSqrt/10000-4 45.9kB ± 0% 23.5kB ± 0% -48.81% (p=0.000 n=8+8) FloatSqrt/100000-4 533kB ± 0% 251kB ± 0% -52.90% (p=0.000 n=8+8) FloatSqrt/1000000-4 9.21MB ± 0% 4.61MB ± 0% -49.98% (p=0.000 n=8+8) name old allocs/op new allocs/op delta FloatSqrt/64-4 9.00 ± 0% 9.00 ± 0% ~ (all equal) FloatSqrt/128-4 13.0 ± 0% 13.0 ± 0% ~ (all equal) FloatSqrt/256-4 15.0 ± 0% 12.0 ± 0% -20.00% (p=0.000 n=8+8) FloatSqrt/1000-4 24.0 ± 0% 19.0 ± 0% -20.83% (p=0.000 n=8+8) FloatSqrt/10000-4 40.0 ± 0% 35.0 ± 0% -12.50% (p=0.000 n=8+8) FloatSqrt/100000-4 66.0 ± 0% 55.0 ± 0% -16.67% (p=0.000 n=8+8) FloatSqrt/1000000-4 143 ± 0% 122 ± 0% -14.69% (p=0.000 n=8+8) Change-Id: I4868adb7f8960f2ca20e7792734c2e6211669fc0 Reviewed-on: https://go-review.googlesource.com/75010 Reviewed-by: Robert Griesemer <gri@golang.org>	2017-11-01 16:21:06 +00:00
Alberto Donizetti	2c783dc038	math/big: save one subtraction per iteration in Float.Sqrt The Sqrt Newton method computes g(t) = f(t)/f'(t) and then iterates t2 = t1 - g(t1) We can save one operation by including the final subtraction in g(t) and evaluating the resulting expression symbolically. For example, for the direct method, g(t) = ½(t² - x)/t and we use 2 multiplications, 1 division and 1 subtraction in g(), plus 1 final subtraction; but if we compute t - g(t) = t - ½(t² - x)/t = ½(t² + x)/t we only use 2 multiplications, 1 division and 1 addition. A similar simplification can be done for the inverse method. name old time/op new time/op delta FloatSqrt/64-4 889ns ± 4% 790ns ± 1% -11.19% (p=0.000 n=8+7) FloatSqrt/128-4 1.82µs ± 0% 1.64µs ± 1% -10.07% (p=0.001 n=6+8) FloatSqrt/256-4 3.56µs ± 4% 3.10µs ± 3% -12.96% (p=0.000 n=7+8) FloatSqrt/1000-4 9.06µs ± 3% 8.86µs ± 1% -2.20% (p=0.001 n=7+7) FloatSqrt/10000-4 109µs ± 1% 107µs ± 1% -1.56% (p=0.000 n=8+8) FloatSqrt/100000-4 2.91ms ± 0% 2.89ms ± 2% -0.68% (p=0.026 n=7+7) FloatSqrt/1000000-4 237ms ± 1% 239ms ± 1% +0.72% (p=0.021 n=8+8) name old alloc/op new alloc/op delta FloatSqrt/64-4 448B ± 0% 416B ± 0% -7.14% (p=0.000 n=8+8) FloatSqrt/128-4 752B ± 0% 720B ± 0% -4.26% (p=0.000 n=8+8) FloatSqrt/256-4 2.05kB ± 0% 1.34kB ± 0% -34.38% (p=0.000 n=8+8) FloatSqrt/1000-4 6.91kB ± 0% 5.09kB ± 0% -26.39% (p=0.000 n=8+8) FloatSqrt/10000-4 60.5kB ± 0% 45.9kB ± 0% -24.17% (p=0.000 n=8+8) FloatSqrt/100000-4 617kB ± 0% 533kB ± 0% -13.57% (p=0.000 n=8+8) FloatSqrt/1000000-4 10.3MB ± 0% 9.2MB ± 0% -10.85% (p=0.000 n=8+8) name old allocs/op new allocs/op delta FloatSqrt/64-4 9.00 ± 0% 9.00 ± 0% ~ (all equal) FloatSqrt/128-4 13.0 ± 0% 13.0 ± 0% ~ (all equal) FloatSqrt/256-4 20.0 ± 0% 15.0 ± 0% -25.00% (p=0.000 n=8+8) FloatSqrt/1000-4 31.0 ± 0% 24.0 ± 0% -22.58% (p=0.000 n=8+8) FloatSqrt/10000-4 50.0 ± 0% 40.0 ± 0% -20.00% (p=0.000 n=8+8) FloatSqrt/100000-4 76.0 ± 0% 66.0 ± 0% -13.16% (p=0.000 n=8+8) FloatSqrt/1000000-4 146 ± 0% 143 ± 0% -2.05% (p=0.000 n=8+8) Change-Id: I271c00de1ca9740e585bf2af7bcd87b18c1fa68e Reviewed-on: https://go-review.googlesource.com/73879 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-11-01 11:24:02 +00:00
Ilya Tocar	94484d8ed5	cmd/compile: intrinsify math.{Trunc/Ceil/Floor} on amd64 This significantly speed-ups Trunc. Ceil/Floor are using the same instruction, so do them too. name old time/op new time/op delta Floor-6 3.33ns ± 1% 3.22ns ± 0% -3.39% (p=0.000 n=10+10) Ceil-6 3.33ns ± 1% 3.22ns ± 0% -3.16% (p=0.000 n=10+7) Trunc-6 4.83ns ± 0% 3.22ns ± 0% -33.36% (p=0.000 n=6+8) Change-Id: If848790e458eedfe38a6a0407bb4f589c68ac254 Reviewed-on: https://go-review.googlesource.com/68630 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>	2017-10-31 19:30:54 +00:00
Michael Munday	c280126557	cmd/asm, cmd/internal/obj/s390x, math: add "test under mask" instructions Adds the following s390x test under mask (immediate) instructions: TMHH TMHL TMLH TMLL These are useful for testing bits and are already used in the math package. Change-Id: Idffb3f83b238dba76ac1e42ac6b0bf7f1d11bea2 Reviewed-on: https://go-review.googlesource.com/41092 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>	2017-10-30 23:55:14 +00:00
Michael Munday	b97688d112	math: optimize dim and remove s390x assembly implementation By calculating dim directly, rather than calling max, we can simplify the generated code significantly. The compiler now reports that dim is easily inlineable, but it can't be inlined because there is still an assembly stub for Dim. Since dim is now very simple I no longer think it is worth having assembly implementations of it. I have therefore removed the s390x assembly. Removing the other assembly for Dim is #21913. name old time/op new time/op delta Dim 4.29ns ± 0% 3.53ns ± 0% -17.62% (p=0.000 n=9+8) Change-Id: Ic38a6b51603cbc661dcdb868ecf2b1947e9f399e Reviewed-on: https://go-review.googlesource.com/64194 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-10-30 19:05:51 +00:00
Alberto Donizetti	bd48d37e30	math/big: add (Float).Sqrt This change adds a Square root method to the big.Float type, with signature (z Float) Sqrt(x Float) Float Fixes #20460 Change-Id: I050aaed0615fe0894e11c800744600648343c223 Reviewed-on: https://go-review.googlesource.com/67830 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-10-26 17:29:27 +00:00
Brian Kessler	1643d4f33a	math/big: implement Lehmer's GCD algorithm Updates #15833 Lehmer's GCD algorithm uses single precision calculations to simulate several steps of multiple precision calculations in Euclid's GCD algorithm which leads to a considerable speed up. This implementation uses Collins' simplified testing condition on the single digit cosequences which requires only one quotient and avoids any possibility of overflow. name old time/op new time/op delta GCD10x10/WithoutXY-4 1.82µs ±24% 0.28µs ± 6% -84.40% (p=0.008 n=5+5) GCD10x10/WithXY-4 1.69µs ± 6% 1.71µs ± 6% ~ (p=0.595 n=5+5) GCD10x100/WithoutXY-4 1.87µs ± 2% 0.56µs ± 4% -70.13% (p=0.008 n=5+5) GCD10x100/WithXY-4 2.61µs ± 2% 2.65µs ± 4% ~ (p=0.635 n=5+5) GCD10x1000/WithoutXY-4 2.75µs ± 2% 1.48µs ± 1% -46.06% (p=0.008 n=5+5) GCD10x1000/WithXY-4 5.29µs ± 2% 5.25µs ± 2% ~ (p=0.548 n=5+5) GCD10x10000/WithoutXY-4 10.7µs ± 2% 10.3µs ± 0% -4.38% (p=0.008 n=5+5) GCD10x10000/WithXY-4 22.3µs ± 6% 22.1µs ± 1% ~ (p=1.000 n=5+5) GCD10x100000/WithoutXY-4 93.7µs ± 2% 99.4µs ± 2% +6.09% (p=0.008 n=5+5) GCD10x100000/WithXY-4 196µs ± 2% 199µs ± 2% ~ (p=0.222 n=5+5) GCD100x100/WithoutXY-4 10.1µs ± 2% 2.5µs ± 2% -74.84% (p=0.008 n=5+5) GCD100x100/WithXY-4 21.4µs ± 2% 21.3µs ± 7% ~ (p=0.548 n=5+5) GCD100x1000/WithoutXY-4 11.3µs ± 2% 4.4µs ± 4% -60.87% (p=0.008 n=5+5) GCD100x1000/WithXY-4 24.7µs ± 3% 23.9µs ± 1% ~ (p=0.056 n=5+5) GCD100x10000/WithoutXY-4 26.6µs ± 1% 20.0µs ± 2% -24.82% (p=0.008 n=5+5) GCD100x10000/WithXY-4 78.7µs ± 2% 78.2µs ± 2% ~ (p=0.690 n=5+5) GCD100x100000/WithoutXY-4 174µs ± 2% 171µs ± 1% ~ (p=0.056 n=5+5) GCD100x100000/WithXY-4 563µs ± 4% 561µs ± 2% ~ (p=1.000 n=5+5) GCD1000x1000/WithoutXY-4 120µs ± 5% 29µs ± 3% -75.71% (p=0.008 n=5+5) GCD1000x1000/WithXY-4 355µs ± 4% 358µs ± 2% ~ (p=0.841 n=5+5) GCD1000x10000/WithoutXY-4 140µs ± 2% 49µs ± 2% -65.07% (p=0.008 n=5+5) GCD1000x10000/WithXY-4 626µs ± 3% 628µs ± 9% ~ (p=0.690 n=5+5) GCD1000x100000/WithoutXY-4 340µs ± 4% 259µs ± 6% -23.79% (p=0.008 n=5+5) GCD1000x100000/WithXY-4 3.76ms ± 4% 3.82ms ± 5% ~ (p=0.310 n=5+5) GCD10000x10000/WithoutXY-4 3.11ms ± 3% 0.54ms ± 2% -82.74% (p=0.008 n=5+5) GCD10000x10000/WithXY-4 7.96ms ± 3% 7.69ms ± 3% ~ (p=0.151 n=5+5) GCD10000x100000/WithoutXY-4 3.88ms ± 1% 1.27ms ± 2% -67.21% (p=0.008 n=5+5) GCD10000x100000/WithXY-4 38.1ms ± 2% 38.8ms ± 1% ~ (p=0.095 n=5+5) GCD100000x100000/WithoutXY-4 208ms ± 1% 25ms ± 4% -88.07% (p=0.008 n=5+5) GCD100000x100000/WithXY-4 533ms ± 5% 525ms ± 4% ~ (p=0.548 n=5+5) Change-Id: Ic1e007eb807b93e75f4752e968e98c1f0cb90e43 Reviewed-on: https://go-review.googlesource.com/59450 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-10-24 22:42:43 +00:00
Mark Pulford	a5c44f3e3f	math: add RoundToEven function Rounding ties to even is statistically useful for some applications. This implementation completes IEEE float64 rounding mode support (in addition to Round, Ceil, Floor, Trunc). This function avoids subtle faults found in ad-hoc implementations, and is simple enough to be inlined by the compiler. Fixes #21748 Change-Id: I09415df2e42435f9e7dabe3bdc0148e9b9ebd609 Reviewed-on: https://go-review.googlesource.com/61211 Reviewed-by: Robert Griesemer <gri@golang.org> Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-10-24 22:33:09 +00:00
Matthew Dempsky	a5868a47c6	cmd/internal/obj/x86: move MOV->XOR rewriting into compiler Fixes #20986. Change-Id: Ic3cf5c0ab260f259ecff7b92cfdf5f4ae432aef3 Reviewed-on: https://go-review.googlesource.com/73072 Run-TryBot: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>	2017-10-24 21:32:17 +00:00
Filippo Valsorda	f75158c365	math/big: fix ModSqrt optimized path for x = z name old time/op new time/op delta ModSqrt224_3Mod4-4 153µs ± 2% 154µs ± 1% ~ (p=0.548 n=5+5) ModSqrt5430_3Mod4-4 776ms ± 2% 791ms ± 2% ~ (p=0.222 n=5+5) Fixes #22265 Change-Id: If233542716e04341990a45a1c2b7118da6d233f7 Reviewed-on: https://go-review.googlesource.com/70832 Run-TryBot: Filippo Valsorda <hi@filippo.io> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>	2017-10-16 21:41:44 +00:00
griesemer	51cfe6849a	math/big: provide support for conversion bases up to 62 Increase MaxBase from 36 to 62 and extend the conversion alphabet with the upper-case letters 'A' to 'Z'. For int conversions with bases <= 36, the letters 'A' to 'Z' have the same values (10 to 35) as the corresponding lower-case letters. For conversion bases > 36 up to 62, the upper-case letters have the values 36 to 61. Added MaxBase to api/except.txt: Clients should not make assumptions about the value of MaxBase being constant. The core of the change is in natconv.go. The remaining changes are adjusted tests and documentation. Fixes #21558. Change-Id: I5f74da633caafca03993e13f32ac9546c572cc84 Reviewed-on: https://go-review.googlesource.com/65970 Reviewed-by: Martin Möhrmann <moehrmann@google.com>	2017-10-06 17:46:15 +00:00
griesemer	2ddd07138d	math/bits: complete examples Change-Id: Icbe6885ffd3aa4e77441ab03a2b9a04a9276d5eb Reviewed-on: https://go-review.googlesource.com/68311 Reviewed-by: Martin Möhrmann <moehrmann@google.com>	2017-10-06 16:58:03 +00:00
Marvin Stenger	90d71fe99e	all: revert "all: prefer strings.IndexByte over strings.Index" This reverts https://golang.org/cl/65930. Fixes #22148 Change-Id: Ie0712621ed89c43bef94417fc32de9af77607760 Reviewed-on: https://go-review.googlesource.com/68430 Reviewed-by: Ian Lance Taylor <iant@golang.org>	2017-10-05 23:19:10 +00:00
Marvin Stenger	aa00c607e1	math/big: remove []byte/string conversions This removes some of the []byte/string conversions currently existing in the (un)marshaling methods of Int and Rat. For Int we introduce a new function (Int).setFromScanner() essentially implementing the SetString method being given an io.ByteScanner instead of a string. So we can handle the string case in (Int).SetString with a strings.Reader and the []byte case in (Int).UnmarshalText() with a bytes.Reader now avoiding the []byte/string conversion here. For Rat we introduce a new function (Rat).marshal() essentially implementing the String method outputting []byte instead of string. Using this new function and the same formatting rules as in (Rat).RatString we can implement (Rat).MarshalText() without the []byte/string conversion it used to have. Change-Id: Ic5ef246c1582c428a40f214b95a16671ef0a06d9 Reviewed-on: https://go-review.googlesource.com/65950 Reviewed-by: Robert Griesemer <gri@golang.org> Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-10-04 17:16:52 +00:00
Marvin Stenger	f22ba1f247	all: prefer strings.IndexByte over strings.Index strings.IndexByte was introduced in go1.2 and it can be used effectively wherever the second argument to strings.Index is exactly one byte long. This avoids generating unnecessary string symbols and saves a few calls to strings.Index. Change-Id: I1ab5edb7c4ee9058084cfa57cbcc267c2597e793 Reviewed-on: https://go-review.googlesource.com/65930 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>	2017-09-25 17:35:41 +00:00
Marvin Stenger	7a5d76fa62	math/big: delete solved TODO The TODO is no longer needed as it was solved by a previous CL. See https://go-review.googlesource.com/14995. Change-Id: If62d1b296f35758ad3d18d28c8fbb95e797f4464 Reviewed-on: https://go-review.googlesource.com/65232 Reviewed-by: Robert Griesemer <gri@golang.org>	2017-09-22 10:21:46 +00:00
Agniva De Sarker	d2f317218b	math: implement fast path for Exp - using FMA and AVX instructions if available to speed-up Exp calculation on amd64 - using a data table instead of #define'ed constants because these instructions do not support loading floating point immediates. One has to use a memory operand / register. - Benchmark results on Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz: Original vs New (non-FMA path) name old time/op new time/op delta Exp 16.0ns ± 1% 16.1ns ± 3% ~ (p=0.308 n=9+10) Original vs New (FMA path) name old time/op new time/op delta Exp 16.0ns ± 1% 13.7ns ± 2% -14.80% (p=0.000 n=9+10) Change-Id: I3d8986925d82b39b95ee979ae06f59d7e591d02e Reviewed-on: https://go-review.googlesource.com/62590 Reviewed-by: Ilya Tocar <ilya.tocar@intel.com> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>	2017-09-20 21:43:00 +00:00
Burak Guven	9cc170f9a5	math/rand: fix comment for Shuffle Shuffle panics if n < 0, not n <= 0. The comment for the (*Rand).Shuffle function is already accurate. Change-Id: I073049310bca9632e50e9ca3ff79eec402122793 Reviewed-on: https://go-review.googlesource.com/63750 Reviewed-by: Ian Lance Taylor <iant@golang.org>	2017-09-14 03:41:35 +00:00
Michael Munday	ffb4708d1b	math: fix Abs, Copysign and Signbit benchmarks CL 62250 makes constant folding a bit more aggressive and these benchmarks were optimized away. This CL adds some indirection to the function arguments to stop them being folded. The Copysign benchmark is a bit faster because I've left one argument as a constant and it can be partially folded. old CL 62250 this CL Copysign 1.24ns ± 0% 0.34ns ± 2% 1.02ns ± 2% Abs 0.67ns ± 0% 0.35ns ± 3% 0.67ns ± 0% Signbit 0.87ns ± 0% 0.35ns ± 2% 0.87ns ± 1% Change-Id: I9604465a87d7aa29f4bd6009839c8ee354be3cd7 Reviewed-on: https://go-review.googlesource.com/62450 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>	2017-09-09 16:52:16 +00:00
Ian Lance Taylor	44f7fd030f	math/rand: change http to https in comment Change-Id: I19c1b0e1b238dda82e69bd47459528ed06b55840 Reviewed-on: https://go-review.googlesource.com/62310 Reviewed-by: David Crawshaw <crawshaw@golang.org>	2017-09-08 22:11:55 +00:00

1 2 3 4 5 ...

394 Commits