1
0
mirror of https://github.com/golang/go synced 2024-11-19 13:44:52 -07:00
Commit Graph

394 Commits

Author SHA1 Message Date
Carlos Eduardo Seo
a44c72823c math/big: improve performance of addVW/subVW for ppc64x
This change adds a better implementation in asm for addVW/subVW for
ppc64x, with speedups up to 3.11x.

benchmark                    old ns/op     new ns/op     delta
BenchmarkAddVW/1-16          6.87          5.71          -16.89%
BenchmarkAddVW/2-16          7.72          5.94          -23.06%
BenchmarkAddVW/3-16          8.74          6.56          -24.94%
BenchmarkAddVW/4-16          9.66          7.26          -24.84%
BenchmarkAddVW/5-16          10.8          7.26          -32.78%
BenchmarkAddVW/10-16         17.4          9.97          -42.70%
BenchmarkAddVW/100-16        164           56.0          -65.85%
BenchmarkAddVW/1000-16       1638          524           -68.01%
BenchmarkAddVW/10000-16      16421         5201          -68.33%
BenchmarkAddVW/100000-16     165762        53324         -67.83%
BenchmarkSubVW/1-16          6.76          5.62          -16.86%
BenchmarkSubVW/2-16          7.69          6.02          -21.72%
BenchmarkSubVW/3-16          8.85          6.61          -25.31%
BenchmarkSubVW/4-16          10.0          7.34          -26.60%
BenchmarkSubVW/5-16          11.3          7.33          -35.13%
BenchmarkSubVW/10-16         19.5          18.7          -4.10%
BenchmarkSubVW/100-16        153           55.9          -63.46%
BenchmarkSubVW/1000-16       1502          519           -65.45%
BenchmarkSubVW/10000-16      15005         5165          -65.58%
BenchmarkSubVW/100000-16     150620        53124         -64.73%

benchmark                    old MB/s     new MB/s     speedup
BenchmarkAddVW/1-16          1165.12      1400.76      1.20x
BenchmarkAddVW/2-16          2071.39      2693.25      1.30x
BenchmarkAddVW/3-16          2744.72      3656.92      1.33x
BenchmarkAddVW/4-16          3311.63      4407.34      1.33x
BenchmarkAddVW/5-16          3700.52      5512.48      1.49x
BenchmarkAddVW/10-16         4605.63      8026.37      1.74x
BenchmarkAddVW/100-16        4856.15      14296.76     2.94x
BenchmarkAddVW/1000-16       4883.96      15264.21     3.13x
BenchmarkAddVW/10000-16      4871.52      15380.78     3.16x
BenchmarkAddVW/100000-16     4826.17      15002.48     3.11x
BenchmarkSubVW/1-16          1183.20      1423.03      1.20x
BenchmarkSubVW/2-16          2081.92      2657.44      1.28x
BenchmarkSubVW/3-16          2711.52      3632.30      1.34x
BenchmarkSubVW/4-16          3198.30      4360.30      1.36x
BenchmarkSubVW/5-16          3534.43      5460.40      1.54x
BenchmarkSubVW/10-16         4106.34      4273.51      1.04x
BenchmarkSubVW/100-16        5213.48      14306.32     2.74x
BenchmarkSubVW/1000-16       5324.27      15391.21     2.89x
BenchmarkSubVW/10000-16      5331.33      15486.57     2.90x
BenchmarkSubVW/100000-16     5311.35      15059.01     2.84x

Change-Id: Ibaa5b9b38d63fba8e01a9c327eb8bef1e6e908c1
Reviewed-on: https://go-review.googlesource.com/101975
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2018-03-27 15:06:53 +00:00
Vlad Krasnov
26d74e8b65 math/big: reduce amount of copying in Montgomery multiplication
Instead shifting the accumulator every iteration of the loop, shift
once in the end. This significantly improves performance on arm64.

On arm64:

name                  old time/op    new time/op    delta
RSA2048Decrypt          3.33ms ± 0%    2.63ms ± 0%  -20.94%  (p=0.000 n=11+11)
RSA2048Sign             4.22ms ± 0%    3.55ms ± 0%  -15.89%  (p=0.000 n=11+11)
3PrimeRSA2048Decrypt    1.95ms ± 0%    1.59ms ± 0%  -18.59%  (p=0.000 n=11+11)

On Skylake:

name                    old time/op  new time/op  delta
RSA2048Decrypt-8        1.73ms ± 2%  1.55ms ± 2%  -10.19%  (p=0.000 n=10+10)
RSA2048Sign-8           2.17ms ± 2%  2.00ms ± 2%   -7.93%  (p=0.000 n=10+10)
3PrimeRSA2048Decrypt-8  1.10ms ± 2%  0.96ms ± 2%  -13.03%  (p=0.000 n=10+9)

Change-Id: I5786191a1a09e4217fdb1acfd90880d35c5855f7
Reviewed-on: https://go-review.googlesource.com/99838
Run-TryBot: Vlad Krasnov <vlad@cloudflare.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Adam Langley <agl@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2018-03-19 21:40:56 +00:00
Alberto Donizetti
168cc7ff9c math/big: add 0 shift fastpath to shl and shr
One could expect calls like

  z.mant.shl(z.mant, shiftAmount)

(or higher-level-functions calls that use lhs/rhs) to be almost free
when shiftAmount = 0; and expect calls like

  z.mant.shl(x.mant, 0)

to have the same cost of a x.mant -> z.mant copy. Neither of this
things are currently true.

For an 800 words nat, the first kind of calls cost ~800ns for rigth
shifts and ~3.5µs for left shift; while the second kind of calls are
doing more work than necessary by calling shlVU/shrVU.

This change makes the first kind of calls ({Shl,Shr}Same) almost free,
and the second kind of calls ({Shl,Shr}) about 30% faster.

name                  old time/op  new time/op  delta
ZeroShifts/Shl-4      3.64µs ± 3%  2.49µs ± 1%  -31.55%  (p=0.000 n=10+10)
ZeroShifts/ShlSame-4  3.65µs ± 1%  0.01µs ± 1%  -99.85%  (p=0.000 n=9+9)
ZeroShifts/Shr-4      3.65µs ± 1%  2.49µs ± 1%  -31.91%  (p=0.000 n=10+10)
ZeroShifts/ShrSame-4   825ns ± 0%     6ns ± 1%  -99.33%  (p=0.000 n=9+10)

During go test math/big, the shl zeroshift fastpath is triggered 1380
times; while the shr fastpath is triggered 153334 times(!).

Change-Id: I5f92b304a40638bd8453a86c87c58e54b337bcdf
Reviewed-on: https://go-review.googlesource.com/87660
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2018-03-19 18:01:37 +00:00
Robert Griesemer
7d4d2cb686 math/big: add comment about internal assumptions on nat values
Change-Id: I7ed40507a019c0bf521ba748fc22c03d74bb17b7
Reviewed-on: https://go-review.googlesource.com/100719
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-03-14 18:31:05 +00:00
erifan01
140bfe9cfe math/big: optimize shlVU and shrVU on arm64
This CL implements shlVU and shrVU with arm64 HW instructions "LDP" and "STP" to reduce load cost,
it also removes unnecessary checks on the number of shifts for better performance.

Benchmarks:

name                              old time/op    new time/op    delta
AddVV/1-8                           21.6ns ± 1%    21.6ns ± 1%      ~     (p=0.683 n=5+5)
AddVV/2-8                           13.5ns ± 0%    13.5ns ± 0%      ~     (all equal)
AddVV/3-8                           15.5ns ± 0%    15.5ns ± 0%      ~     (all equal)
AddVV/4-8                           17.5ns ± 0%    17.5ns ± 0%      ~     (all equal)
AddVV/5-8                           19.5ns ± 0%    19.5ns ± 0%      ~     (all equal)
AddVV/10-8                          29.5ns ± 0%    29.5ns ± 0%      ~     (all equal)
AddVV/100-8                          217ns ± 0%     217ns ± 0%      ~     (all equal)
AddVV/1000-8                        2.02µs ± 0%    2.03µs ± 0%    +0.73%  (p=0.008 n=5+5)
AddVV/10000-8                       20.5µs ± 0%    20.5µs ± 0%    -0.01%  (p=0.008 n=5+5)
AddVV/100000-8                       246µs ± 5%     250µs ± 4%      ~     (p=0.548 n=5+5)
AddVW/1-8                           9.26ns ± 0%    9.32ns ± 0%    +0.65%  (p=0.016 n=4+5)
AddVW/2-8                           19.8ns ± 3%    19.8ns ± 0%      ~     (p=0.143 n=5+5)
AddVW/3-8                           11.5ns ± 0%    11.5ns ± 0%      ~     (all equal)
AddVW/4-8                           13.0ns ± 0%    13.0ns ± 0%      ~     (all equal)
AddVW/5-8                           14.5ns ± 0%    14.5ns ± 0%      ~     (all equal)
AddVW/10-8                          22.0ns ± 0%    22.0ns ± 0%      ~     (all equal)
AddVW/100-8                          167ns ± 0%     166ns ± 0%    -0.60%  (p=0.000 n=5+4)
AddVW/1000-8                        1.52µs ± 0%    1.52µs ± 0%      ~     (all equal)
AddVW/10000-8                       15.1µs ± 0%    15.1µs ± 0%    +0.01%  (p=0.008 n=5+5)
AddVW/100000-8                       163µs ± 4%     153µs ± 3%    -5.97%  (p=0.016 n=5+5)
AddMulVVW/1-8                       32.4ns ± 1%    33.0ns ± 1%    +1.73%  (p=0.040 n=5+5)
AddMulVVW/2-8                       56.4ns ± 2%    55.9ns ± 1%      ~     (p=0.135 n=5+5)
AddMulVVW/3-8                       85.4ns ± 1%    85.1ns ± 0%      ~     (p=0.079 n=5+5)
AddMulVVW/4-8                        129ns ± 1%     129ns ± 0%      ~     (p=0.397 n=5+5)
AddMulVVW/5-8                        148ns ± 0%     148ns ± 0%      ~     (all equal)
AddMulVVW/10-8                       270ns ± 0%     268ns ± 0%    -0.74%  (p=0.029 n=4+4)
AddMulVVW/100-8                     2.75µs ± 0%    2.75µs ± 0%    -0.09%  (p=0.008 n=5+5)
AddMulVVW/1000-8                    26.0µs ± 0%    26.0µs ± 0%    -0.06%  (p=0.024 n=5+5)
AddMulVVW/10000-8                    312µs ± 0%     312µs ± 0%    -0.09%  (p=0.008 n=5+5)
AddMulVVW/100000-8                  2.89ms ± 0%    2.89ms ± 0%    +0.14%  (p=0.016 n=5+5)
DecimalConversion-8                  315µs ± 1%     312µs ± 0%      ~     (p=0.095 n=5+5)
FloatString/100-8                   2.56µs ± 1%    2.52µs ± 1%    -1.31%  (p=0.016 n=5+5)
FloatString/1000-8                  58.6µs ± 0%    58.2µs ± 0%    -0.75%  (p=0.008 n=5+5)
FloatString/10000-8                 4.59ms ± 0%    4.59ms ± 0%      ~     (p=0.056 n=5+5)
FloatString/100000-8                 446ms ± 0%     446ms ± 0%    -0.04%  (p=0.008 n=5+5)
FloatAdd/10-8                        184ns ± 0%     178ns ± 0%    -3.48%  (p=0.008 n=5+5)
FloatAdd/100-8                       189ns ± 3%     178ns ± 2%    -6.02%  (p=0.008 n=5+5)
FloatAdd/1000-8                      371ns ± 0%     267ns ± 0%   -27.99%  (p=0.000 n=5+4)
FloatAdd/10000-8                    1.87µs ± 0%    1.03µs ± 0%   -44.74%  (p=0.008 n=5+5)
FloatAdd/100000-8                   17.1µs ± 0%     8.8µs ± 0%   -48.71%  (p=0.016 n=5+4)
FloatSub/10-8                        148ns ± 0%     146ns ± 0%    -1.35%  (p=0.000 n=5+4)
FloatSub/100-8                       148ns ± 0%     140ns ± 0%    -5.41%  (p=0.008 n=5+5)
FloatSub/1000-8                      242ns ± 0%     191ns ± 0%   -21.24%  (p=0.008 n=5+5)
FloatSub/10000-8                    1.07µs ± 0%    0.64µs ± 1%   -39.89%  (p=0.016 n=4+5)
FloatSub/100000-8                   9.48µs ± 0%    5.32µs ± 0%   -43.87%  (p=0.008 n=5+5)
ParseFloatSmallExp-8                29.3µs ± 3%    28.6µs ± 1%      ~     (p=0.310 n=5+5)
ParseFloatLargeExp-8                 125µs ± 1%     123µs ± 0%    -1.99%  (p=0.008 n=5+5)
GCD10x10/WithoutXY-8                 278ns ± 4%     289ns ± 5%    +3.96%  (p=0.040 n=5+5)
GCD10x10/WithXY-8                   2.12µs ± 2%    2.15µs ± 2%      ~     (p=0.095 n=5+5)
GCD10x100/WithoutXY-8                615ns ± 1%     629ns ± 5%      ~     (p=0.135 n=5+5)
GCD10x100/WithXY-8                  3.42µs ± 1%    3.53µs ± 2%    +3.38%  (p=0.008 n=5+5)
GCD10x1000/WithoutXY-8              1.39µs ± 1%    1.38µs ± 1%      ~     (p=0.460 n=5+5)
GCD10x1000/WithXY-8                 7.47µs ± 2%    7.49µs ± 3%      ~     (p=1.000 n=5+5)
GCD10x10000/WithoutXY-8             8.71µs ± 1%    8.71µs ± 0%      ~     (p=0.841 n=5+5)
GCD10x10000/WithXY-8                28.4µs ± 2%    27.2µs ± 2%    -4.24%  (p=0.008 n=5+5)
GCD10x100000/WithoutXY-8            78.9µs ± 1%    79.1µs ± 0%      ~     (p=0.222 n=5+5)
GCD10x100000/WithXY-8                240µs ± 1%     228µs ± 1%    -4.98%  (p=0.008 n=5+5)
GCD100x100/WithoutXY-8              1.87µs ± 2%    1.89µs ± 1%      ~     (p=0.095 n=5+5)
GCD100x100/WithXY-8                 26.6µs ± 1%    26.3µs ± 0%    -1.14%  (p=0.032 n=5+5)
GCD100x1000/WithoutXY-8             4.44µs ± 2%    4.47µs ± 2%      ~     (p=0.444 n=5+5)
GCD100x1000/WithXY-8                36.7µs ± 1%    36.0µs ± 1%    -1.96%  (p=0.008 n=5+5)
GCD100x10000/WithoutXY-8            22.8µs ± 1%    22.3µs ± 1%    -2.52%  (p=0.008 n=5+5)
GCD100x10000/WithXY-8                145µs ± 1%     142µs ± 0%    -1.88%  (p=0.008 n=5+5)
GCD100x100000/WithoutXY-8            198µs ± 1%     190µs ± 0%    -4.06%  (p=0.008 n=5+5)
GCD100x100000/WithXY-8              1.11ms ± 0%    1.09ms ± 0%    -1.87%  (p=0.008 n=5+5)
GCD1000x1000/WithoutXY-8            25.4µs ± 1%    25.0µs ± 1%    -1.34%  (p=0.008 n=5+5)
GCD1000x1000/WithXY-8                515µs ± 1%     485µs ± 0%    -5.85%  (p=0.008 n=5+5)
GCD1000x10000/WithoutXY-8           57.3µs ± 1%    56.2µs ± 1%    -1.95%  (p=0.008 n=5+5)
GCD1000x10000/WithXY-8              1.21ms ± 0%    1.18ms ± 0%    -2.65%  (p=0.008 n=5+5)
GCD1000x100000/WithoutXY-8           358µs ± 0%     352µs ± 1%    -1.71%  (p=0.008 n=5+5)
GCD1000x100000/WithXY-8             8.72ms ± 0%    8.66ms ± 0%    -0.71%  (p=0.008 n=5+5)
GCD10000x10000/WithoutXY-8           690µs ± 0%     687µs ± 1%      ~     (p=0.095 n=5+5)
GCD10000x10000/WithXY-8             16.0ms ± 0%    12.5ms ± 0%   -22.01%  (p=0.008 n=5+5)
GCD10000x100000/WithoutXY-8         2.09ms ± 0%    2.07ms ± 0%    -0.58%  (p=0.008 n=5+5)
GCD10000x100000/WithXY-8            86.8ms ± 0%    83.4ms ± 0%    -3.95%  (p=0.008 n=5+5)
GCD100000x100000/WithoutXY-8        51.2ms ± 0%    51.2ms ± 0%      ~     (p=0.548 n=5+5)
GCD100000x100000/WithXY-8            1.25s ± 0%     0.89s ± 0%   -28.98%  (p=0.008 n=5+5)
Hilbert-8                           2.46ms ± 2%    2.53ms ± 1%    +2.89%  (p=0.032 n=5+5)
Binomial-8                          5.15µs ± 4%    4.92µs ± 1%    -4.43%  (p=0.032 n=5+5)
QuoRem-8                            7.10µs ± 0%    7.05µs ± 0%    -0.59%  (p=0.008 n=5+5)
Exp-8                                161ms ± 0%     161ms ± 0%    -0.24%  (p=0.008 n=5+5)
Exp2-8                               161ms ± 0%     161ms ± 0%    -0.30%  (p=0.016 n=4+5)
Bitset-8                            40.4ns ± 0%    40.3ns ± 0%      ~     (p=0.159 n=5+5)
BitsetNeg-8                          158ns ± 4%     155ns ± 2%      ~     (p=0.183 n=5+5)
BitsetOrig-8                         374ns ± 0%     383ns ± 1%    +2.35%  (p=0.008 n=5+5)
BitsetNegOrig-8                      620ns ± 1%     663ns ± 2%    +7.00%  (p=0.008 n=5+5)
ModSqrt225_Tonelli-8                7.26ms ± 0%    7.27ms ± 0%      ~     (p=0.841 n=5+5)
ModSqrt224_3Mod4-8                  2.24ms ± 0%    2.24ms ± 0%      ~     (p=0.690 n=5+5)
ModSqrt5430_Tonelli-8                62.3s ± 0%     62.4s ± 0%    +0.15%  (p=0.008 n=5+5)
ModSqrt5430_3Mod4-8                  20.8s ± 0%     20.8s ± 0%      ~     (p=0.151 n=5+5)
Sqrt-8                               101µs ± 0%      97µs ± 0%    -3.99%  (p=0.008 n=5+5)
IntSqr/1-8                          32.7ns ± 1%    32.5ns ± 1%      ~     (p=0.325 n=5+5)
IntSqr/2-8                           161ns ± 4%     160ns ± 4%      ~     (p=0.659 n=5+5)
IntSqr/3-8                           296ns ± 7%     297ns ± 6%      ~     (p=0.841 n=5+5)
IntSqr/5-8                           752ns ± 7%     755ns ± 6%      ~     (p=0.889 n=5+5)
IntSqr/8-8                          1.91µs ± 3%    1.90µs ± 3%      ~     (p=0.746 n=5+5)
IntSqr/10-8                         2.99µs ± 4%    3.00µs ± 4%      ~     (p=0.516 n=5+5)
IntSqr/20-8                         6.29µs ± 2%    6.19µs ± 2%      ~     (p=0.151 n=5+5)
IntSqr/30-8                         14.0µs ± 1%    13.8µs ± 2%      ~     (p=0.056 n=5+5)
IntSqr/50-8                         38.1µs ± 3%    37.9µs ± 3%      ~     (p=0.548 n=5+5)
IntSqr/80-8                         95.1µs ± 1%    94.7µs ± 1%      ~     (p=0.310 n=5+5)
IntSqr/100-8                         148µs ± 1%     148µs ± 1%      ~     (p=0.548 n=5+5)
IntSqr/200-8                         587µs ± 1%     587µs ± 1%      ~     (p=1.000 n=5+5)
IntSqr/300-8                        1.31ms ± 1%    1.32ms ± 1%      ~     (p=0.151 n=5+5)
IntSqr/500-8                        2.48ms ± 0%    2.49ms ± 0%      ~     (p=0.310 n=5+5)
IntSqr/800-8                        4.68ms ± 0%    4.67ms ± 0%      ~     (p=0.548 n=5+5)
IntSqr/1000-8                       7.57ms ± 0%    7.56ms ± 0%      ~     (p=0.421 n=5+5)
Mul-8                                311ms ± 0%     311ms ± 0%      ~     (p=0.151 n=5+5)
Exp3Power/0x10-8                     584ns ± 2%     573ns ± 1%      ~     (p=0.190 n=5+5)
Exp3Power/0x40-8                     646ns ± 2%     649ns ± 1%      ~     (p=0.690 n=5+5)
Exp3Power/0x100-8                   1.42µs ± 2%    1.45µs ± 1%    +2.03%  (p=0.032 n=5+5)
Exp3Power/0x400-8                   8.28µs ± 1%    8.39µs ± 0%    +1.33%  (p=0.008 n=5+5)
Exp3Power/0x1000-8                  60.1µs ± 0%    59.8µs ± 0%    -0.44%  (p=0.008 n=5+5)
Exp3Power/0x4000-8                   818µs ± 0%     816µs ± 0%    -0.23%  (p=0.008 n=5+5)
Exp3Power/0x10000-8                 7.79ms ± 0%    7.78ms ± 0%      ~     (p=0.690 n=5+5)
Exp3Power/0x40000-8                 73.4ms ± 0%    73.3ms ± 0%      ~     (p=0.151 n=5+5)
Exp3Power/0x100000-8                 665ms ± 0%     664ms ± 0%    -0.16%  (p=0.016 n=4+5)
Exp3Power/0x400000-8                 5.99s ± 0%     5.97s ± 0%    -0.24%  (p=0.008 n=5+5)
Fibo-8                               116ms ± 0%     117ms ± 0%    +0.42%  (p=0.008 n=5+5)
NatSqr/1-8                           113ns ± 2%     112ns ± 1%      ~     (p=0.190 n=5+5)
NatSqr/2-8                           249ns ± 2%     250ns ± 2%      ~     (p=0.365 n=5+5)
NatSqr/3-8                           379ns ± 1%     381ns ± 2%      ~     (p=0.127 n=5+5)
NatSqr/5-8                           838ns ± 3%     841ns ± 5%      ~     (p=0.754 n=5+5)
NatSqr/8-8                          1.97µs ± 3%    1.97µs ± 4%      ~     (p=1.000 n=5+5)
NatSqr/10-8                         3.04µs ± 4%    3.04µs ± 4%      ~     (p=1.000 n=5+5)
NatSqr/20-8                         6.49µs ± 3%    6.50µs ± 2%      ~     (p=0.841 n=5+5)
NatSqr/30-8                         14.3µs ± 2%    14.2µs ± 2%      ~     (p=0.548 n=5+5)
NatSqr/50-8                         38.5µs ± 3%    38.3µs ± 3%      ~     (p=0.421 n=5+5)
NatSqr/80-8                         96.3µs ± 1%    96.1µs ± 1%      ~     (p=0.421 n=5+5)
NatSqr/100-8                         149µs ± 1%     148µs ± 1%      ~     (p=0.310 n=5+5)
NatSqr/200-8                         591µs ± 1%     592µs ± 1%      ~     (p=0.690 n=5+5)
NatSqr/300-8                        1.31ms ± 1%    1.32ms ± 0%      ~     (p=0.190 n=5+4)
NatSqr/500-8                        2.49ms ± 0%    2.49ms ± 0%      ~     (p=0.095 n=5+5)
NatSqr/800-8                        4.70ms ± 0%    4.69ms ± 0%      ~     (p=0.222 n=5+5)
NatSqr/1000-8                       7.60ms ± 0%    7.58ms ± 0%      ~     (p=0.222 n=5+5)
ScanPi-8                             326µs ± 0%     327µs ± 1%      ~     (p=0.222 n=5+5)
StringPiParallel-8                  71.4µs ± 5%    67.7µs ± 4%      ~     (p=0.095 n=5+5)
Scan/10/Base2-8                     1.09µs ± 0%    1.10µs ± 1%      ~     (p=0.810 n=5+5)
Scan/100/Base2-8                    7.79µs ± 0%    7.83µs ± 0%    +0.53%  (p=0.008 n=5+5)
Scan/1000/Base2-8                   78.9µs ± 0%    79.0µs ± 0%      ~     (p=0.151 n=5+5)
Scan/10000/Base2-8                  1.22ms ± 0%    1.23ms ± 1%      ~     (p=0.690 n=5+5)
Scan/100000/Base2-8                 55.1ms ± 0%    55.1ms ± 0%    +0.10%  (p=0.008 n=5+5)
Scan/10/Base8-8                      512ns ± 1%     534ns ± 1%    +4.34%  (p=0.008 n=5+5)
Scan/100/Base8-8                    2.90µs ± 1%    2.92µs ± 0%    +0.67%  (p=0.024 n=5+5)
Scan/1000/Base8-8                   31.0µs ± 0%    31.1µs ± 0%    +0.27%  (p=0.008 n=5+5)
Scan/10000/Base8-8                   741µs ± 0%     744µs ± 1%      ~     (p=0.310 n=5+5)
Scan/100000/Base8-8                 50.5ms ± 0%    50.7ms ± 0%    +0.23%  (p=0.016 n=5+4)
Scan/10/Base10-8                     485ns ± 0%     510ns ± 1%    +5.15%  (p=0.008 n=5+5)
Scan/100/Base10-8                   2.68µs ± 0%    2.70µs ± 0%    +0.84%  (p=0.008 n=5+5)
Scan/1000/Base10-8                  28.7µs ± 0%    28.8µs ± 0%    +0.34%  (p=0.008 n=5+5)
Scan/10000/Base10-8                  717µs ± 0%     720µs ± 1%      ~     (p=0.238 n=5+5)
Scan/100000/Base10-8                50.3ms ± 0%    50.3ms ± 0%    +0.02%  (p=0.016 n=4+5)
Scan/10/Base16-8                     439ns ± 0%     461ns ± 1%    +5.06%  (p=0.008 n=5+5)
Scan/100/Base16-8                   2.48µs ± 0%    2.49µs ± 0%    +0.59%  (p=0.024 n=5+5)
Scan/1000/Base16-8                  27.2µs ± 0%    27.3µs ± 0%      ~     (p=0.063 n=5+5)
Scan/10000/Base16-8                  722µs ± 0%     725µs ± 1%      ~     (p=0.421 n=5+5)
Scan/100000/Base16-8                52.7ms ± 0%    52.7ms ± 0%      ~     (p=0.686 n=4+4)
String/10/Base2-8                    248ns ± 1%     248ns ± 1%      ~     (p=0.802 n=5+5)
String/100/Base2-8                  1.51µs ± 0%    1.51µs ± 0%    -0.54%  (p=0.024 n=5+5)
String/1000/Base2-8                 13.6µs ± 0%    13.6µs ± 0%      ~     (p=0.548 n=5+5)
String/10000/Base2-8                 135µs ± 1%     135µs ± 2%      ~     (p=0.421 n=5+5)
String/100000/Base2-8               1.32ms ± 1%    1.33ms ± 1%      ~     (p=0.310 n=5+5)
String/10/Base8-8                    169ns ± 0%     170ns ± 0%      ~     (p=0.079 n=5+5)
String/100/Base8-8                   635ns ± 1%     633ns ± 1%      ~     (p=0.595 n=5+5)
String/1000/Base8-8                 5.33µs ± 0%    5.30µs ± 0%      ~     (p=0.063 n=5+5)
String/10000/Base8-8                50.7µs ± 1%    50.7µs ± 1%      ~     (p=1.000 n=5+5)
String/100000/Base8-8                499µs ± 1%     500µs ± 1%      ~     (p=1.000 n=5+5)
String/10/Base10-8                   517ns ± 1%     512ns ± 1%    -1.01%  (p=0.032 n=5+5)
String/100/Base10-8                 1.97µs ± 0%    2.01µs ± 1%    +2.13%  (p=0.008 n=5+5)
String/1000/Base10-8                12.6µs ± 1%    12.1µs ± 1%    -4.16%  (p=0.008 n=5+5)
String/10000/Base10-8               57.9µs ± 1%    54.8µs ± 1%    -5.46%  (p=0.008 n=5+5)
String/100000/Base10-8              25.6ms ± 0%    25.6ms ± 0%    -0.12%  (p=0.008 n=5+5)
String/10/Base16-8                   149ns ± 0%     149ns ± 1%      ~     (p=1.000 n=5+5)
String/100/Base16-8                  514ns ± 0%     514ns ± 1%      ~     (p=0.825 n=5+5)
String/1000/Base16-8                4.01µs ± 0%    4.01µs ± 0%      ~     (p=0.595 n=5+5)
String/10000/Base16-8               37.7µs ± 0%    37.8µs ± 1%      ~     (p=0.222 n=5+5)
String/100000/Base16-8               373µs ± 1%     372µs ± 0%      ~     (p=1.000 n=5+5)
LeafSize/0-8                        6.64ms ± 0%    6.66ms ± 0%    +0.32%  (p=0.008 n=5+5)
LeafSize/1-8                        74.0µs ± 1%    71.2µs ± 1%    -3.75%  (p=0.008 n=5+5)
LeafSize/2-8                        74.1µs ± 0%    70.7µs ± 1%    -4.53%  (p=0.008 n=5+5)
LeafSize/3-8                         379µs ± 0%     374µs ± 0%    -1.25%  (p=0.008 n=5+5)
LeafSize/4-8                        72.7µs ± 0%    69.2µs ± 0%    -4.79%  (p=0.008 n=5+5)
LeafSize/5-8                         471µs ± 0%     466µs ± 0%    -1.05%  (p=0.008 n=5+5)
LeafSize/6-8                         377µs ± 0%     373µs ± 0%    -1.16%  (p=0.008 n=5+5)
LeafSize/7-8                         245µs ± 0%     241µs ± 0%    -1.65%  (p=0.008 n=5+5)
LeafSize/8-8                        73.1µs ± 0%    69.4µs ± 0%    -5.10%  (p=0.008 n=5+5)
LeafSize/9-8                         538µs ± 0%     532µs ± 0%    -1.01%  (p=0.008 n=5+5)
LeafSize/10-8                        472µs ± 0%     467µs ± 0%    -1.07%  (p=0.008 n=5+5)
LeafSize/11-8                        460µs ± 0%     454µs ± 0%    -1.22%  (p=0.008 n=5+5)
LeafSize/12-8                        378µs ± 0%     373µs ± 0%    -1.34%  (p=0.008 n=5+5)
LeafSize/13-8                        344µs ± 0%     338µs ± 0%    -1.61%  (p=0.008 n=5+5)
LeafSize/14-8                        247µs ± 0%     243µs ± 0%    -1.62%  (p=0.008 n=5+5)
LeafSize/15-8                        169µs ± 0%     165µs ± 0%    -2.71%  (p=0.008 n=5+5)
LeafSize/16-8                       73.3µs ± 1%    69.5µs ± 0%    -5.11%  (p=0.008 n=5+5)
LeafSize/32-8                       82.7µs ± 0%    79.2µs ± 0%    -4.24%  (p=0.008 n=5+5)
LeafSize/64-8                        135µs ± 0%     132µs ± 0%    -2.20%  (p=0.008 n=5+5)
ProbablyPrime/n=0-8                 44.2ms ± 0%    43.9ms ± 0%    -0.69%  (p=0.008 n=5+5)
ProbablyPrime/n=1-8                 64.8ms ± 0%    64.4ms ± 0%    -0.60%  (p=0.008 n=5+5)
ProbablyPrime/n=5-8                  147ms ± 0%     147ms ± 0%    -0.34%  (p=0.008 n=5+5)
ProbablyPrime/n=10-8                 250ms ± 0%     249ms ± 0%    -0.29%  (p=0.008 n=5+5)
ProbablyPrime/n=20-8                 456ms ± 0%     455ms ± 0%    -0.29%  (p=0.008 n=5+5)
ProbablyPrime/Lucas-8               23.6ms ± 0%    23.2ms ± 0%    -1.44%  (p=0.008 n=5+5)
ProbablyPrime/MillerRabinBase2-8    20.6ms ± 0%    20.6ms ± 0%    -0.31%  (p=0.008 n=5+5)
FloatSqrt/64-8                      2.27µs ± 1%    2.11µs ± 1%    -7.02%  (p=0.008 n=5+5)
FloatSqrt/128-8                     4.93µs ± 1%    4.40µs ± 1%   -10.73%  (p=0.008 n=5+5)
FloatSqrt/256-8                     13.6µs ± 0%     6.6µs ± 1%   -51.40%  (p=0.008 n=5+5)
FloatSqrt/1000-8                    69.8µs ± 0%    31.2µs ± 0%   -55.27%  (p=0.008 n=5+5)
FloatSqrt/10000-8                   1.91ms ± 0%    0.59ms ± 0%   -69.17%  (p=0.008 n=5+5)
FloatSqrt/100000-8                  55.4ms ± 0%    17.8ms ± 0%   -67.79%  (p=0.008 n=5+5)
FloatSqrt/1000000-8                  4.56s ± 0%     1.52s ± 0%   -66.59%  (p=0.008 n=5+5)

Change-Id: Icce52c69668f564490c69b908338b21a2288e116
Reviewed-on: https://go-review.googlesource.com/79355
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-12 14:42:11 +00:00
Alberto Donizetti
010579c237 math/big: allocate less in Float.Sqrt
The Newton sqrtInverse procedure we use to compute Float.Sqrt should
not allocate a number of times proportional to the number of Newton
iterations we need to reach the desired precision.

At the beginning the function the target precision is known, so even
if we do want to perform the early steps at low precisions (to save
time), it's still possible to pre-allocate larger backing arrays, both
for the temp variables in the loop and the variable that'll hold the
final result.

There's one complication. At the following line:

  u.Sub(three, u)

the Sub method will allocate, because the receiver aliases one of the
arguments, and the large backing array we initially allocated for u
will be replaced by a smaller one allocated by Sub. We can work around
this by introducing a second temp variable u2 that we use to hold the
Sub call result.

Overall, the sqrtInverse procedure still allocates a number of times
proportional to the number of Newton steps, because unfortunately a
few of the Mul calls in the Newton function allocate; but at least we
allocate less in the function itself.

FloatSqrt/256-4        1.97µs ± 1%    1.84µs ± 1%   -6.61%  (p=0.000 n=8+8)
FloatSqrt/1000-4       4.80µs ± 3%    4.28µs ± 1%  -10.78%  (p=0.000 n=8+8)
FloatSqrt/10000-4      40.0µs ± 1%    38.3µs ± 1%   -4.15%  (p=0.000 n=8+8)
FloatSqrt/100000-4      955µs ± 1%     932µs ± 0%   -2.49%  (p=0.000 n=8+7)
FloatSqrt/1000000-4    79.8ms ± 1%    79.4ms ± 1%     ~     (p=0.105 n=8+8)

name                 old alloc/op   new alloc/op   delta
FloatSqrt/256-4          816B ± 0%      512B ± 0%  -37.25%  (p=0.000 n=8+8)
FloatSqrt/1000-4       2.50kB ± 0%    1.47kB ± 0%  -41.03%  (p=0.000 n=8+8)
FloatSqrt/10000-4      23.5kB ± 0%    18.2kB ± 0%  -22.62%  (p=0.000 n=8+8)
FloatSqrt/100000-4      251kB ± 0%     173kB ± 0%  -31.26%  (p=0.000 n=8+8)
FloatSqrt/1000000-4    4.61MB ± 0%    2.86MB ± 0%  -37.90%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
FloatSqrt/256-4          12.0 ± 0%       8.0 ± 0%  -33.33%  (p=0.000 n=8+8)
FloatSqrt/1000-4         19.0 ± 0%       9.0 ± 0%  -52.63%  (p=0.000 n=8+8)
FloatSqrt/10000-4        35.0 ± 0%      14.0 ± 0%  -60.00%  (p=0.000 n=8+8)
FloatSqrt/100000-4       55.0 ± 0%      23.0 ± 0%  -58.18%  (p=0.000 n=8+8)
FloatSqrt/1000000-4       122 ± 0%        75 ± 0%  -38.52%  (p=0.000 n=8+8)

Change-Id: I950dbf61a40267a6cca82ae72524c3024bcb149c
Reviewed-on: https://go-review.googlesource.com/87659
Reviewed-by: Robert Griesemer <gri@golang.org>
2018-03-08 19:12:35 +00:00
isharipo
d2a5263a9c math/big: speedup nat.setBytes for bigger slices
Set up to _S (number of bytes in Uint) bytes at time
by using BigEndian.Uint32 and BigEndian.Uint64.

The performance improves for slices bigger than _S bytes.
This is the case for 128/256bit arith that initializes
it's objects from bytes.

name               old time/op  new time/op  delta
NatSetBytes/8-4    29.8ns ± 1%  11.4ns ± 0%  -61.63%  (p=0.000 n=9+8)
NatSetBytes/24-4    109ns ± 1%    56ns ± 0%  -48.75%  (p=0.000 n=9+8)
NatSetBytes/128-4   420ns ± 2%   110ns ± 1%  -73.83%  (p=0.000 n=10+10)
NatSetBytes/7-4    26.2ns ± 1%  21.3ns ± 2%  -18.63%  (p=0.000 n=8+9)
NatSetBytes/23-4    106ns ± 1%    67ns ± 1%  -36.93%  (p=0.000 n=9+10)
NatSetBytes/127-4   410ns ± 2%   121ns ± 0%  -70.46%  (p=0.000 n=9+8)

Found this optimization opportunity by looking at ethereum_corevm
community benchmark cpuprofile.

name        old time/op  new time/op  delta
OpDiv256-4   715ns ± 1%   596ns ± 1%  -16.57%  (p=0.008 n=5+5)
OpDiv128-4   373ns ± 1%   314ns ± 1%  -15.83%  (p=0.008 n=5+5)
OpDiv64-4    301ns ± 0%   285ns ± 1%   -5.12%  (p=0.008 n=5+5)

Change-Id: I8e5a680ae6284c8233d8d7431d51253a8a740b57
Reviewed-on: https://go-review.googlesource.com/98775
Run-TryBot: Iskander Sharipov <iskander.sharipov@intel.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-08 18:50:10 +00:00
erifan01
0585d41c87 math/big: optimize addVW and subVW on arm64
The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions,
this CL can reduce the "load" cost and improve performance.

Benchmarks:

name                              old time/op    new time/op     delta
AddVV/1-8                           21.5ns ± 0%     21.5ns ± 0%      ~     (all equal)
AddVV/2-8                           13.5ns ± 0%     13.5ns ± 0%      ~     (all equal)
AddVV/3-8                           15.5ns ± 0%     15.5ns ± 0%      ~     (all equal)
AddVV/4-8                           17.5ns ± 0%     17.5ns ± 0%      ~     (all equal)
AddVV/5-8                           19.5ns ± 0%     19.5ns ± 0%      ~     (all equal)
AddVV/10-8                          29.5ns ± 0%     29.5ns ± 0%      ~     (all equal)
AddVV/100-8                          217ns ± 0%      217ns ± 0%      ~     (all equal)
AddVV/1000-8                        2.02µs ± 0%     2.02µs ± 0%      ~     (all equal)
AddVV/10000-8                       20.3µs ± 0%     20.3µs ± 0%      ~     (p=0.603 n=5+5)
AddVV/100000-8                       223µs ± 7%      228µs ± 8%      ~     (p=0.548 n=5+5)
AddVW/1-8                           9.32ns ± 0%     9.26ns ± 0%    -0.64%  (p=0.008 n=5+5)
AddVW/2-8                           19.8ns ± 3%     10.5ns ± 0%   -46.92%  (p=0.008 n=5+5)
AddVW/3-8                           11.5ns ± 0%     11.0ns ± 0%    -4.35%  (p=0.008 n=5+5)
AddVW/4-8                           13.0ns ± 0%     12.0ns ± 0%    -7.69%  (p=0.008 n=5+5)
AddVW/5-8                           14.5ns ± 0%     12.5ns ± 0%   -13.79%  (p=0.008 n=5+5)
AddVW/10-8                          22.0ns ± 0%     15.5ns ± 0%   -29.55%  (p=0.008 n=5+5)
AddVW/100-8                          167ns ± 0%       81ns ± 0%   -51.44%  (p=0.008 n=5+5)
AddVW/1000-8                        1.52µs ± 0%     0.64µs ± 0%   -57.58%  (p=0.008 n=5+5)
AddVW/10000-8                       15.1µs ± 0%      7.2µs ± 0%   -52.55%  (p=0.008 n=5+5)
AddVW/100000-8                       150µs ± 0%       71µs ± 0%   -52.95%  (p=0.008 n=5+5)
SubVW/1-8                           9.32ns ± 0%     9.26ns ± 0%    -0.64%  (p=0.008 n=5+5)
SubVW/2-8                           19.7ns ± 2%     10.5ns ± 0%   -46.70%  (p=0.008 n=5+5)
SubVW/3-8                           11.5ns ± 0%     11.0ns ± 0%    -4.35%  (p=0.008 n=5+5)
SubVW/4-8                           13.0ns ± 0%     12.0ns ± 0%    -7.69%  (p=0.008 n=5+5)
SubVW/5-8                           14.5ns ± 0%     12.5ns ± 0%   -13.79%  (p=0.008 n=5+5)
SubVW/10-8                          22.0ns ± 0%     15.5ns ± 0%   -29.55%  (p=0.008 n=5+5)
SubVW/100-8                          167ns ± 0%       81ns ± 0%   -51.44%  (p=0.008 n=5+5)
SubVW/1000-8                        1.52µs ± 0%     0.64µs ± 0%   -57.58%  (p=0.008 n=5+5)
SubVW/10000-8                       15.1µs ± 0%      7.2µs ± 0%   -52.49%  (p=0.008 n=5+5)
SubVW/100000-8                       150µs ± 0%       71µs ± 0%   -52.91%  (p=0.008 n=5+5)
AddMulVVW/1-8                       32.4ns ± 1%     32.6ns ± 1%      ~     (p=0.119 n=5+5)
AddMulVVW/2-8                       57.0ns ± 0%     57.0ns ± 0%      ~     (p=0.643 n=5+5)
AddMulVVW/3-8                       90.8ns ± 0%     90.7ns ± 0%      ~     (p=0.524 n=5+5)
AddMulVVW/4-8                        118ns ± 0%      118ns ± 1%      ~     (p=1.000 n=4+5)
AddMulVVW/5-8                        144ns ± 1%      144ns ± 0%      ~     (p=0.794 n=5+4)
AddMulVVW/10-8                       294ns ± 1%      296ns ± 0%    +0.48%  (p=0.040 n=5+5)
AddMulVVW/100-8                     2.73µs ± 0%     2.73µs ± 0%      ~     (p=0.278 n=5+5)
AddMulVVW/1000-8                    26.0µs ± 0%     26.5µs ± 0%    +2.14%  (p=0.008 n=5+5)
AddMulVVW/10000-8                    297µs ± 0%      297µs ± 0%    +0.24%  (p=0.008 n=5+5)
AddMulVVW/100000-8                  3.15ms ± 1%     3.13ms ± 0%      ~     (p=0.690 n=5+5)
DecimalConversion-8                  311µs ± 2%      309µs ± 2%      ~     (p=0.310 n=5+5)
FloatString/100-8                   2.55µs ± 2%     2.54µs ± 2%      ~     (p=1.000 n=5+5)
FloatString/1000-8                  58.1µs ± 0%     58.1µs ± 0%      ~     (p=0.151 n=5+5)
FloatString/10000-8                 4.59ms ± 0%     4.59ms ± 0%      ~     (p=0.151 n=5+5)
FloatString/100000-8                 446ms ± 0%      446ms ± 0%    +0.01%  (p=0.016 n=5+5)
FloatAdd/10-8                        183ns ± 0%      183ns ± 0%      ~     (p=0.333 n=4+5)
FloatAdd/100-8                       187ns ± 1%      192ns ± 2%      ~     (p=0.056 n=5+5)
FloatAdd/1000-8                      369ns ± 0%      371ns ± 0%    +0.54%  (p=0.016 n=4+5)
FloatAdd/10000-8                    1.88µs ± 0%     1.88µs ± 0%    -0.14%  (p=0.000 n=4+5)
FloatAdd/100000-8                   17.2µs ± 0%     17.1µs ± 0%    -0.37%  (p=0.008 n=5+5)
FloatSub/10-8                        147ns ± 0%      147ns ± 0%      ~     (all equal)
FloatSub/100-8                       145ns ± 0%      146ns ± 0%      ~     (p=0.238 n=5+4)
FloatSub/1000-8                      241ns ± 0%      241ns ± 0%      ~     (p=0.333 n=5+4)
FloatSub/10000-8                    1.06µs ± 0%     1.06µs ± 0%      ~     (p=0.444 n=5+5)
FloatSub/100000-8                   9.50µs ± 0%     9.48µs ± 0%    -0.14%  (p=0.008 n=5+5)
ParseFloatSmallExp-8                28.4µs ± 2%     28.5µs ± 1%      ~     (p=0.690 n=5+5)
ParseFloatLargeExp-8                 125µs ± 1%      124µs ± 1%      ~     (p=0.095 n=5+5)
GCD10x10/WithoutXY-8                 277ns ± 2%      278ns ± 3%      ~     (p=0.937 n=5+5)
GCD10x10/WithXY-8                   2.08µs ± 3%     2.15µs ± 3%      ~     (p=0.056 n=5+5)
GCD10x100/WithoutXY-8                592ns ± 3%      613ns ± 4%      ~     (p=0.056 n=5+5)
GCD10x100/WithXY-8                  3.40µs ± 2%     3.42µs ± 4%      ~     (p=0.841 n=5+5)
GCD10x1000/WithoutXY-8              1.37µs ± 2%     1.35µs ± 3%      ~     (p=0.460 n=5+5)
GCD10x1000/WithXY-8                 7.34µs ± 2%     7.33µs ± 4%      ~     (p=0.841 n=5+5)
GCD10x10000/WithoutXY-8             8.52µs ± 0%     8.51µs ± 1%      ~     (p=0.421 n=5+5)
GCD10x10000/WithXY-8                27.5µs ± 2%     27.2µs ± 1%      ~     (p=0.151 n=5+5)
GCD10x100000/WithoutXY-8            78.3µs ± 1%     78.5µs ± 1%      ~     (p=0.690 n=5+5)
GCD10x100000/WithXY-8                231µs ± 0%      229µs ± 1%    -1.11%  (p=0.016 n=5+5)
GCD100x100/WithoutXY-8              1.86µs ± 2%     1.86µs ± 2%      ~     (p=0.881 n=5+5)
GCD100x100/WithXY-8                 27.1µs ± 2%     27.2µs ± 1%      ~     (p=0.421 n=5+5)
GCD100x1000/WithoutXY-8             4.44µs ± 2%     4.41µs ± 1%      ~     (p=0.310 n=5+5)
GCD100x1000/WithXY-8                36.3µs ± 1%     36.2µs ± 1%      ~     (p=0.310 n=5+5)
GCD100x10000/WithoutXY-8            22.6µs ± 2%     22.5µs ± 1%      ~     (p=0.690 n=5+5)
GCD100x10000/WithXY-8                145µs ± 1%      145µs ± 1%      ~     (p=1.000 n=5+5)
GCD100x100000/WithoutXY-8            195µs ± 0%      196µs ± 1%      ~     (p=0.548 n=5+5)
GCD100x100000/WithXY-8              1.10ms ± 0%     1.10ms ± 0%    -0.30%  (p=0.016 n=5+5)
GCD1000x1000/WithoutXY-8            25.0µs ± 1%     25.2µs ± 2%      ~     (p=0.222 n=5+5)
GCD1000x1000/WithXY-8                520µs ± 0%      520µs ± 1%      ~     (p=0.151 n=5+5)
GCD1000x10000/WithoutXY-8           57.0µs ± 1%     56.9µs ± 1%      ~     (p=0.690 n=5+5)
GCD1000x10000/WithXY-8              1.21ms ± 0%     1.21ms ± 1%      ~     (p=0.881 n=5+5)
GCD1000x100000/WithoutXY-8           358µs ± 0%      359µs ± 1%      ~     (p=0.548 n=5+5)
GCD1000x100000/WithXY-8             8.73ms ± 0%     8.73ms ± 0%      ~     (p=0.548 n=5+5)
GCD10000x10000/WithoutXY-8           686µs ± 0%      687µs ± 0%      ~     (p=0.548 n=5+5)
GCD10000x10000/WithXY-8             15.9ms ± 0%     15.9ms ± 0%      ~     (p=0.841 n=5+5)
GCD10000x100000/WithoutXY-8         2.08ms ± 0%     2.08ms ± 0%      ~     (p=1.000 n=5+5)
GCD10000x100000/WithXY-8            86.7ms ± 0%     86.7ms ± 0%      ~     (p=1.000 n=5+5)
GCD100000x100000/WithoutXY-8        51.1ms ± 0%     51.0ms ± 0%      ~     (p=0.151 n=5+5)
GCD100000x100000/WithXY-8            1.23s ± 0%      1.23s ± 0%      ~     (p=0.841 n=5+5)
Hilbert-8                           2.41ms ± 1%     2.42ms ± 2%      ~     (p=0.690 n=5+5)
Binomial-8                          4.86µs ± 1%     4.86µs ± 1%      ~     (p=0.889 n=5+5)
QuoRem-8                            7.09µs ± 0%     7.08µs ± 0%    -0.09%  (p=0.024 n=5+5)
Exp-8                                161ms ± 0%      161ms ± 0%    -0.08%  (p=0.032 n=5+5)
Exp2-8                               161ms ± 0%      161ms ± 0%      ~     (p=1.000 n=5+5)
Bitset-8                            40.7ns ± 0%     40.6ns ± 0%      ~     (p=0.095 n=4+5)
BitsetNeg-8                          159ns ± 4%      148ns ± 0%    -6.92%  (p=0.016 n=5+4)
BitsetOrig-8                         378ns ± 1%      378ns ± 1%      ~     (p=0.937 n=5+5)
BitsetNegOrig-8                      647ns ± 5%      647ns ± 4%      ~     (p=1.000 n=5+5)
ModSqrt225_Tonelli-8                7.26ms ± 0%     7.27ms ± 0%      ~     (p=1.000 n=5+5)
ModSqrt224_3Mod4-8                  2.24ms ± 0%     2.24ms ± 0%      ~     (p=0.690 n=5+5)
ModSqrt5430_Tonelli-8                62.8s ± 1%      62.5s ± 0%      ~     (p=0.063 n=5+4)
ModSqrt5430_3Mod4-8                  20.8s ± 0%      20.8s ± 0%      ~     (p=0.310 n=5+5)
Sqrt-8                               101µs ± 1%      101µs ± 0%    -0.35%  (p=0.032 n=5+5)
IntSqr/1-8                          32.3ns ± 1%     32.5ns ± 1%      ~     (p=0.421 n=5+5)
IntSqr/2-8                           157ns ± 5%      156ns ± 5%      ~     (p=0.651 n=5+5)
IntSqr/3-8                           292ns ± 2%      291ns ± 3%      ~     (p=0.881 n=5+5)
IntSqr/5-8                           738ns ± 6%      740ns ± 5%      ~     (p=0.841 n=5+5)
IntSqr/8-8                          1.82µs ± 4%     1.83µs ± 4%      ~     (p=0.730 n=5+5)
IntSqr/10-8                         2.92µs ± 1%     2.93µs ± 1%      ~     (p=0.643 n=5+5)
IntSqr/20-8                         6.28µs ± 2%     6.28µs ± 2%      ~     (p=1.000 n=5+5)
IntSqr/30-8                         13.8µs ± 2%     13.9µs ± 3%      ~     (p=1.000 n=5+5)
IntSqr/50-8                         37.8µs ± 4%     37.9µs ± 4%      ~     (p=0.690 n=5+5)
IntSqr/80-8                         95.9µs ± 1%     95.8µs ± 1%      ~     (p=0.841 n=5+5)
IntSqr/100-8                         148µs ± 1%      148µs ± 1%      ~     (p=0.310 n=5+5)
IntSqr/200-8                         586µs ± 1%      586µs ± 1%      ~     (p=0.841 n=5+5)
IntSqr/300-8                        1.32ms ± 0%     1.31ms ± 0%      ~     (p=0.222 n=5+5)
IntSqr/500-8                        2.48ms ± 0%     2.48ms ± 0%      ~     (p=0.556 n=5+4)
IntSqr/800-8                        4.68ms ± 0%     4.68ms ± 0%      ~     (p=0.548 n=5+5)
IntSqr/1000-8                       7.57ms ± 0%     7.56ms ± 0%      ~     (p=0.421 n=5+5)
Mul-8                                311ms ± 0%      311ms ± 0%      ~     (p=0.548 n=5+5)
Exp3Power/0x10-8                     559ns ± 1%      560ns ± 1%      ~     (p=0.984 n=5+5)
Exp3Power/0x40-8                     641ns ± 1%      634ns ± 1%      ~     (p=0.063 n=5+5)
Exp3Power/0x100-8                   1.39µs ± 2%     1.40µs ± 2%      ~     (p=0.381 n=5+5)
Exp3Power/0x400-8                   8.27µs ± 1%     8.26µs ± 0%      ~     (p=0.571 n=5+5)
Exp3Power/0x1000-8                  59.9µs ± 0%     59.7µs ± 0%    -0.23%  (p=0.008 n=5+5)
Exp3Power/0x4000-8                   816µs ± 0%      816µs ± 0%      ~     (p=1.000 n=5+5)
Exp3Power/0x10000-8                 7.77ms ± 0%     7.77ms ± 0%      ~     (p=0.841 n=5+5)
Exp3Power/0x40000-8                 73.4ms ± 0%     73.4ms ± 0%      ~     (p=0.690 n=5+5)
Exp3Power/0x100000-8                 665ms ± 0%      664ms ± 0%    -0.14%  (p=0.008 n=5+5)
Exp3Power/0x400000-8                 5.98s ± 0%      5.98s ± 0%    -0.09%  (p=0.008 n=5+5)
Fibo-8                               116ms ± 0%      116ms ± 0%    -0.25%  (p=0.008 n=5+5)
NatSqr/1-8                           115ns ± 3%      116ns ± 2%      ~     (p=0.238 n=5+5)
NatSqr/2-8                           237ns ± 1%      237ns ± 1%      ~     (p=0.683 n=5+5)
NatSqr/3-8                           367ns ± 3%      368ns ± 3%      ~     (p=0.817 n=5+5)
NatSqr/5-8                           807ns ± 3%      812ns ± 3%      ~     (p=0.913 n=5+5)
NatSqr/8-8                          1.93µs ± 2%     1.93µs ± 3%      ~     (p=0.651 n=5+5)
NatSqr/10-8                         2.98µs ± 2%     2.99µs ± 2%      ~     (p=0.690 n=5+5)
NatSqr/20-8                         6.49µs ± 2%     6.46µs ± 2%      ~     (p=0.548 n=5+5)
NatSqr/30-8                         14.4µs ± 2%     14.3µs ± 2%      ~     (p=0.690 n=5+5)
NatSqr/50-8                         38.6µs ± 2%     38.7µs ± 2%      ~     (p=0.841 n=5+5)
NatSqr/80-8                         96.1µs ± 2%     95.8µs ± 2%      ~     (p=0.548 n=5+5)
NatSqr/100-8                         149µs ± 1%      149µs ± 1%      ~     (p=0.841 n=5+5)
NatSqr/200-8                         593µs ± 1%      590µs ± 1%      ~     (p=0.421 n=5+5)
NatSqr/300-8                        1.32ms ± 0%     1.32ms ± 1%      ~     (p=0.222 n=5+5)
NatSqr/500-8                        2.49ms ± 0%     2.49ms ± 0%      ~     (p=0.690 n=5+5)
NatSqr/800-8                        4.69ms ± 0%     4.69ms ± 0%      ~     (p=1.000 n=5+5)
NatSqr/1000-8                       7.59ms ± 0%     7.58ms ± 0%      ~     (p=0.841 n=5+5)
ScanPi-8                             322µs ± 0%      321µs ± 0%      ~     (p=0.095 n=5+5)
StringPiParallel-8                  71.4µs ± 5%     68.8µs ± 4%      ~     (p=0.151 n=5+5)
Scan/10/Base2-8                     1.10µs ± 0%     1.09µs ± 0%    -0.36%  (p=0.032 n=5+5)
Scan/100/Base2-8                    7.78µs ± 0%     7.79µs ± 0%    +0.14%  (p=0.008 n=5+5)
Scan/1000/Base2-8                   78.8µs ± 0%     79.0µs ± 0%    +0.24%  (p=0.008 n=5+5)
Scan/10000/Base2-8                  1.22ms ± 0%     1.22ms ± 0%      ~     (p=0.056 n=5+5)
Scan/100000/Base2-8                 55.1ms ± 0%     55.0ms ± 0%    -0.15%  (p=0.008 n=5+5)
Scan/10/Base8-8                      514ns ± 0%      515ns ± 0%      ~     (p=0.079 n=5+5)
Scan/100/Base8-8                    2.89µs ± 0%     2.89µs ± 0%    +0.15%  (p=0.008 n=5+5)
Scan/1000/Base8-8                   31.0µs ± 0%     31.1µs ± 0%    +0.12%  (p=0.008 n=5+5)
Scan/10000/Base8-8                   740µs ± 0%      740µs ± 0%      ~     (p=0.222 n=5+5)
Scan/100000/Base8-8                 50.6ms ± 0%     50.5ms ± 0%    -0.06%  (p=0.016 n=4+5)
Scan/10/Base10-8                     492ns ± 1%      490ns ± 1%      ~     (p=0.310 n=5+5)
Scan/100/Base10-8                   2.67µs ± 0%     2.67µs ± 0%      ~     (p=0.056 n=5+5)
Scan/1000/Base10-8                  28.7µs ± 0%     28.7µs ± 0%      ~     (p=1.000 n=5+5)
Scan/10000/Base10-8                  717µs ± 0%      716µs ± 0%      ~     (p=0.222 n=5+5)
Scan/100000/Base10-8                50.2ms ± 0%     50.3ms ± 0%    +0.05%  (p=0.008 n=5+5)
Scan/10/Base16-8                     442ns ± 1%      442ns ± 0%      ~     (p=0.468 n=5+5)
Scan/100/Base16-8                   2.46µs ± 0%     2.45µs ± 0%      ~     (p=0.159 n=5+5)
Scan/1000/Base16-8                  27.2µs ± 0%     27.2µs ± 0%      ~     (p=0.841 n=5+5)
Scan/10000/Base16-8                  721µs ± 0%      722µs ± 0%      ~     (p=0.548 n=5+5)
Scan/100000/Base16-8                52.6ms ± 0%     52.6ms ± 0%    +0.07%  (p=0.008 n=5+5)
String/10/Base2-8                    244ns ± 1%      242ns ± 1%      ~     (p=0.103 n=5+5)
String/100/Base2-8                  1.48µs ± 0%     1.48µs ± 1%      ~     (p=0.786 n=5+5)
String/1000/Base2-8                 13.3µs ± 1%     13.3µs ± 0%      ~     (p=0.222 n=5+5)
String/10000/Base2-8                 132µs ± 1%      132µs ± 1%      ~     (p=1.000 n=5+5)
String/100000/Base2-8               1.30ms ± 1%     1.30ms ± 1%      ~     (p=1.000 n=5+5)
String/10/Base8-8                    167ns ± 1%      168ns ± 1%      ~     (p=0.135 n=5+5)
String/100/Base8-8                   623ns ± 1%      626ns ± 1%      ~     (p=0.151 n=5+5)
String/1000/Base8-8                 5.24µs ± 1%     5.24µs ± 0%      ~     (p=1.000 n=5+5)
String/10000/Base8-8                50.0µs ± 1%     50.0µs ± 1%      ~     (p=1.000 n=5+5)
String/100000/Base8-8                492µs ± 1%      489µs ± 1%      ~     (p=0.056 n=5+5)
String/10/Base10-8                   503ns ± 1%      501ns ± 0%      ~     (p=0.183 n=5+5)
String/100/Base10-8                 1.96µs ± 0%     1.97µs ± 0%      ~     (p=0.389 n=5+5)
String/1000/Base10-8                12.4µs ± 1%     12.4µs ± 1%      ~     (p=0.841 n=5+5)
String/10000/Base10-8               56.7µs ± 1%     56.6µs ± 0%      ~     (p=1.000 n=5+5)
String/100000/Base10-8              25.6ms ± 0%     25.6ms ± 0%      ~     (p=0.222 n=5+5)
String/10/Base16-8                   147ns ± 0%      148ns ± 2%      ~     (p=1.000 n=4+5)
String/100/Base16-8                  505ns ± 0%      505ns ± 1%      ~     (p=0.778 n=5+5)
String/1000/Base16-8                3.94µs ± 0%     3.94µs ± 0%      ~     (p=0.841 n=5+5)
String/10000/Base16-8               37.4µs ± 1%     37.2µs ± 1%      ~     (p=0.095 n=5+5)
String/100000/Base16-8               367µs ± 1%      367µs ± 0%      ~     (p=1.000 n=5+5)
LeafSize/0-8                        6.64ms ± 0%     6.65ms ± 0%      ~     (p=0.690 n=5+5)
LeafSize/1-8                        72.5µs ± 1%     72.4µs ± 1%      ~     (p=0.841 n=5+5)
LeafSize/2-8                        72.6µs ± 1%     72.6µs ± 1%      ~     (p=1.000 n=5+5)
LeafSize/3-8                         377µs ± 0%      377µs ± 0%      ~     (p=0.421 n=5+5)
LeafSize/4-8                        71.2µs ± 1%     71.3µs ± 0%      ~     (p=0.278 n=5+5)
LeafSize/5-8                         469µs ± 0%      469µs ± 0%      ~     (p=0.310 n=5+5)
LeafSize/6-8                         376µs ± 0%      376µs ± 0%      ~     (p=0.841 n=5+5)
LeafSize/7-8                         244µs ± 0%      244µs ± 0%      ~     (p=0.841 n=5+5)
LeafSize/8-8                        71.9µs ± 1%     72.1µs ± 1%      ~     (p=0.548 n=5+5)
LeafSize/9-8                         536µs ± 0%      536µs ± 0%      ~     (p=0.151 n=5+5)
LeafSize/10-8                        470µs ± 0%      471µs ± 0%    +0.10%  (p=0.032 n=5+5)
LeafSize/11-8                        458µs ± 0%      458µs ± 0%      ~     (p=0.881 n=5+5)
LeafSize/12-8                        376µs ± 0%      376µs ± 0%      ~     (p=0.548 n=5+5)
LeafSize/13-8                        341µs ± 0%      342µs ± 0%      ~     (p=0.222 n=5+5)
LeafSize/14-8                        246µs ± 0%      245µs ± 0%      ~     (p=0.167 n=5+5)
LeafSize/15-8                        168µs ± 0%      168µs ± 0%      ~     (p=0.548 n=5+5)
LeafSize/16-8                       72.1µs ± 1%     72.2µs ± 1%      ~     (p=0.690 n=5+5)
LeafSize/32-8                       81.5µs ± 1%     81.4µs ± 1%      ~     (p=1.000 n=5+5)
LeafSize/64-8                        133µs ± 1%      134µs ± 1%      ~     (p=0.690 n=5+5)
ProbablyPrime/n=0-8                 44.3ms ± 0%     44.2ms ± 0%    -0.28%  (p=0.008 n=5+5)
ProbablyPrime/n=1-8                 64.8ms ± 0%     64.7ms ± 0%    -0.15%  (p=0.008 n=5+5)
ProbablyPrime/n=5-8                  147ms ± 0%      147ms ± 0%    -0.11%  (p=0.008 n=5+5)
ProbablyPrime/n=10-8                 250ms ± 0%      250ms ± 0%      ~     (p=0.056 n=5+5)
ProbablyPrime/n=20-8                 456ms ± 0%      455ms ± 0%    -0.05%  (p=0.008 n=5+5)
ProbablyPrime/Lucas-8               23.6ms ± 0%     23.5ms ± 0%    -0.29%  (p=0.008 n=5+5)
ProbablyPrime/MillerRabinBase2-8    20.6ms ± 0%     20.6ms ± 0%      ~     (p=0.690 n=5+5)
FloatSqrt/64-8                      2.01µs ± 1%     2.02µs ± 1%      ~     (p=0.421 n=5+5)
FloatSqrt/128-8                     4.43µs ± 2%     4.38µs ± 2%      ~     (p=0.222 n=5+5)
FloatSqrt/256-8                     6.64µs ± 1%     6.68µs ± 2%      ~     (p=0.516 n=5+5)
FloatSqrt/1000-8                    31.9µs ± 0%     31.8µs ± 0%      ~     (p=0.095 n=5+5)
FloatSqrt/10000-8                    595µs ± 0%      594µs ± 0%      ~     (p=0.056 n=5+5)
FloatSqrt/100000-8                  17.9ms ± 0%     17.9ms ± 0%      ~     (p=0.151 n=5+5)
FloatSqrt/1000000-8                  1.52s ± 0%      1.52s ± 0%      ~     (p=0.841 n=5+5)

name                              old speed      new speed       delta
AddVV/1-8                         2.97GB/s ± 0%   2.97GB/s ± 0%      ~     (p=0.971 n=4+4)
AddVV/2-8                         9.47GB/s ± 0%   9.47GB/s ± 0%    +0.01%  (p=0.016 n=5+5)
AddVV/3-8                         12.4GB/s ± 0%   12.4GB/s ± 0%      ~     (p=0.548 n=5+5)
AddVV/4-8                         14.6GB/s ± 0%   14.6GB/s ± 0%      ~     (p=1.000 n=5+5)
AddVV/5-8                         16.4GB/s ± 0%   16.4GB/s ± 0%      ~     (p=1.000 n=5+5)
AddVV/10-8                        21.7GB/s ± 0%   21.7GB/s ± 0%      ~     (p=0.548 n=5+5)
AddVV/100-8                       29.4GB/s ± 0%   29.4GB/s ± 0%      ~     (p=1.000 n=5+5)
AddVV/1000-8                      31.7GB/s ± 0%   31.7GB/s ± 0%      ~     (p=0.524 n=5+4)
AddVV/10000-8                     31.5GB/s ± 0%   31.5GB/s ± 0%      ~     (p=0.690 n=5+5)
AddVV/100000-8                    28.8GB/s ± 7%   28.1GB/s ± 8%      ~     (p=0.548 n=5+5)
AddVW/1-8                          859MB/s ± 0%    864MB/s ± 0%    +0.61%  (p=0.008 n=5+5)
AddVW/2-8                          809MB/s ± 2%   1520MB/s ± 0%   +87.78%  (p=0.008 n=5+5)
AddVW/3-8                         2.08GB/s ± 0%   2.18GB/s ± 0%    +4.54%  (p=0.008 n=5+5)
AddVW/4-8                         2.46GB/s ± 0%   2.66GB/s ± 0%    +8.33%  (p=0.016 n=4+5)
AddVW/5-8                         2.76GB/s ± 0%   3.20GB/s ± 0%   +16.03%  (p=0.008 n=5+5)
AddVW/10-8                        3.63GB/s ± 0%   5.15GB/s ± 0%   +41.83%  (p=0.008 n=5+5)
AddVW/100-8                       4.79GB/s ± 0%   9.87GB/s ± 0%  +106.12%  (p=0.008 n=5+5)
AddVW/1000-8                      5.27GB/s ± 0%  12.42GB/s ± 0%  +135.74%  (p=0.008 n=5+5)
AddVW/10000-8                     5.31GB/s ± 0%  11.19GB/s ± 0%  +110.71%  (p=0.008 n=5+5)
AddVW/100000-8                    5.32GB/s ± 0%  11.32GB/s ± 0%  +112.56%  (p=0.008 n=5+5)
SubVW/1-8                          859MB/s ± 0%    864MB/s ± 0%    +0.61%  (p=0.008 n=5+5)
SubVW/2-8                          812MB/s ± 2%   1520MB/s ± 0%   +87.09%  (p=0.008 n=5+5)
SubVW/3-8                         2.08GB/s ± 0%   2.18GB/s ± 0%    +4.55%  (p=0.008 n=5+5)
SubVW/4-8                         2.46GB/s ± 0%   2.66GB/s ± 0%    +8.33%  (p=0.008 n=5+5)
SubVW/5-8                         2.75GB/s ± 0%   3.20GB/s ± 0%   +16.03%  (p=0.008 n=5+5)
SubVW/10-8                        3.63GB/s ± 0%   5.15GB/s ± 0%   +41.82%  (p=0.008 n=5+5)
SubVW/100-8                       4.79GB/s ± 0%   9.87GB/s ± 0%  +106.13%  (p=0.008 n=5+5)
SubVW/1000-8                      5.27GB/s ± 0%  12.42GB/s ± 0%  +135.74%  (p=0.008 n=5+5)
SubVW/10000-8                     5.31GB/s ± 0%  11.17GB/s ± 0%  +110.44%  (p=0.008 n=5+5)
SubVW/100000-8                    5.32GB/s ± 0%  11.31GB/s ± 0%  +112.35%  (p=0.008 n=5+5)
AddMulVVW/1-8                     1.97GB/s ± 1%   1.96GB/s ± 1%      ~     (p=0.151 n=5+5)
AddMulVVW/2-8                     2.24GB/s ± 0%   2.25GB/s ± 0%      ~     (p=0.095 n=5+5)
AddMulVVW/3-8                     2.11GB/s ± 0%   2.12GB/s ± 0%      ~     (p=0.548 n=5+5)
AddMulVVW/4-8                     2.17GB/s ± 1%   2.17GB/s ± 1%      ~     (p=0.548 n=5+5)
AddMulVVW/5-8                     2.22GB/s ± 1%   2.21GB/s ± 1%      ~     (p=0.421 n=5+5)
AddMulVVW/10-8                    2.17GB/s ± 1%   2.16GB/s ± 0%      ~     (p=0.095 n=5+5)
AddMulVVW/100-8                   2.35GB/s ± 0%   2.35GB/s ± 0%      ~     (p=0.421 n=5+5)
AddMulVVW/1000-8                  2.47GB/s ± 0%   2.41GB/s ± 0%    -2.09%  (p=0.008 n=5+5)
AddMulVVW/10000-8                 2.16GB/s ± 0%   2.15GB/s ± 0%    -0.23%  (p=0.008 n=5+5)
AddMulVVW/100000-8                2.03GB/s ± 1%   2.04GB/s ± 0%      ~     (p=0.690 n=5+5)

name                              old alloc/op   new alloc/op    delta
FloatString/100-8                     400B ± 0%       400B ± 0%      ~     (all equal)
FloatString/1000-8                  3.22kB ± 0%     3.22kB ± 0%      ~     (all equal)
FloatString/10000-8                 55.6kB ± 0%     55.5kB ± 0%      ~     (p=0.206 n=5+5)
FloatString/100000-8                 627kB ± 0%      627kB ± 0%      ~     (all equal)
FloatAdd/10-8                        0.00B           0.00B           ~     (all equal)
FloatAdd/100-8                       0.00B           0.00B           ~     (all equal)
FloatAdd/1000-8                      0.00B           0.00B           ~     (all equal)
FloatAdd/10000-8                     0.00B           0.00B           ~     (all equal)
FloatAdd/100000-8                    0.00B           0.00B           ~     (all equal)
FloatSub/10-8                        0.00B           0.00B           ~     (all equal)
FloatSub/100-8                       0.00B           0.00B           ~     (all equal)
FloatSub/1000-8                      0.00B           0.00B           ~     (all equal)
FloatSub/10000-8                     0.00B           0.00B           ~     (all equal)
FloatSub/100000-8                    0.00B           0.00B           ~     (all equal)
FloatSqrt/64-8                        416B ± 0%       416B ± 0%      ~     (all equal)
FloatSqrt/128-8                       720B ± 0%       720B ± 0%      ~     (all equal)
FloatSqrt/256-8                       816B ± 0%       816B ± 0%      ~     (all equal)
FloatSqrt/1000-8                    2.50kB ± 0%     2.50kB ± 0%      ~     (all equal)
FloatSqrt/10000-8                   23.5kB ± 0%     23.5kB ± 0%      ~     (all equal)
FloatSqrt/100000-8                   251kB ± 0%      251kB ± 0%      ~     (all equal)
FloatSqrt/1000000-8                 4.61MB ± 0%     4.61MB ± 0%      ~     (all equal)

name                              old allocs/op  new allocs/op   delta
FloatString/100-8                     8.00 ± 0%       8.00 ± 0%      ~     (all equal)
FloatString/1000-8                    10.0 ± 0%       10.0 ± 0%      ~     (all equal)
FloatString/10000-8                   42.0 ± 0%       42.0 ± 0%      ~     (all equal)
FloatString/100000-8                   346 ± 0%        346 ± 0%      ~     (all equal)
FloatAdd/10-8                         0.00            0.00           ~     (all equal)
FloatAdd/100-8                        0.00            0.00           ~     (all equal)
FloatAdd/1000-8                       0.00            0.00           ~     (all equal)
FloatAdd/10000-8                      0.00            0.00           ~     (all equal)
FloatAdd/100000-8                     0.00            0.00           ~     (all equal)
FloatSub/10-8                         0.00            0.00           ~     (all equal)
FloatSub/100-8                        0.00            0.00           ~     (all equal)
FloatSub/1000-8                       0.00            0.00           ~     (all equal)
FloatSub/10000-8                      0.00            0.00           ~     (all equal)
FloatSub/100000-8                     0.00            0.00           ~     (all equal)
FloatSqrt/64-8                        9.00 ± 0%       9.00 ± 0%      ~     (all equal)
FloatSqrt/128-8                       13.0 ± 0%       13.0 ± 0%      ~     (all equal)
FloatSqrt/256-8                       12.0 ± 0%       12.0 ± 0%      ~     (all equal)
FloatSqrt/1000-8                      19.0 ± 0%       19.0 ± 0%      ~     (all equal)
FloatSqrt/10000-8                     35.0 ± 0%       35.0 ± 0%      ~     (all equal)
FloatSqrt/100000-8                    55.0 ± 0%       55.0 ± 0%      ~     (all equal)
FloatSqrt/1000000-8                    122 ± 0%        122 ± 0%      ~     (all equal)

Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24
Reviewed-on: https://go-review.googlesource.com/77832
Reviewed-by: Vlad Krasnov <vlad@cloudflare.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-08 15:31:37 +00:00
Vlad Krasnov
fd3d27938a math/big: implement addMulVVW on arm64
The lack of proper addMulVVW implementation for arm64 hurts RSA performance.

This assembly implementation is optimized for arm64 based servers.

name                  old time/op    new time/op     delta
pkg:math/big goos:linux goarch:arm64
AddMulVVW/1             55.2ns ± 0%     11.9ns ± 1%    -78.37%  (p=0.000 n=8+10)
AddMulVVW/2             67.0ns ± 0%     11.2ns ± 0%    -83.28%  (p=0.000 n=7+10)
AddMulVVW/3             93.2ns ± 0%     13.2ns ± 0%    -85.84%  (p=0.000 n=10+10)
AddMulVVW/4              126ns ± 0%       13ns ± 1%    -89.82%  (p=0.000 n=10+10)
AddMulVVW/5              151ns ± 0%       17ns ± 0%    -88.87%  (p=0.000 n=10+9)
AddMulVVW/10             323ns ± 0%       25ns ± 0%    -92.20%  (p=0.000 n=10+10)
AddMulVVW/100           3.28µs ± 0%     0.14µs ± 0%    -95.82%  (p=0.000 n=10+10)
AddMulVVW/1000          31.7µs ± 0%      1.3µs ± 0%    -96.00%  (p=0.000 n=10+8)
AddMulVVW/10000          313µs ± 0%       13µs ± 0%    -95.98%  (p=0.000 n=10+10)
AddMulVVW/100000        3.24ms ± 0%     0.13ms ± 1%    -96.13%  (p=0.000 n=9+9)
pkg:crypto/rsa goos:linux goarch:arm64
RSA2048Decrypt          44.7ms ± 0%      4.0ms ± 6%    -91.08%  (p=0.000 n=8+10)
RSA2048Sign             46.3ms ± 0%      5.0ms ± 0%    -89.29%  (p=0.000 n=9+10)
3PrimeRSA2048Decrypt    22.3ms ± 0%      2.4ms ± 0%    -89.26%  (p=0.000 n=10+10)

Change-Id: I295f0bd5c51a4442d02c44ece1f6026d30dff0bc
Reviewed-on: https://go-review.googlesource.com/76270
Reviewed-by: Vlad Krasnov <vlad@cloudflare.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Vlad Krasnov <vlad@cloudflare.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-07 23:04:38 +00:00
Cherry Zhang
084143d844 math/big: don't use R18 in ARM64 assembly
R18 seems reserved on Apple platforms.

May fix darwin/arm64 build.

Change-Id: Ia2c1de550a64827c85a64affa53b94c62aacce8e
Reviewed-on: https://go-review.googlesource.com/98896
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Elias Naur <elias.naur@gmail.com>
2018-03-06 15:34:00 +00:00
erifan01
c4f3fe95c6 math/big: optimize addVV and subVV on arm64
The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
By unrolling the cycle 4x and 2x, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance.

Benchmarks:

name                              old time/op    new time/op     delta
AddVV/1-8                           21.5ns ± 0%     11.5ns ± 0%   -46.51%  (p=0.008 n=5+5)
AddVV/2-8                           13.5ns ± 0%     12.0ns ± 0%   -11.11%  (p=0.008 n=5+5)
AddVV/3-8                           15.5ns ± 0%     13.0ns ± 0%   -16.13%  (p=0.008 n=5+5)
AddVV/4-8                           17.5ns ± 0%     13.5ns ± 0%   -22.86%  (p=0.008 n=5+5)
AddVV/5-8                           19.5ns ± 0%     14.5ns ± 0%   -25.64%  (p=0.008 n=5+5)
AddVV/10-8                          29.5ns ± 0%     18.0ns ± 0%   -38.98%  (p=0.008 n=5+5)
AddVV/100-8                          217ns ± 0%       94ns ± 0%   -56.64%  (p=0.008 n=5+5)
AddVV/1000-8                        2.02µs ± 0%     1.03µs ± 0%   -48.85%  (p=0.008 n=5+5)
AddVV/10000-8                       20.5µs ± 0%     11.3µs ± 0%   -44.70%  (p=0.008 n=5+5)
AddVV/100000-8                       247µs ± 3%      154µs ± 0%   -37.52%  (p=0.008 n=5+5)
SubVV/1-8                           21.5ns ± 0%     11.5ns ± 0%      ~     (p=0.079 n=4+5)
SubVV/2-8                           13.5ns ± 0%     12.0ns ± 0%   -11.11%  (p=0.008 n=5+5)
SubVV/3-8                           15.5ns ± 0%     13.0ns ± 0%   -16.13%  (p=0.008 n=5+5)
SubVV/4-8                           17.5ns ± 0%     13.5ns ± 0%   -22.86%  (p=0.008 n=5+5)
SubVV/5-8                           19.5ns ± 0%     14.5ns ± 0%   -25.64%  (p=0.008 n=5+5)
SubVV/10-8                          29.5ns ± 0%     18.0ns ± 0%   -38.98%  (p=0.008 n=5+5)
SubVV/100-8                          217ns ± 0%       94ns ± 0%   -56.64%  (p=0.008 n=5+5)
SubVV/1000-8                        2.02µs ± 0%     0.80µs ± 0%   -60.50%  (p=0.008 n=5+5)
SubVV/10000-8                       20.5µs ± 0%     11.3µs ± 0%   -44.99%  (p=0.008 n=5+5)
SubVV/100000-8                       221µs ±11%      223µs ±16%      ~     (p=0.690 n=5+5)
AddVW/1-8                           9.32ns ± 0%     9.32ns ± 0%      ~     (all equal)
AddVW/2-8                           19.7ns ± 1%     19.7ns ± 0%      ~     (p=0.381 n=5+4)
AddVW/3-8                           11.5ns ± 0%     11.5ns ± 0%      ~     (all equal)
AddVW/4-8                           13.0ns ± 0%     13.0ns ± 0%      ~     (all equal)
AddVW/5-8                           14.5ns ± 0%     14.5ns ± 0%      ~     (all equal)
AddVW/10-8                          22.0ns ± 0%     22.0ns ± 0%      ~     (all equal)
AddVW/100-8                          167ns ± 0%      167ns ± 0%      ~     (all equal)
AddVW/1000-8                        1.52µs ± 0%     1.52µs ± 0%    +0.40%  (p=0.008 n=5+5)
AddVW/10000-8                       15.1µs ± 0%     15.1µs ± 0%      ~     (p=0.556 n=5+4)
AddVW/100000-8                       152µs ± 1%      152µs ± 1%      ~     (p=0.690 n=5+5)
AddMulVVW/1-8                       33.3ns ± 0%     32.7ns ± 1%    -1.86%  (p=0.008 n=5+5)
AddMulVVW/2-8                       59.3ns ± 1%     56.9ns ± 1%    -4.15%  (p=0.008 n=5+5)
AddMulVVW/3-8                       80.5ns ± 1%     85.4ns ± 3%    +6.19%  (p=0.008 n=5+5)
AddMulVVW/4-8                        127ns ± 0%      111ns ± 1%   -13.19%  (p=0.008 n=5+5)
AddMulVVW/5-8                        144ns ± 0%      149ns ± 0%    +3.47%  (p=0.016 n=4+5)
AddMulVVW/10-8                       298ns ± 1%      283ns ± 0%    -4.77%  (p=0.008 n=5+5)
AddMulVVW/100-8                     3.06µs ± 0%     2.99µs ± 0%    -2.21%  (p=0.008 n=5+5)
AddMulVVW/1000-8                    31.3µs ± 0%     26.9µs ± 0%   -14.17%  (p=0.008 n=5+5)
AddMulVVW/10000-8                    316µs ± 0%      305µs ± 0%    -3.51%  (p=0.008 n=5+5)
AddMulVVW/100000-8                  3.17ms ± 0%     3.17ms ± 1%      ~     (p=0.690 n=5+5)
DecimalConversion-8                  316µs ± 1%      313µs ± 2%      ~     (p=0.095 n=5+5)
FloatString/100-8                   2.53µs ± 1%     2.56µs ± 2%      ~     (p=0.222 n=5+5)
FloatString/1000-8                  58.4µs ± 0%     58.5µs ± 0%      ~     (p=0.206 n=5+5)
FloatString/10000-8                 4.59ms ± 0%     4.58ms ± 0%    -0.31%  (p=0.008 n=5+5)
FloatString/100000-8                 446ms ± 0%      444ms ± 0%    -0.31%  (p=0.008 n=5+5)
FloatAdd/10-8                        184ns ± 0%      172ns ± 0%    -6.30%  (p=0.008 n=5+5)
FloatAdd/100-8                       189ns ± 2%      191ns ± 4%      ~     (p=0.381 n=5+5)
FloatAdd/1000-8                      371ns ± 0%      347ns ± 1%    -6.42%  (p=0.008 n=5+5)
FloatAdd/10000-8                    1.87µs ± 0%     1.68µs ± 0%   -10.16%  (p=0.008 n=5+5)
FloatAdd/100000-8                   17.1µs ± 0%     15.6µs ± 0%    -8.74%  (p=0.016 n=5+4)
FloatSub/10-8                        152ns ± 0%      138ns ± 0%    -9.47%  (p=0.000 n=4+5)
FloatSub/100-8                       148ns ± 0%      142ns ± 0%    -4.05%  (p=0.000 n=5+4)
FloatSub/1000-8                      245ns ± 1%      217ns ± 0%   -11.28%  (p=0.000 n=5+4)
FloatSub/10000-8                    1.07µs ± 0%     0.88µs ± 1%   -18.14%  (p=0.008 n=5+5)
FloatSub/100000-8                   9.58µs ± 0%     7.96µs ± 0%   -16.84%  (p=0.008 n=5+5)
ParseFloatSmallExp-8                28.8µs ± 1%     29.0µs ± 1%      ~     (p=0.095 n=5+5)
ParseFloatLargeExp-8                 126µs ± 1%      126µs ± 1%      ~     (p=0.841 n=5+5)
GCD10x10/WithoutXY-8                 277ns ± 2%      281ns ± 4%      ~     (p=0.746 n=5+5)
GCD10x10/WithXY-8                   2.10µs ± 1%     2.12µs ± 3%      ~     (p=0.548 n=5+5)
GCD10x100/WithoutXY-8                615ns ± 3%      607ns ± 2%      ~     (p=0.135 n=5+5)
GCD10x100/WithXY-8                  3.50µs ± 2%     3.62µs ± 5%      ~     (p=0.151 n=5+5)
GCD10x1000/WithoutXY-8              1.39µs ± 2%     1.39µs ± 3%      ~     (p=0.690 n=5+5)
GCD10x1000/WithXY-8                 7.39µs ± 1%     7.34µs ± 2%      ~     (p=0.135 n=5+5)
GCD10x10000/WithoutXY-8             8.66µs ± 1%     8.68µs ± 1%      ~     (p=0.421 n=5+5)
GCD10x10000/WithXY-8                28.1µs ± 2%     27.0µs ± 2%    -3.81%  (p=0.008 n=5+5)
GCD10x100000/WithoutXY-8            79.3µs ± 1%     79.3µs ± 1%      ~     (p=0.841 n=5+5)
GCD10x100000/WithXY-8                238µs ± 0%      227µs ± 1%    -4.74%  (p=0.008 n=5+5)
GCD100x100/WithoutXY-8              1.89µs ± 1%     1.88µs ± 2%      ~     (p=0.968 n=5+5)
GCD100x100/WithXY-8                 26.7µs ± 1%     27.0µs ± 1%    +1.44%  (p=0.032 n=5+5)
GCD100x1000/WithoutXY-8             4.48µs ± 1%     4.45µs ± 2%      ~     (p=0.341 n=5+5)
GCD100x1000/WithXY-8                36.3µs ± 1%     35.1µs ± 1%    -3.27%  (p=0.008 n=5+5)
GCD100x10000/WithoutXY-8            22.8µs ± 0%     22.7µs ± 1%      ~     (p=0.056 n=5+5)
GCD100x10000/WithXY-8                145µs ± 1%      133µs ± 1%    -8.33%  (p=0.008 n=5+5)
GCD100x100000/WithoutXY-8            198µs ± 0%      195µs ± 0%    -1.56%  (p=0.008 n=5+5)
GCD100x100000/WithXY-8              1.11ms ± 0%     1.00ms ± 0%   -10.04%  (p=0.008 n=5+5)
GCD1000x1000/WithoutXY-8            25.2µs ± 1%     24.8µs ± 1%    -1.63%  (p=0.016 n=5+5)
GCD1000x1000/WithXY-8                513µs ± 0%      517µs ± 2%      ~     (p=0.421 n=5+5)
GCD1000x10000/WithoutXY-8           57.0µs ± 0%     52.7µs ± 1%    -7.56%  (p=0.008 n=5+5)
GCD1000x10000/WithXY-8              1.20ms ± 0%     1.10ms ± 0%    -8.70%  (p=0.008 n=5+5)
GCD1000x100000/WithoutXY-8           358µs ± 0%      318µs ± 1%   -11.03%  (p=0.008 n=5+5)
GCD1000x100000/WithXY-8             8.71ms ± 0%     7.65ms ± 0%   -12.19%  (p=0.008 n=5+5)
GCD10000x10000/WithoutXY-8           690µs ± 0%      630µs ± 0%    -8.71%  (p=0.008 n=5+5)
GCD10000x10000/WithXY-8             16.0ms ± 1%     14.9ms ± 0%    -6.85%  (p=0.008 n=5+5)
GCD10000x100000/WithoutXY-8         2.09ms ± 0%     1.75ms ± 0%   -16.09%  (p=0.016 n=5+4)
GCD10000x100000/WithXY-8            86.8ms ± 0%     76.3ms ± 0%   -12.09%  (p=0.008 n=5+5)
GCD100000x100000/WithoutXY-8        51.1ms ± 0%     46.0ms ± 0%    -9.97%  (p=0.008 n=5+5)
GCD100000x100000/WithXY-8            1.25s ± 0%      1.15s ± 0%    -7.92%  (p=0.008 n=5+5)
Hilbert-8                           2.45ms ± 1%     2.49ms ± 1%    +1.99%  (p=0.008 n=5+5)
Binomial-8                          4.98µs ± 3%     4.90µs ± 2%      ~     (p=0.421 n=5+5)
QuoRem-8                            7.10µs ± 0%     6.21µs ± 0%   -12.55%  (p=0.016 n=5+4)
Exp-8                                161ms ± 0%      161ms ± 0%      ~     (p=0.421 n=5+5)
Exp2-8                               161ms ± 0%      161ms ± 0%      ~     (p=0.151 n=5+5)
Bitset-8                            40.4ns ± 0%     40.3ns ± 0%      ~     (p=0.190 n=5+5)
BitsetNeg-8                          163ns ± 3%      137ns ± 2%   -15.91%  (p=0.008 n=5+5)
BitsetOrig-8                         377ns ± 1%      372ns ± 1%    -1.22%  (p=0.024 n=5+5)
BitsetNegOrig-8                      631ns ± 1%      605ns ± 1%    -4.09%  (p=0.008 n=5+5)
ModSqrt225_Tonelli-8                7.26ms ± 0%     7.26ms ± 0%      ~     (p=0.548 n=5+5)
ModSqrt224_3Mod4-8                  2.24ms ± 0%     2.24ms ± 0%      ~     (p=1.000 n=5+5)
ModSqrt5430_Tonelli-8                62.4s ± 0%      62.4s ± 0%      ~     (p=0.841 n=5+5)
ModSqrt5430_3Mod4-8                  20.8s ± 0%      20.7s ± 0%      ~     (p=0.056 n=5+5)
Sqrt-8                               101µs ± 0%       89µs ± 0%   -12.17%  (p=0.008 n=5+5)
IntSqr/1-8                          32.5ns ± 1%     32.7ns ± 1%      ~     (p=0.056 n=5+5)
IntSqr/2-8                           160ns ± 5%      158ns ± 0%      ~     (p=0.397 n=5+4)
IntSqr/3-8                           298ns ± 4%      296ns ± 4%      ~     (p=0.667 n=5+5)
IntSqr/5-8                           737ns ± 5%      761ns ± 3%    +3.34%  (p=0.016 n=5+5)
IntSqr/8-8                          1.87µs ± 4%     1.90µs ± 3%      ~     (p=0.222 n=5+5)
IntSqr/10-8                         2.96µs ± 4%     2.92µs ± 6%      ~     (p=0.310 n=5+5)
IntSqr/20-8                         6.28µs ± 3%     6.21µs ± 2%      ~     (p=0.310 n=5+5)
IntSqr/30-8                         14.0µs ± 2%     13.9µs ± 2%      ~     (p=0.548 n=5+5)
IntSqr/50-8                         37.7µs ± 3%     38.3µs ± 2%      ~     (p=0.095 n=5+5)
IntSqr/80-8                         95.9µs ± 2%     95.1µs ± 1%      ~     (p=0.310 n=5+5)
IntSqr/100-8                         148µs ± 1%      148µs ± 1%      ~     (p=0.841 n=5+5)
IntSqr/200-8                         586µs ± 1%      587µs ± 1%      ~     (p=1.000 n=5+5)
IntSqr/300-8                        1.32ms ± 0%     1.31ms ± 1%    -0.73%  (p=0.032 n=5+5)
IntSqr/500-8                        2.48ms ± 0%     2.45ms ± 0%    -1.15%  (p=0.008 n=5+5)
IntSqr/800-8                        4.68ms ± 0%     4.62ms ± 0%    -1.23%  (p=0.008 n=5+5)
IntSqr/1000-8                       7.57ms ± 0%     7.50ms ± 0%    -0.84%  (p=0.008 n=5+5)
Mul-8                                311ms ± 0%      308ms ± 0%    -0.81%  (p=0.008 n=5+5)
Exp3Power/0x10-8                     574ns ± 1%      578ns ± 2%      ~     (p=0.500 n=5+5)
Exp3Power/0x40-8                     640ns ± 1%      646ns ± 0%      ~     (p=0.056 n=5+5)
Exp3Power/0x100-8                   1.42µs ± 1%     1.42µs ± 1%      ~     (p=0.246 n=5+5)
Exp3Power/0x400-8                   8.30µs ± 1%     8.29µs ± 1%      ~     (p=0.802 n=5+5)
Exp3Power/0x1000-8                  60.0µs ± 0%     59.9µs ± 0%    -0.24%  (p=0.016 n=5+5)
Exp3Power/0x4000-8                   817µs ± 0%      816µs ± 0%    -0.17%  (p=0.008 n=5+5)
Exp3Power/0x10000-8                 7.80ms ± 1%     7.70ms ± 0%    -1.23%  (p=0.008 n=5+5)
Exp3Power/0x40000-8                 73.4ms ± 0%     72.5ms ± 0%    -1.28%  (p=0.008 n=5+5)
Exp3Power/0x100000-8                 665ms ± 0%      656ms ± 0%    -1.34%  (p=0.008 n=5+5)
Exp3Power/0x400000-8                 5.99s ± 0%      5.90s ± 0%    -1.40%  (p=0.008 n=5+5)
Fibo-8                               116ms ± 0%       50ms ± 0%   -57.09%  (p=0.008 n=5+5)
NatSqr/1-8                           112ns ± 4%      112ns ± 2%      ~     (p=0.968 n=5+5)
NatSqr/2-8                           251ns ± 2%      250ns ± 1%      ~     (p=0.571 n=5+5)
NatSqr/3-8                           378ns ± 2%      379ns ± 2%      ~     (p=0.794 n=5+5)
NatSqr/5-8                           829ns ± 3%      827ns ± 2%      ~     (p=1.000 n=5+5)
NatSqr/8-8                          1.97µs ± 2%     1.95µs ± 2%      ~     (p=0.310 n=5+5)
NatSqr/10-8                         3.02µs ± 2%     2.99µs ± 2%      ~     (p=0.421 n=5+5)
NatSqr/20-8                         6.51µs ± 2%     6.49µs ± 1%      ~     (p=0.841 n=5+5)
NatSqr/30-8                         14.1µs ± 2%     14.0µs ± 2%      ~     (p=0.841 n=5+5)
NatSqr/50-8                         38.1µs ± 2%     38.3µs ± 3%      ~     (p=0.690 n=5+5)
NatSqr/80-8                         95.5µs ± 2%     96.0µs ± 1%      ~     (p=0.421 n=5+5)
NatSqr/100-8                         150µs ± 1%      148µs ± 2%      ~     (p=0.095 n=5+5)
NatSqr/200-8                         588µs ± 1%      590µs ± 1%      ~     (p=0.421 n=5+5)
NatSqr/300-8                        1.32ms ± 1%     1.31ms ± 1%      ~     (p=0.841 n=5+5)
NatSqr/500-8                        2.50ms ± 0%     2.47ms ± 0%    -1.03%  (p=0.008 n=5+5)
NatSqr/800-8                        4.70ms ± 0%     4.64ms ± 0%    -1.31%  (p=0.008 n=5+5)
NatSqr/1000-8                       7.60ms ± 0%     7.52ms ± 0%    -1.01%  (p=0.008 n=5+5)
ScanPi-8                             326µs ± 0%      326µs ± 0%      ~     (p=0.841 n=5+5)
StringPiParallel-8                  70.3µs ± 5%     63.8µs ±10%      ~     (p=0.056 n=5+5)
Scan/10/Base2-8                     1.09µs ± 0%     1.09µs ± 0%      ~     (p=0.317 n=5+5)
Scan/100/Base2-8                    7.79µs ± 0%     7.78µs ± 0%      ~     (p=0.063 n=5+5)
Scan/1000/Base2-8                   79.0µs ± 0%     78.9µs ± 0%    -0.18%  (p=0.008 n=5+5)
Scan/10000/Base2-8                  1.22ms ± 0%     1.22ms ± 0%    -0.15%  (p=0.008 n=5+5)
Scan/100000/Base2-8                 55.1ms ± 0%     55.2ms ± 0%    +0.20%  (p=0.008 n=5+5)
Scan/10/Base8-8                      512ns ± 0%      512ns ± 1%      ~     (p=0.810 n=5+5)
Scan/100/Base8-8                    2.89µs ± 0%     2.89µs ± 0%      ~     (p=0.810 n=5+5)
Scan/1000/Base8-8                   31.0µs ± 0%     31.0µs ± 0%      ~     (p=0.151 n=5+5)
Scan/10000/Base8-8                   740µs ± 0%      741µs ± 0%    +0.10%  (p=0.008 n=5+5)
Scan/100000/Base8-8                 50.6ms ± 0%     50.6ms ± 0%    +0.08%  (p=0.008 n=5+5)
Scan/10/Base10-8                     487ns ± 0%      487ns ± 0%      ~     (p=0.571 n=5+5)
Scan/100/Base10-8                   2.67µs ± 0%     2.67µs ± 0%      ~     (p=0.810 n=5+5)
Scan/1000/Base10-8                  28.7µs ± 0%     28.7µs ± 0%    +0.06%  (p=0.008 n=5+5)
Scan/10000/Base10-8                  716µs ± 0%      717µs ± 0%      ~     (p=0.222 n=5+5)
Scan/100000/Base10-8                50.3ms ± 0%     50.3ms ± 0%    +0.10%  (p=0.008 n=5+5)
Scan/10/Base16-8                     438ns ± 0%      437ns ± 1%      ~     (p=0.786 n=5+5)
Scan/100/Base16-8                   2.47µs ± 0%     2.47µs ± 0%    -0.19%  (p=0.048 n=5+5)
Scan/1000/Base16-8                  27.2µs ± 0%     27.3µs ± 0%      ~     (p=0.087 n=5+5)
Scan/10000/Base16-8                  722µs ± 0%      722µs ± 0%    +0.11%  (p=0.008 n=5+5)
Scan/100000/Base16-8                52.6ms ± 0%     52.7ms ± 0%    +0.15%  (p=0.008 n=5+5)
String/10/Base2-8                    247ns ± 2%      248ns ± 1%      ~     (p=0.437 n=5+5)
String/100/Base2-8                  1.51µs ± 0%     1.51µs ± 0%    -0.37%  (p=0.024 n=5+5)
String/1000/Base2-8                 13.6µs ± 1%     13.5µs ± 0%      ~     (p=0.095 n=5+5)
String/10000/Base2-8                 135µs ± 0%      135µs ± 1%      ~     (p=0.841 n=5+5)
String/100000/Base2-8               1.32ms ± 1%     1.32ms ± 1%      ~     (p=0.690 n=5+5)
String/10/Base8-8                    169ns ± 1%      169ns ± 1%      ~     (p=1.000 n=5+5)
String/100/Base8-8                   636ns ± 0%      634ns ± 1%      ~     (p=0.413 n=5+5)
String/1000/Base8-8                 5.33µs ± 1%     5.32µs ± 0%      ~     (p=0.222 n=5+5)
String/10000/Base8-8                50.9µs ± 1%     50.7µs ± 0%      ~     (p=0.151 n=5+5)
String/100000/Base8-8                500µs ± 1%      497µs ± 0%      ~     (p=0.421 n=5+5)
String/10/Base10-8                   516ns ± 1%      513ns ± 0%    -0.62%  (p=0.016 n=5+4)
String/100/Base10-8                 1.97µs ± 0%     1.96µs ± 0%      ~     (p=0.667 n=4+5)
String/1000/Base10-8                12.5µs ± 0%     11.5µs ± 0%    -7.92%  (p=0.008 n=5+5)
String/10000/Base10-8               57.7µs ± 0%     52.5µs ± 0%    -8.93%  (p=0.008 n=5+5)
String/100000/Base10-8              25.6ms ± 0%     21.6ms ± 0%   -15.94%  (p=0.008 n=5+5)
String/10/Base16-8                   150ns ± 1%      149ns ± 0%      ~     (p=0.413 n=5+4)
String/100/Base16-8                  514ns ± 1%      514ns ± 1%      ~     (p=0.849 n=5+5)
String/1000/Base16-8                4.01µs ± 0%     4.01µs ± 0%      ~     (p=0.421 n=5+5)
String/10000/Base16-8               37.8µs ± 1%     37.8µs ± 1%      ~     (p=0.841 n=5+5)
String/100000/Base16-8               373µs ± 2%      373µs ± 0%      ~     (p=0.421 n=5+5)
LeafSize/0-8                        6.63ms ± 0%     6.63ms ± 0%      ~     (p=0.730 n=4+5)
LeafSize/1-8                        74.0µs ± 0%     67.7µs ± 1%    -8.53%  (p=0.008 n=5+5)
LeafSize/2-8                        74.2µs ± 0%     68.3µs ± 1%    -7.99%  (p=0.008 n=5+5)
LeafSize/3-8                         379µs ± 0%      309µs ± 0%   -18.52%  (p=0.008 n=5+5)
LeafSize/4-8                        72.7µs ± 1%     66.7µs ± 0%    -8.37%  (p=0.008 n=5+5)
LeafSize/5-8                         471µs ± 0%      384µs ± 0%   -18.55%  (p=0.008 n=5+5)
LeafSize/6-8                         378µs ± 0%      308µs ± 0%   -18.59%  (p=0.008 n=5+5)
LeafSize/7-8                         245µs ± 0%      204µs ± 1%   -16.75%  (p=0.008 n=5+5)
LeafSize/8-8                        73.4µs ± 0%     66.9µs ± 1%    -8.79%  (p=0.008 n=5+5)
LeafSize/9-8                         538µs ± 0%      437µs ± 0%   -18.75%  (p=0.008 n=5+5)
LeafSize/10-8                        472µs ± 0%      396µs ± 1%   -16.01%  (p=0.008 n=5+5)
LeafSize/11-8                        460µs ± 0%      374µs ± 0%   -18.58%  (p=0.008 n=5+5)
LeafSize/12-8                        378µs ± 0%      308µs ± 0%   -18.38%  (p=0.008 n=5+5)
LeafSize/13-8                        343µs ± 0%      284µs ± 0%   -17.30%  (p=0.008 n=5+5)
LeafSize/14-8                        248µs ± 0%      206µs ± 0%   -16.94%  (p=0.008 n=5+5)
LeafSize/15-8                        169µs ± 0%      144µs ± 0%   -14.69%  (p=0.008 n=5+5)
LeafSize/16-8                       72.9µs ± 0%     66.8µs ± 1%    -8.27%  (p=0.008 n=5+5)
LeafSize/32-8                       82.5µs ± 0%     76.7µs ± 0%    -7.04%  (p=0.008 n=5+5)
LeafSize/64-8                        134µs ± 0%      129µs ± 0%    -3.80%  (p=0.008 n=5+5)
ProbablyPrime/n=0-8                 44.2ms ± 0%     43.4ms ± 0%    -1.95%  (p=0.008 n=5+5)
ProbablyPrime/n=1-8                 64.9ms ± 0%     64.0ms ± 0%    -1.27%  (p=0.008 n=5+5)
ProbablyPrime/n=5-8                  147ms ± 0%      146ms ± 0%    -0.58%  (p=0.008 n=5+5)
ProbablyPrime/n=10-8                 250ms ± 0%      249ms ± 0%    -0.35%  (p=0.008 n=5+5)
ProbablyPrime/n=20-8                 456ms ± 0%      455ms ± 0%    -0.18%  (p=0.008 n=5+5)
ProbablyPrime/Lucas-8               23.6ms ± 0%     22.7ms ± 0%    -3.74%  (p=0.008 n=5+5)
ProbablyPrime/MillerRabinBase2-8    20.7ms ± 0%     20.6ms ± 0%      ~     (p=0.421 n=5+5)
FloatSqrt/64-8                      2.25µs ± 1%     2.29µs ± 0%    +1.48%  (p=0.008 n=5+5)
FloatSqrt/128-8                     4.86µs ± 1%     4.92µs ± 1%    +1.21%  (p=0.032 n=5+5)
FloatSqrt/256-8                     13.6µs ± 0%     13.7µs ± 1%    +1.31%  (p=0.032 n=5+5)
FloatSqrt/1000-8                    70.0µs ± 1%     70.1µs ± 0%      ~     (p=0.690 n=5+5)
FloatSqrt/10000-8                   1.92ms ± 0%     1.90ms ± 0%    -0.59%  (p=0.008 n=5+5)
FloatSqrt/100000-8                  55.3ms ± 0%     54.8ms ± 0%    -1.01%  (p=0.008 n=5+5)
FloatSqrt/1000000-8                  4.56s ± 0%      4.50s ± 0%    -1.28%  (p=0.008 n=5+5)

name                              old speed      new speed       delta
AddVV/1-8                         2.97GB/s ± 0%   5.56GB/s ± 0%   +86.85%  (p=0.008 n=5+5)
AddVV/2-8                         9.47GB/s ± 0%  10.66GB/s ± 0%   +12.50%  (p=0.008 n=5+5)
AddVV/3-8                         12.4GB/s ± 0%   14.7GB/s ± 0%   +19.10%  (p=0.008 n=5+5)
AddVV/4-8                         14.6GB/s ± 0%   18.9GB/s ± 0%   +29.63%  (p=0.016 n=4+5)
AddVV/5-8                         16.4GB/s ± 0%   22.0GB/s ± 0%   +34.47%  (p=0.016 n=5+4)
AddVV/10-8                        21.7GB/s ± 0%   35.5GB/s ± 0%   +63.89%  (p=0.008 n=5+5)
AddVV/100-8                       29.4GB/s ± 0%   68.0GB/s ± 0%  +131.38%  (p=0.008 n=5+5)
AddVV/1000-8                      31.7GB/s ± 0%   61.9GB/s ± 0%   +95.43%  (p=0.008 n=5+5)
AddVV/10000-8                     31.2GB/s ± 0%   56.4GB/s ± 0%   +80.83%  (p=0.008 n=5+5)
AddVV/100000-8                    25.9GB/s ± 3%   41.4GB/s ± 0%   +59.98%  (p=0.008 n=5+5)
SubVV/1-8                         2.97GB/s ± 0%   5.56GB/s ± 0%   +86.97%  (p=0.016 n=4+5)
SubVV/2-8                         9.47GB/s ± 0%  10.66GB/s ± 0%   +12.51%  (p=0.008 n=5+5)
SubVV/3-8                         12.4GB/s ± 0%   14.8GB/s ± 0%   +19.23%  (p=0.016 n=4+5)
SubVV/4-8                         14.6GB/s ± 0%   18.9GB/s ± 0%   +29.56%  (p=0.008 n=5+5)
SubVV/5-8                         16.4GB/s ± 0%   22.0GB/s ± 0%   +34.47%  (p=0.016 n=4+5)
SubVV/10-8                        21.7GB/s ± 0%   35.5GB/s ± 0%   +63.89%  (p=0.008 n=5+5)
SubVV/100-8                       29.4GB/s ± 0%   68.0GB/s ± 0%  +131.38%  (p=0.008 n=5+5)
SubVV/1000-8                      31.6GB/s ± 0%   80.1GB/s ± 0%  +153.08%  (p=0.008 n=5+5)
SubVV/10000-8                     31.2GB/s ± 0%   56.7GB/s ± 0%   +81.79%  (p=0.008 n=5+5)
SubVV/100000-8                    29.1GB/s ±10%   29.0GB/s ±18%      ~     (p=0.690 n=5+5)
AddVW/1-8                          859MB/s ± 0%    859MB/s ± 0%    -0.01%  (p=0.008 n=5+5)
AddVW/2-8                          811MB/s ± 1%    814MB/s ± 0%      ~     (p=0.413 n=5+4)
AddVW/3-8                         2.08GB/s ± 0%   2.08GB/s ± 0%      ~     (p=0.206 n=5+5)
AddVW/4-8                         2.46GB/s ± 0%   2.46GB/s ± 0%      ~     (p=0.056 n=5+5)
AddVW/5-8                         2.75GB/s ± 0%   2.75GB/s ± 0%      ~     (p=0.508 n=5+5)
AddVW/10-8                        3.63GB/s ± 0%   3.63GB/s ± 0%      ~     (p=0.214 n=5+5)
AddVW/100-8                       4.79GB/s ± 0%   4.79GB/s ± 0%      ~     (p=0.500 n=5+5)
AddVW/1000-8                      5.27GB/s ± 0%   5.25GB/s ± 0%    -0.43%  (p=0.008 n=5+5)
AddVW/10000-8                     5.30GB/s ± 0%   5.30GB/s ± 0%      ~     (p=0.397 n=5+5)
AddVW/100000-8                    5.27GB/s ± 1%   5.25GB/s ± 1%      ~     (p=0.690 n=5+5)
AddMulVVW/1-8                     1.92GB/s ± 0%   1.96GB/s ± 1%    +1.95%  (p=0.008 n=5+5)
AddMulVVW/2-8                     2.16GB/s ± 1%   2.25GB/s ± 1%    +4.32%  (p=0.008 n=5+5)
AddMulVVW/3-8                     2.39GB/s ± 1%   2.25GB/s ± 3%    -5.79%  (p=0.008 n=5+5)
AddMulVVW/4-8                     2.00GB/s ± 0%   2.31GB/s ± 1%   +15.31%  (p=0.008 n=5+5)
AddMulVVW/5-8                     2.22GB/s ± 0%   2.14GB/s ± 0%    -3.86%  (p=0.008 n=5+5)
AddMulVVW/10-8                    2.15GB/s ± 1%   2.25GB/s ± 0%    +5.03%  (p=0.008 n=5+5)
AddMulVVW/100-8                   2.09GB/s ± 0%   2.14GB/s ± 0%    +2.25%  (p=0.008 n=5+5)
AddMulVVW/1000-8                  2.04GB/s ± 0%   2.38GB/s ± 0%   +16.52%  (p=0.008 n=5+5)
AddMulVVW/10000-8                 2.03GB/s ± 0%   2.10GB/s ± 0%    +3.64%  (p=0.008 n=5+5)
AddMulVVW/100000-8                2.02GB/s ± 0%   2.02GB/s ± 1%      ~     (p=0.690 n=5+5)

Change-Id: Ie482d67a7dbb5af6f5d81af2b3d9d14bd66336db
Reviewed-on: https://go-review.googlesource.com/77831
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-06 00:22:08 +00:00
Ilya Tocar
c15984c6c6 math: remove unused variable
useSSE41 was used inside asm implementation of floor to select between base and ss4 code path.
We intrinsified floor and left asm functions as a backup for non-sse4 systems.
This made variable unused, so remove it.

Change-Id: Ia2633de7c7cb1ef1d5b15a2366b523e481b722d9
Reviewed-on: https://go-review.googlesource.com/97935
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-01 18:51:44 +00:00
erifan01
ed6c6c9c11 math: optimize sinh and cosh
Improve performance by reducing unnecessary function calls

Benchmarks:

Tme    old time/op  new time/op  delta
Cosh-8   229ns ± 0%   138ns ± 0%  -39.74%  (p=0.008 n=5+5)
Sinh-8   231ns ± 0%   139ns ± 0%  -39.83%  (p=0.008 n=5+5)

Change-Id: Icab5485849bbfaafca8429d06b67c558101f4f3c
Reviewed-on: https://go-review.googlesource.com/85477
Reviewed-by: Robert Griesemer <gri@golang.org>
2018-02-27 04:34:37 +00:00
Ilya Tocar
c3935c08d2 math/big: speed-up addMulVVW on amd64
Use MULX/ADOX/ADCX instructions to speed-up addMulVVW,
when they are available. addMulVVW is a hotspot in rsa.
This is faster than ADD/ADC/IMUL version, because ADOX/ADCX only
modify carry/overflow flag, so they can be interleaved with each other
and with MULX, which doesn't modify flags at all.
Increasing unroll factor to e. g. 16 makes rsa 1% faster, but 3PrimeRSA2048Decrypt
performance falls back to baseline.

Updates #20058

AddMulVVW/1-8                       3.28ns ± 2%     3.26ns ± 3%     ~     (p=0.107 n=10+10)
AddMulVVW/2-8                       4.26ns ± 2%     4.24ns ± 3%     ~     (p=0.327 n=9+9)
AddMulVVW/3-8                       5.07ns ± 2%     5.26ns ± 2%   +3.73%  (p=0.000 n=10+10)
AddMulVVW/4-8                       6.40ns ± 2%     6.50ns ± 2%   +1.61%  (p=0.000 n=10+10)
AddMulVVW/5-8                       6.77ns ± 2%     6.86ns ± 1%   +1.38%  (p=0.001 n=9+9)
AddMulVVW/10-8                      12.2ns ± 2%     10.6ns ± 3%  -13.65%  (p=0.000 n=10+10)
AddMulVVW/100-8                     79.7ns ± 2%     52.4ns ± 1%  -34.17%  (p=0.000 n=10+10)
AddMulVVW/1000-8                     695ns ± 1%      491ns ± 2%  -29.39%  (p=0.000 n=9+10)
AddMulVVW/10000-8                   7.26µs ± 2%     5.92µs ± 6%  -18.42%  (p=0.000 n=10+10)
AddMulVVW/100000-8                  72.6µs ± 2%     62.2µs ± 2%  -14.31%  (p=0.000 n=10+10)

crypto/rsa speed-up is smaller, but stil noticeable:

RSA2048Decrypt-8        1.61ms ± 1%  1.38ms ± 1%  -14.13%  (p=0.000 n=10+10)
RSA2048Sign-8           1.93ms ± 1%  1.70ms ± 1%  -11.86%  (p=0.000 n=10+10)
3PrimeRSA2048Decrypt-8   932µs ± 0%   828µs ± 0%  -11.15%  (p=0.000 n=10+10)

Results on crypto/tls:

HandshakeServer/RSA-8                        901µs ± 1%    777µs ± 0%  -13.70%  (p=0.000 n=10+8)
HandshakeServer/ECDHE-P256-RSA-8            1.01ms ± 1%   0.90ms ± 0%  -11.53%  (p=0.000 n=10+9)

Full math/big benchmarks:

name                              old time/op    new time/op     delta
AddVV/1-8                           3.74ns ± 6%     3.55ns ± 2%     ~     (p=0.082 n=10+8)
AddVV/2-8                           3.96ns ± 2%     3.98ns ± 5%     ~     (p=0.794 n=10+9)
AddVV/3-8                           4.97ns ± 2%     4.94ns ± 1%     ~     (p=0.081 n=10+9)
AddVV/4-8                           5.59ns ± 2%     5.59ns ± 2%     ~     (p=0.809 n=10+10)
AddVV/5-8                           6.63ns ± 1%     6.62ns ± 1%     ~     (p=0.560 n=9+10)
AddVV/10-8                          8.11ns ± 1%     8.11ns ± 2%     ~     (p=0.402 n=10+10)
AddVV/100-8                         46.9ns ± 2%     46.8ns ± 1%     ~     (p=0.809 n=10+10)
AddVV/1000-8                         389ns ± 1%      391ns ± 4%     ~     (p=0.809 n=10+10)
AddVV/10000-8                       5.05µs ± 5%     4.98µs ± 2%     ~     (p=0.113 n=9+10)
AddVV/100000-8                      55.3µs ± 3%     55.2µs ± 3%     ~     (p=0.796 n=10+10)
AddVW/1-8                           3.04ns ± 3%     3.02ns ± 3%     ~     (p=0.538 n=10+10)
AddVW/2-8                           3.57ns ± 2%     3.61ns ± 2%   +1.12%  (p=0.032 n=9+9)
AddVW/3-8                           3.77ns ± 1%     3.79ns ± 2%     ~     (p=0.719 n=10+10)
AddVW/4-8                           4.69ns ± 1%     4.69ns ± 2%     ~     (p=0.920 n=10+9)
AddVW/5-8                           4.58ns ± 1%     4.58ns ± 1%     ~     (p=0.812 n=10+10)
AddVW/10-8                          7.62ns ± 2%     7.63ns ± 1%     ~     (p=0.926 n=10+10)
AddVW/100-8                         41.1ns ± 2%     42.4ns ± 3%   +3.34%  (p=0.000 n=10+10)
AddVW/1000-8                         386ns ± 2%      389ns ± 4%     ~     (p=0.514 n=10+10)
AddVW/10000-8                       3.88µs ± 3%     3.87µs ± 3%     ~     (p=0.448 n=10+10)
AddVW/100000-8                      41.2µs ± 3%     41.7µs ± 3%     ~     (p=0.148 n=10+10)
AddMulVVW/1-8                       3.28ns ± 2%     3.26ns ± 3%     ~     (p=0.107 n=10+10)
AddMulVVW/2-8                       4.26ns ± 2%     4.24ns ± 3%     ~     (p=0.327 n=9+9)
AddMulVVW/3-8                       5.07ns ± 2%     5.26ns ± 2%   +3.73%  (p=0.000 n=10+10)
AddMulVVW/4-8                       6.40ns ± 2%     6.50ns ± 2%   +1.61%  (p=0.000 n=10+10)
AddMulVVW/5-8                       6.77ns ± 2%     6.86ns ± 1%   +1.38%  (p=0.001 n=9+9)
AddMulVVW/10-8                      12.2ns ± 2%     10.6ns ± 3%  -13.65%  (p=0.000 n=10+10)
AddMulVVW/100-8                     79.7ns ± 2%     52.4ns ± 1%  -34.17%  (p=0.000 n=10+10)
AddMulVVW/1000-8                     695ns ± 1%      491ns ± 2%  -29.39%  (p=0.000 n=9+10)
AddMulVVW/10000-8                   7.26µs ± 2%     5.92µs ± 6%  -18.42%  (p=0.000 n=10+10)
AddMulVVW/100000-8                  72.6µs ± 2%     62.2µs ± 2%  -14.31%  (p=0.000 n=10+10)
DecimalConversion-8                  108µs ±19%      104µs ± 4%     ~     (p=0.460 n=10+8)
FloatString/100-8                    926ns ±14%      908ns ± 5%     ~     (p=0.398 n=9+9)
FloatString/1000-8                  25.7µs ± 1%     25.7µs ± 1%     ~     (p=0.739 n=10+10)
FloatString/10000-8                 2.13ms ± 1%     2.12ms ± 1%     ~     (p=0.353 n=10+10)
FloatString/100000-8                 207ms ± 1%      206ms ± 2%     ~     (p=0.912 n=10+10)
FloatAdd/10-8                       61.3ns ± 3%     61.9ns ± 3%     ~     (p=0.183 n=10+10)
FloatAdd/100-8                      62.0ns ± 2%     62.9ns ± 4%     ~     (p=0.118 n=10+10)
FloatAdd/1000-8                     84.7ns ± 2%     84.4ns ± 1%     ~     (p=0.591 n=10+10)
FloatAdd/10000-8                     305ns ± 2%      306ns ± 1%     ~     (p=0.443 n=10+10)
FloatAdd/100000-8                   2.45µs ± 1%     2.46µs ± 1%     ~     (p=0.782 n=10+10)
FloatSub/10-8                       56.8ns ± 4%     56.5ns ± 5%     ~     (p=0.423 n=10+10)
FloatSub/100-8                      57.3ns ± 4%     57.1ns ± 5%     ~     (p=0.540 n=10+10)
FloatSub/1000-8                     66.8ns ± 4%     66.6ns ± 1%     ~     (p=0.868 n=10+10)
FloatSub/10000-8                     199ns ± 1%      198ns ± 1%     ~     (p=0.287 n=10+9)
FloatSub/100000-8                   1.47µs ± 2%     1.47µs ± 2%     ~     (p=0.920 n=10+9)
ParseFloatSmallExp-8                8.74µs ±10%     9.48µs ±10%   +8.51%  (p=0.010 n=9+10)
ParseFloatLargeExp-8                39.2µs ±25%     39.6µs ±12%     ~     (p=0.529 n=10+10)
GCD10x10/WithoutXY-8                 173ns ±23%      177ns ±20%     ~     (p=0.698 n=10+10)
GCD10x10/WithXY-8                    736ns ±12%      728ns ±16%     ~     (p=0.838 n=10+10)
GCD10x100/WithoutXY-8                325ns ±16%      326ns ±14%     ~     (p=0.912 n=10+10)
GCD10x100/WithXY-8                  1.14µs ±13%     1.16µs ± 6%     ~     (p=0.287 n=10+9)
GCD10x1000/WithoutXY-8               851ns ±25%      820ns ±12%     ~     (p=0.592 n=10+10)
GCD10x1000/WithXY-8                 2.89µs ±17%     2.85µs ± 5%     ~     (p=1.000 n=10+9)
GCD10x10000/WithoutXY-8             6.66µs ±12%     6.82µs ±19%     ~     (p=0.529 n=10+10)
GCD10x10000/WithXY-8                18.0µs ± 5%     17.2µs ±19%     ~     (p=0.315 n=7+10)
GCD10x100000/WithoutXY-8            77.8µs ±18%     73.3µs ±11%     ~     (p=0.315 n=10+9)
GCD10x100000/WithXY-8                186µs ±14%      204µs ±29%     ~     (p=0.218 n=10+10)
GCD100x100/WithoutXY-8              1.09µs ± 1%     1.09µs ± 2%     ~     (p=0.117 n=9+10)
GCD100x100/WithXY-8                 7.93µs ± 1%     7.97µs ± 1%   +0.52%  (p=0.006 n=10+10)
GCD100x1000/WithoutXY-8             2.00µs ± 3%     2.04µs ± 6%     ~     (p=0.053 n=9+10)
GCD100x1000/WithXY-8                9.23µs ± 1%     9.29µs ± 1%   +0.63%  (p=0.009 n=10+10)
GCD100x10000/WithoutXY-8            10.2µs ±11%      9.7µs ± 6%     ~     (p=0.278 n=10+9)
GCD100x10000/WithXY-8               33.3µs ± 4%     33.6µs ± 4%     ~     (p=0.481 n=10+10)
GCD100x100000/WithoutXY-8            106µs ±17%      105µs ±13%     ~     (p=0.853 n=10+10)
GCD100x100000/WithXY-8               289µs ±17%      276µs ± 8%     ~     (p=0.353 n=10+10)
GCD1000x1000/WithoutXY-8            12.2µs ± 1%     12.1µs ± 1%   -0.45%  (p=0.007 n=10+10)
GCD1000x1000/WithXY-8                131µs ± 1%      132µs ± 0%   +0.93%  (p=0.000 n=9+7)
GCD1000x10000/WithoutXY-8           20.6µs ± 2%     20.6µs ± 1%     ~     (p=0.326 n=10+9)
GCD1000x10000/WithXY-8               238µs ± 1%      237µs ± 1%     ~     (p=0.356 n=9+10)
GCD1000x100000/WithoutXY-8           117µs ± 8%      114µs ±11%     ~     (p=0.190 n=10+10)
GCD1000x100000/WithXY-8             1.51ms ± 1%     1.50ms ± 1%     ~     (p=0.053 n=9+10)
GCD10000x10000/WithoutXY-8           220µs ± 1%      218µs ± 1%   -0.86%  (p=0.000 n=10+10)
GCD10000x10000/WithXY-8             3.04ms ± 0%     3.05ms ± 0%   +0.33%  (p=0.001 n=9+10)
GCD10000x100000/WithoutXY-8          513µs ± 0%      511µs ± 0%   -0.38%  (p=0.000 n=10+10)
GCD10000x100000/WithXY-8            15.1ms ± 0%     15.0ms ± 0%     ~     (p=0.053 n=10+9)
GCD100000x100000/WithoutXY-8        10.4ms ± 1%     10.4ms ± 2%     ~     (p=0.258 n=9+9)
GCD100000x100000/WithXY-8            205ms ± 1%      205ms ± 1%     ~     (p=0.481 n=10+10)
Hilbert-8                           1.25ms ±15%     1.24ms ±17%     ~     (p=0.853 n=10+10)
Binomial-8                          3.03µs ±24%     2.90µs ±16%     ~     (p=0.481 n=10+10)
QuoRem-8                            1.95µs ± 1%     1.95µs ± 2%     ~     (p=0.117 n=9+10)
Exp-8                               5.12ms ± 2%     3.99ms ± 1%  -22.02%  (p=0.000 n=10+9)
Exp2-8                              5.14ms ± 2%     3.98ms ± 0%  -22.55%  (p=0.000 n=10+9)
Bitset-8                            16.4ns ± 2%     16.5ns ± 2%     ~     (p=0.311 n=9+10)
BitsetNeg-8                         46.3ns ± 4%     45.8ns ± 4%     ~     (p=0.272 n=10+10)
BitsetOrig-8                         250ns ±19%      247ns ±14%     ~     (p=0.671 n=10+10)
BitsetNegOrig-8                      416ns ±14%      429ns ±14%     ~     (p=0.353 n=10+10)
ModSqrt225_Tonelli-8                 400µs ± 0%      320µs ± 0%  -19.88%  (p=0.000 n=9+7)
ModSqrt224_3Mod4-8                   123µs ± 1%       97µs ± 0%  -21.21%  (p=0.000 n=9+10)
ModSqrt5430_Tonelli-8                1.87s ± 0%      1.39s ± 1%  -25.70%  (p=0.000 n=9+10)
ModSqrt5430_3Mod4-8                  630ms ± 2%      465ms ± 1%  -26.12%  (p=0.000 n=10+10)
Sqrt-8                              25.8µs ± 1%     25.9µs ± 0%   +0.66%  (p=0.002 n=10+8)
IntSqr/1-8                          11.3ns ± 1%     11.3ns ± 2%     ~     (p=0.360 n=9+10)
IntSqr/2-8                          26.6ns ± 1%     27.4ns ± 2%   +2.87%  (p=0.000 n=8+9)
IntSqr/3-8                          36.5ns ± 6%     36.6ns ± 5%     ~     (p=0.589 n=10+10)
IntSqr/5-8                          57.2ns ± 2%     57.8ns ± 1%   +0.92%  (p=0.045 n=10+9)
IntSqr/8-8                           112ns ± 1%       93ns ± 1%  -16.60%  (p=0.000 n=10+10)
IntSqr/10-8                          148ns ± 1%      129ns ± 5%  -12.85%  (p=0.000 n=10+10)
IntSqr/20-8                          642ns ±28%      692ns ±21%     ~     (p=0.105 n=10+10)
IntSqr/30-8                         1.03µs ±18%     1.06µs ±15%     ~     (p=0.422 n=10+8)
IntSqr/50-8                         2.33µs ±14%     2.14µs ±20%     ~     (p=0.063 n=10+10)
IntSqr/80-8                         4.06µs ±13%     3.72µs ±14%   -8.31%  (p=0.029 n=10+10)
IntSqr/100-8                        5.79µs ±10%     5.20µs ±18%  -10.15%  (p=0.004 n=10+10)
IntSqr/200-8                        17.1µs ± 1%     12.9µs ± 3%  -24.44%  (p=0.000 n=10+10)
IntSqr/300-8                        35.9µs ± 0%     26.6µs ± 1%  -25.75%  (p=0.000 n=10+10)
IntSqr/500-8                        84.9µs ± 0%     71.7µs ± 1%  -15.49%  (p=0.000 n=10+10)
IntSqr/800-8                         170µs ± 1%      142µs ± 2%  -16.73%  (p=0.000 n=10+10)
IntSqr/1000-8                        258µs ± 1%      218µs ± 1%  -15.65%  (p=0.000 n=10+10)
Mul-8                               10.4ms ± 1%      8.3ms ± 0%  -20.05%  (p=0.000 n=10+9)
Exp3Power/0x10-8                     311ns ±15%      321ns ±24%     ~     (p=0.447 n=10+10)
Exp3Power/0x40-8                     358ns ±21%      346ns ±37%     ~     (p=0.591 n=10+10)
Exp3Power/0x100-8                    611ns ±19%      570ns ±27%     ~     (p=0.393 n=10+10)
Exp3Power/0x400-8                   1.31µs ±26%     1.34µs ±19%     ~     (p=0.853 n=10+10)
Exp3Power/0x1000-8                  6.76µs ±23%     6.22µs ±16%     ~     (p=0.095 n=10+9)
Exp3Power/0x4000-8                  37.6µs ±14%     36.4µs ±21%     ~     (p=0.247 n=10+10)
Exp3Power/0x10000-8                  345µs ±14%      310µs ±11%   -9.99%  (p=0.005 n=10+10)
Exp3Power/0x40000-8                 2.77ms ± 1%     2.34ms ± 1%  -15.47%  (p=0.000 n=10+10)
Exp3Power/0x100000-8                25.1ms ± 1%     21.3ms ± 1%  -15.26%  (p=0.000 n=10+10)
Exp3Power/0x400000-8                 225ms ± 1%      190ms ± 1%  -15.61%  (p=0.000 n=10+10)
Fibo-8                              23.4ms ± 1%     23.3ms ± 0%     ~     (p=0.052 n=10+10)
NatSqr/1-8                          58.4ns ±24%     59.8ns ±38%     ~     (p=0.739 n=10+10)
NatSqr/2-8                           122ns ±21%      122ns ±16%     ~     (p=0.896 n=10+10)
NatSqr/3-8                           140ns ±28%      148ns ±30%     ~     (p=0.288 n=10+10)
NatSqr/5-8                           193ns ±29%      210ns ±34%     ~     (p=0.469 n=10+10)
NatSqr/8-8                           317ns ±21%      296ns ±25%     ~     (p=0.393 n=10+10)
NatSqr/10-8                          362ns ± 8%      373ns ±30%     ~     (p=0.617 n=9+10)
NatSqr/20-8                         1.24µs ±16%     1.06µs ±29%  -14.57%  (p=0.019 n=10+10)
NatSqr/30-8                         1.90µs ±32%     1.71µs ±10%     ~     (p=0.176 n=10+9)
NatSqr/50-8                         4.22µs ±19%     3.67µs ± 7%  -13.03%  (p=0.017 n=10+9)
NatSqr/80-8                         7.33µs ±20%     6.50µs ±15%  -11.26%  (p=0.009 n=10+10)
NatSqr/100-8                        9.84µs ±18%     9.33µs ± 8%     ~     (p=0.280 n=10+10)
NatSqr/200-8                        21.4µs ± 7%     20.0µs ±14%     ~     (p=0.075 n=10+10)
NatSqr/300-8                        38.0µs ± 2%     31.3µs ±10%  -17.63%  (p=0.000 n=10+10)
NatSqr/500-8                         102µs ± 5%      101µs ± 4%     ~     (p=0.780 n=9+10)
NatSqr/800-8                         190µs ± 3%      166µs ± 6%  -12.29%  (p=0.000 n=10+10)
NatSqr/1000-8                        277µs ± 2%      245µs ± 6%  -11.64%  (p=0.000 n=10+10)
ScanPi-8                             144µs ±23%      149µs ±24%     ~     (p=0.579 n=10+10)
StringPiParallel-8                  25.6µs ± 0%     25.8µs ± 0%   +0.69%  (p=0.000 n=9+10)
Scan/10/Base2-8                      305ns ± 1%      309ns ± 1%   +1.32%  (p=0.000 n=10+9)
Scan/100/Base2-8                    1.95µs ± 1%     1.98µs ± 1%   +1.10%  (p=0.000 n=10+10)
Scan/1000/Base2-8                   19.5µs ± 1%     19.7µs ± 1%   +1.39%  (p=0.000 n=10+10)
Scan/10000/Base2-8                   270µs ± 1%      272µs ± 1%   +0.58%  (p=0.024 n=9+9)
Scan/100000/Base2-8                 10.3ms ± 0%     10.3ms ± 0%   +0.16%  (p=0.022 n=9+10)
Scan/10/Base8-8                      146ns ± 4%      154ns ± 4%   +5.57%  (p=0.000 n=9+9)
Scan/100/Base8-8                     748ns ± 1%      759ns ± 1%   +1.51%  (p=0.000 n=9+10)
Scan/1000/Base8-8                   7.88µs ± 1%     8.00µs ± 1%   +1.64%  (p=0.000 n=10+10)
Scan/10000/Base8-8                   155µs ± 1%      155µs ± 1%     ~     (p=0.968 n=10+9)
Scan/100000/Base8-8                 9.11ms ± 0%     9.11ms ± 0%     ~     (p=0.604 n=9+10)
Scan/10/Base10-8                     140ns ± 5%      149ns ± 5%   +6.39%  (p=0.000 n=9+10)
Scan/100/Base10-8                    680ns ± 0%      688ns ± 1%   +1.08%  (p=0.000 n=9+10)
Scan/1000/Base10-8                  7.09µs ± 1%     7.16µs ± 1%   +0.98%  (p=0.019 n=10+10)
Scan/10000/Base10-8                  149µs ± 3%      150µs ± 3%     ~     (p=0.143 n=10+10)
Scan/100000/Base10-8                9.16ms ± 0%     9.16ms ± 0%     ~     (p=0.661 n=10+9)
Scan/10/Base16-8                     134ns ± 5%      135ns ± 3%     ~     (p=0.505 n=9+9)
Scan/100/Base16-8                    560ns ± 1%      563ns ± 0%   +0.67%  (p=0.000 n=10+8)
Scan/1000/Base16-8                  6.28µs ± 1%     6.26µs ± 1%     ~     (p=0.448 n=10+10)
Scan/10000/Base16-8                  161µs ± 1%      162µs ± 1%   +0.74%  (p=0.008 n=9+9)
Scan/100000/Base16-8                9.64ms ± 0%     9.64ms ± 0%     ~     (p=0.436 n=10+10)
String/10/Base2-8                    116ns ±12%      118ns ±13%     ~     (p=0.645 n=10+10)
String/100/Base2-8                   871ns ±23%      860ns ±22%     ~     (p=0.699 n=10+10)
String/1000/Base2-8                 10.0µs ±20%     10.0µs ±23%     ~     (p=0.853 n=10+10)
String/10000/Base2-8                 110µs ±21%      120µs ±25%     ~     (p=0.436 n=10+10)
String/100000/Base2-8                768µs ±11%      733µs ±16%     ~     (p=0.393 n=10+10)
String/10/Base8-8                   51.3ns ± 1%     51.0ns ± 3%     ~     (p=0.286 n=9+9)
String/100/Base8-8                   284ns ± 9%      272ns ±12%     ~     (p=0.267 n=9+10)
String/1000/Base8-8                 3.06µs ± 9%     3.04µs ±10%     ~     (p=0.739 n=10+10)
String/10000/Base8-8                36.1µs ±14%     35.1µs ± 9%     ~     (p=0.447 n=10+9)
String/100000/Base8-8                371µs ±12%      373µs ±16%     ~     (p=0.739 n=10+10)
String/10/Base10-8                   167ns ±11%      165ns ± 9%     ~     (p=0.781 n=10+10)
String/100/Base10-8                  727ns ± 1%      740ns ± 2%   +1.70%  (p=0.001 n=10+10)
String/1000/Base10-8                5.30µs ±18%     5.37µs ±14%     ~     (p=0.631 n=10+10)
String/10000/Base10-8               45.0µs ±14%     44.6µs ±10%     ~     (p=0.720 n=9+10)
String/100000/Base10-8              5.10ms ± 1%     5.05ms ± 3%     ~     (p=0.211 n=9+10)
String/10/Base16-8                  47.7ns ± 6%     47.7ns ± 6%     ~     (p=0.985 n=10+10)
String/100/Base16-8                  221ns ±10%      234ns ±27%     ~     (p=0.541 n=10+10)
String/1000/Base16-8                2.23µs ±11%     2.12µs ± 8%   -4.81%  (p=0.029 n=9+8)
String/10000/Base16-8               28.3µs ±21%     28.5µs ±14%     ~     (p=0.796 n=10+10)
String/100000/Base16-8               291µs ±16%      293µs ±15%     ~     (p=0.931 n=9+9)
LeafSize/0-8                        2.43ms ± 1%     2.49ms ± 1%   +2.56%  (p=0.000 n=10+10)
LeafSize/1-8                        49.7µs ± 9%     46.3µs ±16%   -6.78%  (p=0.017 n=10+9)
LeafSize/2-8                        48.4µs ±18%     46.3µs ±19%     ~     (p=0.436 n=10+10)
LeafSize/3-8                        81.7µs ± 3%     80.9µs ± 3%     ~     (p=0.278 n=10+9)
LeafSize/4-8                        47.0µs ± 7%     47.9µs ±13%     ~     (p=0.905 n=9+10)
LeafSize/5-8                        96.8µs ± 1%     97.3µs ± 2%     ~     (p=0.515 n=8+10)
LeafSize/6-8                        82.5µs ± 4%     80.9µs ± 2%   -1.92%  (p=0.019 n=10+10)
LeafSize/7-8                        67.2µs ±13%     66.6µs ± 9%     ~     (p=0.842 n=10+9)
LeafSize/8-8                        46.0µs ±28%     45.1µs ±12%     ~     (p=0.739 n=10+10)
LeafSize/9-8                         111µs ± 1%      111µs ± 1%     ~     (p=0.739 n=10+10)
LeafSize/10-8                       98.8µs ± 4%     97.9µs ± 3%     ~     (p=0.278 n=10+9)
LeafSize/11-8                       96.8µs ± 1%     96.4µs ± 1%     ~     (p=0.211 n=9+10)
LeafSize/12-8                       81.0µs ± 4%     81.3µs ± 3%     ~     (p=0.579 n=10+10)
LeafSize/13-8                       79.7µs ± 5%     79.2µs ± 3%     ~     (p=0.661 n=10+9)
LeafSize/14-8                       67.6µs ±12%     65.8µs ± 7%     ~     (p=0.447 n=10+9)
LeafSize/15-8                       63.9µs ±17%     66.3µs ±14%     ~     (p=0.481 n=10+10)
LeafSize/16-8                       44.0µs ±28%     46.0µs ±27%     ~     (p=0.481 n=10+10)
LeafSize/32-8                       46.2µs ±13%     43.5µs ±18%     ~     (p=0.156 n=9+10)
LeafSize/64-8                       53.3µs ±10%     53.0µs ±19%     ~     (p=0.730 n=9+9)
ProbablyPrime/n=0-8                 3.60ms ± 1%     3.39ms ± 1%   -5.87%  (p=0.000 n=10+9)
ProbablyPrime/n=1-8                 4.42ms ± 1%     4.08ms ± 1%   -7.69%  (p=0.000 n=10+10)
ProbablyPrime/n=5-8                 7.57ms ± 2%     6.79ms ± 1%  -10.24%  (p=0.000 n=10+10)
ProbablyPrime/n=10-8                11.6ms ± 2%     10.2ms ± 1%  -11.69%  (p=0.000 n=10+10)
ProbablyPrime/n=20-8                19.4ms ± 2%     16.9ms ± 2%  -12.89%  (p=0.000 n=10+10)
ProbablyPrime/Lucas-8               2.81ms ± 2%     2.72ms ± 1%   -3.22%  (p=0.000 n=10+9)
ProbablyPrime/MillerRabinBase2-8     797µs ± 1%      680µs ± 1%  -14.64%  (p=0.000 n=10+10)

name                              old speed      new speed       delta
AddVV/1-8                         17.1GB/s ± 6%   18.0GB/s ± 2%     ~     (p=0.122 n=10+8)
AddVV/2-8                         32.4GB/s ± 2%   32.2GB/s ± 4%     ~     (p=0.661 n=10+9)
AddVV/3-8                         38.6GB/s ± 2%   38.9GB/s ± 1%     ~     (p=0.113 n=10+9)
AddVV/4-8                         45.8GB/s ± 2%   45.8GB/s ± 2%     ~     (p=0.796 n=10+10)
AddVV/5-8                         48.1GB/s ± 2%   48.3GB/s ± 1%     ~     (p=0.315 n=10+10)
AddVV/10-8                        78.9GB/s ± 1%   78.9GB/s ± 2%     ~     (p=0.353 n=10+10)
AddVV/100-8                        136GB/s ± 2%    137GB/s ± 1%     ~     (p=0.971 n=10+10)
AddVV/1000-8                       164GB/s ± 1%    164GB/s ± 4%     ~     (p=0.853 n=10+10)
AddVV/10000-8                      126GB/s ± 6%    129GB/s ± 2%     ~     (p=0.063 n=10+10)
AddVV/100000-8                     116GB/s ± 3%    116GB/s ± 3%     ~     (p=0.796 n=10+10)
AddVW/1-8                         2.64GB/s ± 3%   2.64GB/s ± 3%     ~     (p=0.579 n=10+10)
AddVW/2-8                         4.49GB/s ± 2%   4.44GB/s ± 2%   -1.09%  (p=0.040 n=9+9)
AddVW/3-8                         6.36GB/s ± 1%   6.34GB/s ± 2%     ~     (p=0.684 n=10+10)
AddVW/4-8                         6.83GB/s ± 1%   6.82GB/s ± 2%     ~     (p=0.905 n=10+9)
AddVW/5-8                         8.75GB/s ± 1%   8.73GB/s ± 1%     ~     (p=0.796 n=10+10)
AddVW/10-8                        10.5GB/s ± 2%   10.5GB/s ± 1%     ~     (p=0.971 n=10+10)
AddVW/100-8                       19.5GB/s ± 2%   18.9GB/s ± 2%   -3.22%  (p=0.000 n=10+10)
AddVW/1000-8                      20.7GB/s ± 2%   20.6GB/s ± 4%     ~     (p=0.631 n=10+10)
AddVW/10000-8                     20.6GB/s ± 3%   20.7GB/s ± 3%     ~     (p=0.481 n=10+10)
AddVW/100000-8                    19.4GB/s ± 2%   19.2GB/s ± 3%     ~     (p=0.165 n=10+10)
AddMulVVW/1-8                     19.5GB/s ± 2%   19.7GB/s ± 3%     ~     (p=0.123 n=10+10)
AddMulVVW/2-8                     30.1GB/s ± 2%   30.2GB/s ± 3%     ~     (p=0.297 n=9+9)
AddMulVVW/3-8                     37.9GB/s ± 2%   36.5GB/s ± 2%   -3.63%  (p=0.000 n=10+10)
AddMulVVW/4-8                     40.0GB/s ± 2%   39.4GB/s ± 2%   -1.58%  (p=0.001 n=10+10)
AddMulVVW/5-8                     47.3GB/s ± 2%   46.6GB/s ± 1%   -1.35%  (p=0.001 n=9+9)
AddMulVVW/10-8                    52.3GB/s ± 2%   60.6GB/s ± 3%  +15.76%  (p=0.000 n=10+10)
AddMulVVW/100-8                   80.3GB/s ± 2%  122.1GB/s ± 1%  +51.92%  (p=0.000 n=10+10)
AddMulVVW/1000-8                  92.0GB/s ± 1%  130.3GB/s ± 2%  +41.61%  (p=0.000 n=9+10)
AddMulVVW/10000-8                 88.2GB/s ± 2%  108.2GB/s ± 5%  +22.66%  (p=0.000 n=10+10)
AddMulVVW/100000-8                88.2GB/s ± 2%  102.9GB/s ± 2%  +16.69%  (p=0.000 n=10+10)

Change-Id: Ic98e30c91d437d845fed03e07e976c3fdbf02b36
Reviewed-on: https://go-review.googlesource.com/74851
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Adam Langley <agl@golang.org>
2018-02-24 00:13:03 +00:00
Shawn Smith
d3beea8c52 all: fix misspellings
GitHub-Last-Rev: 468df242d0
GitHub-Pull-Request: golang/go#23935
Change-Id: If751ce3ffa3a4d5e00a3138211383d12cb6b23fc
Reviewed-on: https://go-review.googlesource.com/95577
Run-TryBot: Andrew Bonventre <andybons@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Andrew Bonventre <andybons@golang.org>
2018-02-20 21:02:58 +00:00
Alberto Donizetti
331092c58f math/big: fix %s verbs in Float tests error messages
Fatalf calls in two Float tests use the %s verb with Floats values,
which is not allowed and results in failure messages that look like
this:

    float_test.go:1385: i = 0, prec = 1, ToZero:
                     %!s(*big.Float=1) [0]
                /    %!s(*big.Float=1) [0]
                =    %!s(*big.Float=0.0625)
                want %!s(*big.Float=1)

Switch to %v.

Change-Id: Ifdc80bf19c91ca1b190f6551a6d0a51b42ed5919
Reviewed-on: https://go-review.googlesource.com/87199
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-02-14 09:50:19 +00:00
Thanabodee Charoenpiriyakij
1124fa300b math: use Abs rather than if x < 0 { x = -x }
This is the benchmark result base on darwin with amd64 architecture:

name     old time/op  new time/op  delta
Cos      10.2ns ± 2%  10.3ns ± 3%  +1.18%  (p=0.032 n=10+10)
Cosh     25.3ns ± 3%  24.6ns ± 2%  -3.00%  (p=0.000 n=10+10)
Hypot    6.40ns ± 2%  6.19ns ± 3%  -3.36%  (p=0.000 n=10+10)
HypotGo  7.16ns ± 3%  6.54ns ± 2%  -8.66%  (p=0.000 n=10+10)
J0       66.0ns ± 2%  63.7ns ± 1%  -3.42%  (p=0.000 n=9+10)

Fixes #21812

Change-Id: I2b88fbdfc250cd548f8f08b44ce2eb172dcacf43
Reviewed-on: https://go-review.googlesource.com/84437
Reviewed-by: Giovanni Bajo <rasky@develer.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Giovanni Bajo <rasky@develer.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-13 20:12:23 +00:00
Paul PISCUC
3526c40979 math/rand: typo fixed in documentation of seedPos
In the comment of seedPost, the word: condiiton was changed to: condition

Change-Id: I8967cc0e9f5d37776bada96cc1443c8bf46e1117
Reviewed-on: https://go-review.googlesource.com/86156
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-01-04 20:27:29 +00:00
Brian Kessler
5305bdd86b math: correct result for Pow(x, ±.5)
Fixes #23224

The previous Pow code had an optimization for
powers equal to ±0.5 that used Sqrt for
increased accuracy/speed.  This caused special
cases involving powers of ±0.5 to disagree with
the Pow spec.  This change places the Sqrt optimization
after all of the special case handling.

Change-Id: I6bf757f6248256b29cc21725a84e27705d855369
Reviewed-on: https://go-review.googlesource.com/85660
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-01-02 18:10:43 +00:00
Ilya Tocar
1992893307 math: remove asm version of Dim
Dim performance has regressed by 14% vs 1.9 on amd64.
Current pure go version of Dim is faster and,
what is even more important for performance, is inlinable, so
instead of tweaking asm implementation, just remove it.
I had to update BenchmarkDim, because it was simply reloading
constant(answer) in a loop.
Perf data below:

name   old time/op  new time/op  delta
Dim-6  6.79ns ± 0%  1.60ns ± 1%  -76.39%  (p=0.000 n=7+10)

If I modify benchmark to be the same as in this CL results are even better:

name   old time/op  new time/op  delta
Dim-6  10.2ns ± 0%   1.6ns ± 1%  -84.27%  (p=0.000 n=8+10)

Updates #21913

Change-Id: I00e23c8affc293531e1d9f0e0e49f3a525634f53
Reviewed-on: https://go-review.googlesource.com/80695
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-11-30 21:00:33 +00:00
Alberto Donizetti
ff534e2130 math/big: protect against aliasing in nat.divLarge
In nat.divLarge (having signature (z nat).divLarge(u, uIn, v nat)),
we check whether z aliases uIn or v, but aliasing is currently not
checked for the u parameter.

Unfortunately, z and u aliasing each other can in some cases cause
errors in the computation.

The q return parameter (which will hold the result's quotient), is
unconditionally initialized as

    q = z.make(m + 1)

When cap(z) ≥ m+1, z.make() will reuse z's backing array, causing q
and z to share the same backing array. If then z aliases u, setting q
during the quotient computation will then corrupt u, which at that
point already holds computation state.

To fix this, we add an alias(z, u) check at the beginning of the
function, taking care of aliasing the same way we already do for uIn
and v.

Fixes #22830

Change-Id: I3ab81120d5af6db7772a062bb1dfc011de91f7ad
Reviewed-on: https://go-review.googlesource.com/78995
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-11-30 20:36:54 +00:00
Vladimir Stefanovic
2708da0dc1 runtime/cgo, math: don't use FP instructions for soft-float mips{,le}
Updates #18162

Change-Id: I591fcf71a02678a99a56a6487da9689d3c9b1bb6
Reviewed-on: https://go-review.googlesource.com/37955
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-11-30 17:12:32 +00:00
Brian Kessler
802a8f88a3 math/cmplx: use signed zero to correct branch cuts
Branch cuts for the elementary complex functions along real or imaginary axes
should be resolved in floating point calculations by one-sided continuity with
signed zero as described in:

"Branch Cuts for Complex Elementary Functions or Much Ado About Nothing's Sign Bit"
W. Kahan

Available at: https://people.freebsd.org/~das/kahan86branch.pdf

And as described in the C99 standard which is claimed as the original cephes source.

Sqrt did not return the correct branch when imag(x) == 0. The branch is now
determined by sign(imag(x)).  This incorrect branch choice was affecting the behavior
of the Trigonometric/Hyperbolic functions that use Sqrt in intermediate calculations.

Asin, Asinh and Atan had spurious domain checks, whereas the functions should be valid
over the whole complex plane with appropriate branch cuts.

Fixes #6888

Change-Id: I9b1278af54f54bfb4208276ae345bbd3ddf3ec83
Reviewed-on: https://go-review.googlesource.com/46492
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2017-11-27 07:44:00 +00:00
Russ Cox
d9a198c7d0 Revert "math/rand: make Perm match Shuffle"
This reverts CL 55972.

Reason for revert: this changes Perm's behavior unnecessarily.
I asked for this change originally but I now regret it.
Reverting so that I don't have to justify it in Go 1.10 release notes.

Edited to keep the change to rand_test.go, which seems to have
been mostly unrelated.

Fixes #22744.

Change-Id: If8bb1bcde3ced0db2fdcd0aa65ab128613686c66
Reviewed-on: https://go-review.googlesource.com/78195
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-11-16 16:24:30 +00:00
Daniel Martí
a265f2e90e go/printer: indent lone comments in composite lits
If a composite literal contains any comments on their own lines without
any elements, the printer would unindent the comments.

The comments in this edge case are written when the closing '}' is
written. Indent and outdent first so that the indentation is
interspersed before the comment is written.

Also note that the go/printer golden tests don't show the exact same
behaviour that gofmt does. Added a TODO to figure this out in a separate
CL.

While at it, ensure that the tree conforms to gofmt. The changes are
unrelated to this indentation fix, however.

Fixes #22355.

Change-Id: I5ac25ac6de95a236f1e123479127cc4dd71e93fe
Reviewed-on: https://go-review.googlesource.com/74232
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-11-15 18:48:48 +00:00
Brian Kessler
2955a8a6cc math/big: clarify comment on lehmerGCD overflow
A clarifying comment was added to indicate that overflow of a
single Word is not possible in the single digit calculation.
Lehmer's paper includes a proof of the bounds on the size of the
cosequences (u0, u1, u2, v0, v1, v2).

Change-Id: I98127a07aa8f8fe44814b74b2bc6ff720805194b
Reviewed-on: https://go-review.googlesource.com/77451
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-11-14 17:32:39 +00:00
Filippo Valsorda
ef0e2af7b0 math/big: add security warning to (*Int).Rand
Change-Id: I22a67733aa2d07298e124077654c9b1473802100
Reviewed-on: https://go-review.googlesource.com/76012
Reviewed-by: Aliaksandr Valialkin <valyala@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-11-06 15:55:31 +00:00
Tobias Klauser
89bcbf40b8 math/bits: add examples for right rotation
Right rotation is achieved using negative k in RotateLeft*(x, k). Add
examples demonstrating that functionality.

Change-Id: I15dab159accd2937cb18d3fa8ca32da8501567d3
Reviewed-on: https://go-review.googlesource.com/75371
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-11-03 20:12:07 +00:00
Lynn Boger
3860478b42 math: implement asm modf for ppc64x
This change adds an asm implementations modf for ppc64x.

Improvements:

BenchmarkModf-16               7.48          6.26          -16.31%

Updates: #21390

Change-Id: I9c4f3213688e3e8842d050840dc04fc9c0bf6ce4
Reviewed-on: https://go-review.googlesource.com/74411
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Munday <mike.munday@ibm.com>
2017-11-02 13:24:32 +00:00
griesemer
85c32c3744 math/big: implement CmpAbs
Fixes #22473.

Change-Id: Ie886dfc8b5510970d6d63ca6472c73325f6f2276
Reviewed-on: https://go-review.googlesource.com/74971
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
2017-11-01 18:17:49 +00:00
Alberto Donizetti
856dccb175 math/big: avoid unnecessary Newton iteration in Float.Sqrt
An initial draft of the Newton code for Float.Sqrt was structured like
this:

  for condition
    // do Newton iteration..
    prec *= 2

since prec, at the end of the loop, was double the precision used in
the last Newton iteration, the termination condition was set to
2*limit. The code was later rewritten in the form

  for condition
    prec *= 2
    // do Newton iteration..

but condition was not updated, and it's still 2*limit, which is about
double what we actually need, and is triggering the execution of an
additional, and unnecessary, Newton iteration.

This change adjusts the Newton termination condition to the (correct)
value of z.prec, plus 32 guard bits as a safety margin.

name                 old time/op    new time/op    delta
FloatSqrt/64-4          798ns ± 3%     802ns ± 3%     ~     (p=0.458 n=8+8)
FloatSqrt/128-4        1.65µs ± 1%    1.65µs ± 1%     ~     (p=0.290 n=8+8)
FloatSqrt/256-4        3.10µs ± 1%    2.10µs ± 0%  -32.32%  (p=0.000 n=8+7)
FloatSqrt/1000-4       8.83µs ± 1%    4.91µs ± 2%  -44.39%  (p=0.000 n=8+8)
FloatSqrt/10000-4       107µs ± 1%      40µs ± 1%  -62.68%  (p=0.000 n=8+8)
FloatSqrt/100000-4     2.91ms ± 1%    0.96ms ± 1%  -67.13%  (p=0.000 n=8+8)
FloatSqrt/1000000-4     240ms ± 1%      80ms ± 1%  -66.66%  (p=0.000 n=8+8)

name                 old alloc/op   new alloc/op   delta
FloatSqrt/64-4           416B ± 0%      416B ± 0%     ~     (all equal)
FloatSqrt/128-4          720B ± 0%      720B ± 0%     ~     (all equal)
FloatSqrt/256-4        1.34kB ± 0%    0.82kB ± 0%  -39.29%  (p=0.000 n=8+8)
FloatSqrt/1000-4       5.09kB ± 0%    2.50kB ± 0%  -50.94%  (p=0.000 n=8+8)
FloatSqrt/10000-4      45.9kB ± 0%    23.5kB ± 0%  -48.81%  (p=0.000 n=8+8)
FloatSqrt/100000-4      533kB ± 0%     251kB ± 0%  -52.90%  (p=0.000 n=8+8)
FloatSqrt/1000000-4    9.21MB ± 0%    4.61MB ± 0%  -49.98%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
FloatSqrt/64-4           9.00 ± 0%      9.00 ± 0%     ~     (all equal)
FloatSqrt/128-4          13.0 ± 0%      13.0 ± 0%     ~     (all equal)
FloatSqrt/256-4          15.0 ± 0%      12.0 ± 0%  -20.00%  (p=0.000 n=8+8)
FloatSqrt/1000-4         24.0 ± 0%      19.0 ± 0%  -20.83%  (p=0.000 n=8+8)
FloatSqrt/10000-4        40.0 ± 0%      35.0 ± 0%  -12.50%  (p=0.000 n=8+8)
FloatSqrt/100000-4       66.0 ± 0%      55.0 ± 0%  -16.67%  (p=0.000 n=8+8)
FloatSqrt/1000000-4       143 ± 0%       122 ± 0%  -14.69%  (p=0.000 n=8+8)

Change-Id: I4868adb7f8960f2ca20e7792734c2e6211669fc0
Reviewed-on: https://go-review.googlesource.com/75010
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-11-01 16:21:06 +00:00
Alberto Donizetti
2c783dc038 math/big: save one subtraction per iteration in Float.Sqrt
The Sqrt Newton method computes g(t) = f(t)/f'(t) and then iterates

  t2 = t1 - g(t1)

We can save one operation by including the final subtraction in g(t)
and evaluating the resulting expression symbolically.

For example, for the direct method,

  g(t) = ½(t² - x)/t

and we use 2 multiplications, 1 division and 1 subtraction in g(),
plus 1 final subtraction; but if we compute

  t - g(t) = t - ½(t² - x)/t = ½(t² + x)/t

we only use 2 multiplications, 1 division and 1 addition.

A similar simplification can be done for the inverse method.

name                 old time/op    new time/op    delta
FloatSqrt/64-4          889ns ± 4%     790ns ± 1%  -11.19%  (p=0.000 n=8+7)
FloatSqrt/128-4        1.82µs ± 0%    1.64µs ± 1%  -10.07%  (p=0.001 n=6+8)
FloatSqrt/256-4        3.56µs ± 4%    3.10µs ± 3%  -12.96%  (p=0.000 n=7+8)
FloatSqrt/1000-4       9.06µs ± 3%    8.86µs ± 1%   -2.20%  (p=0.001 n=7+7)
FloatSqrt/10000-4       109µs ± 1%     107µs ± 1%   -1.56%  (p=0.000 n=8+8)
FloatSqrt/100000-4     2.91ms ± 0%    2.89ms ± 2%   -0.68%  (p=0.026 n=7+7)
FloatSqrt/1000000-4     237ms ± 1%     239ms ± 1%   +0.72%  (p=0.021 n=8+8)

name                 old alloc/op   new alloc/op   delta
FloatSqrt/64-4           448B ± 0%      416B ± 0%   -7.14%  (p=0.000 n=8+8)
FloatSqrt/128-4          752B ± 0%      720B ± 0%   -4.26%  (p=0.000 n=8+8)
FloatSqrt/256-4        2.05kB ± 0%    1.34kB ± 0%  -34.38%  (p=0.000 n=8+8)
FloatSqrt/1000-4       6.91kB ± 0%    5.09kB ± 0%  -26.39%  (p=0.000 n=8+8)
FloatSqrt/10000-4      60.5kB ± 0%    45.9kB ± 0%  -24.17%  (p=0.000 n=8+8)
FloatSqrt/100000-4      617kB ± 0%     533kB ± 0%  -13.57%  (p=0.000 n=8+8)
FloatSqrt/1000000-4    10.3MB ± 0%     9.2MB ± 0%  -10.85%  (p=0.000 n=8+8)

name                 old allocs/op  new allocs/op  delta
FloatSqrt/64-4           9.00 ± 0%      9.00 ± 0%     ~     (all equal)
FloatSqrt/128-4          13.0 ± 0%      13.0 ± 0%     ~     (all equal)
FloatSqrt/256-4          20.0 ± 0%      15.0 ± 0%  -25.00%  (p=0.000 n=8+8)
FloatSqrt/1000-4         31.0 ± 0%      24.0 ± 0%  -22.58%  (p=0.000 n=8+8)
FloatSqrt/10000-4        50.0 ± 0%      40.0 ± 0%  -20.00%  (p=0.000 n=8+8)
FloatSqrt/100000-4       76.0 ± 0%      66.0 ± 0%  -13.16%  (p=0.000 n=8+8)
FloatSqrt/1000000-4       146 ± 0%       143 ± 0%   -2.05%  (p=0.000 n=8+8)

Change-Id: I271c00de1ca9740e585bf2af7bcd87b18c1fa68e
Reviewed-on: https://go-review.googlesource.com/73879
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-11-01 11:24:02 +00:00
Ilya Tocar
94484d8ed5 cmd/compile: intrinsify math.{Trunc/Ceil/Floor} on amd64
This significantly speed-ups Trunc.
Ceil/Floor are using the same instruction, so do them too.

name     old time/op  new time/op  delta
Floor-6  3.33ns ± 1%  3.22ns ± 0%   -3.39%  (p=0.000 n=10+10)
Ceil-6   3.33ns ± 1%  3.22ns ± 0%   -3.16%  (p=0.000 n=10+7)
Trunc-6  4.83ns ± 0%  3.22ns ± 0%  -33.36%  (p=0.000 n=6+8)

Change-Id: If848790e458eedfe38a6a0407bb4f589c68ac254
Reviewed-on: https://go-review.googlesource.com/68630
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-10-31 19:30:54 +00:00
Michael Munday
c280126557 cmd/asm, cmd/internal/obj/s390x, math: add "test under mask" instructions
Adds the following s390x test under mask (immediate) instructions:

TMHH
TMHL
TMLH
TMLL

These are useful for testing bits and are already used in the math package.

Change-Id: Idffb3f83b238dba76ac1e42ac6b0bf7f1d11bea2
Reviewed-on: https://go-review.googlesource.com/41092
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-10-30 23:55:14 +00:00
Michael Munday
b97688d112 math: optimize dim and remove s390x assembly implementation
By calculating dim directly, rather than calling max, we can simplify
the generated code significantly. The compiler now reports that dim
is easily inlineable, but it can't be inlined because there is still
an assembly stub for Dim.

Since dim is now very simple I no longer think it is worth having
assembly implementations of it. I have therefore removed the s390x
assembly. Removing the other assembly for Dim is #21913.

name  old time/op  new time/op  delta
Dim   4.29ns ± 0%  3.53ns ± 0%  -17.62%  (p=0.000 n=9+8)

Change-Id: Ic38a6b51603cbc661dcdb868ecf2b1947e9f399e
Reviewed-on: https://go-review.googlesource.com/64194
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-10-30 19:05:51 +00:00
Alberto Donizetti
bd48d37e30 math/big: add (*Float).Sqrt
This change adds a Square root method to the big.Float type, with
signature

  (z *Float) Sqrt(x *Float) *Float

Fixes #20460

Change-Id: I050aaed0615fe0894e11c800744600648343c223
Reviewed-on: https://go-review.googlesource.com/67830
Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-10-26 17:29:27 +00:00
Brian Kessler
1643d4f33a math/big: implement Lehmer's GCD algorithm
Updates #15833

Lehmer's GCD algorithm uses single precision calculations
to simulate several steps of multiple precision calculations
in Euclid's GCD algorithm which leads to a considerable
speed up.  This implementation uses Collins' simplified
testing condition on the single digit cosequences which
requires only one quotient and avoids any possibility of
overflow.

name                          old time/op  new time/op  delta
GCD10x10/WithoutXY-4          1.82µs ±24%  0.28µs ± 6%  -84.40%  (p=0.008 n=5+5)
GCD10x10/WithXY-4             1.69µs ± 6%  1.71µs ± 6%     ~     (p=0.595 n=5+5)
GCD10x100/WithoutXY-4         1.87µs ± 2%  0.56µs ± 4%  -70.13%  (p=0.008 n=5+5)
GCD10x100/WithXY-4            2.61µs ± 2%  2.65µs ± 4%     ~     (p=0.635 n=5+5)
GCD10x1000/WithoutXY-4        2.75µs ± 2%  1.48µs ± 1%  -46.06%  (p=0.008 n=5+5)
GCD10x1000/WithXY-4           5.29µs ± 2%  5.25µs ± 2%     ~     (p=0.548 n=5+5)
GCD10x10000/WithoutXY-4       10.7µs ± 2%  10.3µs ± 0%   -4.38%  (p=0.008 n=5+5)
GCD10x10000/WithXY-4          22.3µs ± 6%  22.1µs ± 1%     ~     (p=1.000 n=5+5)
GCD10x100000/WithoutXY-4      93.7µs ± 2%  99.4µs ± 2%   +6.09%  (p=0.008 n=5+5)
GCD10x100000/WithXY-4          196µs ± 2%   199µs ± 2%     ~     (p=0.222 n=5+5)
GCD100x100/WithoutXY-4        10.1µs ± 2%   2.5µs ± 2%  -74.84%  (p=0.008 n=5+5)
GCD100x100/WithXY-4           21.4µs ± 2%  21.3µs ± 7%     ~     (p=0.548 n=5+5)
GCD100x1000/WithoutXY-4       11.3µs ± 2%   4.4µs ± 4%  -60.87%  (p=0.008 n=5+5)
GCD100x1000/WithXY-4          24.7µs ± 3%  23.9µs ± 1%     ~     (p=0.056 n=5+5)
GCD100x10000/WithoutXY-4      26.6µs ± 1%  20.0µs ± 2%  -24.82%  (p=0.008 n=5+5)
GCD100x10000/WithXY-4         78.7µs ± 2%  78.2µs ± 2%     ~     (p=0.690 n=5+5)
GCD100x100000/WithoutXY-4      174µs ± 2%   171µs ± 1%     ~     (p=0.056 n=5+5)
GCD100x100000/WithXY-4         563µs ± 4%   561µs ± 2%     ~     (p=1.000 n=5+5)
GCD1000x1000/WithoutXY-4       120µs ± 5%    29µs ± 3%  -75.71%  (p=0.008 n=5+5)
GCD1000x1000/WithXY-4          355µs ± 4%   358µs ± 2%     ~     (p=0.841 n=5+5)
GCD1000x10000/WithoutXY-4      140µs ± 2%    49µs ± 2%  -65.07%  (p=0.008 n=5+5)
GCD1000x10000/WithXY-4         626µs ± 3%   628µs ± 9%     ~     (p=0.690 n=5+5)
GCD1000x100000/WithoutXY-4     340µs ± 4%   259µs ± 6%  -23.79%  (p=0.008 n=5+5)
GCD1000x100000/WithXY-4       3.76ms ± 4%  3.82ms ± 5%     ~     (p=0.310 n=5+5)
GCD10000x10000/WithoutXY-4    3.11ms ± 3%  0.54ms ± 2%  -82.74%  (p=0.008 n=5+5)
GCD10000x10000/WithXY-4       7.96ms ± 3%  7.69ms ± 3%     ~     (p=0.151 n=5+5)
GCD10000x100000/WithoutXY-4   3.88ms ± 1%  1.27ms ± 2%  -67.21%  (p=0.008 n=5+5)
GCD10000x100000/WithXY-4      38.1ms ± 2%  38.8ms ± 1%     ~     (p=0.095 n=5+5)
GCD100000x100000/WithoutXY-4   208ms ± 1%    25ms ± 4%  -88.07%  (p=0.008 n=5+5)
GCD100000x100000/WithXY-4      533ms ± 5%   525ms ± 4%     ~     (p=0.548 n=5+5)

Change-Id: Ic1e007eb807b93e75f4752e968e98c1f0cb90e43
Reviewed-on: https://go-review.googlesource.com/59450
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-10-24 22:42:43 +00:00
Mark Pulford
a5c44f3e3f math: add RoundToEven function
Rounding ties to even is statistically useful for some applications.
This implementation completes IEEE float64 rounding mode support (in
addition to Round, Ceil, Floor, Trunc).

This function avoids subtle faults found in ad-hoc implementations, and
is simple enough to be inlined by the compiler.

Fixes #21748

Change-Id: I09415df2e42435f9e7dabe3bdc0148e9b9ebd609
Reviewed-on: https://go-review.googlesource.com/61211
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-10-24 22:33:09 +00:00
Matthew Dempsky
a5868a47c6 cmd/internal/obj/x86: move MOV->XOR rewriting into compiler
Fixes #20986.

Change-Id: Ic3cf5c0ab260f259ecff7b92cfdf5f4ae432aef3
Reviewed-on: https://go-review.googlesource.com/73072
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2017-10-24 21:32:17 +00:00
Filippo Valsorda
f75158c365 math/big: fix ModSqrt optimized path for x = z
name                   old time/op  new time/op  delta
ModSqrt224_3Mod4-4      153µs ± 2%   154µs ± 1%   ~     (p=0.548 n=5+5)
ModSqrt5430_3Mod4-4     776ms ± 2%   791ms ± 2%   ~     (p=0.222 n=5+5)

Fixes #22265

Change-Id: If233542716e04341990a45a1c2b7118da6d233f7
Reviewed-on: https://go-review.googlesource.com/70832
Run-TryBot: Filippo Valsorda <hi@filippo.io>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-10-16 21:41:44 +00:00
griesemer
51cfe6849a math/big: provide support for conversion bases up to 62
Increase MaxBase from 36 to 62 and extend the conversion
alphabet with the upper-case letters 'A' to 'Z'. For int
conversions with bases <= 36, the letters 'A' to 'Z' have
the same values (10 to 35) as the corresponding lower-case
letters. For conversion bases > 36 up to 62, the upper-case
letters have the values 36 to 61.

Added MaxBase to api/except.txt: Clients should not make
assumptions about the value of MaxBase being constant.

The core of the change is in natconv.go. The remaining
changes are adjusted tests and documentation.

Fixes #21558.

Change-Id: I5f74da633caafca03993e13f32ac9546c572cc84
Reviewed-on: https://go-review.googlesource.com/65970
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
2017-10-06 17:46:15 +00:00
griesemer
2ddd07138d math/bits: complete examples
Change-Id: Icbe6885ffd3aa4e77441ab03a2b9a04a9276d5eb
Reviewed-on: https://go-review.googlesource.com/68311
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
2017-10-06 16:58:03 +00:00
Marvin Stenger
90d71fe99e all: revert "all: prefer strings.IndexByte over strings.Index"
This reverts https://golang.org/cl/65930.

Fixes #22148

Change-Id: Ie0712621ed89c43bef94417fc32de9af77607760
Reviewed-on: https://go-review.googlesource.com/68430
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-10-05 23:19:10 +00:00
Marvin Stenger
aa00c607e1 math/big: remove []byte/string conversions
This removes some of the []byte/string conversions currently
existing in the (un)marshaling methods of Int and Rat.

For Int we introduce a new function (*Int).setFromScanner() essentially
implementing the SetString method being given an io.ByteScanner instead
of a string. So we can handle the string case in (*Int).SetString with
a *strings.Reader and the []byte case in (*Int).UnmarshalText() with a
*bytes.Reader now avoiding the []byte/string conversion here.

For Rat we introduce a new function (*Rat).marshal() essentially
implementing the String method outputting []byte instead of string.
Using this new function and the same formatting rules as in
(*Rat).RatString we can implement (*Rat).MarshalText() without
the []byte/string conversion it used to have.

Change-Id: Ic5ef246c1582c428a40f214b95a16671ef0a06d9
Reviewed-on: https://go-review.googlesource.com/65950
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-10-04 17:16:52 +00:00
Marvin Stenger
f22ba1f247 all: prefer strings.IndexByte over strings.Index
strings.IndexByte was introduced in go1.2 and it can be used
effectively wherever the second argument to strings.Index is
exactly one byte long.

This avoids generating unnecessary string symbols and saves
a few calls to strings.Index.

Change-Id: I1ab5edb7c4ee9058084cfa57cbcc267c2597e793
Reviewed-on: https://go-review.googlesource.com/65930
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-09-25 17:35:41 +00:00
Marvin Stenger
7a5d76fa62 math/big: delete solved TODO
The TODO is no longer needed as it was solved by a previous CL.
See https://go-review.googlesource.com/14995.

Change-Id: If62d1b296f35758ad3d18d28c8fbb95e797f4464
Reviewed-on: https://go-review.googlesource.com/65232
Reviewed-by: Robert Griesemer <gri@golang.org>
2017-09-22 10:21:46 +00:00
Agniva De Sarker
d2f317218b math: implement fast path for Exp
- using FMA and AVX instructions if available to speed-up
Exp calculation on amd64

- using a data table instead of #define'ed constants because
these instructions do not support loading floating point immediates.
One has to use a memory operand / register.

- Benchmark results on Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz:

Original vs New (non-FMA path)
name  old time/op    new time/op    delta
Exp     16.0ns ± 1%    16.1ns ± 3%   ~     (p=0.308 n=9+10)

Original vs New (FMA path)
name  old time/op    new time/op    delta
Exp     16.0ns ± 1%    13.7ns ± 2%  -14.80%  (p=0.000 n=9+10)

Change-Id: I3d8986925d82b39b95ee979ae06f59d7e591d02e
Reviewed-on: https://go-review.googlesource.com/62590
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-09-20 21:43:00 +00:00
Burak Guven
9cc170f9a5 math/rand: fix comment for Shuffle
Shuffle panics if n < 0, not n <= 0. The comment for the (*Rand).Shuffle
function is already accurate.

Change-Id: I073049310bca9632e50e9ca3ff79eec402122793
Reviewed-on: https://go-review.googlesource.com/63750
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-09-14 03:41:35 +00:00
Michael Munday
ffb4708d1b math: fix Abs, Copysign and Signbit benchmarks
CL 62250 makes constant folding a bit more aggressive and these
benchmarks were optimized away. This CL adds some indirection to
the function arguments to stop them being folded.

The Copysign benchmark is a bit faster because I've left one
argument as a constant and it can be partially folded.

                     old           CL 62250     this CL
Copysign             1.24ns ± 0%   0.34ns ± 2%  1.02ns ± 2%
Abs                  0.67ns ± 0%   0.35ns ± 3%  0.67ns ± 0%
Signbit              0.87ns ± 0%   0.35ns ± 2%  0.87ns ± 1%

Change-Id: I9604465a87d7aa29f4bd6009839c8ee354be3cd7
Reviewed-on: https://go-review.googlesource.com/62450
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-09-09 16:52:16 +00:00
Ian Lance Taylor
44f7fd030f math/rand: change http to https in comment
Change-Id: I19c1b0e1b238dda82e69bd47459528ed06b55840
Reviewed-on: https://go-review.googlesource.com/62310
Reviewed-by: David Crawshaw <crawshaw@golang.org>
2017-09-08 22:11:55 +00:00