1
0
mirror of https://github.com/golang/go synced 2024-11-19 16:34:49 -07:00
Commit Graph

13 Commits

Author SHA1 Message Date
Keith Randall
a96e117a58 runtime: amd64, use 4-byte ops for memmove of 4 bytes
memmove used to use 2 2-byte load/store pairs to move 4 bytes.
When the result is loaded with a single 4-byte load, it caused
a store to load fowarding stall.  To avoid the stall,
special case memmove to use 4 byte ops for the 4 byte copy case.

We already have a special case for 8-byte copies.
386 already specializes 4-byte copies.
I'll do 2-byte copies also, but not for 1.8.

benchmark                 old ns/op     new ns/op     delta
BenchmarkIssue18740-8     7567          4799          -36.58%

3-byte copies get a bit slower.  Other copies are unchanged.
name         old time/op   new time/op   delta
Memmove/3-8   4.76ns ± 5%   5.26ns ± 3%  +10.50%  (p=0.000 n=10+10)

Fixes #18740

Change-Id: Iec82cbac0ecfee80fa3c8fc83828f9a1819c3c74
Reviewed-on: https://go-review.googlesource.com/35567
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
2017-01-23 19:39:22 +00:00
Brad Fitzpatrick
2341631506 all: sprinkle t.Parallel on some slow tests
I used the slowtests.go tool as described in
https://golang.org/cl/32684 on packages that stood out.

go test -short std drops from ~56 to ~52 seconds.

This isn't a huge win, but it was mostly an exercise.

Updates #17751

Change-Id: I9f3402e36a038d71e662d06ce2c1d52f6c4b674d
Reviewed-on: https://go-review.googlesource.com/32751
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-11-04 16:56:57 +00:00
Denis Nagorny
d7507e9d11 runtime: improve memmove for amd64
Use AVX if available on 4th generation of Intel(TM) Core(TM) processors.

    (collected on E5 2609v3 @1.9GHz)
    name                        old speed      new speed       delta
    Memmove/1-6                  158MB/s ± 0%    172MB/s ± 0%    +9.09% (p=0.000 n=16+16)
    Memmove/2-6                  316MB/s ± 0%    345MB/s ± 0%    +9.09% (p=0.000 n=18+16)
    Memmove/3-6                  517MB/s ± 0%    517MB/s ± 0%      ~ (p=0.445 n=16+16)
    Memmove/4-6                  687MB/s ± 1%    690MB/s ± 0%    +0.35% (p=0.000 n=20+17)
    Memmove/5-6                  729MB/s ± 0%    729MB/s ± 0%    +0.01% (p=0.000 n=16+18)
    Memmove/6-6                  875MB/s ± 0%    875MB/s ± 0%    +0.01% (p=0.000 n=18+18)
    Memmove/7-6                 1.02GB/s ± 0%   1.02GB/s ± 1%      ~ (p=0.139 n=19+20)
    Memmove/8-6                 1.26GB/s ± 0%   1.26GB/s ± 0%    +0.00% (p=0.000 n=18+18)
    Memmove/9-6                 1.42GB/s ± 0%   1.42GB/s ± 0%    +0.00% (p=0.000 n=17+18)
    Memmove/10-6                1.58GB/s ± 0%   1.58GB/s ± 0%    +0.00% (p=0.000 n=19+19)
    Memmove/11-6                1.74GB/s ± 0%   1.74GB/s ± 0%    +0.00% (p=0.001 n=18+17)
    Memmove/12-6                1.90GB/s ± 0%   1.90GB/s ± 0%    +0.00% (p=0.000 n=19+19)
    Memmove/13-6                2.05GB/s ± 0%   2.05GB/s ± 0%    +0.00% (p=0.000 n=18+19)
    Memmove/14-6                2.21GB/s ± 0%   2.21GB/s ± 0%    +0.00% (p=0.000 n=16+20)
    Memmove/15-6                2.37GB/s ± 0%   2.37GB/s ± 0%    +0.00% (p=0.004 n=19+20)
    Memmove/16-6                2.53GB/s ± 0%   2.53GB/s ± 0%    +0.00% (p=0.000 n=16+16)
    Memmove/32-6                4.67GB/s ± 0%   4.67GB/s ± 0%    +0.00% (p=0.000 n=17+17)
    Memmove/64-6                8.67GB/s ± 0%   8.64GB/s ± 0%    -0.33% (p=0.000 n=18+17)
    Memmove/128-6               12.6GB/s ± 0%   11.6GB/s ± 0%    -8.05% (p=0.000 n=16+19)
    Memmove/256-6               16.3GB/s ± 0%   16.6GB/s ± 0%    +1.66% (p=0.000 n=20+18)
    Memmove/512-6               21.5GB/s ± 0%   24.4GB/s ± 0%   +13.35% (p=0.000 n=18+17)
    Memmove/1024-6              24.7GB/s ± 0%   33.7GB/s ± 0%   +36.12% (p=0.000 n=18+18)
    Memmove/2048-6              27.3GB/s ± 0%   43.3GB/s ± 0%   +58.77% (p=0.000 n=19+17)
    Memmove/4096-6              37.5GB/s ± 0%   50.5GB/s ± 0%   +34.56% (p=0.000 n=19+19)
    MemmoveUnalignedDst/1-6      135MB/s ± 0%    146MB/s ± 0%    +7.69% (p=0.000 n=16+14)
    MemmoveUnalignedDst/2-6      271MB/s ± 0%    292MB/s ± 0%    +7.69% (p=0.000 n=18+18)
    MemmoveUnalignedDst/3-6      438MB/s ± 0%    438MB/s ± 0%      ~ (p=0.352 n=16+19)
    MemmoveUnalignedDst/4-6      584MB/s ± 0%    584MB/s ± 0%      ~ (p=0.876 n=17+17)
    MemmoveUnalignedDst/5-6      631MB/s ± 1%    632MB/s ± 0%    +0.25% (p=0.000 n=20+17)
    MemmoveUnalignedDst/6-6      759MB/s ± 0%    759MB/s ± 0%    +0.00% (p=0.000 n=19+16)
    MemmoveUnalignedDst/7-6      885MB/s ± 0%    883MB/s ± 1%      ~ (p=0.647 n=18+20)
    MemmoveUnalignedDst/8-6     1.08GB/s ± 0%   1.08GB/s ± 0%    +0.00% (p=0.035 n=19+18)
    MemmoveUnalignedDst/9-6     1.22GB/s ± 0%   1.22GB/s ± 0%      ~ (p=0.251 n=18+17)
    MemmoveUnalignedDst/10-6    1.35GB/s ± 0%   1.35GB/s ± 0%      ~ (p=0.327 n=17+18)
    MemmoveUnalignedDst/11-6    1.49GB/s ± 0%   1.49GB/s ± 0%      ~ (p=0.531 n=18+19)
    MemmoveUnalignedDst/12-6    1.63GB/s ± 0%   1.63GB/s ± 0%      ~ (p=0.886 n=19+18)
    MemmoveUnalignedDst/13-6    1.76GB/s ± 0%   1.76GB/s ± 1%    -0.24% (p=0.006 n=18+20)
    MemmoveUnalignedDst/14-6    1.90GB/s ± 0%   1.90GB/s ± 0%      ~ (p=0.818 n=20+19)
    MemmoveUnalignedDst/15-6    2.03GB/s ± 0%   2.03GB/s ± 0%      ~ (p=0.294 n=17+16)
    MemmoveUnalignedDst/16-6    2.17GB/s ± 0%   2.17GB/s ± 0%      ~ (p=0.602 n=16+18)
    MemmoveUnalignedDst/32-6    4.05GB/s ± 0%   4.05GB/s ± 0%    +0.00% (p=0.010 n=18+17)
    MemmoveUnalignedDst/64-6    7.59GB/s ± 0%   7.59GB/s ± 0%    +0.00% (p=0.022 n=18+16)
    MemmoveUnalignedDst/128-6   11.1GB/s ± 0%   11.4GB/s ± 0%    +2.79% (p=0.000 n=18+17)
    MemmoveUnalignedDst/256-6   16.4GB/s ± 0%   16.7GB/s ± 0%    +1.59% (p=0.000 n=20+17)
    MemmoveUnalignedDst/512-6   15.7GB/s ± 0%   21.3GB/s ± 0%   +35.87% (p=0.000 n=18+20)
    MemmoveUnalignedDst/1024-6  16.0GB/s ±20%   31.5GB/s ± 0%   +96.93% (p=0.000 n=20+14)
    MemmoveUnalignedDst/2048-6  19.6GB/s ± 0%   42.1GB/s ± 0%  +115.16% (p=0.000 n=17+18)
    MemmoveUnalignedDst/4096-6  6.41GB/s ± 0%  33.18GB/s ± 0%  +417.56% (p=0.000 n=17+18)
    MemmoveUnalignedSrc/1-6      171MB/s ± 0%    166MB/s ± 0%    -3.33% (p=0.000 n=19+16)
    MemmoveUnalignedSrc/2-6      343MB/s ± 0%    342MB/s ± 1%    -0.41% (p=0.000 n=17+20)
    MemmoveUnalignedSrc/3-6      508MB/s ± 0%    493MB/s ± 1%    -2.90% (p=0.000 n=17+17)
    MemmoveUnalignedSrc/4-6      677MB/s ± 0%    660MB/s ± 2%    -2.55% (p=0.000 n=17+20)
    MemmoveUnalignedSrc/5-6      790MB/s ± 0%    790MB/s ± 0%      ~ (p=0.139 n=17+17)
    MemmoveUnalignedSrc/6-6      948MB/s ± 0%    946MB/s ± 1%      ~ (p=0.330 n=17+19)
    MemmoveUnalignedSrc/7-6     1.11GB/s ± 0%   1.11GB/s ± 0%    -0.05% (p=0.026 n=17+17)
    MemmoveUnalignedSrc/8-6     1.38GB/s ± 0%   1.38GB/s ± 0%      ~ (p=0.091 n=18+16)
    MemmoveUnalignedSrc/9-6     1.42GB/s ± 0%   1.40GB/s ± 1%    -1.04% (p=0.000 n=19+20)
    MemmoveUnalignedSrc/10-6    1.58GB/s ± 0%   1.56GB/s ± 1%    -1.15% (p=0.000 n=18+19)
    MemmoveUnalignedSrc/11-6    1.73GB/s ± 0%   1.71GB/s ± 1%    -1.30% (p=0.000 n=20+20)
    MemmoveUnalignedSrc/12-6    1.89GB/s ± 0%   1.87GB/s ± 1%    -1.18% (p=0.000 n=17+20)
    MemmoveUnalignedSrc/13-6    2.05GB/s ± 0%   2.02GB/s ± 1%    -1.18% (p=0.000 n=17+20)
    MemmoveUnalignedSrc/14-6    2.21GB/s ± 0%   2.18GB/s ± 1%    -1.14% (p=0.000 n=17+20)
    MemmoveUnalignedSrc/15-6    2.36GB/s ± 0%   2.34GB/s ± 1%    -1.04% (p=0.000 n=17+20)
    MemmoveUnalignedSrc/16-6    2.52GB/s ± 0%   2.49GB/s ± 1%    -1.26% (p=0.000 n=19+20)
    MemmoveUnalignedSrc/32-6    4.82GB/s ± 0%   4.61GB/s ± 0%    -4.40% (p=0.000 n=19+20)
    MemmoveUnalignedSrc/64-6    5.03GB/s ± 4%   7.97GB/s ± 0%   +58.55% (p=0.000 n=20+16)
    MemmoveUnalignedSrc/128-6   11.1GB/s ± 0%   11.2GB/s ± 0%    +0.52% (p=0.000 n=17+18)
    MemmoveUnalignedSrc/256-6   16.5GB/s ± 0%   16.4GB/s ± 0%    -0.10% (p=0.000 n=20+18)
    MemmoveUnalignedSrc/512-6   21.0GB/s ± 0%   22.1GB/s ± 0%    +5.48% (p=0.000 n=14+17)
    MemmoveUnalignedSrc/1024-6  24.9GB/s ± 0%   31.9GB/s ± 0%   +28.20% (p=0.000 n=19+20)
    MemmoveUnalignedSrc/2048-6  23.3GB/s ± 0%   33.8GB/s ± 0%   +45.22% (p=0.000 n=17+19)
    MemmoveUnalignedSrc/4096-6  37.3GB/s ± 0%   42.7GB/s ± 0%   +14.30% (p=0.000 n=17+17)

Change-Id: Id66aa3e499ccfb117cb99d623ef326b50d057b64
Reviewed-on: https://go-review.googlesource.com/29590
Run-TryBot: Denis Nagorny <denis.nagorny@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-10-06 10:21:58 +00:00
Joe Tsai
6fb4b15f98 Revert "runtime: improve memmove for amd64"
This reverts commit 3607c5f4f1.

This was causing failures on amd64 machines without AVX.

Fixes #16939

Change-Id: I70080fbb4e7ae791857334f2bffd847d08cb25fa
Reviewed-on: https://go-review.googlesource.com/28274
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-08-31 21:07:35 +00:00
Denis Nagorny
3607c5f4f1 runtime: improve memmove for amd64
Use AVX if available on 4th generation of Intel(TM) Core(TM) processors.

(collected on E5 2609v3 @1.9GHz)
name                        old speed      new speed       delta
Memmove/1-6                  158MB/s ± 0%    172MB/s ± 0%    +9.09% (p=0.000 n=16+16)
Memmove/2-6                  316MB/s ± 0%    345MB/s ± 0%    +9.09% (p=0.000 n=18+16)
Memmove/3-6                  517MB/s ± 0%    517MB/s ± 0%      ~ (p=0.445 n=16+16)
Memmove/4-6                  687MB/s ± 1%    690MB/s ± 0%    +0.35% (p=0.000 n=20+17)
Memmove/5-6                  729MB/s ± 0%    729MB/s ± 0%    +0.01% (p=0.000 n=16+18)
Memmove/6-6                  875MB/s ± 0%    875MB/s ± 0%    +0.01% (p=0.000 n=18+18)
Memmove/7-6                 1.02GB/s ± 0%   1.02GB/s ± 1%      ~ (p=0.139 n=19+20)
Memmove/8-6                 1.26GB/s ± 0%   1.26GB/s ± 0%    +0.00% (p=0.000 n=18+18)
Memmove/9-6                 1.42GB/s ± 0%   1.42GB/s ± 0%    +0.00% (p=0.000 n=17+18)
Memmove/10-6                1.58GB/s ± 0%   1.58GB/s ± 0%    +0.00% (p=0.000 n=19+19)
Memmove/11-6                1.74GB/s ± 0%   1.74GB/s ± 0%    +0.00% (p=0.001 n=18+17)
Memmove/12-6                1.90GB/s ± 0%   1.90GB/s ± 0%    +0.00% (p=0.000 n=19+19)
Memmove/13-6                2.05GB/s ± 0%   2.05GB/s ± 0%    +0.00% (p=0.000 n=18+19)
Memmove/14-6                2.21GB/s ± 0%   2.21GB/s ± 0%    +0.00% (p=0.000 n=16+20)
Memmove/15-6                2.37GB/s ± 0%   2.37GB/s ± 0%    +0.00% (p=0.004 n=19+20)
Memmove/16-6                2.53GB/s ± 0%   2.53GB/s ± 0%    +0.00% (p=0.000 n=16+16)
Memmove/32-6                4.67GB/s ± 0%   4.67GB/s ± 0%    +0.00% (p=0.000 n=17+17)
Memmove/64-6                8.67GB/s ± 0%   8.64GB/s ± 0%    -0.33% (p=0.000 n=18+17)
Memmove/128-6               12.6GB/s ± 0%   11.6GB/s ± 0%    -8.05% (p=0.000 n=16+19)
Memmove/256-6               16.3GB/s ± 0%   16.6GB/s ± 0%    +1.66% (p=0.000 n=20+18)
Memmove/512-6               21.5GB/s ± 0%   24.4GB/s ± 0%   +13.35% (p=0.000 n=18+17)
Memmove/1024-6              24.7GB/s ± 0%   33.7GB/s ± 0%   +36.12% (p=0.000 n=18+18)
Memmove/2048-6              27.3GB/s ± 0%   43.3GB/s ± 0%   +58.77% (p=0.000 n=19+17)
Memmove/4096-6              37.5GB/s ± 0%   50.5GB/s ± 0%   +34.56% (p=0.000 n=19+19)
MemmoveUnalignedDst/1-6      135MB/s ± 0%    146MB/s ± 0%    +7.69% (p=0.000 n=16+14)
MemmoveUnalignedDst/2-6      271MB/s ± 0%    292MB/s ± 0%    +7.69% (p=0.000 n=18+18)
MemmoveUnalignedDst/3-6      438MB/s ± 0%    438MB/s ± 0%      ~ (p=0.352 n=16+19)
MemmoveUnalignedDst/4-6      584MB/s ± 0%    584MB/s ± 0%      ~ (p=0.876 n=17+17)
MemmoveUnalignedDst/5-6      631MB/s ± 1%    632MB/s ± 0%    +0.25% (p=0.000 n=20+17)
MemmoveUnalignedDst/6-6      759MB/s ± 0%    759MB/s ± 0%    +0.00% (p=0.000 n=19+16)
MemmoveUnalignedDst/7-6      885MB/s ± 0%    883MB/s ± 1%      ~ (p=0.647 n=18+20)
MemmoveUnalignedDst/8-6     1.08GB/s ± 0%   1.08GB/s ± 0%    +0.00% (p=0.035 n=19+18)
MemmoveUnalignedDst/9-6     1.22GB/s ± 0%   1.22GB/s ± 0%      ~ (p=0.251 n=18+17)
MemmoveUnalignedDst/10-6    1.35GB/s ± 0%   1.35GB/s ± 0%      ~ (p=0.327 n=17+18)
MemmoveUnalignedDst/11-6    1.49GB/s ± 0%   1.49GB/s ± 0%      ~ (p=0.531 n=18+19)
MemmoveUnalignedDst/12-6    1.63GB/s ± 0%   1.63GB/s ± 0%      ~ (p=0.886 n=19+18)
MemmoveUnalignedDst/13-6    1.76GB/s ± 0%   1.76GB/s ± 1%    -0.24% (p=0.006 n=18+20)
MemmoveUnalignedDst/14-6    1.90GB/s ± 0%   1.90GB/s ± 0%      ~ (p=0.818 n=20+19)
MemmoveUnalignedDst/15-6    2.03GB/s ± 0%   2.03GB/s ± 0%      ~ (p=0.294 n=17+16)
MemmoveUnalignedDst/16-6    2.17GB/s ± 0%   2.17GB/s ± 0%      ~ (p=0.602 n=16+18)
MemmoveUnalignedDst/32-6    4.05GB/s ± 0%   4.05GB/s ± 0%    +0.00% (p=0.010 n=18+17)
MemmoveUnalignedDst/64-6    7.59GB/s ± 0%   7.59GB/s ± 0%    +0.00% (p=0.022 n=18+16)
MemmoveUnalignedDst/128-6   11.1GB/s ± 0%   11.4GB/s ± 0%    +2.79% (p=0.000 n=18+17)
MemmoveUnalignedDst/256-6   16.4GB/s ± 0%   16.7GB/s ± 0%    +1.59% (p=0.000 n=20+17)
MemmoveUnalignedDst/512-6   15.7GB/s ± 0%   21.3GB/s ± 0%   +35.87% (p=0.000 n=18+20)
MemmoveUnalignedDst/1024-6  16.0GB/s ±20%   31.5GB/s ± 0%   +96.93% (p=0.000 n=20+14)
MemmoveUnalignedDst/2048-6  19.6GB/s ± 0%   42.1GB/s ± 0%  +115.16% (p=0.000 n=17+18)
MemmoveUnalignedDst/4096-6  6.41GB/s ± 0%  33.18GB/s ± 0%  +417.56% (p=0.000 n=17+18)
MemmoveUnalignedSrc/1-6      171MB/s ± 0%    166MB/s ± 0%    -3.33% (p=0.000 n=19+16)
MemmoveUnalignedSrc/2-6      343MB/s ± 0%    342MB/s ± 1%    -0.41% (p=0.000 n=17+20)
MemmoveUnalignedSrc/3-6      508MB/s ± 0%    493MB/s ± 1%    -2.90% (p=0.000 n=17+17)
MemmoveUnalignedSrc/4-6      677MB/s ± 0%    660MB/s ± 2%    -2.55% (p=0.000 n=17+20)
MemmoveUnalignedSrc/5-6      790MB/s ± 0%    790MB/s ± 0%      ~ (p=0.139 n=17+17)
MemmoveUnalignedSrc/6-6      948MB/s ± 0%    946MB/s ± 1%      ~ (p=0.330 n=17+19)
MemmoveUnalignedSrc/7-6     1.11GB/s ± 0%   1.11GB/s ± 0%    -0.05% (p=0.026 n=17+17)
MemmoveUnalignedSrc/8-6     1.38GB/s ± 0%   1.38GB/s ± 0%      ~ (p=0.091 n=18+16)
MemmoveUnalignedSrc/9-6     1.42GB/s ± 0%   1.40GB/s ± 1%    -1.04% (p=0.000 n=19+20)
MemmoveUnalignedSrc/10-6    1.58GB/s ± 0%   1.56GB/s ± 1%    -1.15% (p=0.000 n=18+19)
MemmoveUnalignedSrc/11-6    1.73GB/s ± 0%   1.71GB/s ± 1%    -1.30% (p=0.000 n=20+20)
MemmoveUnalignedSrc/12-6    1.89GB/s ± 0%   1.87GB/s ± 1%    -1.18% (p=0.000 n=17+20)
MemmoveUnalignedSrc/13-6    2.05GB/s ± 0%   2.02GB/s ± 1%    -1.18% (p=0.000 n=17+20)
MemmoveUnalignedSrc/14-6    2.21GB/s ± 0%   2.18GB/s ± 1%    -1.14% (p=0.000 n=17+20)
MemmoveUnalignedSrc/15-6    2.36GB/s ± 0%   2.34GB/s ± 1%    -1.04% (p=0.000 n=17+20)
MemmoveUnalignedSrc/16-6    2.52GB/s ± 0%   2.49GB/s ± 1%    -1.26% (p=0.000 n=19+20)
MemmoveUnalignedSrc/32-6    4.82GB/s ± 0%   4.61GB/s ± 0%    -4.40% (p=0.000 n=19+20)
MemmoveUnalignedSrc/64-6    5.03GB/s ± 4%   7.97GB/s ± 0%   +58.55% (p=0.000 n=20+16)
MemmoveUnalignedSrc/128-6   11.1GB/s ± 0%   11.2GB/s ± 0%    +0.52% (p=0.000 n=17+18)
MemmoveUnalignedSrc/256-6   16.5GB/s ± 0%   16.4GB/s ± 0%    -0.10% (p=0.000 n=20+18)
MemmoveUnalignedSrc/512-6   21.0GB/s ± 0%   22.1GB/s ± 0%    +5.48% (p=0.000 n=14+17)
MemmoveUnalignedSrc/1024-6  24.9GB/s ± 0%   31.9GB/s ± 0%   +28.20% (p=0.000 n=19+20)
MemmoveUnalignedSrc/2048-6  23.3GB/s ± 0%   33.8GB/s ± 0%   +45.22% (p=0.000 n=17+19)
MemmoveUnalignedSrc/4096-6  37.3GB/s ± 0%   42.7GB/s ± 0%   +14.30% (p=0.000 n=17+17)

Change-Id: Iab488d93a293cdf573ab5cd89b95a818bbb5d531
Reviewed-on: https://go-review.googlesource.com/22515
Run-TryBot: Denis Nagorny <denis.nagorny@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2016-08-31 16:03:30 +00:00
Marcel van Lohuizen
095fbdcc91 runtime: use of Run for some benchmarks
Names of sub-benchmarks are preserved, short of the additional slash.

Change-Id: I9b3f82964f9a44b0d28724413320afd091ed3106
Reviewed-on: https://go-review.googlesource.com/23425
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Marcel van Lohuizen <mpvl@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-05-25 16:49:02 +00:00
Keith Randall
6a33f7765f runtime: use MOVSB instead of MOVSQ for unaligned moves
MOVSB is quite a bit faster for unaligned moves.
Possibly we should use MOVSB all of the time, but Intel folks
say it might be a bit faster to use MOVSQ on some processors
(but not any I have access to at the moment).

benchmark                              old ns/op     new ns/op     delta
BenchmarkMemmove4096-8                 93.9          93.2          -0.75%
BenchmarkMemmoveUnalignedDst4096-8     256           151           -41.02%
BenchmarkMemmoveUnalignedSrc4096-8     175           90.5          -48.29%

Fixes #14630

Change-Id: I568e6d6590eb3615e6a699fb474020596be665ff
Reviewed-on: https://go-review.googlesource.com/20293
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-03-21 19:10:24 +00:00
Brad Fitzpatrick
519474451a all: make copyright headers consistent with one space after period
This is a subset of https://golang.org/cl/20022 with only the copyright
header lines, so the next CL will be smaller and more reviewable.

Go policy has been single space after periods in comments for some time.

The copyright header template at:

    https://golang.org/doc/contribute.html#copyright

also uses a single space.

Make them all consistent.

Change-Id: Icc26c6b8495c3820da6b171ca96a74701b4a01b0
Reviewed-on: https://go-review.googlesource.com/20111
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-01 23:34:33 +00:00
Ilya Tocar
b597e1ed54 runtime: speed up memclr with avx2 on amd64
Results are a bit noisy, but show good improvement (haswell)

name            old time/op    new time/op     delta
Memclr5-48        6.06ns ± 8%     5.65ns ± 8%    -6.81%  (p=0.000 n=20+20)
Memclr16-48       5.75ns ± 6%     5.71ns ± 6%      ~     (p=0.545 n=20+19)
Memclr64-48       6.54ns ± 5%     6.14ns ± 9%    -6.12%  (p=0.000 n=18+20)
Memclr256-48      10.1ns ±12%      9.9ns ±14%      ~     (p=0.285 n=20+19)
Memclr4096-48      104ns ± 8%       57ns ±15%   -44.98%  (p=0.000 n=20+20)
Memclr65536-48    2.45µs ± 5%     2.43µs ± 8%      ~     (p=0.665 n=16+20)
Memclr1M-48       58.7µs ±13%     56.4µs ±11%    -3.92%  (p=0.033 n=20+19)
Memclr4M-48        233µs ± 9%      234µs ± 9%      ~     (p=0.728 n=20+19)
Memclr8M-48        469µs ±11%      472µs ±16%      ~     (p=0.947 n=20+20)
Memclr16M-48       947µs ±10%      916µs ±10%      ~     (p=0.050 n=20+19)
Memclr64M-48      10.9ms ±10%      4.5ms ± 9%   -58.43%  (p=0.000 n=20+20)
GoMemclr5-48      3.80ns ±13%     3.38ns ± 6%   -11.02%  (p=0.000 n=20+20)
GoMemclr16-48     3.34ns ±15%     3.40ns ± 9%      ~     (p=0.351 n=20+20)
GoMemclr64-48     4.10ns ±15%     4.04ns ±10%      ~     (p=1.000 n=20+19)
GoMemclr256-48    7.75ns ±20%     7.88ns ± 9%      ~     (p=0.227 n=20+19)

name            old speed      new speed       delta
Memclr5-48       826MB/s ± 7%    886MB/s ± 8%    +7.32%  (p=0.000 n=20+20)
Memclr16-48     2.78GB/s ± 5%   2.81GB/s ± 6%      ~     (p=0.550 n=20+19)
Memclr64-48     9.79GB/s ± 5%  10.44GB/s ±10%    +6.64%  (p=0.000 n=18+20)
Memclr256-48    25.4GB/s ±14%   25.6GB/s ±12%      ~     (p=0.647 n=20+19)
Memclr4096-48   39.4GB/s ± 8%   72.0GB/s ±13%   +82.81%  (p=0.000 n=20+20)
Memclr65536-48  26.6GB/s ± 6%   27.0GB/s ± 9%      ~     (p=0.517 n=17+20)
Memclr1M-48     17.9GB/s ±12%   18.5GB/s ±11%      ~     (p=0.068 n=20+20)
Memclr4M-48     18.0GB/s ± 9%   17.8GB/s ±14%      ~     (p=0.547 n=20+20)
Memclr8M-48     17.9GB/s ±10%   17.8GB/s ±14%      ~     (p=0.947 n=20+20)
Memclr16M-48    17.8GB/s ± 9%   18.4GB/s ± 9%      ~     (p=0.050 n=20+19)
Memclr64M-48    6.19GB/s ±10%  14.87GB/s ± 9%  +140.11%  (p=0.000 n=20+20)
GoMemclr5-48    1.31GB/s ±10%   1.48GB/s ± 6%   +13.06%  (p=0.000 n=19+20)
GoMemclr16-48   4.81GB/s ±14%   4.71GB/s ± 8%      ~     (p=0.341 n=20+20)
GoMemclr64-48   15.7GB/s ±13%   15.8GB/s ±11%      ~     (p=0.967 n=20+19)

Change-Id: I393f3f20e2f31538d1b1dd70d6e5c201c106a095
Reviewed-on: https://go-review.googlesource.com/16773
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Klaus Post <klauspost@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
2015-11-24 16:49:30 +00:00
Michael Hudson-Doyle
168a51b3a1 runtime: adjust the arm64 memmove and memclr to operate by word as much as they can
Not only is this an obvious optimization:

benchmark                           old MB/s     new MB/s     speedup
BenchmarkMemmove1-4                 35.35        29.65        0.84x
BenchmarkMemmove2-4                 63.78        52.53        0.82x
BenchmarkMemmove3-4                 89.72        73.96        0.82x
BenchmarkMemmove4-4                 109.94       95.73        0.87x
BenchmarkMemmove5-4                 127.60       112.80       0.88x
BenchmarkMemmove6-4                 143.59       126.67       0.88x
BenchmarkMemmove7-4                 157.90       138.92       0.88x
BenchmarkMemmove8-4                 167.18       231.81       1.39x
BenchmarkMemmove9-4                 175.23       252.07       1.44x
BenchmarkMemmove10-4                165.68       261.10       1.58x
BenchmarkMemmove11-4                174.43       263.31       1.51x
BenchmarkMemmove12-4                180.76       267.56       1.48x
BenchmarkMemmove13-4                189.06       284.93       1.51x
BenchmarkMemmove14-4                186.31       284.72       1.53x
BenchmarkMemmove15-4                195.75       281.62       1.44x
BenchmarkMemmove16-4                202.96       439.23       2.16x
BenchmarkMemmove32-4                264.77       775.77       2.93x
BenchmarkMemmove64-4                306.81       1209.64      3.94x
BenchmarkMemmove128-4               357.03       1515.41      4.24x
BenchmarkMemmove256-4               380.77       2066.01      5.43x
BenchmarkMemmove512-4               385.05       2556.45      6.64x
BenchmarkMemmove1024-4              381.23       2804.10      7.36x
BenchmarkMemmove2048-4              379.06       2814.83      7.43x
BenchmarkMemmove4096-4              387.43       3064.96      7.91x
BenchmarkMemmoveUnaligned1-4        28.91        25.40        0.88x
BenchmarkMemmoveUnaligned2-4        56.13        47.56        0.85x
BenchmarkMemmoveUnaligned3-4        74.32        69.31        0.93x
BenchmarkMemmoveUnaligned4-4        97.02        83.58        0.86x
BenchmarkMemmoveUnaligned5-4        110.17       103.62       0.94x
BenchmarkMemmoveUnaligned6-4        124.95       113.26       0.91x
BenchmarkMemmoveUnaligned7-4        142.37       130.82       0.92x
BenchmarkMemmoveUnaligned8-4        151.20       205.64       1.36x
BenchmarkMemmoveUnaligned9-4        166.97       215.42       1.29x
BenchmarkMemmoveUnaligned10-4       148.49       221.22       1.49x
BenchmarkMemmoveUnaligned11-4       159.47       239.57       1.50x
BenchmarkMemmoveUnaligned12-4       163.52       247.32       1.51x
BenchmarkMemmoveUnaligned13-4       167.55       256.54       1.53x
BenchmarkMemmoveUnaligned14-4       175.12       251.03       1.43x
BenchmarkMemmoveUnaligned15-4       192.10       267.13       1.39x
BenchmarkMemmoveUnaligned16-4       190.76       378.87       1.99x
BenchmarkMemmoveUnaligned32-4       259.02       562.98       2.17x
BenchmarkMemmoveUnaligned64-4       317.72       842.44       2.65x
BenchmarkMemmoveUnaligned128-4      355.43       1274.49      3.59x
BenchmarkMemmoveUnaligned256-4      378.17       1815.74      4.80x
BenchmarkMemmoveUnaligned512-4      362.15       2180.81      6.02x
BenchmarkMemmoveUnaligned1024-4     376.07       2453.58      6.52x
BenchmarkMemmoveUnaligned2048-4     381.66       2568.32      6.73x
BenchmarkMemmoveUnaligned4096-4     398.51       2669.36      6.70x
BenchmarkMemclr5-4                  113.83       107.93       0.95x
BenchmarkMemclr16-4                 223.84       389.63       1.74x
BenchmarkMemclr64-4                 421.99       1209.58      2.87x
BenchmarkMemclr256-4                525.94       2411.58      4.59x
BenchmarkMemclr4096-4               581.66       4372.20      7.52x
BenchmarkMemclr65536-4              565.84       4747.48      8.39x
BenchmarkGoMemclr5-4                194.63       160.31       0.82x
BenchmarkGoMemclr16-4               295.30       630.07       2.13x
BenchmarkGoMemclr64-4               480.24       1884.03      3.92x
BenchmarkGoMemclr256-4              540.23       2926.49      5.42x

but it turns out that it's necessary to avoid the GC seeing partially written
pointers.

It's of course possible to be more sophisticated (using ldp/stp to move 16
bytes at a time in the core loop and unrolling the tail copying loops being
the obvious ideas) but I wanted something simple and (reasonably) obviously
correct.

Fixes #12552

Change-Id: Iaeaf8a812cd06f4747ba2f792de1ded738890735
Reviewed-on: https://go-review.googlesource.com/14813
Reviewed-by: Austin Clements <austin@google.com>
2015-10-08 07:49:35 +00:00
Josh Bleecher Snyder
7e0c11c32f cmd/6g, runtime: improve duffzero throughput
It is faster to execute

	MOVQ AX,(DI)
	MOVQ AX,8(DI)
	MOVQ AX,16(DI)
	MOVQ AX,24(DI)
	ADDQ $32,DI

than

	STOSQ
	STOSQ
	STOSQ
	STOSQ

However, in order to be able to jump into
the middle of a block of MOVQs, the call
site needs to pre-adjust DI.

If we're clearing a small area, the cost
of that DI pre-adjustment isn't repaid.

This CL switches the DUFFZERO implementation
to use a hybrid strategy, in which small
clears use STOSQ as before, but large clears
use mostly MOVQ/ADDQ blocks.

benchmark                 old ns/op     new ns/op     delta
BenchmarkClearFat8        0.55          0.55          +0.00%
BenchmarkClearFat12       0.82          0.83          +1.22%
BenchmarkClearFat16       0.55          0.55          +0.00%
BenchmarkClearFat24       0.82          0.82          +0.00%
BenchmarkClearFat32       2.20          1.94          -11.82%
BenchmarkClearFat40       1.92          1.66          -13.54%
BenchmarkClearFat48       2.21          1.93          -12.67%
BenchmarkClearFat56       3.03          2.20          -27.39%
BenchmarkClearFat64       3.26          2.48          -23.93%
BenchmarkClearFat72       3.57          2.76          -22.69%
BenchmarkClearFat80       3.83          3.05          -20.37%
BenchmarkClearFat88       4.14          3.30          -20.29%
BenchmarkClearFat128      5.54          4.69          -15.34%
BenchmarkClearFat256      9.95          9.09          -8.64%
BenchmarkClearFat512      18.7          17.9          -4.28%
BenchmarkClearFat1024     36.2          35.4          -2.21%

Change-Id: Ic786406d9b3cab68d5a231688f9e66fcd1bd7103
Reviewed-on: https://go-review.googlesource.com/2585
Reviewed-by: Keith Randall <khr@golang.org>
2015-04-15 19:17:07 +00:00
Josh Bleecher Snyder
f03c9202c4 cmd/gc: optimize memclr of slices and arrays
Recognize loops of the form

for i := range a {
	a[i] = zero
}

in which the evaluation of a is free from side effects.
Replace these loops with calls to memclr.
This occurs in the stdlib in 18 places.

The motivating example is clearing a byte slice:

benchmark                old ns/op     new ns/op     delta
BenchmarkGoMemclr5       3.31          3.26          -1.51%
BenchmarkGoMemclr16      13.7          3.28          -76.06%
BenchmarkGoMemclr64      50.8          4.14          -91.85%
BenchmarkGoMemclr256     157           6.02          -96.17%

Update #5373.

Change-Id: I99d3e6f5f268e8c6499b7e661df46403e5eb83e4
Reviewed-on: https://go-review.googlesource.com/2520
Reviewed-by: Keith Randall <khr@golang.org>
2015-01-09 22:35:25 +00:00
Russ Cox
c007ce824d build: move package sources from src/pkg to src
Preparation was in CL 134570043.
This CL contains only the effect of 'hg mv src/pkg/* src'.
For more about the move, see golang.org/s/go14nopkg.
2014-09-08 00:08:51 -04:00