Major reorganization of the crc32 code:
- The arch-specific files now implement a well-defined interface
(documented in crc32.go). They no longer have the responsibility of
initializing and falling back to a non-accelerated implementation;
instead, that happens in the higher level code.
- The non-accelerated algorithms are moved to a separate file with no
dependencies on other code.
- The "cutoff" optimization for slicing-by-8 is moved inside the
algorithm itself (as opposed to every callsite).
Tests are significantly improved:
- direct tests for the non-accelerated algorithms.
- "cross-check" tests for arch-specific implementations (all archs).
- tests for misaligned buffers for both IEEE and Castagnoli.
Fixes#16909.
Change-Id: I9b6dd83b7a57cd615eae901c0a6d61c6b8091c74
Reviewed-on: https://go-review.googlesource.com/27935
Reviewed-by: Keith Randall <khr@golang.org>
When SSE is available, we don't need the Table. However, it is
returned as a handle by MakeTable. Fix this to always generate
the table.
Further cleanup is discussed in #16909.
Change-Id: Ic05400d68c6b5d25073ebd962000451746137afc
Reviewed-on: https://go-review.googlesource.com/27934
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This reverts commit 54d7de7dd6.
It was breaking non-amd64 builds.
Change-Id: I22650e922498eeeba3d4fa08bb4ea40a210c8f97
Reviewed-on: https://go-review.googlesource.com/27925
Reviewed-by: Keith Randall <khr@golang.org>
The code wasn't checking to see if the data was still >= 64 bytes
long after aligning it.
Aligning the data is an optimization and we don't actually need
to do it. In fact for smaller sizes it slows things down due to
the overhead of calling the generic function. Therefore for now
I have simply removed the alignment stage. I have also added a
check into the assembly to deliberately trigger a segmentation
fault if the data is too short.
Fixes#16779.
Change-Id: Ic01636d775efc5ec97689f050991cee04ce8fe73
Reviewed-on: https://go-review.googlesource.com/27409
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This commit improves the processing of the final few bytes in
castagnoliSSE42: instead of processing one byte at a time, we use all
versions of the CRC32 instruction to process 4 bytes, then 2, then 1.
The difference is only noticeable for small "odd" sized buffers.
We do the similar improvement for processing the first few bytes in
the case of unaligned buffer.
Fixing the test which was not actually verifying the results for
misaligned buffers (WriteString was creating an internal copy which
was aligned).
Adding benchmarks for length 15 (aligned and misaligned), results
below.
name old time/op new time/op delta
CastagnoliCrc15B-4 25.1ns ± 0% 22.1ns ± 1% -12.14%
CastagnoliCrc15BMisaligned-4 25.2ns ± 0% 22.9ns ± 1% -9.03%
CastagnoliCrc40B-4 23.1ns ± 0% 23.4ns ± 0% +1.08%
CastagnoliCrc1KB-4 127ns ± 0% 128ns ± 0% +1.18%
CastagnoliCrc4KB-4 462ns ± 0% 464ns ± 0% ~
CastagnoliCrc32KB-4 3.58µs ± 0% 3.60µs ± 0% +0.58%
name old speed new speed delta
CastagnoliCrc15B-4 597MB/s ± 0% 679MB/s ± 1% +13.77%
CastagnoliCrc15BMisaligned-4 596MB/s ± 0% 655MB/s ± 1% +9.94%
CastagnoliCrc40B-4 1.73GB/s ± 0% 1.71GB/s ± 0% -1.14%
CastagnoliCrc1KB-4 8.01GB/s ± 0% 7.93GB/s ± 1% -1.06%
CastagnoliCrc4KB-4 8.86GB/s ± 0% 8.83GB/s ± 0% ~
CastagnoliCrc32KB-4 9.14GB/s ± 0% 9.09GB/s ± 0% -0.58%
Change-Id: I499e37af2241d28e3e5d522bbab836c1a718430a
Reviewed-on: https://go-review.googlesource.com/24470
Reviewed-by: Keith Randall <khr@golang.org>
The input buffer is aligned to a doubleword boundary to
improve performance of the vector instructions. The pure
Go implementation is used to align the input data, and is
also used when the vector instructions are not available
or the data length is less than 64 bytes.
Change-Id: Ie259a5f2f1562bcc17961c99e5776c99091d6bed
Reviewed-on: https://go-review.googlesource.com/22201
Reviewed-by: Michael Munday <munday@ca.ibm.com>
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Bill O'Farrell <billotosyr@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
It seems cleaner and more consistent with other files to list the
architectures that have assembly implementations rather than to
list those that do not.
This means we don't have to add s390x and future platforms to this
list.
Change-Id: I2ad3f66b76eb1711333c910236ca7f5151b698e5
Reviewed-on: https://go-review.googlesource.com/21770
Reviewed-by: Bill O'Farrell <billotosyr@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently we test crc64 only with ISO polynomial.
Change-Id: Ibc5e202db3b960369cbbb18e31eb0fea07b54dba
Reviewed-on: https://go-review.googlesource.com/21309
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This adds "slicing by 8" optimization to Castagnoli tables which will
speed up CRC32 calculation on systems without asssembler,
which are all but AMD64.
In my tests, it is faster to use "slicing by 8" for sizes all down to
16 bytes, so the switchover point has been adjusted.
There are no benchmarks for small sizes, so I have added one for 40 bytes,
as well as one for bigger sizes (32KB).
Castagnoli, No assembler, 40 Byte payload: (before, after)
BenchmarkCastagnoli40B-4 10000000 161 ns/op 246.94 MB/s
BenchmarkCastagnoli40B-4 20000000 100 ns/op 398.01 MB/s
Castagnoli, No assembler, 32KB payload: (before, after)
BenchmarkCastagnoli32KB-4 10000 115426 ns/op 283.89 MB/s
BenchmarkCastagnoli32KB-4 30000 45171 ns/op 725.41 MB/s
IEEE, No assembler, 1KB payload: (before, after)
BenchmarkCrc1KB-4 500000 3604 ns/op 284.10 MB/s
BenchmarkCrc1KB-4 1000000 1463 ns/op 699.79 MB/s
Compared:
benchmark old ns/op new ns/op delta
BenchmarkCastagnoli40B-4 161 100 -37.89%
BenchmarkCastagnoli32KB-4 115426 45171 -60.87%
BenchmarkCrc1KB-4 3604 1463 -59.41%
benchmark old MB/s new MB/s speedup
BenchmarkCastagnoli40B-4 246.94 398.01 1.61x
BenchmarkCastagnoli32KB-4 283.89 725.41 2.56x
BenchmarkCrc1KB-4 284.10 699.79 2.46x
Change-Id: I303e4ec84e8d4dafd057d64c0e43deb2b498e968
Reviewed-on: https://go-review.googlesource.com/19335
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Add several instructions that were used via BYTE and use them.
Instructions added: PEXTRB, PEXTRD, PEXTRQ, PINSRB, XGETBV, POPCNT.
Change-Id: I5a80cd390dc01f3555dbbe856a475f74b5e6df65
Reviewed-on: https://go-review.googlesource.com/18593
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
CRC-32 computation is stateless and the p slice does not get stored
anywhere. Thus, we mark the assembly functions as noescape so that
it doesn't believe that p leaks in:
func Update(crc uint32, tab *Table, p []byte) uint32
Before:
./crc32.go:153: leaking param: p
After:
./crc32.go:153: Update p does not escape
Change-Id: I52ba35b6cc544fff724327140e0c27898431d1dc
Reviewed-on: https://go-review.googlesource.com/17069
Reviewed-by: Russ Cox <rsc@golang.org>
iEEETable violates the Go naming conventions and is inconsistent
with the rest of the package. Use ieeeTable instead.
Change-Id: I04b201aa39759d159de2b0295f43da80488c2263
Reviewed-on: https://go-review.googlesource.com/17068
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Change-Id: I77c6768fff6f0163b36800307c4d573bb6521fe5
Reviewed-on: https://go-review.googlesource.com/14454
Reviewed-by: Minux Ma <minux@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
IEEE is the most commonly used CRC-32 polynomial, used by zip, gzip and others.
Based on http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
benchmark old ns/op new ns/op delta
BenchmarkIEEECrc1KB-8 3193 352 -88.98%
BenchmarkIEEECrc4KB-8 5025 1307 -73.99%
BenchmarkCastagnoliCrc1KB-8 126 126 +0.00%
benchmark old MB/s new MB/s speedup
BenchmarkIEEECrc1KB-8 320.68 2901.92 9.05x
BenchmarkIEEECrc4KB-8 815.08 3131.80 3.84x
BenchmarkCastagnoliCrc1KB-8 8100.80 8109.78 1.00x
Change-Id: I99c9a48365f631827f516e44f97e86155f03cb90
Reviewed-on: https://go-review.googlesource.com/14080
Reviewed-by: Keith Randall <khr@golang.org>
Explicitly say that *Table returned by MakeTable may not be
modified. Otherwise, this leads to very subtle bugs that may
or may not manifest themselves.
Same comment was made on package crc64, to keep the future
open to the caching tables that crc32 effectively does.
Fixes: #12487.
Change-Id: I2881bebb8b16f6f8564412172774c79c2593c6c1
Reviewed-on: https://go-review.googlesource.com/14258
Reviewed-by: Ian Lance Taylor <iant@golang.org>
The URL is shown on go docs and is an eye-sore.
For go1.6.
Change-Id: I8b8ea3751200d06ed36acfe22f47ebb38107f8db
Reviewed-on: https://go-review.googlesource.com/13282
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
The Slicing-By-8 [1] algorithm has much performance improvements than
current approach. This patch only uses it for IEEE, which is the most
common case in practice.
There is the benchmark on Mac OS X 10.9:
benchmark old MB/s new MB/s speedup
BenchmarkIEEECrc1KB 349.40 353.03 1.01x
BenchmarkIEEECrc4KB 351.55 934.35 2.66x
BenchmarkCastagnoliCrc1KB 7037.58 7392.63 1.05x
This algorithm need 8K lookup table, so it's enabled only for block
larger than 4K.
We can see about 2.6x improvement for IEEE.
Change-Id: I7f786d20f0949245e4aa101d7921669f496ed0f7
Reviewed-on: https://go-review.googlesource.com/1863
Reviewed-by: Russ Cox <rsc@golang.org>
Explicitly specify that we represent polynomial in reversed notation
Fixes#8229
Change-Id: Idf094c01fd82f133cd0c1b50fa967d12c577bdb5
Reviewed-on: https://go-review.googlesource.com/9237
Reviewed-by: David Chase <drchase@google.com>
This also removes pkg/runtime/traceback_lr.c, which was ported
to Go in an earlier commit and then moved to
runtime/traceback.go.
Reviewer: rsc@golang.org
rsc: LGTM