1
0
mirror of https://github.com/golang/go synced 2024-11-20 00:24:43 -07:00
Commit Graph

66 Commits

Author SHA1 Message Date
Austin Clements
608c1b0d56 runtime: scan objects with finalizers concurrently
This reduces pause time by ~25% relative to tip and by ~50% relative
to Go 1.5.1.

Currently one of the steps of STW mark termination is to loop (in
parallel) over all spans to find objects with finalizers in order to
mark all objects reachable from these objects and to treat the
finalizer special as a root. Unfortunately, even if there are no
finalizers at all, this loop takes roughly 1 ms/heap GB/core, so
multi-gigabyte heaps can quickly push our STW time past 10ms.

Fix this by moving this scan from mark termination to concurrent scan,
where it can run in parallel with mutators. The loop itself could also
be optimized, but this cost is small compared to concurrent marking.

Making this scan concurrent introduces two complications:

1) The scan currently walks the specials list of each span without
locking it, which is safe only with the world stopped. We fix this by
speculatively checking if a span has any specials (the vast majority
won't) and then locking the specials list only if there are specials
to check.

2) An object can have a finalizer set after concurrent scan, in which
case it won't have been marked appropriately by concurrent scan. If
the finalizer is a closure and is only reachable from the special, it
could be swept before it is run. Likewise, if the object is not marked
yet when the finalizer is set and then becomes unreachable before it
is marked, other objects reachable only from it may be swept before
the finalizer function is run. We fix this issue by making
addfinalizer ensure the same marking invariants as markroot does.

For multi-gigabyte heaps, this reduces max pause time by 20%–30%
relative to tip (depending on GOMAXPROCS) and by ~50% relative to Go
1.5.1 (where this loop was neither concurrent nor parallel). Here are
the results for the garbage benchmark:

               ---------------- max pause ----------------
Heap   Procs   Concurrent scan   STW parallel scan   1.5.1
24GB     12         18ms              23ms            37ms
24GB      4         18ms              25ms            37ms
 4GB      4         3.8ms            4.9ms           6.9ms

In all cases, 95%ile pause time is similar to the max pause time. This
also improves mean STW time by 10%–30%.

Fixes #11485.

Change-Id: I9359d8c3d120a51d23d924b52bf853a1299b1dfd
Reviewed-on: https://go-review.googlesource.com/14982
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-02 19:55:48 +00:00
Austin Clements
97b64d88eb runtime: avoid debug prints of huge objects
Currently when the GC prints an object for debugging (e.g., for a
failed invalidptr or checkmark check), it dumps the entire object. To
avoid inundating the user with output for really large objects, limit
this to printing just the first 128 words (which are most likely to be
useful in identifying the type of an object) and the 32 words around
the problematic field.

Change-Id: Id94a5c9d8162f8bd9b2a63bf0b1bfb0adde83c68
Reviewed-on: https://go-review.googlesource.com/14764
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-09-18 22:23:18 +00:00
Austin Clements
b7c55ba496 runtime: improve invalid pointer error message
By default, the runtime panics if it detects a pointer to an
unallocated span. At this point, this usually catches bad uses of
unsafe or cgo in user code (though it could also catch runtime bugs).
Unfortunately, the rather cryptic error misleads users, offers users
little help with debugging their own problem, and offers the Go
developers little help with root-causing.

Improve the error message in various ways. First, the wording is
improved to make it clearer what condition was detected and to suggest
that this may be the result of incorrect use of unsafe or cgo. Second,
we add a dump of the object containing the bad pointer so that there's
at least some hope of figuring out why a bad pointer was stored in the
Go heap.

Change-Id: I57b91b12bc3cb04476399d7706679e096ce594b9
Reviewed-on: https://go-review.googlesource.com/14763
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-09-18 22:23:11 +00:00
Austin Clements
4ac4085f8e runtime: minor clarifications of markroot
This puts the _Root* indexes in a more friendly order and tweaks
markrootSpans to use a for-range loop instead of its own indexing.

Change-Id: I2c18d55c9a673ea396b6424d51ef4997a1a74825
Reviewed-on: https://go-review.googlesource.com/14548
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-09-14 19:37:44 +00:00
Austin Clements
572f08a064 runtime: split marking of span roots into 128 subtasks
Marking of span roots can represent a significant fraction of the time
spent in mark termination. Simply traversing the span list takes about
1ms per GB of heap and if there are a large number of finalizers (for
example, for network connections), it may take much longer.

Improve the situation by splitting the span scan into 128 subtasks
that can be executed in parallel and load balanced by the markroots
parallel for. This lets the GC balance this job across the Ps.

A better solution is to do this during concurrent mark, or to improve
it algorithmically, but this is a simple change with a lot of bang for
the buck.

This was suggested by Rhys Hiltner.

Updates #11485.

Change-Id: I8b281adf0ba827064e154a1b6cc32d4d8031c03c
Reviewed-on: https://go-review.googlesource.com/13112
Reviewed-by: Keith Randall <khr@golang.org>
2015-09-14 18:15:40 +00:00
Austin Clements
92e68c89f6 runtime: move stack barrier code to its own file
Currently the stack barrier code is mixed in with the mark and scan
code. Move all of the stack barrier related functions and variables to
a new dedicated source file. There are no code modifications.

Change-Id: I604603045465ef8573b9f88915d28ab6b5910903
Reviewed-on: https://go-review.googlesource.com/14050
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-09-09 01:18:50 +00:00
Austin Clements
3bfc9df21a runtime: add GODEBUG for stack barriers at every frame
Currently enabling the debugging mode where stack barriers are
installed at every frame requires recompiling the runtime. However,
this is potentially useful for field debugging and for runtime tests,
so make this mode a GODEBUG.

Updates #12238.

Change-Id: I6fb128f598b19568ae723a612e099c0ed96917f5
Reviewed-on: https://go-review.googlesource.com/13947
Reviewed-by: Russ Cox <rsc@golang.org>
2015-08-30 16:06:55 +00:00
Austin Clements
e2bb03f175 runtime: don't install a stack barrier in cgocallback_gofunc's frame
Currently the runtime can install stack barriers in any frame.
However, the frame of cgocallback_gofunc is special: it's the one
function that switches from a regular G stack to the system stack on
return. Hence, the return PC slot in its frame on the G stack is
actually used to save getg().sched.pc (so tracebacks appear to unwind
to the last Go function running on that G), and not as an actual
return PC for cgocallback_gofunc.

Because of this, if we install a stack barrier in cgocallback_gofunc's
return PC slot, when cgocallback_gofunc does return, it will move the
stack barrier stub PC in to getg().sched.pc and switch back to the
system stack. The rest of the runtime doesn't know how to deal with a
stack barrier stub in sched.pc: nothing knows how to match it up with
the G's stack barrier array and, when the runtime removes stack
barriers, it doesn't know to undo the one in sched.pc. Hence, if the C
code later returns back in to Go code, it will attempt to return
through the stack barrier saved in sched.pc, which may no longer have
correct unwinding information.

Fix this by blacklisting cgocallback_gofunc's frame so the runtime
won't install a stack barrier in it's return PC slot.

Fixes #12238.

Change-Id: I46aa2155df2fd050dd50de3434b62987dc4947b8
Reviewed-on: https://go-review.googlesource.com/13944
Reviewed-by: Russ Cox <rsc@golang.org>
2015-08-30 16:06:47 +00:00
Austin Clements
510fd1350d runtime: enable GC assists ASAP
Currently the GC coordinator enables GC assists at the same time it
enables background mark workers, after the concurrent scan phase is
done. However, this means a rapidly allocating mutator has the entire
scan phase during which to allocate beyond the heap trigger and
potentially beyond the heap goal with no back-pressure from assists.
This prevents the feedback system that's supposed to keep the heap
size under the heap goal from doing its job.

Fix this by enabling mutator assists during the scan phase. This is
safe because the write barrier is already enabled and globally
acknowledged at this point.

There's still a very small window between when the heap size reaches
the heap trigger and when the GC coordinator is able to stop the world
during which the mutator can allocate unabated. This allows *very*
rapidly allocator mutators like TestTraceStress to still occasionally
exceed the heap goal by a small amount (~20 MB at most for
TestTraceStress). However, this seems like a corner case.

Fixes #11677.

Change-Id: I0f80d949ec82341cd31ca1604a626efb7295a819
Reviewed-on: https://go-review.googlesource.com/12674
Reviewed-by: Russ Cox <rsc@golang.org>
2015-07-27 19:59:05 +00:00
Austin Clements
f5e67e53e7 runtime: allow GC drain whenever write barrier is enabled
Currently we hand-code a set of phases when draining is allowed.
However, this set of phases is conservative. The critical invariant is
simply that the write barrier must be enabled if we're draining.

Shortly we're going to enable mutator assists during the scan phase,
which means we may drain during the scan phase. In preparation, this
commit generalizes these assertions to check the fundamental condition
that the write barrier is enabled, rather than checking that we're in
any particular phase.

Change-Id: I0e1bec1ca823d4a697a0831ec4c50f5dd3f2a893
Reviewed-on: https://go-review.googlesource.com/12673
Reviewed-by: Russ Cox <rsc@golang.org>
2015-07-27 19:59:04 +00:00
Austin Clements
8f34b25318 runtime: retry GC assist until debt is paid off
Currently, there are three ways to satisfy a GC assist: 1) the mutator
steals credit from background GC, 2) the mutator actually does GC
work, and 3) there is no more work available. 3 was never really
intended as a way to satisfy an assist, and it causes problems: there
are periods when it's expected that the GC won't have any work, such
as when transitioning from mark 1 to mark 2 and from mark 2 to mark
termination. During these periods, there's no back-pressure on rapidly
allocating mutators, which lets them race ahead of the heap goal.

For example, test/init1.go and the runtime/trace test both have small
reachable heaps and contain loops that rapidly allocate large garbage
byte slices. This bug lets these tests exceed the heap goal by several
orders of magnitude.

Fix this by forcing the assist (and hence the allocation) to block
until it can satisfy its debt via either 1 or 2, or the GC cycle
terminates.

This fixes one the causes of #11677. It's still possible to overshoot
the GC heap goal, but with this change the overshoot is almost exactly
by the amount of allocation that happens during the concurrent scan
phase, between when the heap passes the GC trigger and when the GC
enables assists.

Change-Id: I5ef4edcb0d2e13a1e432e66e8245f2bd9f8995be
Reviewed-on: https://go-review.googlesource.com/12671
Reviewed-by: Russ Cox <rsc@golang.org>
2015-07-27 19:59:01 +00:00
Austin Clements
500c88d40d runtime: yield to GC coordinator after assist completion
Currently it's possible for the GC assist to signal completion of the
mark phase, which puts the GC coordinator goroutine on the current P's
run queue, and then return to mutator code that delays until the next
forced preemption before actually yielding control to the GC
coordinator, dragging out completion of the mark phase. This delay can
be further exacerbated if the mutator makes other goroutines runnable
before yielding control, since this will push the GC coordinator on
the back of the P's run queue.

To fix this, this adds a Gosched to the assist if it completed the
mark phase. This immediately and directly yields control to the GC
coordinator. This already happens implicitly in the background mark
workers because they park immediately after completing the mark.

This is one of the reasons completion of the mark phase is being
dragged out and allowing the mutator to allocate without assisting,
leading to the large heap goal overshoot in issue #11677. This is also
a prerequisite to making the assist block when it can't pay off its
debt.

Change-Id: I586adfbecb3ca042a37966752c1dc757f5c7fc78
Reviewed-on: https://go-review.googlesource.com/12670
Reviewed-by: Russ Cox <rsc@golang.org>
2015-07-27 19:59:00 +00:00
Austin Clements
4f188c2d1c runtime: disallow GC assists in non-preemptible contexts
Currently it's possible to perform GC work on a system stack or when
locks are held if there's an allocation that triggers an assist. This
is generally a bad idea because of the fragility of these contexts,
and it's incompatible with two changes we're about to make: one is to
yield after signaling mark completion (which we can't do from a
non-preemptible context) and the other is to make assists block if
there's no other way for them to pay off the assist debt.

This commit simply skips the assist if it's called from a
non-preemptible context. The allocation will still count toward the
assist debt, so it will be paid off by a later assist. There should be
little allocation from non-preemptible contexts, so this shouldn't
harm the overall assist mechanism.

Change-Id: I7bf0e6c73e659fe6b52f27437abf39d76b245c79
Reviewed-on: https://go-review.googlesource.com/12649
Reviewed-by: Russ Cox <rsc@golang.org>
2015-07-27 19:58:59 +00:00
Austin Clements
b8526a8380 runtime: steal the correct amount of GC assist credit
GC assists are supposed to steal at most the amount of background GC
credit available so that background GC credit doesn't go negative.
However, they are instead stealing the *total* amount of their debt
but only claiming up to the amount of credit that was available. This
results in draining the background GC credit pool too quickly, which
results in unnecessary assist work.

The fix is trivial: steal the amount of work we meant to steal (which
is already computed).

Change-Id: I837fe60ed515ba91c6baf363248069734a7895ef
Reviewed-on: https://go-review.googlesource.com/12643
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2015-07-27 19:58:54 +00:00
Austin Clements
2a331ca8bb runtime: document relaxed access to arena_used
The unsynchronized accesses to mheap_.arena_used in the concurrent
part of the garbage collector look like a problem waiting to happen.
In fact, they are safe, but the reason is somewhat subtle and
undocumented. This commit documents this reasoning.

Related to issue #9984.

Change-Id: Icdbf2329c1aa11dbe2396a71eb5fc2a85bd4afd5
Reviewed-on: https://go-review.googlesource.com/11254
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
2015-06-22 18:37:20 +00:00
Rick Hudson
90a19961f2 runtime: reduce latency by aggressively ending mark phase
Some latency regressions have crept into our system over the past few
weeks. This CL fixes those by having the mark phase more aggressively
blacken objects so that the mark termination phase, a STW phase, has less
work to do. Three approaches were taken when the mark phase believes
it has no more work to do, ie all the work buffers are empty.
If things have gone well the mark phase is correct and there is
in fact little or no work. In that case the following items will
take very little time. If the mark phase is wrong this CL will
ferret that work out and give the mark phase a chance to deal with
it concurrently before mark termination begins.

When the mark phase first appears to be out of work, it does three things:
1) It switches from allocating white to allocating black to reduce the
number of unmarked objects reachable only from stacks.
2) It flushes and disables per-P GC work caches so all work must be in
globally visible work buffers.
3) It rescans the global roots---the BSS and data segments---so there
are fewer objects to blacken during mark termination. We do not rescan
stacks at this point, though that could be done in a later CL.
After these steps, it again drains the global work buffers.

On a lightly loaded machine the garbage benchmark has reduced the
number of GC cycles with latency > 10 ms from 83 out of 4083 cycles
down to 2 out of 3995 cycles. Maximum latency was reduced from
60+ msecs down to 20 ms.

Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492
Reviewed-on: https://go-review.googlesource.com/10590
Reviewed-by: Austin Clements <austin@google.com>
2015-06-18 21:38:46 +00:00
Russ Cox
3c60e6e8cf runtime: fix races in stack scan
This fixes a hang during runtime.TestTraceStress.
It also fixes double-scan of stacks, which leads to
stack barrier installation failures.

Both of these have shown up as flaky failures on the dashboard.

Fixes #10941.

Change-Id: Ia2a5991ce2c9f43ba06ae1c7032f7c898dc990e0
Reviewed-on: https://go-review.googlesource.com/11089
Reviewed-by: Austin Clements <austin@google.com>
2015-06-17 17:56:26 +00:00
Russ Cox
d5b40b6ac2 runtime: add GODEBUG gcshrinkstackoff, gcstackbarrieroff, and gcstoptheworld variables
While we're here, update the documentation and delete variables with no effect.

Change-Id: I4df0d266dff880df61b488ed547c2870205862f0
Reviewed-on: https://go-review.googlesource.com/10790
Reviewed-by: Austin Clements <austin@google.com>
2015-06-15 17:31:04 +00:00
Ainar Garipov
7f9f70e5b6 all: fix misprints in comments
These were found by grepping the comments from the go code and feeding
the output to aspell.

Change-Id: Id734d6c8d1938ec3c36bd94a4dbbad577e3ad395
Reviewed-on: https://go-review.googlesource.com/10941
Reviewed-by: Aamir Khan <syst3m.w0rm@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2015-06-11 14:18:57 +00:00
Austin Clements
306f8f11ad runtime: unwind stack barriers when writing above the current frame
Stack barriers assume that writes through pointers to frames above the
current frame will get write barriers, and hence these frames do not
need to be re-scanned to pick up these changes. For normal writes,
this is true. However, there are places in the runtime that use
typedmemmove to potentially write through pointers to higher frames
(such as mapassign1). Currently, typedmemmove does not execute write
barriers if the destination is on the stack. If there's a stack
barrier between the current frame and the frame being modified with
typedmemmove, and the stack barrier is not otherwise hit, it's
possible that the garbage collector will never see the updated pointer
and incorrectly reclaim the object.

Fix this by making heapBitsBulkBarrier (which lies behind typedmemmove
and its variants) detect when the destination is in the stack and
unwind stack barriers up to the point, forcing mark termination to
later rescan the effected frame and collect these pointers.

Fixes #11084. Might be related to #10240, #10541, #10941, #11023,
 #11027 and possibly others.

Change-Id: I323d6cd0f1d29fa01f8fc946f4b90e04ef210efd
Reviewed-on: https://go-review.googlesource.com/10791
Reviewed-by: Russ Cox <rsc@golang.org>
2015-06-07 17:57:47 +00:00
Austin Clements
7529314ed3 runtime: use correct SP when installing stack barriers
Currently the stack barriers are installed at the next frame boundary
after gp.sched.sp + 1024*2^n for n=0,1,2,... However, when a G is in a
system call, we set gp.sched.sp to 0, which causes stack barriers to
be installed at *every* frame. This easily overflows the slice we've
reserved for storing the stack barrier information, and causes a
"slice bounds out of range" panic in gcInstallStackBarrier.

Fix this by using gp.syscallsp instead of gp.sched.sp if it's
non-zero. This is the same logic that gentraceback uses to determine
the current SP.

Fixes #11049.

Change-Id: Ie40eeee5bec59b7c1aa715a7c17aa63b1f1cf4e8
Reviewed-on: https://go-review.googlesource.com/10755
Reviewed-by: Russ Cox <rsc@golang.org>
2015-06-05 15:53:07 +00:00
Austin Clements
faa7a7e8ae runtime: implement GC stack barriers
This commit implements stack barriers to minimize the amount of
stack re-scanning that must be done during mark termination.

Currently the GC scans stacks of active goroutines twice during every
GC cycle: once at the beginning during root discovery and once at the
end during mark termination. The second scan happens while the world
is stopped and guarantees that we've seen all of the roots (since
there are no write barriers on writes to local stack
variables). However, this means pause time is proportional to stack
size. In particularly recursive programs, this can drive pause time up
past our 10ms goal (e.g., it takes about 150ms to scan a 50MB heap).

Re-scanning the entire stack is rarely necessary, especially for large
stacks, because usually most of the frames on the stack were not
active between the first and second scans and hence any changes to
these frames (via non-escaping pointers passed down the stack) were
tracked by write barriers.

To efficiently track how far a stack has been unwound since the first
scan (and, hence, how much needs to be re-scanned), this commit
introduces stack barriers. During the first scan, at exponentially
spaced points in each stack, the scan overwrites return PCs with the
PC of the stack barrier function. When "returned" to, the stack
barrier function records how far the stack has unwound and jumps to
the original return PC for that point in the stack. Then the second
scan only needs to proceed as far as the lowest barrier that hasn't
been hit.

For deeply recursive programs, this substantially reduces mark
termination time (and hence pause time). For the goscheme example
linked in issue #10898, prior to this change, mark termination times
were typically between 100 and 500ms; with this change, mark
termination times are typically between 10 and 20ms. As a result of
the reduced stack scanning work, this reduces overall execution time
of the goscheme example by 20%.

Fixes #10898.

The effect of this on programs that are not deeply recursive is
minimal:

name                   old time/op    new time/op    delta
BinaryTree17              3.16s ± 2%     3.26s ± 1%  +3.31%  (p=0.000 n=19+19)
Fannkuch11                2.42s ± 1%     2.48s ± 1%  +2.24%  (p=0.000 n=17+19)
FmtFprintfEmpty          50.0ns ± 3%    49.8ns ± 1%    ~     (p=0.534 n=20+19)
FmtFprintfString          173ns ± 0%     175ns ± 0%  +1.49%  (p=0.000 n=16+19)
FmtFprintfInt             170ns ± 1%     175ns ± 1%  +2.97%  (p=0.000 n=20+19)
FmtFprintfIntInt          288ns ± 0%     295ns ± 0%  +2.73%  (p=0.000 n=16+19)
FmtFprintfPrefixedInt     242ns ± 1%     252ns ± 1%  +4.13%  (p=0.000 n=18+18)
FmtFprintfFloat           324ns ± 0%     323ns ± 0%  -0.36%  (p=0.000 n=20+19)
FmtManyArgs              1.14µs ± 0%    1.12µs ± 1%  -1.01%  (p=0.000 n=18+19)
GobDecode                8.88ms ± 1%    8.87ms ± 0%    ~     (p=0.480 n=19+18)
GobEncode                6.80ms ± 1%    6.85ms ± 0%  +0.82%  (p=0.000 n=20+18)
Gzip                      363ms ± 1%     363ms ± 1%    ~     (p=0.077 n=18+20)
Gunzip                   90.6ms ± 0%    90.0ms ± 1%  -0.71%  (p=0.000 n=17+18)
HTTPClientServer         51.5µs ± 1%    50.8µs ± 1%  -1.32%  (p=0.000 n=18+18)
JSONEncode               17.0ms ± 0%    17.1ms ± 0%  +0.40%  (p=0.000 n=18+17)
JSONDecode               61.8ms ± 0%    63.8ms ± 1%  +3.11%  (p=0.000 n=18+17)
Mandelbrot200            3.84ms ± 0%    3.84ms ± 1%    ~     (p=0.583 n=19+19)
GoParse                  3.71ms ± 1%    3.72ms ± 1%    ~     (p=0.159 n=18+19)
RegexpMatchEasy0_32       100ns ± 0%     100ns ± 1%  -0.19%  (p=0.033 n=17+19)
RegexpMatchEasy0_1K       342ns ± 1%     331ns ± 0%  -3.41%  (p=0.000 n=19+19)
RegexpMatchEasy1_32      82.5ns ± 0%    81.7ns ± 0%  -0.98%  (p=0.000 n=18+18)
RegexpMatchEasy1_1K       505ns ± 0%     494ns ± 1%  -2.16%  (p=0.000 n=18+18)
RegexpMatchMedium_32      137ns ± 1%     137ns ± 1%  -0.24%  (p=0.048 n=20+18)
RegexpMatchMedium_1K     41.6µs ± 0%    41.3µs ± 1%  -0.57%  (p=0.004 n=18+20)
RegexpMatchHard_32       2.11µs ± 0%    2.11µs ± 1%  +0.20%  (p=0.037 n=17+19)
RegexpMatchHard_1K       63.9µs ± 2%    63.3µs ± 0%  -0.99%  (p=0.000 n=20+17)
Revcomp                   560ms ± 1%     522ms ± 0%  -6.87%  (p=0.000 n=18+16)
Template                 75.0ms ± 0%    75.1ms ± 1%  +0.18%  (p=0.013 n=18+19)
TimeParse                 358ns ± 1%     364ns ± 0%  +1.74%  (p=0.000 n=20+15)
TimeFormat                360ns ± 0%     372ns ± 0%  +3.55%  (p=0.000 n=20+18)

Change-Id: If8a9bfae6c128d15a4f405e02bcfa50129df82a2
Reviewed-on: https://go-review.googlesource.com/10314
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-06-02 20:00:57 +00:00
Austin Clements
724f8298a8 runtime: avoid double-scanning of stacks
Currently there's a race between stopg scanning another G's stack and
the G reaching a preemption point and scanning its own stack. When
this race occurs, the G's stack is scanned twice. Currently this is
okay, so this race is benign.

However, we will shortly be adding stack barriers during the first
stack scan, so scanning will no longer be idempotent. To prepare for
this, this change ensures that each stack is scanned only once during
each GC phase by checking the flag that indicates that the stack has
been scanned in this phase before scanning the stack.

Change-Id: Id9f4d5e2e5b839bc3f200ec1723a4a12dd677ab4
Reviewed-on: https://go-review.googlesource.com/10458
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-06-02 19:59:05 +00:00
Austin Clements
3f6e69aca5 runtime: steal space for stack barrier tracking from stack
The stack barrier code will need a bookkeeping structure to keep track
of the overwritten return PCs. This commit introduces and allocates
this structure, but does not yet use the structure.

We don't want to allocate space for this structure during garbage
collection, so this commit allocates it along with the allocation of
the corresponding stack. However, we can't do a regular allocation in
newstack because mallocgc may itself grow the stack (which would lead
to a recursive allocation). Hence, this commit makes the bookkeeping
structure part of the stack allocation itself by stealing the
necessary space from the top of the stack allocation. Since the size
of this bookkeeping structure is logarithmic in the size of the stack,
this has minimal impact on stack behavior.

Change-Id: Ia14408be06aafa9ca4867f4e70bddb3fe0e96665
Reviewed-on: https://go-review.googlesource.com/10313
Reviewed-by: Russ Cox <rsc@golang.org>
2015-06-02 19:57:57 +00:00
Rick Hudson
197aa9e64d runtime: remove unused quiesce code
This is dead code. If you want to quiesce the system the
preferred way is to use forEachP(func(*p){}).

Change-Id: Ic7677a5dd55e3639b99e78ddeb2c71dd1dd091fa
Reviewed-on: https://go-review.googlesource.com/10267
Reviewed-by: Austin Clements <austin@google.com>
2015-05-20 17:56:44 +00:00
Russ Cox
8903b3db0e runtime: add fast check for self-loop pointer in scanobject
Addresses a problem reported on the mailing list.

This will come up mainly in programs custom allocators that batch allocations,
but it still helps in our programs, which mainly do not have such allocations.

name                   old mean              new mean              delta
BinaryTree17            5.95s × (0.97,1.03)   5.93s × (0.97,1.04)    ~    (p=0.613)
Fannkuch11              4.46s × (0.98,1.04)   4.33s × (0.99,1.01)  -2.93% (p=0.000)
FmtFprintfEmpty        86.6ns × (0.98,1.03)  86.8ns × (0.98,1.02)    ~    (p=0.523)
FmtFprintfString        290ns × (0.98,1.05)   287ns × (0.98,1.03)    ~    (p=0.061)
FmtFprintfInt           271ns × (0.98,1.04)   286ns × (0.99,1.01)  +5.54% (p=0.000)
FmtFprintfIntInt        495ns × (0.98,1.04)   489ns × (0.99,1.01)  -1.24% (p=0.015)
FmtFprintfPrefixedInt   391ns × (0.99,1.02)   407ns × (0.99,1.01)  +4.00% (p=0.000)
FmtFprintfFloat         578ns × (0.99,1.01)   559ns × (0.99,1.01)  -3.35% (p=0.000)
FmtManyArgs            1.96µs × (0.98,1.05)  1.94µs × (0.99,1.01)  -1.33% (p=0.030)
GobDecode              15.9ms × (0.97,1.05)  15.7ms × (0.99,1.01)  -1.35% (p=0.044)
GobEncode              11.4ms × (0.97,1.05)  11.3ms × (0.98,1.03)    ~    (p=0.141)
Gzip                    658ms × (0.98,1.05)   648ms × (0.99,1.01)  -1.59% (p=0.009)
Gunzip                  144ms × (0.99,1.03)   144ms × (0.99,1.01)    ~    (p=0.867)
HTTPClientServer       92.1µs × (0.97,1.05)  90.3µs × (0.99,1.01)  -1.89% (p=0.005)
JSONEncode             31.0ms × (0.96,1.07)  30.2ms × (0.98,1.03)  -2.66% (p=0.001)
JSONDecode              110ms × (0.97,1.04)   107ms × (0.99,1.01)  -2.59% (p=0.000)
Mandelbrot200          6.15ms × (0.98,1.04)  6.07ms × (0.99,1.02)  -1.32% (p=0.045)
GoParse                6.79ms × (0.97,1.04)  6.74ms × (0.97,1.04)    ~    (p=0.242)
RegexpMatchEasy0_32     158ns × (0.98,1.05)   155ns × (0.99,1.01)  -1.64% (p=0.010)
RegexpMatchEasy0_1K     548ns × (0.97,1.04)   540ns × (0.99,1.01)  -1.34% (p=0.042)
RegexpMatchEasy1_32     133ns × (0.97,1.04)   132ns × (0.97,1.05)    ~    (p=0.466)
RegexpMatchEasy1_1K     899ns × (0.96,1.05)   878ns × (0.99,1.01)  -2.32% (p=0.002)
RegexpMatchMedium_32    250ns × (0.96,1.03)   243ns × (0.99,1.01)  -2.90% (p=0.000)
RegexpMatchMedium_1K   73.4µs × (0.98,1.04)  73.0µs × (0.98,1.04)    ~    (p=0.411)
RegexpMatchHard_32     3.87µs × (0.97,1.07)  3.84µs × (0.98,1.04)    ~    (p=0.273)
RegexpMatchHard_1K      120µs × (0.97,1.08)   117µs × (0.99,1.01)  -2.06% (p=0.010)
Revcomp                 940ms × (0.96,1.07)   924ms × (0.97,1.07)    ~    (p=0.071)
Template                128ms × (0.96,1.05)   128ms × (0.99,1.01)    ~    (p=0.502)
TimeParse               632ns × (0.96,1.07)   616ns × (0.99,1.01)  -2.58% (p=0.001)
TimeFormat              671ns × (0.97,1.06)   657ns × (0.99,1.02)  -2.10% (p=0.002)

In contrast to the one in test/bench/go1 (above), the binarytree program on the
shootout site uses more goroutines, batches allocations, and sets GOMAXPROCS
to runtime.NumCPU()*2.

Using that version, before vs after:

name          old mean             new mean             delta
BinaryTree20  18.6s × (0.96,1.05)  11.3s × (0.98,1.02)  -39.46% (p=0.000)

And Go 1.4 vs after:

name          old mean             new mean             delta
BinaryTree20  13.0s × (0.97,1.02)  11.3s × (0.98,1.02)  -13.21% (p=0.000)

There is still a scheduling problem - the raw run times are hiding the fact that
this chews up 2x the CPU - but we'll take care of that separately.

Change-Id: I3f5da879b24ae73a0d06745381ffb88c3744948b
Reviewed-on: https://go-review.googlesource.com/10220
Reviewed-by: Austin Clements <austin@google.com>
2015-05-19 15:29:40 +00:00
Russ Cox
1635ab7dfe runtime: remove wbshadow mode
The write barrier shadow heap was very useful for
developing the write barriers initially, but it's no longer used,
clunky, and dragging the rest of the implementation down.

The gccheckmark mode will find bugs due to missed barriers
when they result in missed marks; wbshadow mode found the
missed barriers more aggressively, but it required an entire
separate copy of the heap. The gccheckmark mode requires
no extra memory, making it more useful in practice.

Compared to previous CL:
name                   old mean              new mean              delta
BinaryTree17            5.91s × (0.96,1.06)   5.72s × (0.97,1.03)  -3.12% (p=0.000)
Fannkuch11              4.32s × (1.00,1.00)   4.36s × (1.00,1.00)  +0.91% (p=0.000)
FmtFprintfEmpty        89.0ns × (0.93,1.10)  86.6ns × (0.96,1.11)    ~    (p=0.077)
FmtFprintfString        298ns × (0.98,1.06)   283ns × (0.99,1.04)  -4.90% (p=0.000)
FmtFprintfInt           286ns × (0.98,1.03)   283ns × (0.98,1.04)  -1.09% (p=0.032)
FmtFprintfIntInt        498ns × (0.97,1.06)   480ns × (0.99,1.02)  -3.65% (p=0.000)
FmtFprintfPrefixedInt   408ns × (0.98,1.02)   396ns × (0.99,1.01)  -3.00% (p=0.000)
FmtFprintfFloat         587ns × (0.98,1.01)   562ns × (0.99,1.01)  -4.34% (p=0.000)
FmtManyArgs            1.94µs × (0.99,1.02)  1.89µs × (0.99,1.01)  -2.85% (p=0.000)
GobDecode              15.8ms × (0.98,1.03)  15.7ms × (0.99,1.02)    ~    (p=0.251)
GobEncode              12.0ms × (0.96,1.09)  11.8ms × (0.98,1.03)  -1.87% (p=0.024)
Gzip                    648ms × (0.99,1.01)   647ms × (0.99,1.01)    ~    (p=0.688)
Gunzip                  143ms × (1.00,1.01)   143ms × (1.00,1.01)    ~    (p=0.203)
HTTPClientServer       90.3µs × (0.98,1.01)  89.1µs × (0.99,1.02)  -1.30% (p=0.000)
JSONEncode             31.6ms × (0.99,1.01)  31.7ms × (0.98,1.02)    ~    (p=0.219)
JSONDecode              107ms × (1.00,1.01)   111ms × (0.99,1.01)  +3.58% (p=0.000)
Mandelbrot200          6.03ms × (1.00,1.01)  6.01ms × (1.00,1.00)    ~    (p=0.077)
GoParse                6.53ms × (0.99,1.03)  6.54ms × (0.99,1.02)    ~    (p=0.585)
RegexpMatchEasy0_32     161ns × (1.00,1.01)   161ns × (0.98,1.05)    ~    (p=0.948)
RegexpMatchEasy0_1K     541ns × (0.99,1.01)   559ns × (0.98,1.01)  +3.32% (p=0.000)
RegexpMatchEasy1_32     138ns × (1.00,1.00)   137ns × (0.99,1.01)  -0.55% (p=0.001)
RegexpMatchEasy1_1K     887ns × (0.99,1.01)   878ns × (0.99,1.01)  -0.98% (p=0.000)
RegexpMatchMedium_32    253ns × (0.99,1.01)   252ns × (0.99,1.01)  -0.39% (p=0.001)
RegexpMatchMedium_1K   72.8µs × (1.00,1.00)  72.7µs × (1.00,1.00)    ~    (p=0.485)
RegexpMatchHard_32     3.85µs × (1.00,1.01)  3.85µs × (1.00,1.01)    ~    (p=0.283)
RegexpMatchHard_1K      117µs × (1.00,1.01)   117µs × (1.00,1.00)    ~    (p=0.175)
Revcomp                 922ms × (0.97,1.08)   903ms × (0.98,1.05)  -2.15% (p=0.021)
Template                126ms × (0.99,1.01)   126ms × (0.99,1.01)    ~    (p=0.943)
TimeParse               628ns × (0.99,1.01)   634ns × (0.99,1.01)  +0.92% (p=0.000)
TimeFormat              668ns × (0.99,1.01)   698ns × (0.98,1.03)  +4.53% (p=0.000)

It's nice that the microbenchmarks are the ones helped the most,
because those were the ones hurt the most by the conversion from
4-bit to 2-bit heap bitmaps. This CL brings the overall effect of that
process to (compared to CL 9706 patch set 1):

name                   old mean              new mean              delta
BinaryTree17            5.87s × (0.94,1.09)   5.72s × (0.97,1.03)  -2.57% (p=0.011)
Fannkuch11              4.32s × (1.00,1.00)   4.36s × (1.00,1.00)  +0.87% (p=0.000)
FmtFprintfEmpty        89.1ns × (0.95,1.16)  86.6ns × (0.96,1.11)    ~    (p=0.090)
FmtFprintfString        283ns × (0.98,1.02)   283ns × (0.99,1.04)    ~    (p=0.681)
FmtFprintfInt           284ns × (0.98,1.04)   283ns × (0.98,1.04)    ~    (p=0.620)
FmtFprintfIntInt        486ns × (0.98,1.03)   480ns × (0.99,1.02)  -1.27% (p=0.002)
FmtFprintfPrefixedInt   400ns × (0.99,1.02)   396ns × (0.99,1.01)  -0.84% (p=0.001)
FmtFprintfFloat         566ns × (0.99,1.01)   562ns × (0.99,1.01)  -0.80% (p=0.000)
FmtManyArgs            1.91µs × (0.99,1.02)  1.89µs × (0.99,1.01)  -1.10% (p=0.000)
GobDecode              15.5ms × (0.98,1.05)  15.7ms × (0.99,1.02)  +1.55% (p=0.005)
GobEncode              11.9ms × (0.97,1.03)  11.8ms × (0.98,1.03)  -0.97% (p=0.048)
Gzip                    648ms × (0.99,1.01)   647ms × (0.99,1.01)    ~    (p=0.627)
Gunzip                  143ms × (1.00,1.00)   143ms × (1.00,1.01)    ~    (p=0.482)
HTTPClientServer       89.2µs × (0.99,1.02)  89.1µs × (0.99,1.02)    ~    (p=0.740)
JSONEncode             32.3ms × (0.97,1.06)  31.7ms × (0.98,1.02)  -1.95% (p=0.002)
JSONDecode              106ms × (0.99,1.01)   111ms × (0.99,1.01)  +4.22% (p=0.000)
Mandelbrot200          6.02ms × (1.00,1.00)  6.01ms × (1.00,1.00)    ~    (p=0.417)
GoParse                6.57ms × (0.97,1.06)  6.54ms × (0.99,1.02)    ~    (p=0.404)
RegexpMatchEasy0_32     162ns × (1.00,1.00)   161ns × (0.98,1.05)    ~    (p=0.088)
RegexpMatchEasy0_1K     561ns × (0.99,1.02)   559ns × (0.98,1.01)  -0.47% (p=0.034)
RegexpMatchEasy1_32     145ns × (0.95,1.04)   137ns × (0.99,1.01)  -5.56% (p=0.000)
RegexpMatchEasy1_1K     864ns × (0.99,1.04)   878ns × (0.99,1.01)  +1.57% (p=0.000)
RegexpMatchMedium_32    255ns × (0.99,1.04)   252ns × (0.99,1.01)  -1.43% (p=0.001)
RegexpMatchMedium_1K   73.9µs × (0.98,1.04)  72.7µs × (1.00,1.00)  -1.55% (p=0.004)
RegexpMatchHard_32     3.92µs × (0.98,1.04)  3.85µs × (1.00,1.01)  -1.80% (p=0.003)
RegexpMatchHard_1K      120µs × (0.98,1.04)   117µs × (1.00,1.00)  -2.13% (p=0.001)
Revcomp                 936ms × (0.95,1.08)   903ms × (0.98,1.05)  -3.58% (p=0.002)
Template                130ms × (0.98,1.04)   126ms × (0.99,1.01)  -2.98% (p=0.000)
TimeParse               638ns × (0.98,1.05)   634ns × (0.99,1.01)    ~    (p=0.198)
TimeFormat              674ns × (0.99,1.01)   698ns × (0.98,1.03)  +3.69% (p=0.000)

Change-Id: Ia0e9b50b1d75a3c0c7556184cd966305574fe07c
Reviewed-on: https://go-review.googlesource.com/9706
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-05-11 14:55:11 +00:00
Russ Cox
feb8a3b616 runtime: optimize heapBitsSetType
For the conversion of the heap bitmap from 4-bit to 2-bit fields,
I replaced heapBitsSetType with the dumbest thing that could possibly work:
two atomic operations (atomicand8+atomicor8) per 2-bit field.

This CL replaces that code with a proper implementation that
avoids the atomics whenever possible. Benchmarks vs base CL
(before the conversion to 2-bit heap bitmap) and vs Go 1.4 below.

Compared to Go 1.4, SetTypePtr (a 1-pointer allocation)
is 10ns slower because a race against the concurrent GC requires the
use of an atomicor8 that used to be an ordinary write. This slowdown
was present even in the base CL.

Compared to both Go 1.4 and base, SetTypeNode8 (a 10-word allocation)
is 10ns slower because it too needs a new atomic, because with the
denser representation, the byte on the end of the allocation is now shared
with the object next to it; this was not true with the 4-bit representation.

Excluding these two (fundamental) slowdowns due to the use of atomics,
the new code is noticeably faster than both Go 1.4 and the base CL.

The next CL will reintroduce the ``typeDead'' optimization.

Stats are from 5 runs on a MacBookPro10,2 (late 2012 Core i5).

Compared to base CL (** = new atomic)
name                  old mean              new mean              delta
SetTypePtr            14.1ns × (0.99,1.02)  14.7ns × (0.93,1.10)     ~    (p=0.175)
SetTypePtr8           18.4ns × (1.00,1.01)  18.6ns × (0.81,1.21)     ~    (p=0.866)
SetTypePtr16          28.7ns × (1.00,1.00)  22.4ns × (0.90,1.27)  -21.88% (p=0.015)
SetTypePtr32          52.3ns × (1.00,1.00)  33.8ns × (0.93,1.24)  -35.37% (p=0.001)
SetTypePtr64          79.2ns × (1.00,1.00)  55.1ns × (1.00,1.01)  -30.43% (p=0.000)
SetTypePtr126          118ns × (1.00,1.00)   100ns × (1.00,1.00)  -15.97% (p=0.000)
SetTypePtr128          130ns × (0.92,1.19)    98ns × (1.00,1.00)  -24.36% (p=0.008)
SetTypePtrSlice        726ns × (0.96,1.08)   760ns × (1.00,1.00)     ~    (p=0.152)
SetTypeNode1          14.1ns × (0.94,1.15)  12.0ns × (1.00,1.01)  -14.60% (p=0.020)
SetTypeNode1Slice      135ns × (0.96,1.07)    88ns × (1.00,1.00)  -34.53% (p=0.000)
SetTypeNode8          20.9ns × (1.00,1.01)  32.6ns × (1.00,1.00)  +55.37% (p=0.000) **
SetTypeNode8Slice      414ns × (0.99,1.02)   244ns × (1.00,1.00)  -41.09% (p=0.000)
SetTypeNode64         80.0ns × (1.00,1.00)  57.4ns × (1.00,1.00)  -28.23% (p=0.000)
SetTypeNode64Slice    2.15µs × (1.00,1.01)  1.56µs × (1.00,1.00)  -27.43% (p=0.000)
SetTypeNode124         119ns × (0.99,1.00)   100ns × (1.00,1.00)  -16.11% (p=0.000)
SetTypeNode124Slice   3.40µs × (1.00,1.00)  2.93µs × (1.00,1.00)  -13.80% (p=0.000)
SetTypeNode126         120ns × (1.00,1.01)    98ns × (1.00,1.00)  -18.19% (p=0.000)
SetTypeNode126Slice   3.53µs × (0.98,1.08)  3.02µs × (1.00,1.00)  -14.49% (p=0.002)
SetTypeNode1024        726ns × (0.97,1.09)   740ns × (1.00,1.00)     ~    (p=0.451)
SetTypeNode1024Slice  24.9µs × (0.89,1.37)  23.1µs × (1.00,1.00)     ~    (p=0.476)

Compared to Go 1.4 (** = new atomic)
name                  old mean               new mean              delta
SetTypePtr            5.71ns × (0.89,1.19)  14.68ns × (0.93,1.10)  +157.24% (p=0.000) **
SetTypePtr8           19.3ns × (0.96,1.10)   18.6ns × (0.81,1.21)      ~    (p=0.638)
SetTypePtr16          30.7ns × (0.99,1.03)   22.4ns × (0.90,1.27)   -26.88% (p=0.005)
SetTypePtr32          51.5ns × (1.00,1.00)   33.8ns × (0.93,1.24)   -34.40% (p=0.001)
SetTypePtr64          83.6ns × (0.94,1.12)   55.1ns × (1.00,1.01)   -34.12% (p=0.001)
SetTypePtr126          137ns × (0.87,1.26)    100ns × (1.00,1.00)   -27.10% (p=0.028)
SetTypePtrSlice        865ns × (0.80,1.23)    760ns × (1.00,1.00)      ~    (p=0.243)
SetTypeNode1          15.2ns × (0.88,1.12)   12.0ns × (1.00,1.01)   -20.89% (p=0.014)
SetTypeNode1Slice      156ns × (0.93,1.16)     88ns × (1.00,1.00)   -43.57% (p=0.001)
SetTypeNode8          23.8ns × (0.90,1.18)   32.6ns × (1.00,1.00)   +36.76% (p=0.003) **
SetTypeNode8Slice      502ns × (0.92,1.10)    244ns × (1.00,1.00)   -51.46% (p=0.000)
SetTypeNode64         85.6ns × (0.94,1.11)   57.4ns × (1.00,1.00)   -32.89% (p=0.001)
SetTypeNode64Slice    2.36µs × (0.91,1.14)   1.56µs × (1.00,1.00)   -33.96% (p=0.002)
SetTypeNode124         130ns × (0.91,1.12)    100ns × (1.00,1.00)   -23.49% (p=0.004)
SetTypeNode124Slice   3.81µs × (0.90,1.22)   2.93µs × (1.00,1.00)   -23.09% (p=0.025)

There are fewer benchmarks vs Go 1.4 because unrolling directly
into the heap bitmap is not yet implemented, so those would not
be meaningful comparisons.

These benchmarks were not present in Go 1.4 as distributed.
The backport to Go 1.4 is in github.com/rsc/go's go14bench branch,
commit 71d5ee5.

Change-Id: I95ed05a22bf484b0fc9efad549279e766c98d2b6
Reviewed-on: https://go-review.googlesource.com/9704
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-05-11 14:51:20 +00:00
Russ Cox
0234dfd493 runtime: use 2-bit heap bitmap (in place of 4-bit)
Previous CLs changed the representation of the non-heap type bitmaps
to be 1-bit bitmaps (pointer or not). Before this CL, the heap bitmap
stored a 2-bit type for each word and a mark bit and checkmark bit
for the first word of the object. (There used to be additional per-word bits.)

Reduce heap bitmap to 2-bit, with 1 dedicated to pointer or not,
and the other used for mark, checkmark, and "keep scanning forward
to find pointers in this object." See comments for details.

This CL replaces heapBitsSetType with very slow but obviously correct code.
A followup CL will optimize it. (Spoiler: the new code is faster than Go 1.4 was.)

Change-Id: I999577a133f3cfecacebdec9cdc3573c235c7fb9
Reviewed-on: https://go-review.googlesource.com/9703
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2015-05-11 14:43:45 +00:00
Russ Cox
9626561030 runtime: fix gccheckmark mode and enable by default
It was testing the mark bits on what roots pointed at,
but not the remainder of the live heap, because in
CL 2991 I accidentally inverted this check during
refactoring.

The next CL will turn it back off by default again,
but I want one run on the builders with the full
checkmark checks.

Change-Id: Ic166458cea25c0a56e5387fc527cb166ff2e5ada
Reviewed-on: https://go-review.googlesource.com/9824
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2015-05-07 21:08:29 +00:00
Austin Clements
3be3cbd548 runtime: track "scannable" bytes of heap
This tracks the number of scannable bytes in the allocated heap. That
is, bytes that the garbage collector must scan before reaching the
last pointer field in each object.

This will be used to compute a more robust estimate of the GC scan
work.

Change-Id: I1eecd45ef9cdd65b69d2afb5db5da885c80086bb
Reviewed-on: https://go-review.googlesource.com/9695
Reviewed-by: Russ Cox <rsc@golang.org>
2015-05-06 19:40:33 +00:00
Austin Clements
53c53984e7 runtime: include scalar slots in GC scan work metric
The garbage collector predicts how much "scan work" must be done in a
cycle to determine how much work should be done by mutators when they
allocate. Most code doesn't care what units the scan work is in: it
simply knows that a certain amount of scan work has to be done in the
cycle. Currently, the GC uses the number of pointer slots scanned as
the scan work on the theory that this is the bulk of the time spent in
the garbage collector and hence reflects real CPU resource usage.
However, this metric is difficult to estimate at the beginning of a
cycle.

Switch to counting the total number of bytes scanned, including both
pointer and scalar slots. This is still less than the total marked
heap since it omits no-scan objects and no-scan tails of objects. This
metric may not reflect absolute performance as well as the count of
scanned pointer slots (though it still takes time to scan scalar
fields), but it will be much easier to estimate robustly, which is
more important.

Change-Id: Ie3a5eeeb0384a1ca566f61b2f11e9ff3a75ca121
Reviewed-on: https://go-review.googlesource.com/9694
Reviewed-by: Russ Cox <rsc@golang.org>
2015-05-06 19:40:27 +00:00
Russ Cox
4fffc50c26 runtime: correct accounting of scan work and bytes marked
(1) Count pointer-free objects found during scanning roots
as marked bytes, by not zeroing the mark total after scanning roots.

(2) Don't count the bytes for the roots themselves, by not adding
them to the mark total in scanblock (the zeroing removed by (1)
was aimed at that add but hitting more).

Combined, (1) and (2) fix the calculation of the marked heap size.
This makes the GC trigger much less often in the Go 1 benchmarks,
which have a global []byte pointing at 256 MB of data.
That 256 MB allocation was not being included in the heap size
in the current code, but was included in Go 1.4.
This is the source of much of the relative slowdown in that directory.

(3) Count the bytes for the roots as scanned work, by not zeroing
the scan total after scanning roots. There is no strict justification
for this, and it probably doesn't matter much either way,
but it was always combined with another buggy zeroing
(removed in (1)), so guilty by association.

Austin noticed this.

name                                    old mean                new mean        delta
BenchmarkBinaryTree17              13.1s × (0.97,1.03)      5.9s × (0.97,1.05)  -55.19% (p=0.000)
BenchmarkFannkuch11                4.35s × (0.99,1.01)     4.37s × (1.00,1.01)  +0.47% (p=0.032)
BenchmarkFmtFprintfEmpty          84.6ns × (0.95,1.14)    85.7ns × (0.94,1.05)  ~ (p=0.521)
BenchmarkFmtFprintfString          320ns × (0.95,1.06)     283ns × (0.99,1.02)  -11.48% (p=0.000)
BenchmarkFmtFprintfInt             311ns × (0.98,1.03)     288ns × (0.99,1.02)  -7.26% (p=0.000)
BenchmarkFmtFprintfIntInt          554ns × (0.96,1.05)     478ns × (0.99,1.02)  -13.70% (p=0.000)
BenchmarkFmtFprintfPrefixedInt     434ns × (0.96,1.06)     393ns × (0.98,1.04)  -9.60% (p=0.000)
BenchmarkFmtFprintfFloat           620ns × (0.99,1.03)     584ns × (0.99,1.01)  -5.73% (p=0.000)
BenchmarkFmtManyArgs              2.19µs × (0.98,1.03)    1.94µs × (0.99,1.01)  -11.62% (p=0.000)
BenchmarkGobDecode                21.2ms × (0.97,1.06)    15.2ms × (0.99,1.01)  -28.17% (p=0.000)
BenchmarkGobEncode                18.1ms × (0.94,1.06)    11.8ms × (0.99,1.01)  -35.00% (p=0.000)
BenchmarkGzip                      650ms × (0.98,1.01)     649ms × (0.99,1.02)  ~ (p=0.802)
BenchmarkGunzip                    143ms × (1.00,1.01)     143ms × (1.00,1.01)  ~ (p=0.438)
BenchmarkHTTPClientServer          110µs × (0.98,1.04)     101µs × (0.98,1.02)  -8.79% (p=0.000)
BenchmarkJSONEncode               40.3ms × (0.97,1.03)    31.8ms × (0.98,1.03)  -20.92% (p=0.000)
BenchmarkJSONDecode                119ms × (0.97,1.02)     108ms × (0.99,1.02)  -9.15% (p=0.000)
BenchmarkMandelbrot200            6.03ms × (1.00,1.01)    6.03ms × (0.99,1.01)  ~ (p=0.750)
BenchmarkGoParse                  8.58ms × (0.89,1.10)    6.80ms × (1.00,1.00)  -20.71% (p=0.000)
BenchmarkRegexpMatchEasy0_32       162ns × (1.00,1.01)     162ns × (0.99,1.02)  ~ (p=0.131)
BenchmarkRegexpMatchEasy0_1K       540ns × (0.99,1.02)     559ns × (0.99,1.02)  +3.58% (p=0.000)
BenchmarkRegexpMatchEasy1_32       139ns × (0.98,1.04)     139ns × (1.00,1.00)  ~ (p=0.466)
BenchmarkRegexpMatchEasy1_1K       889ns × (0.99,1.01)     885ns × (0.99,1.01)  -0.50% (p=0.022)
BenchmarkRegexpMatchMedium_32      252ns × (0.99,1.02)     252ns × (0.99,1.01)  ~ (p=0.469)
BenchmarkRegexpMatchMedium_1K     72.9µs × (0.99,1.01)    73.6µs × (0.99,1.03)  ~ (p=0.168)
BenchmarkRegexpMatchHard_32       3.87µs × (1.00,1.01)    3.86µs × (1.00,1.00)  ~ (p=0.055)
BenchmarkRegexpMatchHard_1K        118µs × (0.99,1.01)     117µs × (0.99,1.00)  ~ (p=0.133)
BenchmarkRevcomp                   995ms × (0.94,1.10)     949ms × (0.99,1.01)  -4.64% (p=0.000)
BenchmarkTemplate                  141ms × (0.97,1.02)     127ms × (0.99,1.01)  -10.00% (p=0.000)
BenchmarkTimeParse                 641ns × (0.99,1.01)     623ns × (0.99,1.01)  -2.79% (p=0.000)
BenchmarkTimeFormat                729ns × (0.98,1.03)     679ns × (0.99,1.00)  -6.93% (p=0.000)

Change-Id: I839bd7356630d18377989a0748763414e15ed057
Reviewed-on: https://go-review.googlesource.com/9602
Reviewed-by: Austin Clements <austin@google.com>
2015-05-01 19:31:00 +00:00
Russ Cox
4d0f3a1c95 cmd/internal/gc, runtime: use 1-bit bitmap for stack frames, data, bss
The bitmaps were 2 bits per pointer because we needed to distinguish
scalar, pointer, multiword, and we used the leftover value to distinguish
uninitialized from scalar, even though the garbage collector (GC) didn't care.

Now that there are no multiword structures from the GC's point of view,
cut the bitmaps down to 1 bit per pointer, recording just live pointer vs not.

The GC assumes the same layout for stack frames and for the maps
describing the global data and bss sections, so change them all in one CL.

The code still refers to 4-bit heap bitmaps and 2-bit "type bitmaps", since
the 2-bit representation lives (at least for now) in some of the reflect data.

Because these stack frame bitmaps are stored directly in the rodata in
the binary, this CL reduces the size of the 6g binary by about 1.1%.

Performance change is basically a wash, but using less memory,
and smaller binaries, and enables other bitmap reductions.

name                                      old mean                new mean        delta
BenchmarkBinaryTree17                13.2s × (0.97,1.03)     13.0s × (0.99,1.01)  -0.93% (p=0.005)
BenchmarkBinaryTree17-2              9.69s × (0.96,1.05)     9.51s × (0.96,1.03)  -1.86% (p=0.001)
BenchmarkBinaryTree17-4              10.1s × (0.97,1.05)     10.0s × (0.96,1.05)  ~ (p=0.141)
BenchmarkFannkuch11                  4.35s × (0.99,1.01)     4.43s × (0.98,1.04)  +1.75% (p=0.001)
BenchmarkFannkuch11-2                4.31s × (0.99,1.03)     4.32s × (1.00,1.00)  ~ (p=0.095)
BenchmarkFannkuch11-4                4.32s × (0.99,1.02)     4.38s × (0.98,1.04)  +1.38% (p=0.008)
BenchmarkFmtFprintfEmpty            83.5ns × (0.97,1.10)    87.3ns × (0.92,1.11)  +4.55% (p=0.014)
BenchmarkFmtFprintfEmpty-2          81.8ns × (0.98,1.04)    82.5ns × (0.97,1.08)  ~ (p=0.364)
BenchmarkFmtFprintfEmpty-4          80.9ns × (0.99,1.01)    82.6ns × (0.97,1.08)  +2.12% (p=0.010)
BenchmarkFmtFprintfString            320ns × (0.95,1.04)     322ns × (0.97,1.05)  ~ (p=0.368)
BenchmarkFmtFprintfString-2          303ns × (0.97,1.04)     304ns × (0.97,1.04)  ~ (p=0.484)
BenchmarkFmtFprintfString-4          305ns × (0.97,1.05)     306ns × (0.98,1.05)  ~ (p=0.543)
BenchmarkFmtFprintfInt               311ns × (0.98,1.03)     319ns × (0.97,1.03)  +2.63% (p=0.000)
BenchmarkFmtFprintfInt-2             297ns × (0.98,1.04)     301ns × (0.97,1.04)  +1.19% (p=0.023)
BenchmarkFmtFprintfInt-4             302ns × (0.98,1.02)     304ns × (0.97,1.03)  ~ (p=0.126)
BenchmarkFmtFprintfIntInt            554ns × (0.96,1.05)     554ns × (0.97,1.03)  ~ (p=0.975)
BenchmarkFmtFprintfIntInt-2          520ns × (0.98,1.03)     517ns × (0.98,1.02)  ~ (p=0.153)
BenchmarkFmtFprintfIntInt-4          524ns × (0.98,1.02)     525ns × (0.98,1.03)  ~ (p=0.597)
BenchmarkFmtFprintfPrefixedInt       433ns × (0.97,1.06)     434ns × (0.97,1.06)  ~ (p=0.804)
BenchmarkFmtFprintfPrefixedInt-2     413ns × (0.98,1.04)     413ns × (0.98,1.03)  ~ (p=0.881)
BenchmarkFmtFprintfPrefixedInt-4     420ns × (0.97,1.03)     421ns × (0.97,1.03)  ~ (p=0.561)
BenchmarkFmtFprintfFloat             620ns × (0.99,1.03)     636ns × (0.97,1.03)  +2.57% (p=0.000)
BenchmarkFmtFprintfFloat-2           601ns × (0.98,1.02)     617ns × (0.98,1.03)  +2.58% (p=0.000)
BenchmarkFmtFprintfFloat-4           613ns × (0.98,1.03)     626ns × (0.98,1.02)  +2.15% (p=0.000)
BenchmarkFmtManyArgs                2.19µs × (0.96,1.04)    2.23µs × (0.97,1.02)  +1.65% (p=0.000)
BenchmarkFmtManyArgs-2              2.08µs × (0.98,1.03)    2.10µs × (0.99,1.02)  +0.79% (p=0.019)
BenchmarkFmtManyArgs-4              2.10µs × (0.98,1.02)    2.13µs × (0.98,1.02)  +1.72% (p=0.000)
BenchmarkGobDecode                  21.3ms × (0.97,1.05)    21.1ms × (0.97,1.04)  -1.36% (p=0.025)
BenchmarkGobDecode-2                20.0ms × (0.97,1.03)    19.2ms × (0.97,1.03)  -4.00% (p=0.000)
BenchmarkGobDecode-4                19.5ms × (0.99,1.02)    19.0ms × (0.99,1.01)  -2.39% (p=0.000)
BenchmarkGobEncode                  18.3ms × (0.95,1.07)    18.1ms × (0.96,1.08)  ~ (p=0.305)
BenchmarkGobEncode-2                16.8ms × (0.97,1.02)    16.4ms × (0.98,1.02)  -2.79% (p=0.000)
BenchmarkGobEncode-4                15.4ms × (0.98,1.02)    15.4ms × (0.98,1.02)  ~ (p=0.465)
BenchmarkGzip                        650ms × (0.98,1.03)     655ms × (0.97,1.04)  ~ (p=0.075)
BenchmarkGzip-2                      652ms × (0.98,1.03)     655ms × (0.98,1.02)  ~ (p=0.337)
BenchmarkGzip-4                      656ms × (0.98,1.04)     653ms × (0.98,1.03)  ~ (p=0.291)
BenchmarkGunzip                      143ms × (1.00,1.01)     143ms × (1.00,1.01)  ~ (p=0.507)
BenchmarkGunzip-2                    143ms × (1.00,1.01)     143ms × (1.00,1.01)  ~ (p=0.313)
BenchmarkGunzip-4                    143ms × (1.00,1.01)     143ms × (1.00,1.01)  ~ (p=0.312)
BenchmarkHTTPClientServer            110µs × (0.98,1.03)     109µs × (0.99,1.02)  -1.40% (p=0.000)
BenchmarkHTTPClientServer-2          154µs × (0.90,1.08)     149µs × (0.90,1.08)  -3.43% (p=0.007)
BenchmarkHTTPClientServer-4          138µs × (0.97,1.04)     138µs × (0.96,1.04)  ~ (p=0.670)
BenchmarkJSONEncode                 40.2ms × (0.98,1.02)    40.2ms × (0.98,1.05)  ~ (p=0.828)
BenchmarkJSONEncode-2               35.1ms × (0.99,1.02)    35.2ms × (0.98,1.03)  ~ (p=0.392)
BenchmarkJSONEncode-4               35.3ms × (0.98,1.03)    35.3ms × (0.98,1.02)  ~ (p=0.813)
BenchmarkJSONDecode                  119ms × (0.97,1.02)     117ms × (0.98,1.02)  -1.80% (p=0.000)
BenchmarkJSONDecode-2                115ms × (0.99,1.02)     114ms × (0.98,1.02)  -1.18% (p=0.000)
BenchmarkJSONDecode-4                116ms × (0.98,1.02)     114ms × (0.98,1.02)  -1.43% (p=0.000)
BenchmarkMandelbrot200              6.03ms × (1.00,1.01)    6.03ms × (1.00,1.01)  ~ (p=0.985)
BenchmarkMandelbrot200-2            6.03ms × (1.00,1.01)    6.02ms × (1.00,1.01)  ~ (p=0.320)
BenchmarkMandelbrot200-4            6.03ms × (1.00,1.01)    6.03ms × (1.00,1.01)  ~ (p=0.799)
BenchmarkGoParse                    8.63ms × (0.89,1.10)    8.58ms × (0.93,1.09)  ~ (p=0.667)
BenchmarkGoParse-2                  8.20ms × (0.97,1.04)    8.37ms × (0.97,1.04)  +1.96% (p=0.001)
BenchmarkGoParse-4                  8.00ms × (0.98,1.02)    8.14ms × (0.99,1.02)  +1.75% (p=0.000)
BenchmarkRegexpMatchEasy0_32         162ns × (1.00,1.01)     164ns × (0.98,1.04)  +1.35% (p=0.011)
BenchmarkRegexpMatchEasy0_32-2       161ns × (1.00,1.01)     161ns × (1.00,1.00)  ~ (p=0.185)
BenchmarkRegexpMatchEasy0_32-4       161ns × (1.00,1.00)     161ns × (1.00,1.00)  -0.19% (p=0.001)
BenchmarkRegexpMatchEasy0_1K         540ns × (0.99,1.02)     566ns × (0.98,1.04)  +4.98% (p=0.000)
BenchmarkRegexpMatchEasy0_1K-2       540ns × (0.99,1.01)     557ns × (0.99,1.01)  +3.21% (p=0.000)
BenchmarkRegexpMatchEasy0_1K-4       541ns × (0.99,1.01)     559ns × (0.99,1.01)  +3.26% (p=0.000)
BenchmarkRegexpMatchEasy1_32         139ns × (0.98,1.04)     139ns × (0.99,1.03)  ~ (p=0.979)
BenchmarkRegexpMatchEasy1_32-2       139ns × (0.99,1.04)     139ns × (0.99,1.02)  ~ (p=0.777)
BenchmarkRegexpMatchEasy1_32-4       139ns × (0.98,1.04)     139ns × (0.99,1.04)  ~ (p=0.771)
BenchmarkRegexpMatchEasy1_1K         890ns × (0.99,1.03)     885ns × (1.00,1.01)  -0.50% (p=0.004)
BenchmarkRegexpMatchEasy1_1K-2       888ns × (0.99,1.01)     885ns × (0.99,1.01)  -0.37% (p=0.004)
BenchmarkRegexpMatchEasy1_1K-4       890ns × (0.99,1.02)     884ns × (1.00,1.00)  -0.70% (p=0.000)
BenchmarkRegexpMatchMedium_32        252ns × (0.99,1.01)     251ns × (0.99,1.01)  ~ (p=0.081)
BenchmarkRegexpMatchMedium_32-2      254ns × (0.99,1.04)     252ns × (0.99,1.01)  -0.78% (p=0.027)
BenchmarkRegexpMatchMedium_32-4      253ns × (0.99,1.04)     252ns × (0.99,1.01)  -0.70% (p=0.022)
BenchmarkRegexpMatchMedium_1K       72.9µs × (0.99,1.01)    72.7µs × (1.00,1.00)  ~ (p=0.064)
BenchmarkRegexpMatchMedium_1K-2     74.1µs × (0.98,1.05)    72.9µs × (1.00,1.01)  -1.61% (p=0.001)
BenchmarkRegexpMatchMedium_1K-4     73.6µs × (0.99,1.05)    72.8µs × (1.00,1.00)  -1.13% (p=0.007)
BenchmarkRegexpMatchHard_32         3.88µs × (0.99,1.03)    3.92µs × (0.98,1.05)  ~ (p=0.143)
BenchmarkRegexpMatchHard_32-2       3.89µs × (0.99,1.03)    3.93µs × (0.98,1.09)  ~ (p=0.278)
BenchmarkRegexpMatchHard_32-4       3.90µs × (0.99,1.05)    3.93µs × (0.98,1.05)  ~ (p=0.252)
BenchmarkRegexpMatchHard_1K          118µs × (0.99,1.01)     117µs × (0.99,1.02)  -0.54% (p=0.003)
BenchmarkRegexpMatchHard_1K-2        118µs × (0.99,1.01)     118µs × (0.99,1.03)  ~ (p=0.581)
BenchmarkRegexpMatchHard_1K-4        118µs × (0.99,1.02)     117µs × (0.99,1.01)  -0.54% (p=0.002)
BenchmarkRevcomp                     991ms × (0.95,1.10)     989ms × (0.94,1.08)  ~ (p=0.879)
BenchmarkRevcomp-2                   978ms × (0.95,1.11)     962ms × (0.96,1.08)  ~ (p=0.257)
BenchmarkRevcomp-4                   979ms × (0.96,1.07)     974ms × (0.96,1.11)  ~ (p=0.678)
BenchmarkTemplate                    141ms × (0.99,1.02)     145ms × (0.99,1.02)  +2.75% (p=0.000)
BenchmarkTemplate-2                  135ms × (0.98,1.02)     138ms × (0.99,1.02)  +2.34% (p=0.000)
BenchmarkTemplate-4                  136ms × (0.98,1.02)     140ms × (0.99,1.02)  +2.71% (p=0.000)
BenchmarkTimeParse                   640ns × (0.99,1.01)     622ns × (0.99,1.01)  -2.88% (p=0.000)
BenchmarkTimeParse-2                 640ns × (0.99,1.01)     622ns × (1.00,1.00)  -2.81% (p=0.000)
BenchmarkTimeParse-4                 640ns × (1.00,1.01)     622ns × (0.99,1.01)  -2.82% (p=0.000)
BenchmarkTimeFormat                  730ns × (0.98,1.02)     731ns × (0.98,1.03)  ~ (p=0.767)
BenchmarkTimeFormat-2                709ns × (0.99,1.02)     707ns × (0.99,1.02)  ~ (p=0.347)
BenchmarkTimeFormat-4                717ns × (0.98,1.01)     718ns × (0.98,1.02)  ~ (p=0.793)

Change-Id: Ie779c47e912bf80eb918bafa13638bd8dfd6c2d9
Reviewed-on: https://go-review.googlesource.com/9406
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-05-01 18:44:36 +00:00
Austin Clements
3ca20218c1 runtime: fix gcDumpObject on non-heap pointers
gcDumpObject is used to print the source and destination objects when
checkmark find a missing mark. However, gcDumpObject currently assumes
the given pointer will point to a heap object. This is not true of the
source object during root marking and may not even be true of the
destination object in the limited situations where the heap points
back in to the stack.

If the pointer isn't a heap object, gcDumpObject will attempt an
out-of-bounds access to h_spans. This will cause a panicslice, which
will attempt to construct a useful panic message. This will cause a
string allocation, which will lead mallocgc to panic because the GC is
in mark termination (checkmark only happens during mark termination).

Fix this by checking that the pointer points into the heap arena
before attempting to use it as an arena pointer.

Change-Id: I09da600c380d4773f1f8f38e45b82cb229ea6382
Reviewed-on: https://go-review.googlesource.com/9498
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-30 14:53:51 +00:00
Austin Clements
63caec5dee runtime: eliminate one heapBitsForObject from scanobject
scanobject with ptrmask!=nil is only ever called with the base
pointer of a heap object. Currently, scanobject calls
heapBitsForObject, which goes to a great deal of trouble to check
that the pointer points into the heap and to find the base of the
object it points to, both of which are completely unnecessary in
this case.

Replace this call to heapBitsForObject with much simpler logic to
fetch the span and compute the heap bits.

Benchmark results with five runs:

name                                    old mean                new mean        delta
BenchmarkBinaryTree17              9.21s × (0.95,1.02)     8.55s × (0.91,1.03)  -7.16% (p=0.022)
BenchmarkFannkuch11                2.65s × (1.00,1.00)     2.62s × (1.00,1.00)  -1.10% (p=0.000)
BenchmarkFmtFprintfEmpty          73.2ns × (0.99,1.01)    71.7ns × (1.00,1.01)  -1.99% (p=0.004)
BenchmarkFmtFprintfString          302ns × (0.99,1.00)     292ns × (0.98,1.02)  -3.31% (p=0.020)
BenchmarkFmtFprintfInt             281ns × (0.98,1.01)     279ns × (0.96,1.02)  ~ (p=0.596)
BenchmarkFmtFprintfIntInt          482ns × (0.98,1.01)     488ns × (0.95,1.02)  ~ (p=0.419)
BenchmarkFmtFprintfPrefixedInt     382ns × (0.99,1.01)     365ns × (0.96,1.02)  -4.35% (p=0.015)
BenchmarkFmtFprintfFloat           475ns × (0.99,1.01)     472ns × (1.00,1.00)  ~ (p=0.108)
BenchmarkFmtManyArgs              1.89µs × (1.00,1.01)    1.90µs × (0.94,1.02)  ~ (p=0.883)
BenchmarkGobDecode                22.4ms × (0.99,1.01)    21.9ms × (0.92,1.04)  ~ (p=0.332)
BenchmarkGobEncode                24.7ms × (0.98,1.02)    23.9ms × (0.87,1.07)  ~ (p=0.407)
BenchmarkGzip                      397ms × (0.99,1.01)     398ms × (0.99,1.01)  ~ (p=0.718)
BenchmarkGunzip                   96.7ms × (1.00,1.00)    96.9ms × (1.00,1.00)  ~ (p=0.230)
BenchmarkHTTPClientServer         71.5µs × (0.98,1.01)    68.5µs × (0.92,1.06)  ~ (p=0.243)
BenchmarkJSONEncode               46.1ms × (0.98,1.01)    44.9ms × (0.98,1.03)  -2.51% (p=0.040)
BenchmarkJSONDecode               86.1ms × (0.99,1.01)    86.5ms × (0.99,1.01)  ~ (p=0.343)
BenchmarkMandelbrot200            4.12ms × (1.00,1.00)    4.13ms × (1.00,1.00)  +0.23% (p=0.000)
BenchmarkGoParse                  5.89ms × (0.96,1.03)    5.82ms × (0.96,1.04)  ~ (p=0.522)
BenchmarkRegexpMatchEasy0_32       141ns × (0.99,1.01)     142ns × (1.00,1.00)  ~ (p=0.178)
BenchmarkRegexpMatchEasy0_1K       408ns × (1.00,1.00)     392ns × (0.99,1.00)  -3.83% (p=0.000)
BenchmarkRegexpMatchEasy1_32       122ns × (1.00,1.00)     122ns × (1.00,1.00)  ~ (p=0.178)
BenchmarkRegexpMatchEasy1_1K       626ns × (1.00,1.01)     624ns × (0.99,1.00)  ~ (p=0.122)
BenchmarkRegexpMatchMedium_32      202ns × (0.99,1.00)     205ns × (0.99,1.01)  +1.58% (p=0.001)
BenchmarkRegexpMatchMedium_1K     54.4µs × (1.00,1.00)    55.5µs × (1.00,1.00)  +1.86% (p=0.000)
BenchmarkRegexpMatchHard_32       2.68µs × (1.00,1.00)    2.71µs × (1.00,1.00)  +0.97% (p=0.002)
BenchmarkRegexpMatchHard_1K       79.8µs × (1.00,1.01)    80.5µs × (1.00,1.01)  +0.94% (p=0.003)
BenchmarkRevcomp                   590ms × (0.99,1.01)     585ms × (1.00,1.00)  ~ (p=0.066)
BenchmarkTemplate                  111ms × (0.97,1.02)     112ms × (0.99,1.01)  ~ (p=0.201)
BenchmarkTimeParse                 392ns × (1.00,1.00)     385ns × (1.00,1.00)  -1.69% (p=0.000)
BenchmarkTimeFormat                449ns × (0.98,1.01)     448ns × (0.99,1.01)  ~ (p=0.550)

Change-Id: Ie7c3830c481d96c9043e7bf26853c6c1d05dc9f4
Reviewed-on: https://go-review.googlesource.com/9364
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-28 15:22:20 +00:00
Austin Clements
33e0f3d853 runtime: fix some out of date comments and typos
Change-Id: I061057414c722c5a0f03c709528afc8554114db6
Reviewed-on: https://go-review.googlesource.com/9367
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-27 20:08:38 +00:00
Austin Clements
bb6320535d runtime: replace STW for enabling write barriers with ragged barrier
Currently, we use a full stop-the-world around enabling write
barriers. This is to ensure that all Gs have enabled write barriers
before any blackening occurs (either in gcBgMarkWorker() or in
gcAssistAlloc()).

However, there's no need to bring the whole world to a synchronous
stop to ensure this. This change replaces the STW with a ragged
barrier that ensures each P has individually observed that write
barriers should be enabled before GC performs any blackening.

Change-Id: If2f129a6a55bd8bdd4308067af2b739f3fb41955
Reviewed-on: https://go-review.googlesource.com/8207
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-27 19:26:37 +00:00
Austin Clements
1b4025f4bd runtime: replace per-M workbuf cache with per-P gcWork cache
Currently, each M has a cache of the most recently used *workbuf. This
is used primarily by the write barrier so it doesn't have to access
the global workbuf lists on every write barrier. It's also used by
stack scanning because it's convenient.

This cache is important for write barrier performance, but this
particular approach has several downsides. It's faster than no cache,
but far from optimal (as the benchmarks below show). It's complex:
access to the cache is sprinkled through most of the workbuf list
operations and it requires special care to transform into and back out
of the gcWork cache that's actually used for scanning and marking. It
requires atomic exchanges to take ownership of the cached workbuf and
to return it to the M's cache even though it's almost always used by
only the current M. Since it's per-M, flushing these caches is O(# of
Ms), which may be high. And it has some significant subtleties: for
example, in general the cache shouldn't be used after the
harvestwbufs() in mark termination because it could hide work from
mark termination, but stack scanning can happen after this and *will*
use the cache (but it turns out this is okay because it will always be
followed by a getfull(), which drains the cache).

This change replaces this cache with a per-P gcWork object. This
gcWork cache can be used directly by scanning and marking (as long as
preemption is disabled, which is a general requirement of gcWork).
Since it's per-P, it doesn't require synchronization, which simplifies
things and means the only atomic operations in the write barrier are
occasionally fetching new work buffers and setting a mark bit if the
object isn't already marked. This cache can be flushed in O(# of Ps),
which is generally small. It follows a simple flushing rule: the cache
can be used during any phase, but during mark termination it must be
flushed before allowing preemption. This also makes the dispose during
mutator assist no longer necessary, which eliminates the vast majority
of gcWork dispose calls and reduces contention on the global workbuf
lists. And it's a lot faster on some benchmarks:

benchmark                          old ns/op       new ns/op       delta
BenchmarkBinaryTree17              11963668673     11206112763     -6.33%
BenchmarkFannkuch11                2643217136      2649182499      +0.23%
BenchmarkFmtFprintfEmpty           70.4            70.2            -0.28%
BenchmarkFmtFprintfString          364             307             -15.66%
BenchmarkFmtFprintfInt             317             282             -11.04%
BenchmarkFmtFprintfIntInt          512             483             -5.66%
BenchmarkFmtFprintfPrefixedInt     404             380             -5.94%
BenchmarkFmtFprintfFloat           521             479             -8.06%
BenchmarkFmtManyArgs               2164            1894            -12.48%
BenchmarkGobDecode                 30366146        22429593        -26.14%
BenchmarkGobEncode                 29867472        26663152        -10.73%
BenchmarkGzip                      391236616       396779490       +1.42%
BenchmarkGunzip                    96639491        96297024        -0.35%
BenchmarkHTTPClientServer          100110          70763           -29.31%
BenchmarkJSONEncode                51866051        52511382        +1.24%
BenchmarkJSONDecode                103813138       86094963        -17.07%
BenchmarkMandelbrot200             4121834         4120886         -0.02%
BenchmarkGoParse                   16472789        5879949         -64.31%
BenchmarkRegexpMatchEasy0_32       140             140             +0.00%
BenchmarkRegexpMatchEasy0_1K       394             394             +0.00%
BenchmarkRegexpMatchEasy1_32       120             120             +0.00%
BenchmarkRegexpMatchEasy1_1K       621             614             -1.13%
BenchmarkRegexpMatchMedium_32      209             202             -3.35%
BenchmarkRegexpMatchMedium_1K      54889           55175           +0.52%
BenchmarkRegexpMatchHard_32        2682            2675            -0.26%
BenchmarkRegexpMatchHard_1K        79383           79524           +0.18%
BenchmarkRevcomp                   584116718       584595320       +0.08%
BenchmarkTemplate                  125400565       109620196       -12.58%
BenchmarkTimeParse                 386             387             +0.26%
BenchmarkTimeFormat                580             447             -22.93%

(Best out of 10 runs. The delta of averages is similar.)

This also puts us in a good position to flush these caches when
nearing the end of concurrent marking, which will let us increase the
size of the work buffers while still controlling mark termination
pause time.

Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522
Reviewed-on: https://go-review.googlesource.com/9178
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-24 20:10:14 +00:00
Austin Clements
ce502b063c runtime: use park/ready to wake up GC at end of concurrent mark
Currently, the main GC goroutine sleeps on a note during concurrent
mark and the first background mark worker or assist to finish marking
use wakes up that note to let the main goroutine proceed into mark
termination. Unfortunately, the latency of this wakeup can be quite
high, since the GC goroutine will typically have lost its P while in
the futex sleep, meaning it will be placed on the global run queue and
will wait there until some P is kind enough to pick it up. This delay
gives the mutator more time to allocate and create floating garbage,
growing the heap unnecessarily. Worse, it's likely that background
marking has stopped at this point (unless GOMAXPROCS>4), so anything
that's allocated and published to the heap during this window will
have to be scanned during mark termination while the world is stopped.

This change replaces the note sleep/wakeup with a gopark/ready
scheme. This keeps the wakeup inside the Go scheduler and lets the
garbage collector take advantage of the new scheduler semantics that
run the ready()d goroutine immediately when the ready()ing goroutine
sleeps.

For the json benchmark from x/benchmarks with GOMAXPROCS=4, this
reduces the delay in waking up the GC goroutine and entering mark
termination once concurrent marking is done from ~100ms to typically
<100µs.

Change-Id: Ib11f8b581b8914f2d68e0094f121e49bac3bb384
Reviewed-on: https://go-review.googlesource.com/9291
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-24 15:13:01 +00:00
Rick Hudson
77f56af0bc runtime: Improve scanning performance
To achieve a 2% improvement in the garbage benchmark this CL removes
an unneeded assert and avoids one hbits.next() call per object
being scanned.

Change-Id: Ibd542d01e9c23eace42228886f9edc488354df0d
Reviewed-on: https://go-review.googlesource.com/9244
Reviewed-by: Austin Clements <austin@google.com>
2015-04-23 20:27:46 +00:00
Austin Clements
8d03acce54 runtime: multi-threaded, utilization-scheduled background mark
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.

This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.

This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.

This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.

Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 15:35:32 +00:00
Austin Clements
100da60979 runtime: track time spent in mutator assists
This time is tracked per P and periodically flushed to the global
controller state. This will be used to compute mutator assist
utilization in order to schedule background GC work.

Change-Id: Ib94f90903d426a02cf488bf0e2ef67a068eb3eec
Reviewed-on: https://go-review.googlesource.com/8837
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 15:35:22 +00:00
Austin Clements
4b2fde945a runtime: proportional mutator assist
Currently, mutator allocation periodically assists the garbage
collector by performing a small, fixed amount of scanning work.
However, to control heap growth, mutators need to perform scanning
work *proportional* to their allocation rate.

This change implements proportional mutator assists. This uses the
scan work estimate computed by the garbage collector at the beginning
of each cycle to compute how much scan work must be performed per
allocation byte to complete the estimated scan work by the time the
heap reaches the goal size. When allocation triggers an assist, it
uses this ratio and the amount allocated since the last assist to
compute the assist work, then attempts to steal as much of this work
as possible from the background collector's credit, and then performs
any remaining scan work itself.

Change-Id: I98b2078147a60d01d6228b99afd414ef857e4fba
Reviewed-on: https://go-review.googlesource.com/8836
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 15:35:18 +00:00
Austin Clements
028f972847 runtime: make gcDrainN in terms of scan work
Currently, the "n" in gcDrainN is in terms of objects to scan. This is
used by gchelpwork to perform a limited amount of work on allocation,
but is a pretty arbitrary way to bound this amount of work since the
number of objects has little relation to how long they take to scan.

Modify gcDrainN to perform a fixed amount of scan work instead. For
now, gchelpwork still performs a fairly arbitrary amount of scan work,
but at least this is much more closely related to how long the work
will take. Shortly, we'll use this to precisely control the scan work
performed by mutator assists during allocation to achieve the heap
size goal.

Change-Id: I3cd07fe0516304298a0af188d0ccdf621d4651cc
Reviewed-on: https://go-review.googlesource.com/8835
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 15:35:14 +00:00
Austin Clements
8e24283a28 runtime: track background scan work credit
This tracks scan work done by background GC in a global pool. Mutator
assists will draw on this credit to avoid doing work when background
GC is staying ahead.

Unlike the other GC controller tracking variables, this will be both
written and read throughout the cycle. Hence, we can't arbitrarily
delay updates like we can for scan work and bytes marked. However, we
still want to minimize contention, so this global credit pool is
allowed some error from the "true" amount of credit. Background GC
accumulates credit locally up to a limit and only then flushes to the
global pool. Similarly, mutator assists will draw from the credit pool
in batches.

Change-Id: I1aa4fc604b63bf53d1ee2a967694dffdfc3e255e
Reviewed-on: https://go-review.googlesource.com/8834
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 15:35:09 +00:00
Austin Clements
571ebae6ef runtime: track scan work performed during concurrent mark
This tracks the amount of scan work in terms of scanned pointers
during the concurrent mark phase. We'll use this information to
estimate scan work for the next cycle.

Currently this aggregates the work counter in gcWork and dispose
atomically aggregates this into a global work counter. dispose happens
relatively infrequently, so the contention on the global counter
should be low. If this turns out to be an issue, we can reduce the
number of disposes, and if it's still a problem, we can switch to
per-P counters.

Change-Id: Iac0364c466ee35fab781dbbbe7970a5f3c4e1fc1
Reviewed-on: https://go-review.googlesource.com/8832
Reviewed-by: Rick Hudson <rlh@golang.org>
2015-04-21 15:35:00 +00:00
Michael Hudson-Doyle
a1f57598cc runtime, cmd/internal/ld: rename themoduledata to firstmoduledata
'themoduledata' doesn't really make sense now we support multiple moduledata
objects.

Change-Id: I8263045d8f62a42cb523502b37289b0fba054f62
Reviewed-on: https://go-review.googlesource.com/8521
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-04-10 05:11:49 +00:00
Michael Hudson-Doyle
fae4a128cb runtime, reflect: support multiple moduledata objects
This changes all the places that consult themoduledata to consult a
linked list of moduledata objects, as will be necessary for
-linkshared to work.

Obviously, as there is as yet no way of adding moduledata objects to
this list, all this change achieves right now is wasting a few
instructions here and there.

Change-Id: I397af7f60d0849b76aaccedf72238fe664867051
Reviewed-on: https://go-review.googlesource.com/8231
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-04-10 04:51:42 +00:00
Austin Clements
50a66562a0 runtime: track heap bytes marked by GC
This tracks the number of heap bytes marked by a GC cycle. We'll use
this information to precisely trigger the next GC cycle.

Currently this aggregates the work counter in gcWork and dispose
atomically aggregates this into a global work counter. dispose happens
relatively infrequently, so the contention on the global counter
should be low. If this turns out to be an issue, we can reduce the
number of disposes, and if it's still a problem, we can switch to
per-P counters.

Change-Id: I1bc377cb2e802ef61c2968602b63146d52e7f5db
Reviewed-on: https://go-review.googlesource.com/8388
Reviewed-by: Russ Cox <rsc@golang.org>
2015-04-06 21:28:07 +00:00