This adds a GCCPUFraction field to MemStats that reports the
cumulative fraction of the program's execution time spent in the
garbage collector. This is equivalent to the utilization percent shown
in the gctrace output and makes this available programmatically.
This does make one small effect on the gctrace output: we now report
the duration of mark termination up to just before the final
start-the-world, rather than up to just after. However, unlike
stop-the-world, I don't believe there's any way that start-the-world
can block, so it should take negligible time.
While there are many statistics one might want to expose via MemStats,
this is one of the few that will undoubtedly remain meaningful
regardless of future changes to the memory system.
The diff for this change is larger than the actual change. Mostly it
lifts the code for computing the GC CPU utilization out of the
debug.gctrace path.
Updates #10323.
Change-Id: I0f7dc3fdcafe95e8d1233ceb79de606b48acd989
Reviewed-on: https://go-review.googlesource.com/12844
Reviewed-by: Russ Cox <rsc@golang.org>
Currently we only capture GC phase transition times if
debug.gctrace>0, but we're about to compute GC CPU utilization
regardless of whether debug.gctrace is set, so we need these
regardless of debug.gctrace.
Change-Id: If3acf16505a43d416e9a99753206f03287180660
Reviewed-on: https://go-review.googlesource.com/12843
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
The scheduler, work buffer's dispose, and write barriers
can conspire to hide the a pointer from the GC's concurent
mark phase. If this pointer is the only path to a large
amount of marking the STW mark termination phase may take
a lot of time.
Consider the following:
1) dispose places a work buffer on the partial queue
2) the GC is busy so it does not immediately remove and
process the work buffer
3) the scheduler runs a mutator whose write barrier dequeues the
work buffer from the partial queue so the GC won't see it
This repeats until the GC reaches the mark termination
phase where the GC finally discovers the pointer along
with a lot of work to do.
This CL fixes the problem by having the mutator
dispose of the buffer to the full queue instead of
the partial queue. Since the write buffer never asks for full
buffers the conspiracy described above is not possible.
Updates #11694.
Change-Id: I2ce832f9657a7570f800e8ce4459cd9e304ef43b
Reviewed-on: https://go-review.googlesource.com/12840
Reviewed-by: Austin Clements <austin@google.com>
Currently we enter mark 2 by first flushing all existing gcWork caches
and then setting gcBlackenPromptly, which disables further gcWork
caching. However, if a worker or assist pulls a work buffer in to its
gcWork cache after that cache has been flushed but before caching is
disabled, that work may remain in that cache until mark termination.
If that work represents a heap bottleneck (e.g., a single pointer that
is the only way to reach a large amount of the heap), this can force
mark termination to do a large amount of work, resulting in a long
STW.
Fix this by reversing the order of these steps: first disable caching,
then flush all existing caches.
Rick Hudson <rlh> did the hard work of tracking this down. This CL
combined with CL 12672 and CL 12646 distills the critical parts of his
fix from CL 12539.
Fixes#11694.
Change-Id: Ib10d0a21e3f6170a80727d0286f9990df049fed2
Reviewed-on: https://go-review.googlesource.com/12688
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently the GC coordinator enables GC assists at the same time it
enables background mark workers, after the concurrent scan phase is
done. However, this means a rapidly allocating mutator has the entire
scan phase during which to allocate beyond the heap trigger and
potentially beyond the heap goal with no back-pressure from assists.
This prevents the feedback system that's supposed to keep the heap
size under the heap goal from doing its job.
Fix this by enabling mutator assists during the scan phase. This is
safe because the write barrier is already enabled and globally
acknowledged at this point.
There's still a very small window between when the heap size reaches
the heap trigger and when the GC coordinator is able to stop the world
during which the mutator can allocate unabated. This allows *very*
rapidly allocator mutators like TestTraceStress to still occasionally
exceed the heap goal by a small amount (~20 MB at most for
TestTraceStress). However, this seems like a corner case.
Fixes#11677.
Change-Id: I0f80d949ec82341cd31ca1604a626efb7295a819
Reviewed-on: https://go-review.googlesource.com/12674
Reviewed-by: Russ Cox <rsc@golang.org>
Currently we clear both the mark 1 and mark 2 signals at the beginning
of concurrent mark. If either if these is clear, it acts as a signal
to the scheduler that it should start background workers. However,
this means that in the interim *between* mark 1 and mark 2, the
scheduler basically loops starting up new workers only to have them
return with nothing to do. In addition to harming performance and
delaying mutator work, this approach has a race where workers started
for mark 1 can mistakenly signal mark 2, causing it to complete
prematurely. This approach also interferes with starting assists
earlier to fix#11677.
Fix this by initially setting both mark 1 and mark 2 to "signaled".
The scheduler will not start background mark workers, though assists
can still run. When we're ready to enter mark 1, we clear the mark 1
signal and wait for it. Then, when we're ready to enter mark 2, we
clear the mark 2 signal and wait for it.
This structure also lets us deal cleanly with the situation where all
work is drained *prior* to the mark 2 wait, meaning that there may be
no workers to signal completion. Currently we deal with this using a
racy (and possibly incorrect) check for work in the coordinator itself
to skip the mark 2 wait if there's no work. This change makes the
coordinator unconditionally wait for mark completion and makes the
scheduler itself signal completion by slightly extending the logic it
already has to determine that there's no work and hence no use in
starting a new worker.
This is a prerequisite to fixing the remaining component of #11677,
which will require enabling assists during the scan phase. However, we
don't want to enable background workers until the mark phase because
they will compete with the scan. This change lets us use bgMark1 and
bgMark2 to indicate when it's okay to start background workers
independent of assists.
This is also a prerequisite to fixing #11694. It significantly reduces
the occurrence of long mark termination pauses in #11694 (from 64 out
of 1000 to 2 out of 1000 in one experiment).
Coincidentally, this also reduces the final heap size (and hence run
time) of TestTraceStress from ~100 MB and ~1.9 seconds to ~14 MB and
~0.4 seconds because it significantly shortens concurrent mark
duration.
Rick Hudson <rlh> did the hard work of tracking this down.
Change-Id: I12ea9ee2db9a0ae9d3a90dde4944a75fcf408f4c
Reviewed-on: https://go-review.googlesource.com/12672
Reviewed-by: Russ Cox <rsc@golang.org>
Currently it's possible for the GC assist to signal completion of the
mark phase, which puts the GC coordinator goroutine on the current P's
run queue, and then return to mutator code that delays until the next
forced preemption before actually yielding control to the GC
coordinator, dragging out completion of the mark phase. This delay can
be further exacerbated if the mutator makes other goroutines runnable
before yielding control, since this will push the GC coordinator on
the back of the P's run queue.
To fix this, this adds a Gosched to the assist if it completed the
mark phase. This immediately and directly yields control to the GC
coordinator. This already happens implicitly in the background mark
workers because they park immediately after completing the mark.
This is one of the reasons completion of the mark phase is being
dragged out and allowing the mutator to allocate without assisting,
leading to the large heap goal overshoot in issue #11677. This is also
a prerequisite to making the assist block when it can't pay off its
debt.
Change-Id: I586adfbecb3ca042a37966752c1dc757f5c7fc78
Reviewed-on: https://go-review.googlesource.com/12670
Reviewed-by: Russ Cox <rsc@golang.org>
Currently fractional and idle mark workers dispose of their gcWork
cache during mark 2 after incrementing work.nwait and after checking
whether there are any workers or any work available. This creates a
window for two races:
1) If the only remaining work is in this worker's gcWork cache, it
will see that there are no more workers and no more work on the
global lists (since it has not yet flushed its own cache) and
prematurely signal mark 2 completion.
2) After this worker has incremented work.nwait but before it has
flushed its cache, another worker may observe that there are no
more workers and no more work and prematurely signal mark 2
completion.
We can fix both of these by simply moving the cache flush above the
increment of nwait and the test of the completion condition.
This is probably contributing to #11694, though this alone is not
enough to fix it.
Change-Id: Idcf9656e5c460c5ea0d23c19c6c51e951f7716c3
Reviewed-on: https://go-review.googlesource.com/12646
Reviewed-by: Russ Cox <rsc@golang.org>
Currently the gctrace output reports the trigger heap size, rather
than the actual heap size at the beginning of GC. Often these are the
same, or at least very close. However, it's possible for the heap to
already have exceeded this trigger when we first check the trigger and
start GC; in this case, this output is very misleading. We've
encountered this confusion a few times when debugging and this
behavior is difficult to document succinctly.
Change the gctrace output to report the actual heap size when GC
starts, rather than the trigger.
Change-Id: I246b3ccae4c4c7ea44c012e70d24a46878d7601f
Reviewed-on: https://go-review.googlesource.com/12452
Reviewed-by: Russ Cox <rsc@golang.org>
Whenever someone pastes gctrace output into GitHub, it helpfully turns
the GC cycle number into a link to some unrelated issue. Prevent this
by removing the pound before the cycle number. The fact that this is a
cycle number is probably more obvious at a glance than most of the
other numbers.
Change-Id: Ifa5fc7fe6c715eac50e639f25bc36c81a132ffea
Reviewed-on: https://go-review.googlesource.com/12413
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
The runtime.GC documentation was rewritten in df2809f to make it clear
that it blocks until GC is complete, but the re-rewrite in ed9a4c9 and
e28a679 lost this property when clarifying that it may also block the
entire program and not just the caller.
Try to arrive at wording that conveys both of these properties.
Change-Id: I1e255322aa28a21a548556ecf2a44d8d8ac524ef
Reviewed-on: https://go-review.googlesource.com/12392
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
We shouldn't guarantee this behavior, but suggest it's possible.
Change-Id: I4c2afb48b99be4d91537306d3337171a13c9990a
Reviewed-on: https://go-review.googlesource.com/12346
Reviewed-by: David Crawshaw <crawshaw@golang.org>
No code changes. Just make it clear that runtime.GC is not concurrent.
Change-Id: I00a99ebd26402817c665c9a128978cef19f037be
Reviewed-on: https://go-review.googlesource.com/12345
Reviewed-by: Dave Cheney <dave@cheney.net>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
These memstats are currently being computed by gcMark, which was
appropriate in Go 1.4, but gcMark is now just one part of a bigger
picture. In particular, it can't account for the sweep termination
pause time, it can't account for all of the mark termination pause
time, and the reported "pause end" and "last GC" times will be
slightly earlier than they really are.
Lift computing of these statistics into func gc, which has the
appropriate visibility into the process to compute them correctly.
Fixes one of the issues in #10323. This does not add new statistics
appropriate to the concurrent collector; it simply fixes existing
statistics that are being misreported.
Change-Id: I670cb16594a8641f6b27acf4472db15b6e8e086e
Reviewed-on: https://go-review.googlesource.com/11794
Reviewed-by: Russ Cox <rsc@golang.org>
Currently we report MemStats.PauseEnd in nanoseconds, but with no
particular 0 time. On Linux, the 0 time is when the host started. On
Darwin, it's the UNIX epoch. This is also inconsistent with the other
absolute time in MemStats, LastGC, which is always reported in
nanoseconds since 1970.
Fix PauseEnd so it's always reported in nanoseconds since 1970, like
LastGC.
Fixes one of the issues raised in #10323.
Change-Id: Ie2fe3169d45113992363a03b764f4e6c47e5c6a8
Reviewed-on: https://go-review.googlesource.com/11801
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
The one in misc/makerelease/makerelease.go is particularly bad and
probably warrants rotating our keys.
I didn't update old weekly notes, and reverted some changes involving
test code for now, since we're late in the Go 1.5 freeze. Otherwise,
the rest are all auto-generated changes, and all manually reviewed.
Change-Id: Ia2753576ab5d64826a167d259f48a2f50508792d
Reviewed-on: https://go-review.googlesource.com/12048
Reviewed-by: Rob Pike <r@golang.org>
Currently we fail to reset the live heap accounting state before the
checkmark mark and before the gctrace=2 extra mark. As a result, if
either are enabled, at the end of GC it thinks there are 0 bytes of
live heap, which causes the GC controller to initiate a new GC
immediately, regardless of the true heap size.
Fix this by factoring this state reset into a function and calling it
before all three possible marks.
This function should be merged with gcResetGState, but doing so
requires some additional cleanup, so it will wait for after the
freeze. Filed #11427 for this cleanup.
Fixes#10492.
Change-Id: Ibe46348916fc8368fac6f086e142815c970a6f4d
Reviewed-on: https://go-review.googlesource.com/11561
Reviewed-by: Russ Cox <rsc@golang.org>
Memory for stacks is manually managed by the runtime and, currently
(with one exception) we free stack spans immediately when the last
stack on a span is freed. However, the garbage collector assumes that
spans can never transition from non-free to free during scan or mark.
This disagreement makes it possible for the garbage collector to mark
uninitialized objects and is blocking us from re-enabling the bad
pointer test in the garbage collector (issue #9880).
For example, the following sequence will result in marking an
uninitialized object:
1. scanobject loads a pointer slot out of the object it's scanning.
This happens to be one of the special pointers from the heap into a
stack. Call the pointer p and suppose it points into X's stack.
2. X, running on another thread, grows its stack and frees its old
stack.
3. The old stack happens to be large or was the last stack in its
span, so X frees this span, setting it to state _MSpanFree.
4. The span gets reused as a heap span.
5. scanobject calls heapBitsForObject, which loads the span containing
p, which is now in state _MSpanInUse, but doesn't necessarily have
an object at p. The not-object at p gets marked, and at this point
all sorts of things can go wrong.
We already have a partial solution to this. When shrinking a stack, we
put the old stack on a queue to be freed at the end of garbage
collection. This was done to address exactly this problem, but wasn't
a complete solution.
This commit generalizes this solution to both shrinking and growing
stacks. For stacks that fit in the stack pool, we simply don't free
the span, even if its reference count reaches zero. It's fine to reuse
the span for other stacks, and this enables that. At the end of GC, we
sweep for cached stack spans with a zero reference count and free
them. For larger stacks, we simply queue the stack span to be freed at
the end of GC. Ideally, we would reuse these large stack spans the way
we can small stack spans, but that's a more invasive change that will
have to wait until after the freeze.
Fixes#11267.
Change-Id: Ib7f2c5da4845cc0268e8dc098b08465116972a71
Reviewed-on: https://go-review.googlesource.com/11502
Reviewed-by: Russ Cox <rsc@golang.org>
We don't use this state. _GCoff means we're sweeping in the
background. This makes it clear in the next commit that _GCoff and
only _GCoff means sweeping.
Change-Id: I416324a829ba0be3794a6cf3cf1655114cb6e47c
Reviewed-on: https://go-review.googlesource.com/11501
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Some latency regressions have crept into our system over the past few
weeks. This CL fixes those by having the mark phase more aggressively
blacken objects so that the mark termination phase, a STW phase, has less
work to do. Three approaches were taken when the mark phase believes
it has no more work to do, ie all the work buffers are empty.
If things have gone well the mark phase is correct and there is
in fact little or no work. In that case the following items will
take very little time. If the mark phase is wrong this CL will
ferret that work out and give the mark phase a chance to deal with
it concurrently before mark termination begins.
When the mark phase first appears to be out of work, it does three things:
1) It switches from allocating white to allocating black to reduce the
number of unmarked objects reachable only from stacks.
2) It flushes and disables per-P GC work caches so all work must be in
globally visible work buffers.
3) It rescans the global roots---the BSS and data segments---so there
are fewer objects to blacken during mark termination. We do not rescan
stacks at this point, though that could be done in a later CL.
After these steps, it again drains the global work buffers.
On a lightly loaded machine the garbage benchmark has reduced the
number of GC cycles with latency > 10 ms from 83 out of 4083 cycles
down to 2 out of 3995 cycles. Maximum latency was reduced from
60+ msecs down to 20 ms.
Change-Id: I152285b48a7e56c5083a02e8e4485dd39c990492
Reviewed-on: https://go-review.googlesource.com/10590
Reviewed-by: Austin Clements <austin@google.com>
This fixes a hang during runtime.TestTraceStress.
It also fixes double-scan of stacks, which leads to
stack barrier installation failures.
Both of these have shown up as flaky failures on the dashboard.
Fixes#10941.
Change-Id: Ia2a5991ce2c9f43ba06ae1c7032f7c898dc990e0
Reviewed-on: https://go-review.googlesource.com/11089
Reviewed-by: Austin Clements <austin@google.com>
While we're here, update the documentation and delete variables with no effect.
Change-Id: I4df0d266dff880df61b488ed547c2870205862f0
Reviewed-on: https://go-review.googlesource.com/10790
Reviewed-by: Austin Clements <austin@google.com>
These were found by grepping the comments from the go code and feeding
the output to aspell.
Change-Id: Id734d6c8d1938ec3c36bd94a4dbbad577e3ad395
Reviewed-on: https://go-review.googlesource.com/10941
Reviewed-by: Aamir Khan <syst3m.w0rm@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Currently, write barriers are only enabled after completion of the
concurrent scan phase, as we enter the concurrent mark phase. However,
stack barriers are installed during the scan phase and assume that
write barriers will track changes to frames above the stack
barriers. Since write barriers aren't enabled until after stack
barriers are installed, we may miss modifications to the stack that
happen after installing the stack barriers and before enabling write
barriers.
Fix this by enabling write barriers during the scan phase.
This commit intentionally makes the minimal change to do this (there's
only one line of code change; the rest are comment changes). At the
very least, we should consider eliminating the ragged barrier that's
intended to synchronize the enabling of write barriers, but now just
wastes time. I've included a large comment about extensions and
alternative designs.
Change-Id: Ib20fede794e4fcb91ddf36f99bd97344d7f96421
Reviewed-on: https://go-review.googlesource.com/10795
Reviewed-by: Russ Cox <rsc@golang.org>
Currently checkmarks mode fails to rescan stacks because it sees the
leftover state bits indicating that the stacks haven't changed since
the last scan. As a result, it won't detect lost marks caused by
failing to scan stacks correctly during regular garbage collection.
Fix this by marking all stacks dirty before performing the checkmark
phase.
Change-Id: I1f06882bb8b20257120a4b8e7f95bb3ffc263895
Reviewed-on: https://go-review.googlesource.com/10794
Reviewed-by: Russ Cox <rsc@golang.org>
Currently the GODEBUG=gctrace=1 trace line includes "@n.nnns" to
indicate the time that the GC cycle ended relative to the time the
program started. This was meant to be consistent with the utilization
as of the end of the cycle, which is printed next on the trace line,
but it winds up just being confusing and unexpected.
Change the trace line to include the time that the GC cycle started
relative to the time the program started.
Change-Id: I7d64580cd696eb17540716d3e8a74a9d6ae50650
Reviewed-on: https://go-review.googlesource.com/10634
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
The stack barrier code will need a bookkeeping structure to keep track
of the overwritten return PCs. This commit introduces and allocates
this structure, but does not yet use the structure.
We don't want to allocate space for this structure during garbage
collection, so this commit allocates it along with the allocation of
the corresponding stack. However, we can't do a regular allocation in
newstack because mallocgc may itself grow the stack (which would lead
to a recursive allocation). Hence, this commit makes the bookkeeping
structure part of the stack allocation itself by stealing the
necessary space from the top of the stack allocation. Since the size
of this bookkeeping structure is logarithmic in the size of the stack,
this has minimal impact on stack behavior.
Change-Id: Ia14408be06aafa9ca4867f4e70bddb3fe0e96665
Reviewed-on: https://go-review.googlesource.com/10313
Reviewed-by: Russ Cox <rsc@golang.org>
Currently we truncate gctrace clock and CPU times to millisecond
precision. As a result, many phases are typically printed as 0, which
is fine for user consumption, but makes gathering statistics and
reports over GC traces difficult.
In 1.4, the gctrace line printed times in microseconds. This was
better for statistics, but not as easy for users to read or interpret,
and it generally made the trace lines longer.
This change strikes a balance between these extremes by printing
milliseconds, but including the decimal part to two significant
figures down to microsecond precision. This remains easy to read and
interpret, but includes more precision when it's useful.
For example, where the code currently prints,
gc #29 @1.629s 0%: 0+2+0+12+0 ms clock, 0+2+0+0/12/0+0 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
this prints,
gc #29 @1.629s 0%: 0.005+2.1+0+12+0.29 ms clock, 0.005+2.1+0+0/12/0+0.29 ms cpu, 4->4->2 MB, 4 MB goal, 1 P
Fixes#10970.
Change-Id: I249624779433927cd8b0947b986df9060c289075
Reviewed-on: https://go-review.googlesource.com/10554
Reviewed-by: Russ Cox <rsc@golang.org>
runtime.GC() is intentionally very weakly specified. However, it is so
weakly specified that it's difficult to know that it's being used
correctly for its one intended use case: to ensure garbage collection
has run in a test that is garbage-sensitive. In particular, it is
unclear whether it is synchronous or asynchronous. In the old STW
collector this was essentially self-evident; short of queuing up a
garbage collection to run later, it had to be synchronous. However,
with the concurrent collector, there's evidence that people are
inferring that it may be asynchronous (e.g., issue #10986), as this is
both unclear in the documentation and possible in the implementation.
In fact, runtime.GC() runs a fully synchronous STW collection. We
probably don't want to commit to this exact behavior. But we can
commit to the essential property that tests rely on: that runtime.GC()
does not return until the GC has finished.
Change-Id: Ifc3045a505e1898ecdbe32c1f7e80e2e9ffacb5b
Reviewed-on: https://go-review.googlesource.com/10488
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
This is dead code. If you want to quiesce the system the
preferred way is to use forEachP(func(*p){}).
Change-Id: Ic7677a5dd55e3639b99e78ddeb2c71dd1dd091fa
Reviewed-on: https://go-review.googlesource.com/10267
Reviewed-by: Austin Clements <austin@google.com>
Prior to this CL whenever the GC marking was enabled and
a P was looking for work we supplied a G to help
the GC do its marking tasks. Once this G finished all
the marking available it would release the P to find another
available G. In the case where there was no work the P would drop
into findrunnable which would execute the mark helper G which would
immediately return and the P would drop into findrunnable again repeating
the process. Since the P was always given a G to run it never blocks.
This CL first checks if the GC mark helper G has available work and if
not the P immediately falls through to its blocking logic.
Fixes#10901
Change-Id: I94ac9646866ba64b7892af358888bc9950de23b5
Reviewed-on: https://go-review.googlesource.com/10189
Reviewed-by: Austin Clements <austin@google.com>
Currently setGCPercent sets heapminimum to heapminimum*GOGC/100. The
real intent is to set heapminimum to a scaled multiple of a fixed
default heap minimum, not to scale heapminimum based on its current
value. This turns out to be okay because setGCPercent is only called
once and heapminimum is initially set to this default heap minimum.
However, the code as written is confusing, especially since
setGCPercent is otherwise written so it could be called again to
change GOGC. Fix this by introducing a defaultHeapMinimum constant and
using this instead of the current value of heapminimum to compute the
scaled heap minimum.
As part of this, this commit improves the documentation on
heapminimum.
Change-Id: I4eb82c73dc2eb44a6e5a17c780a747a2e73d7493
Reviewed-on: https://go-review.googlesource.com/10181
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, startTheWorld releases worldsema before starting the
world. Since startTheWorld can change gomaxprocs after allowing Ps to
run, this means that gomaxprocs can change while another P holds
worldsema.
Unfortunately, the garbage collector and forEachP assume that holding
worldsema protects against changes in gomaxprocs (which it *almost*
does). In particular, this is causing somewhat frequent "P did not run
fn" crashes in forEachP in the runtime tests because gomaxprocs is
changing between the several loops that forEachP does over all the Ps.
Fix this by only releasing worldsema after the world is started.
This relates to issue #10618. forEachP still fails under stress
testing, but much less frequently.
Change-Id: I085d627b70cca9ebe9af28fe73b9872f1bb224ff
Reviewed-on: https://go-review.googlesource.com/10156
Reviewed-by: Russ Cox <rsc@golang.org>
There are several steps to stopping and starting the world and
currently they're open-coded in several places. The garbage collector
is the only thing that needs to stop and start the world in a
non-trivial pattern. Replace all other uses with calls to higher-level
functions that implement the entire pattern necessary to stop and
start the world.
This is a pure refectoring and should not change any code semantics.
In the following commits, we'll make changes that are easier to do
with this abstraction in place.
This commit renames the old starttheworld to startTheWorldWithSema.
This is a slight misnomer right now because the callers release
worldsema just before calling this. However, a later commit will swap
these and I don't want to think of another name in the mean time.
Change-Id: I5dc97f87b44fb98963c49c777d7053653974c911
Reviewed-on: https://go-review.googlesource.com/10154
Reviewed-by: Russ Cox <rsc@golang.org>
In order to avoid deadlocks, startGC avoids kicking off GC if locks
are held by the calling M. However, it currently fails to check
preemptoff, which is the other way to disable preemption.
Fix this by adding a check for preemptoff.
Change-Id: Ie1083166e5ba4af5c9d6c5a42efdfaaef41ca997
Reviewed-on: https://go-review.googlesource.com/10153
Reviewed-by: Russ Cox <rsc@golang.org>
Small types record the location of pointers in their memory layout
by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries,
and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using
a bitmap for a large type containing arrays does not make sense:
if someone refers to the type [1<<28]*byte in a program in such
a way that the type information makes it into the binary, it would be
a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB
(for 1-bit entries) bitmap full of 1s into the binary or even to keep
one in memory during the execution of the program.
For large types containing arrays, it is much more compact to describe
the locations of pointers using a notation that can express repetition
than to lay out a bitmap of pointers. Go 1.4 included such a notation,
called ``GC programs'' but it was complex, required recursion during
decoding, and was generally slow. Dmitriy measured the execution of
these programs writing directly to the heap bitmap as being 7x slower
than copying from a preunrolled 4-bit mask (and frankly that code was
not terribly fast either). For some tests, unrollgcprog1 was seen costing
as much as 3x more than the rest of malloc combined.
This CL introduces a different form for the GC programs. They use a
simple Lempel-Ziv-style encoding of the 1-bit pointer information,
in which the only operations are (1) emit the following n bits
and (2) repeat the last n bits c more times. This encoding can be
generated directly from the Go type information (using repetition
only for arrays or large runs of non-pointer data) and it can be decoded
very efficiently. In particular the decoding requires little state and
no recursion, so that the entire decoding can run without any memory
accesses other than the reads of the encoding and the writes of the
decoded form to the heap bitmap. For recursive types like arrays of
arrays of arrays, the inner instructions are only executed once, not
n times, so that large repetitions run at full speed. (In contrast, large
repetitions in the old programs repeated the individual bit-level layout
of the inner data over and over.) The result is as much as 25x faster
decoding compared to the old form.
Because the old decoder was so slow, Go 1.4 had three (or so) cases
for how to set the heap bitmap bits for an allocation of a given type:
(1) If the type had an even number of words up to 32 words, then
the 4-bit pointer mask for the type fit in no more than 16 bytes;
store the 4-bit pointer mask directly in the binary and copy from it.
(1b) If the type had an odd number of words up to 15 words, then
the 4-bit pointer mask for the type, doubled to end on a byte boundary,
fit in no more than 16 bytes; store that doubled mask directly in the
binary and copy from it.
(2) If the type had an even number of words up to 128 words,
or an odd number of words up to 63 words (again due to doubling),
then the 4-bit pointer mask would fit in a 64-byte unrolled mask.
Store a GC program in the binary, but leave space in the BSS for
the unrolled mask. Execute the GC program to construct the mask the
first time it is needed, and thereafter copy from the mask.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
(This is the case that was 7x slower than the other two.)
Because the new pointer masks store 1-bit entries instead of 4-bit
entries and because using the decoder no longer carries a significant
overhead, after this CL (that is, for Go 1.5) there are only two cases:
(1) If the type is 128 words or less (no condition about odd or even),
store the 1-bit pointer mask directly in the binary and use it to
initialize the heap bitmap during malloc. (Implemented in CL 9702.)
(2) There is no case 2 anymore.
(3) Otherwise, store a GC program and execute it to write directly to
the heap bitmap each time an object of that type is allocated.
Executing the GC program directly into the heap bitmap (case (3) above)
was disabled for the Go 1.5 dev cycle, both to avoid needing to use
GC programs for typedmemmove and to avoid updating that code as
the heap bitmap format changed. Typedmemmove no longer uses this
type information; as of CL 9886 it uses the heap bitmap directly.
Now that the heap bitmap format is stable, we reintroduce GC programs
and their space savings.
Benchmarks for heapBitsSetType, before this CL vs this CL:
name old mean new mean delta
SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000)
SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179)
SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001)
SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001)
SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000)
SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000)
SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000)
SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001)
SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000)
SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070)
SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000)
SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015)
SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069)
SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767)
SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000)
SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001)
SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000)
SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000)
SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000)
SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000)
SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000)
SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001)
SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000)
SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000)
SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000)
SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000)
The above compares always using a cached pointer mask (and the
corresponding waste of memory) against using the programs directly.
Some slowdown is expected, in exchange for having a better general algorithm.
The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024,
along with the slice variants of those.
It is possible that the cutoff of 128 words (bits) should be raised
in a followup CL, but even with this low cutoff the GC programs are
faster than Go 1.4's "fast path" non-GC program case.
Benchmarks for heapBitsSetType, Go 1.4 vs this CL:
name old mean new mean delta
SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000)
SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000)
SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000)
SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000)
SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000)
SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000)
SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000)
SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000)
SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000)
SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000)
SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000)
SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000)
SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000)
SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000)
SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000)
SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000)
SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000)
SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000)
SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000)
SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000)
SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000)
SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000)
SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000)
SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000)
SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000)
SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000)
SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation.
Both Go 1.4 and this CL are using pointer bitmaps for this case,
so that's an overall 3x speedup for using pointer bitmaps.
SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation.
Both Go 1.4 and this CL are running the GC program for this case,
so that's an overall 17x speedup when using GC programs (and
I've seen >20x on other systems).
Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against
this CL's SetTypeNode128 (GC program), the slow path in the
code in this CL is 2x faster than the fast path in Go 1.4.
The Go 1 benchmarks are basically unaffected compared to just before this CL.
Go 1 benchmarks, before this CL vs this CL:
name old mean new mean delta
BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306)
Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006)
FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280)
FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039)
FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000)
FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048)
FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533)
FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000)
FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000)
GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609)
GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000)
Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835)
Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169)
HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045)
JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549)
JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000)
Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878)
GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004)
RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000)
RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088)
RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380)
RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157)
RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021)
RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539)
RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378)
RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067)
Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943)
Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000)
TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000)
TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976)
For the record, Go 1 benchmarks, Go 1.4 vs this CL:
name old mean new mean delta
BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000)
Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212)
FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000)
FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203)
FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000)
FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000)
FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000)
FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000)
FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000)
GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000)
GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000)
Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000)
Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204)
HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000)
JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001)
JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000)
Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184)
GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003)
RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000)
RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000)
RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000)
RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000)
RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000)
RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000)
RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000)
RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000)
Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083)
Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000)
TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000)
TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000)
This CL is a bit larger than I would like, but the compiler, linker, runtime,
and package reflect all need to be in sync about the format of these programs,
so there is no easy way to split this into independent changes (at least
while keeping the build working at each change).
Fixes#9625.
Fixes#10524.
Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a
Reviewed-on: https://go-review.googlesource.com/9888
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
Currently the heap minimum is set to 4MB which prevents our ability to
collect at every allocation by setting GOGC=0. This adjust the
heap minimum to 4MB*GOGC/100 thus reenabling collecting at every allocation.
Fixes#10681
Change-Id: I912d027dac4b14ae535597e8beefa9ac3fb8ad94
Reviewed-on: https://go-review.googlesource.com/9814
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, the GC uses a moving average of recent scan work ratios to
estimate the total scan work required by this cycle. This is in turn
used to compute how much scan work should be done by mutators when
they allocate in order to perform all expected scan work by the time
the allocated heap reaches the heap goal.
However, our current scan work estimate can be arbitrarily wrong if
the heap topography changes significantly from one cycle to the
next. For example, in the go1 benchmarks, at the beginning of each
benchmark, the heap is dominated by a 256MB no-scan object, so the GC
learns that the scan density of the heap is very low. In benchmarks
that then rapidly allocate pointer-dense objects, by the time of the
next GC cycle, our estimate of the scan work can be too low by a large
factor. This in turn lets the mutator allocate faster than the GC can
collect, allowing it to get arbitrarily far ahead of the scan work
estimate, which leads to very long GC cycles with very little mutator
assist that can overshoot the heap goal by large margins. This is
particularly easy to demonstrate with BinaryTree17:
$ GODEBUG=gctrace=1 ./go1.test -test.bench BinaryTree17
gc #1 @0.017s 2%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 4->262->262 MB, 4 MB goal, 1 P
gc #2 @0.026s 3%: 0+0+0+0+0 ms clock, 0+0+0+0/0/0+0 ms cpu, 262->262->262 MB, 524 MB goal, 1 P
testing: warning: no tests to run
PASS
BenchmarkBinaryTree17 gc #3 @1.906s 0%: 0+0+0+0+7 ms clock, 0+0+0+0/0/0+7 ms cpu, 325->325->287 MB, 325 MB goal, 1 P (forced)
gc #4 @12.203s 20%: 0+0+0+10067+10 ms clock, 0+0+0+0/2523/852+10 ms cpu, 430->2092->1950 MB, 574 MB goal, 1 P
1 9150447353 ns/op
Change this estimate to instead use the *current* scannable heap
size. This has the advantage of being based solely on the current
state of the heap, not on past densities or reachable heap sizes, so
it isn't susceptible to falling behind during these sorts of phase
changes. This is strictly an over-estimate, but it's better to
over-estimate and get more assist than necessary than it is to
under-estimate and potentially spiral out of control. Experiments with
scaling this estimate back showed no obvious benefit for mutator
utilization, heap size, or assist time.
This new estimate has little effect for most benchmarks, including
most go1 benchmarks, x/benchmarks, and the 6g benchmark. It has a huge
effect for benchmarks that triggered the bad pacer behavior:
name old mean new mean delta
BinaryTree17 10.0s × (1.00,1.00) 3.5s × (0.98,1.01) -64.93% (p=0.000)
Fannkuch11 2.74s × (1.00,1.01) 2.65s × (1.00,1.00) -3.52% (p=0.000)
FmtFprintfEmpty 56.4ns × (0.99,1.00) 57.8ns × (1.00,1.01) +2.43% (p=0.000)
FmtFprintfString 187ns × (0.99,1.00) 185ns × (0.99,1.01) -1.19% (p=0.010)
FmtFprintfInt 184ns × (1.00,1.00) 183ns × (1.00,1.00) (no variance)
FmtFprintfIntInt 321ns × (1.00,1.00) 315ns × (1.00,1.00) -1.80% (p=0.000)
FmtFprintfPrefixedInt 266ns × (1.00,1.00) 263ns × (1.00,1.00) -1.22% (p=0.000)
FmtFprintfFloat 353ns × (1.00,1.00) 353ns × (1.00,1.00) -0.13% (p=0.035)
FmtManyArgs 1.21µs × (1.00,1.00) 1.19µs × (1.00,1.00) -1.33% (p=0.000)
GobDecode 9.69ms × (1.00,1.00) 9.59ms × (1.00,1.00) -1.07% (p=0.000)
GobEncode 7.89ms × (0.99,1.01) 7.74ms × (1.00,1.00) -1.92% (p=0.000)
Gzip 391ms × (1.00,1.00) 392ms × (1.00,1.00) ~ (p=0.522)
Gunzip 97.1ms × (1.00,1.00) 97.0ms × (1.00,1.00) -0.10% (p=0.000)
HTTPClientServer 55.7µs × (0.99,1.01) 56.7µs × (0.99,1.01) +1.81% (p=0.001)
JSONEncode 19.1ms × (1.00,1.00) 19.0ms × (1.00,1.00) -0.85% (p=0.000)
JSONDecode 66.8ms × (1.00,1.00) 66.9ms × (1.00,1.00) ~ (p=0.288)
Mandelbrot200 4.13ms × (1.00,1.00) 4.12ms × (1.00,1.00) -0.08% (p=0.000)
GoParse 3.97ms × (1.00,1.01) 4.01ms × (1.00,1.00) +0.99% (p=0.000)
RegexpMatchEasy0_32 114ns × (1.00,1.00) 115ns × (0.99,1.00) ~ (p=0.070)
RegexpMatchEasy0_1K 376ns × (1.00,1.00) 376ns × (1.00,1.00) ~ (p=0.900)
RegexpMatchEasy1_32 94.9ns × (1.00,1.00) 96.3ns × (1.00,1.01) +1.53% (p=0.001)
RegexpMatchEasy1_1K 568ns × (1.00,1.00) 567ns × (1.00,1.00) -0.22% (p=0.001)
RegexpMatchMedium_32 159ns × (1.00,1.00) 159ns × (1.00,1.00) ~ (p=0.178)
RegexpMatchMedium_1K 46.4µs × (1.00,1.00) 46.6µs × (1.00,1.00) +0.29% (p=0.000)
RegexpMatchHard_32 2.37µs × (1.00,1.00) 2.37µs × (1.00,1.00) ~ (p=0.722)
RegexpMatchHard_1K 71.1µs × (1.00,1.00) 71.2µs × (1.00,1.00) ~ (p=0.229)
Revcomp 565ms × (1.00,1.00) 562ms × (1.00,1.00) -0.52% (p=0.000)
Template 81.0ms × (1.00,1.00) 80.2ms × (1.00,1.00) -0.97% (p=0.000)
TimeParse 380ns × (1.00,1.00) 380ns × (1.00,1.00) ~ (p=0.148)
TimeFormat 405ns × (0.99,1.00) 385ns × (0.99,1.00) -5.00% (p=0.000)
Change-Id: I11274158bf3affaf62662e02de7af12d5fb789e4
Reviewed-on: https://go-review.googlesource.com/9696
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
This tracks the number of scannable bytes in the allocated heap. That
is, bytes that the garbage collector must scan before reaching the
last pointer field in each object.
This will be used to compute a more robust estimate of the GC scan
work.
Change-Id: I1eecd45ef9cdd65b69d2afb5db5da885c80086bb
Reviewed-on: https://go-review.googlesource.com/9695
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, we only flush the per-P gcWork caches in gcMark, at the
beginning of mark termination. This is necessary to ensure that no
work is held up in these caches.
However, this flush happens after we update the GC controller state,
which depends on statistics about marked heap size and scan work that
are only updated by this flush. Hence, the controller is missing the
bulk of heap marking and scan work. This bug was introduced in commit
1b4025f, which introduced the per-P gcWork caches.
Fix this by flushing these caches before we update the GC controller
state. We continue to flush them at the beginning of mark termination
as well to be robust in case any write barriers happened between the
previous flush and entering mark termination, but this should be a
no-op.
Change-Id: I8f0f91024df967ebf0c616d1c4f0c339c304ebaa
Reviewed-on: https://go-review.googlesource.com/9646
Reviewed-by: Russ Cox <rsc@golang.org>
During development some tracing routines were added that are not
needed in the release. These included GCstarttimes, GCendtimes, and
GCprinttimes.
Fixes#10462
Change-Id: I0788e6409d61038571a5ae0cbbab793102df0a65
Reviewed-on: https://go-review.googlesource.com/9689
Reviewed-by: Austin Clements <austin@google.com>
This adds a detailed debug dump of the state of the GC controller and
a GODEBUG flag to enable it.
Change-Id: I562fed7981691a84ddf0f9e6fcd9f089f497ac13
Reviewed-on: https://go-review.googlesource.com/9640
Reviewed-by: Russ Cox <rsc@golang.org>
This avoids confusion with the main findrunnable in the scheduler.
Change-Id: I8cf40657557a8610a2fe5a2f74598518256ca7f0
Reviewed-on: https://go-review.googlesource.com/9305
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, we use a full stop-the-world around enabling write
barriers. This is to ensure that all Gs have enabled write barriers
before any blackening occurs (either in gcBgMarkWorker() or in
gcAssistAlloc()).
However, there's no need to bring the whole world to a synchronous
stop to ensure this. This change replaces the STW with a ragged
barrier that ensures each P has individually observed that write
barriers should be enabled before GC performs any blackening.
Change-Id: If2f129a6a55bd8bdd4308067af2b739f3fb41955
Reviewed-on: https://go-review.googlesource.com/8207
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, each M has a cache of the most recently used *workbuf. This
is used primarily by the write barrier so it doesn't have to access
the global workbuf lists on every write barrier. It's also used by
stack scanning because it's convenient.
This cache is important for write barrier performance, but this
particular approach has several downsides. It's faster than no cache,
but far from optimal (as the benchmarks below show). It's complex:
access to the cache is sprinkled through most of the workbuf list
operations and it requires special care to transform into and back out
of the gcWork cache that's actually used for scanning and marking. It
requires atomic exchanges to take ownership of the cached workbuf and
to return it to the M's cache even though it's almost always used by
only the current M. Since it's per-M, flushing these caches is O(# of
Ms), which may be high. And it has some significant subtleties: for
example, in general the cache shouldn't be used after the
harvestwbufs() in mark termination because it could hide work from
mark termination, but stack scanning can happen after this and *will*
use the cache (but it turns out this is okay because it will always be
followed by a getfull(), which drains the cache).
This change replaces this cache with a per-P gcWork object. This
gcWork cache can be used directly by scanning and marking (as long as
preemption is disabled, which is a general requirement of gcWork).
Since it's per-P, it doesn't require synchronization, which simplifies
things and means the only atomic operations in the write barrier are
occasionally fetching new work buffers and setting a mark bit if the
object isn't already marked. This cache can be flushed in O(# of Ps),
which is generally small. It follows a simple flushing rule: the cache
can be used during any phase, but during mark termination it must be
flushed before allowing preemption. This also makes the dispose during
mutator assist no longer necessary, which eliminates the vast majority
of gcWork dispose calls and reduces contention on the global workbuf
lists. And it's a lot faster on some benchmarks:
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 11963668673 11206112763 -6.33%
BenchmarkFannkuch11 2643217136 2649182499 +0.23%
BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28%
BenchmarkFmtFprintfString 364 307 -15.66%
BenchmarkFmtFprintfInt 317 282 -11.04%
BenchmarkFmtFprintfIntInt 512 483 -5.66%
BenchmarkFmtFprintfPrefixedInt 404 380 -5.94%
BenchmarkFmtFprintfFloat 521 479 -8.06%
BenchmarkFmtManyArgs 2164 1894 -12.48%
BenchmarkGobDecode 30366146 22429593 -26.14%
BenchmarkGobEncode 29867472 26663152 -10.73%
BenchmarkGzip 391236616 396779490 +1.42%
BenchmarkGunzip 96639491 96297024 -0.35%
BenchmarkHTTPClientServer 100110 70763 -29.31%
BenchmarkJSONEncode 51866051 52511382 +1.24%
BenchmarkJSONDecode 103813138 86094963 -17.07%
BenchmarkMandelbrot200 4121834 4120886 -0.02%
BenchmarkGoParse 16472789 5879949 -64.31%
BenchmarkRegexpMatchEasy0_32 140 140 +0.00%
BenchmarkRegexpMatchEasy0_1K 394 394 +0.00%
BenchmarkRegexpMatchEasy1_32 120 120 +0.00%
BenchmarkRegexpMatchEasy1_1K 621 614 -1.13%
BenchmarkRegexpMatchMedium_32 209 202 -3.35%
BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52%
BenchmarkRegexpMatchHard_32 2682 2675 -0.26%
BenchmarkRegexpMatchHard_1K 79383 79524 +0.18%
BenchmarkRevcomp 584116718 584595320 +0.08%
BenchmarkTemplate 125400565 109620196 -12.58%
BenchmarkTimeParse 386 387 +0.26%
BenchmarkTimeFormat 580 447 -22.93%
(Best out of 10 runs. The delta of averages is similar.)
This also puts us in a good position to flush these caches when
nearing the end of concurrent marking, which will let us increase the
size of the work buffers while still controlling mark termination
pause time.
Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522
Reviewed-on: https://go-review.googlesource.com/9178
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
When findRunnable considers running a fractional mark worker, it first
checks if there's any work to be done; if there isn't there's no point
in running the worker because it will just reschedule immediately.
However, currently findRunnable just checks work.full and
work.partial, whereas getfull can *also* draw work from m.currentwbuf.
As a result, findRunnable may not start a worker even though there
actually is work.
This problem manifests itself in occasional failures of the
test/init1.go test. This test is unusual because it performs a large
amount of allocation without executing any write barriers, which means
there's nothing to force the pointers in currentwbuf out to the
work.partial/full lists where findRunnable can see them.
This change fixes this problem by making findRunnable also check for a
currentwbuf. This aligns findRunnable with trygetfull's notion of
whether or not there's work.
Change-Id: Ic76d22b7b5d040bc4f58a6b5975e9217650e66c4
Reviewed-on: https://go-review.googlesource.com/9299
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, findRunnable only considers running a mark worker if
there's work in the work queue. In principle, this can delay the start
of the desired number of dedicated mark workers if there's no work
pending. This is unlikely to occur in practice, since there should be
work queued from the scan phase, but if it were to come up, a CPU hog
mutator could slow down or delay garbage collection.
This check makes sense for fractional mark workers, since they'll just
return to the scheduler immediately if there's no work, but we want
the scheduler to start all of the dedicated mark workers promptly,
even if there's currently no queued work. Hence, this change moves the
pending work check after the check for starting a dedicated worker.
Change-Id: I52b851cc9e41f508a0955b3f905ca80f109ea101
Reviewed-on: https://go-review.googlesource.com/9298
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, when allocation reaches the GC trigger, the runtime uses
readyExecute to start the GC goroutine immediately rather than wait
for the scheduler to get around to the GC goroutine while the mutator
continues to grow the heap.
Now that the scheduler runs the most recently readied goroutine when a
goroutine yields its time slice, this rigmarole is no longer
necessary. The runtime can simply ready the GC goroutine and yield
from the readying goroutine.
Change-Id: I3b4ebadd2a72a923b1389f7598f82973dd5c8710
Reviewed-on: https://go-review.googlesource.com/9292
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Currently, the main GC goroutine sleeps on a note during concurrent
mark and the first background mark worker or assist to finish marking
use wakes up that note to let the main goroutine proceed into mark
termination. Unfortunately, the latency of this wakeup can be quite
high, since the GC goroutine will typically have lost its P while in
the futex sleep, meaning it will be placed on the global run queue and
will wait there until some P is kind enough to pick it up. This delay
gives the mutator more time to allocate and create floating garbage,
growing the heap unnecessarily. Worse, it's likely that background
marking has stopped at this point (unless GOMAXPROCS>4), so anything
that's allocated and published to the heap during this window will
have to be scanned during mark termination while the world is stopped.
This change replaces the note sleep/wakeup with a gopark/ready
scheme. This keeps the wakeup inside the Go scheduler and lets the
garbage collector take advantage of the new scheduler semantics that
run the ready()d goroutine immediately when the ready()ing goroutine
sleeps.
For the json benchmark from x/benchmarks with GOMAXPROCS=4, this
reduces the delay in waking up the GC goroutine and entering mark
termination once concurrent marking is done from ~100ms to typically
<100µs.
Change-Id: Ib11f8b581b8914f2d68e0094f121e49bac3bb384
Reviewed-on: https://go-review.googlesource.com/9291
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, we use a note sleep with a timeout in a loop in func gc to
periodically revise the GC control variables. Replace this with a
fully blocking note sleep and use a periodic timer to trigger the
revise instead. This is a step toward replacing the note sleep in func
gc.
Change-Id: I2d562f6b9b2e5f0c28e9a54227e2c0f8a2603f63
Reviewed-on: https://go-review.googlesource.com/9290
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, it's possible for the next_gc calculation to underflow.
Since next_gc is unsigned, this wraps around and effectively disables
GC for the rest of the program's execution. Besides being obviously
wrong, this is causing test failures on 32-bit because some tests are
running out of heap.
This underflow happens for two reasons, both having to do with how we
estimate the reachable heap size at the end of the GC cycle.
One reason is that this calculation depends on the value of heap_live
at the beginning of the GC cycle, but we currently only record that
value during a concurrent GC and not during a forced STW GC. Fix this
by moving the recorded value from gcController to work and recording
it on a common code path.
The other reason is that we use the amount of allocation during the GC
cycle as an approximation of the amount of floating garbage and
subtract it from the marked heap to estimate the reachable heap.
However, since this is only an approximation, it's possible for the
amount of allocation during the cycle to be *larger* than the marked
heap size (since the runtime allocates white and it's possible for
these allocations to never be made reachable from the heap). Currently
this causes wrap-around in our estimate of the reachable heap size,
which in turn causes wrap-around in next_gc. Fix this by bottoming out
the reachable heap estimate at 0, in which case we just fall back to
triggering GC at heapminimum (which is okay since this only happens on
small heaps).
Fixes#10555, fixes#10556, and fixes#10559.
Change-Id: Iad07b529c03772356fede2ae557732f13ebfdb63
Reviewed-on: https://go-review.googlesource.com/9286
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, we set the heap goal for the next GC cycle using the size
of the marked heap at the end of the current cycle. This can lead to a
bad feedback loop if the mutator is rapidly allocating and releasing
pointers that can significantly bloat heap size.
If the GC were STW, the marked heap size would be exactly the
reachable heap size (call it stwLive). However, in concurrent GC,
marked=stwLive+floatLive, where floatLive is the amount of "floating
garbage": objects that were reachable at some point during the cycle
and were marked, but which are no longer reachable by the end of the
cycle. If the GC cycle is short, then the mutator doesn't have much
time to create floating garbage, so marked≈stwLive. However, if the GC
cycle is long and the mutator is allocating and creating floating
garbage very rapidly, then it's possible that marked≫stwLive. Since
the runtime currently sets the heap goal based on marked, this will
cause it to set a high heap goal. This means that 1) the next GC cycle
will take longer because of the larger heap and 2) the assist ratio
will be low because of the large distance between the trigger and the
goal. The combination of these lets the mutator produce even more
floating garbage in the next cycle, which further exacerbates the
problem.
For example, on the garbage benchmark with GOMAXPROCS=1, this causes
the heap to grow to ~500MB and the garbage collector to retain upwards
of ~300MB of heap, while the true reachable heap size is ~32MB. This,
in turn, causes the GC cycle to take upwards of ~3 seconds.
Fix this bad feedback loop by estimating the true reachable heap size
(stwLive) and using this rather than the marked heap size
(stwLive+floatLive) as the basis for the GC trigger and heap goal.
This breaks the bad feedback loop and causes the mutator to assist
more, which decreases the rate at which it can create floating
garbage. On the same garbage benchmark, this reduces the maximum heap
size to ~73MB, the retained heap to ~40MB, and the duration of the GC
cycle to ~200ms.
Change-Id: I7712244c94240743b266f9eb720c03802799cdd1
Reviewed-on: https://go-review.googlesource.com/9177
Reviewed-by: Rick Hudson <rlh@golang.org>
This may or may not be useful to the end user, but it's incredibly
useful for us to understand the behavior of the pacer. Currently this
is fairly easy (though not trivial) to derive from the other heap
stats we print, but we're about to change how we compute the goal,
which will make it much harder to derive.
Change-Id: I796ef233d470c01f606bd9929820c01ece1f585a
Reviewed-on: https://go-review.googlesource.com/9176
Reviewed-by: Rick Hudson <rlh@golang.org>
The trigger controller computes GC CPU utilization by dividing by the
wall-clock time that's passed since concurrent mark began. Since this
delta is nanoseconds it's borderline impossible for it to be zero, but
if it is zero we'll currently divide by zero. Be robust to this
possibility by ignoring the utilization in the error term if no time
has elapsed.
Change-Id: I93dfc9e84735682af3e637f6538d1e7602634f09
Reviewed-on: https://go-review.googlesource.com/9175
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, the GC controller computes the mutator assist ratio at the
beginning of the cycle by estimating that the marked heap size this
cycle will be the same as it was the previous cycle. It then uses that
assist ratio for the rest of the cycle. However, this means that if
the mutator is quickly growing its reachable heap, the heap size is
likely to exceed the heap goal and currently there's no additional
pressure on mutator assists when this happens. For example, 6g (with
GOMAXPROCS=1) frequently exceeds the goal heap size by ~25% because of
this.
This change makes GC revise its work estimate and the resulting assist
ratio every 10ms during the concurrent mark. Instead of
unconditionally using the marked heap size from the last cycle as an
estimate for this cycle, it takes the minimum of the previously marked
heap and the currently marked heap. As a result, as the cycle
approaches or exceeds its heap goal, this will increase the assist
ratio to put more pressure on the mutator assist to bring the cycle to
an end. For 6g, this causes the GC to always finish within 5% and
often within 1% of its heap goal.
Change-Id: I4333b92ad0878c704964be42c655c38a862b4224
Reviewed-on: https://go-review.googlesource.com/9070
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Currently, in accordance with the GC pacing proposal, we schedule
background marking with a goal of achieving 25% utilization *total*
between mutator assists and background marking. This is stricter than
was set out in the Go 1.5 proposal, which suggests that the garbage
collector can use 25% just for itself and anything the mutator does to
help out is on top of that. It also has several technical
drawbacks. Because mutator assist time is constantly changing and we
can't have instantaneous information on background marking time, it
effectively requires hitting a moving target based on out-of-date
information. This works out in the long run, but works poorly for
short GC cycles and on short time scales. Also, this requires
time-multiplexing all Ps between the mutator and background GC since
the goal utilization of background GC constantly fluctuates. This
results in a complicated scheduling algorithm, poor affinity, and
extra overheads from context switching.
This change modifies the way we schedule and run background marking so
that background marking always consumes 25% of GOMAXPROCS and mutator
assist is in addition to this. This enables a much more robust
scheduling algorithm where we pre-determine the number of Ps we should
dedicate to background marking as well as the utilization goal for a
single floating "remainder" mark worker.
Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168
Reviewed-on: https://go-review.googlesource.com/9013
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, the concurrent sweep follows a 1:1 rule: when allocation
needs a span, it sweeps a span (likewise, when a large allocation
needs N pages, it sweeps until it frees N pages). This rule worked
well for the STW collector (especially when GOGC==100) because it did
no more sweeping than necessary to keep the heap from growing, would
generally finish sweeping just before GC, and ensured good temporal
locality between sweeping a page and allocating from it.
It doesn't work well with concurrent GC. Since concurrent GC requires
starting GC earlier (sometimes much earlier), the sweep often won't be
done when GC starts. Unfortunately, the first thing GC has to do is
finish the sweep. In the mean time, the mutator can continue
allocating, pushing the heap size even closer to the goal size. This
worked okay with the 7/8ths trigger, but it gets into a vicious cycle
with the GC trigger controller: if the mutator is allocating quickly
and driving the trigger lower, more and more sweep work will be left
to GC; this both causes GC to take longer (allowing the mutator to
allocate more during GC) and delays the start of the concurrent mark
phase, which throws off the GC controller's statistics and generally
causes it to push the trigger even lower.
As an example of a particularly bad case, the garbage benchmark with
GOMAXPROCS=4 and -benchmem 512 (MB) spends the first 0.4-0.8 seconds
of each GC cycle sweeping, during which the heap grows by between
109MB and 252MB.
To fix this, this change replaces the 1:1 sweep rule with a
proportional sweep rule. At the end of GC, GC knows exactly how much
heap allocation will occur before the next concurrent GC as well as
how many span pages must be swept. This change computes this "sweep
ratio" and when the mallocgc asks for a span, the mcentral sweeps
enough spans to bring the swept span count into ratio with the
allocated byte count.
On the benchmark from above, this entirely eliminates sweeping at the
beginning of GC, which reduces the time between startGC readying the
GC goroutine and GC stopping the world for sweep termination to ~100µs
during which the heap grows at most 134KB.
Change-Id: I35422d6bba0c2310d48bb1f8f30a72d29e98c1af
Reviewed-on: https://go-review.googlesource.com/8921
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, concurrent GC triggers at a fixed 7/8*GOGC heap growth. For
mutators that allocate slowly, this means GC will trigger too early
and run too often, wasting CPU time on GC. For mutators that allocate
quickly, this means GC will trigger too late, causing the program to
exceed the GOGC heap growth goal and/or to exceed CPU goals because of
a high mutator assist ratio.
This change adds a feedback control loop to dynamically adjust the GC
trigger from cycle to cycle. By monitoring the heap growth and GC CPU
utilization from cycle to cycle, this adjusts the Go garbage collector
to target the GOGC heap growth goal and the 25% CPU utilization goal.
Change-Id: Ic82eef288c1fa122f73b69fe604d32cbb219e293
Reviewed-on: https://go-review.googlesource.com/8851
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, the entire GC process runs with g.m.preemptoff set. In the
concurrent phases, the parts that actually need preemption disabled
are run on a system stack and there's no overall need to stay on the
same M or P during the concurrent phases. Hence, move the setting of
g.m.preemptoff to when we start mark termination, at which point we
really do need preemption disabled.
This dramatically changes the scheduling behavior of the concurrent
mark phase. Currently, since this is non-preemptible, concurrent mark
gets one dedicated P (so 1/GOMAXPROCS utilization). With this change,
the GC goroutine is scheduled like any other goroutine during
concurrent mark, so it gets 1/<runnable goroutines> utilization.
You might think it's not even necessary to set g.m.preemptoff at that
point since the world is stopped, but stackalloc/stackfree use this as
a signal that the per-P pools are not safe to access without
synchronization.
Change-Id: I08aebe8179a7d304650fb8449ff36262b3771099
Reviewed-on: https://go-review.googlesource.com/8839
Reviewed-by: Rick Hudson <rlh@golang.org>
This time is tracked per P and periodically flushed to the global
controller state. This will be used to compute mutator assist
utilization in order to schedule background GC work.
Change-Id: Ib94f90903d426a02cf488bf0e2ef67a068eb3eec
Reviewed-on: https://go-review.googlesource.com/8837
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, mutator allocation periodically assists the garbage
collector by performing a small, fixed amount of scanning work.
However, to control heap growth, mutators need to perform scanning
work *proportional* to their allocation rate.
This change implements proportional mutator assists. This uses the
scan work estimate computed by the garbage collector at the beginning
of each cycle to compute how much scan work must be performed per
allocation byte to complete the estimated scan work by the time the
heap reaches the goal size. When allocation triggers an assist, it
uses this ratio and the amount allocated since the last assist to
compute the assist work, then attempts to steal as much of this work
as possible from the background collector's credit, and then performs
any remaining scan work itself.
Change-Id: I98b2078147a60d01d6228b99afd414ef857e4fba
Reviewed-on: https://go-review.googlesource.com/8836
Reviewed-by: Rick Hudson <rlh@golang.org>
This tracks scan work done by background GC in a global pool. Mutator
assists will draw on this credit to avoid doing work when background
GC is staying ahead.
Unlike the other GC controller tracking variables, this will be both
written and read throughout the cycle. Hence, we can't arbitrarily
delay updates like we can for scan work and bytes marked. However, we
still want to minimize contention, so this global credit pool is
allowed some error from the "true" amount of credit. Background GC
accumulates credit locally up to a limit and only then flushes to the
global pool. Similarly, mutator assists will draw from the credit pool
in batches.
Change-Id: I1aa4fc604b63bf53d1ee2a967694dffdfc3e255e
Reviewed-on: https://go-review.googlesource.com/8834
Reviewed-by: Rick Hudson <rlh@golang.org>
This implements tracking the scan work ratio of a GC cycle and using
this to estimate the scan work that will be required by the next GC
cycle. Currently this estimate is unused; it will be used to drive
mutator assists.
Change-Id: I8685b59d89cf1d83eddfc9b30d84da4e3a7f4b72
Reviewed-on: https://go-review.googlesource.com/8833
Reviewed-by: Rick Hudson <rlh@golang.org>
This tracks the amount of scan work in terms of scanned pointers
during the concurrent mark phase. We'll use this information to
estimate scan work for the next cycle.
Currently this aggregates the work counter in gcWork and dispose
atomically aggregates this into a global work counter. dispose happens
relatively infrequently, so the contention on the global counter
should be low. If this turns out to be an issue, we can reduce the
number of disposes, and if it's still a problem, we can switch to
per-P counters.
Change-Id: Iac0364c466ee35fab781dbbbe7970a5f3c4e1fc1
Reviewed-on: https://go-review.googlesource.com/8832
Reviewed-by: Rick Hudson <rlh@golang.org>
This memory is untyped and can't be used anymore.
The next version of SWIG won't need it.
Change-Id: I592b287c5f5186975ee09a9b28d8efe3b57134e7
Reviewed-on: https://go-review.googlesource.com/8956
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Currently, when allocation reaches the concurrent GC trigger size, we
start the concurrent collector by ready'ing its G. This simply puts it
on the end of the P's run queue, which means we may not actually start
GC for some time as the current G continues to run and then the P
drains other Gs already on its run queue. Since the mutator can
continue to allocate, the heap can potentially be much larger than we
intended by the time GC actually starts. Furthermore, how much larger
is difficult to predict since it depends on the scheduler.
Fix this by preempting the current G and switching directly to the
concurrent GC G as soon as we reach the trigger heap size.
On the garbage benchmark from the benchmarks subrepo with
GOMAXPROCS=4, this reduces the time from triggering the GC to the
beginning of sweep termination by 10 to 30 milliseconds, which reduces
allocation after the trigger by up to 10MB (a large fraction of the
64MB live heap the benchmark tries to maintain).
One other known source of delay before we "really" start GC is the
sweep finalization performed before sweep termination. This has
similar negative effects on heap size and predictability, but is an
orthogonal problem. This change adds a TODO for this.
Change-Id: I8bae98cb43685c1bf353ff55868e4647e3743c47
Reviewed-on: https://go-review.googlesource.com/8513
Reviewed-by: Rick Hudson <rlh@golang.org>
These were appropriate for STW GC, since it interrupted the allocating
Goroutine, but don't apply to concurrent GC, which runs on its own
Goroutine. Forced GC is still STW, but it makes sense to attribute the
GC to the goroutine that called runtime.GC().
Change-Id: If12418ca66dc7e53b8b16025af4e03adb5d9577e
Reviewed-on: https://go-review.googlesource.com/8715
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
'themoduledata' doesn't really make sense now we support multiple moduledata
objects.
Change-Id: I8263045d8f62a42cb523502b37289b0fba054f62
Reviewed-on: https://go-review.googlesource.com/8521
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This changes all the places that consult themoduledata to consult a
linked list of moduledata objects, as will be necessary for
-linkshared to work.
Obviously, as there is as yet no way of adding moduledata objects to
this list, all this change achieves right now is wasting a few
instructions here and there.
Change-Id: I397af7f60d0849b76aaccedf72238fe664867051
Reviewed-on: https://go-review.googlesource.com/8231
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently, the initial heap size reported in the gctrace line is the
heap_live right before sweep termination. However, we triggered GC
when heap_live reached next_gc, and there may have been significant
allocation between that point and the beginning of sweep
termination. Ideally these would be essentially the same, but
currently there's scheduler delay when readying the GC goroutine as
well as delay from background sweep finalization.
We should fix this delay, but in the mean time, to give the user a
better idea of how much the heap grew during the whole of garbage
collection, report the trigger rather than what the heap size happened
to be after the garbage collector finished rolling out of bed. This
will also be more useful for heap growth plots.
Change-Id: I08476b9fbcfb2de90592405e9c9f434dfb9eb1f8
Reviewed-on: https://go-review.googlesource.com/8512
Reviewed-by: Rick Hudson <rlh@golang.org>
When the gctrace GODEBUG option is enabled, it will now report three
heap sizes: the heap size at the beginning of the GC cycle, the heap
size at the end of the GC cycle before sweeping, and marked heap size,
which is the amount of heap that will be retained until the next GC
cycle.
Change-Id: Ie13f8a6d5c609bc9cc47c7555960ab55b37b5f1c
Reviewed-on: https://go-review.googlesource.com/8430
Reviewed-by: Rick Hudson <rlh@golang.org>
In the STW collector, next_gc was both the heap size to trigger GC at
as well as the goal heap size.
Early in the concurrent collector's development, next_gc was the goal
heap size, but was also used as the heap size to trigger GC at. This
meant we always overshot the goal because of allocation during
concurrent GC.
Currently, next_gc is still the goal heap size, but we trigger
concurrent GC at 7/8*GOGC heap growth. This complicates
shouldtriggergc, but was necessary because of the incremental
maintenance of next_gc.
Now we simply compute next_gc for the next cycle during mark
termination. Hence, it's now easy to take the simpler route and
redefine next_gc as the heap size at which the next GC triggers. We
can directly compute this with the 7/8 backoff during mark termination
and shouldtriggergc can simply test if the live heap size has grown
over the next_gc trigger.
This will also simplify later changes once we start setting next_gc in
more sophisticated ways.
Change-Id: I872be4ae06b4f7a0d7f7967360a054bd36b90eea
Reviewed-on: https://go-review.googlesource.com/8420
Reviewed-by: Russ Cox <rsc@golang.org>
Currently there are two main consumers of memstats.heap_alloc:
updatememstats (aka ReadMemStats) and shouldtriggergc.
updatememstats recomputes heap_alloc from the ground up, so we don't
need to keep heap_alloc up to date for it. shouldtriggergc wants to
know how many bytes were marked by the previous GC plus how many bytes
have been allocated since then, but this *isn't* what heap_alloc
tracks. heap_alloc also includes objects that are not marked and
haven't yet been swept.
Introduce a new memstat called heap_live that actually tracks what
shouldtriggergc wants to know and stop keeping heap_alloc up to date.
Unlike heap_alloc, heap_live follows a simple sawtooth that drops
during each mark termination and increases monotonically between GCs.
heap_alloc, on the other hand, has much more complicated behavior: it
may drop during sweep termination, slowly decreases from background
sweeping between GCs, is roughly unaffected by allocation as long as
there are unswept spans (because we sweep and allocate at the same
rate), and may go up after background sweeping is done depending on
the GC trigger.
heap_live simplifies computing next_gc and using it to figure out when
to trigger garbage collection. Currently, we guess next_gc at the end
of a cycle and update it as we sweep and get a better idea of how much
heap was marked. Now, since we're directly tracking how much heap is
marked, we can directly compute next_gc.
This also corrects bugs that could cause us to trigger GC early.
Currently, in any case where sweep termination actually finds spans to
sweep, heap_alloc is an overestimation of live heap, so we'll trigger
GC too early. heap_live, on the other hand, is unaffected by sweeping.
Change-Id: I1f96807b6ed60d4156e8173a8e68745ffc742388
Reviewed-on: https://go-review.googlesource.com/8389
Reviewed-by: Russ Cox <rsc@golang.org>
This tracks the number of heap bytes marked by a GC cycle. We'll use
this information to precisely trigger the next GC cycle.
Currently this aggregates the work counter in gcWork and dispose
atomically aggregates this into a global work counter. dispose happens
relatively infrequently, so the contention on the global counter
should be low. If this turns out to be an issue, we can reduce the
number of disposes, and if it's still a problem, we can switch to
per-P counters.
Change-Id: I1bc377cb2e802ef61c2968602b63146d52e7f5db
Reviewed-on: https://go-review.googlesource.com/8388
Reviewed-by: Russ Cox <rsc@golang.org>
This tracks both total CPU time used by GC and the total time
available to all Ps since the beginning of the program and uses this
to derive a cumulative CPU usage percent for the gctrace line.
Change-Id: Ica85372b8dd45f7621909b325d5ac713a9b0d015
Reviewed-on: https://go-review.googlesource.com/8350
Reviewed-by: Russ Cox <rsc@golang.org>
GODEBUG=gctrace=1 turns on a per-GC cycle trace line. The current line
is left over from the STW garbage collector and includes a lot of
information that is no longer meaningful for the concurrent GC and
doesn't include a lot of information that is important.
Replace this line with a new line designed for the new garbage
collector.
This new line is focused more on helping the user understand the
impact of the garbage collector on their program and less on telling
us, the runtime developers, everything that's happening inside
GC. It's designed to fit in 80 columns and intentionally omit some
potentially useful things that were in the old line. We might want a
"verbose" mode that adds information for us.
We'll be able to further simplify the line once we eliminate the STW
around enabling the write barrier. Then we'll have just one STW phase,
one concurrent phase, and one more STW phase, so we'll be able to
reduce the number of times from five to three.
Change-Id: Icc30939fe4576fb4491b4eac811649395727aa2a
Reviewed-on: https://go-review.googlesource.com/8208
Reviewed-by: Russ Cox <rsc@golang.org>
In preparation for being able to run a go program that has code
in several objects, this changes from having several linker
symbols used by the runtime into having one linker symbol that
points at a structure containing the needed data. Multiple
object support will construct a linked list of such structures.
A follow up will initialize the slices in the themoduledata
structure directly from the linker but I was aiming for a minimal
diff for now.
Change-Id: I613cce35309801cf265a1d5ae5aaca8d689c5cbf
Reviewed-on: https://go-review.googlesource.com/7441
Reviewed-by: Ian Lance Taylor <iant@golang.org>
The barrier in gcDrain does not account for concurrent gcDrainNs
happening in gchelpwork, so it can actually return while there is
still work being done. It turns out this is okay, but for subtle
reasons involving gcDrainN always being run on the system
stack. Document these reasons.
Change-Id: Ib07b3753cc4e2b54533ab3081a359cbd1c3c08fb
Reviewed-on: https://go-review.googlesource.com/7736
Reviewed-by: Rick Hudson <rlh@golang.org>
We're skating on thin ice, and things are finally starting to melt around here.
(I want to avoid the debugging session that will happen when someone
uses atomicand8 expecting it to be atomic with respect to other operations.)
Change-Id: I254f1582be4eb1f2d7fbba05335a91c6bf0c7f02
Reviewed-on: https://go-review.googlesource.com/7861
Reviewed-by: Minux Ma <minux@golang.org>
Currently, the GC's concurrent mark phase runs on the system
stack. There's no need to do this, and running it this way ties up the
entire M and P running the GC by preventing the scheduler from
preempting the GC even during concurrent mark.
Fix this by running concurrent mark on the regular G stack. It's still
non-preemptible because we also set preemptoff around the whole GC
process, but this moves us closer to making it preemptible.
Change-Id: Ia9f1245e299b8c5c513a4b1e3ef13eaa35ac5e73
Reviewed-on: https://go-review.googlesource.com/7730
Reviewed-by: Rick Hudson <rlh@golang.org>
"Sync" is not very informative. What's being synchronized and with
whom? Update this comment to explain what we're really doing: enabling
write barriers.
Change-Id: I4f0cbb8771988c7ba4606d566b77c26c64165f0f
Reviewed-on: https://go-review.googlesource.com/7700
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently we harvestwbufs the moment we enter the mark phase, even
before starting the world again. Since cached wbufs are only filled
when we're in mark or mark termination, they should all be empty at
this point, making the harvest pointless. Remove the harvest.
We should, but do not currently harvest at the end of the mark phase
when we're running out of work to do.
Change-Id: I5f4ba874f14dd915b8dfbc4ee5bb526eecc2c0b4
Reviewed-on: https://go-review.googlesource.com/7669
Reviewed-by: Rick Hudson <rlh@golang.org>
Even though the world is stopped the GC may do pointer
writes that need to be protected by write barriers.
This means that the write barrier must be on
continuously from the time the mark phase starts and
the mark termination phase ends. Checks were added to
ensure that no allocation happens during a GC.
Hoist the logic that clears pools the start of the GC
so that the memory can be reclaimed during this GC cycle.
Change-Id: I9d1551ac5db9bac7bac0cb5370d5b2b19a9e6a52
Reviewed-on: https://go-review.googlesource.com/6990
Reviewed-by: Austin Clements <austin@google.com>
Stip uninteresting bottom and top frames from trace stacks.
This makes both binary and json trace files smaller,
and also makes stacks shorter and more readable in the viewer.
Change-Id: Ib9c80ccc280504f0e235f867f53f1d2652c41583
Reviewed-on: https://go-review.googlesource.com/5523
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
Starting it lazily causes a memory allocation (for the goroutine) during GC.
First use of channels for runtime implementation.
Change-Id: I9cd24dcadbbf0ee5070ee6d0ed7ea415504f316c
Reviewed-on: https://go-review.googlesource.com/6960
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
The unbounded list-based defer pool can grow infinitely.
This can happen if a goroutine routinely allocates a defer;
then blocks on one P; and then unblocked, scheduled and
frees the defer on another P.
The scenario was reported on golang-nuts list.
We've been here several times. Any unbounded local caches
are bad and grow to infinite size. This change introduces
central defer pool; local pools become fixed-size
with the only purpose of amortizing accesses to the
central pool.
Freedefer now executes on system stack to not consume
nosplit stack space.
Change-Id: I1a27695838409259d1586a0adfa9f92bccf7ceba
Reviewed-on: https://go-review.googlesource.com/3967
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
The unbounded list-based sudog cache can grow infinitely.
This can happen if a goroutine is routinely blocked on one P
and then unblocked and scheduled on another P.
The scenario was reported on golang-nuts list.
We've been here several times. Any unbounded local caches
are bad and grow to infinite size. This change introduces
central sudog cache; local caches become fixed-size
with the only purpose of amortizing accesses to the
central cache.
The change required to move sudog cache from mcache to P,
because mcache is not scanned by GC.
Change-Id: I3bb7b14710354c026dcba28b3d3c8936a8db4e90
Reviewed-on: https://go-review.googlesource.com/3742
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
Since allglock is held in this function, there's no point to
tip-toeing around allgs. Just use a for-range loop.
Change-Id: I1ee61c7e8cac8b8ebc8107c0c22f739db5db9840
Reviewed-on: https://go-review.googlesource.com/5882
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Previously, we had three loops in the garbage collector that all
cleared the per-G GC flags. Consolidate these into one function.
This one function is designed to work in a concurrent setting. As a
result, it's slightly more expensive than the loops it replaces during
STW phases, but these happen at most twice per GC.
Change-Id: Id1ec0074fd58865eb0112b8a0547b267802d0df1
Reviewed-on: https://go-review.googlesource.com/5881
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
The loop in gcMark is redundant with the gcworkdone resetting
performed by markroot, which called a few lines later in gcMark.
Change-Id: Ie0a826a614ecfa79e6e6b866e8d1de40ba515856
Reviewed-on: https://go-review.googlesource.com/5880
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
When GODEBUG=gctrace=2 two gcs are preformed. During the first gc
the stack scan sets the g's gcscanvalid and gcworkdone flags to true
indicating that the stacks have to be scanned and do not need to
be rescanned. These need to be reset to false for the second GC so the
stacks are rescanned, otherwise if the only pointer to an object is
on the stack it will not be discovered and the object will be freed.
Typically this will include the object that was just allocated in
the mallocgc call that initiated the GC.
Change-Id: Ic25163f4689905fd810c90abfca777324005c02f
Reviewed-on: https://go-review.googlesource.com/5861
Reviewed-by: Russ Cox <rsc@golang.org>
This is a nice split but more importantly it provides a better
way to fit the checkmark phase into the sequencing.
Also factor out common span copying into gcSpanCopy.
Change-Id: Ia058644974e4ed4ac3cf4b017a3446eb2284d053
Reviewed-on: https://go-review.googlesource.com/5333
Reviewed-by: Austin Clements <austin@google.com>
The loop made more sense when gc_m was not its own function.
Change-Id: I71a7f21d777e69c1924e3b534c507476daa4dfdd
Reviewed-on: https://go-review.googlesource.com/5332
Reviewed-by: Austin Clements <austin@google.com>