1
0
mirror of https://github.com/golang/go synced 2024-11-19 16:44:43 -07:00
Commit Graph

283 Commits

Author SHA1 Message Date
Austin Clements
3b5637ff2b runtime: doubly fix "double wakeup" panic
runtime.gchelper depends on the non-atomic load of work.ndone
happening strictly before the atomic add of work.nwait. Until very
recently (commit 978af9c2db, fixing #20334), the compiler reordered
these operations. This created a race since work.ndone can change as
soon as work.nwait is equal to work.ndone. If that happened, more than
one gchelper could attempt to wake up the work.alldone note, causing a
"double wakeup" panic.

This was fixed in the compiler, but to make this code less subtle,
make the load of work.ndone atomic. This clearly forces the order of
these operations, ensuring the race doesn't happen.

Fixes #19305 (though really 978af9c2db fixed it).

Change-Id: Ieb1a84e1e5044c33ac612c8a5ab6297e7db4c57d
Reviewed-on: https://go-review.googlesource.com/43311
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
2017-05-12 15:33:09 +00:00
Austin Clements
227fff2ea4 runtime/debug: don't trigger a GC on SetGCPercent
Currently SetGCPercent forces a GC in order to recompute GC pacing.
Since we can now recompute pacing on the fly using gcSetTriggerRatio,
change SetGCPercent (really runtime.setGCPercent) to go through
gcSetTriggerRatio and not trigger a GC.

Fixes #19076.

Change-Id: Ib30d7ab1bb3b55219535b9f238108f3d45a1b522
Reviewed-on: https://go-review.googlesource.com/39835
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-21 17:42:02 +00:00
Austin Clements
1c4f3c5ea0 runtime: make gcSetTriggerRatio work at any time
This changes gcSetTriggerRatio so it can be called even during
concurrent mark or sweep. In this case, it will adjust the pacing of
the current phase, accounting for progress that has already been made.

To make this work for concurrent sweep, this introduces a "basis" for
the pagesSwept count, much like the basis we just introduced for
heap_live. This lets gcSetTriggerRatio shift the basis to the current
heap_live and pagesSwept and compute a slope from there to completion.
This avoids creating a discontinuity where, if the ratio has
increased, there has to be a flurry of sweep activity to catch up.
Instead, this creates a continuous, piece-wise linear function as
adjustments are made.

For #19076.

Change-Id: Ibcd76aeeb81ff4814b00be7cbd3530b73bbdbba9
Reviewed-on: https://go-review.googlesource.com/39833
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-21 17:41:59 +00:00
Austin Clements
a5eb3dceaf runtime: drive proportional sweep directly off heap_live
Currently, proportional sweep maintains its own count of how many
bytes have been allocated since the beginning of the sweep cycle so it
can compute how many pages need to be swept for a given allocation.

However, this requires a somewhat complex reimbursement scheme since
proportional sweep must be done before a span is allocated, but we
don't know how many bytes to charge until we've allocated a span. This
means that the allocated byte count used by proportional sweep can go
up and down, which has led to underflow bugs in the past (#18043) and
is going to interfere with adjusting sweep pacing on-the-fly (for #19076).

This approach also means we're maintaining a statistic that is very
closely related to heap_live, but has a different 0 value. This is
particularly confusing because the sweep ratio is computed based on
heap_live, so you have to understand that these two statistics are
very closely related.

Replace all of this and compute the sweep debt directly from the
current value of heap_live. To make this work, we simply save the
value of heap_live when the sweep ratio is computed to use as a
"basis" for later computing the sweep debt.

This eliminates the need for reimbursement as well as the code for
maintaining the sweeper's version of the live heap size.

For #19076.

Coincidentally fixes #18043, since this eliminates sweep reimbursement
entirely.

Change-Id: I1f931ddd6e90c901a3972c7506874c899251dc2a
Reviewed-on: https://go-review.googlesource.com/39832
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-21 17:41:57 +00:00
Austin Clements
ee175afac2 runtime: consolidate all trigger-derived computations
Currently, the computations that derive controls from the GC trigger
are spread across several parts of the mark termination code.
Consolidate computing the absolute trigger, the heap goal, and sweep
pacing into a single function called at the end of mark termination.

Unlike the code being consolidated, this has to be more careful about
negative gcpercent. Many of the consolidated code paths simply didn't
execute if GC was off.

This is a step toward being able to change the GC trigger ratio in the
middle of concurrent sweeping and marking. For this commit, we try to
stick close to the original structure of the code that's being
consolidated, so it doesn't yet support mid-cycle adjustments.

For #19076.

Change-Id: Ic5335be04b96ad20e70d53d67913a86bd6b31456
Reviewed-on: https://go-review.googlesource.com/39831
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-21 17:41:55 +00:00
Austin Clements
49a412a5b7 runtime: rationalize triggerRatio
gcController.triggerRatio is the only field in gcController that
persists across cycles. As global mutable state, the places where it
written and read are spread out, making it difficult to see that
updates and downstream calculations are done correctly.

Improve this situation by doing two things:

1) Move triggerRatio to memstats so it lives with the other
trigger-related fields and makes gcController entirely transient
state.

2) Commit the new trigger ratio during mark termination when we
compute other next-cycle controls, including the absolute trigger.
This forces us to explicitly thread the new trigger ratio from
gcController.endCycle to mark termination, so we're not just pulling
it out of global state.

Change-Id: I6669932f8039a8c0ef46a3f2a8c537db72e578aa
Reviewed-on: https://go-review.googlesource.com/39830
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-21 17:41:53 +00:00
Austin Clements
9d36163c0b runtime: consistently use atomic loads for heap_live
heap_live is updated atomically without locking, so we should also use
atomic loads to read it. Fix the reads of heap_live that happen
outside of STW to be atomic.

Change-Id: Idca9451c348168c2a792a9499af349833a3c333f
Reviewed-on: https://go-review.googlesource.com/41371
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-21 17:41:51 +00:00
Austin Clements
051809e352 runtime: free workbufs during sweeping
This extends the sweeper to free workbufs back to the heap between GC
cycles, allowing this memory to be reused for GC'd allocations or
eventually returned to the OS.

This helps for applications that have high peak heap usage relative to
their regular heap usage (for example, a high-memory initialization
phase). Workbuf memory is roughly proportional to heap size and since
we currently never free workbufs, it's proportional to *peak* heap
size. By freeing workbufs, we can release and reuse this memory for
other purposes when the heap shrinks.

This is somewhat complicated because this costs ~1–2 µs per workbuf
span, so for large heaps it's too expensive to just do synchronously
after mark termination between starting the world and dropping the
worldsema. Hence, we do it asynchronously in the sweeper. This adds a
list of "free" workbuf spans that can be returned to the heap. GC
moves all workbuf spans to this list after mark termination and the
background sweeper drains this list back to the heap. If the sweeper
doesn't finish, that's fine, since getempty can directly reuse any
remaining spans to allocate more workbufs.

Performance impact is negligible. On the x/benchmarks, this reduces
GC-bytes-from-system by 6–11%.

Fixes #19325.

Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8
Reviewed-on: https://go-review.googlesource.com/38582
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-13 18:20:47 +00:00
Austin Clements
9cc883a466 runtime: allocate GC workbufs from manually-managed spans
Currently the runtime allocates workbufs from persistent memory, which
means they can never be freed.

Switch to allocating them from manually-managed heap spans. This
doesn't free them yet, but it puts us in a position to do so.

For #19325.

Change-Id: I94b2512a2f2bbbb456cd9347761b9412e80d2da9
Reviewed-on: https://go-review.googlesource.com/38581
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-13 18:20:44 +00:00
Austin Clements
7e1832d06c runtime: say where the compiler knows about var writeBarrier
The runtime.writeBarrier variable tries to be helpful by telling you
that the compiler also knows about this variable, which you could
probably guess, but doesn't say how the compiler knows about it. In
fact, the compiler has a complete copy in builtin/runtime.go that
needs to be kept in sync. Say so.

Change-Id: Ia7fb0c591cb6f9b8230decce01008b417dfcec89
Reviewed-on: https://go-review.googlesource.com/40150
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-04-09 23:05:24 +00:00
Austin Clements
9ffbdabdb0 runtime: make runtime.GC() trigger a concurrent GC
Currently runtime.GC() triggers a STW GC. For common uses in tests and
benchmarks, it doesn't matter whether it's STW or concurrent, but for
uses in servers for things like collecting heap profiles and
controlling memory footprint, this pause can be a bit problem for
latency.

This changes runtime.GC() to trigger a concurrent GC. In order to
remain as close as possible to its current meaning, we define it to
always perform a full mark/sweep GC cycle before returning (even if
that means it has to finish up a cycle we're in the middle of first)
and to publish the heap profile as of the triggered mark termination.
While it must perform a full cycle, simultaneous runtime.GC() calls
can be consolidated into a single full cycle.

Fixes #18216.

Change-Id: I9088cc5deef4ab6bcf0245ed1982a852a01c44b5
Reviewed-on: https://go-review.googlesource.com/37520
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:15:21 +00:00
Austin Clements
2919132e1b runtime: don't adjust GC trigger on forced GC
Forced GCs don't provide good information about how to adjust the GC
trigger. Currently we avoid adjusting the trigger on forced GC because
forced GC is also STW and we don't adjust the trigger on STW GC.
However, this will become a problem when forced GC is concurrent.

Fix this by skipping trigger adjustment if the GC was user-forced.

For #18216.

Change-Id: I03dfdad12ecd3cfeca4573140a0768abb29aac5e
Reviewed-on: https://go-review.googlesource.com/38951
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:15:16 +00:00
Austin Clements
29fdbcfea3 runtime: track forced GCs independent of gcMode
Currently gcMode != gcBackgroundMode implies this was a user-forced GC
cycle. This is no longer going to be true when we make runtime.GC()
trigger a concurrent GC, so replace this with an explicit
work.userForced bit.

For #18216.

Change-Id: If7d71bbca78b5f0b35641b070f9d457f5c9a52bd
Reviewed-on: https://go-review.googlesource.com/37519
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:15:13 +00:00
Austin Clements
3d58498fdb runtime: simplify forced GC triggering
Now that the gcMode is no longer involved in the GC trigger condition,
we can simplify the triggering of forced GCs. By making the trigger
condition for forced GCs true even if gcphase is not _GCoff, we don't
need any special case path in gcStart to ensure that forced GCs don't
get consolidated.

Change-Id: I6067a13d76e40ff2eef8fade6fc14adb0cb58ee5
Reviewed-on: https://go-review.googlesource.com/37517
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:15:08 +00:00
Austin Clements
29be3f1999 runtime: generalize GC trigger
Currently the GC triggering condition is an awkward combination of the
gcMode (whether or not it's gcBackgroundMode) and a boolean
"forceTrigger" flag.

Replace this with a new gcTrigger type that represents the range of
transition predicates we need. This has several advantages:

1. We can remove the awkward logic that affects the trigger behavior
   based on the gcMode. Now gcMode purely controls whether to run a
   STW GC or not and the gcTrigger controls whether this is a forced
   GC that cannot be consolidated with other GC cycles.

2. We can lift the time-based triggering logic in sysmon to just
   another type of GC trigger and move the logic to the trigger test.

3. This sets us up to have a cycle count-based trigger, which we'll
   use to make runtime.GC trigger concurrent GC with the desired
   consolidation properties.

For #18216.

Change-Id: If9cd49349579a548800f5022ae47b8128004bbfc
Reviewed-on: https://go-review.googlesource.com/37516
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:15:06 +00:00
Austin Clements
1be3e76e76 runtime: simplify heap profile flushing
Currently the heap profile is flushed by *either* gcSweep in STW mode
or by gcMarkTermination in concurrent mode. Simplify this by making
gcMarkTermination always flush the heap profile and by making gcSweep
do one extra flush (instead of two) in STW mode.

Change-Id: I62147afb2a128e1f3d92ef4bb8144c8a345f53c4
Reviewed-on: https://go-review.googlesource.com/37715
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:15:01 +00:00
Austin Clements
eee85fc5a1 runtime: snapshot heap profile during mark termination
Currently we snapshot the heap profile just *after* mark termination
starts the world because it's a relatively expensive operation.
However, this means any alloc or free events that happen between
starting the world and snapshotting the heap profile can be accounted
to the wrong cycle. In the worst case, a free can be accounted to the
cycle before the alloc; if the heap is small, this can result
temporarily in a negative "in use" count in the profile.

Fix this without making STW more expensive by using a global heap
profile cycle counter. This lets us split up the operation into a two
parts: 1) a super-cheap snapshot operation that simply increments the
global cycle counter during STW, and 2) a more expensive cleanup
operation we can do after starting the world that frees up a slot in
all buckets for use by the next heap profile cycle.

Fixes #19311.

Change-Id: I6bdafabf111c48b3d26fe2d91267f7bef0bd4270
Reviewed-on: https://go-review.googlesource.com/37714
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-31 01:14:56 +00:00
Austin Clements
13ae271d5d runtime: introduce a type for lfstacks
The lfstack API is still a C-style API: lfstacks all have unhelpful
type uint64 and the APIs are package-level functions. Make the code
more readable and Go-style by creating an lfstack type with methods
for push, pop, and empty.

Change-Id: I64685fa3be0e82ae2d1a782a452a50974440a827
Reviewed-on: https://go-review.googlesource.com/38290
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-19 22:42:24 +00:00
Dmitry Vyukov
0556e26273 sync: make Mutex more fair
Add new starvation mode for Mutex.
In starvation mode ownership is directly handed off from
unlocking goroutine to the next waiter. New arriving goroutines
don't compete for ownership.
Unfair wait time is now limited to 1ms.
Also fix a long standing bug that goroutines were requeued
at the tail of the wait queue. That lead to even more unfair
acquisition times with multiple waiters.

Performance of normal mode is not considerably affected.

Fixes #13086

On the provided in the issue lockskew program:

done in 1.207853ms
done in 1.177451ms
done in 1.184168ms
done in 1.198633ms
done in 1.185797ms
done in 1.182502ms
done in 1.316485ms
done in 1.211611ms
done in 1.182418ms

name                    old time/op  new time/op   delta
MutexUncontended-48     0.65ns ± 0%   0.65ns ± 1%     ~           (p=0.087 n=10+10)
Mutex-48                 112ns ± 1%    114ns ± 1%   +1.69%        (p=0.000 n=10+10)
MutexSlack-48            113ns ± 0%     87ns ± 1%  -22.65%         (p=0.000 n=8+10)
MutexWork-48             149ns ± 0%    145ns ± 0%   -2.48%         (p=0.000 n=9+10)
MutexWorkSlack-48        149ns ± 0%    122ns ± 3%  -18.26%         (p=0.000 n=6+10)
MutexNoSpin-48           103ns ± 4%    105ns ± 3%     ~           (p=0.089 n=10+10)
MutexSpin-48             490ns ± 4%    515ns ± 6%   +5.08%        (p=0.006 n=10+10)
Cond32-48               13.4µs ± 6%   13.1µs ± 5%   -2.75%        (p=0.023 n=10+10)
RWMutexWrite100-48      53.2ns ± 3%   41.2ns ± 3%  -22.57%        (p=0.000 n=10+10)
RWMutexWrite10-48       45.9ns ± 2%   43.9ns ± 2%   -4.38%        (p=0.000 n=10+10)
RWMutexWorkWrite100-48   122ns ± 2%    134ns ± 1%   +9.92%        (p=0.000 n=10+10)
RWMutexWorkWrite10-48    206ns ± 1%    188ns ± 1%   -8.52%         (p=0.000 n=8+10)
Cond32-24               12.1µs ± 3%   12.4µs ± 3%   +1.98%         (p=0.043 n=10+9)
MutexUncontended-24     0.74ns ± 1%   0.75ns ± 1%     ~           (p=0.650 n=10+10)
Mutex-24                 122ns ± 2%    124ns ± 1%   +1.31%        (p=0.007 n=10+10)
MutexSlack-24           96.9ns ± 2%  102.8ns ± 2%   +6.11%        (p=0.000 n=10+10)
MutexWork-24             146ns ± 1%    135ns ± 2%   -7.70%         (p=0.000 n=10+9)
MutexWorkSlack-24        135ns ± 1%    128ns ± 2%   -5.01%         (p=0.000 n=10+9)
MutexNoSpin-24           114ns ± 3%    110ns ± 4%   -3.84%        (p=0.000 n=10+10)
MutexSpin-24             482ns ± 4%    475ns ± 8%     ~           (p=0.286 n=10+10)
RWMutexWrite100-24      43.0ns ± 3%   43.1ns ± 2%     ~           (p=0.956 n=10+10)
RWMutexWrite10-24       43.4ns ± 1%   43.2ns ± 1%     ~            (p=0.085 n=10+9)
RWMutexWorkWrite100-24   130ns ± 3%    131ns ± 3%     ~           (p=0.747 n=10+10)
RWMutexWorkWrite10-24    191ns ± 1%    192ns ± 1%     ~           (p=0.210 n=10+10)
Cond32-12               11.5µs ± 2%   11.7µs ± 2%   +1.98%        (p=0.002 n=10+10)
MutexUncontended-12     1.48ns ± 0%   1.50ns ± 1%   +1.08%        (p=0.004 n=10+10)
Mutex-12                 141ns ± 1%    143ns ± 1%   +1.63%        (p=0.000 n=10+10)
MutexSlack-12            121ns ± 0%    119ns ± 0%   -1.65%          (p=0.001 n=8+9)
MutexWork-12             141ns ± 2%    150ns ± 3%   +6.36%         (p=0.000 n=9+10)
MutexWorkSlack-12        131ns ± 0%    138ns ± 0%   +5.73%         (p=0.000 n=9+10)
MutexNoSpin-12          87.0ns ± 1%   83.7ns ± 1%   -3.80%        (p=0.000 n=10+10)
MutexSpin-12             364ns ± 1%    377ns ± 1%   +3.77%        (p=0.000 n=10+10)
RWMutexWrite100-12      42.8ns ± 1%   43.9ns ± 1%   +2.41%         (p=0.000 n=8+10)
RWMutexWrite10-12       39.8ns ± 4%   39.3ns ± 1%     ~            (p=0.433 n=10+9)
RWMutexWorkWrite100-12   131ns ± 1%    131ns ± 0%     ~            (p=0.591 n=10+9)
RWMutexWorkWrite10-12    173ns ± 1%    174ns ± 0%     ~            (p=0.059 n=10+8)
Cond32-6                10.9µs ± 2%   10.9µs ± 2%     ~           (p=0.739 n=10+10)
MutexUncontended-6      2.97ns ± 0%   2.97ns ± 0%     ~     (all samples are equal)
Mutex-6                  122ns ± 6%    122ns ± 2%     ~           (p=0.668 n=10+10)
MutexSlack-6             149ns ± 3%    142ns ± 3%   -4.63%        (p=0.000 n=10+10)
MutexWork-6              136ns ± 3%    140ns ± 5%     ~           (p=0.077 n=10+10)
MutexWorkSlack-6         152ns ± 0%    138ns ± 2%   -9.21%         (p=0.000 n=6+10)
MutexNoSpin-6            150ns ± 1%    152ns ± 0%   +1.50%         (p=0.000 n=8+10)
MutexSpin-6              726ns ± 0%    730ns ± 1%     ~           (p=0.069 n=10+10)
RWMutexWrite100-6       40.6ns ± 1%   40.9ns ± 1%   +0.91%         (p=0.001 n=8+10)
RWMutexWrite10-6        37.1ns ± 0%   37.0ns ± 1%     ~            (p=0.386 n=9+10)
RWMutexWorkWrite100-6    133ns ± 1%    134ns ± 1%   +1.01%         (p=0.005 n=9+10)
RWMutexWorkWrite10-6     152ns ± 0%    152ns ± 0%     ~     (all samples are equal)
Cond32-2                7.86µs ± 2%   7.95µs ± 2%   +1.10%        (p=0.023 n=10+10)
MutexUncontended-2      8.10ns ± 0%   9.11ns ± 4%  +12.44%         (p=0.000 n=9+10)
Mutex-2                 32.9ns ± 9%   38.4ns ± 6%  +16.58%        (p=0.000 n=10+10)
MutexSlack-2            93.4ns ± 1%   98.5ns ± 2%   +5.39%         (p=0.000 n=10+9)
MutexWork-2             40.8ns ± 3%   43.8ns ± 7%   +7.38%         (p=0.000 n=10+9)
MutexWorkSlack-2        98.6ns ± 5%  108.2ns ± 2%   +9.80%         (p=0.000 n=10+8)
MutexNoSpin-2            399ns ± 1%    398ns ± 2%     ~             (p=0.463 n=8+9)
MutexSpin-2             1.99µs ± 3%   1.97µs ± 1%   -0.81%          (p=0.003 n=9+8)
RWMutexWrite100-2       37.6ns ± 5%   46.0ns ± 4%  +22.17%         (p=0.000 n=10+8)
RWMutexWrite10-2        50.1ns ± 6%   36.8ns ±12%  -26.46%         (p=0.000 n=9+10)
RWMutexWorkWrite100-2    136ns ± 0%    134ns ± 2%   -1.80%          (p=0.001 n=7+9)
RWMutexWorkWrite10-2     140ns ± 1%    138ns ± 1%   -1.50%        (p=0.000 n=10+10)
Cond32                  5.93µs ± 1%   5.91µs ± 0%     ~            (p=0.411 n=9+10)
MutexUncontended        15.9ns ± 0%   15.8ns ± 0%   -0.63%          (p=0.000 n=8+8)
Mutex                   15.9ns ± 0%   15.8ns ± 0%   -0.44%        (p=0.003 n=10+10)
MutexSlack              26.9ns ± 3%   26.7ns ± 2%     ~           (p=0.084 n=10+10)
MutexWork               47.8ns ± 0%   47.9ns ± 0%   +0.21%          (p=0.014 n=9+8)
MutexWorkSlack          54.9ns ± 3%   54.5ns ± 3%     ~           (p=0.254 n=10+10)
MutexNoSpin              786ns ± 2%    765ns ± 1%   -2.66%        (p=0.000 n=10+10)
MutexSpin               3.87µs ± 1%   3.83µs ± 0%   -0.85%          (p=0.005 n=9+8)
RWMutexWrite100         21.2ns ± 2%   21.0ns ± 1%   -0.88%         (p=0.018 n=10+9)
RWMutexWrite10          22.6ns ± 1%   22.6ns ± 0%     ~             (p=0.471 n=9+9)
RWMutexWorkWrite100      132ns ± 0%    132ns ± 0%     ~     (all samples are equal)
RWMutexWorkWrite10       124ns ± 0%    123ns ± 0%     ~           (p=0.656 n=10+10)

Change-Id: I66412a3a0980df1233ad7a5a0cd9723b4274528b
Reviewed-on: https://go-review.googlesource.com/34310
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
2017-02-17 17:24:59 +00:00
Russ Cox
1f77db94f8 runtime: do not call wakep from enlistWorker, to avoid possible deadlock
We have seen one instance of a production job suddenly spinning to
100% CPU and becoming unresponsive. In that one instance, a SIGQUIT
was sent after 328 minutes of spinning, and the stacks showed a single
goroutine in "IO wait (scan)" state.

Looking for things that might get stuck if a goroutine got stuck in
scanning a stack, we found that injectglist does:

	lock(&sched.lock)
	var n int
	for n = 0; glist != nil; n++ {
		gp := glist
		glist = gp.schedlink.ptr()
		casgstatus(gp, _Gwaiting, _Grunnable)
		globrunqput(gp)
	}
	unlock(&sched.lock)

and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes
away. Essentially, this code locks sched.lock and then while holding
sched.lock, waits to lock gp.atomicstatus.

The code that is doing the scan is:

	if castogscanstatus(gp, s, s|_Gscan) {
		if !gp.gcscandone {
			scanstack(gp, gcw)
			gp.gcscandone = true
		}
		restartg(gp)
		break loop
	}

More analysis showed that scanstack can, in a rare case, end up
calling back into code that acquires sched.lock. For example:

	runtime.scanstack at proc.go:866
	calls runtime.gentraceback at mgcmark.go:842
	calls runtime.scanstack$1 at traceback.go:378
	calls runtime.scanframeworker at mgcmark.go:819
	calls runtime.scanblock at mgcmark.go:904
	calls runtime.greyobject at mgcmark.go:1221
	calls (*runtime.gcWork).put at mgcmark.go:1412
	calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127
	calls runtime.wakep at mgc.go:632
	calls runtime.startm at proc.go:1779
	acquires runtime.sched.lock at proc.go:1675

This path was found with an automated deadlock-detecting tool.
There are many such paths but they all go through enlistWorker -> wakep.

The evidence strongly suggests that one of these paths is what caused
the deadlock we observed. We're running those jobs with
GOTRACEBACK=crash now to try to get more information if it happens
again.

Further refinement and analysis shows that if we drop the wakep call
from enlistWorker, the remaining few deadlock cycles found by the tool
are all false positives caused by not understanding the effect of calls
to func variables.

The enlistWorker -> wakep call was intended only as a performance
optimization, it rarely executes, and if it does execute at just the
wrong time it can (and plausibly did) cause the deadlock we saw.

Comment it out, to avoid the potential deadlock.

Fixes #19112.
Unfixes #14179.

Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b
Reviewed-on: https://go-review.googlesource.com/37093
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
2017-02-15 21:22:36 +00:00
Austin Clements
d089a6c718 runtime: remove stack barriers
Now that we don't rescan stacks, stack barriers are unnecessary. This
removes all of the code and structures supporting them as well as
tests that were specifically for stack barriers.

Updates #17503.

Change-Id: Ia29221730e0f2bbe7beab4fa757f31a032d9690c
Reviewed-on: https://go-review.googlesource.com/36620
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2017-02-14 15:52:54 +00:00
Austin Clements
c5ebcd2c8a runtime: remove rescan list
With the hybrid barrier, rescanning stacks is no longer necessary so
the rescan list is no longer necessary. Remove it.

This leaves the gcrescanstacks GODEBUG variable, since it's useful for
debugging, but changes it to simply walk all of the Gs to rescan
stacks rather than using the rescan list.

We could also remove g.gcscanvalid, which is effectively a distributed
rescan list. However, it's still useful for gcrescanstacks mode and it
adds little complexity, so we'll leave it in.

Fixes #17099.
Updates #17503.

Change-Id: I776d43f0729567335ef1bfd145b75c74de2cc7a9
Reviewed-on: https://go-review.googlesource.com/36619
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-02-14 15:52:51 +00:00
Josh Bleecher Snyder
46a75870ad runtime: speed up fastrand() % n
This occurs a fair amount in the runtime for non-power-of-two n.
Use an alternative, faster formulation.

name           old time/op  new time/op  delta
Fastrandn/2-8  4.45ns ± 2%  2.09ns ± 3%  -53.12%  (p=0.000 n=14+14)
Fastrandn/3-8  4.78ns ±11%  2.06ns ± 2%  -56.94%  (p=0.000 n=15+15)
Fastrandn/4-8  4.76ns ± 9%  1.99ns ± 3%  -58.28%  (p=0.000 n=15+13)
Fastrandn/5-8  4.96ns ±13%  2.03ns ± 6%  -59.14%  (p=0.000 n=15+15)

name                    old time/op  new time/op  delta
SelectUncontended-8     33.7ns ± 2%  33.9ns ± 2%  +0.70%  (p=0.000 n=49+50)
SelectSyncContended-8   1.68µs ± 4%  1.65µs ± 4%  -1.54%  (p=0.000 n=50+45)
SelectAsyncContended-8   282ns ± 1%   277ns ± 1%  -1.50%  (p=0.000 n=48+43)
SelectNonblock-8        5.31ns ± 1%  5.32ns ± 1%    ~     (p=0.275 n=45+44)
SelectProdCons-8         585ns ± 3%   577ns ± 2%  -1.35%  (p=0.000 n=50+50)
GoroutineSelect-8       1.59ms ± 2%  1.59ms ± 1%    ~     (p=0.084 n=49+48)

Updates #16213

Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7
Reviewed-on: https://go-review.googlesource.com/36932
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-02-14 00:01:22 +00:00
Russ Cox
e4371fb179 time: optimize Now on darwin, windows
Fetch both monotonic and wall time together when possible.
Avoids skew and is cheaper.

Also shave a few ns off in conversion in package time.

Compared to current implementation (after monotonic changes):

name   old time/op  new time/op  delta
Now    19.6ns ± 1%   9.7ns ± 1%  -50.63%  (p=0.000 n=41+49) darwin/amd64
Now    23.5ns ± 4%  10.6ns ± 5%  -54.61%  (p=0.000 n=30+28) windows/amd64
Now    54.5ns ± 5%  29.8ns ± 9%  -45.40%  (p=0.000 n=27+29) windows/386

More importantly, compared to Go 1.8:

name   old time/op  new time/op  delta
Now     9.5ns ± 1%   9.7ns ± 1%   +1.94%  (p=0.000 n=41+49) darwin/amd64
Now    12.9ns ± 5%  10.6ns ± 5%  -17.73%  (p=0.000 n=30+28) windows/amd64
Now    15.3ns ± 5%  29.8ns ± 9%  +94.36%  (p=0.000 n=30+29) windows/386

This brings time.Now back in line with Go 1.8 on darwin/amd64 and windows/amd64.

It's not obvious why windows/386 is still noticeably worse than Go 1.8,
but it's better than before this CL. The windows/386 speed is not too
important; the changes just keep the two architectures similar.

Change-Id: If69b94970c8a1a57910a371ee91e0d4e82e46c5d
Reviewed-on: https://go-review.googlesource.com/36428
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-02-09 14:45:16 +00:00
Austin Clements
c1730ae424 runtime: force workers out before checking mark roots
Currently we check that all roots are marked as soon as gcMarkDone
decides to transition from mark 1 to mark 2. However, issue #16083
indicates that there may be a race where we try to complete mark 1
while a worker is still scanning a stack, causing the root mark check
to fail.

We don't yet understand this race, but as a simple mitigation, move
the root check to after gcMarkDone performs a ragged barrier, which
will force any remaining workers to finish their current job.

Updates #16083. This may "fix" it, but it would be better to
understand and fix the underlying race.

Change-Id: I1af9ce67bd87ade7bc2a067295d79c28cd11abd2
Reviewed-on: https://go-review.googlesource.com/35353
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-01-18 15:40:33 +00:00
Austin Clements
618c291544 runtime: update big mgc.go comment
The comment describing the overall GC algorithm at the top of mgc.go
has gotten woefully out-of-date (and was possibly never
correct/complete). Update it to reflect the current workings of the
GC and the set of phases that we now divide it into.

Change-Id: I02143c0ebefe9d4cd7753349dab8045f0973bf95
Reviewed-on: https://go-review.googlesource.com/34711
Reviewed-by: Rick Hudson <rlh@golang.org>
2017-01-06 18:22:35 +00:00
Austin Clements
01c6a19e04 runtime: add number of forced GCs to MemStats
This adds a counter for the number of times the application forced a
GC by, e.g., calling runtime.GC(). This is useful for detecting
applications that are overusing/abusing runtime.GC() or
debug.FreeOSMemory().

Fixes #18217.

Change-Id: I990ab7a313c1b3b7a50a3d44535c460d7c54f47d
Reviewed-on: https://go-review.googlesource.com/34067
Reviewed-by: Russ Cox <rsc@golang.org>
2016-12-07 20:59:16 +00:00
Cherry Zhang
bbe96f5673 runtime: make work.bytesMarked 8-byte aligned
Make atomic access on 32-bit architectures happy.

Updates #17786.

Change-Id: I42de63ff1381af42124dc51befc887160f71797d
Reviewed-on: https://go-review.googlesource.com/33235
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Austin Clements <austin@google.com>
2016-11-21 20:25:17 +00:00
Austin Clements
0bae74e8c9 runtime: wake idle Ps when enqueuing GC work
If the scheduler has no user work and there's no GC work visible, it
puts the P to sleep (or blocks on the network). However, if we later
enqueue more GC work, there's currently nothing that specifically
wakes up the scheduler to let it start an idle GC worker. As a result,
we can underutilize the CPU during GC if Ps have been put to sleep.

Fix this by making GC wake idle Ps when work buffers are put on the
full list. We already have a hook to do this, since we use this to
preempt a random P if we need more dedicated workers. We expand this
hook to instead wake an idle P if there is one. The logic we use for
this is identical to the logic used to wake an idle P when we ready a
goroutine.

To make this really sound, we also fix the scheduler to re-check the
idle GC worker condition after releasing its P. This closes a race
where 1) the scheduler checks for idle work and finds none, 2) new
work is enqueued but there are no idle Ps so none are woken, and 3)
the scheduler releases its P.

There is one subtlety here. Currently we call enlistWorker directly
from putfull, but the gcWork is in an inconsistent state in the places
that call putfull. This isn't a problem right now because nothing that
enlistWorker does touches the gcWork, but with the added call to
wakep, it's possible to get a recursive call into the gcWork
(specifically, while write barriers are disallowed, this can do an
allocation, which can dispose a gcWork, which can put a workbuf). To
handle this, we lift the enlistWorker calls up a layer and delay them
until the gcWork is in a consistent state.

Fixes #14179.

Change-Id: Ia2467a52e54c9688c3c1752e1fc00f5b37bbfeeb
Reviewed-on: https://go-review.googlesource.com/32434
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
2016-11-20 22:44:22 +00:00
Austin Clements
49ea9207b6 runtime: exit idle worker if there's higher-priority work
Idle GC workers trigger whenever there's a GC running and the
scheduler doesn't find any other work. However, they currently run for
a full scheduler quantum (~10ms) once started.

This is really bad for event-driven applications, where work may come
in on the network hundreds of times during that window. In the
go-gcbench rpc benchmark, this is bad enough to often cause effective
STWs where all Ps are in the idle worker. When this happens, we don't
even poll the network any more (except for the background 10ms poll in
sysmon), so we don't even know there's more work to do.

Fix this by making idle workers check with the scheduler roughly every
100 µs to see if there's any higher-priority work the P should be
doing. This check includes polling the network for incoming work.

Fixes #16528.

Change-Id: I6f62ebf6d36a92368da9891bafbbfd609b9bd003
Reviewed-on: https://go-review.googlesource.com/32433
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-11-20 22:44:17 +00:00
David Crawshaw
54ec7b072e runtime: access modules via a slice
The introduction of -buildmode=plugin means modules can be added to a
Go program while it is running. This means there exists some time
while the program is running with the module is on the moduledata
linked list, but it has not been initialized to the satisfaction of
other parts of the runtime. Notably, the GC.

This CL adds a new way of access modules, an activeModules function.
It returns a slice of modules that is built in the background and
atomically swapped in. The parts of the runtime that need to wait on
module initialization can use this slice instead of the linked list.

Fixes #17455

Change-Id: I04790fd07e40c7295beb47cea202eb439206d33d
Reviewed-on: https://go-review.googlesource.com/32357
Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-11-01 16:04:12 +00:00
Martin Möhrmann
d7b34d5f29 runtime: improve atoi implementation
- Adds overflow checks
- Adds parsing of negative integers
- Adds boolean return value to signal parsing errors
- Adds atoi32 for parsing of integers that fit in an int32
- Adds tests

Handling of errors to provide error messages
at the call sites is left to future CLs.

Updates #17718

Change-Id: I3cacd0ab1230b9efc5404c68edae7304d39bcbc0
Reviewed-on: https://go-review.googlesource.com/32390
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-11-01 14:04:39 +00:00
Austin Clements
bd640c882a runtime: disable stack rescanning by default
With the hybrid barrier in place, we can now disable stack rescanning
by default. This commit adds a "gcrescanstacks" GODEBUG variable that
is off by default but can be set to re-enable STW stack rescanning.
The plan is to leave this off but available in Go 1.8 for debugging
and as a fallback.

With this change, worst-case mark termination time at GOMAXPROCS=12
*not* including time spent stopping the world (which is still
unbounded) is reliably under 100 µs, with a 95%ile around 50 µs in
every benchmark I tried (the go1 benchmarks, the x/benchmarks garbage
benchmark, and the gcbench activegs and rpc benchmarks). Including
time spent stopping the world usually adds about 20 µs to total STW
time at GOMAXPROCS=12, but I've seen it add around 150 µs in these
benchmarks when a goroutine takes time to reach a safe point (see
issue #10958) or when stopping the world races with goroutine
switches. At GOMAXPROCS=1, where this isn't an issue, worst case STW
is typically 30 µs.

The go-gcbench activegs benchmark is designed to stress large numbers
of dirty stacks. This commit reduces 95%ile STW time for 500k dirty
stacks by nearly three orders of magnitude, from 150ms to 195µs.

This has little effect on the throughput of the go1 benchmarks or the
x/benchmarks benchmarks.

name         old time/op  new time/op  delta
XGarbage-12  2.31ms ± 0%  2.32ms ± 1%  +0.28%  (p=0.001 n=17+16)
XJSON-12     12.4ms ± 0%  12.4ms ± 0%  +0.41%  (p=0.000 n=18+18)
XHTTP-12     11.8µs ± 0%  11.8µs ± 1%    ~     (p=0.492 n=20+18)

It reduces the tail latency of the x/benchmarks HTTP benchmark:

name      old p50-time  new p50-time  delta
XHTTP-12    489µs ± 0%    491µs ± 1%  +0.54%  (p=0.000 n=20+18)

name      old p95-time  new p95-time  delta
XHTTP-12    957µs ± 1%    960µs ± 1%  +0.28%  (p=0.002 n=20+17)

name      old p99-time  new p99-time  delta
XHTTP-12   1.76ms ± 1%   1.64ms ± 1%  -7.20%  (p=0.000 n=20+18)

Comparing to the beginning of the hybrid barrier implementation
("runtime: parallelize STW mcache flushing") shows that the hybrid
barrier trades a small performance impact for much better STW latency,
as expected. The magnitude of the performance impact is generally
small:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.37s ± 1%     2.42s ± 1%  +2.04%  (p=0.000 n=19+18)
Fannkuch11-12                2.84s ± 0%     2.72s ± 0%  -4.00%  (p=0.000 n=19+19)
FmtFprintfEmpty-12          44.2ns ± 1%    45.2ns ± 1%  +2.20%  (p=0.000 n=17+19)
FmtFprintfString-12          130ns ± 1%     134ns ± 0%  +2.94%  (p=0.000 n=18+16)
FmtFprintfInt-12             114ns ± 1%     117ns ± 0%  +3.01%  (p=0.000 n=19+15)
FmtFprintfIntInt-12          176ns ± 1%     182ns ± 0%  +3.17%  (p=0.000 n=20+15)
FmtFprintfPrefixedInt-12     186ns ± 1%     187ns ± 1%  +1.04%  (p=0.000 n=20+19)
FmtFprintfFloat-12           251ns ± 1%     250ns ± 1%  -0.74%  (p=0.000 n=17+18)
FmtManyArgs-12               746ns ± 1%     761ns ± 0%  +2.08%  (p=0.000 n=19+20)
GobDecode-12                6.57ms ± 1%    6.65ms ± 1%  +1.11%  (p=0.000 n=19+20)
GobEncode-12                5.59ms ± 1%    5.65ms ± 0%  +1.08%  (p=0.000 n=17+17)
Gzip-12                      223ms ± 1%     223ms ± 1%  -0.31%  (p=0.006 n=20+20)
Gunzip-12                   38.0ms ± 0%    37.9ms ± 1%  -0.25%  (p=0.009 n=19+20)
HTTPClientServer-12         77.5µs ± 1%    78.9µs ± 2%  +1.89%  (p=0.000 n=20+20)
JSONEncode-12               14.7ms ± 1%    14.9ms ± 0%  +0.75%  (p=0.000 n=20+20)
JSONDecode-12               53.0ms ± 1%    55.9ms ± 1%  +5.54%  (p=0.000 n=19+19)
Mandelbrot200-12            3.81ms ± 0%    3.81ms ± 1%  +0.20%  (p=0.023 n=17+19)
GoParse-12                  3.17ms ± 1%    3.18ms ± 1%    ~     (p=0.057 n=20+19)
RegexpMatchEasy0_32-12      71.7ns ± 1%    70.4ns ± 1%  -1.77%  (p=0.000 n=19+20)
RegexpMatchEasy0_1K-12       946ns ± 0%     946ns ± 0%    ~     (p=0.405 n=18+18)
RegexpMatchEasy1_32-12      67.2ns ± 2%    67.3ns ± 2%    ~     (p=0.732 n=20+20)
RegexpMatchEasy1_1K-12       374ns ± 1%     378ns ± 1%  +1.14%  (p=0.000 n=18+19)
RegexpMatchMedium_32-12      107ns ± 1%     107ns ± 1%    ~     (p=0.259 n=18+20)
RegexpMatchMedium_1K-12     34.2µs ± 1%    34.5µs ± 1%  +1.03%  (p=0.000 n=18+18)
RegexpMatchHard_32-12       1.77µs ± 1%    1.79µs ± 1%  +0.73%  (p=0.000 n=19+18)
RegexpMatchHard_1K-12       53.6µs ± 1%    54.2µs ± 1%  +1.10%  (p=0.000 n=19+19)
Template-12                 61.5ms ± 1%    63.9ms ± 0%  +3.96%  (p=0.000 n=18+18)
TimeParse-12                 303ns ± 1%     300ns ± 1%  -1.08%  (p=0.000 n=19+20)
TimeFormat-12                318ns ± 1%     320ns ± 0%  +0.79%  (p=0.000 n=19+19)
Revcomp-12 (*)               509ms ± 3%     504ms ± 0%    ~     (p=0.967 n=7+12)
[Geo mean]                  54.3µs         54.8µs       +0.88%

(*) Revcomp is highly non-linear, so I only took samples with 2
iterations.

name         old time/op  new time/op  delta
XGarbage-12  2.25ms ± 0%  2.32ms ± 1%  +2.74%  (p=0.000 n=16+16)
XJSON-12     11.6ms ± 0%  12.4ms ± 0%  +6.81%  (p=0.000 n=18+18)
XHTTP-12     11.6µs ± 1%  11.8µs ± 1%  +1.62%  (p=0.000 n=17+18)

Updates #17503.

Updates #17099, since you can't have a rescan list bug if there's no
rescan list. I'm not marking it as fixed, since gcrescanstacks can
still be set to re-enable the rescan lists.

Change-Id: I6e926b4c2dbd4cd56721869d4f817bdbb330b851
Reviewed-on: https://go-review.googlesource.com/31766
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-28 21:24:13 +00:00
Austin Clements
ee3d20129a runtime: avoid getfull() barrier most of the time
With the hybrid barrier, unless we're doing a STW GC or hit a very
rare race (~once per all.bash) that can start mark termination before
all of the work is drained, we don't need to drain the work queue at
all. Even draining an empty work queue is rather expensive since we
have to enter the getfull() barrier, so it's worth avoiding this.

Conveniently, it's quite easy to detect whether or not we actually
need the getufull() barrier: since the world is stopped when we enter
mark termination, everything must have flushed its work to the work
queue, so we can just check the queue. If the queue is empty and we
haven't queued up any jobs that may create more work (which should
always be the case with the hybrid barrier), we can simply have all GC
workers perform non-blocking drains.

Also conveniently, this solution is quite safe. If we do somehow screw
something up and there's work on the work queue, some worker will
still process it, it just may not happen in parallel.

This is not the "right" solution, but it's simple, expedient,
low-risk, and maintains compatibility with debug.gcrescanstacks. When
we remove the gcrescanstacks fallback in Go 1.9, we should also fix
the race that starts mark termination early, and then we can eliminate
work draining from mark termination.

Updates #17503.

Change-Id: I7b3cd5de6a248ab29d78c2b42aed8b7443641361
Reviewed-on: https://go-review.googlesource.com/32186
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-28 21:23:52 +00:00
Austin Clements
85c22bc3a5 runtime: mark tiny blocks at GC start
The hybrid barrier requires allocate-black, but there's one case where
we don't currently allocate black: the tiny allocator. If we allocate
a *new* tiny alloc block during GC, it will be allocated black, but if
we allocated the current block before GC, it won't be black, and the
further allocations from it won't mark it, which means we may free a
reachable tiny block during sweeping.

Fix this by passing over all mcaches at the beginning of mark, while
the world is still stopped, and greying their tiny blocks.

Updates #17503.

Change-Id: I04d4df7cc2f553f8f7b1e4cb0b52e2946588111a
Reviewed-on: https://go-review.googlesource.com/31456
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-28 20:05:28 +00:00
Austin Clements
a475a38a3d runtime: parallelize STW mcache flushing
Currently all mcaches are flushed in a single STW root job. This takes
about 5 µs per P, but since it's done sequentially it adds about
5*GOMAXPROCS µs to the STW.

Fix this by parallelizing the job. Since there are exactly GOMAXPROCS
mcaches to flush, this parallelizes quite nicely and brings the STW
latency cost down to a constant 5 µs (assuming GOMAXPROCS actually
reflects the number of CPUs).

Updates #17503.

Change-Id: Ibefeb1c2229975d5137c6e67fac3b6c92103742d
Reviewed-on: https://go-review.googlesource.com/32033
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-28 18:19:53 +00:00
Austin Clements
6834839427 runtime, cmd/trace: annotate different mark worker types
Currently mark workers are shown in the trace as regular goroutines
labeled "runtime.gcBgMarkWorker". That's somewhat unhelpful to an end
user because of the opaque label and particularly unhelpful to runtime
developers because it doesn't distinguish the different types of mark
workers.

Fix this by introducing a variant of the GoStart event called
GoStartLabel that lets the runtime indicate a label for a goroutine
execution span and using this to label mark worker executions as "GC
(<mode>)" in the trace viewer.

Since this bumps the trace version to 1.8, we also add test data for
1.7 traces.

Change-Id: Id7b9c0536508430c661ffb9e40e436f3901ca121
Reviewed-on: https://go-review.googlesource.com/30702
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
2016-10-28 14:29:40 +00:00
Peter Weinberger
ca922b6d36 runtime: Profile goroutines holding contended mutexes.
runtime.SetMutexProfileFraction(n int) will capture 1/n-th of stack
traces of goroutines holding contended mutexes if n > 0. From runtime/pprof,
pprot.Lookup("mutex").WriteTo writes the accumulated
stack traces to w (in essentially the same format that blocking
profiling uses).

Change-Id: Ie0b54fa4226853d99aa42c14cb529ae586a8335a
Reviewed-on: https://go-review.googlesource.com/29650
Reviewed-by: Austin Clements <austin@google.com>
2016-10-28 11:47:16 +00:00
Austin Clements
d6625caf53 runtime: scan mark worker stacks like normal
Currently, markroot delays scanning mark worker stacks until mark
termination by putting the mark worker G directly on the rescan list
when it encounters one during the mark phase. Without this, since mark
workers are non-preemptible, two mark workers that attempt to scan
each other's stacks can deadlock.

However, this is annoyingly asymmetric and causes some real problems.
First, markroot does not own the G at that point, so it's not
technically safe to add it to the rescan list. I haven't been able to
find a specific problem this could cause, but I suspect it's the root
cause of issue #17099. Second, this will interfere with the hybrid
barrier, since there is no stack rescanning during mark termination
with the hybrid barrier.

This commit switches to a different approach. We move the mark
worker's call to gcDrain to the system stack and set the mark worker's
status to _Gwaiting for the duration of the drain to indicate that
it's preemptible. This lets another mark worker scan its G stack while
the drain is running on the system stack. We don't return to the G
stack until we can switch back to _Grunning, which ensures we don't
race with a stack scan. This lets us eliminate the special case for
mark worker stack scans and scan them just like any other goroutine.
The only subtlety to this approach is that we have to disable stack
shrinking for mark workers; they could be referring to captured
variables from the G stack, so it's not safe to move their stacks.

Updates #17099 and #17503.

Change-Id: Ia5213949ec470af63e24dfce01df357c12adbbea
Reviewed-on: https://go-review.googlesource.com/31820
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-26 18:13:16 +00:00
Austin Clements
575b1dda4e runtime: eliminate allspans snapshot
Now that sweeping and span marking use the sweep list, there's no need
for the work.spans snapshot of the allspans list. This change
eliminates the few remaining uses of it, which are either dead code or
can use allspans directly, and removes work.spans and its support
functions.

Change-Id: Id5388b42b1e68e8baee853d8eafb8bb4ff95bb43
Reviewed-on: https://go-review.googlesource.com/30537
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-25 22:33:02 +00:00
Austin Clements
f9497a6747 runtime: make sweep time proportional to in-use spans
Currently sweeping walks the list of all spans, which means the work
in sweeping is proportional to the maximum number of spans ever used.
If the heap was once large but is now small, this causes an
amortization failure: on a small heap, GCs happen frequently, but a
full sweep still has to happen in each GC cycle, which means we spent
a lot of time in sweeping.

Fix this by creating a separate list consisting of just the in-use
spans to be swept, so sweeping is proportional to the number of in-use
spans (which is proportional to the live heap). Specifically, we
create two lists: a list of unswept in-use spans and a list of swept
in-use spans. At the start of the sweep cycle, the swept list becomes
the unswept list and the new swept list is empty. Allocating a new
in-use span adds it to the swept list. Sweeping moves spans from the
unswept list to the swept list.

This fixes the amortization problem because a shrinking heap moves
spans off the unswept list without adding them to the swept list,
reducing the time required by the next sweep cycle.

Updates #9265. This fix eliminates almost all of the time spent in
sweepone; however, markrootSpans has essentially the same bug, so now
the test program from this issue spends all of its time in
markrootSpans.

No significant effect on other benchmarks.

Change-Id: Ib382e82790aad907da1c127e62b3ab45d7a4ac1e
Reviewed-on: https://go-review.googlesource.com/30535
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-25 22:32:57 +00:00
Austin Clements
45baff61e3 runtime: expand comment on work.spans
Change-Id: I4b8a6f5d9bc5aba16026d17f99f3512dacde8d2d
Reviewed-on: https://go-review.googlesource.com/30534
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-25 22:32:54 +00:00
Austin Clements
4d6207790b runtime: consolidate h_allspans and mheap_.allspans
These are two ways to refer to the allspans array that hark back to
when the runtime was split between C and Go. Clean this up by making
mheap_.allspans a slice and eliminating h_allspans.

Change-Id: Ic9360d040cf3eb590b5dfbab0b82e8ace8525610
Reviewed-on: https://go-review.googlesource.com/30530
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-25 22:32:42 +00:00
Austin Clements
5b7497f327 runtime: update heap profile stats after world is started
Updating the heap profile stats is one of the most expensive parts of
mark termination other than stack rescanning, but there's really no
need to do this with the world stopped. Move it to right after we've
started the world back up. This creates a *very* small window where
allocations from the next cycle can slip into the profile, but the
exact point where mark termination happens is so non-deterministic
already that a slight reordering here is unimportant.

Change-Id: I2f76f22c70329923ad6a594a2c26869f0736d34e
Reviewed-on: https://go-review.googlesource.com/31363
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-19 21:36:24 +00:00
Austin Clements
9429aab999 runtime: remove gcWork flushes in mark termination
The only reason these flushes are still necessary at all is that
gcmarknewobject doesn't flush its gcWork stats like it's supposed to.
By changing gcmarknewobject to follow the standard protocol, the
flushes become completely unnecessary because mark 2 ensures caches
are flushed (and stay flushed) before we ever enter mark termination.

In the garbage benchmark, this takes roughly 50 µs, which is
surprisingly long for doing nothing. We still double-check after
draining that they are in fact empty.

Change-Id: Ia1c7cf98a53f72baa513792eb33eca6a0b4a7128
Reviewed-on: https://go-review.googlesource.com/31134
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-19 21:36:09 +00:00
Gleb Stepanov
460d112f6c runtime: fix typo in comments
Fix typo in word synchronization in comments.

Change-Id: I453b4e799301e758799c93df1e32f5244ca2fb84
Reviewed-on: https://go-review.googlesource.com/25174
Reviewed-by: Russ Cox <rsc@golang.org>
2016-10-12 02:48:31 +00:00
Austin Clements
fa9b57bb1d runtime: make next_gc ^0 when GC is disabled
When GC is disabled, we set gcpercent to -1. However, we still use
gcpercent to compute several values, such as next_gc and gc_trigger.
These calculations are meaningless when gcpercent is -1 and result in
meaningless values. This is okay in a sense because we also never use
these values if gcpercent is -1, but they're confusing when exposed to
the user, for example via MemStats or the execution trace. It's
particularly unfortunate in the execution trace because it attempts to
plot the underflowed value of next_gc, which scales all useful
information in the heap row into oblivion.

Fix this by making next_gc ^0 when gcpercent < 0. This has the
advantage of being true in a way: next_gc is effectively infinite when
gcpercent < 0. We can also detect this special value when updating the
execution trace and report next_gc as 0 so it doesn't blow up the
display of the heap line.

Change-Id: I4f366e4451f8892a4908da7b2b6086bdc67ca9a9
Reviewed-on: https://go-review.googlesource.com/30016
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-10-07 18:32:51 +00:00
Austin Clements
2098e5d39a runtime: eliminate memstats.heap_reachable
We used to compute an estimate of the reachable heap size that was
different from the marked heap size. This ultimately caused more
problems than it solved, so we pulled it out, but memstats still has
both heap_reachable and heap_marked, and there are some leftover TODOs
about the problems with this estimate.

Clean this up by eliminating heap_reachable in favor of heap_marked
and deleting the stale TODOs.

Change-Id: I713bc20a7c90683d2b43ff63c0b21a440269cc4d
Reviewed-on: https://go-review.googlesource.com/29271
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-09-26 22:00:47 +00:00
Austin Clements
ec9c84c884 runtime: disentangle next_gc from GC trigger
Back in Go 1.4, memstats.next_gc was both the heap size at which GC
would trigger, and the size GC kept the heap under. When we switched
to concurrent GC in Go 1.5, we got somewhat confused and made this
variable the trigger heap size, while gcController.heapGoal became the
goal heap size.

memstats.next_gc is exposed to the user via MemStats.NextGC, while
gcController.heapGoal is not. This is unfortunate because 1) the heap
goal is far more useful for diagnostics, and 2) the trigger heap size
is just part of the GC trigger heuristic, which means it wouldn't be
useful to an application even if it tried to use it.

We never noticed this mess because MemStats.NextGC is practically
undocumented. Now that we're trying to document MemStats, it became
clear that this field had diverged from its original usefulness.

Clean up this mess by shuffling things back around so that next_gc is
the goal heap size and the new (unexposed) memstats.gc_trigger field
is the trigger heap size. This eliminates gcController.heapGoal.

Updates #15849.

Change-Id: I2cbbd43b1d78bdf613cb43f53488bd63913189b7
Reviewed-on: https://go-review.googlesource.com/29270
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-09-26 22:00:44 +00:00
Austin Clements
cf4f1d07a1 runtime: bound scanobject to ~100 µs
Currently the time spent in scanobject is proportional to the size of
the object being scanned. Since scanobject is non-preemptible, large
objects can cause significant goroutine (and even whole application)
delays through several means:

1. If a GC assist picks up a large object, the allocating goroutine is
   blocked for the whole scan, even if that scan well exceeds that
   goroutine's debt.

2. Since the scheduler does not run on the P performing a large object
   scan, goroutines in that P's run queue do not run unless they are
   stolen by another P (which can take some time). If there are a few
   large objects, all of the Ps may get tied up so the scheduler
   doesn't run anywhere.

3. Even if a large object is scanned by a background worker and other
   Ps are still running the scheduler, the large object scan doesn't
   flush background credit until the whole scan is done. This can
   easily cause all allocations to block in assists, waiting for
   credit, causing an effective STW.

Fix this by splitting large objects into 128 KB "oblets" and scanning
at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates
to bounding scanobject at roughly 100 µs. This improves assist
behavior both because assists can no longer get "unlucky" and be stuck
scanning a large object, and because it causes the background worker
to flush credit and unblock assists more frequently when scanning
large objects. This also improves GC parallelism if the heap consists
primarily of a small number of very large objects by letting multiple
workers scan a large objects in parallel.

Fixes #10345. Fixes #16293.

This substantially improves goroutine latency in the benchmark from
issue #16293, which exercises several forms of very large objects:

name                 old max-latency    new max-latency    delta
SliceNoPointer-12           154µs ± 1%        155µs ±  2%     ~     (p=0.087 n=13+12)
SlicePointer-12             314ms ± 1%       5.94ms ±138%  -98.11%  (p=0.000 n=19+20)
SliceLivePointer-12        1148ms ± 0%       4.72ms ±167%  -99.59%  (p=0.000 n=19+20)
MapNoPointer-12           72509µs ± 1%        408µs ±325%  -99.44%  (p=0.000 n=19+18)
ChanPointer-12              313ms ± 0%       4.74ms ±140%  -98.49%  (p=0.000 n=18+20)
ChanLivePointer-12         1147ms ± 0%       3.30ms ±149%  -99.71%  (p=0.000 n=19+20)

name                 old P99.9-latency  new P99.9-latency  delta
SliceNoPointer-12           113µs ±25%         107µs ±12%     ~     (p=0.153 n=20+18)
SlicePointer-12          309450µs ± 0%         133µs ±23%  -99.96%  (p=0.000 n=20+20)
SliceLivePointer-12         961ms ± 0%        1.35ms ±27%  -99.86%  (p=0.000 n=20+20)
MapNoPointer-12            448µs ±288%         119µs ±18%  -73.34%  (p=0.000 n=18+20)
ChanPointer-12           309450µs ± 0%         134µs ±23%  -99.96%  (p=0.000 n=20+19)
ChanLivePointer-12          961ms ± 0%        1.35ms ±27%  -99.86%  (p=0.000 n=20+20)

This has negligible effect on all metrics from the garbage, JSON, and
HTTP x/benchmarks.

It shows slight improvement on some of the go1 benchmarks,
particularly Revcomp, which uses some multi-megabyte buffers:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.46s ± 1%     2.47s ± 1%  +0.32%  (p=0.012 n=20+20)
Fannkuch11-12                2.82s ± 0%     2.81s ± 0%  -0.61%  (p=0.000 n=17+20)
FmtFprintfEmpty-12          50.8ns ± 5%    50.5ns ± 2%    ~     (p=0.197 n=17+19)
FmtFprintfString-12          131ns ± 1%     132ns ± 0%  +0.57%  (p=0.000 n=20+16)
FmtFprintfInt-12             117ns ± 0%     116ns ± 0%  -0.47%  (p=0.000 n=15+20)
FmtFprintfIntInt-12          180ns ± 0%     179ns ± 1%  -0.78%  (p=0.000 n=16+20)
FmtFprintfPrefixedInt-12     186ns ± 1%     185ns ± 1%  -0.55%  (p=0.000 n=19+20)
FmtFprintfFloat-12           263ns ± 1%     271ns ± 0%  +2.84%  (p=0.000 n=18+20)
FmtManyArgs-12               741ns ± 1%     742ns ± 1%    ~     (p=0.190 n=19+19)
GobDecode-12                7.44ms ± 0%    7.35ms ± 1%  -1.21%  (p=0.000 n=20+20)
GobEncode-12                6.22ms ± 1%    6.21ms ± 1%    ~     (p=0.336 n=20+19)
Gzip-12                      220ms ± 1%     219ms ± 1%    ~     (p=0.130 n=19+19)
Gunzip-12                   37.9ms ± 0%    37.9ms ± 1%    ~     (p=1.000 n=20+19)
HTTPClientServer-12         82.5µs ± 3%    82.6µs ± 3%    ~     (p=0.776 n=20+19)
JSONEncode-12               16.4ms ± 1%    16.5ms ± 2%  +0.49%  (p=0.003 n=18+19)
JSONDecode-12               53.7ms ± 1%    54.1ms ± 1%  +0.71%  (p=0.000 n=19+18)
Mandelbrot200-12            4.19ms ± 1%    4.20ms ± 1%    ~     (p=0.452 n=19+19)
GoParse-12                  3.38ms ± 1%    3.37ms ± 1%    ~     (p=0.123 n=19+19)
RegexpMatchEasy0_32-12      72.1ns ± 1%    71.8ns ± 1%    ~     (p=0.397 n=19+17)
RegexpMatchEasy0_1K-12       242ns ± 0%     242ns ± 0%    ~     (p=0.168 n=17+20)
RegexpMatchEasy1_32-12      72.1ns ± 1%    72.1ns ± 1%    ~     (p=0.538 n=18+19)
RegexpMatchEasy1_1K-12       385ns ± 1%     384ns ± 1%    ~     (p=0.388 n=20+20)
RegexpMatchMedium_32-12      112ns ± 1%     112ns ± 3%    ~     (p=0.539 n=20+20)
RegexpMatchMedium_1K-12     34.4µs ± 2%    34.4µs ± 2%    ~     (p=0.628 n=18+18)
RegexpMatchHard_32-12       1.80µs ± 1%    1.80µs ± 1%    ~     (p=0.522 n=18+19)
RegexpMatchHard_1K-12       54.0µs ± 1%    54.1µs ± 1%    ~     (p=0.647 n=20+19)
Revcomp-12                   387ms ± 1%     369ms ± 5%  -4.89%  (p=0.000 n=17+19)
Template-12                 62.3ms ± 1%    62.0ms ± 0%  -0.48%  (p=0.002 n=20+17)
TimeParse-12                 314ns ± 1%     314ns ± 0%    ~     (p=1.011 n=20+13)
TimeFormat-12                358ns ± 0%     354ns ± 0%  -1.12%  (p=0.000 n=17+20)
[Geo mean]                  53.5µs         53.3µs       -0.23%

Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132
Reviewed-on: https://go-review.googlesource.com/23540
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
2016-09-06 19:27:33 +00:00