From 03eb9483e37a5c0c9208cff17138a94f44e4d6a1 Mon Sep 17 00:00:00 2001
From: Austin Clements <austin@google.com>
Date: Wed, 19 Jul 2017 17:03:52 -0400
Subject: [PATCH] runtime: separate soft and hard heap limits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently, GC pacing is based on a single hard heap limit computed
based on GOGC. In order to achieve this hard limit, assist pacing
makes the conservative assumption that the entire heap is live.
However, in the steady state (with GOGC=100), only half of the heap is
live. As a result, the garbage collector works twice as hard as
necessary and finishes half way between the trigger and the goal.
Since this is a stable state for the trigger controller, this repeats
from cycle to cycle. Matters are even worse if GOGC is higher. For
example, if GOGC=200, only a third of the heap is live in steady
state, so the GC will work three times harder than necessary and
finish only a third of the way between the trigger and the goal.

Since this causes the garbage collector to consume ~50% of the
available CPU during marking instead of the intended 25%, about 25% of
the CPU goes to mutator assists. This high mutator assist cost causes
high mutator latency variability.

This commit improves the situation by separating the heap goal into
two goals: a soft goal and a hard goal. The soft goal is set based on
GOGC, just like the current goal is, and the hard goal is set at a 10%
larger heap than the soft goal. Prior to the soft goal, assist pacing
assumes the heap is in steady state (e.g., only half of it is live).
Between the soft goal and the hard goal, assist pacing switches to the
current conservative assumption that the entire heap is live.

In benchmarks, this nearly eliminates mutator assists. However, since
background marking is fixed at 25% CPU, this causes the trigger
controller to saturate, which leads to somewhat higher variability in
heap size. The next commit will address this.

The lower CPU usage of course leads to longer mark cycles, though
really it means the mark cycles are as long as they should have been
in the first place. This does, however, lead to two potential
down-sides compared to the current pacing policy: 1. the total
overhead of the write barrier is higher because it's enabled more of
the time and 2. the heap size may be larger because there's more
floating garbage. We addressed 1 by significantly improving the
performance of the write barrier in the preceding commits. 2 can be
demonstrated in intense GC benchmarks, but doesn't seem to be a
problem in any real applications.

Updates #14951.
Updates #14812 (fixes?).
Fixes #18534.

This has no significant effect on the throughput of the
github.com/dr2chase/bent benchmarks-50.

This has little overall throughput effect on the go1 benchmarks:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.41s ± 0%     2.40s ± 0%  -0.22%  (p=0.007 n=20+18)
Fannkuch11-12                2.95s ± 0%     2.95s ± 0%  +0.07%  (p=0.003 n=17+18)
FmtFprintfEmpty-12          41.7ns ± 3%    42.2ns ± 0%  +1.17%  (p=0.002 n=20+15)
FmtFprintfString-12         66.5ns ± 0%    67.9ns ± 2%  +2.16%  (p=0.000 n=16+20)
FmtFprintfInt-12            77.6ns ± 2%    75.6ns ± 3%  -2.55%  (p=0.000 n=19+19)
FmtFprintfIntInt-12          124ns ± 1%     123ns ± 1%  -0.98%  (p=0.000 n=18+17)
FmtFprintfPrefixedInt-12     151ns ± 1%     148ns ± 1%  -1.75%  (p=0.000 n=19+20)
FmtFprintfFloat-12           210ns ± 1%     212ns ± 0%  +0.75%  (p=0.000 n=19+16)
FmtManyArgs-12               501ns ± 1%     499ns ± 1%  -0.30%  (p=0.041 n=17+19)
GobDecode-12                6.50ms ± 1%    6.49ms ± 1%    ~     (p=0.234 n=19+19)
GobEncode-12                5.43ms ± 0%    5.47ms ± 0%  +0.75%  (p=0.000 n=20+19)
Gzip-12                      216ms ± 1%     220ms ± 1%  +1.71%  (p=0.000 n=19+20)
Gunzip-12                   38.6ms ± 0%    38.8ms ± 0%  +0.66%  (p=0.000 n=18+19)
HTTPClientServer-12         78.1µs ± 1%    78.5µs ± 1%  +0.49%  (p=0.035 n=20+20)
JSONEncode-12               12.1ms ± 0%    12.2ms ± 0%  +1.05%  (p=0.000 n=18+17)
JSONDecode-12               53.0ms ± 0%    52.3ms ± 0%  -1.27%  (p=0.000 n=19+19)
Mandelbrot200-12            3.74ms ± 0%    3.69ms ± 0%  -1.17%  (p=0.000 n=18+19)
GoParse-12                  3.17ms ± 1%    3.17ms ± 1%    ~     (p=0.569 n=19+20)
RegexpMatchEasy0_32-12      73.2ns ± 1%    73.7ns ± 0%  +0.76%  (p=0.000 n=18+17)
RegexpMatchEasy0_1K-12       239ns ± 0%     238ns ± 0%  -0.27%  (p=0.000 n=13+17)
RegexpMatchEasy1_32-12      69.0ns ± 2%    69.1ns ± 1%    ~     (p=0.404 n=19+19)
RegexpMatchEasy1_1K-12       367ns ± 1%     365ns ± 1%  -0.60%  (p=0.000 n=19+19)
RegexpMatchMedium_32-12      105ns ± 1%     104ns ± 1%  -1.24%  (p=0.000 n=19+16)
RegexpMatchMedium_1K-12     34.1µs ± 2%    33.6µs ± 3%  -1.60%  (p=0.000 n=20+20)
RegexpMatchHard_32-12       1.62µs ± 1%    1.67µs ± 1%  +2.75%  (p=0.000 n=18+18)
RegexpMatchHard_1K-12       48.8µs ± 1%    50.3µs ± 2%  +3.07%  (p=0.000 n=20+19)
Revcomp-12                   386ms ± 0%     384ms ± 0%  -0.57%  (p=0.000 n=20+19)
Template-12                 59.9ms ± 1%    61.1ms ± 1%  +2.01%  (p=0.000 n=20+19)
TimeParse-12                 301ns ± 2%     307ns ± 0%  +2.11%  (p=0.000 n=20+19)
TimeFormat-12                323ns ± 0%     323ns ± 0%    ~     (all samples are equal)
[Geo mean]                  47.0µs         47.1µs       +0.23%

https://perf.golang.org/search?q=upload:20171030.1

Likewise, the throughput effect on the x/benchmarks is minimal (and
reasonably positive on the garbage benchmark with a large heap):

name                         old time/op  new time/op  delta
Garbage/benchmem-MB=1024-12  2.40ms ± 4%  2.29ms ± 3%  -4.57%  (p=0.000 n=19+18)
Garbage/benchmem-MB=64-12    2.23ms ± 1%  2.24ms ± 2%  +0.59%  (p=0.016 n=19+18)
HTTP-12                      12.5µs ± 1%  12.6µs ± 1%    ~     (p=0.326 n=20+19)
JSON-12                      11.1ms ± 1%  11.3ms ± 2%  +2.15%  (p=0.000 n=16+17)

It does increase the heap size of the garbage benchmarks, but seems to
have relatively little impact on more realistic programs. Also, we'll
gain some of this back with the next commit.

name                         old peak-RSS-bytes  new peak-RSS-bytes  delta
Garbage/benchmem-MB=1024-12          1.21G ± 1%          1.88G ± 2%  +55.59%  (p=0.000 n=19+20)
Garbage/benchmem-MB=64-12             168M ± 3%           248M ± 8%  +48.08%  (p=0.000 n=18+20)
HTTP-12                              45.6M ± 9%          47.0M ±27%     ~     (p=0.925 n=20+20)
JSON-12                               193M ±11%           206M ±11%   +7.06%  (p=0.001 n=20+20)

https://perf.golang.org/search?q=upload:20171030.2

Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736
Reviewed-on: https://go-review.googlesource.com/59970
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
---
 src/runtime/mgc.go | 70 +++++++++++++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/src/runtime/mgc.go b/src/runtime/mgc.go
index bc5e4fb40a..688f36afb0 100644
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -500,47 +500,73 @@ func (c *gcControllerState) startCycle() {
 // is when assists are enabled and the necessary statistics are
 // available).
 func (c *gcControllerState) revise() {
-	// Compute the expected scan work remaining.
+	gcpercent := gcpercent
+	if gcpercent < 0 {
+		// If GC is disabled but we're running a forced GC,
+		// act like GOGC is huge for the below calculations.
+		gcpercent = 100000
+	}
+	live := atomic.Load64(&memstats.heap_live)
+
+	var heapGoal, scanWorkExpected int64
+	if live <= memstats.next_gc {
+		// We're under the soft goal. Pace GC to complete at
+		// next_gc assuming the heap is in steady-state.
+		heapGoal = int64(memstats.next_gc)
+
+		// Compute the expected scan work remaining.
+		//
+		// This is estimated based on the expected
+		// steady-state scannable heap. For example, with
+		// GOGC=100, only half of the scannable heap is
+		// expected to be live, so that's what we target.
+		//
+		// (This is a float calculation to avoid overflowing on
+		// 100*heap_scan.)
+		scanWorkExpected = int64(float64(memstats.heap_scan) * 100 / float64(100+gcpercent))
+	} else {
+		// We're past the soft goal. Pace GC so that in the
+		// worst case it will complete by the hard goal.
+		const maxOvershoot = 1.1
+		heapGoal = int64(float64(memstats.next_gc) * maxOvershoot)
+
+		// Compute the upper bound on the scan work remaining.
+		scanWorkExpected = int64(memstats.heap_scan)
+	}
+
+	// Compute the remaining scan work estimate.
 	//
 	// Note that we currently count allocations during GC as both
 	// scannable heap (heap_scan) and scan work completed
-	// (scanWork), so this difference won't be changed by
-	// allocations during GC.
-	//
-	// This particular estimate is a strict upper bound on the
-	// possible remaining scan work for the current heap.
-	// You might consider dividing this by 2 (or by
-	// (100+GOGC)/100) to counter this over-estimation, but
-	// benchmarks show that this has almost no effect on mean
-	// mutator utilization, heap size, or assist time and it
-	// introduces the danger of under-estimating and letting the
-	// mutator outpace the garbage collector.
-	scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
-	if scanWorkExpected < 1000 {
+	// (scanWork), so allocation will change this difference will
+	// slowly in the soft regime and not at all in the hard
+	// regime.
+	scanWorkRemaining := scanWorkExpected - c.scanWork
+	if scanWorkRemaining < 1000 {
 		// We set a somewhat arbitrary lower bound on
 		// remaining scan work since if we aim a little high,
 		// we can miss by a little.
 		//
 		// We *do* need to enforce that this is at least 1,
 		// since marking is racy and double-scanning objects
-		// may legitimately make the expected scan work
-		// negative.
-		scanWorkExpected = 1000
+		// may legitimately make the remaining scan work
+		// negative, even in the hard goal regime.
+		scanWorkRemaining = 1000
 	}
 
 	// Compute the heap distance remaining.
-	heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
-	if heapDistance <= 0 {
+	heapRemaining := heapGoal - int64(live)
+	if heapRemaining <= 0 {
 		// This shouldn't happen, but if it does, avoid
 		// dividing by zero or setting the assist negative.
-		heapDistance = 1
+		heapRemaining = 1
 	}
 
 	// Compute the mutator assist ratio so by the time the mutator
 	// allocates the remaining heap bytes up to next_gc, it will
 	// have done (or stolen) the remaining amount of scan work.
-	c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
-	c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)
+	c.assistWorkPerByte = float64(scanWorkRemaining) / float64(heapRemaining)
+	c.assistBytesPerWork = float64(heapRemaining) / float64(scanWorkRemaining)
 }
 
 // endCycle computes the trigger ratio for the next cycle.