runtime: tweak bgsweep "low-priority" heuristic

Currently bgsweep attempts to be a low-priority background goroutine that runs mainly when the application is mostly idle. To avoid complicating the scheduler further, it achieves this with a simple heuristic: call Gosched after each span swept. While this is somewhat inefficient as there's scheduling overhead on each iteration, it's mostly fine because it tends to just come out of idle time anyway. In a busy system, the call to Gosched quickly puts bgsweep at the back of scheduler queues. However, what's problematic about this heuristic is the number of tracing events it produces. Average span sweeping latencies have been measured as low as 30 ns, so every 30 ns in the sweep phase, with available idle time, there would be a few trace events emitted. This could result in an overwhelming number, making traces much larger than they need to be. It also pollutes other observability tools, like the scheduling latencies runtime metric, because bgsweep stays runnable the whole time. This change fixes these problems with two modifications to the heursitic: 1. Check if there are any idle Ps before yielding. If there are, don't yield. 2. Sweep at least 10 spans before trying to yield. (1) is doing most of the work here. This change assumes that the presence of idle Ps means that there is available CPU time, so bgsweep is already making use of idle time and there's no reason it should stop. This will have the biggest impact on the aforementioned issues. (2) is a mitigation for the case where GOMAXPROCS=1, because we won't ever observe a zero idle P count. It does mean that bgsweep is a little bit higher priority than before because it yields its time less often, so it could interfere with goroutine scheduling latencies more. However, by sweeping 10 spans before volunteering time, we directly reduce trace event production by 90% in all cases. The impact on scheduling latencies should be fairly minimal, as sweeping a span is already so fast, that sweeping 10 is unlikely to make a dent in any meaningful end-to-end latency. In fact, it may even improve application latencies overall by freeing up spans and sweep work from goroutines allocating memory. It may be worth considering pushing this number higher in the future. Another reason to do (2) is to reduce contention on npidle, which will be checked as part of (1), but this is a fairly minor concern. The main reason is to capture the GOMAXPROCS=1 case. Fixes #54767. Change-Id: I4361400f17197b8ab84c01f56203f20575b29fc6 Reviewed-on: https://go-review.googlesource.com/c/go/+/429615 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2024-11-17 12:14:47 -07:00 · 2022-09-08 20:59:02 +00:00 · 2022-09-08 20:59:02 +00:00 · 8451529e9a
commit 8451529e9a
parent 63ceff95fa
2 changed files with 36 additions and 2 deletions
--- a/src/runtime/mgcsweep.go
+++ b/src/runtime/mgcsweep.go
@ -279,12 +279,34 @@ func bgsweep(c chan int) {
 	goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceEvGoBlock, 1)

 	for {
+		// bgsweep attempts to be a "low priority" goroutine by intentionally
+		// yielding time. It's OK if it doesn't run, because goroutines allocating
+		// memory will sweep and ensure that all spans are swept before the next
+		// GC cycle. We really only want to run when we're idle.
+		//
+		// However, calling Gosched after each span swept produces a tremendous
+		// amount of tracing events, sometimes up to 50% of events in a trace. It's
+		// also inefficient to call into the scheduler so much because sweeping a
+		// single span is in general a very fast operation, taking as little as 30 ns
+		// on modern hardware. (See #54767.)
+		//
+		// As a result, bgsweep sweeps in batches, and only calls into the scheduler
+		// at the end of every batch. Furthermore, it only yields its time if there
+		// isn't spare idle time available on other cores. If there's available idle
+		// time, helping to sweep can reduce allocation latencies by getting ahead of
+		// the proportional sweeper and having spans ready to go for allocation.
+		const sweepBatchSize = 10
+		nSwept := 0
 		for sweepone() != ^uintptr(0) {
 			sweep.nbgsweep++
-			Gosched()
+			nSwept++
+			if nSwept%sweepBatchSize == 0 {
+				goschedIfBusy()
+			}
 		}
 		for freeSomeWbufs(true) {
-			Gosched()
+			// N.B. freeSomeWbufs is already batched internally.
+			goschedIfBusy()
 		}
 		lock(&sweep.lock)
 		if !isSweepDone() {
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@ -326,6 +326,18 @@ func goschedguarded() {
 	mcall(goschedguarded_m)
 }

+// goschedIfBusy yields the processor like gosched, but only does so if
+// there are no idle Ps or if we're on the only P and there's nothing in
+// the run queue. In both cases, there is freely available idle time.
+//
+//go:nosplit
+func goschedIfBusy() {
+	if sched.npidle.Load() > 0 {
+		return
+	}
+	mcall(gosched_m)
+}
+
 // Puts the current goroutine into a waiting state and calls unlockf on the
 // system stack.
 //