Allows removing fields that aren't relevant to a particular OS or
changing their types to match the underlying OS system calls they'll
be used for.
Change-Id: I5cea89ee77b4e7b985bff41337e561887c3272ff
Reviewed-on: https://go-review.googlesource.com/16176
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
runtime/internal/sys will hold system-, architecture- and config-
specific constants.
Updates #11647
Change-Id: I6db29c312556087a42e8d2bdd9af40d157c56b54
Reviewed-on: https://go-review.googlesource.com/16817
Reviewed-by: Russ Cox <rsc@golang.org>
The tracing code is currently called from contexts such as sysmon and
the scheduler where write barriers are not allowed. Unfortunately,
while the common paths through the tracing code do not have write
barriers, many of the less common paths dealing with buffer overflow
and recycling do.
This change replaces all *traceBufs with traceBufPtrs. In the style of
guintptr, etc., the GC does not trace traceBufPtrs and write barriers
do not apply when these pointers are written. Since traceBufs are
allocated from non-GC'd memory and manually managed, this is always
safe.
Updates #10600.
Change-Id: I52b992d36d1b634ebd855c8cde27947ec14f59ba
Reviewed-on: https://go-review.googlesource.com/16812
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
This change breaks out most of the atomics functions in the runtime
into package runtime/internal/atomic. It adds some basic support
in the toolchain for runtime packages, and also modifies linux/arm
atomics to remove the dependency on the runtime's mutex. The mutexes
have been replaced with spinlocks.
all trybots are happy!
In addition to the trybots, I've tested on the darwin/arm64 builder,
on the darwin/arm builder, and on a ppc64le machine.
Change-Id: I6698c8e3cf3834f55ce5824059f44d00dc8e3c2f
Reviewed-on: https://go-review.googlesource.com/14204
Run-TryBot: Michael Matloob <matloob@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
This eliminates many write barriers in the scheduler code that are
unnecessary and will interfere with upcoming changes where the garbage
collector will have to invoke run queue functions in contexts that
must not have write barriers.
Change-Id: I702d0ac99cfd00ffff406e7362917db6a43e7e55
Reviewed-on: https://go-review.googlesource.com/16556
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
Reduces the size of m by ~8% on linux/amd64 (1040 bytes -> 960 bytes).
There are also windows-specific fields, but they're currently
referenced in OS-independent source files (but only when
GOOS=="windows").
Change-Id: I13e1471ff585ccced1271f74209f8ed6df14c202
Reviewed-on: https://go-review.googlesource.com/16173
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Change compiler-invoked interface functions to directly take
iface/eface parameters instead of fInterface/interface{} to avoid
needing to always convert.
For the handful of functions that legitimately need to take an
interface{} parameter, add efaceOf to type-safely convert *interface{}
to *eface.
Change-Id: I8928761a12fd3c771394f36adf93d3006a9fcf39
Reviewed-on: https://go-review.googlesource.com/16166
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Instead of open-coding conversions from *string to unsafe.Pointer then
to *stringStruct, add a helper function to add some type safety.
Bonus: This caught two **string values being converted to
*stringStruct in heapdump.go.
While here, get rid of the redundant _string type, but add in a
stringStructDWARF type used for generating DWARF debug info.
Change-Id: I8882f8cca66ac45190270f82019a5d85db023bd2
Reviewed-on: https://go-review.googlesource.com/16131
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
When I saw that it was labelled "legacy", I went looking for users of it
to see how it was still used. But there aren't any. Save the next person
the trouble.
Change-Id: I921dd6c57b60331c9816542272555153ac133c02
Reviewed-on: https://go-review.googlesource.com/16035
Reviewed-by: Dave Cheney <dave@cheney.net>
Run-TryBot: Dave Cheney <dave@cheney.net>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Currently we track the per-G GC assist balance as two monotonically
increasing values: the bytes allocated by the G this cycle (gcalloc)
and the scan work performed by the G this cycle (gcscanwork). The
assist balance is hence assistRatio*gcalloc - gcscanwork.
This works, but has two important downsides:
1) It requires floating-point math to figure out if a G is in debt or
not. This makes it inappropriate to check for assist debt in the
hot path of mallocgc, so we only do this when a G allocates a new
span. As a result, Gs can operate "in the red", leading to
under-assist and extended GC cycle length.
2) Revising the assist ratio during a GC cycle can lead to an "assist
burst". If you think of plotting the scan work performed versus
heaps size, the assist ratio controls the slope of this line.
However, in the current system, the target line always passes
through 0 at the heap size that triggered GC, so if the runtime
increases the assist ratio, there has to be a potentially large
assist to jump from the current amount of scan work up to the new
target scan work for the current heap size.
This commit replaces this approach with directly tracking the GC
assist balance in terms of allocation credit bytes. Allocating N bytes
simply decreases this by N and assisting raises it by the amount of
scan work performed divided by the assist ratio (to get back to
bytes).
This will make it cheap to figure out if a G is in debt, which will
let us efficiently check if an assist is necessary *before* performing
an allocation and hence keep Gs "in the black".
This also fixes assist bursts because the assist ratio is now in terms
of *remaining* work, rather than work from the beginning of the GC
cycle. Hence, the plot of scan work versus heap size becomes
continuous: we can revise the slope, but this slope always starts from
where we are right now, rather than where we were at the beginning of
the cycle.
Change-Id: Ia821c5f07f8a433e8da7f195b52adfedd58bdf2c
Reviewed-on: https://go-review.googlesource.com/15408
Reviewed-by: Rick Hudson <rlh@golang.org>
Commit 0e6a6c5 removed readyExecute a long time ago, but left behind
the g.readyg field that was used by readyExecute. Remove this now
unused field.
Change-Id: I41b87ad2b427974d256ec7a7f6d4bdc2ce8a13bb
Reviewed-on: https://go-review.googlesource.com/13111
Reviewed-by: Rick Hudson <rlh@golang.org>
Right now we find out implicitly if stack barriers are in place,
or defers. This change makes sure we find out about short
unwinds always.
Change-Id: Ibdde1ba9c79eb792660dcb7aa6f186e4e4d559b3
Reviewed-on: https://go-review.googlesource.com/13966
Reviewed-by: Austin Clements <austin@google.com>
nmspinning has a value range of [0, 2^31-1]. Update the comment to
indicate this and fix the comparison so it's not always false.
Fixes#11280
Change-Id: Iedaf0654dcba5e2c800645f26b26a1a781ea1991
Reviewed-on: https://go-review.googlesource.com/13877
Reviewed-by: Minux Ma <minux@golang.org>
Also, crash early on non-Linux SMP ARM systems when GOARM < 7;
without the proper synchronization, SMP cannot work.
Linux is okay because we call kernel-provided routines for
synchronization and barriers, and the kernel takes care of
providing the right routines for the current system.
On non-Linux systems we are left to fend for ourselves.
It is possible to use different synchronization on GOARM=6,
but it's too late to do that in the Go 1.5 cycle.
We don't believe there are any non-Linux SMP GOARM=6 systems anyway.
Fixes#12067.
Change-Id: I771a556e47893ed540ec2cd33d23c06720157ea3
Reviewed-on: https://go-review.googlesource.com/13363
Reviewed-by: Austin Clements <austin@google.com>
Instead of pushing the denominator argument on the stack,
the denominator is now passed in m.
This fixes a variety of bugs related to trying to take stack traces
backwards from the middle of the software div/mod routines.
Some of those bugs have been kludged around in the past,
but others have not. Instead of trying to patch up after breaking
the stack, this CL stops breaking the stack.
This is an update of https://golang.org/cl/19810043,
which was rolled back in https://golang.org/cl/20350043.
The problem in the original CL was that there were divisions
at bad times, when m was not available. These were divisions
by constant denominators, either in C code or in assembly.
The Go compiler knows how to generate division by multiplication
for constant denominators, but the C compiler did not.
There is no longer any C code, so that's taken care of.
There was one problematic DIV in runtime.usleep (assembly)
but https://golang.org/cl/12898 took care of that one.
So now this approach is safe.
Reject DIV/MOD in NOSPLIT functions to keep them from
coming back.
Fixes#6681.
Fixes#6699.
Fixes#10486.
Change-Id: I09a13c76ad08ba75b3bd5d46a3eb78e66a84ab38
Reviewed-on: https://go-review.googlesource.com/12899
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Nearly all the flaky failures we've seen in trace tests have been
due to the use of time stamps to determine relative event ordering.
This is tricky for many reasons, including:
- different cores might not have exactly synchronized clocks
- VMs are worse than real hardware
- non-x86 chips have different timer resolution than x86 chips
- on fast systems two events can end up with the same time stamp
Stop trying to make time reliable. It's clearly not going to be for Go 1.5.
Instead, record an explicit event sequence number for ordering.
Using our own counter solves all of the above problems.
The trace still contains time stamps, of course. The sequence number
is just used for ordering.
Should alleviate #10554 somewhat. Then tickDiv can be chosen to
be a useful time unit instead of having to be exact for ordering.
Separating ordering and time stamps lets the trace parser diagnose
systems where the time stamp order and actual order do not match
for one reason or another. This CL adds that check to the end of
trace.Parse, after all other sequence order-based checking.
If that error is found, we skip the test instead of failing it.
Putting the check in trace.Parse means that cmd/trace will pick
up the same check, refusing to display a trace where the time stamps
do not match actual ordering.
Using net/http's BenchmarkClientServerParallel4 on various CPU counts,
not tracing vs tracing:
name old time/op new time/op delta
ClientServerParallel4 50.4µs ± 4% 80.2µs ± 4% +59.06% (p=0.000 n=10+10)
ClientServerParallel4-2 33.1µs ± 7% 57.8µs ± 5% +74.53% (p=0.000 n=10+10)
ClientServerParallel4-4 18.5µs ± 4% 32.6µs ± 3% +75.77% (p=0.000 n=10+10)
ClientServerParallel4-6 12.9µs ± 5% 24.4µs ± 2% +89.33% (p=0.000 n=10+10)
ClientServerParallel4-8 11.4µs ± 6% 21.0µs ± 3% +83.40% (p=0.000 n=10+10)
ClientServerParallel4-12 14.4µs ± 4% 23.8µs ± 4% +65.67% (p=0.000 n=10+10)
Fixes#10512.
Change-Id: I173eecf8191e86feefd728a5aad25bf1bc094b12
Reviewed-on: https://go-review.googlesource.com/12579
Reviewed-by: Austin Clements <austin@google.com>
The following sequence of events can lead to the runtime attempting an
out-of-bounds access on a stack barrier slice:
1. A SIGPROF comes in on a thread while the G on that thread is in
_Gsyscall. The sigprof handler calls gentraceback, which saves a
local copy of the G's stkbar slice. Currently the G has no stack
barriers, so this slice is empty.
2. On another thread, the GC concurrently scans the stack of the
goroutine being profiled (it considers it stopped because it's in
_Gsyscall) and installs stack barriers.
3. Back on the sigprof thread, gentraceback comes across a stack
barrier in the stack and attempts to look it up in its (zero
length) copy of G's old stkbar slice, which causes an out-of-bounds
access.
This commit fixes this by adding a simple cas spin to synchronize the
SIGPROF handler with stack barrier insertion.
In general I would prefer that this synchronization be done through
the G status, since that's how stack scans are otherwise synchronized,
but adding a new lock is a much smaller change and G statuses are full
of subtlety.
Fixes#11863.
Change-Id: Ie89614a6238bb9c6a5b1190499b0b48ec759eaf7
Reviewed-on: https://go-review.googlesource.com/12748
Reviewed-by: Russ Cox <rsc@golang.org>
The one in misc/makerelease/makerelease.go is particularly bad and
probably warrants rotating our keys.
I didn't update old weekly notes, and reverted some changes involving
test code for now, since we're late in the Go 1.5 freeze. Otherwise,
the rest are all auto-generated changes, and all manually reviewed.
Change-Id: Ia2753576ab5d64826a167d259f48a2f50508792d
Reviewed-on: https://go-review.googlesource.com/12048
Reviewed-by: Rob Pike <r@golang.org>
Stack can move during callback, so libcall struct cannot be stored on stack.
asmstdcall updates return values and errno in libcall struct parameter, but
these could be at different location when callback returns.
Store these in m, so they are not affected by GC.
Fixes#10406
Change-Id: Id01c9d2b4b44530494e6d9e9e1c875261ce477cd
Reviewed-on: https://go-review.googlesource.com/10370
Reviewed-by: Russ Cox <rsc@golang.org>
There were two issues.
1. Delayed EvGoSysExit could have been emitted during TraceStart,
while it had not yet emitted EvGoInSyscall.
2. Delayed EvGoSysExit could have been emitted during next tracing session.
Fixes#10476Fixes#11262
Change-Id: Iab68eb31cf38eb6eb6eee427f49c5ca0865a8c64
Reviewed-on: https://go-review.googlesource.com/9132
Reviewed-by: Russ Cox <rsc@golang.org>
This fixes a hang during runtime.TestTraceStress.
It also fixes double-scan of stacks, which leads to
stack barrier installation failures.
Both of these have shown up as flaky failures on the dashboard.
Fixes#10941.
Change-Id: Ia2a5991ce2c9f43ba06ae1c7032f7c898dc990e0
Reviewed-on: https://go-review.googlesource.com/11089
Reviewed-by: Austin Clements <austin@google.com>
These were found by grepping the comments from the go code and feeding
the output to aspell.
Change-Id: Id734d6c8d1938ec3c36bd94a4dbbad577e3ad395
Reviewed-on: https://go-review.googlesource.com/10941
Reviewed-by: Aamir Khan <syst3m.w0rm@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
The stack barrier code will need a bookkeeping structure to keep track
of the overwritten return PCs. This commit introduces and allocates
this structure, but does not yet use the structure.
We don't want to allocate space for this structure during garbage
collection, so this commit allocates it along with the allocation of
the corresponding stack. However, we can't do a regular allocation in
newstack because mallocgc may itself grow the stack (which would lead
to a recursive allocation). Hence, this commit makes the bookkeeping
structure part of the stack allocation itself by stealing the
necessary space from the top of the stack allocation. Since the size
of this bookkeeping structure is logarithmic in the size of the stack,
this has minimal impact on stack behavior.
Change-Id: Ia14408be06aafa9ca4867f4e70bddb3fe0e96665
Reviewed-on: https://go-review.googlesource.com/10313
Reviewed-by: Russ Cox <rsc@golang.org>
Currently the runtime assumes that the allocation for the stack is
exactly [stack.lo, stack.hi). We're about to steal a small part of
this allocation for per-stack GC metadata. To prepare for this, this
commit adds a field to the G for the allocated size of the stack.
With this change, stack.lo and stack.hi continue to act as the true
bounds on the stack, but are no longer also used as the bounds on the
stack allocation.
(I also tried this the other way around, where stack.lo and stack.hi
remained the allocation bounds and I introduced a new top of stack.
However, there are far more places that assume stack.hi is the true
top of the stack than there are places that assume it's the top of the
allocation.)
Change-Id: Ifa9d956753be53d286d09cbc73d47fb34a18c0c6
Reviewed-on: https://go-review.googlesource.com/10312
Reviewed-by: Russ Cox <rsc@golang.org>
Ian proposed an improved way of handling signals masks in Go, motivated
by a problem where the Android java runtime expects certain signals to
be blocked for all JVM threads. Discussion here
https://groups.google.com/forum/#!topic/golang-dev/_TSCkQHJt6g
Ian's text is used in the following:
A Go program always needs to have the synchronous signals enabled.
These are the signals for which _SigPanic is set in sigtable, namely
SIGSEGV, SIGBUS, SIGFPE.
A Go program that uses the os/signal package, and calls signal.Notify,
needs to have at least one thread which is not blocking that signal,
but it doesn't matter much which one.
Unix programs do not change signal mask across execve. They inherit
signal masks across fork. The shell uses this fact to some extent;
for example, the job control signals (SIGTTIN, SIGTTOU, SIGTSTP) are
blocked for commands run due to backquote quoting or $().
Our current position on signal masks was not thought out. We wandered
into step by step, e.g., http://golang.org/cl/7323067 .
This CL does the following:
Introduce a new platform hook, msigsave, that saves the signal mask of
the current thread to m.sigsave.
Call msigsave from needm and newm.
In minit grab set up the signal mask from m.sigsave and unblock the
essential synchronous signals, and SIGILL, SIGTRAP, SIGPROF, SIGSTKFLT
(for systems that have it).
In unminit, restore the signal mask from m.sigsave.
The first time that os/signal.Notify is called, start a new thread whose
only purpose is to update its signal mask to make sure signals for
signal.Notify are unblocked on at least one thread.
The effect on Go programs will be that if they are invoked with some
non-synchronous signals blocked, those signals will normally be
ignored. Previously, those signals would mostly be ignored. A change
in behaviour will occur for programs started with any of these signals
blocked, if they receive the signal: SIGHUP, SIGINT, SIGQUIT, SIGABRT,
SIGTERM. Previously those signals would always cause a crash (unless
using the os/signal package); with this change, they will be ignored
if the program is started with the signal blocked (and does not use
the os/signal package).
./all.bash completes successfully on linux/amd64.
OpenBSD is missing the implementation.
Change-Id: I188098ba7eb85eae4c14861269cc466f2aa40e8c
Reviewed-on: https://go-review.googlesource.com/10173
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Currently, forEachP reuses the stopwait and stopnote fields from
stopTheWorld to track how many Ps have not responded to the safe-point
request and to sleep until all Ps have responded.
It was assumed this was safe because both stopTheWorld and forEachP
must occur under the worlsema and hence stopwait and stopnote cannot
be used for both purposes simultaneously and callers could always
determine the appropriate use based on sched.gcwaiting (which is only
set by stopTheWorld). However, this is not the case, since it's
possible for there to be a window between when an M observes that
gcwaiting is set and when it checks stopwait during which stopwait
could have changed meanings. When this happens, the M decrements
stopwait and may wakeup stopnote, but does not otherwise participate
in the forEachP protocol. As a result, stopwait is decremented too
many times, so it may reach zero before all Ps have run the safe-point
function, causing forEachP to wake up early. It will then either
observe that some P has not run the safe-point function and panic with
"P did not run fn", or the remaining P (or Ps) will run the safe-point
function before it wakes up and it will observe that stopwait is
negative and panic with "not stopped".
Fix this problem by giving forEachP its own safePointWait and
safePointNote fields.
One known sequence of events that can cause this race is as
follows. It involves three actors:
G1 is running on M1 on P1. P1 has an empty run queue.
G2/M2 is in a blocked syscall and has lost its P. (The details of this
don't matter, it just needs to be in a position where it needs to grab
an idle P.)
GC just started on G3/M3/P3. (These aren't very involved, they just
have to be separate from the other G's, M's, and P's.)
1. GC calls stopTheWorld(), which sets sched.gcwaiting to 1.
Now G1/M1 begins to enter a syscall:
2. G1/M1 invokes reentersyscall, which sets the P1's status to
_Psyscall.
3. G1/M1's reentersyscall observes gcwaiting != 0 and calls
entersyscall_gcwait.
4. G1/M1's entersyscall_gcwait blocks acquiring sched.lock.
Back on GC:
5. stopTheWorld cas's P1's status to _Pgcstop, does other stuff, and
returns.
6. GC does stuff and then calls startTheWorld().
7. startTheWorld() calls procresize(), which sets P1's status to
_Pidle and puts P1 on the idle list.
Now G2/M2 returns from its syscall and takes over P1:
8. G2/M2 returns from its blocked syscall and gets P1 from the idle
list.
9. G2/M2 acquires P1, which sets P1's status to _Prunning.
10. G2/M2 starts a new syscall and invokes reentersyscall, which sets
P1's status to _Psyscall.
Back on G1/M1:
11. G1/M1 finally acquires sched.lock in entersyscall_gcwait.
At this point, G1/M1 still thinks it's running on P1. P1's status is
_Psyscall, which is consistent with what G1/M1 is doing, but it's
_Psyscall because *G2/M2* put it in to _Psyscall, not G1/M1. This is
basically an ABA race on P1's status.
Because forEachP currently shares stopwait with stopTheWorld. G1/M1's
entersyscall_gcwait observes the non-zero stopwait set by forEachP,
but mistakes it for a stopTheWorld. It cas's P1's status from
_Psyscall (set by G2/M2) to _Pgcstop and proceeds to decrement
stopwait one more time than forEachP was expecting.
Fixes#10618. (See the issue for details on why the above race is safe
when forEachP is not involved.)
Prior to this commit, the command
stress ./runtime.test -test.run TestFutexsleep\|TestGoroutineProfile
would reliably fail after a few hundred runs. With this commit, it
ran for over 2 million runs and never crashed.
Change-Id: I9a91ea20035b34b6e5f07ef135b144115f281f30
Reviewed-on: https://go-review.googlesource.com/10157
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, runqsteal steals Gs from another P into an intermediate
buffer and then copies those Gs into the current P's run queue. This
intermediate buffer itself was moved from the stack to the P in commit
c4fe503 to eliminate the cost of zeroing it on every steal.
This commit follows up c4fe503 by stealing directly into the current
P's run queue, which eliminates the copy and the need for the
intermediate buffer. The update to the tail pointer is only committed
once the entire steal operation has succeeded, so the semantics of
stealing do not change.
Change-Id: Icdd7a0eb82668980bf42c0154b51eef6419fdd51
Reviewed-on: https://go-review.googlesource.com/9998
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
One important use case is a pipeline computation that pass values
from one Goroutine to the next and then exits or is placed in a
wait state. If GOMAXPROCS > 1 a Goroutine running on P1 will enable
another Goroutine and then immediately make P1 available to execute
it. We need to prevent other Ps from stealing the G that P1 is about
to execute. Otherwise the Gs can thrash between Ps causing unneeded
synchronization and slowing down throughput.
Fix this by changing the stealing logic so that when a P attempts to
steal the only G on some other P's run queue, it will pause
momentarily to allow the victim P to schedule the G.
As part of optimizing stealing we also use a per P victim queue
move stolen gs. This eliminates the zeroing of a stack local victim
queue which turned out to be expensive.
This CL is a necessary but not sufficient prerequisite to changing
the default value of GOMAXPROCS to something > 1 which is another
CL/discussion.
For highly serialized programs, such as GoroutineRing below this can
make a large difference. For larger and more parallel programs such
as the x/benchmarks there is no noticeable detriment.
~/work/code/src/rsc.io/benchstat/benchstat old.txt new.txt
name old mean new mean delta
GoroutineRing 30.2µs × (0.98,1.01) 30.1µs × (0.97,1.04) ~ (p=0.941)
GoroutineRing-2 113µs × (0.91,1.07) 30µs × (0.98,1.03) -73.17% (p=0.004)
GoroutineRing-4 144µs × (0.98,1.02) 32µs × (0.98,1.01) -77.69% (p=0.000)
GoroutineRingBuf 32.7µs × (0.97,1.03) 32.5µs × (0.97,1.02) ~ (p=0.795)
GoroutineRingBuf-2 120µs × (0.92,1.08) 33µs × (1.00,1.00) -72.48% (p=0.004)
GoroutineRingBuf-4 138µs × (0.92,1.06) 33µs × (1.00,1.00) -76.21% (p=0.003)
The bench benchmarks show little impact.
old new
garbage 7032879 7011696
httpold 25509 25301
splayold 1022073 1019499
jsonold 28230624 28081433
Change-Id: I228c48fed8d85c9bbef16a7edc53ab7898506f50
Reviewed-on: https://go-review.googlesource.com/9872
Reviewed-by: Austin Clements <austin@google.com>
Since we now have stack information for code running on the
systemstack, we can traceback over it. To make cpu profiles useful,
add a case in gentraceback to jump over systemstack switches.
Fixes#10609.
Change-Id: I21f47fcc802c07c5d4a1ada56374314e388a6dc7
Reviewed-on: https://go-review.googlesource.com/9506
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Makes searching in source code easier.
Change-Id: Ie2e85934d23920ac0bc01d28168bcfbbdc465580
Reviewed-on: https://go-review.googlesource.com/9774
Reviewed-by: Daniel Morsing <daniel.morsing@gmail.com>
Reviewed-by: Minux Ma <minux@golang.org>
Currently, we use a full stop-the-world around enabling write
barriers. This is to ensure that all Gs have enabled write barriers
before any blackening occurs (either in gcBgMarkWorker() or in
gcAssistAlloc()).
However, there's no need to bring the whole world to a synchronous
stop to ensure this. This change replaces the STW with a ragged
barrier that ensures each P has individually observed that write
barriers should be enabled before GC performs any blackening.
Change-Id: If2f129a6a55bd8bdd4308067af2b739f3fb41955
Reviewed-on: https://go-review.googlesource.com/8207
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
This adds forEachP, which performs a general-purpose ragged global
barrier. forEachP takes a callback and invokes it for every P at a GC
safe point.
Ps that are idle or in a syscall are considered to be at a continuous
safe point. forEachP ensures that these Ps do not change state by
forcing all syscall Ps into idle and holding the sched.lock.
To ensure that Ps do not enter syscall or idle without running the
safe-point function, this adds checks for a pending callback every
place there is currently a gcwaiting check.
We'll use forEachP to replace the STW around enabling the write
barrier and to replace the current asynchronous per-M wbuf cache with
a cooperatively managed per-P gcWork cache.
Change-Id: Ie944f8ce1fead7c79bf271d2f42fcd61a41bb3cc
Reviewed-on: https://go-review.googlesource.com/8206
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, each M has a cache of the most recently used *workbuf. This
is used primarily by the write barrier so it doesn't have to access
the global workbuf lists on every write barrier. It's also used by
stack scanning because it's convenient.
This cache is important for write barrier performance, but this
particular approach has several downsides. It's faster than no cache,
but far from optimal (as the benchmarks below show). It's complex:
access to the cache is sprinkled through most of the workbuf list
operations and it requires special care to transform into and back out
of the gcWork cache that's actually used for scanning and marking. It
requires atomic exchanges to take ownership of the cached workbuf and
to return it to the M's cache even though it's almost always used by
only the current M. Since it's per-M, flushing these caches is O(# of
Ms), which may be high. And it has some significant subtleties: for
example, in general the cache shouldn't be used after the
harvestwbufs() in mark termination because it could hide work from
mark termination, but stack scanning can happen after this and *will*
use the cache (but it turns out this is okay because it will always be
followed by a getfull(), which drains the cache).
This change replaces this cache with a per-P gcWork object. This
gcWork cache can be used directly by scanning and marking (as long as
preemption is disabled, which is a general requirement of gcWork).
Since it's per-P, it doesn't require synchronization, which simplifies
things and means the only atomic operations in the write barrier are
occasionally fetching new work buffers and setting a mark bit if the
object isn't already marked. This cache can be flushed in O(# of Ps),
which is generally small. It follows a simple flushing rule: the cache
can be used during any phase, but during mark termination it must be
flushed before allowing preemption. This also makes the dispose during
mutator assist no longer necessary, which eliminates the vast majority
of gcWork dispose calls and reduces contention on the global workbuf
lists. And it's a lot faster on some benchmarks:
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 11963668673 11206112763 -6.33%
BenchmarkFannkuch11 2643217136 2649182499 +0.23%
BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28%
BenchmarkFmtFprintfString 364 307 -15.66%
BenchmarkFmtFprintfInt 317 282 -11.04%
BenchmarkFmtFprintfIntInt 512 483 -5.66%
BenchmarkFmtFprintfPrefixedInt 404 380 -5.94%
BenchmarkFmtFprintfFloat 521 479 -8.06%
BenchmarkFmtManyArgs 2164 1894 -12.48%
BenchmarkGobDecode 30366146 22429593 -26.14%
BenchmarkGobEncode 29867472 26663152 -10.73%
BenchmarkGzip 391236616 396779490 +1.42%
BenchmarkGunzip 96639491 96297024 -0.35%
BenchmarkHTTPClientServer 100110 70763 -29.31%
BenchmarkJSONEncode 51866051 52511382 +1.24%
BenchmarkJSONDecode 103813138 86094963 -17.07%
BenchmarkMandelbrot200 4121834 4120886 -0.02%
BenchmarkGoParse 16472789 5879949 -64.31%
BenchmarkRegexpMatchEasy0_32 140 140 +0.00%
BenchmarkRegexpMatchEasy0_1K 394 394 +0.00%
BenchmarkRegexpMatchEasy1_32 120 120 +0.00%
BenchmarkRegexpMatchEasy1_1K 621 614 -1.13%
BenchmarkRegexpMatchMedium_32 209 202 -3.35%
BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52%
BenchmarkRegexpMatchHard_32 2682 2675 -0.26%
BenchmarkRegexpMatchHard_1K 79383 79524 +0.18%
BenchmarkRevcomp 584116718 584595320 +0.08%
BenchmarkTemplate 125400565 109620196 -12.58%
BenchmarkTimeParse 386 387 +0.26%
BenchmarkTimeFormat 580 447 -22.93%
(Best out of 10 runs. The delta of averages is similar.)
This also puts us in a good position to flush these caches when
nearing the end of concurrent marking, which will let us increase the
size of the work buffers while still controlling mark termination
pause time.
Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522
Reviewed-on: https://go-review.googlesource.com/9178
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, when the runtime ready()s a G, it adds it to the end of the
current P's run queue and continues running. If there are many other
things in the run queue, this can result in a significant delay before
the ready()d G actually runs and can hurt fairness when other Gs in
the run queue are CPU hogs. For example, if there are three Gs sharing
a P, one of which is a CPU hog that never voluntarily gives up the P
and the other two of which are doing small amounts of work and
communicating back and forth on an unbuffered channel, the two
communicating Gs will get very little CPU time.
Change this so that when G1 ready()s G2 and then blocks, the scheduler
immediately hands off the remainder of G1's time slice to G2. In the
above example, the two communicating Gs will now act as a unit and
together get half of the CPU time, while the CPU hog gets the other
half of the CPU time.
This fixes the problem demonstrated by the ping-pong benchmark added
in the previous commit:
benchmark old ns/op new ns/op delta
BenchmarkPingPongHog 684287 825 -99.88%
On the x/benchmarks suite, this change improves the performance of
garbage by ~6% (for GOMAXPROCS=1 and 4), and json by 28% and 36% for
GOMAXPROCS=1 and 4. It has negligible effect on heap size.
This has no effect on the go1 benchmark suite since those benchmarks
are mostly single-threaded.
Change-Id: I858a08eaa78f702ea98a5fac99d28a4ac91d339f
Reviewed-on: https://go-review.googlesource.com/9289
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
Currently, in accordance with the GC pacing proposal, we schedule
background marking with a goal of achieving 25% utilization *total*
between mutator assists and background marking. This is stricter than
was set out in the Go 1.5 proposal, which suggests that the garbage
collector can use 25% just for itself and anything the mutator does to
help out is on top of that. It also has several technical
drawbacks. Because mutator assist time is constantly changing and we
can't have instantaneous information on background marking time, it
effectively requires hitting a moving target based on out-of-date
information. This works out in the long run, but works poorly for
short GC cycles and on short time scales. Also, this requires
time-multiplexing all Ps between the mutator and background GC since
the goal utilization of background GC constantly fluctuates. This
results in a complicated scheduling algorithm, poor affinity, and
extra overheads from context switching.
This change modifies the way we schedule and run background marking so
that background marking always consumes 25% of GOMAXPROCS and mutator
assist is in addition to this. This enables a much more robust
scheduling algorithm where we pre-determine the number of Ps we should
dedicate to background marking as well as the utilization goal for a
single floating "remainder" mark worker.
Change-Id: I187fa4c03ab6fe78012a84d95975167299eb9168
Reviewed-on: https://go-review.googlesource.com/9013
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, the concurrent mark phase is performed by the main GC
goroutine. Prior to the previous commit enabling preemption, this
caused marking to always consume 1/GOMAXPROCS of the available CPU
time. If GOMAXPROCS=1, this meant background GC would consume 100% of
the CPU (effectively a STW). If GOMAXPROCS>4, background GC would use
less than the goal of 25%. If GOMAXPROCS=4, background GC would use
the goal 25%, but if the mutator wasn't using the remaining 75%,
background marking wouldn't take advantage of the idle time. Enabling
preemption in the previous commit made GC miss CPU targets in
completely different ways, but set us up to bring everything back in
line.
This change replaces the fixed GC goroutine with per-P background mark
goroutines. Once started, these goroutines don't go in the standard
run queues; instead, they are scheduled specially such that the time
spent in mutator assists and the background mark goroutines totals 25%
of the CPU time available to the program. Furthermore, this lets
background marking take advantage of idle Ps, which significantly
boosts GC performance for applications that under-utilize the CPU.
This requires also changing how time is reported for gctrace, so this
change splits the concurrent mark CPU time into assist/background/idle
scanning.
This also requires increasing the size of the StackRecord slice used
in a GoroutineProfile test.
Change-Id: I0936ff907d2cee6cb687a208f2df47e8988e3157
Reviewed-on: https://go-review.googlesource.com/8850
Reviewed-by: Rick Hudson <rlh@golang.org>
This time is tracked per P and periodically flushed to the global
controller state. This will be used to compute mutator assist
utilization in order to schedule background GC work.
Change-Id: Ib94f90903d426a02cf488bf0e2ef67a068eb3eec
Reviewed-on: https://go-review.googlesource.com/8837
Reviewed-by: Rick Hudson <rlh@golang.org>
Currently, mutator allocation periodically assists the garbage
collector by performing a small, fixed amount of scanning work.
However, to control heap growth, mutators need to perform scanning
work *proportional* to their allocation rate.
This change implements proportional mutator assists. This uses the
scan work estimate computed by the garbage collector at the beginning
of each cycle to compute how much scan work must be performed per
allocation byte to complete the estimated scan work by the time the
heap reaches the goal size. When allocation triggers an assist, it
uses this ratio and the amount allocated since the last assist to
compute the assist work, then attempts to steal as much of this work
as possible from the background collector's credit, and then performs
any remaining scan work itself.
Change-Id: I98b2078147a60d01d6228b99afd414ef857e4fba
Reviewed-on: https://go-review.googlesource.com/8836
Reviewed-by: Rick Hudson <rlh@golang.org>
This CL revises CL 7504 to use explicitly uintptr types for the
struct fields that are going to be updated sometimes without
write barriers. The result is that the fields are now updated *always*
without write barriers.
This approach has two important properties:
1) Now the GC never looks at the field, so if the missing reference
could cause a problem, it will do so all the time, not just when the
write barrier is missed at just the right moment.
2) Now a write barrier never happens for the field, avoiding the
(correct) detection of inconsistent write barriers when GODEBUG=wbshadow=1.
Change-Id: Iebd3962c727c0046495cc08914a8dc0808460e0e
Reviewed-on: https://go-review.googlesource.com/9019
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
readyExecute passes a closure to mcall that captures an argument to
readyExecute. Since mcall is marked noescape, this closure lives on
the stack of the calling goroutine. However, the closure puts the
calling goroutine on the run queue (and switches to a new
goroutine). If the calling goroutine gets scheduled before the mcall
returns, this stack-allocated closure will become invalid while it's
still executing. One consequence of this we've observed is that the
captured gp variable can get overwritten before the call to
execute(gp), causing execute(gp) to segfault.
Fix this by passing the currently captured gp variable through a field
in the calling goroutine's g struct so that the func is no longer a
closure.
To prevent problems like this in the future, this change also removes
the go:noescape annotation from mcall. Due to a compiler bug, this
will currently cause a func closure passed to mcall to be implicitly
allocated rather than refusing the implicit allocation. However, this
is okay because there are no other closures passed to mcall right now
and the compiler bug will be fixed shortly.
Fixes#10428.
Change-Id: I49b48b85de5643323b89e9eaa4df63854e968c32
Reviewed-on: https://go-review.googlesource.com/8866
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
This memory is untyped and can't be used anymore.
The next version of SWIG won't need it.
Change-Id: I592b287c5f5186975ee09a9b28d8efe3b57134e7
Reviewed-on: https://go-review.googlesource.com/8956
Reviewed-by: Ian Lance Taylor <iant@golang.org>
By removing type slice, renaming type sliceStruct to type slice and
whacking until it compiles.
Has a pleasing net reduction of conversions.
Fixes#10188
Change-Id: I77202b8df637185b632fd7875a1fdd8d52c7a83c
Reviewed-on: https://go-review.googlesource.com/8770
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
According to Go execution modes, a Go program compiled with
-buildmode=c-archive has a main function, but it is ignored on run.
This gives the runtime the information it needs not to run the main.
I have this working with pending linker changes on darwin/amd64.
Change-Id: I49bd7d65aa619ec847c464a872afa5deea7d4d30
Reviewed-on: https://go-review.googlesource.com/8701
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: David Crawshaw <crawshaw@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This is Part 2 of the change, see Part 1 here: in https://go-review.googlesource.com/#/c/7692/
Suggested by iant@, we use the library initialization entry point to:
- create a new OS thread and run the "regular" runtime init stack on
that thread
- return immediately from the main (i.e., loader) thread
- at the first CGO invocation, we wait for the runtime initialization
to complete.
The above mechanism is implemented only on linux_amd64. Next step is to
support it on linux_arm. Other platforms don't yet support shared library
compiling/linking, but we intend to use the same strategy there as well.
Change-Id: Ib2c81b1b83bee837134084b75a3beecfb8de6bf4
Reviewed-on: https://go-review.googlesource.com/8094
Run-TryBot: Srdjan Petrovic <spetrovic@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
This tracks both total CPU time used by GC and the total time
available to all Ps since the beginning of the program and uses this
to derive a cumulative CPU usage percent for the gctrace line.
Change-Id: Ica85372b8dd45f7621909b325d5ac713a9b0d015
Reviewed-on: https://go-review.googlesource.com/8350
Reviewed-by: Russ Cox <rsc@golang.org>
Also invert it, which means it no longer needs to cross the cgo
package boundary.
Change-Id: I393cd073bda02b591a55d6bc6b8bb94970ea71cd
Reviewed-on: https://go-review.googlesource.com/8082
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: David Crawshaw <crawshaw@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>