File size: 46,966 Bytes

e36aeda

// Copyright 2023 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// # Go execution tracer
//
// The tracer captures a wide range of execution events like goroutine
// creation/blocking/unblocking, syscall enter/exit/block, GC-related events,
// changes of heap size, processor start/stop, etc and writes them to a buffer
// in a compact form. A precise nanosecond-precision timestamp and a stack
// trace is captured for most events.
//
// ## Design
//
// The basic idea behind the execution tracer is to have per-M buffers that
// trace data may be written into. Each M maintains a write flag indicating whether
// its trace buffer is currently in use.
//
// Tracing is initiated by StartTrace, and proceeds in "generations," with each
// generation being marked by a call to traceAdvance, to advance to the next
// generation. Generations are a global synchronization point for trace data,
// and we proceed to a new generation by moving forward trace.gen. Each M reads
// trace.gen under its own write flag to determine which generation it is writing
// trace data for. To this end, each M has 2 slots for buffers: one slot for the
// previous generation, one slot for the current one. It uses tl.gen to select
// which buffer slot to write to. Simultaneously, traceAdvance uses the write flag
// to determine whether every thread is guaranteed to observe an updated
// trace.gen. Once it is sure, it may then flush any buffers that are left over
// from the previous generation safely, since it knows the Ms will not mutate
// it.
//
// Flushed buffers are processed by the ReadTrace function, which is called by
// the trace reader goroutine. The first goroutine to call ReadTrace is designated
// as the trace reader goroutine until tracing completes. (There may only be one at
// a time.)
//
// Once all buffers are flushed, any extra post-processing complete, and flushed
// buffers are processed by the trace reader goroutine, the trace emits an
// EndOfGeneration event to mark the global synchronization point in the trace.
//
// All other trace features, including CPU profile samples, stack information,
// string tables, etc. all revolve around this generation system, and typically
// appear in pairs: one for the previous generation, and one for the current one.
// Like the per-M buffers, which of the two is written to is selected using trace.gen,
// and anything managed this way must similarly be mutated only in traceAdvance or
// under the M's write flag.
//
// Trace events themselves are simple. They consist of a single byte for the event type,
// followed by zero or more LEB128-encoded unsigned varints. They are decoded using
// a pre-determined table for each trace version: internal/trace/tracev2.specs.
//
// To avoid relying on timestamps for correctness and validation, each G and P have
// sequence counters that are written into trace events to encode a partial order.
// The sequence counters reset on each generation. Ms do not need sequence counters
// because they are the source of truth for execution: trace events, and even whole
// buffers, are guaranteed to appear in order in the trace data stream, simply because
// that's the order the thread emitted them in.
//
// See traceruntime.go for the API the tracer exposes to the runtime for emitting events.
//
// In each generation, we ensure that we enumerate all goroutines, such that each
// generation's data is fully self-contained. This makes features like the flight
// recorder easy to implement. To this end, we guarantee that every live goroutine is
// listed at least once by emitting a status event for the goroutine, indicating its
// starting state. These status events are emitted based on context, generally based
// on the event that's about to be emitted.
//
// The traceEventWriter type encapsulates these details, and is the backbone of
// the API exposed in traceruntime.go, though there are deviations where necessary.
//
// This is the overall design, but as always, there are many details. Beyond this,
// look to the invariants and select corner cases below and the code itself for the
// source of truth.
//
// See https://go.dev/issue/60773 for a link to a more complete design with rationale,
// though parts of it are out-of-date.
//
// ## Invariants
//
// 1. An m that has a trace buffer MUST be on either the allm or sched.freem lists.
//
// Otherwise, traceAdvance might miss an M with a buffer that needs to be flushed.
//
// 2. Trace buffers MUST only be mutated in traceAdvance or under a traceAcquire/traceRelease.
//
// Otherwise, traceAdvance may race with Ms writing trace data when trying to flush buffers.
//
// 3. traceAdvance MUST NOT return until all of the current generation's buffers are flushed.
//
// Otherwise, callers cannot rely on all the data they need being available (for example, for
// the flight recorder).
//
// 4. P and goroutine state transition events MUST be emitted by an M that owns its ability
//    to transition.
//
// What this means is that the M must either be the owner of the P, the owner of the goroutine,
// or owner of a non-running goroutine's _Gscan bit. There are a lot of bad things that can
// happen if this invariant isn't maintained, mostly around generating inconsistencies in the
// trace due to racy emission of events.
//
// 5. Acquisition of a P (pidleget or takeP/gcstopP) MUST NOT be performed under a traceAcquire/traceRelease pair.
//
// Notably, it's important that traceAcquire/traceRelease not cover a state in which the
// goroutine or P is not yet owned. For example, if traceAcquire is held across both wirep and
// pidleget, then we could end up emitting an event in the wrong generation. Suppose T1
// traceAcquires in generation 1, a generation transition happens, T2 emits a ProcStop and
// executes pidleput in generation 2, and finally T1 calls pidleget and emits ProcStart.
// The ProcStart must follow the ProcStop in the trace to make any sense, but ProcStop was
// emitted in a latter generation.
//
// 6. Goroutine state transitions, with the exception of transitions into _Grunning, MUST be
//    performed under the traceAcquire/traceRelease pair where the event is emitted.
//
// Otherwise, traceAdvance may observe a goroutine state that is inconsistent with the
// events being emitted. traceAdvance inspects all goroutines' states in order to emit
// a status event for any goroutine that did not have an event emitted for it already.
// If the generation then advances in between that observation and the event being emitted,
// then the trace will contain a status that doesn't line up with the event. For example,
// if the event is emitted after the state transition _Gwaiting -> _Grunnable, then
// traceAdvance may observe the goroutine in _Grunnable, emit a status event, advance the
// generation, and the following generation contains a GoUnblock event. The trace parser
// will get confused because it sees that goroutine in _Grunnable in the previous generation
// trying to be transitioned from _Gwaiting into _Grunnable in the following one. Something
// similar happens if the trace event is emitted before the state transition, so that does
// not help either.
//
// Transitions to _Grunning do not have the same problem because traceAdvance is unable to
// observe running goroutines directly. It must stop them, or wait for them to emit an event.
// Note that it cannot even stop them with asynchronous preemption in any "bad" window between
// the state transition to _Grunning and the event emission because async preemption cannot
// stop goroutines in the runtime.
//
// 7. Goroutine state transitions into _Grunning MUST emit an event for the transition after
//    the state transition.
//
// This follows from invariants (4), (5), and the explanation of (6).
// The relevant part of the previous invariant is that in order for the tracer to be unable to
// stop a goroutine, it must be in _Grunning and in the runtime. So to close any windows between
// event emission and the state transition, the event emission must happen *after* the transition
// to _Grunning.
//
// ## Select corner cases
//
// ### CGO calls / system calls
//
// CGO calls and system calls are mostly straightforward, except for P stealing. For historical
// reasons, this introduces a new trace-level P state called ProcSyscall which used to model
// _Psyscall (now _Psyscall_unused). This state is used to indicate in the trace that a P
// is eligible for stealing as part of the parser's ordering logic.
//
// Another quirk of this corner case is the ProcSyscallAbandoned trace-level P state, which
// is used only in status events to indicate a relaxation of verification requirements. It
// means that if the execution trace parser can't find the corresponding thread that the P
// was stolen from in the state it expects it to be, to accept the trace anyway. This is also
// historical. When _Psyscall still existed, one would steal and then ProcSteal, and there
// was no ordering between the ProcSteal and the subsequent GoSyscallEndBlocked. One clearly
// happened before the other, but since P stealing was a single atomic, there was no way
// to enforce the order. The GoSyscallEndBlocked thread could move on and end up in any
// state, and the GoSyscallEndBlocked could be in a completely different generation to the
// ProcSteal. Today this is no longer possible as the ProcSteal is always ordered before
// the GoSyscallEndBlocked event in the runtime.
//
// Both ProcSyscall and ProcSyscallAbandoned are likely no longer be necessary.
//
// ### CGO callbacks
//
// When a C thread calls into Go, the execution tracer models that as the creation of a new
// goroutine. When the thread exits back into C, that is modeled as the destruction of that
// goroutine. These are the GoCreateSyscall and GoDestroySyscall events, which represent the
// creation and destruction of a goroutine with its starting and ending states being _Gsyscall.
//
// This model is simple to reason about but contradicts the runtime implementation, which
// doesn't do this directly for performance reasons. The runtime implementation instead caches
// a G on the M created for the C thread. On Linux this M is then cached in the thread's TLS,
// and on other systems, the M is put on a global list on exit from Go. We need to do some
// extra work to make sure that this is modeled correctly in the tracer. For example,
// a C thread exiting Go may leave a P hanging off of its M (whether that M is kept in TLS
// or placed back on a list). In order to correctly model goroutine creation and destruction,
// we must behave as if the P was at some point stolen by the runtime, if the C thread
// reenters Go with the same M (and thus, same P) once more.

package runtime

import (
	"internal/runtime/atomic"
	"internal/trace/tracev2"
	"unsafe"
)

// Trace state.

// trace is global tracing context.
var trace struct {
	// trace.lock must only be acquired on the system stack where
	// stack splits cannot happen while it is held.
	lock mutex

	// Trace buffer management.
	//
	// First we check the empty list for any free buffers. If not, buffers
	// are allocated directly from the OS. Once they're filled up and/or
	// flushed, they end up on the full queue for trace.gen%2.
	//
	// The trace reader takes buffers off the full list one-by-one and
	// places them into reading until they're finished being read from.
	// Then they're placed onto the empty list.
	//
	// Protected by trace.lock.
	reading       *traceBuf // buffer currently handed off to user
	empty         *traceBuf // stack of empty buffers
	full          [2]traceBufQueue
	workAvailable atomic.Bool

	// State for the trace reader goroutine.
	//
	// Protected by trace.lock.
	readerGen              atomic.Uintptr // the generation the reader is currently reading for
	flushedGen             atomic.Uintptr // the last completed generation
	headerWritten          bool           // whether ReadTrace has emitted trace header
	endOfGenerationWritten bool           // whether readTrace has emitted the end of the generation signal

	// doneSema is used to synchronize the reader and traceAdvance. Specifically,
	// it notifies traceAdvance that the reader is done with a generation.
	// Both semaphores are 0 by default (so, acquires block). traceAdvance
	// attempts to acquire for gen%2 after flushing the last buffers for gen.
	// Meanwhile the reader releases the sema for gen%2 when it has finished
	// processing gen.
	doneSema [2]uint32

	// Trace data tables for deduplicating data going into the trace.
	// There are 2 of each: one for gen%2, one for 1-gen%2.
	stackTab  [2]traceStackTable  // maps stack traces to unique ids
	stringTab [2]traceStringTable // maps strings to unique ids
	typeTab   [2]traceTypeTable   // maps type pointers to unique ids

	// cpuLogRead accepts CPU profile samples from the signal handler where
	// they're generated. There are two profBufs here: one for gen%2, one for
	// 1-gen%2. These profBufs use a three-word header to hold the IDs of the P, G,
	// and M (respectively) that were active at the time of the sample. Because
	// profBuf uses a record with all zeros in its header to indicate overflow,
	// we make sure to make the P field always non-zero: The ID of a real P will
	// start at bit 1, and bit 0 will be set. Samples that arrive while no P is
	// running (such as near syscalls) will set the first header field to 0b10.
	// This careful handling of the first header field allows us to store ID of
	// the active G directly in the second field, even though that will be 0
	// when sampling g0.
	//
	// Initialization and teardown of these fields is protected by traceAdvanceSema.
	cpuLogRead  [2]*profBuf
	signalLock  atomic.Uint32              // protects use of the following member, only usable in signal handlers
	cpuLogWrite [2]atomic.Pointer[profBuf] // copy of cpuLogRead for use in signal handlers, set without signalLock
	cpuSleep    *wakeableSleep
	cpuLogDone  <-chan struct{}
	cpuBuf      [2]*traceBuf

	reader atomic.Pointer[g] // goroutine that called ReadTrace, or nil

	// Fast mappings from enumerations to string IDs that are prepopulated
	// in the trace.
	markWorkerLabels [2][len(gcMarkWorkerModeStrings)]traceArg
	goStopReasons    [2][len(traceGoStopReasonStrings)]traceArg
	goBlockReasons   [2][len(traceBlockReasonStrings)]traceArg

	// enabled indicates whether tracing is enabled, but it is only an optimization,
	// NOT the source of truth on whether tracing is enabled. Tracing is only truly
	// enabled if gen != 0. This is used as an optimistic fast path check.
	//
	// Transitioning this value from true -> false is easy (once gen is 0)
	// because it's OK for enabled to have a stale "true" value. traceAcquire will
	// always double-check gen.
	//
	// Transitioning this value from false -> true is harder. We need to make sure
	// this is observable as true strictly before gen != 0. To maintain this invariant
	// we only make this transition with the world stopped and use the store to gen
	// as a publication barrier.
	enabled bool

	// enabledWithAllocFree is set if debug.traceallocfree is != 0 when tracing begins.
	// It follows the same synchronization protocol as enabled.
	enabledWithAllocFree bool

	// Trace generation counter.
	gen            atomic.Uintptr
	lastNonZeroGen uintptr // last non-zero value of gen

	// shutdown is set when we are waiting for trace reader to finish after setting gen to 0
	//
	// Writes protected by trace.lock.
	shutdown atomic.Bool

	// Number of goroutines in syscall exiting slow path.
	exitingSyscall atomic.Int32

	// seqGC is the sequence counter for GC begin/end.
	//
	// Mutated only during stop-the-world.
	seqGC uint64

	// minPageHeapAddr is the minimum address of the page heap when tracing started.
	minPageHeapAddr uint64

	// debugMalloc is the value of debug.malloc before tracing began.
	debugMalloc bool
}

// Trace public API.

var (
	traceAdvanceSema  uint32 = 1
	traceShutdownSema uint32 = 1
)

// StartTrace enables tracing for the current process.
// While tracing, the data will be buffered and available via [ReadTrace].
// StartTrace returns an error if tracing is already enabled.
// Most clients should use the [runtime/trace] package or the [testing] package's
// -test.trace flag instead of calling StartTrace directly.
func StartTrace() error {
	if traceEnabled() || traceShuttingDown() {
		return errorString("tracing is already enabled")
	}
	// Block until cleanup of the last trace is done.
	semacquire(&traceShutdownSema)
	semrelease(&traceShutdownSema)

	// Hold traceAdvanceSema across trace start, since we'll want it on
	// the other side of tracing being enabled globally.
	semacquire(&traceAdvanceSema)

	// Initialize CPU profile -> trace ingestion.
	traceInitReadCPU()

	// Compute the first generation for this StartTrace.
	//
	// Note: we start from the last non-zero generation rather than 1 so we
	// can avoid resetting all the arrays indexed by gen%2 or gen%3. There's
	// more than one of each per m, p, and goroutine.
	firstGen := traceNextGen(trace.lastNonZeroGen)

	// Reset GC sequencer.
	trace.seqGC = 1

	// Reset trace reader state.
	trace.headerWritten = false
	trace.readerGen.Store(firstGen)
	trace.flushedGen.Store(0)

	// Register some basic strings in the string tables.
	traceRegisterLabelsAndReasons(firstGen)

	// N.B. This may block for quite a while to get a frequency estimate. Do it
	// here to minimize the time that the world is stopped.
	frequency := traceClockUnitsPerSecond()

	// Stop the world.
	//
	// What we need to successfully begin tracing is to make sure that the next time
	// *any goroutine* hits a traceAcquire, it sees that the trace is enabled.
	//
	// Stopping the world gets us most of the way there, since it makes sure that goroutines
	// stop executing. There is however one exception: goroutines without Ps concurrently
	// exiting a syscall. We handle this by making sure that, after we update trace.gen,
	// there isn't a single goroutine calling traceAcquire on the syscall slow path by checking
	// trace.exitingSyscall. See the comment on the check below for more details.
	//
	// Note also that stopping the world is necessary to make sure sweep-related events are
	// coherent. Since the world is stopped and sweeps are non-preemptible, we can never start
	// the world and see an unpaired sweep 'end' event. Other parts of the tracer rely on this.
	stw := stopTheWorld(stwStartTrace)

	// Prevent sysmon from running any code that could generate events.
	lock(&sched.sysmonlock)

	// Grab the minimum page heap address. All Ps are stopped, so it's safe to read this since
	// nothing can allocate heap memory.
	trace.minPageHeapAddr = uint64(mheap_.pages.inUse.ranges[0].base.addr())

	// Reset mSyscallID on all Ps while we have them stationary and the trace is disabled.
	for _, pp := range allp {
		pp.trace.mSyscallID = -1
	}

	// Start tracing.
	//
	// Set trace.enabled. This is *very* subtle. We need to maintain the invariant that if
	// trace.gen != 0, then trace.enabled is always observed as true. Simultaneously, for
	// performance, we need trace.enabled to be read without any synchronization.
	//
	// We ensure this is safe by stopping the world, which acts a global barrier on almost
	// every M, and explicitly synchronize with any other Ms that could be running concurrently
	// with us. Today, there are only two such cases:
	// - sysmon, which we synchronized with by acquiring sysmonlock.
	// - goroutines exiting syscalls, which we synchronize with via trace.exitingSyscall.
	//
	// After trace.gen is updated, other Ms may start creating trace buffers and emitting
	// data into them.
	trace.enabled = true
	if debug.traceallocfree.Load() != 0 {
		// Enable memory events since the GODEBUG is set.
		trace.debugMalloc = debug.malloc
		trace.enabledWithAllocFree = true
		debug.malloc = true
	}
	trace.gen.Store(firstGen)

	// Wait for exitingSyscall to drain.
	//
	// It may not monotonically decrease to zero, but in the limit it will always become
	// zero because the world is stopped and there are no available Ps for syscall-exited
	// goroutines to run on.
	//
	// Because we set gen before checking this, and because exitingSyscall is always incremented
	// *before* traceAcquire (which checks gen), we can be certain that when exitingSyscall is zero
	// that any goroutine that goes to exit a syscall from then on *must* observe the new gen as
	// well as trace.enabled being set to true.
	//
	// The critical section on each goroutine here is going to be quite short, so the likelihood
	// that we observe a zero value is high.
	for trace.exitingSyscall.Load() != 0 {
		osyield()
	}

	// Record some initial pieces of information.
	//
	// N.B. This will also emit a status event for this goroutine.
	tl := traceAcquire()
	traceSyncBatch(firstGen, frequency) // Get this as early in the trace as possible. See comment in traceAdvance.
	tl.Gomaxprocs(gomaxprocs)           // Get this as early in the trace as possible. See comment in traceAdvance.
	tl.STWStart(stwStartTrace)          // We didn't trace this above, so trace it now.

	// Record the fact that a GC is active, if applicable.
	if gcphase == _GCmark || gcphase == _GCmarktermination {
		tl.GCActive()
	}

	// Dump a snapshot of memory, if enabled.
	if trace.enabledWithAllocFree {
		traceSnapshotMemory(firstGen)
	}

	// Record the heap goal so we have it at the very beginning of the trace.
	tl.HeapGoal()
	traceRelease(tl)

	unlock(&sched.sysmonlock)
	startTheWorld(stw)

	traceStartReadCPU()
	traceAdvancer.start()

	semrelease(&traceAdvanceSema)
	return nil
}

// StopTrace stops tracing, if it was previously enabled.
// StopTrace only returns after all the reads for the trace have completed.
func StopTrace() {
	traceAdvance(true)
}

// traceAdvance moves tracing to the next generation, and cleans up the current generation,
// ensuring that it's flushed out before returning. If stopTrace is true, it disables tracing
// altogether instead of advancing to the next generation.
//
// traceAdvanceSema must not be held.
//
// traceAdvance is called by runtime/trace and golang.org/x/exp/trace using linkname.
//
//go:linkname traceAdvance
func traceAdvance(stopTrace bool) {
	semacquire(&traceAdvanceSema)

	// Get the gen that we're advancing from. In this function we don't really care much
	// about the generation we're advancing _into_ since we'll do all the cleanup in this
	// generation for the next advancement.
	gen := trace.gen.Load()
	if gen == 0 {
		// We may end up here traceAdvance is called concurrently with StopTrace.
		semrelease(&traceAdvanceSema)
		return
	}

	// Collect all the untraced Gs.
	type untracedG struct {
		gp           *g
		goid         uint64
		mid          int64
		stackID      uint64
		status       uint32
		waitreason   waitReason
		inMarkAssist bool
	}
	var untracedGs []untracedG
	forEachGRace(func(gp *g) {
		// Make absolutely sure all Gs are ready for the next
		// generation. We need to do this even for dead Gs because
		// they may come alive with a new identity, and its status
		// traced bookkeeping might end up being stale.
		// We may miss totally new goroutines, but they'll always
		// have clean bookkeeping.
		gp.trace.readyNextGen(gen)
		// If the status was traced, nothing else to do.
		if gp.trace.statusWasTraced(gen) {
			return
		}
		// Scribble down information about this goroutine.
		ug := untracedG{gp: gp, mid: -1}
		systemstack(func() {
			me := getg().m.curg
			// We don't have to handle this G status transition because we
			// already eliminated ourselves from consideration above.
			casGToWaitingForSuspendG(me, _Grunning, waitReasonTraceGoroutineStatus)
			// We need to suspend and take ownership of the G to safely read its
			// goid. Note that we can't actually emit the event at this point
			// because we might stop the G in a window where it's unsafe to write
			// events based on the G's status. We need the global trace buffer flush
			// coming up to make sure we're not racing with the G.
			//
			// It should be very unlikely that we try to preempt a running G here.
			// The only situation that we might is that we're racing with a G
			// that's running for the first time in this generation. Therefore,
			// this should be relatively fast.
			s := suspendG(gp)
			if !s.dead {
				ug.goid = s.g.goid
				if s.g.m != nil {
					ug.mid = int64(s.g.m.procid)
				}
				ug.status = readgstatus(s.g) &^ _Gscan
				ug.waitreason = s.g.waitreason
				ug.inMarkAssist = s.g.inMarkAssist
				ug.stackID = traceStack(0, gp, &trace.stackTab[gen%2])
			}
			resumeG(s)
			casgstatus(me, _Gwaiting, _Grunning)
		})
		if ug.goid != 0 {
			untracedGs = append(untracedGs, ug)
		}
	})

	if !stopTrace {
		// Re-register runtime goroutine labels and stop/block reasons.
		traceRegisterLabelsAndReasons(traceNextGen(gen))
	}

	// N.B. This may block for quite a while to get a frequency estimate. Do it
	// here to minimize the time that we prevent the world from stopping.
	frequency := traceClockUnitsPerSecond()

	// Prevent the world from stopping.
	//
	// This is necessary to ensure the consistency of the STW events. If we're feeling
	// adventurous we could lift this restriction and add a STWActive event, but the
	// cost of maintaining this consistency is low.
	//
	// This is also a good time to preempt all the Ps and ensure they had a status traced.
	semacquire(&worldsema)

	// Go over each P and emit a status event for it if necessary.
	//
	// TODO(mknyszek): forEachP is very heavyweight. We could do better by integrating
	// the statusWasTraced check into it, to avoid preempting unnecessarily.
	forEachP(waitReasonTraceProcStatus, func(pp *p) {
		tl := traceAcquire()
		if !pp.trace.statusWasTraced(tl.gen) {
			tl.writer().writeProcStatusForP(pp, false).end()
		}
		traceRelease(tl)
	})

	// While we're still holding worldsema (preventing a STW and thus a
	// change in the number of Ps), reset the status on dead Ps.
	// They just appear as idle.
	//
	// TODO(mknyszek): Consider explicitly emitting ProcCreate and ProcDestroy
	// events to indicate whether a P exists, rather than just making its
	// existence implicit.
	for _, pp := range allp[len(allp):cap(allp)] {
		pp.trace.readyNextGen(gen)
	}

	// Prevent preemption to make sure we're not interrupted.
	//
	// We want to get through the rest as soon as possible.
	mp := acquirem()

	// Advance the generation or stop the trace.
	trace.lastNonZeroGen = gen
	if stopTrace {
		systemstack(func() {
			// Ordering is important here. Set shutdown first, then disable tracing,
			// so that conditions like (traceEnabled() || traceShuttingDown()) have
			// no opportunity to be false. Hold the trace lock so this update appears
			// atomic to the trace reader.
			lock(&trace.lock)
			trace.shutdown.Store(true)
			trace.gen.Store(0)
			unlock(&trace.lock)

			// Clear trace.enabled. It is totally OK for this value to be stale,
			// because traceAcquire will always double-check gen.
			trace.enabled = false
		})
	} else {
		trace.gen.Store(traceNextGen(gen))
	}

	// Emit a sync batch which contains a ClockSnapshot. Also emit a ProcsChange
	// event so we have one on record for each generation. Let's emit it as soon
	// as possible so that downstream tools can rely on the value being there
	// fairly soon in a generation.
	//
	// It's important that we do this before allowing stop-the-worlds again,
	// because the procs count could change.
	if !stopTrace {
		tl := traceAcquire()
		traceSyncBatch(tl.gen, frequency)
		tl.Gomaxprocs(gomaxprocs)
		traceRelease(tl)
	}

	// Emit a GCActive event in the new generation if necessary.
	//
	// It's important that we do this before allowing stop-the-worlds again,
	// because that could emit global GC-related events.
	if !stopTrace && (gcphase == _GCmark || gcphase == _GCmarktermination) {
		tl := traceAcquire()
		tl.GCActive()
		traceRelease(tl)
	}

	// Preemption is OK again after this. If the world stops or whatever it's fine.
	// We're just cleaning up the last generation after this point.
	//
	// We also don't care if the GC starts again after this for the same reasons.
	releasem(mp)
	semrelease(&worldsema)

	// Snapshot allm and freem.
	//
	// Snapshotting after the generation counter update is sufficient.
	// Because an m must be on either allm or sched.freem if it has an active trace
	// buffer, new threads added to allm after this point must necessarily observe
	// the new generation number (sched.lock acts as a barrier).
	//
	// Threads that exit before this point and are on neither list explicitly
	// flush their own buffers in traceThreadDestroy.
	//
	// Snapshotting freem is necessary because Ms can continue to emit events
	// while they're still on that list. Removal from sched.freem is serialized with
	// this snapshot, so either we'll capture an m on sched.freem and race with
	// the removal to flush its buffers (resolved by traceThreadDestroy acquiring
	// the thread's write flag, which one of us must win, so at least its old gen buffer
	// will be flushed in time for the new generation) or it will have flushed its
	// buffers before we snapshotted it to begin with.
	lock(&sched.lock)
	mToFlush := allm
	for mp := mToFlush; mp != nil; mp = mp.alllink {
		mp.trace.link = mp.alllink
	}
	for mp := sched.freem; mp != nil; mp = mp.freelink {
		mp.trace.link = mToFlush
		mToFlush = mp
	}
	unlock(&sched.lock)

	// Iterate over our snapshot, flushing every buffer until we're done.
	//
	// Because trace writers read the generation while the write flag is
	// held, we can be certain that when there are no writers there are
	// also no stale generation values left. Therefore, it's safe to flush
	// any buffers that remain in that generation's slot.
	const debugDeadlock = false
	systemstack(func() {
		// Track iterations for some rudimentary deadlock detection.
		i := 0
		detectedDeadlock := false

		for mToFlush != nil {
			prev := &mToFlush
			for mp := *prev; mp != nil; {
				if mp.trace.writing.Load() {
					// The M is writing. Come back to it later.
					prev = &mp.trace.link
					mp = mp.trace.link
					continue
				}
				// Flush the trace buffer.
				//
				// trace.lock needed for traceBufFlush, but also to synchronize
				// with traceThreadDestroy, which flushes both buffers unconditionally.
				lock(&trace.lock)
				for exp, buf := range mp.trace.buf[gen%2] {
					if buf != nil {
						traceBufFlush(buf, gen)
						mp.trace.buf[gen%2][exp] = nil
					}
				}
				unlock(&trace.lock)

				// Remove the m from the flush list.
				*prev = mp.trace.link
				mp.trace.link = nil
				mp = *prev
			}
			// Yield only if we're going to be going around the loop again.
			if mToFlush != nil {
				osyield()
			}

			if debugDeadlock {
				// Try to detect a deadlock. We probably shouldn't loop here
				// this many times.
				if i > 100000 && !detectedDeadlock {
					detectedDeadlock = true
					println("runtime: failing to flush")
					for mp := mToFlush; mp != nil; mp = mp.trace.link {
						print("runtime: m=", mp.id, "\n")
					}
				}
				i++
			}
		}
	})

	// At this point, the old generation is fully flushed minus stack and string
	// tables, CPU samples, and goroutines that haven't run at all during the last
	// generation.

	// Check to see if any Gs still haven't had events written out for them.
	statusWriter := unsafeTraceWriter(gen, nil)
	for _, ug := range untracedGs {
		if ug.gp.trace.statusWasTraced(gen) {
			// It was traced, we don't need to do anything.
			continue
		}
		// It still wasn't traced. Because we ensured all Ms stopped writing trace
		// events to the last generation, that must mean the G never had its status
		// traced in gen between when we recorded it and now. If that's true, the goid
		// and status we recorded then is exactly what we want right now.
		status := goStatusToTraceGoStatus(ug.status, ug.waitreason)
		statusWriter = statusWriter.writeGoStatus(ug.goid, ug.mid, status, ug.inMarkAssist, ug.stackID)
	}
	statusWriter.flush().end()

	// Read everything out of the last gen's CPU profile buffer.
	traceReadCPU(gen)

	// Flush CPU samples, stacks, and strings for the last generation. This is safe,
	// because we're now certain no M is writing to the last generation.
	//
	// Ordering is important here. traceCPUFlush may generate new stacks and dumping
	// stacks may generate new strings.
	traceCPUFlush(gen)
	trace.stackTab[gen%2].dump(gen)
	trace.typeTab[gen%2].dump(gen)
	trace.stringTab[gen%2].reset(gen)

	// That's it. This generation is done producing buffers.
	systemstack(func() {
		lock(&trace.lock)
		trace.flushedGen.Store(gen)
		unlock(&trace.lock)
	})

	if stopTrace {
		// Acquire the shutdown sema to begin the shutdown process.
		semacquire(&traceShutdownSema)

		// Finish off CPU profile reading.
		traceStopReadCPU()

		// Reset debug.malloc if necessary. Note that this is set in a racy
		// way; that's OK. Some mallocs may still enter into the debug.malloc
		// block, but they won't generate events because tracing is disabled.
		// That is, it's OK if mallocs read a stale debug.malloc or
		// trace.enabledWithAllocFree value.
		if trace.enabledWithAllocFree {
			trace.enabledWithAllocFree = false
			debug.malloc = trace.debugMalloc
		}
	}

	// Block until the trace reader has finished processing the last generation.
	semacquire(&trace.doneSema[gen%2])
	if raceenabled {
		raceacquire(unsafe.Pointer(&trace.doneSema[gen%2]))
	}

	// Double-check that things look as we expect after advancing and perform some
	// final cleanup if the trace has fully stopped.
	systemstack(func() {
		lock(&trace.lock)
		if !trace.full[gen%2].empty() {
			throw("trace: non-empty full trace buffer for done generation")
		}
		if stopTrace {
			if !trace.full[1-(gen%2)].empty() {
				throw("trace: non-empty full trace buffer for next generation")
			}
			if trace.reading != nil || trace.reader.Load() != nil {
				throw("trace: reading after shutdown")
			}
			// Free all the empty buffers.
			for trace.empty != nil {
				buf := trace.empty
				trace.empty = buf.link
				sysFree(unsafe.Pointer(buf), unsafe.Sizeof(*buf), &memstats.other_sys)
			}
			// Clear trace.shutdown and other flags.
			trace.headerWritten = false
			trace.shutdown.Store(false)
		}
		unlock(&trace.lock)
	})

	if stopTrace {
		// Clear the sweep state on every P for the next time tracing is enabled.
		//
		// It may be stale in the next trace because we may have ended tracing in
		// the middle of a sweep on a P.
		//
		// It's fine not to call forEachP here because tracing is disabled and we
		// know at this point that nothing is calling into the tracer, but we do
		// need to look at dead Ps too just because GOMAXPROCS could have been called
		// at any point since we stopped tracing, and we have to ensure there's no
		// bad state on dead Ps too. Prevent a STW and a concurrent GOMAXPROCS that
		// might mutate allp by making ourselves briefly non-preemptible.
		mp := acquirem()
		for _, pp := range allp[:cap(allp)] {
			pp.trace.inSweep = false
			pp.trace.maySweep = false
			pp.trace.swept = 0
			pp.trace.reclaimed = 0
		}
		releasem(mp)
	}

	// Release the advance semaphore. If stopTrace is true we're still holding onto
	// traceShutdownSema.
	//
	// Do a direct handoff. Don't let one caller of traceAdvance starve
	// other calls to traceAdvance.
	semrelease1(&traceAdvanceSema, true, 0)

	if stopTrace {
		// Stop the traceAdvancer. We can't be holding traceAdvanceSema here because
		// we'll deadlock (we're blocked on the advancer goroutine exiting, but it
		// may be currently trying to acquire traceAdvanceSema).
		traceAdvancer.stop()
		semrelease(&traceShutdownSema)
	}
}

func traceNextGen(gen uintptr) uintptr {
	if gen == ^uintptr(0) {
		// gen is used both %2 and %3 and we want both patterns to continue when we loop around.
		// ^uint32(0) and ^uint64(0) are both odd and multiples of 3. Therefore the next generation
		// we want is even and one more than a multiple of 3. The smallest such number is 4.
		return 4
	}
	return gen + 1
}

// traceRegisterLabelsAndReasons re-registers mark worker labels and
// goroutine stop/block reasons in the string table for the provided
// generation. Note: the provided generation must not have started yet.
func traceRegisterLabelsAndReasons(gen uintptr) {
	for i, label := range gcMarkWorkerModeStrings[:] {
		trace.markWorkerLabels[gen%2][i] = traceArg(trace.stringTab[gen%2].put(gen, label))
	}
	for i, str := range traceBlockReasonStrings[:] {
		trace.goBlockReasons[gen%2][i] = traceArg(trace.stringTab[gen%2].put(gen, str))
	}
	for i, str := range traceGoStopReasonStrings[:] {
		trace.goStopReasons[gen%2][i] = traceArg(trace.stringTab[gen%2].put(gen, str))
	}
}

// ReadTrace returns the next chunk of binary tracing data, blocking until data
// is available. If tracing is turned off and all the data accumulated while it
// was on has been returned, ReadTrace returns nil. The caller must copy the
// returned data before calling ReadTrace again.
// ReadTrace must be called from one goroutine at a time.
func ReadTrace() (buf []byte) {
top:
	var park bool
	systemstack(func() {
		buf, park = readTrace0()
	})
	if park {
		gopark(func(gp *g, _ unsafe.Pointer) bool {
			if !trace.reader.CompareAndSwapNoWB(nil, gp) {
				// We're racing with another reader.
				// Wake up and handle this case.
				return false
			}

			if g2 := traceReader(); gp == g2 {
				// New data arrived between unlocking
				// and the CAS and we won the wake-up
				// race, so wake up directly.
				return false
			} else if g2 != nil {
				printlock()
				println("runtime: got trace reader", g2, g2.goid)
				throw("unexpected trace reader")
			}

			return true
		}, nil, waitReasonTraceReaderBlocked, traceBlockSystemGoroutine, 2)
		goto top
	}
	return buf
}

// readTrace0 is ReadTrace's continuation on g0. This must run on the
// system stack because it acquires trace.lock.
//
//go:systemstack
func readTrace0() (buf []byte, park bool) {
	if raceenabled {
		// g0 doesn't have a race context. Borrow the user G's.
		if getg().racectx != 0 {
			throw("expected racectx == 0")
		}
		getg().racectx = getg().m.curg.racectx
		// (This defer should get open-coded, which is safe on
		// the system stack.)
		defer func() { getg().racectx = 0 }()
	}

	// This function must not allocate while holding trace.lock:
	// allocation can call heap allocate, which will try to emit a trace
	// event while holding heap lock.
	lock(&trace.lock)

	if trace.reader.Load() != nil {
		// More than one goroutine reads trace. This is bad.
		// But we rather do not crash the program because of tracing,
		// because tracing can be enabled at runtime on prod servers.
		unlock(&trace.lock)
		println("runtime: ReadTrace called from multiple goroutines simultaneously")
		return nil, false
	}
	// Recycle the old buffer.
	if buf := trace.reading; buf != nil {
		buf.link = trace.empty
		trace.empty = buf
		trace.reading = nil
	}
	// Write trace header.
	if !trace.headerWritten {
		trace.headerWritten = true
		unlock(&trace.lock)
		return []byte("go 1.26 trace\x00\x00\x00"), false
	}

	// Read the next buffer.

	if trace.readerGen.Load() == 0 {
		trace.readerGen.Store(1)
	}
	var gen uintptr
	for {
		assertLockHeld(&trace.lock)
		gen = trace.readerGen.Load()

		// Check to see if we need to block for more data in this generation
		// or if we need to move our generation forward.
		if !trace.full[gen%2].empty() {
			break
		}
		// Most of the time readerGen is one generation ahead of flushedGen, as the
		// current generation is being read from. Then, once the last buffer is flushed
		// into readerGen, flushedGen will rise to meet it. At this point, the tracer
		// is waiting on the reader to finish flushing the last generation so that it
		// can continue to advance.
		if trace.flushedGen.Load() == gen {
			// Write out the internal in-band end-of-generation signal.
			if !trace.endOfGenerationWritten {
				trace.endOfGenerationWritten = true
				unlock(&trace.lock)
				return []byte{byte(tracev2.EvEndOfGeneration)}, false
			}

			// Reset the flag.
			trace.endOfGenerationWritten = false

			// Handle shutdown.
			if trace.shutdown.Load() {
				unlock(&trace.lock)

				// Wake up anyone waiting for us to be done with this generation.
				//
				// Do this after reading trace.shutdown, because the thread we're
				// waking up is going to clear trace.shutdown.
				if raceenabled {
					// Model synchronization on trace.doneSema, which te race
					// detector does not see. This is required to avoid false
					// race reports on writer passed to trace.Start.
					racerelease(unsafe.Pointer(&trace.doneSema[gen%2]))
				}
				semrelease(&trace.doneSema[gen%2])

				// We're shutting down, and the last generation is fully
				// read. We're done.
				return nil, false
			}
			// Handle advancing to the next generation.

			// The previous gen has had all of its buffers flushed, and
			// there's nothing else for us to read. Advance the generation
			// we're reading from and try again.
			trace.readerGen.Store(trace.gen.Load())
			unlock(&trace.lock)

			// Wake up anyone waiting for us to be done with this generation.
			//
			// Do this after reading gen to make sure we can't have the trace
			// advance until we've read it.
			if raceenabled {
				// See comment above in the shutdown case.
				racerelease(unsafe.Pointer(&trace.doneSema[gen%2]))
			}
			semrelease(&trace.doneSema[gen%2])

			// Reacquire the lock and go back to the top of the loop.
			lock(&trace.lock)
			continue
		}
		// Wait for new data.
		//
		// We don't simply use a note because the scheduler
		// executes this goroutine directly when it wakes up
		// (also a note would consume an M).
		//
		// Before we drop the lock, clear the workAvailable flag. Work can
		// only be queued with trace.lock held, so this is at least true until
		// we drop the lock.
		trace.workAvailable.Store(false)
		unlock(&trace.lock)
		return nil, true
	}
	// Pull a buffer.
	tbuf := trace.full[gen%2].pop()
	trace.reading = tbuf
	unlock(&trace.lock)
	return tbuf.arr[:tbuf.pos], false
}

// traceReader returns the trace reader that should be woken up, if any.
// Callers should first check (traceEnabled() || traceShuttingDown()).
//
// This must run on the system stack because it acquires trace.lock.
//
//go:systemstack
func traceReader() *g {
	gp := traceReaderAvailable()
	if gp == nil || !trace.reader.CompareAndSwapNoWB(gp, nil) {
		return nil
	}
	return gp
}

// traceReaderAvailable returns the trace reader if it is not currently
// scheduled and should be. Callers should first check that
// (traceEnabled() || traceShuttingDown()) is true.
func traceReaderAvailable() *g {
	// There are two conditions under which we definitely want to schedule
	// the reader:
	// - The reader is lagging behind in finishing off the last generation.
	//   In this case, trace buffers could even be empty, but the trace
	//   advancer will be waiting on the reader, so we have to make sure
	//   to schedule the reader ASAP.
	// - The reader has pending work to process for it's reader generation
	//   (assuming readerGen is not lagging behind). Note that we also want
	//   to be careful *not* to schedule the reader if there's no work to do.
	//
	// We also want to be careful not to schedule the reader if there's no
	// reason to.
	if trace.flushedGen.Load() == trace.readerGen.Load() || trace.workAvailable.Load() {
		return trace.reader.Load()
	}
	return nil
}

// Trace advancer goroutine.
var traceAdvancer traceAdvancerState

type traceAdvancerState struct {
	timer *wakeableSleep
	done  chan struct{}
}

// start starts a new traceAdvancer.
func (s *traceAdvancerState) start() {
	// Start a goroutine to periodically advance the trace generation.
	s.done = make(chan struct{})
	s.timer = newWakeableSleep()
	go func() {
		for traceEnabled() {
			// Set a timer to wake us up
			s.timer.sleep(int64(debug.traceadvanceperiod))

			// Try to advance the trace.
			traceAdvance(false)
		}
		s.done <- struct{}{}
	}()
}

// stop stops a traceAdvancer and blocks until it exits.
func (s *traceAdvancerState) stop() {
	s.timer.wake()
	<-s.done
	close(s.done)
	s.timer.close()
}

// traceAdvancePeriod is the approximate period between
// new generations.
const defaultTraceAdvancePeriod = 1e9 // 1 second.

// wakeableSleep manages a wakeable goroutine sleep.
//
// Users of this type must call init before first use and
// close to free up resources. Once close is called, init
// must be called before another use.
type wakeableSleep struct {
	timer *timer

	// lock protects access to wakeup, but not send/recv on it.
	lock   mutex
	wakeup chan struct{}
}

// newWakeableSleep initializes a new wakeableSleep and returns it.
func newWakeableSleep() *wakeableSleep {
	s := new(wakeableSleep)
	lockInit(&s.lock, lockRankWakeableSleep)
	s.wakeup = make(chan struct{}, 1)
	s.timer = new(timer)
	f := func(s any, _ uintptr, _ int64) {
		s.(*wakeableSleep).wake()
	}
	s.timer.init(f, s)
	return s
}

// sleep sleeps for the provided duration in nanoseconds or until
// another goroutine calls wake.
//
// Must not be called by more than one goroutine at a time and
// must not be called concurrently with close.
func (s *wakeableSleep) sleep(ns int64) {
	s.timer.reset(nanotime()+ns, 0)
	lock(&s.lock)
	if raceenabled {
		raceacquire(unsafe.Pointer(&s.lock))
	}
	wakeup := s.wakeup
	if raceenabled {
		racerelease(unsafe.Pointer(&s.lock))
	}
	unlock(&s.lock)
	<-wakeup
	s.timer.stop()
}

// wake awakens any goroutine sleeping on the timer.
//
// Safe for concurrent use with all other methods.
func (s *wakeableSleep) wake() {
	// Grab the wakeup channel, which may be nil if we're
	// racing with close.
	lock(&s.lock)
	if raceenabled {
		raceacquire(unsafe.Pointer(&s.lock))
	}
	if s.wakeup != nil {
		// Non-blocking send.
		//
		// Others may also write to this channel and we don't
		// want to block on the receiver waking up. This also
		// effectively batches together wakeup notifications.
		select {
		case s.wakeup <- struct{}{}:
		default:
		}
	}
	if raceenabled {
		racerelease(unsafe.Pointer(&s.lock))
	}
	unlock(&s.lock)
}

// close wakes any goroutine sleeping on the timer and prevents
// further sleeping on it.
//
// Once close is called, the wakeableSleep must no longer be used.
//
// It must only be called once no goroutine is sleeping on the
// timer *and* nothing else will call wake concurrently.
func (s *wakeableSleep) close() {
	// Set wakeup to nil so that a late timer ends up being a no-op.
	lock(&s.lock)
	if raceenabled {
		raceacquire(unsafe.Pointer(&s.lock))
	}
	wakeup := s.wakeup
	s.wakeup = nil

	// Close the channel.
	close(wakeup)

	if raceenabled {
		racerelease(unsafe.Pointer(&s.lock))
	}
	unlock(&s.lock)
	return
}