Fix interpreter stackwalk crash when seeding from a debugger-supplied context#129907
Fix interpreter stackwalk crash when seeding from a debugger-supplied context#129907steveisok wants to merge 3 commits into
Conversation
A Frame's m_Next is NULL until the frame is fully pushed (initialized to nullptr in the Frame constructor, linked into the chain only by Push()). A stackwalk can observe a not-yet-fully-linked frame - for example the DAC walking a thread suspended for debugger stepping over interpreter frames. The walker relies on the explicit-frame chain being terminated by FRAME_TOP and only ever checks against FRAME_TOP, so advancing onto a frame whose m_Next is NULL left m_crawl.pFrame == NULL and caused a null dereference in StackFrameIterator::CheckForSkippedFrames (and other deref sites). Normalize a NULL next-frame to FRAME_TOP in GotoNextFrame so the chain is always terminated and the rest of the walk never dereferences NULL. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
StackFrameIterator::Init and ResetRegDisp seed an interpreter stackwalk by reading the owning InterpreterFrame from the first-argument register of pCurrentContext. That is only a valid convention for a CONTEXT produced by a normal runtime stackwalk, where the interpreter code manager (eetwain.cpp) synthesizes it. It does NOT hold for a CONTEXT supplied by the debugger via ICorDebugStackWalk::SetContext: DacDbiInterfaceImpl::SetStackWalkCurrentContext copies a raw thread CONTEXT into pCurrentContext, so the first-arg register holds the interpreted method's working value rather than the InterpreterFrame pointer (e.g. when VS GetActiveFunctions walks a thread stopped at a breakpoint deep inside an interpreted method). Reading the register then yields NULL/garbage, tripping the _ASSERTE in a checked DAC (or null-dereferencing in release) inside ResetRegDisp. Only VS reproduces this because its GetActiveFunctions/SetContext path seeds the walk with an arbitrary mid-method context; VSCode does not. Add GetOwningInterpreterFrame, which keeps the first-arg-register fast path but falls back to the topmost InterpreterFrame on the thread's explicit-frame chain when the register does not point at a valid InterpreterFrame. Also normalize a NULL PtrNextFrame() (m_Next of a not-yet-fully-linked frame) to FRAME_TOP at both seed sites so the chain stays terminated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @JulieLeeMSFT, @BrzVlad, @janvorli, @kg |
There was a problem hiding this comment.
Pull request overview
This PR updates CoreCLR stack walking to be resilient when a stackwalk in interpreted code is seeded from a debugger-supplied CONTEXT (rather than one produced by the runtime’s own stackwalk), avoiding crashes/asserts when the “first-arg register contains owning InterpreterFrame*” convention doesn’t hold.
Changes:
- Normalize a
NULLexplicit-framem_NexttoFRAME_TOPduringCrawlFrame::GotoNextFrame()to keep frame-chain iteration terminating safely. - Add
GetOwningInterpreterFrame(...)and use it fromStackFrameIterator::InitandResetRegDispwhen starting from interpreted-code contexts. - Treat
InterpreterFrame::PtrNextFrame()returningNULLas end-of-chain (FRAME_TOP) when skipping past the owning interpreter frame.
| PTR_InterpreterFrame pOwning = | ||
| dac_cast<PTR_InterpreterFrame>((TADDR)GetFirstArgReg(pContext)); | ||
|
|
||
| if ((pOwning == NULL) || (pOwning->GetFrameIdentifier() != FrameIdentifier::InterpreterFrame)) | ||
| { | ||
| pOwning = NULL; | ||
| for (PTR_Frame pFrame = pThread->GetFrame(); | ||
| pFrame != FRAME_TOP; | ||
| pFrame = pFrame->PtrNextFrame()) | ||
| { | ||
| if (pFrame->GetFrameIdentifier() == FrameIdentifier::InterpreterFrame) | ||
| { | ||
| pOwning = dac_cast<PTR_InterpreterFrame>(pFrame); | ||
| break; | ||
| } | ||
| } | ||
| } |
| // establishes a convention where the first-argument register of pCurrentContext holds the | ||
| // owning InterpreterFrame (see the InterpreterCodeManager in eetwain.cpp). That convention | ||
| // does NOT hold for a CONTEXT supplied by the debugger via ICorDebugStackWalk::SetContext: | ||
| // DacDbiInterfaceImpl::SetStackWalkCurrentContext copies a raw thread CONTEXT into |
There was a problem hiding this comment.
This explanation doesn't make sense to me.
My mental model is that there are two kinds of CONTEXT, those where IP points at interpretted bytecode and those where it doesn't. If the IP points at interpreted bytecode then we'd also expect GetFirstArgReg(pContext) to point at the InterpreterFrame. Creating this method implies that invariant isn't holding, but if so I think we should either:
- Find where the invariant is getting broken and fix it. If the invariant doesn't hold here then the bug occured earlier, wherever the CONTEXT was generated.
- Define a new looser invariant with a clear understanding of why the previous simpler invariant can't be used.
The comment mentions the debugger provides the CONTEXT and treats it like a black-box. If the debugger provides a bad seed CONTEXT then the bug is likely in the debugger or in something the debugger read from. We'd need to peel those layers and find the root cause.
Deterministic, IDE-independent reproduction of the interpreter stackwalk seed crash fixed in StackFrameIterator::Init / ResetRegDisp. A managed target runs a method under the interpreter and calls Debugger.Break() inside it. A minimal native ICorDebug client (dbgshim launch + managed callback) responds to the Break event by calling ICorDebugThread2::GetActiveFunctions on the stopped thread, which drives the V2 shim stackwalk (ShimStackWalk::Populate -> CordbStackWalk::SetContext -> DacDbiInterfaceImpl::SetStackWalkCurrentContext -> StackFrameIterator::ResetRegDisp) with a debugger-supplied CONTEXT in interpreted code. For local debugging the DAC is loaded into the harness, so on a buggy Checked build the harness asserts in ResetRegDisp (Release: null-deref), and on a fixed build GetActiveFunctions returns cleanly and the harness exits 0. This exercises the same code path that VS hits on the iOS simulator via the remote ICorDebug target. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Debugging an interpreter-based app (e.g. .NET MAUI on the iOS simulator) through
Visual Studio crashes the target process when stepping. The crash is in the DAC's
StackFrameIterator::ResetRegDisp(null-deref in release,_ASSERTEin checked),driven by
CordbThread::GetActiveFunctions→ShimStackWalk→CordbStackWalk::SetContext→DacDbiInterfaceImpl::SetStackWalkCurrentContext.Root cause
When seeding a stackwalk in interpreted code,
InitandResetRegDispderive theowning
InterpreterFramefrom the first-argument register ofpCurrentContext:PTR_InterpreterFrame pOwning = dac_cast<PTR_InterpreterFrame>((TADDR)GetFirstArgReg(m_crawl.pRD->pCurrentContext));That "first-arg register holds the owning InterpreterFrame " convention is only
synthesized by the interpreter code manager for a CONTEXT produced by a normal
runtime stackwalk. It does not hold for a CONTEXT supplied by the debugger via
ICorDebugStackWalk::SetContext : SetStackWalkCurrentContext copies a raw thread
CONTEXT into pCurrentContext , so the first-arg register holds the interpreted
method's working value — not the frame pointer — when the thread is stopped at a
breakpoint deep inside an interpreted method.
Reading the register then yields NULL/garbage, which null-dereferences (release) or
trips _ASSERTE (checked) in ResetRegDisp . Visual Studio reproduces it easily,
because its GetActiveFunctions / SetContext path seeds the walk with an arbitrary
mid-method context; VS Code does not issue that walk and thus is harder to get the bug to pop.
Fix
Add GetOwningInterpreterFrame , which keeps the first-arg-register fast path but
falls back to the topmost InterpreterFrame on the thread's explicit-frame chain
when the register does not point at a valid InterpreterFrame . The frame chain is
reliable regardless of how the CONTEXT was produced. Both seed sites also normalize a
NULL PtrNextFrame() (the m_Next of a not-yet-fully-linked frame) to FRAME_TOP
so the chain stays terminated.
Testing
• Repro: step in VS while debugging a .NET MAUI app on the iOS simulator (interpreter).
• Before: target crashes in ResetRegDisp (checked _ASSERTE / release null-deref).
• After: stepping works; VS Call Stack / Parallel Stacks populate correctly.