fix: guard ProfiledThread teardown against signal races#509
fix: guard ProfiledThread teardown against signal races#509
Conversation
- Use currentSignalSafe() in onThreadEnd to avoid malloc reentrancy - Move SignalBlocker into release() and freeKey so all call sites (libraryPatcher_linux.cpp, perfEvents_linux.cpp) are automatically protected — no per-caller wrapping needed - Remove dead buffer-allocation code (confirmed no callers remain) - Fix CriticalSection destructor to use pointer captured at construction instead of re-fetching TLS (which may be null after release()) - Add fork-based regression test Fixes PROF-14546
…n invariant test The previous test installed SIG_IGN for both profiling signals, so no handler ever fired and the race was never triggered. The second CriticalSection check always used the fallback bitmap path and passed regardless of the fix. Add ProfiledThread::clearCurrentThreadTLS() to simulate the exact race window (TLS cleared mid-CS scope). The test calls it inside a live CriticalSection; without the _thread_ptr capture fix the destructor re-fetches nullptr and skips exitCriticalSection(), leaving _in_critical_section stuck — tryEnterCriticalSection() then returns false (exit 5). With the fix the test exits 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI Test ResultsRun: #25443330309 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-05-06 15:47:55 UTC |
The ProfiledThread destructor is private; the forked child can simply _exit() and let the OS reclaim memory instead of calling delete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Pull request overview
This PR hardens ProfiledThread teardown against profiling-signal races that could previously lead to crashes (SIGSEGV) or inconsistent CriticalSection state during thread shutdown.
Changes:
- Switch
Profiler::onThreadEndto useProfiledThread::currentSignalSafe()to avoid allocations in teardown paths. - Block profiling signals during
ProfiledThreadteardown (release()and TLS key destructor) to prevent signal-handler reentrancy while freeing thread state. - Update
CriticalSectionto capture theProfiledThread*once at construction and reuse it in the destructor; add a fork-based regression test for the TLS-clear window.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ddprof-lib/src/test/cpp/ddprof_ut.cpp | Adds a regression test exercising CriticalSection behavior when TLS is cleared mid-scope. |
| ddprof-lib/src/main/cpp/thread.h | Simplifies ProfiledThread state and adds a test-only TLS-clear helper. |
| ddprof-lib/src/main/cpp/thread.cpp | Blocks profiling signals around TLS teardown and removes unused buffer-recycling code paths. |
| ddprof-lib/src/main/cpp/profiler.cpp | Avoids allocating ProfiledThread during onThreadEnd by using a signal-safe accessor. |
| ddprof-lib/src/main/cpp/otel_context.h | Updates documentation around OTEL TLS pointer lifecycle at thread exit. |
| ddprof-lib/src/main/cpp/guards.h | Extends CriticalSection to store a captured ProfiledThread*. |
| ddprof-lib/src/main/cpp/guards.cpp | Uses captured ProfiledThread* in CriticalSection destructor to avoid TLS re-fetch races. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
What does this PR do?:
Fix a SIGSEGV in
Profiler::onThreadEndcaused by three interlocking signal-race defects during thread teardown:onThreadEndcalledProfiledThread::current()which allocates vianew— if SIGPROF/SIGVTALRM fired duringmalloc, re-entrancy caused a crash.ProfiledThread::release()calleddeletewithout blocking profiling signals — a signal handler could construct aCriticalSectionwith a dangling_thread_ptr.CriticalSectiondestructor re-fetched TLS (could be NULL afterrelease()) —exitCriticalSection()was silently skipped, leaving_in_critical_sectionstucktrue.Motivation:
Production SIGSEGV reported in PROF-14546. The crash was intermittent and hard to reproduce because it required a profiling signal to fire in a narrow window during thread teardown.
Additional Notes:
SignalBlocker(RAIIpthread_sigmask(SIG_BLOCK)) is now placed insideProfiledThread::release()andfreeKey()so all callers — including five previously unprotected sites inlibraryPatcher_linux.cppandperfEvents_linux.cpp— are automatically protected.CriticalSectionnow captures_thread_ptrat construction time and reuses it in the destructor, eliminating the TLS re-fetch race.onThreadEndswitched fromcurrent()(allocating) tocurrentSignalSafe()(non-allocating).releaseFromBuffer,_buffer,_free_stack_top, etc.) was removed as it was never exercised.How to test the change?:
A fork-based regression test
ProfiledThreadTeardown.CriticalSectionExitsEvenAfterTLSClearedwas added toddprof_ut.cpp. It:CriticalSectionon the current thread (which captures_thread_ptrat construction)clearCurrentThreadTLS()inside the liveCriticalSectionscope to simulate TLS being torn down while a critical section is active (the race window the fix targets)exitCriticalSection()still ran on scope exit by asserting that a subsequenttryEnterCriticalSection()succeeds (exit code 5 without the fix, 0 with it)For Datadog employees:
Unsure? Have a question? Request a review!