Skip to content

fix: guard ProfiledThread teardown against signal races#509

Open
jbachorik wants to merge 4 commits intomainfrom
muse/fix-sigsegv-onthreadend
Open

fix: guard ProfiledThread teardown against signal races#509
jbachorik wants to merge 4 commits intomainfrom
muse/fix-sigsegv-onthreadend

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 6, 2026

What does this PR do?:
Fix a SIGSEGV in Profiler::onThreadEnd caused by three interlocking signal-race defects during thread teardown:

  1. onThreadEnd called ProfiledThread::current() which allocates via new — if SIGPROF/SIGVTALRM fired during malloc, re-entrancy caused a crash.
  2. ProfiledThread::release() called delete without blocking profiling signals — a signal handler could construct a CriticalSection with a dangling _thread_ptr.
  3. CriticalSection destructor re-fetched TLS (could be NULL after release()) — exitCriticalSection() was silently skipped, leaving _in_critical_section stuck true.

Motivation:
Production SIGSEGV reported in PROF-14546. The crash was intermittent and hard to reproduce because it required a profiling signal to fire in a narrow window during thread teardown.

Additional Notes:

  • SignalBlocker (RAII pthread_sigmask(SIG_BLOCK)) is now placed inside ProfiledThread::release() and freeKey() so all callers — including five previously unprotected sites in libraryPatcher_linux.cpp and perfEvents_linux.cpp — are automatically protected.
  • CriticalSection now captures _thread_ptr at construction time and reuses it in the destructor, eliminating the TLS re-fetch race.
  • onThreadEnd switched from current() (allocating) to currentSignalSafe() (non-allocating).
  • Dead buffer-allocation code (releaseFromBuffer, _buffer, _free_stack_top, etc.) was removed as it was never exercised.

How to test the change?:
A fork-based regression test ProfiledThreadTeardown.CriticalSectionExitsEvenAfterTLSCleared was added to ddprof_ut.cpp. It:

  • Forks a child process for TLS isolation
  • Constructs a CriticalSection on the current thread (which captures _thread_ptr at construction)
  • Calls clearCurrentThreadTLS() inside the live CriticalSection scope to simulate TLS being torn down while a critical section is active (the race window the fix targets)
  • Verifies that exitCriticalSection() still ran on scope exit by asserting that a subsequent tryEnterCriticalSection() succeeds (exit code 5 without the fix, 0 with it)
  • Parent asserts the child exits with status 0

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: PROF-14546

Unsure? Have a question? Request a review!

- Use currentSignalSafe() in onThreadEnd to avoid malloc reentrancy
- Move SignalBlocker into release() and freeKey so all call sites
  (libraryPatcher_linux.cpp, perfEvents_linux.cpp) are automatically
  protected — no per-caller wrapping needed
- Remove dead buffer-allocation code (confirmed no callers remain)
- Fix CriticalSection destructor to use pointer captured at construction
  instead of re-fetching TLS (which may be null after release())
- Add fork-based regression test

Fixes PROF-14546
@jbachorik jbachorik added the AI label May 6, 2026
…n invariant test

The previous test installed SIG_IGN for both profiling signals, so no handler
ever fired and the race was never triggered. The second CriticalSection check
always used the fallback bitmap path and passed regardless of the fix.

Add ProfiledThread::clearCurrentThreadTLS() to simulate the exact race window
(TLS cleared mid-CS scope). The test calls it inside a live CriticalSection;
without the _thread_ptr capture fix the destructor re-fetches nullptr and skips
exitCriticalSection(), leaving _in_critical_section stuck — tryEnterCriticalSection()
then returns false (exit 5). With the fix the test exits 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 6, 2026

CI Test Results

Run: #25443330309 | Commit: dc3c566 | Duration: 32m 33s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-06 15:47:55 UTC

The ProfiledThread destructor is private; the forked child can simply
_exit() and let the OS reclaim memory instead of calling delete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens ProfiledThread teardown against profiling-signal races that could previously lead to crashes (SIGSEGV) or inconsistent CriticalSection state during thread shutdown.

Changes:

  • Switch Profiler::onThreadEnd to use ProfiledThread::currentSignalSafe() to avoid allocations in teardown paths.
  • Block profiling signals during ProfiledThread teardown (release() and TLS key destructor) to prevent signal-handler reentrancy while freeing thread state.
  • Update CriticalSection to capture the ProfiledThread* once at construction and reuse it in the destructor; add a fork-based regression test for the TLS-clear window.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ddprof-lib/src/test/cpp/ddprof_ut.cpp Adds a regression test exercising CriticalSection behavior when TLS is cleared mid-scope.
ddprof-lib/src/main/cpp/thread.h Simplifies ProfiledThread state and adds a test-only TLS-clear helper.
ddprof-lib/src/main/cpp/thread.cpp Blocks profiling signals around TLS teardown and removes unused buffer-recycling code paths.
ddprof-lib/src/main/cpp/profiler.cpp Avoids allocating ProfiledThread during onThreadEnd by using a signal-safe accessor.
ddprof-lib/src/main/cpp/otel_context.h Updates documentation around OTEL TLS pointer lifecycle at thread exit.
ddprof-lib/src/main/cpp/guards.h Extends CriticalSection to store a captured ProfiledThread*.
ddprof-lib/src/main/cpp/guards.cpp Uses captured ProfiledThread* in CriticalSection destructor to avoid TLS re-fetch races.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddprof-lib/src/main/cpp/otel_context.h Outdated
Comment thread ddprof-lib/src/test/cpp/ddprof_ut.cpp
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@jbachorik jbachorik marked this pull request as ready for review May 6, 2026 15:00
@jbachorik jbachorik requested a review from a team as a code owner May 6, 2026 15:00
@jbachorik jbachorik changed the title fix: guard ProfiledThread teardown against signal races (PROF-14546) fix: guard ProfiledThread teardown against signal races May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants