gh-149342: _remote_debugging: Fix binary profile corruption when sampling a (temporarily) empty stack#149343
Conversation
| &shared_count, &pop_count, &push_count); | ||
|
|
||
| if (encoding == STACK_REPEAT && !is_new_thread) { | ||
| if (encoding == STACK_REPEAT) { |
There was a problem hiding this comment.
I think it's safe to drop because compare_stacks would return STACK_REPEAT only if the prev stack was empty, too; otherwise it'd give STACK_FULL...
There was a problem hiding this comment.
I think we should as well not emit this at all in the profiler frontend code
There was a problem hiding this comment.
Do you mean something like?
diff --git c/Modules/_remote_debugging/module.c i/Modules/_remote_debugging/module.c
index df3ade371de..7cd28540efc 100644
--- c/Modules/_remote_debugging/module.c
+++ i/Modules/_remote_debugging/module.c
@@ -646,6 +646,15 @@ _remote_debugging_RemoteUnwinder_get_stack_trace_impl(RemoteUnwinderObject *self
goto exit;
}
+ PyObject *thread_frames = PyStructSequence_GET_ITEM(frame_info, 2);
+ if (PyList_GET_SIZE(thread_frames) == 0) {
+ Py_DECREF(frame_info);
+ if (self->tstate_addr || self->only_active_thread) {
+ break;
+ }
+ continue;
+ }
+
if (PyList_Append(interpreter_threads, frame_info) == -1) {
Py_DECREF(frame_info);
Py_DECREF(interpreter_threads);This looks like a band aid, but it's clean?
Perhaps the fix should be deeper... I can take a stab but couldn’t figure out anything as clean, only (a result of me poking around without deeper understanding, and I trust no LLM to do proper reviews in deeper circles of hell):
diff --git c/Modules/_remote_debugging/module.c i/Modules/_remote_debugging/module.c
index df3ade371de..c53b6eafa9c 100644
--- c/Modules/_remote_debugging/module.c
+++ i/Modules/_remote_debugging/module.c
@@ -622,10 +622,8 @@ _remote_debugging_RemoteUnwinder_get_stack_trace_impl(RemoteUnwinderObject *self
gc_frame,
main_thread_tstate);
if (!frame_info) {
- // Check if this was an intentional skip due to mode-based filtering
- if ((self->mode == PROFILING_MODE_CPU || self->mode == PROFILING_MODE_GIL ||
- self->mode == PROFILING_MODE_EXCEPTION) && !PyErr_Occurred()) {
- // Detect cycle: if current_tstate didn't advance, we have corrupted data
+ if (!PyErr_Occurred()) {
+ // Intentional skip: mode filtering or no Python frames.
if (current_tstate == prev_tstate) {
Py_DECREF(interpreter_threads);
PyErr_Format(PyExc_RuntimeError,
@@ -636,7 +634,9 @@ _remote_debugging_RemoteUnwinder_get_stack_trace_impl(RemoteUnwinderObject *self
Py_CLEAR(result);
goto exit;
}
- // Thread was skipped due to mode filtering, continue to next thread
+ if (self->tstate_addr || self->only_active_thread) {
+ break;
+ }
continue;
}
// This was an actual error
diff --git c/Modules/_remote_debugging/threads.c i/Modules/_remote_debugging/threads.c
index e303c667ea0..cb8b35c7caa 100644
--- c/Modules/_remote_debugging/threads.c
+++ i/Modules/_remote_debugging/threads.c
@@ -467,6 +467,12 @@ unwind_stack_for_thread(
*current_tstate = GET_MEMBER(uintptr_t, ts, unwinder->debug_offsets.thread_state.next);
+ if (PyList_GET_SIZE(frame_info) == 0) {
+ Py_DECREF(frame_info);
+ cleanup_stack_chunks(&chunks);
+ return NULL;
+ }
+
thread_id = PyLong_FromLongLong(tid);
if (thread_id == NULL) {
set_exception_cause(unwinder, PyExc_RuntimeError, "Failed to create thread ID");This whole issue makes the case from PyCon even stronger.
There was a problem hiding this comment.
No, I was referring in the Python side so we never call the writer if we have an empty frame.
There was a problem hiding this comment.
It skips the whole sample but it's a hot path so I want to avoid elaborate filtering.
Documentation build overview
19 files changed ·
|
``` Objects/descrobject.c:194:16: runtime error: call to function BinaryWriter_get_total_samples through pointer to incorrect function type 'struct _object *(*)(struct _object *, void *)' /home/maurycy/cpython/./Modules/_remote_debugging/module.c:1660: note: BinaryWriter_get_total_samples defined here SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior Objects/descrobject.c:194:16 ``` https://github.com/python/cpython/actions/runs/25297636845/job/74159079502
19586a0 to
913b5c7
Compare
My understanding is that an empty stack is a bit racy but perfectly valid?
Closes #149342
_remote_debugging:Invalid RLE count XXX exceeds maximum possible XXX for remaining data#149342