fix(libev): guard handle_read/handle_write against close() race condition by vponomaryov · Pull Request #889 · scylladb/python-driver

vponomaryov · 2026-05-19T13:15:06Z

When close() is called from one thread, it sets is_closed=True and closes the socket immediately. However, libev watchers are stopped asynchronously in _loop_will_run(), so handle_read()/handle_write() can still fire on the now-closed fd, causing EBADF errors that surface as ConnectionShutdown('Bad file descriptor') and prevent reconnection.

So, fix it applying following changes:

Early-return guards at the top of handle_read() and handle_write() that check is_closed/is_defunct before touching the socket
Secondary is_closed/is_defunct checks in error handlers to catch the race when close() happens between watcher dispatch and syscall
Peer disconnect detection (EBADF, ECONNRESET, ENOTCONN, etc.) that calls close() cleanly instead of defunct()
last_error preservation in close() when connected_event is unset, preventing factory() from returning dead connections

Fixes: #614

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

vponomaryov · 2026-05-19T13:17:53Z

Ref: https://scylladb.atlassian.net/browse/SCT-83

vponomaryov · 2026-05-22T09:41:50Z

Verified the test fix here:

scylla-staging/valerii/vp-longevity-100gb-4h-oci-test#25

Test run without this fix (for comparison) is here:

scylla-master/longevity/longevity-100gb-4h-oci-test#34

Comment from the "zeus" comparing the fixed test run and issue affected:

@zeus posted:

Conclusion: The target run did NOT hit the SCT-83 bug. The fix works.

Comparison

Aspect Reference run (SCT-83 bug) 8a5b983a Target run (fix branch) 0d335d55

Status FAILED PASSED

Branch origin/master origin/fix-pydrv-race-issue-614

Errors 6 ERROR events (3 SoftTimeoutEvent + 3 FailedResultEvent) 0 errors

Test longevity_test.LongevityTest.test_custom_time Same

Scylla version 2026.3.0~dev Same

Backend/Region OCI / us-ashburn-1 Same

Nodes 6 6

Healthcheck duration comparison

Reference run (with bug) — progressive degradation pattern:

Healthchecks №1-7: 9-10s (normal)

Healthcheck №8: 46s (degradation begins)

Healthcheck №9: 1m38s

Healthcheck №10: 3m20s (198s, exceeds 180s soft timeout → ERROR)

Healthcheck №11: 3m50s (228s, exceeds timeout → ERROR)

Healthcheck №12: 2m8s

Healthcheck №13: 1m38s

Healthcheck №14: 1m7s

Healthcheck №15: 2m33s

Healthcheck №16: 46s

Healthcheck №17: 22m33s (1351s, massively exceeds timeout → ERROR)

Target run (with fix) — consistently healthy:

All 20 healthchecks: 9-11s (every single one normal, no degradation)

Analysis

The SCT-83 / python-driver#614 bug manifests as a progressive increase in healthcheck duration over the course of the test, caused by a race condition in the python driver. In the reference run, healthcheck times escalated from ~10s to minutes, eventually hitting 22m33s — well beyond the 180s soft timeout.

The target run, running on branch fix-pydrv-race-issue-614, shows zero degradation — all 20 healthchecks completed in 9-11 seconds throughout the entire 4+ hour test. No SoftTimeoutEvents, no FailedResultEvents, no errors at all.

The fix completely eliminates the healthcheck timeout escalation pattern characteristic of SCT-83.

@fruch , @soyacz ^
This has proven to fix the https://scylladb.atlassian.net/browse/SCT-83 bug which is identical to the GH issue - #614

vponomaryov · 2026-05-22T09:50:31Z

Note that the libenv tests have passed successfully - related to the PR.
The failed integration tests are the asyncio-based - not related to this PR.

@scylladb/python-driver-maint please merge and release it.

Lorak-mmk · 2026-05-22T10:04:43Z

+                elif self.is_closed or self.is_defunct:
+                    # Socket was closed by another thread between the
+                    # watcher firing and us calling send().  This is the
+                    # race described in scylladb/python-driver#614.
+                    return


I don't see why this is necessary. Without this, we will go into else branch, call self.defunct(err) which will do nothing, and then return.

Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.

But, the guard is still valuable because of the following reasons:

Without the guard we will get missleading log message if the error is EBADF (which is in _PEER_DISCONNECT_ERRNOS), the code falls into the _PEER_DISCONNECT_ERRNOS branch first (it comes before else), logging "Connection %s closed by peer during write".
But the socket wasn't closed by the peer - we closed it via close().

The _PEER_DISCONNECT_ERRNOS branch calls self.close(), which is a no-op when already closed but still acquires self.lock - a minor unnecessary overhead.

The guard makes the race condition handling self-documenting.
Without it, a reader must trace through multiple branches to confirm the behavior is safe.

So, I find it rather useful than redundant.

Ah, thanks for the explanation, I did not pay much attention to this path. Now I see a different potential issue, mentioned by Avi ( #614 (comment) )

If we are receiving EBADF, then we performed some action (in this case READ / WRITE) on a closed file descriptor. As far as I know, file descriptors can be reused. So, if another connection was opened between this one being closed and watchers exiting, it makes it possible for us to accidentaly read / write to a totally unrelated connection (or even some file opened by a user app). Am I wrong?

If this is true, then this patch hides the immediate symptom of the problem, but also hides a much more dangerous problem that could cause data corruption.

I don't know the libev connection impl that well, but maybe the proper approach is to, in close, first stop the watchers somehow and make sure that libev won't touch the fd anymore, and only then close the fd?

Situation where we accidentally read/write to a different connection is, probably, architecturally valid for C/C++ programs using raw fds, but doesn't apply in case of Python.
The Python socket abstraction prevents it.

Python's socket.socket object tracks state internally - after .close(), any operation on that object raises EBADF regardless of whether the OS reused the fd number for something else.
The self._socket reference always points to the old (closed) Python object, never to the new socket that happened to get the same fd.

libev get a raw fd, not PYthon socket, but the watchers themselves use the Python socket so I guess this issue really can't happen.
The watchers also seem to be correctly removed from the loop in close (connection_destroyed), so they won't keep triggering forever. I think your change should be fine.

Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.

But, the guard is still valuable because of the following reasons:

Without the guard we will get missleading log message if the error is EBADF (which is in _PEER_DISCONNECT_ERRNOS), the code falls into the _PEER_DISCONNECT_ERRNOS branch first (it comes before else), logging "Connection %s closed by peer during write".
But the socket wasn't closed by the peer - we closed it via close().

The _PEER_DISCONNECT_ERRNOS branch calls self.close(), which is a no-op when already closed but still acquires self.lock - a minor unnecessary overhead.

The guard makes the race condition handling self-documenting.
Without it, a reader must trace through multiple branches to confirm the behavior is safe.

So, I find it rather useful than redundant.

This is incorrect:

def defunct(self, exc): with self.lock: if self.is_defunct or self.is_closed: return self.is_defunct = True

defunct is the right approach here. If there’s an issue with it, we should fix defunct itself instead of handling all these failures separately.

There should only be two ways to close a connection:

Connection.close — for expected/normal connection shutdown

Connection.defunct — for shutting down a connection due to a critical error

There is a case when socket operation can unexpectedly get one of these:

errno.EBADF, errno.ENOTCONN, errno.ESHUTDOWN, errno.ECONNRESET, errno.ECONNABORTED,

If socket is closed/invalidated/aborted via Connection.close then Connection.defunct should ignore it and should not result in misleading errors and should be a noop.
if it closed by Connection.defunct it should preserve prior error and be effectively noop.
But if this error caused by socket being closed/invalidated/aborted by other reason we should channel it same way as any other error, which is what Connection.defunct for.

Only thing we need to be sure is that we never close socket before setting is_closed or is_defunct , which I believe already the case.

An improvement for the "defunct" function sounds as non-blocker for this PR - may be done separately.

Agreed, but you still need to route error handling to defunct

Lorak-mmk · 2026-05-22T10:05:17Z

+            elif self.is_closed or self.is_defunct:
+                # Socket was closed by another thread between the watcher
+                # firing and us calling recv().  This is the race described
+                # in scylladb/python-driver#614.
+                return


Without this, if the error is in _PEER_DISCONNECT_ERRNOS (EBADF), the code would execute following:

self.last_error = ConnectionShutdown( "Connection to %s was closed by peer" % self.endpoint) self.close()

It will overwrite the last_error with an incorrect closed by peer message when in reality close() was already called by us.
Since last_error is used by factory() to detect why a connection died, this overwrite could mask the real error.

Lorak-mmk · 2026-05-22T10:07:36Z

+class LibevPeerDisconnectErrnos(unittest.TestCase):
+    """Verify _PEER_DISCONNECT_ERRNOS contains expected values."""
+
+    def setUp(self):
+        _skip_if_unavailable(self)
+
+    def test_contains_ebadf(self):
+        self.assertIn(errno.EBADF, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_econnreset(self):
+        self.assertIn(errno.ECONNRESET, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_econnaborted(self):
+        self.assertIn(errno.ECONNABORTED, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_enotconn(self):
+        self.assertIn(errno.ENOTCONN, _PEER_DISCONNECT_ERRNOS)
+
+    def test_contains_eshutdown(self):
+        self.assertIn(errno.ESHUTDOWN, _PEER_DISCONNECT_ERRNOS)


I see no point in having such tests. It just cements the current code, not testing the behavior.

I don't mind removing it

vponomaryov · 2026-05-22T11:47:31Z

The effect of the python driver issue was slow health checks in SCT.
Example from a test run with the issue:

And here how much time it takes with the fix:

So, numbers tell for itself.

Lorak-mmk · 2026-05-22T11:48:18Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

sylwiaszunejko · 2026-05-22T12:17:45Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

vponomaryov · 2026-05-22T12:24:08Z

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

So, I need to rebase the PR?
Even if we know that this PR doesn't affect the failed tests?

Lorak-mmk · 2026-05-22T12:45:22Z

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

Tbh I don't care that much.

So, I need to rebase the PR?

If you could then it would be great. I know you didn't touch asyncio, but let's just be sure. Change in one backend affecting another backend would not be the weirdest thing we saw in this driver.

…tion When close() is called from one thread, it sets is_closed=True and closes the socket immediately. However, libev watchers are stopped asynchronously in _loop_will_run(), so handle_read()/handle_write() can still fire on the now-closed fd, causing EBADF errors that surface as ConnectionShutdown('Bad file descriptor') and prevent reconnection. So, fix it applying following changes: - Early-return guards at the top of handle_read() and handle_write() that check is_closed/is_defunct before touching the socket - Secondary is_closed/is_defunct checks in error handlers to catch the race when close() happens between watcher dispatch and syscall - Peer disconnect detection (EBADF, ECONNRESET, ENOTCONN, etc.) that calls close() cleanly instead of defunct() - last_error preservation in close() when connected_event is unset, preventing factory() from returning dead connections Fixes: scylladb#614

vponomaryov · 2026-05-22T13:37:05Z

Rebased.

Lorak-mmk · 2026-05-22T14:27:42Z

@dkropachev Please take a look at this.

dkropachev · 2026-05-22T21:52:19Z

+                elif self.is_closed or self.is_defunct:
+                    # Socket was closed by another thread between the
+                    # watcher firing and us calling send().  This is the
+                    # race described in scylladb/python-driver#614.
+                    return


Agreed, but you still need to route error handling to defunct

dkropachev · 2026-05-22T21:55:02Z

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

One of the tests was enabled that uses TLS with asyncio, the ultimate cure - #884

vponomaryov requested review from a team and dkropachev May 22, 2026 09:42

This was referenced May 22, 2026

Driver reported "[Errno 9] Bad file descriptor" #614

Open

fix(py-drv): use fixed python driver with proper reconnections scylladb/scylla-cluster-tests#14708

Draft

Lorak-mmk requested changes May 22, 2026

View reviewed changes

vponomaryov requested a review from Lorak-mmk May 22, 2026 11:00

vponomaryov force-pushed the fix-race-issue-614 branch from 5dbc94b to ad60335 Compare May 22, 2026 13:36

Lorak-mmk approved these changes May 22, 2026

View reviewed changes

dkropachev requested changes May 22, 2026

View reviewed changes

Conversation

vponomaryov commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-review checklist

Uh oh!

vponomaryov commented May 19, 2026

Uh oh!

vponomaryov commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conclusion: The target run did NOT hit the SCT-83 bug. The fix works.

Comparison

Healthcheck duration comparison

Analysis

Uh oh!

vponomaryov commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkropachev May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vponomaryov commented May 22, 2026

Uh oh!

Lorak-mmk commented May 22, 2026

Uh oh!

sylwiaszunejko commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vponomaryov commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented May 22, 2026

Uh oh!

vponomaryov commented May 22, 2026

Uh oh!

Lorak-mmk commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkropachev commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vponomaryov commented May 19, 2026 •

edited

Loading

vponomaryov commented May 22, 2026 •

edited

Loading

vponomaryov commented May 22, 2026 •

edited

Loading

dkropachev May 22, 2026 •

edited

Loading

sylwiaszunejko commented May 22, 2026 •

edited

Loading

vponomaryov commented May 22, 2026 •

edited

Loading