Skip to content

fix(libev): guard handle_read/handle_write against close() race condition#889

Open
vponomaryov wants to merge 1 commit into
scylladb:masterfrom
vponomaryov:fix-race-issue-614
Open

fix(libev): guard handle_read/handle_write against close() race condition#889
vponomaryov wants to merge 1 commit into
scylladb:masterfrom
vponomaryov:fix-race-issue-614

Conversation

@vponomaryov
Copy link
Copy Markdown

@vponomaryov vponomaryov commented May 19, 2026

When close() is called from one thread, it sets is_closed=True and closes the socket immediately. However, libev watchers are stopped asynchronously in _loop_will_run(), so handle_read()/handle_write() can still fire on the now-closed fd, causing EBADF errors that surface as ConnectionShutdown('Bad file descriptor') and prevent reconnection.

So, fix it applying following changes:

  • Early-return guards at the top of handle_read() and handle_write() that check is_closed/is_defunct before touching the socket
  • Secondary is_closed/is_defunct checks in error handlers to catch the race when close() happens between watcher dispatch and syscall
  • Peer disconnect detection (EBADF, ECONNRESET, ENOTCONN, etc.) that calls close() cleanly instead of defunct()
  • last_error preservation in close() when connected_event is unset, preventing factory() from returning dead connections

Fixes: #614

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@vponomaryov
Copy link
Copy Markdown
Author

@vponomaryov
Copy link
Copy Markdown
Author

vponomaryov commented May 22, 2026

Verified the test fix here:

Test run without this fix (for comparison) is here:

Comment from the "zeus" comparing the fixed test run and issue affected:

@zeus posted:

Conclusion: The target run did NOT hit the SCT-83 bug. The fix works.

Comparison

Aspect Reference run (SCT-83 bug) 8a5b983a Target run (fix branch) 0d335d55
Status FAILED PASSED
Branch origin/master origin/fix-pydrv-race-issue-614
Errors 6 ERROR events (3 SoftTimeoutEvent + 3 FailedResultEvent) 0 errors
Test longevity_test.LongevityTest.test_custom_time Same
Scylla version 2026.3.0~dev Same
Backend/Region OCI / us-ashburn-1 Same
Nodes 6 6

Healthcheck duration comparison

Reference run (with bug) — progressive degradation pattern:

  • Healthchecks №1-7: 9-10s (normal)
  • Healthcheck №8: 46s (degradation begins)
  • Healthcheck №9: 1m38s
  • Healthcheck №10: 3m20s (198s, exceeds 180s soft timeout → ERROR)
  • Healthcheck №11: 3m50s (228s, exceeds timeout → ERROR)
  • Healthcheck №12: 2m8s
  • Healthcheck №13: 1m38s
  • Healthcheck №14: 1m7s
  • Healthcheck №15: 2m33s
  • Healthcheck №16: 46s
  • Healthcheck №17: 22m33s (1351s, massively exceeds timeout → ERROR)

Target run (with fix) — consistently healthy:

  • All 20 healthchecks: 9-11s (every single one normal, no degradation)

Analysis

The SCT-83 / python-driver#614 bug manifests as a progressive increase in healthcheck duration over the course of the test, caused by a race condition in the python driver. In the reference run, healthcheck times escalated from ~10s to minutes, eventually hitting 22m33s — well beyond the 180s soft timeout.

The target run, running on branch fix-pydrv-race-issue-614, shows zero degradation — all 20 healthchecks completed in 9-11 seconds throughout the entire 4+ hour test. No SoftTimeoutEvents, no FailedResultEvents, no errors at all.

The fix completely eliminates the healthcheck timeout escalation pattern characteristic of SCT-83.

@fruch , @soyacz ^
This has proven to fix the https://scylladb.atlassian.net/browse/SCT-83 bug which is identical to the GH issue - #614

@vponomaryov
Copy link
Copy Markdown
Author

vponomaryov commented May 22, 2026

Note that the libenv tests have passed successfully - related to the PR.
The failed integration tests are the asyncio-based - not related to this PR.

@scylladb/python-driver-maint please merge and release it.

Comment on lines +367 to +371
elif self.is_closed or self.is_defunct:
# Socket was closed by another thread between the
# watcher firing and us calling send(). This is the
# race described in scylladb/python-driver#614.
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is necessary. Without this, we will go into else branch, call self.defunct(err) which will do nothing, and then return.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.

But, the guard is still valuable because of the following reasons:

  • Without the guard we will get missleading log message if the error is EBADF (which is in _PEER_DISCONNECT_ERRNOS), the code falls into the _PEER_DISCONNECT_ERRNOS branch first (it comes before else), logging "Connection %s closed by peer during write".
    But the socket wasn't closed by the peer - we closed it via close().

  • The _PEER_DISCONNECT_ERRNOS branch calls self.close(), which is a no-op when already closed but still acquires self.lock - a minor unnecessary overhead.

  • The guard makes the race condition handling self-documenting.
    Without it, a reader must trace through multiple branches to confirm the behavior is safe.

So, I find it rather useful than redundant.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the explanation, I did not pay much attention to this path. Now I see a different potential issue, mentioned by Avi ( #614 (comment) )

If we are receiving EBADF, then we performed some action (in this case READ / WRITE) on a closed file descriptor. As far as I know, file descriptors can be reused. So, if another connection was opened between this one being closed and watchers exiting, it makes it possible for us to accidentaly read / write to a totally unrelated connection (or even some file opened by a user app). Am I wrong?

If this is true, then this patch hides the immediate symptom of the problem, but also hides a much more dangerous problem that could cause data corruption.

I don't know the libev connection impl that well, but maybe the proper approach is to, in close, first stop the watchers somehow and make sure that libev won't touch the fd anymore, and only then close the fd?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Situation where we accidentally read/write to a different connection is, probably, architecturally valid for C/C++ programs using raw fds, but doesn't apply in case of Python.
The Python socket abstraction prevents it.

Python's socket.socket object tracks state internally - after .close(), any operation on that object raises EBADF regardless of whether the OS reused the fd number for something else.
The self._socket reference always points to the old (closed) Python object, never to the new socket that happened to get the same fd.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libev get a raw fd, not PYthon socket, but the watchers themselves use the Python socket so I guess this issue really can't happen.
The watchers also seem to be correctly removed from the loop in close (connection_destroyed), so they won't keep triggering forever. I think your change should be fine.

Copy link
Copy Markdown
Collaborator

@dkropachev dkropachev May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.

But, the guard is still valuable because of the following reasons:

  • Without the guard we will get missleading log message if the error is EBADF (which is in _PEER_DISCONNECT_ERRNOS), the code falls into the _PEER_DISCONNECT_ERRNOS branch first (it comes before else), logging "Connection %s closed by peer during write".
    But the socket wasn't closed by the peer - we closed it via close().
  • The _PEER_DISCONNECT_ERRNOS branch calls self.close(), which is a no-op when already closed but still acquires self.lock - a minor unnecessary overhead.
  • The guard makes the race condition handling self-documenting.
    Without it, a reader must trace through multiple branches to confirm the behavior is safe.

So, I find it rather useful than redundant.

This is incorrect:

    def defunct(self, exc):
        with self.lock:
            if self.is_defunct or self.is_closed:
                return
            self.is_defunct = True

defunct is the right approach here. If there’s an issue with it, we should fix defunct itself instead of handling all these failures separately.

There should only be two ways to close a connection:

  • Connection.close — for expected/normal connection shutdown
  • Connection.defunct — for shutting down a connection due to a critical error

There is a case when socket operation can unexpectedly get one of these:

    errno.EBADF, errno.ENOTCONN, errno.ESHUTDOWN,
    errno.ECONNRESET, errno.ECONNABORTED,

If socket is closed/invalidated/aborted via Connection.close then Connection.defunct should ignore it and should not result in misleading errors and should be a noop.
if it closed by Connection.defunct it should preserve prior error and be effectively noop.
But if this error caused by socket being closed/invalidated/aborted by other reason we should channel it same way as any other error, which is what Connection.defunct for.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing we need to be sure is that we never close socket before setting is_closed or is_defunct , which I believe already the case.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An improvement for the "defunct" function sounds as non-blocker for this PR - may be done separately.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but you still need to route error handling to defunct

Comment on lines +421 to +425
elif self.is_closed or self.is_defunct:
# Socket was closed by another thread between the watcher
# firing and us calling recv(). This is the race described
# in scylladb/python-driver#614.
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, if the error is in _PEER_DISCONNECT_ERRNOS (EBADF), the code would execute following:

self.last_error = ConnectionShutdown(
    "Connection to %s was closed by peer" % self.endpoint)
self.close()

It will overwrite the last_error with an incorrect closed by peer message when in reality close() was already called by us.
Since last_error is used by factory() to detect why a connection died, this overwrite could mask the real error.

Comment on lines +234 to +253
class LibevPeerDisconnectErrnos(unittest.TestCase):
"""Verify _PEER_DISCONNECT_ERRNOS contains expected values."""

def setUp(self):
_skip_if_unavailable(self)

def test_contains_ebadf(self):
self.assertIn(errno.EBADF, _PEER_DISCONNECT_ERRNOS)

def test_contains_econnreset(self):
self.assertIn(errno.ECONNRESET, _PEER_DISCONNECT_ERRNOS)

def test_contains_econnaborted(self):
self.assertIn(errno.ECONNABORTED, _PEER_DISCONNECT_ERRNOS)

def test_contains_enotconn(self):
self.assertIn(errno.ENOTCONN, _PEER_DISCONNECT_ERRNOS)

def test_contains_eshutdown(self):
self.assertIn(errno.ESHUTDOWN, _PEER_DISCONNECT_ERRNOS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no point in having such tests. It just cements the current code, not testing the behavior.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind removing it

@vponomaryov vponomaryov requested a review from Lorak-mmk May 22, 2026 11:00
@vponomaryov
Copy link
Copy Markdown
Author

The effect of the python driver issue was slow health checks in SCT.
Example from a test run with the issue:
sct-run-without-fix

And here how much time it takes with the fix:
sct-run-with-fix

So, numbers tell for itself.

@Lorak-mmk
Copy link
Copy Markdown

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

@sylwiaszunejko
Copy link
Copy Markdown
Collaborator

sylwiaszunejko commented May 22, 2026

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

@vponomaryov
Copy link
Copy Markdown
Author

vponomaryov commented May 22, 2026

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

So, I need to rebase the PR?
Even if we know that this PR doesn't affect the failed tests?

@Lorak-mmk
Copy link
Copy Markdown

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

Tbh I don't care that much.

So, I need to rebase the PR?

If you could then it would be great. I know you didn't touch asyncio, but let's just be sure. Change in one backend affecting another backend would not be the weirdest thing we saw in this driver.

…tion

When close() is called from one thread, it sets is_closed=True and
closes the socket immediately. However, libev watchers are stopped
asynchronously in _loop_will_run(), so handle_read()/handle_write()
can still fire on the now-closed fd, causing EBADF errors that
surface as ConnectionShutdown('Bad file descriptor') and prevent
reconnection.

So, fix it applying following changes:
- Early-return guards at the top of handle_read() and handle_write()
  that check is_closed/is_defunct before touching the socket
- Secondary is_closed/is_defunct checks in error handlers to catch
  the race when close() happens between watcher dispatch and syscall
- Peer disconnect detection (EBADF, ECONNRESET, ENOTCONN, etc.)
  that calls close() cleanly instead of defunct()
- last_error preservation in close() when connected_event is unset,
  preventing factory() from returning dead connections

Fixes: scylladb#614
@vponomaryov vponomaryov force-pushed the fix-race-issue-614 branch from 5dbc94b to ad60335 Compare May 22, 2026 13:36
@vponomaryov
Copy link
Copy Markdown
Author

Rebased.

@Lorak-mmk
Copy link
Copy Markdown

@dkropachev Please take a look at this.

Comment on lines +367 to +371
elif self.is_closed or self.is_defunct:
# Socket was closed by another thread between the
# watcher firing and us calling send(). This is the
# race described in scylladb/python-driver#614.
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but you still need to route error handling to defunct

@dkropachev
Copy link
Copy Markdown
Collaborator

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

One of the tests was enabled that uses TLS with asyncio, the ultimate cure - #884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Driver reported "[Errno 9] Bad file descriptor"

4 participants