Fix callback not being called after activesock shutdown (#4864)#4878
Fix callback not being called after activesock shutdown (#4864)#4878azertyfun wants to merge 2 commits intopjsip:masterfrom
Conversation
c60348f to
e608796
Compare
|
This fixes the issue as described & reproduced in #4864. I am moderately confident that this is a necessary and sufficient fix, however the whole architecture around TLS transport is quite difficult to get my head around. There are many points where a |
There was a problem hiding this comment.
Pull request overview
Fixes a shutdown edge-case in PJLIB’s active socket write-completion path so higher layers still receive the send-completion callback, preventing reference leaks (e.g., callers that decrement pjsip_tx_data refs in on_data_sent).
Changes:
- Invoke
on_data_sentwhenSHUT_TXis set inioqueue_on_write_complete()(instead of returning silently).
226b170 to
af16a17
Compare
|
Actually with further testing I encountered another case where the callback still isn't called. This requires more thorough troubleshooting. One major pain point is that the issue is only reproducible in high-load environments or with the ioqueue "fast track" commented out; else Marking this as draft until I figure out a more thorough solution. |
The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction
|
Fixed a second leak. At this point I've lost all confidence that the original implementation has any understanding of the importance of not losing track of ioqueue keys. I'll make a thorough review next week. |
The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.
The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction
4a54cd5 to
b3440a7
Compare
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
b3440a7 to
47cead4
Compare
The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.
The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
47cead4 to
31f22f4
Compare
31f22f4 to
bc22897
Compare
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
10a568e to
f1d7044
Compare
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
f1d7044 to
63b1f04
Compare
Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
63b1f04 to
8eb1190
Compare
Was comparing bytes (pj_ssize_t data length) against PJ_EPENDING instead of status. Pre-existing bug unrelated to pjsip#4878. Co-Authored-By: Claude Code
8eb1190 to
60e1f69
Compare
…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - ssl_sock: prevent send_buf_pending overwrite (tdata leak) - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
pjlib/src/pj/ssl_sock_imp_common.c:604
- This new branch returns PJ_ENOMEM when alloc_send_data() fails while send_buf_pending already contains data. That status can now propagate to ssl_send()/flush_delayed_send() and be treated as a hard error (e.g., renegotiation path treats anything other than SUCCESS/EPENDING as failure). Consider returning a backpressure status that callers already treat as non-fatal (e.g., PJ_EBUSY/PJ_EPENDING), or updating the relevant callers to handle this specific condition without tearing down the connection.
wdata = alloc_send_data(ssock, needed_len);
if (wdata == NULL && ssock->send_buf_pending.data_len) {
/* Buffer full and there's already pending data. Cannot overwrite
* send_buf_pending as that would lose the previous data and its
* callback reference (causing a tdata leak). See #4878.
*/
pj_lock_release(ssock->write_mutex);
return PJ_ENOMEM;
} else if (wdata == NULL) {
/* Oops, the send buffer is full, let's just
* queue it for sending and return PJ_EPENDING.
*/
ssock->send_buf_pending.data_len = needed_len;
ssock->send_buf_pending.app_key = send_key;
ssock->send_buf_pending.flags = flags;
ssock->send_buf_pending.plain_data_len = orig_len;
pj_lock_release(ssock->write_mutex);
return PJ_EPENDING;
}
…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - ssl_sock: prevent send_buf_pending overwrite (tdata leak) - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code
60e1f69 to
45d60d4
Compare
…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - ssl_sock: prevent send_buf_pending overwrite (tdata leak) - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code
45d60d4 to
d395b2b
Compare
…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - ssl_sock: prevent send_buf_pending overwrite (tdata leak) - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code
d395b2b to
bc60815
Compare
|
@azertyfun We've tried to address the issues mentioned earlier. Please let us know if you notice any further issues or have additional comments. |
|
We deployed the earlier fix (which sends garbage over the wire – but that doesn't really matter in our use-case). However we had to roll it back because we unexpectedly encountered crashes. I am currently investigating the cause. I still think the fix is sound in principle, but it may be triggering other existing race conditions. In particular this bit seems faulty: if the ioqueue is busy, |
|
You're right. Asked the AI to check that area, and it found several issues!!! 1. Wrong app_key passed to ssl_send 2. PJ_EPENDING not handled — double-send 3. Missing callback on PJ_SUCCESS — tdata leak 4. send_buf_pending not cancelled on error in ssock_on_data_sent Here is the draft patch (compile tested only, will test further next week). |
|
After further investigation, the crash I described above was one of the outcomes, but the more common one was that we saw Now it is clearly related to the Thanks to the error message we know t hat From there I can only speculate why there was a deadlock, but The whole logic for SSL handshaking is spread out, has multiple redundancies, and probably just as buggy as the rest of the code touched by this PR. Another thorough review is needed, but that's going to be quite a few additional man-hours which I'm starting to run short on as other priorities are catching up after several weeks working on this already. And anyway, how can we know when we've squashed the last critical bug? How can we verify our fix when the interfaces are so complex and behave in radically different ways under load vs not, including in edge-cases like SSL renegotiation under load ? Unfortunately given that I caused a major incident by pushing a partial fix, we won't get approval for another attempt unless we have a complete explanation for all issues encountered and a way to reliably reproduce our race conditions and verify the correctness of any fixes. Perhaps the solution for us will be to wait until |

This is critical to avoid a reference leak. The callback is responsible for calling pjsip_tx_data_dec_ref.
Background: When
pj_ioqueue_unregister()is called with pending async writes,on_write_completecallbacks are silently skipped. This causes tdata reference leaks since upper layers rely on these callbacks to callpjsip_tx_data_dec_ref(). The bug only manifests when the ioqueue fast-track fails (high load, full send buffer), making it nearly invisible in normal testing. Exists with or withoutPJ_IOQUEUE_CALLBACK_NO_LOCK.Approach: Drain pending write callbacks during key unregistration, handling two lists with different semantics:
write_cb_list(completed ops, deferred callback) invoked with success status respectingwrite_callback_threadserialization, thenwrite_list(pending unsent ops) invoked with-PJ_ECANCELLED.Maintainer rework addressing review feedback:
grp_lockref counting.-PJ_ECANCELLED;write_callback_threadproperly cleared in closing path.on_data_sentwhenSHUT_TXset (noton_data_read).send_buf_pendingoverwrite (tdata leak).PJ_IOQUEUE_FAST_TRACKconfig to disable fast-track for testing async path.SO_SNDBUFshrink to force async path.