Skip to content

Fix callback not being called after activesock shutdown (#4864)#4878

Open
azertyfun wants to merge 2 commits intopjsip:masterfrom
azertyfun:activesock-txdata-leak
Open

Fix callback not being called after activesock shutdown (#4864)#4878
azertyfun wants to merge 2 commits intopjsip:masterfrom
azertyfun:activesock-txdata-leak

Conversation

@azertyfun
Copy link
Copy Markdown

@azertyfun azertyfun commented Mar 19, 2026

This is critical to avoid a reference leak. The callback is responsible for calling pjsip_tx_data_dec_ref.


Background: When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks are silently skipped. This causes tdata reference leaks since upper layers rely on these callbacks to call pjsip_tx_data_dec_ref(). The bug only manifests when the ioqueue fast-track fails (high load, full send buffer), making it nearly invisible in normal testing. Exists with or without PJ_IOQUEUE_CALLBACK_NO_LOCK.

Approach: Drain pending write callbacks during key unregistration, handling two lists with different semantics: write_cb_list (completed ops, deferred callback) invoked with success status respecting write_callback_thread serialization, then write_list (pending unsent ops) invoked with -PJ_ECANCELLED.

Maintainer rework addressing review feedback:

  • No deadlock: callbacks invoked without holding key/ioqueue locks. Key lifetime protected with grp_lock ref counting.
  • All maintained backends: epoll, kqueue, select, IOCP. Skipped deprecated symbian/uwp.
  • IOCP: cancelled overlapped ops invoke write callback with -PJ_ECANCELLED; write_callback_thread properly cleared in closing path.
  • activesock: forward on_data_sent when SHUT_TX set (not on_data_read).
  • SSL: prevent send_buf_pending overwrite (tdata leak).
  • PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing async path.
  • Regression test: concurrent sends + close before poll, with SO_SNDBUF shrink to force async path.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 19, 2026

CLA assistant check
All committers have signed the CLA.

@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from c60348f to e608796 Compare March 19, 2026 17:27
@azertyfun
Copy link
Copy Markdown
Author

This fixes the issue as described & reproduced in #4864.

I am moderately confident that this is a necessary and sufficient fix, however the whole architecture around TLS transport is quite difficult to get my head around. There are many points where a tdata can be delayed or cancelled. Hopefully someone a lot more familiar than me with this codebase can weigh in on this.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a shutdown edge-case in PJLIB’s active socket write-completion path so higher layers still receive the send-completion callback, preventing reference leaks (e.g., callers that decrement pjsip_tx_data refs in on_data_sent).

Changes:

  • Invoke on_data_sent when SHUT_TX is set in ioqueue_on_write_complete() (instead of returning silently).

azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 226b170 to af16a17 Compare March 20, 2026 12:14
@azertyfun
Copy link
Copy Markdown
Author

Actually with further testing I encountered another case where the callback still isn't called. This requires more thorough troubleshooting.

One major pain point is that the issue is only reproducible in high-load environments or with the ioqueue "fast track" commented out; else pj_ioqueue_sendto always returns PJ_SUCCESS in which case the caller is responsible for invoking the callback. When the fast-track fails, pj_ioqueue_sendto always returns PJ_EPENDING in which case the callee is responsible for invoking the callback. However, since this case is not normally covered by the test suite, it is evidently not guaranteed to be correct.

Marking this as draft until I figure out a more thorough solution.

@azertyfun azertyfun marked this pull request as draft March 20, 2026 13:28
azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026
The leak happens when

1. send_buf_pending is already in use
2. ssock_on_data_sent calls flush_circ_buf_output
3. send_buf_pending gets overwritten -> previous data in
   send_buf_pending is lost forever, its callback is never called, and
   therefore the reference to the tdata it contains is never
   decremented, preventing transport destruction
@azertyfun
Copy link
Copy Markdown
Author

azertyfun commented Mar 20, 2026

Fixed a second leak. At this point I've lost all confidence that the original implementation has any understanding of the importance of not losing track of ioqueue keys. I'll make a thorough review next week.

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 24, 2026
The fast-track remains enabled by default, but this is useful for
troubleshooting.

The ioqueue has two widly different behaviors:

1. In the fast-track (almost always used in low-load scenarios), sending
   a write_op always immediately returns with PJ_SUCCESS. The semantics
   in this case are that the _caller_ is responsible for calling any
   relevant callbacks.
2. Outside the fast track (relevant in high-load scenarios, which
   unfortunately do not seem to be extensively covered by regression
   tests), sending a write_op returns PJ_EPENDING in the happy path
   which changes the semantics; now the _ioqueue_ becomes responsible
   for the data owned by write_op and for calling the relevant
   callbacks.
   See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be
   missed and cause bugs.

Disabling the fast track allows for more easily testing the second case.
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 24, 2026
The leak happens when
1. send_buf_pending is already in use
2. ssock_on_data_sent calls flush_circ_buf_output
3. send_buf_pending gets overwritten -> previous data in
   send_buf_pending is lost forever, its callback is never called, and
   therefore the reference to the tdata it contains is never
   decremented, preventing transport destruction
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 4a54cd5 to b3440a7 Compare March 24, 2026 16:09
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 24, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from b3440a7 to 47cead4 Compare March 25, 2026 09:46
@azertyfun
Copy link
Copy Markdown
Author

Alright, I think I have reviewed all the relevant code paths I could think of for this issue. This led me to add a couple additional fixes compared to before (same logic of always invoking the callback in on_data_read + fix for pj_activesock_close()`. I also took the opportunity to add a regression test.

This is the mind map I used to review the possible journeys taken by a tdata object sent from pjsip_transport_send().

tdata journey

@azertyfun azertyfun marked this pull request as ready for review March 25, 2026 09:55
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
The fast-track remains enabled by default, but this is useful for
troubleshooting.

The ioqueue has two widly different behaviors:

1. In the fast-track (almost always used in low-load scenarios), sending
   a write_op always immediately returns with PJ_SUCCESS. The semantics
   in this case are that the _caller_ is responsible for calling any
   relevant callbacks.
2. Outside the fast track (relevant in high-load scenarios, which
   unfortunately do not seem to be extensively covered by regression
   tests), sending a write_op returns PJ_EPENDING in the happy path
   which changes the semantics; now the _ioqueue_ becomes responsible
   for the data owned by write_op and for calling the relevant
   callbacks.
   See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be
   missed and cause bugs.

Disabling the fast track allows for more easily testing the second case.
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
The leak happens when
1. send_buf_pending is already in use
2. ssock_on_data_sent calls flush_circ_buf_output
3. send_buf_pending gets overwritten -> previous data in
   send_buf_pending is lost forever, its callback is never called, and
   therefore the reference to the tdata it contains is never
   decremented, preventing transport destruction
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 47cead4 to 31f22f4 Compare March 25, 2026 14:05
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 31f22f4 to bc22897 Compare March 25, 2026 14:06
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 27, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 10a568e to f1d7044 Compare March 27, 2026 14:25
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 30, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 30, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from f1d7044 to 63b1f04 Compare March 30, 2026 08:54
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 30, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 30, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 63b1f04 to 8eb1190 Compare March 30, 2026 09:41
Was comparing bytes (pj_ssize_t data length) against PJ_EPENDING
instead of status. Pre-existing bug unrelated to pjsip#4878.

Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from 8eb1190 to 60e1f69 Compare April 1, 2026 04:14
nanangizz added a commit to azertyfun/pjproject that referenced this pull request Apr 1, 2026
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz requested a review from Copilot April 1, 2026 04:15
@nanangizz nanangizz added this to the release-2.17 milestone Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pjlib/src/pj/ssl_sock_imp_common.c:604

  • This new branch returns PJ_ENOMEM when alloc_send_data() fails while send_buf_pending already contains data. That status can now propagate to ssl_send()/flush_delayed_send() and be treated as a hard error (e.g., renegotiation path treats anything other than SUCCESS/EPENDING as failure). Consider returning a backpressure status that callers already treat as non-fatal (e.g., PJ_EBUSY/PJ_EPENDING), or updating the relevant callers to handle this specific condition without tearing down the connection.
    wdata = alloc_send_data(ssock, needed_len);
    if (wdata == NULL && ssock->send_buf_pending.data_len) {
        /* Buffer full and there's already pending data. Cannot overwrite
         * send_buf_pending as that would lose the previous data and its
         * callback reference (causing a tdata leak). See #4878.
         */
        pj_lock_release(ssock->write_mutex);
        return PJ_ENOMEM;
    } else if (wdata == NULL) {
        /* Oops, the send buffer is full, let's just
         * queue it for sending and return PJ_EPENDING.
         */
        ssock->send_buf_pending.data_len = needed_len;
        ssock->send_buf_pending.app_key = send_key;
        ssock->send_buf_pending.flags = flags;
        ssock->send_buf_pending.plain_data_len = orig_len;
        pj_lock_release(ssock->write_mutex);
        return PJ_EPENDING;
    }

nanangizz added a commit to azertyfun/pjproject that referenced this pull request Apr 1, 2026
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from 60e1f69 to 45d60d4 Compare April 1, 2026 04:40
nanangizz added a commit to azertyfun/pjproject that referenced this pull request Apr 1, 2026
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from 45d60d4 to d395b2b Compare April 1, 2026 05:03
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from d395b2b to bc60815 Compare April 1, 2026 05:23
@nanangizz nanangizz requested a review from sauwming April 1, 2026 06:04
@sauwming sauwming linked an issue Apr 2, 2026 that may be closed by this pull request
@nanangizz
Copy link
Copy Markdown
Member

@azertyfun We've tried to address the issues mentioned earlier. Please let us know if you notice any further issues or have additional comments.

@azertyfun
Copy link
Copy Markdown
Author

We deployed the earlier fix (which sends garbage over the wire – but that doesn't really matter in our use-case). However we had to roll it back because we unexpectedly encountered crashes.

I am currently investigating the cause. I still think the fix is sound in principle, but it may be triggering other existing race conditions. In particular this bit seems faulty: if the ioqueue is busy, ssl_send returns PJ_EPENDING, and wp is not removed from ssock->write_pending. However it has already been added to write_list, so it will be duplicated. I still need to look further to understand why this is only happening now.

@nanangizz
Copy link
Copy Markdown
Member

You're right. Asked the AI to check that area, and it found several issues!!!

1. Wrong app_key passed to ssl_send
flush_delayed_send passes &wp->key (the internal SSL write_data_t key) instead of wp->app_key (the original caller's op_key). When the async callback fires, the TLS transport casts op_key to pjsip_tx_data_op_key* — but &wp->key is NOT a valid pjsip_tx_data_op_key, so accessing tdata_op_key->tdata and tdata_op_key->callback accesses invalid memory. Potential crash/UB.

2. PJ_EPENDING not handled — double-send
When ssl_send returns PJ_EPENDING (data encrypted + queued), wp stays in write_pending. Next flush_delayed_send call re-processes wp: ssl_write encrypts the same plaintext again → duplicate data on the wire, potential TLS record corruption.

3. Missing callback on PJ_SUCCESS — tdata leak
When delayed data is sent synchronously (PJ_SUCCESS), no on_data_sent callback fires. The app (which received PJ_EPENDING from the original pj_ssl_sock_send) expects a callback → tdata ref never decremented → leak.

4. send_buf_pending not cancelled on error in ssock_on_data_sent
When sent < 0 (error/cancel) and the app callback returns PJ_TRUE, ssock_on_data_sent tries to flush_circ_buf_output for send_buf_pending on a dead socket. This fails, and the send_buf_pending callback is never invoked → tdata leak.

Here is the draft patch (compile tested only, will test further next week).

@azertyfun
Copy link
Copy Markdown
Author

After further investigation, the crash I described above was one of the outcomes, but the more common one was that we saw ssl0x7fa1caca6a00 Renegotiation failed: Not enough memory (PJ_ENOMEM) sometimes followed by a deadlock. In every case where there was a deadlock, there was this PJ_ENOMEM error. Unfortunately I do not have a workable core dump of those occurrences.

Now it is clearly related to the return PJ_ENOMEM this PR adds. What's unclear for now is how.

Thanks to the error message we know t hat ssl_do_handshake failed with PJ_ENOMEM due to the change I made. Before in the same situation it was silently failing with PJ_EPENDING instead, which essentially a no-op.

From there I can only speculate why there was a deadlock, but on_error calls on_handshake_complete which calls on_connect_complete which triggers a bunch of things in asterisk... so who knows what may have happened from there.

The whole logic for SSL handshaking is spread out, has multiple redundancies, and probably just as buggy as the rest of the code touched by this PR. Another thorough review is needed, but that's going to be quite a few additional man-hours which I'm starting to run short on as other priorities are catching up after several weeks working on this already.

And anyway, how can we know when we've squashed the last critical bug? How can we verify our fix when the interfaces are so complex and behave in radically different ways under load vs not, including in edge-cases like SSL renegotiation under load ? Unfortunately given that I caused a major incident by pushing a partial fix, we won't get approval for another attempt unless we have a complete explanation for all issues encountered and a way to reliably reproduce our race conditions and verify the correctness of any fixes.

Perhaps the solution for us will be to wait until PJ_IOQUEUE_CALLBACK_NO_LOCK is stabilized&released, and just rely on the TLS connection never breaking so we never have to encounter these buggy and untested edge-cases. Or perhaps switch to UDP/TCP through a VPN. Which I realize aren't real solutions and they do not help you. Given the scope of the architural and implementation issues in the SSL subsystem, short of a complete rewrite I honestly lack confidence that pjproject can ensure correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tdata refcounting issue prevents SIPTLS transport shutdown

6 participants