Fix callback not being called after activesock shutdown (#4864) by azertyfun · Pull Request #4878 · pjsip/pjproject

azertyfun · 2026-03-19T17:26:39Z

This is critical to avoid a reference leak. The callback is responsible for calling pjsip_tx_data_dec_ref.

Background: When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks are silently skipped. This causes tdata reference leaks since upper layers rely on these callbacks to call pjsip_tx_data_dec_ref(). The bug only manifests when the ioqueue fast-track fails (high load, full send buffer), making it nearly invisible in normal testing. Exists with or without PJ_IOQUEUE_CALLBACK_NO_LOCK.

Approach: Drain pending write callbacks during key unregistration, handling two lists with different semantics: write_cb_list (completed ops, deferred callback) invoked with success status respecting write_callback_thread serialization, then write_list (pending unsent ops) invoked with -PJ_ECANCELLED.

Maintainer rework addressing review feedback:

No deadlock: callbacks invoked without holding key/ioqueue locks. Key lifetime protected with grp_lock ref counting.
All maintained backends: epoll, kqueue, select, IOCP. Skipped deprecated symbian/uwp.
IOCP: cancelled overlapped ops invoke write callback with -PJ_ECANCELLED; write_callback_thread properly cleared in closing path.
activesock: forward on_data_sent when SHUT_TX set (not on_data_read).
SSL: prevent send_buf_pending overwrite (tdata leak).
PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing async path.
Regression test: concurrent sends + close before poll, with SO_SNDBUF shrink to force async path.

CLAassistant · 2026-03-19T17:26:50Z

All committers have signed the CLA.

azertyfun · 2026-03-19T17:32:18Z

This fixes the issue as described & reproduced in #4864.

I am moderately confident that this is a necessary and sufficient fix, however the whole architecture around TLS transport is quite difficult to get my head around. There are many points where a tdata can be delayed or cancelled. Hopefully someone a lot more familiar than me with this codebase can weigh in on this.

Copilot

Pull request overview

Fixes a shutdown edge-case in PJLIB’s active socket write-completion path so higher layers still receive the send-completion callback, preventing reference leaks (e.g., callers that decrement pjsip_tx_data refs in on_data_sent).

Changes:

Invoke on_data_sent when SHUT_TX is set in ioqueue_on_write_complete() (instead of returning silently).

pjlib/src/pj/activesock.c

)

azertyfun · 2026-03-20T13:28:26Z

Actually with further testing I encountered another case where the callback still isn't called. This requires more thorough troubleshooting.

One major pain point is that the issue is only reproducible in high-load environments or with the ioqueue "fast track" commented out; else pj_ioqueue_sendto always returns PJ_SUCCESS in which case the caller is responsible for invoking the callback. When the fast-track fails, pj_ioqueue_sendto always returns PJ_EPENDING in which case the callee is responsible for invoking the callback. However, since this case is not normally covered by the test suite, it is evidently not guaranteed to be correct.

Marking this as draft until I figure out a more thorough solution.

The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction

azertyfun · 2026-03-20T16:19:28Z

Fixed a second leak. At this point I've lost all confidence that the original implementation has any understanding of the importance of not losing track of ioqueue keys. I'll make a thorough review next week.

The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.

The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction

Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)

azertyfun · 2026-03-25T09:55:46Z

Alright, I think I have reviewed all the relevant code paths I could think of for this issue. This led me to add a couple additional fixes compared to before (same logic of always invoking the callback in on_data_read + fix for pj_activesock_close()`. I also took the opportunity to add a regression test.

This is the mind map I used to review the possible journeys taken by a tdata object sent from pjsip_transport_send().

The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.

The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction

Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)

Was comparing bytes (pj_ssize_t data length) against PJ_EPENDING instead of status. Pre-existing bug unrelated to pjsip#4878. Co-Authored-By: Claude Code

…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - ssl_sock: prevent send_buf_pending overwrite (tdata leak) - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pjlib/src/pj/ssl_sock_imp_common.c:604

This new branch returns PJ_ENOMEM when alloc_send_data() fails while send_buf_pending already contains data. That status can now propagate to ssl_send()/flush_delayed_send() and be treated as a hard error (e.g., renegotiation path treats anything other than SUCCESS/EPENDING as failure). Consider returning a backpressure status that callers already treat as non-fatal (e.g., PJ_EBUSY/PJ_EPENDING), or updating the relevant callers to handle this specific condition without tearing down the connection.

    wdata = alloc_send_data(ssock, needed_len);
    if (wdata == NULL && ssock->send_buf_pending.data_len) {
        /* Buffer full and there's already pending data. Cannot overwrite
         * send_buf_pending as that would lose the previous data and its
         * callback reference (causing a tdata leak). See #4878.
         */
        pj_lock_release(ssock->write_mutex);
        return PJ_ENOMEM;
    } else if (wdata == NULL) {
        /* Oops, the send buffer is full, let's just
         * queue it for sending and return PJ_EPENDING.
         */
        ssock->send_buf_pending.data_len = needed_len;
        ssock->send_buf_pending.app_key = send_key;
        ssock->send_buf_pending.flags = flags;
        ssock->send_buf_pending.plain_data_len = orig_len;
        pj_lock_release(ssock->write_mutex);
        return PJ_EPENDING;
    }

pjlib/src/pjlib-test/activesock.c

pjlib/src/pj/ioqueue_common_abs.c

pjlib/src/pjlib-test/activesock.c

…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - ssl_sock: prevent send_buf_pending overwrite (tdata leak) - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code

nanangizz · 2026-04-02T04:12:49Z

@azertyfun We've tried to address the issues mentioned earlier. Please let us know if you notice any further issues or have additional comments.

azertyfun · 2026-04-02T10:24:00Z

We deployed the earlier fix (which sends garbage over the wire – but that doesn't really matter in our use-case). However we had to roll it back because we unexpectedly encountered crashes.

I am currently investigating the cause. I still think the fix is sound in principle, but it may be triggering other existing race conditions. In particular this bit seems faulty: if the ioqueue is busy, ssl_send returns PJ_EPENDING, and wp is not removed from ssock->write_pending. However it has already been added to write_list, so it will be duplicated. I still need to look further to understand why this is only happening now.

nanangizz · 2026-04-02T11:54:08Z

You're right. Asked the AI to check that area, and it found several issues!!!

1. Wrong app_key passed to ssl_send
flush_delayed_send passes &wp->key (the internal SSL write_data_t key) instead of wp->app_key (the original caller's op_key). When the async callback fires, the TLS transport casts op_key to pjsip_tx_data_op_key* — but &wp->key is NOT a valid pjsip_tx_data_op_key, so accessing tdata_op_key->tdata and tdata_op_key->callback accesses invalid memory. Potential crash/UB.

2. PJ_EPENDING not handled — double-send
When ssl_send returns PJ_EPENDING (data encrypted + queued), wp stays in write_pending. Next flush_delayed_send call re-processes wp: ssl_write encrypts the same plaintext again → duplicate data on the wire, potential TLS record corruption.

3. Missing callback on PJ_SUCCESS — tdata leak
When delayed data is sent synchronously (PJ_SUCCESS), no on_data_sent callback fires. The app (which received PJ_EPENDING from the original pj_ssl_sock_send) expects a callback → tdata ref never decremented → leak.

4. send_buf_pending not cancelled on error in ssock_on_data_sent
When sent < 0 (error/cancel) and the app callback returns PJ_TRUE, ssock_on_data_sent tries to flush_circ_buf_output for send_buf_pending on a dead socket. This fails, and the send_buf_pending callback is never invoked → tdata leak.

Here is the draft patch (compile tested only, will test further next week).

azertyfun · 2026-04-02T13:40:04Z

After further investigation, the crash I described above was one of the outcomes, but the more common one was that we saw ssl0x7fa1caca6a00 Renegotiation failed: Not enough memory (PJ_ENOMEM) sometimes followed by a deadlock. In every case where there was a deadlock, there was this PJ_ENOMEM error. Unfortunately I do not have a workable core dump of those occurrences.

Now it is clearly related to the return PJ_ENOMEM this PR adds. What's unclear for now is how.

Thanks to the error message we know t hat ssl_do_handshake failed with PJ_ENOMEM due to the change I made. Before in the same situation it was silently failing with PJ_EPENDING instead, which essentially a no-op.

From there I can only speculate why there was a deadlock, but on_error calls on_handshake_complete which calls on_connect_complete which triggers a bunch of things in asterisk... so who knows what may have happened from there.

The whole logic for SSL handshaking is spread out, has multiple redundancies, and probably just as buggy as the rest of the code touched by this PR. Another thorough review is needed, but that's going to be quite a few additional man-hours which I'm starting to run short on as other priorities are catching up after several weeks working on this already.

And anyway, how can we know when we've squashed the last critical bug? How can we verify our fix when the interfaces are so complex and behave in radically different ways under load vs not, including in edge-cases like SSL renegotiation under load ? Unfortunately given that I caused a major incident by pushing a partial fix, we won't get approval for another attempt unless we have a complete explanation for all issues encountered and a way to reliably reproduce our race conditions and verify the correctness of any fixes.

Perhaps the solution for us will be to wait until PJ_IOQUEUE_CALLBACK_NO_LOCK is stabilized&released, and just rely on the TLS connection never breaking so we never have to encounter these buggy and untested edge-cases. Or perhaps switch to UDP/TCP through a VPN. Which I realize aren't real solutions and they do not help you. Given the scope of the architural and implementation issues in the SSL subsystem, short of a complete rewrite I honestly lack confidence that pjproject can ensure correctness.

azertyfun force-pushed the activesock-txdata-leak branch from c60348f to e608796 Compare March 19, 2026 17:27

sauwming requested a review from Copilot March 20, 2026 04:56

Copilot started reviewing on behalf of sauwming March 20, 2026 04:56 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

pjlib/src/pj/activesock.c Outdated Show resolved Hide resolved

pjlib/src/pj/activesock.c Show resolved Hide resolved

pjlib/src/pj/activesock.c Show resolved Hide resolved

azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026

Call on_data_sent with bytes_sent even if asock is shut down (pjsip#4878

226b170

)

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026

Call on_data_sent with bytes_sent even if asock is shut down (pjsip#4878

af16a17

)

azertyfun force-pushed the activesock-txdata-leak branch from 226b170 to af16a17 Compare March 20, 2026 12:14

azertyfun marked this pull request as draft March 20, 2026 13:28

azertyfun force-pushed the activesock-txdata-leak branch from 4a54cd5 to b3440a7 Compare March 24, 2026 16:09

azertyfun force-pushed the activesock-txdata-leak branch from b3440a7 to 47cead4 Compare March 25, 2026 09:46

azertyfun marked this pull request as ready for review March 25, 2026 09:55

azertyfun mentioned this pull request Mar 25, 2026

[bug]: TLS connection cuts with "sslv3 alert bad record mac" when sending a lot of outbound requests asterisk/asterisk#168

Open

1 task

azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

31f22f4

azertyfun force-pushed the activesock-txdata-leak branch from 47cead4 to 31f22f4 Compare March 25, 2026 14:05

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

bc22897

azertyfun force-pushed the activesock-txdata-leak branch from 31f22f4 to bc22897 Compare March 25, 2026 14:06

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 27, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

f1d7044

azertyfun force-pushed the activesock-txdata-leak branch from 10a568e to f1d7044 Compare March 27, 2026 14:25

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 30, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

63b1f04

azertyfun force-pushed the activesock-txdata-leak branch from f1d7044 to 63b1f04 Compare March 30, 2026 08:54

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 30, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

8eb1190

azertyfun force-pushed the activesock-txdata-leak branch from 63b1f04 to 8eb1190 Compare March 30, 2026 09:41

Fix incorrect PJ_EPENDING check in ioq_tcp test

90f8061

Was comparing bytes (pj_ssize_t data length) against PJ_EPENDING instead of status. Pre-existing bug unrelated to pjsip#4878. Co-Authored-By: Claude Code

nanangizz force-pushed the activesock-txdata-leak branch from 8eb1190 to 60e1f69 Compare April 1, 2026 04:14

nanangizz requested a review from Copilot April 1, 2026 04:15

nanangizz added type: bug component: pjlib labels Apr 1, 2026

nanangizz added this to the release-2.17 milestone Apr 1, 2026

Copilot started reviewing on behalf of nanangizz April 1, 2026 04:15 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

nanangizz force-pushed the activesock-txdata-leak branch from 60e1f69 to 45d60d4 Compare April 1, 2026 04:40

nanangizz force-pushed the activesock-txdata-leak branch from 45d60d4 to d395b2b Compare April 1, 2026 05:03

nanangizz force-pushed the activesock-txdata-leak branch from d395b2b to bc60815 Compare April 1, 2026 05:23

nanangizz requested a review from sauwming April 1, 2026 06:04

sauwming approved these changes Apr 1, 2026

View reviewed changes

sauwming linked an issue Apr 2, 2026 that may be closed by this pull request

tdata refcounting issue prevents SIPTLS transport shutdown #4864

Open

Conversation

azertyfun commented Mar 19, 2026 • edited by nanangizz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azertyfun commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azertyfun commented Mar 20, 2026

Uh oh!

azertyfun commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azertyfun commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nanangizz commented Apr 2, 2026

Uh oh!

azertyfun commented Apr 2, 2026

Uh oh!

nanangizz commented Apr 2, 2026

Uh oh!

azertyfun commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

azertyfun commented Mar 19, 2026 •

edited by nanangizz

Loading

CLAassistant commented Mar 19, 2026 •

edited

Loading

azertyfun commented Mar 20, 2026 •

edited

Loading