Skip to content

Notify pending transactions on TCP/TLS connection close#288

Merged
emiago merged 3 commits intoemiago:mainfrom
Odinvt:fix/tcp-conn-close-notify
Mar 14, 2026
Merged

Notify pending transactions on TCP/TLS connection close#288
emiago merged 3 commits intoemiago:mainfrom
Odinvt:fix/tcp-conn-close-notify

Conversation

@Odinvt
Copy link
Copy Markdown
Contributor

@Odinvt Odinvt commented Mar 9, 2026

Scenario :

When a TCP/TLS connection is closed by the remote end, readConnection exits without notifying the transaction layer. Pending client transactions on that connection wait the full Timer_B (32s) before timing out.

Fix :

This adds a callback from the transport to the transaction layer on connection close. Matching client transactions receive client_input_transport_err, which is already handled in every FSM state. The goroutine pattern matches existing transport error handling in ack()/resend().

How this was found :

During mTLS testing against Kamailio with require_certificate=yes, a client connecting with an invalid certificate completes the TLS handshake (TLS 1.2 sends client cert after ServerHello) but the server immediately closes the connection upon evaluating the cert. Our REGISTER transaction was left hanging for 32 seconds despite the connection being dead.

@emiago
Copy link
Copy Markdown
Owner

emiago commented Mar 9, 2026

Hi @Odinvt . I see the issue you want to catch.
I could only see this with a experimental feature flag. Just some global that could disable this behavior.
It is not something I could cover with RFC. Also timers do save you from burst effect.

Also why not adding ws/wss as well?

@Odinvt
Copy link
Copy Markdown
Contributor Author

Odinvt commented Mar 9, 2026

Thanks for the feedback @emiago, all addressed:

Feature flag: Added WithTransactionLayerTerminateOnConnClose() as an opt-in TransactionLayerOption, marked // Experimental. Default behavior unchanged.

Burst effect: You're right, Timer_B naturally spreads termination since transactions are created at different times. This flag collapses that into one event. Go scheduling provides fairness across the spawned goroutines though, but the application is deliberately making that tradeoff, hence the opt-in 👍.

WS/WSS: Added. TransportWS now has onConnClose + defer in readConnection. WSS inherits via embedding.

Note: I separated signal from action. The connection-close notification is always wired (all 4 transports > OnConnectionClose) per RFC 3261 #17.1.4 & RFC 3261 #18.4 which requires the transport layer to inform the transport user of connection failures. The RFC also says the client transaction SHOULD transition to Terminated on transport failure. Kept the flag for safety regardless. The termination logic is a private method only called when the flag is set. OnConnectionClose is exported so it can serve future use cases without another PR.

@emiago
Copy link
Copy Markdown
Owner

emiago commented Mar 9, 2026

ok nice. I would use global but ok it can be tranport layer parameter.
This could be merged as is. I am just wondering why would we not include server transactions as well.

@Odinvt
Copy link
Copy Markdown
Contributor Author

Odinvt commented Mar 9, 2026

I would use global but ok it can be tranport layer parameter.

Sorry for not looking deeper into the global var pattern before submitting, I should have checked your conventions first.

I am just wondering why would we not include server transactions as well.

Our use case is UAC-heavy (SIP load testing), so client transactions are where I hit this in practice. I haven't faced the server tx side yet, which is why it wasn't included. But looking at the code, here's what I found (please do correct me if I'm wrong):

Server tx lifecycle relative to readConnection:

  • readConnection goroutine (per conn)
    • conn.Read(buf)parseStream(data)handler(msg) (= txl.handleMessage)
      • go handleRequestBackground(req) — NEW goroutine, detached from read loop
        • serverTxCreateGetConnection(source) — retrieves SAME conn from pool
        • NewServerTx(key, req, conn) — tx holds that conn reference
        • txl.reqHandler(req, tx) — app handler runs in this goroutine

The go in handleMessage means the read loop doesn't block on server tx creation or the app handler. They run detached.

When connection dies:

  1. readConnection gets EOF
  2. Defers fire LIFO: onConnClose(conn) first, then pool.CloseAndDelete(conn, raddr)
  3. The server tx still holds the same conn pointer
  4. Next tx.Respond() call hits conn.WriteMsg() on a closed conn, which returns server_input_transport_err, FSM handles it and transitions to Terminated

So server transactions already self-terminate on the next write attempt. The difference from client transactions: there's no Timer_B equivalent sitting idle for 32s. The server tx lifecycle is bounded by application code duration (handler runs, calls Respond(), fails, done). And TerminateGracefully() runs after the handler returns as a safety net.

The concern with including server tx: firing server_input_transport_err on connection close would race with the application handler goroutine that's still running. The handler might be mid-work (DB queries, external calls) before calling tx.Respond(). If I terminate the tx under it, the handler sees an unexpected terminated state. That's a behavioral change for application code.

It could still make sense as a separate opt-in, but it needs its own analysis since the failure mode is different: client tx = stuck waiting for a response that will never come, server tx = app handler racing with early termination. Happy to look into it as a follow-up if you think it's worth exploring.

@Odinvt
Copy link
Copy Markdown
Contributor Author

Odinvt commented Mar 9, 2026

Actually thinking about this more, you're right. OnConnectionClose is on the transaction layer, not on individual transactions. Applications can't hook into it to clean up their server tx handlers.

Server transactions already have tx.Done() and tx.Err() as notification channels, and the FSM handles server_input_transport_err in every state. Terminating server tx on connection close would work: applications that select on tx.Done() get notified immediately, applications that don't just hit the write error on Respond() as before, which is harmless since the tx is already terminated by then. The fsmMu lock prevents the race between termination and Respond().

Up to you if you want me to add server transactions to the same flag in this PR or keep it as a follow-up.

@emiago
Copy link
Copy Markdown
Owner

emiago commented Mar 12, 2026

yeah we should add server transactions as well. I see no reason not to add this.
The RFCs you stated are only saying react on transport err when there is actuall request sending, which is what lib does.
Therefore also reason for timers.

I think there could be some reason todo this like your example, but as I said it should be flag for now.
There is generally maybe more correct approach that SIPGo transactions are not coupled with connection. If you know SIP tries to carry network infomation to keep that P2P communication possible and in some sense transactions are not tighed to connection state, specially if you consider UDP
In other words broken connection should be recoverable , but practically and nowdays network complexity it is mostly solved different way or even not possible to have that, which therefore also SIPgo design gives some performance in most cases.

@Odinvt
Copy link
Copy Markdown
Contributor Author

Odinvt commented Mar 12, 2026

Added server transaction termination under the same flag. Thanks for the feedback and the insight on SIP transaction/connection coupling, that is indeed a useful framing.

@emiago emiago merged commit 9bb53f1 into emiago:main Mar 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants