Skip to content

Use Web Worker heartbeat to keep connection alive during blocking dialogs#5784

Open
evnchn wants to merge 3 commits intozauberzeug:mainfrom
evnchn:heartbeat-worker-keep-alive
Open

Use Web Worker heartbeat to keep connection alive during blocking dialogs#5784
evnchn wants to merge 3 commits intozauberzeug:mainfrom
evnchn:heartbeat-worker-keep-alive

Conversation

@evnchn
Copy link
Copy Markdown
Collaborator

@evnchn evnchn commented Feb 14, 2026

Motivation

When blocking browser dialogs (alert(), confirm(), print()) freeze the main JavaScript thread, Socket.IO cannot respond to server pings. This causes the server to consider the client disconnected, eventually deleting the client and forcing a full page reload — losing all user state.

Addresses #2410

Implementation

  • Web Worker heartbeat (nicegui-heartbeat.js): A lightweight Web Worker runs on a separate thread, sending periodic HTTP POST requests to a new /_nicegui/heartbeat endpoint. Since Web Workers are unaffected by main thread blocking, heartbeats continue even during blocking dialogs.
  • Server-side heartbeat handler (nicegui.py): Receives heartbeat POSTs and calls client._handle_heartbeat(), which cancels any pending delete tasks to keep the client alive.
  • Stale next_message_id fix (nicegui.js): On disconnect, Socket.IO reconnects using the original query parameters from page load. After many messages, the old next_message_id has been pruned from the outbox history, causing try_rewind to fail and trigger a bare window.location.reload(). Fixed by updating options.query.next_message_id = window.nextMessageId in the disconnect handler.
  • reconnect_timeout passed to client JS (client.py, index.html): The heartbeat interval is derived from reconnect_timeout * 0.5 (minimum 0.5s) so the worker adapts to the configured timeout.

Progress

  • I chose a meaningful title that completes the sentence: "If applied, this PR will..."
  • The implementation is complete.
  • If this PR addresses a security issue, it has been coordinated via the security advisory process.
  • Pytests have been added (or are not necessary).
  • Documentation has been added (or is not necessary).

…logs

When blocking browser dialogs (alert, confirm, print) freeze the main JS thread,
Socket.IO cannot respond to server pings, causing disconnection and page reload.
This adds a Web Worker that sends periodic HTTP heartbeat requests on a separate
thread, keeping the server-side client alive. Also fixes stale next_message_id
on reconnection by updating Socket.IO query params in the disconnect handler.

Closes zauberzeug#2410

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@evnchn evnchn added the feature Type/scope: New or intentionally changed behavior label Feb 14, 2026
@evnchn
Copy link
Copy Markdown
Collaborator Author

evnchn commented Feb 14, 2026

Testing method:

  • Open console
  • Turn on Preserve log
  • Run alert()
  • Have lunch
  • Return to press OK.
  • The page should NOT reload.

@evnchn
Copy link
Copy Markdown
Collaborator Author

evnchn commented Feb 14, 2026

What would happen if we moved the entire Socket.IO management into a Web Worker?

When the main thread is blocked, what happens to the messages though...

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@phifuh
Copy link
Copy Markdown

phifuh commented Feb 16, 2026

I can confirm this behaviour :d We had a simple rename javascript dialog and just waiting a minute or so disconnected the client. Our solution was simple, just do an async real ui.dialog() and await it.

Is it possible to add an async add_js_async element? Currently we are just injecting js code if i am not mistaken? Maybe this could be some sort of bridge like your web-worker idea

@evnchn
Copy link
Copy Markdown
Collaborator Author

evnchn commented Feb 16, 2026

@phifuh On your add_js idea:

  • Async JS code will not salvage total-JS-blocking stuff like dialog(). If it did, we would not need a Web Worker.
  • As of writing NiceGUI offers ui.run_javascript which does not support getting the value of async JS*. Nevertheless, well-written userland JS can process user input asynchronously and submit the result to the server via emitEvent once available.

*: await ui.run_javascript('await new Promise(resolve => setTimeout(() => resolve("Hello World"), 100));') returns None...

Overall, I am not inclined to add another JS execution means beside ui.run_javascript. Expanding its functionality to better deal with async JS may be a good idea though, but that's for another PR.

@falkoschindler falkoschindler added the review Status: PR is open and needs review label Feb 17, 2026
@falkoschindler falkoschindler added this to the 3.10 milestone Feb 17, 2026
@evnchn
Copy link
Copy Markdown
Collaborator Author

evnchn commented Mar 29, 2026

@falkoschindler If memory serves, none of the files touched here is actually changed in what's merged in 3.10, so this is rather good candidate for 3.10 inclusion.

I'd imagine it's good to "change all files, but each just a little" over "10 PRs stabbed at a particular file" in one release because it gives us easier bisection for free.

@evnchn evnchn requested a review from falkoschindler March 29, 2026 18:49
Copy link
Copy Markdown
Contributor

@falkoschindler falkoschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR — the Web Worker approach is a clever way to work around the main thread freeze.

I haven't done a detailed manual review yet because I had Claude look at it first, and it flagged a number of issues that seem worth addressing before I dig deeper. Posting them below so you can take a pass at them.

It would also be great if the PR description included a short "how to test manually" section (e.g., open a page, trigger alert(), wait past the reconnect timeout, dismiss, and verify state is preserved).

  1. Heartbeat cancels delete tasks permanently — client may never be cleaned up

    _handle_heartbeat() cancels all pending delete tasks but does not reschedule them. If the browser is truly gone (e.g., crashed with the worker still running briefly, or a misbehaving client), the delete task is cancelled and never recreated. The client instance will leak in Client.instances indefinitely — no mechanism re-creates the delete task after a heartbeat.

    The heartbeat should reset the timer rather than cancel it outright. Consider replacing the cancel-only approach with a "reschedule" pattern: cancel the current delete task and create a new one that sleeps for another reconnect_timeout period. This way, if heartbeats stop, cleanup still happens.

    def _handle_heartbeat(self) -> None:
        """Reset pending delete tasks to keep the client alive during blocking browser dialogs."""
        for document_id in list(self._delete_tasks):
            self._reschedule_delete_task(document_id)
  2. Heartbeat worker never stops on normal page unload

    The worker is started on page load but there is no beforeunload or pagehide listener to send a stop message to the worker. While the worker is garbage-collected when the page is destroyed in most browsers, this is not guaranteed (especially with bfcache). Adding a cleanup listener would be defensive and cheap:

    window.addEventListener("pagehide", () => {
      if (window.heartbeatWorker) {
        window.heartbeatWorker.postMessage({ type: "stop" });
        window.heartbeatWorker.terminate();
      }
    });
  3. On Air compatibility not verified

    With On Air, the browser talks to a relay server. The heartbeat POST goes to window.location.origin (the relay), which needs to proxy it back to the NiceGUI server via the _handle_http handler in air.py. This should work since the relay proxies arbitrary HTTP requests, but it hasn't been tested. Please manually verify heartbeat behavior with on_air=True and confirm it works.

  4. Heartbeat URL doesn't account for reverse proxy / non-origin deployments

    The heartbeat URL is built as window.location.origin + options.prefix + ..., but window.location.origin may not be correct behind a reverse proxy with path rewriting. Other NiceGUI network calls go through Socket.IO (which uses the configured path), so they don't have this issue. Consider using a relative URL instead:

    url: `${options.prefix}/_nicegui/heartbeat`,

    fetch() with a relative URL will use the page's origin automatically and works correctly behind proxies.

  5. Heartbeat interval may be too aggressive for large deployments

    With reconnect_timeout=3.0, the heartbeat fires every 1.5 seconds. For apps with many concurrent clients, this adds significant HTTP overhead. The minimum of 0.5s seems very low. Consider a higher floor (e.g., 2-3s minimum) and document the tradeoff.

  6. Tests use long time.sleep() calls (15s)

    test_connection_survives_alert_with_high_reconnect_timeout sleeps for 15 seconds in a single test. This will significantly slow down the test suite. Consider reducing reconnect_timeout and ping_interval/ping_timeout for this test to keep the sleep under 5s, or mark the test as @pytest.mark.slow if such a marker exists.

  7. window.heartbeatWorker pollutes the global namespace

    Consider using a local variable inside createApp instead of attaching to window, unless it needs to be accessed externally (e.g., for debugging). If it does, document why.

  8. .catch(() => {}) silently swallows all fetch errors in the worker

    This is understandable (heartbeat failures are expected during shutdown), but a console.debug would help with debugging connectivity issues during development.

  9. Test uses counter.__setitem__ lambda hack

    In test_connection_survives_alert_with_high_reconnect_timeout, this pattern is unnecessarily obscure:

    lambda: label.set_text(str(counter.__setitem__('value', counter['value'] + 1) or counter['value']))

    A simple helper function or nonlocal variable would be clearer.

@falkoschindler falkoschindler modified the milestones: 3.10, 3.11 Apr 1, 2026
…elative URL, raise interval floor

- Heartbeat reschedules delete tasks instead of permanently canceling (prevents client leak)
- Add pagehide listener to stop/terminate worker on navigation
- Use relative heartbeat URL (works behind reverse proxies)
- Raise minimum heartbeat interval from 0.5s to 2s
- Reduce test sleep times (15s -> 5s)
- Replace counter.__setitem__ hack with simple list+helper

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Type/scope: New or intentionally changed behavior review Status: PR is open and needs review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants