Skip to content

Handle socket timeouts as non error to prevent leds to turn off/on#141

Open
marcoh242 wants to merge 5 commits intowebosbrew:mainfrom
marcoh242:fix-timeout-errors
Open

Handle socket timeouts as non error to prevent leds to turn off/on#141
marcoh242 wants to merge 5 commits intowebosbrew:mainfrom
marcoh242:fix-timeout-errors

Conversation

@marcoh242
Copy link
Copy Markdown

@marcoh242 marcoh242 commented Mar 30, 2026

Treat socket timeouts (EAGAIN/EWOULDBLOCK) as a non-error in hyperion_read(), returning 0 instead of -1. This prevents unnecessary disconnects when no data is available yet.

This fix was done by Claude 4.6

Instead of a timeout error where the leds go off shortly I only saw a small lag in Hyperion.NG live preview. Due to smoothing it wasnt even visible for me.

2026-03-30T20:41:04.387014Z [1894.501351000] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: Framerate: 29.905573 FPS; timings - wait: 33533us, acquire: 88us, convert: 15us, process; 8us, send: 463us, release: 0us
2026-03-30T20:41:04.387104Z [1894.501434660] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:         UI: pixfmt: 0; 0x0
2026-03-30T20:41:04.387147Z [1894.501478320] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:      VIDEO: pixfmt: 2; 320x180
2026-03-30T20:41:04.387173Z [1894.501503500] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV    UI: pixfmt: 0; 0x0
2026-03-30T20:41:04.387196Z [1894.501526780] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV VIDEO: pixfmt: 2; 320x180
2026-03-30T20:41:06.392949Z [1896.507286520] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: Framerate: 29.913967 FPS; timings - wait: 32784us, acquire: 1026us, convert: 13us, process; 9us, send: 444us, release: 1us
2026-03-30T20:41:06.393041Z [1896.507372200] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:         UI: pixfmt: 0; 0x0
2026-03-30T20:41:06.393070Z [1896.507401280] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:      VIDEO: pixfmt: 2; 320x180
2026-03-30T20:41:06.393094Z [1896.507425200] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV    UI: pixfmt: 0; 0x0
2026-03-30T20:41:06.393116Z [1896.507446600] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV VIDEO: pixfmt: 2; 320x180

In this log you can see acquire timing gone up to 1026 when i saw a short lag. I guess this created a error before this fix.
Instead of this change a Option to increase the timeout limit should also work for users with this problem.

Treat socket timeouts (EAGAIN/EWOULDBLOCK) as a non-error in hyperion_read(), returning 0 instead of -1. This prevents unnecessary disconnects when no data is available yet.

This fix was done by Claude 4.6
@marcoh242 marcoh242 changed the title Update hyperion_client.c Handle socket timeouts as non error to prevent leds to turn off/on Mar 30, 2026
@TBSniller
Copy link
Copy Markdown
Collaborator

TBSniller commented Mar 31, 2026

@marcoh242 Thanks for the PR!

Unfortunately I remember that we had an issue where hyperion-webos needeed a full restart after a short network disconnect or longer idle period. I think this was the supposed fix to this problem.

While your finding is still a bug, it might need a different approach. If you want to continue working on this I created a Claude planning prompt (/plan) for you with ChatGPT:

Please produce a minimal, correct patch for webosbrew/hyperion-webos to make hyperion_read() robust for stream sockets. Please check the repository and create a plan first. Have a critical view on the following problems/requirements and determinate if these really are problems that need to be fixed. Don't create workarounds. Tell if you would propose a different approach.

Current Problem main problem:
- `src/hyperion_client.c` triggers a reconnect when no data is available. Back then it was intended as on a short network disconnect/longer idle period the stream was broken which required a application Restart.

Further problems which need to be checked to create a full fix for both problems (:
- src/hyperion_client.c assumes one read() returns the full 4-byte header and one read() returns the full body.
- src/service.c reconnects whenever hyperion_read() returns < 0.
- EAGAIN/EWOULDBLOCK from SO_RCVTIMEO should not be treated as a disconnect.
- EOF, invalid frame size, and real socket errors should still be fatal.

Requirements:
- handle short reads for header and body
- handle EINTR
- treat EAGAIN/EWOULDBLOCK as non-fatal
- keep partial receive state across calls
- reset rx state on connect/destroy
- validate frame size before body read
- keep outer reconnect model unchanged
- keep patch small and reviewable

Please output:
1) brief analysis of current bug
2) proposed design
3) concise rationale
4) Implementation plan

@marcoh242
Copy link
Copy Markdown
Author

marcoh242 commented Mar 31, 2026

@TBSniller

I followed your suggestions and let Claude implemented the generated plan which resulted in the same initially error i got.
After a second interation it fixed my issue again as far as I can tell.
I updated the branch with that generated code.
Idk if it produced something to consider 😅

Plan: Robust hyperion_read() for stream sockets

TL;DR

Make hyperion_read() robust against short reads, EINTR, and EAGAIN/EWOULDBLOCK on stream sockets by introducing a small state machine with persistent rx state, while keeping the outer reconnect model in service.c unchanged.

1. Bug Analysis

Main bug (confirmed): hyperion_read() with the current EAGAIN fix returns 0 mid-header or mid-body read if a timeout fires between the header and body reads. This silently discards partial data — the next call starts fresh and interprets stale/partial bytes as a new header, corrupting the stream.

Short reads (confirmed real problem): TCP read() can return fewer bytes than requested. A 4-byte header read could return 1–3 bytes; a body read could return a partial frame. The current code does not loop to complete the read, so a short read corrupts the framing.

EINTR (confirmed real problem): read() can return -1/EINTR if a signal is delivered. This is transient and should be retried, not treated as fatal.

EAGAIN/EWOULDBLOCK (confirmed real problem, partially fixed): The current patch handles this but is incomplete — returning 0 after reading a partial header/body discards the partial data.

EOF handling (needs fix): read() returning 0 means peer closed the connection (EOF). Current code doesn't check for this — it would proceed to parse headbuff with whatever was in it.

service.c reconnect logic (no change needed): The connection_loop() already correctly reconnects when hyperion_read() < 0 and stays connected when it returns 0. This contract is correct and should be preserved.

2. Proposed Design

Introduce a persistent receive state struct and a helper _read_exact() that handles short reads, EINTR, and EAGAIN in one place. hyperion_read() becomes a two-phase state machine (HEADER → BODY → dispatch → reset to HEADER).

Return values (unchanged contract):

  • 0 = OK (message received, or no data yet — EAGAIN/timeout)
  • -1 = fatal error (EOF, socket error, invalid frame) → caller reconnects

New static state:

enum rx_phase { RX_HEADER, RX_BODY };

static struct {
    enum rx_phase phase;
    uint8_t  header[4];
    uint32_t body_len;
    uint32_t received;     // bytes received so far in current phase
} rx;

Helper _read_exact():

Returns:

  • > 0 = number of bytes read this call (may be partial)
  • 0 = EAGAIN/EWOULDBLOCK (no data, non-fatal)
  • -1 = fatal (EOF or real error)

Handles EINTR internally by retrying.

hyperion_read() logic:

  1. If RX_HEADER: read into rx.header + rx.received, up to 4 - rx.received bytes.
    • On EAGAIN with 0 bytes so far → return 0 (no data yet)
    • On EAGAIN with partial header → return 0 (will resume next call)
    • When 4 bytes complete → decode body_len, validate, transition to RX_BODY
  2. If RX_BODY: read into recvBuff + rx.received, up to rx.body_len - rx.received bytes.
    • On EAGAIN → return 0 (will resume next call)
    • When complete → _parse_reply(), reset to RX_HEADER, return 0
  3. Any fatal error or EOF in either phase → return -1

Reset points:

  • hyperion_client() → reset rx state (new connection)
  • hyperion_destroy() → reset rx state

3. Rationale

Why a state machine instead of just looping inside read()?
The socket has SO_RCVTIMEO set to 1 second. If we loop blocking inside hyperion_read() waiting for all bytes, we could block indefinitely and starve the connection_loop_running check in service.c. A non-blocking state machine lets hyperion_read() return promptly on timeout, resume on next call, and still allows the outer loop to check its stop flag.

Why not change service.c?
The reconnect model is correct: -1 = broken connection, 0 = alive. No change needed.

Why not use non-blocking sockets + poll()?
That would be a larger refactor. SO_RCVTIMEO + state machine achieves the same robustness with minimal code change.

4. Critical Assessment of Requirements

Requirement Real problem? Action
Handle short reads YES — TCP can fragment Fix with _read_exact + state
Handle EINTR YES — signals interrupt read Fix in _read_exact
EAGAIN/EWOULDBLOCK non-fatal YES — SO_RCVTIMEO causes this Fix with state machine
Keep partial state across calls YES — required by above rx struct
Reset rx on connect/destroy YES — stale state = corruption Reset in both
Validate frame size YES — already done, keep it Validate after header complete
Keep reconnect model YES — it's correct as-is No changes to service.c
EOF detection YES — missing today read() == 0 → return -1

Steps

Phase 1: hyperion_client.c changes

  1. Add rx_phase enum and rx state struct as static variables (after existing statics)

    • depends on nothing
  2. Add static _rx_reset() helper to zero the rx state

    • depends on step 1
  3. Add static _read_exact(int fd, void* buf, size_t len, size_t* received) helper

    • Calls read(fd, buf + *received, len - *received)
    • On EINTR → retry (loop)
    • On EAGAIN/EWOULDBLOCK → return 0
    • On n == 0 (EOF) → return -1
    • On n < 0 (other error) → return -1
    • On n > 0*received += n, return n
    • depends on nothing
  4. Rewrite hyperion_read() as a state machine using _read_exact() and the rx struct

    • depends on steps 1-3
  5. Call _rx_reset() at end of hyperion_client() (after socket connects) and in hyperion_destroy()

    • depends on step 2

Phase 2: No changes to service.c

  • The return value contract is preserved: 0 = OK, -1 = fatal

Phase 3: No changes to hyperion_client.h

  • Public API unchanged

2nd Iteration:

Let me look at the socket setup code to understand how SO_RCVTIMEO works on this platform:

Searched for regex and reviewed hyperion_client.c
I see the issue. My _read_exact() treats read() returning 0 as a fatal EOF. However, on some embedded Linux kernels (like the one webOS uses), SO_RCVTIMEO can cause read() to return 0 instead of -1 with EAGAIN when the timeout fires. The old code didn't check n == 0, so it just kept going — my code treats it as a peer disconnect and returns -1.

The fix: treat n == 0 as a timeout (same as EAGAIN) rather than EOF. Broken connections will be detected on the next send() call instead.

Updated hyperion_client.c
File editing tools are currently disabled. Here's the one-line change needed in hyperion_client.c:62:

Root cause: On webOS's embedded Linux kernel, read() with SO_RCVTIMEO may return 0 on timeout instead of -1 with EAGAIN. My _read_exact() treats n == 0 as EOF (peer closed) and returns -1, which service.c interprets as a fatal error and triggers a reconnect. The old simple fix didn't check n == 0, so it just kept going.

Fix: Change line 62 from:

        if (n == 0)            return -1; /* EOF — peer closed */
to:

        if (n == 0)            return 0; /* timeout or EOF — treat as transient */
This makes _read_exact() treat n == 0 the same as EAGAIN — return 0 (try again later). If the peer truly closed the connection, the next hyperion_set_image() call from the capture thread will fail on send(), which handles detection through a different path.

@TBSniller
Copy link
Copy Markdown
Collaborator

TBSniller commented Mar 31, 2026

I think this looks better, but the 2nd iteratrion still has brought back the missing reconnect issue.
I think we need an approach for both cases. Maybe with a forced reconnect after certain timeout?

Please review the current receive-path implementation in webosbrew/hyperion-webos and propose the safest minimal fix based on the actual control flow.

Context:
- The current patch already added a good RX state machine for partial header/body reads.
- Short reads must not lose state.
- EAGAIN/EWOULDBLOCK must not immediately trigger reconnect.
- On this target, treating `read() == 0` as fatal caused the original issue to reappear for the reporter.
- But treating `read() == 0` / timeout-like events as always transient may suppress reconnect forever if the connection becomes stale or Hyperion/server-side idle handling effectively drops the session.
- `service.c` reconnects only when `hyperion_read()` returns < 0.
- Write/send failures are not the main reconnect driver here.

What I want you to do:
1. Re-evaluate the current code path end-to-end, not just generic socket theory.
2. Check whether the current implementation can get stuck forever in a non-fatal “no progress” state.
3. Propose the safest minimal fix that preserves:
   - partial-read RX state
   - non-fatal short timeouts
   - eventual reconnect for stale/dead connections
4. If appropriate, prefer a small “stale connection / no progress timeout” approach over immediate fatal handling.
5. Then implement the patch.

Please output:
- Findings
- Recommended fix
- Unified diff patch
- Brief rationale

Important constraints:
- Keep the current RX state machine idea.
- Keep the outer reconnect model in service.c unchanged if possible.
- Do not hand-wave.
- Base your proposal on this repository’s actual reconnect behavior.
- Keep in mind that we need to fix two things: Too fast reconnect in no data case and no reconnect causing a dropped connection on hyperion side

1. Add `#include <time.h>` to system includes
2. Add `#define RX_STALE_SECS 120` after `recvBuff`
3. Add `static time_t rx_last_data;` after `rx` struct
4. Add `rx_last_data = time(NULL);` to `_rx_reset()`
5. Add `_check_stale()` helper after `_read_exact()`, before `hyperion_client()`
6. Modify `hyperion_read()`:
   - After `_read_exact()` returns > 0: `rx_last_data = time(NULL);`
   - Replace `return 0` at idle exits with `return _check_stale();`
@marcoh242
Copy link
Copy Markdown
Author

Branch is updated. Here is the output:

Plan: Robust hyperion_read() for stream sockets

TL;DR

Make hyperion_read() robust against short reads, EINTR, and EAGAIN/EWOULDBLOCK on stream sockets by introducing a small state machine with persistent rx state, while keeping the outer reconnect model in service.c unchanged.

1. Bug Analysis

Main bug (confirmed): hyperion_read() with the current EAGAIN fix returns 0 mid-header or mid-body read if a timeout fires between the header and body reads. This silently discards partial data — the next call starts fresh and interprets stale/partial bytes as a new header, corrupting the stream.

Short reads (confirmed real problem): TCP read() can return fewer bytes than requested. A 4-byte header read could return 1–3 bytes; a body read could return a partial frame. The current code does not loop to complete the read, so a short read corrupts the framing.

EINTR (confirmed real problem): read() can return -1/EINTR if a signal is delivered. This is transient and should be retried, not treated as fatal.

EAGAIN/EWOULDBLOCK (confirmed real problem, partially fixed): The current patch handles this but is incomplete — returning 0 after reading a partial header/body discards the partial data.

EOF handling (needs fix): read() returning 0 means peer closed the connection (EOF). Current code doesn't check for this — it would proceed to parse headbuff with whatever was in it.

service.c reconnect logic (no change needed): The connection_loop() already correctly reconnects when hyperion_read() < 0 and stays connected when it returns 0. This contract is correct and should be preserved.

2. Proposed Design

Introduce a persistent receive state struct and a helper _read_exact() that handles short reads, EINTR, and EAGAIN in one place. hyperion_read() becomes a two-phase state machine (HEADER → BODY → dispatch → reset to HEADER).

Return values (unchanged contract):

  • 0 = OK (message received, or no data yet — EAGAIN/timeout)
  • -1 = fatal error (EOF, socket error, invalid frame) → caller reconnects

New static state:

enum rx_phase { RX_HEADER, RX_BODY };

static struct {
    enum rx_phase phase;
    uint8_t  header[4];
    uint32_t body_len;
    uint32_t received;     // bytes received so far in current phase
} rx;

Helper _read_exact():

Returns:

  • > 0 = number of bytes read this call (may be partial)
  • 0 = EAGAIN/EWOULDBLOCK (no data, non-fatal)
  • -1 = fatal (EOF or real error)

Handles EINTR internally by retrying.

hyperion_read() logic:

  1. If RX_HEADER: read into rx.header + rx.received, up to 4 - rx.received bytes.
    • On EAGAIN with 0 bytes so far → return 0 (no data yet)
    • On EAGAIN with partial header → return 0 (will resume next call)
    • When 4 bytes complete → decode body_len, validate, transition to RX_BODY
  2. If RX_BODY: read into recvBuff + rx.received, up to rx.body_len - rx.received bytes.
    • On EAGAIN → return 0 (will resume next call)
    • When complete → _parse_reply(), reset to RX_HEADER, return 0
  3. Any fatal error or EOF in either phase → return -1

Reset points:

  • hyperion_client() → reset rx state (new connection)
  • hyperion_destroy() → reset rx state

3. Rationale

Why a state machine instead of just looping inside read()?
The socket has SO_RCVTIMEO set to 1 second. If we loop blocking inside hyperion_read() waiting for all bytes, we could block indefinitely and starve the connection_loop_running check in service.c. A non-blocking state machine lets hyperion_read() return promptly on timeout, resume on next call, and still allows the outer loop to check its stop flag.

Why not change service.c?
The reconnect model is correct: -1 = broken connection, 0 = alive. No change needed.

Why not use non-blocking sockets + poll()?
That would be a larger refactor. SO_RCVTIMEO + state machine achieves the same robustness with minimal code change.

4. Critical Assessment of Requirements

Requirement Real problem? Action
Handle short reads YES — TCP can fragment Fix with _read_exact + state
Handle EINTR YES — signals interrupt read Fix in _read_exact
EAGAIN/EWOULDBLOCK non-fatal YES — SO_RCVTIMEO causes this Fix with state machine
Keep partial state across calls YES — required by above rx struct
Reset rx on connect/destroy YES — stale state = corruption Reset in both
Validate frame size YES — already done, keep it Validate after header complete
Keep reconnect model YES — it's correct as-is No changes to service.c
EOF detection YES — missing today read() == 0 → return -1

Completed: State machine + partial-read handling (already implemented)

The rx state machine, _read_exact(), _rx_reset() calls in hyperion_client()/hyperion_destroy() are already in place.

Remaining fix: stale-connection detection via time-based idle check

Problem

Current code treats _read_exact() → 0 (both EAGAIN and EOF) as always transient. This prevents reconnect when Hyperion actually drops the connection (FIN) or the connection is half-open.

Solution

Add time_t rx_last_data timestamp. Reset on connect/data. After RX_STALE_SECS (120s) of no received bytes, hyperion_read() returns -1.

Steps

  1. Add #include <time.h> to system includes
  2. Add #define RX_STALE_SECS 120 after recvBuff
  3. Add static time_t rx_last_data; after rx struct
  4. Add rx_last_data = time(NULL); to _rx_reset()
  5. Add _check_stale() helper after _read_exact(), before hyperion_client()
  6. Modify hyperion_read():
    • After _read_exact() returns > 0: rx_last_data = time(NULL);
    • Replace return 0 at idle exits with return _check_stale();

No changes to:

  • _read_exact() (n==0 stays transient)
  • service.c (return contract preserved)
  • hyperion_client.h (public API unchanged)

Verification

  1. Build: wsl bash /mnt/c/Users/xxx/webos/piccap/build_backend.sh
  2. Review: hyperion_read() returns 0 on short timeouts, -1 after 120s idle or on socket error
  3. Confirm _rx_reset() resets both rx state and timestamp
  4. Confirm partial-read state is preserved across calls

@TBSniller
Copy link
Copy Markdown
Collaborator

@marcoh242 Nice, thank you! We might just need to lower the timeout value a bit.

It looks good now for me, but maybe one of the experienced devs @sundermann @tuxuser @Informatic @mariotaku can have a quick eye on this

@TBSniller
Copy link
Copy Markdown
Collaborator

@marcoh242 and a few linting fixes are needed

@marcoh242
Copy link
Copy Markdown
Author

@TBSniller Should be fine now

@TBSniller
Copy link
Copy Markdown
Collaborator

@codex review

1 similar comment
@TBSniller
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e92d0f6d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@TBSniller
Copy link
Copy Markdown
Collaborator

TBSniller commented Apr 6, 2026

@marcoh242 Can you set the timeout to like 5 or less seconds and test for your fix again? 120s are quite long and maybe not needed. In worst case this would mean the user needs to wait 120 seconds till his LEDs are working again in case of a short connection loss. If it still works for you we are done I guess

https://github.com/webosbrew/hyperion-webos/pull/141/changes#diff-20a731bde2aa57b022b9b736d666ebb430e93587d3dd48d62e3d5e13a70fc265R32

@marcoh242
Copy link
Copy Markdown
Author

Ok gonna test with 5 seconds and see what happens when i change WiFi Channel or restart hyperion

@marcoh242
Copy link
Copy Markdown
Author

I tested it a while and my issue with the leds turning on/off is gone but If I force a network disconnect by shortly unplugging the network I get the issue u mentioned at first. For me personally its still an improvement but I dont have enough knowledge to work on that further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants