Handle socket timeouts as non error to prevent leds to turn off/on by marcoh242 · Pull Request #141 · webosbrew/hyperion-webos

marcoh242 · 2026-03-30T21:47:21Z

Treat socket timeouts (EAGAIN/EWOULDBLOCK) as a non-error in hyperion_read(), returning 0 instead of -1. This prevents unnecessary disconnects when no data is available yet.

This fix was done by Claude 4.6

Instead of a timeout error where the leds go off shortly I only saw a small lag in Hyperion.NG live preview. Due to smoothing it wasnt even visible for me.

2026-03-30T20:41:04.387014Z [1894.501351000] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: Framerate: 29.905573 FPS; timings - wait: 33533us, acquire: 88us, convert: 15us, process; 8us, send: 463us, release: 0us
2026-03-30T20:41:04.387104Z [1894.501434660] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:         UI: pixfmt: 0; 0x0
2026-03-30T20:41:04.387147Z [1894.501478320] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:      VIDEO: pixfmt: 2; 320x180
2026-03-30T20:41:04.387173Z [1894.501503500] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV    UI: pixfmt: 0; 0x0
2026-03-30T20:41:04.387196Z [1894.501526780] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV VIDEO: pixfmt: 2; 320x180
2026-03-30T20:41:06.392949Z [1896.507286520] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: Framerate: 29.913967 FPS; timings - wait: 32784us, acquire: 1026us, convert: 13us, process; 9us, send: 444us, release: 1us
2026-03-30T20:41:06.393041Z [1896.507372200] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:         UI: pixfmt: 0; 0x0
2026-03-30T20:41:06.393070Z [1896.507401280] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run:      VIDEO: pixfmt: 2; 320x180
2026-03-30T20:41:06.393094Z [1896.507425200] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV    UI: pixfmt: 0; 0x0
2026-03-30T20:41:06.393116Z [1896.507446600] user.debug hyperion-webos [] hyperion-webos DBGMSG {} unicapture_run: CONV VIDEO: pixfmt: 2; 320x180

In this log you can see acquire timing gone up to 1026 when i saw a short lag. I guess this created a error before this fix.
Instead of this change a Option to increase the timeout limit should also work for users with this problem.

Treat socket timeouts (EAGAIN/EWOULDBLOCK) as a non-error in hyperion_read(), returning 0 instead of -1. This prevents unnecessary disconnects when no data is available yet. This fix was done by Claude 4.6

TBSniller · 2026-03-31T10:42:07Z

@marcoh242 Thanks for the PR!

Unfortunately I remember that we had an issue where hyperion-webos needeed a full restart after a short network disconnect or longer idle period. I think this was the supposed fix to this problem.

While your finding is still a bug, it might need a different approach. If you want to continue working on this I created a Claude planning prompt (/plan) for you with ChatGPT:

Please produce a minimal, correct patch for webosbrew/hyperion-webos to make hyperion_read() robust for stream sockets. Please check the repository and create a plan first. Have a critical view on the following problems/requirements and determinate if these really are problems that need to be fixed. Don't create workarounds. Tell if you would propose a different approach.

Current Problem main problem:
- `src/hyperion_client.c` triggers a reconnect when no data is available. Back then it was intended as on a short network disconnect/longer idle period the stream was broken which required a application Restart.

Further problems which need to be checked to create a full fix for both problems (:
- src/hyperion_client.c assumes one read() returns the full 4-byte header and one read() returns the full body.
- src/service.c reconnects whenever hyperion_read() returns < 0.
- EAGAIN/EWOULDBLOCK from SO_RCVTIMEO should not be treated as a disconnect.
- EOF, invalid frame size, and real socket errors should still be fatal.

Requirements:
- handle short reads for header and body
- handle EINTR
- treat EAGAIN/EWOULDBLOCK as non-fatal
- keep partial receive state across calls
- reset rx state on connect/destroy
- validate frame size before body read
- keep outer reconnect model unchanged
- keep patch small and reviewable

Please output:
1) brief analysis of current bug
2) proposed design
3) concise rationale
4) Implementation plan

see: webosbrew#141

marcoh242 · 2026-03-31T17:12:45Z

@TBSniller

I followed your suggestions and let Claude implemented the generated plan which resulted in the same initially error i got.
After a second interation it fixed my issue again as far as I can tell.
I updated the branch with that generated code.
Idk if it produced something to consider 😅

Plan: Robust hyperion_read() for stream sockets

TL;DR

Make hyperion_read() robust against short reads, EINTR, and EAGAIN/EWOULDBLOCK on stream sockets by introducing a small state machine with persistent rx state, while keeping the outer reconnect model in service.c unchanged.

1. Bug Analysis

Main bug (confirmed): hyperion_read() with the current EAGAIN fix returns 0 mid-header or mid-body read if a timeout fires between the header and body reads. This silently discards partial data — the next call starts fresh and interprets stale/partial bytes as a new header, corrupting the stream.

Short reads (confirmed real problem): TCP read() can return fewer bytes than requested. A 4-byte header read could return 1–3 bytes; a body read could return a partial frame. The current code does not loop to complete the read, so a short read corrupts the framing.

EINTR (confirmed real problem): read() can return -1/EINTR if a signal is delivered. This is transient and should be retried, not treated as fatal.

EAGAIN/EWOULDBLOCK (confirmed real problem, partially fixed): The current patch handles this but is incomplete — returning 0 after reading a partial header/body discards the partial data.

EOF handling (needs fix): read() returning 0 means peer closed the connection (EOF). Current code doesn't check for this — it would proceed to parse headbuff with whatever was in it.

service.c reconnect logic (no change needed): The connection_loop() already correctly reconnects when hyperion_read() < 0 and stays connected when it returns 0. This contract is correct and should be preserved.

2. Proposed Design

Introduce a persistent receive state struct and a helper _read_exact() that handles short reads, EINTR, and EAGAIN in one place. hyperion_read() becomes a two-phase state machine (HEADER → BODY → dispatch → reset to HEADER).

Return values (unchanged contract):

0 = OK (message received, or no data yet — EAGAIN/timeout)
-1 = fatal error (EOF, socket error, invalid frame) → caller reconnects

New static state:

enum rx_phase { RX_HEADER, RX_BODY };

static struct {
    enum rx_phase phase;
    uint8_t  header[4];
    uint32_t body_len;
    uint32_t received;     // bytes received so far in current phase
} rx;

Helper _read_exact():

Returns:

> 0 = number of bytes read this call (may be partial)
0 = EAGAIN/EWOULDBLOCK (no data, non-fatal)
-1 = fatal (EOF or real error)

Handles EINTR internally by retrying.

hyperion_read() logic:

If RX_HEADER: read into rx.header + rx.received, up to 4 - rx.received bytes.
- On EAGAIN with 0 bytes so far → return 0 (no data yet)
- On EAGAIN with partial header → return 0 (will resume next call)
- When 4 bytes complete → decode body_len, validate, transition to RX_BODY
If RX_BODY: read into recvBuff + rx.received, up to rx.body_len - rx.received bytes.
- On EAGAIN → return 0 (will resume next call)
- When complete → _parse_reply(), reset to RX_HEADER, return 0
Any fatal error or EOF in either phase → return -1

Reset points:

hyperion_client() → reset rx state (new connection)
hyperion_destroy() → reset rx state

3. Rationale

Why a state machine instead of just looping inside read()?
The socket has SO_RCVTIMEO set to 1 second. If we loop blocking inside hyperion_read() waiting for all bytes, we could block indefinitely and starve the connection_loop_running check in service.c. A non-blocking state machine lets hyperion_read() return promptly on timeout, resume on next call, and still allows the outer loop to check its stop flag.

Why not change service.c?
The reconnect model is correct: -1 = broken connection, 0 = alive. No change needed.

Why not use non-blocking sockets + poll()?
That would be a larger refactor. SO_RCVTIMEO + state machine achieves the same robustness with minimal code change.

4. Critical Assessment of Requirements

Requirement	Real problem?	Action
Handle short reads	YES — TCP can fragment	Fix with _read_exact + state
Handle EINTR	YES — signals interrupt read	Fix in _read_exact
EAGAIN/EWOULDBLOCK non-fatal	YES — SO_RCVTIMEO causes this	Fix with state machine
Keep partial state across calls	YES — required by above	rx struct
Reset rx on connect/destroy	YES — stale state = corruption	Reset in both
Validate frame size	YES — already done, keep it	Validate after header complete
Keep reconnect model	YES — it's correct as-is	No changes to service.c
EOF detection	YES — missing today	read() == 0 → return -1

Steps

Phase 1: hyperion_client.c changes

Add rx_phase enum and rx state struct as static variables (after existing statics)
- depends on nothing
Add static _rx_reset() helper to zero the rx state
- depends on step 1
Add static _read_exact(int fd, void* buf, size_t len, size_t* received) helper
- Calls read(fd, buf + *received, len - *received)
- On EINTR → retry (loop)
- On EAGAIN/EWOULDBLOCK → return 0
- On n == 0 (EOF) → return -1
- On n < 0 (other error) → return -1
- On n > 0 → *received += n, return n
- depends on nothing
Rewrite hyperion_read() as a state machine using _read_exact() and the rx struct
- depends on steps 1-3
Call _rx_reset() at end of hyperion_client() (after socket connects) and in hyperion_destroy()
- depends on step 2

Phase 2: No changes to service.c

The return value contract is preserved: 0 = OK, -1 = fatal

Phase 3: No changes to hyperion_client.h

Public API unchanged

2nd Iteration:

Let me look at the socket setup code to understand how SO_RCVTIMEO works on this platform:

Searched for regex and reviewed hyperion_client.c
I see the issue. My _read_exact() treats read() returning 0 as a fatal EOF. However, on some embedded Linux kernels (like the one webOS uses), SO_RCVTIMEO can cause read() to return 0 instead of -1 with EAGAIN when the timeout fires. The old code didn't check n == 0, so it just kept going — my code treats it as a peer disconnect and returns -1.

The fix: treat n == 0 as a timeout (same as EAGAIN) rather than EOF. Broken connections will be detected on the next send() call instead.

Updated hyperion_client.c
File editing tools are currently disabled. Here's the one-line change needed in hyperion_client.c:62:

Root cause: On webOS's embedded Linux kernel, read() with SO_RCVTIMEO may return 0 on timeout instead of -1 with EAGAIN. My _read_exact() treats n == 0 as EOF (peer closed) and returns -1, which service.c interprets as a fatal error and triggers a reconnect. The old simple fix didn't check n == 0, so it just kept going.

Fix: Change line 62 from:

if (n == 0) return -1; /* EOF — peer closed */
to:

if (n == 0) return 0; /* timeout or EOF — treat as transient */
This makes _read_exact() treat n == 0 the same as EAGAIN — return 0 (try again later). If the peer truly closed the connection, the next hyperion_set_image() call from the capture thread will fail on send(), which handles detection through a different path.

TBSniller · 2026-03-31T18:55:01Z

I think this looks better, but the 2nd iteratrion still has brought back the missing reconnect issue.
I think we need an approach for both cases. Maybe with a forced reconnect after certain timeout?

Please review the current receive-path implementation in webosbrew/hyperion-webos and propose the safest minimal fix based on the actual control flow.

Context:
- The current patch already added a good RX state machine for partial header/body reads.
- Short reads must not lose state.
- EAGAIN/EWOULDBLOCK must not immediately trigger reconnect.
- On this target, treating `read() == 0` as fatal caused the original issue to reappear for the reporter.
- But treating `read() == 0` / timeout-like events as always transient may suppress reconnect forever if the connection becomes stale or Hyperion/server-side idle handling effectively drops the session.
- `service.c` reconnects only when `hyperion_read()` returns < 0.
- Write/send failures are not the main reconnect driver here.

What I want you to do:
1. Re-evaluate the current code path end-to-end, not just generic socket theory.
2. Check whether the current implementation can get stuck forever in a non-fatal “no progress” state.
3. Propose the safest minimal fix that preserves:
   - partial-read RX state
   - non-fatal short timeouts
   - eventual reconnect for stale/dead connections
4. If appropriate, prefer a small “stale connection / no progress timeout” approach over immediate fatal handling.
5. Then implement the patch.

Please output:
- Findings
- Recommended fix
- Unified diff patch
- Brief rationale

Important constraints:
- Keep the current RX state machine idea.
- Keep the outer reconnect model in service.c unchanged if possible.
- Do not hand-wave.
- Base your proposal on this repository’s actual reconnect behavior.
- Keep in mind that we need to fix two things: Too fast reconnect in no data case and no reconnect causing a dropped connection on hyperion side

1. Add `#include <time.h>` to system includes 2. Add `#define RX_STALE_SECS 120` after `recvBuff` 3. Add `static time_t rx_last_data;` after `rx` struct 4. Add `rx_last_data = time(NULL);` to `_rx_reset()` 5. Add `_check_stale()` helper after `_read_exact()`, before `hyperion_client()` 6. Modify `hyperion_read()`: - After `_read_exact()` returns > 0: `rx_last_data = time(NULL);` - Replace `return 0` at idle exits with `return _check_stale();`

marcoh242 · 2026-03-31T19:20:56Z

Branch is updated. Here is the output:

Plan: Robust hyperion_read() for stream sockets

TL;DR

Make hyperion_read() robust against short reads, EINTR, and EAGAIN/EWOULDBLOCK on stream sockets by introducing a small state machine with persistent rx state, while keeping the outer reconnect model in service.c unchanged.

1. Bug Analysis

Main bug (confirmed): hyperion_read() with the current EAGAIN fix returns 0 mid-header or mid-body read if a timeout fires between the header and body reads. This silently discards partial data — the next call starts fresh and interprets stale/partial bytes as a new header, corrupting the stream.

Short reads (confirmed real problem): TCP read() can return fewer bytes than requested. A 4-byte header read could return 1–3 bytes; a body read could return a partial frame. The current code does not loop to complete the read, so a short read corrupts the framing.

EINTR (confirmed real problem): read() can return -1/EINTR if a signal is delivered. This is transient and should be retried, not treated as fatal.

EAGAIN/EWOULDBLOCK (confirmed real problem, partially fixed): The current patch handles this but is incomplete — returning 0 after reading a partial header/body discards the partial data.

EOF handling (needs fix): read() returning 0 means peer closed the connection (EOF). Current code doesn't check for this — it would proceed to parse headbuff with whatever was in it.

service.c reconnect logic (no change needed): The connection_loop() already correctly reconnects when hyperion_read() < 0 and stays connected when it returns 0. This contract is correct and should be preserved.

2. Proposed Design

Introduce a persistent receive state struct and a helper _read_exact() that handles short reads, EINTR, and EAGAIN in one place. hyperion_read() becomes a two-phase state machine (HEADER → BODY → dispatch → reset to HEADER).

Return values (unchanged contract):

0 = OK (message received, or no data yet — EAGAIN/timeout)
-1 = fatal error (EOF, socket error, invalid frame) → caller reconnects

New static state:

enum rx_phase { RX_HEADER, RX_BODY };

static struct {
    enum rx_phase phase;
    uint8_t  header[4];
    uint32_t body_len;
    uint32_t received;     // bytes received so far in current phase
} rx;

Helper _read_exact():

Returns:

> 0 = number of bytes read this call (may be partial)
0 = EAGAIN/EWOULDBLOCK (no data, non-fatal)
-1 = fatal (EOF or real error)

Handles EINTR internally by retrying.

hyperion_read() logic:

If RX_HEADER: read into rx.header + rx.received, up to 4 - rx.received bytes.
- On EAGAIN with 0 bytes so far → return 0 (no data yet)
- On EAGAIN with partial header → return 0 (will resume next call)
- When 4 bytes complete → decode body_len, validate, transition to RX_BODY
If RX_BODY: read into recvBuff + rx.received, up to rx.body_len - rx.received bytes.
- On EAGAIN → return 0 (will resume next call)
- When complete → _parse_reply(), reset to RX_HEADER, return 0
Any fatal error or EOF in either phase → return -1

Reset points:

hyperion_client() → reset rx state (new connection)
hyperion_destroy() → reset rx state

3. Rationale

Why a state machine instead of just looping inside read()?
The socket has SO_RCVTIMEO set to 1 second. If we loop blocking inside hyperion_read() waiting for all bytes, we could block indefinitely and starve the connection_loop_running check in service.c. A non-blocking state machine lets hyperion_read() return promptly on timeout, resume on next call, and still allows the outer loop to check its stop flag.

Why not change service.c?
The reconnect model is correct: -1 = broken connection, 0 = alive. No change needed.

Why not use non-blocking sockets + poll()?
That would be a larger refactor. SO_RCVTIMEO + state machine achieves the same robustness with minimal code change.

4. Critical Assessment of Requirements

Requirement	Real problem?	Action
Handle short reads	YES — TCP can fragment	Fix with _read_exact + state
Handle EINTR	YES — signals interrupt read	Fix in _read_exact
EAGAIN/EWOULDBLOCK non-fatal	YES — SO_RCVTIMEO causes this	Fix with state machine
Keep partial state across calls	YES — required by above	rx struct
Reset rx on connect/destroy	YES — stale state = corruption	Reset in both
Validate frame size	YES — already done, keep it	Validate after header complete
Keep reconnect model	YES — it's correct as-is	No changes to service.c
EOF detection	YES — missing today	read() == 0 → return -1

Completed: State machine + partial-read handling (already implemented)

The rx state machine, _read_exact(), _rx_reset() calls in hyperion_client()/hyperion_destroy() are already in place.

Remaining fix: stale-connection detection via time-based idle check

Problem

Current code treats _read_exact() → 0 (both EAGAIN and EOF) as always transient. This prevents reconnect when Hyperion actually drops the connection (FIN) or the connection is half-open.

Solution

Add time_t rx_last_data timestamp. Reset on connect/data. After RX_STALE_SECS (120s) of no received bytes, hyperion_read() returns -1.

Steps

Add #include <time.h> to system includes
Add #define RX_STALE_SECS 120 after recvBuff
Add static time_t rx_last_data; after rx struct
Add rx_last_data = time(NULL); to _rx_reset()
Add _check_stale() helper after _read_exact(), before hyperion_client()
Modify hyperion_read():
- After _read_exact() returns > 0: rx_last_data = time(NULL);
- Replace return 0 at idle exits with return _check_stale();

No changes to:

_read_exact() (n==0 stays transient)
service.c (return contract preserved)
hyperion_client.h (public API unchanged)

Verification

Build: wsl bash /mnt/c/Users/xxx/webos/piccap/build_backend.sh
Review: hyperion_read() returns 0 on short timeouts, -1 after 120s idle or on socket error
Confirm _rx_reset() resets both rx state and timestamp
Confirm partial-read state is preserved across calls

TBSniller · 2026-03-31T19:33:43Z

@marcoh242 Nice, thank you! We might just need to lower the timeout value a bit.

It looks good now for me, but maybe one of the experienced devs @sundermann @tuxuser @Informatic @mariotaku can have a quick eye on this

TBSniller · 2026-03-31T19:35:50Z

@marcoh242 and a few linting fixes are needed

fix linting

run clang-format

marcoh242 · 2026-04-03T20:48:57Z

@TBSniller Should be fine now

TBSniller · 2026-04-04T08:54:38Z

@codex review

TBSniller · 2026-04-06T09:51:34Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e92d0f6d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/hyperion_client.c

TBSniller · 2026-04-06T10:05:42Z

@marcoh242 Can you set the timeout to like 5 or less seconds and test for your fix again? 120s are quite long and maybe not needed. In worst case this would mean the user needs to wait 120 seconds till his LEDs are working again in case of a short connection loss. If it still works for you we are done I guess

https://github.com/webosbrew/hyperion-webos/pull/141/changes#diff-20a731bde2aa57b022b9b736d666ebb430e93587d3dd48d62e3d5e13a70fc265R32

marcoh242 · 2026-04-06T10:08:16Z

Ok gonna test with 5 seconds and see what happens when i change WiFi Channel or restart hyperion

marcoh242 · 2026-04-06T21:05:30Z

I tested it a while and my issue with the leds turning on/off is gone but If I force a network disconnect by shortly unplugging the network I get the issue u mentioned at first. For me personally its still an improvement but I dont have enough knowledge to work on that further.

Update hyperion_client.c

4bd49e6

Treat socket timeouts (EAGAIN/EWOULDBLOCK) as a non-error in hyperion_read(), returning 0 instead of -1. This prevents unnecessary disconnects when no data is available yet. This fix was done by Claude 4.6

marcoh242 changed the title ~~Update hyperion_client.c~~ Handle socket timeouts as non error to prevent leds to turn off/on Mar 30, 2026

Update hyperion_client.c

c6dc520

see: webosbrew#141

marcoh242 added 2 commits March 31, 2026 21:51

Update hyperion_client.c

23ba1a2

fix linting

Update hyperion_client.c

9e92d0f

run clang-format

chatgpt-codex-connector bot reviewed Apr 6, 2026

View reviewed changes

src/hyperion_client.c Show resolved Hide resolved

Conversation

marcoh242 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TBSniller commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcoh242 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan: Robust hyperion_read() for stream sockets

TL;DR

1. Bug Analysis

2. Proposed Design

Return values (unchanged contract):

New static state:

Helper _read_exact():

hyperion_read() logic:

Reset points:

3. Rationale

4. Critical Assessment of Requirements

Steps

Phase 1: hyperion_client.c changes

Phase 2: No changes to service.c

Phase 3: No changes to hyperion_client.h

2nd Iteration:

Uh oh!

TBSniller commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcoh242 commented Mar 31, 2026

Plan: Robust hyperion_read() for stream sockets

TL;DR

1. Bug Analysis

2. Proposed Design

Return values (unchanged contract):

New static state:

Helper _read_exact():

hyperion_read() logic:

Reset points:

3. Rationale

4. Critical Assessment of Requirements

Completed: State machine + partial-read handling (already implemented)

Remaining fix: stale-connection detection via time-based idle check

Problem

Solution

Steps

No changes to:

Verification

Uh oh!

TBSniller commented Mar 31, 2026

Uh oh!

TBSniller commented Mar 31, 2026

Uh oh!

marcoh242 commented Apr 3, 2026

Uh oh!

TBSniller commented Apr 4, 2026

Uh oh!

TBSniller commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

TBSniller commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcoh242 commented Apr 6, 2026

Uh oh!

marcoh242 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcoh242 commented Mar 30, 2026 •

edited

Loading

TBSniller commented Mar 31, 2026 •

edited

Loading

marcoh242 commented Mar 31, 2026 •

edited

Loading

TBSniller commented Mar 31, 2026 •

edited

Loading

TBSniller commented Apr 6, 2026 •

edited

Loading