Skip to content

catalog: redirect to /login on AUTH_REFRESH_FAILED; guard first-login race#4923

Open
drernie wants to merge 2 commits into
masterfrom
vir-sso-refresh-failure-catalog
Open

catalog: redirect to /login on AUTH_REFRESH_FAILED; guard first-login race#4923
drernie wants to merge 2 commits into
masterfrom
vir-sso-refresh-failure-catalog

Conversation

@drernie
Copy link
Copy Markdown
Member

@drernie drernie commented May 28, 2026

Summary

  • Detect typed auth errors (AUTH_REFRESH_FAILED, NOT_LOGGED_IN) on GraphQL responses via graphQLErrors[].extensions.code and dispatch authLost so the user is bounced to /login instead of seeing a generic error banner.
  • Keep legacy message matchers (Token invalid., Not logged in, Failed to refresh SSO access token) for compatibility while the registry change rolls out.
  • Suppress NOT_LOGGED_IN while a sign-in handshake is in flight (redux state SIGNING_IN / REFRESHING) — fixes the post-OAuth-callback hydration race that the KB workarounds didn't fully address.

Context

Vir reported a "session goes dead and only a manual sign-in recovers" symptom: catalog held a session whose Okta refresh token had been rejected, every GraphQL call 401'd Failed to refresh SSO access token, and the catalog rendered a generic error instead of guiding the user back to login.

Sibling registry PR: https://github.com/quiltdata/enterprise/pull/1073

Files

  • app/containers/Auth/urqlExchange.tsx — new isAuthLostError / isOnlyNotLoggedIn helpers, subscribed-to waiting flag via a ref to gate the race-prone code.
  • app/containers/Auth/urqlExchange.spec.ts — unit coverage for both helpers.

Test plan

  • npx vitest run app/containers/Auth/urqlExchange.spec.ts passes (8 tests)
  • tsc --noEmit — no new errors in changed files
  • Manual: revoke a refresh token on a staging stack, observe catalog redirects to /login on next GraphQL call
  • Manual: complete an Okta sign-in flow and confirm no flicker / no spurious redirect during the post-callback handshake

🤖 Generated with Claude Code

Greptile Summary

This PR broadens auth-error handling in the urql exchange: it detects typed registry error codes (AUTH_REFRESH_FAILED, NOT_LOGGED_IN) in addition to legacy message strings, dispatches authLost to redirect expired sessions to /login, and suppresses NOT_LOGGED_IN during SIGNING_IN/REFRESHING to prevent false redirects from the post-OAuth-callback race.

  • isAuthLostError unifies typed extension codes and legacy message strings into a single predicate; isOnlyNotLoggedIn identifies the specific race-prone error shape so it can be gated by waitingRef.current.
  • The waitingRef pattern (ref updated every render, read synchronously in the async callback) is the correct way to avoid stale-closure issues in this exchange.
  • The race suppression only fires when the registry sends the new typed NOT_LOGGED_IN extension code; legacy [GraphQL] Not logged in messages (no extension) will still dispatch authLost during a handshake until the sibling registry PR is fully deployed.

Confidence Score: 4/5

Safe to merge; the auth-redirect path is preserved for all production cases and the race guard adds a narrow, well-commented suppression window.

The core logic is correct: the waitingRef pattern avoids stale closures, authLost is still dispatched for every non-suppressed auth error, and existing legacy message matching is preserved. The only gap is that the first-login race fix is silently ineffective against legacy-format registries during the rollout window.

urqlExchange.tsx lines 125-127 — the race-suppression branch — deserves a second look to confirm expected behavior during the registry upgrade window.

Important Files Changed

Filename Overview
catalog/app/containers/Auth/urqlExchange.tsx Adds isAuthLostError/isOnlyNotLoggedIn helpers, broadens auth-error detection from a single legacy message to typed extension codes, and suppresses NOT_LOGGED_IN during SIGNING_IN/REFRESHING via a ref-based state snapshot. Logic is sound but the race-suppression only fires when the registry sends the new typed code.
catalog/app/containers/Auth/urqlExchange.spec.ts New test file covering both helpers with 8 cases; missing one path: legacy [GraphQL] Not logged in message (no extension code) inside isAuthLostError.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GraphQL response lands in handleResult] --> B{r.error?}
    B -- No --> Z[return r]
    B -- Yes --> C{handleInvalidToken?}
    C -- No --> Z
    C -- Yes --> D{isAuthLostError?}
    D -- No --> Z
    D -- Yes --> E{isOnlyNotLoggedIn?}
    E -- No --> G[dispatch authLost]
    E -- Yes --> F{waitingRef.current?}
    F -- Yes --> H[Suppress - return r]
    F -- No --> G
Loading

Reviews (1): Last reviewed commit: "catalog: redirect to /login on AUTH_REFR..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

…first login

Companion to enterprise PR that types the registry's SSO refresh-failure
401 with `AUTH_REFRESH_FAILED` (and the no-Authorization-header 401 with
`NOT_LOGGED_IN`).

The urql auth exchange previously only redirected on a single legacy
message string ("Token invalid."). Now it inspects
`graphQLErrors[].extensions.code` for either of the new typed codes and
also keeps the legacy message matchers for compatibility. On a hit it
dispatches `authLost` — the existing path that clears tokens and routes
to /login — instead of falling through to a generic error banner.

To avoid making the post-OAuth-callback hydration race worse, suppress
NOT_LOGGED_IN when redux is in a SIGNING_IN / REFRESHING state: a GraphQL
query firing before `POST /api/login` settles would otherwise clear
tokens mid-handshake and bounce the user back to /login. AUTH_REFRESH_FAILED
is not race-prone (it implies a token was sent and rejected) so it is
always treated as auth lost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 68.42105% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.56%. Comparing base (3975b45) to head (18b2fbe).

Files with missing lines Patch % Lines
catalog/app/containers/Auth/urqlExchange.tsx 68.42% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4923      +/-   ##
==========================================
+ Coverage   46.55%   46.56%   +0.01%     
==========================================
  Files         832      832              
  Lines       34120    34138      +18     
  Branches     5833     5838       +5     
==========================================
+ Hits        15884    15897      +13     
- Misses      16237    16241       +4     
- Partials     1999     2000       +1     
Flag Coverage Δ
api-python 93.14% <ø> (ø)
catalog 21.59% <68.42%> (+0.04%) ⬆️
lambda 96.63% <ø> (ø)
py-shared 98.18% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +125 to +127
if (isOnlyNotLoggedIn(r.error) && waitingRef.current) {
return r
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Race suppression won't fire for legacy Not logged in messages

isOnlyNotLoggedIn checks extensions.code === 'NOT_LOGGED_IN', so it returns false when the registry sends the old-format [GraphQL] Not logged in message (without an extension code). In that case, even with waitingRef.current === true, the suppression branch is skipped and authLost is dispatched, bouncing the user to /login mid-handshake. The fix only kicks in for registries that have already deployed the new typed codes from the sibling enterprise PR. If there are staging stacks on the old registry version, the first-login race can still trigger during the rollout window.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 18b2fbeisOnlyNotLoggedIn now also returns true for the pre-typed [GraphQL] Not logged in message, so the race suppression applies during the rollout window when a stack is still on the old registry build.

Comment on lines +36 to +40
const err = combinedError([{ message: 'Failed to refresh SSO access token' }])
expect(isAuthLostError(err)).toBe(true)
})

it('does not flag unrelated errors', () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test for legacy Not logged in message path

LEGACY_AUTH_LOST_MESSAGES contains '[GraphQL] Not logged in' but the spec only explicitly exercises the Token invalid. and Failed to refresh SSO access token legacy paths. Adding a case for the Not logged in message (without an extensions object) would confirm the third entry in the set is also reachable and documents the interaction with isOnlyNotLoggedIn returning false for it.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 18b2fbe — added spec cases for the legacy Not logged in message (no extension) covering both isAuthLostError (still a redirect trigger) and isOnlyNotLoggedIn (suppressed while waiting).

Addresses greptile review on #4923:
- isOnlyNotLoggedIn also matches the pre-typed `[GraphQL] Not logged in`
  message; otherwise older registry builds (during the rollout) would
  still bounce users mid-handshake.
- Adds spec coverage for the legacy message in both isAuthLostError
  (kept as redirect-trigger) and isOnlyNotLoggedIn (suppressed while
  waiting).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Austin-s-h
Copy link
Copy Markdown
Collaborator

@drernie can we add tests for this behavior across all SSO providers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants