Skip to content

[WIP / debug-only] feat(ai): structured logs to diagnose Bedrock stalls#699

Draft
jorgeraad wants to merge 18 commits intocanaryfrom
feat/instrumentation-tier1
Draft

[WIP / debug-only] feat(ai): structured logs to diagnose Bedrock stalls#699
jorgeraad wants to merge 18 commits intocanaryfrom
feat/instrumentation-tier1

Conversation

@jorgeraad
Copy link
Copy Markdown
Collaborator

@jorgeraad jorgeraad commented May 1, 2026

⚠️ Not for merge. This is a temporary diagnostic patch to confirm what's
happening during the multi-hour Bedrock stream stalls on jraad-deploy. Once
we have signal from the next stall and identify the root cause, this branch
will either be reverted or replaced with a properly-scoped instrumentation
PR (Tier 2/3 from the design notes). Do not merge as-is.

Adds always-on structured logging around Bedrock streamText calls so we can finally see what's happening during the multi-hour recon stalls on jraad-deploy.

Two pieces:

  • A wrapFetchWithBedrockLogs helper in instrumentation.ts that emits per-fetch lifecycle events — start, headers, first byte, done, body errors, signal abort — each tagged with a callId and ageMs since fetch start. It now also owns the streaming-fetch-timeout composition so its signal_abort log fires when the 15-min backstop trips (today the caller never sees that signal directly).

  • Retry-decision logs in ai.ts that bypass the silent: true flag — which OffensiveSecurityAgent hardcodes for CLI quietness, also hiding the same warnings we need server-side. Covers rate-limit retries, the new stream-idle-resume path, context-length compaction / summarization, the streamText onError handler, and unrecoverable stream errors. Each carries a logicalCallId so a multi-retry chain joins to a single logical call in CloudWatch Insights.

CLI behavior is unchanged — the existing console.warn calls stay gated by silent. The new events go to a separate stdout channel: [apex.instrumentation] {json}.

Rebased onto canary to pick up the new withIdleTimeout work; my retry-log block now also fires on the new apex.retry.stream_idle_resume path. Bundles the unmerged feat/surface-integration (#664) commits as well — the PR diff will include those until #664 lands.

Smoke-tested locally against a hung Bun server and confirmed all six lifecycle events fire, including signal_abort on the composed timeout.

Plan

  1. Bump the apex submodule in console to this branch on jraad-deploy only.
  2. Reproduce the stall (or wait for the next one).
  3. Pull [apex.instrumentation] lines from CloudWatch and confirm whether retry chains fire / whether the timeout signal aborts / whether bytes stop flowing.
  4. Use the answer to scope the real fix; revert this PR.

@jorgeraad jorgeraad changed the title feat(ai): structured logs to diagnose Bedrock stream stalls [WIP / debug-only] feat(ai): structured logs to diagnose Bedrock stalls May 1, 2026
Test and others added 18 commits May 1, 2026 17:47
Adds the npm dep that subsequent integration tasks will import from.
Workspace-hoisted at the console worktree root; apex's own bun.lock
is unchanged because the parent console workspace owns the lockfile.
…classifier

Defines ConsolidatedEndpoint (one record per (file, path) with method[]) and
classifyEndpoint() implementing the page-vs-API rules from the design's
section 1.4. Next.js page/route convention drives PAGE classification; v1
deliberately leaves Rails/Django/FastAPI/Spring view-rendering routes as
their HTTP method (per design — fallback path covers them).

7/7 unit tests cover App Router pages, Pages Router, route handlers, Server
Actions, WebSocket, Express, and multi-method consolidation.
Single workflow entry point per app. Calls surface.map() with
includeInternal:false, applies fallback gate (no frameworks OR zero
endpoints), consolidates per-(file,path) and runs the page-vs-API
classifier. Returns a discriminated union: { mode: 'fallback', reason }
or { mode: 'surface', endpoints, frameworks }.

10 unit tests cover consolidation (multi-method same path, distinct files
same path), the two fallback conditions, and end-to-end classification on
a synthetic MapResult.
… path

Replaces per-app pages+apiEndpoints CodeAgent pair on the surface-driven
path with a single per-app enrichment CodeAgent that receives the full
deterministic endpoint list and emits one document_asset per endpoint.

Per design Phase 1.2: agent told NOT to grep for new routes (surface
already enumerated them); pre-fills authRequired from auth signals;
preserves existing description/pentestObjectives on unchanged endpoints
when the session has prior state.

Adds export keywords to WHITEBOX_CODE_AGENT_SYSTEM_PROMPT, AppInfoSchema,
and DiscoverySummarySchema in whiteboxAttackSurface.ts so the new module
can reuse them — no logic changes to the workflow itself.

11 unit tests cover the objective-builder prompt content (numbered
endpoint list, 'Do NOT re-discover routes' invariant, document_asset
asks, page vs api method rendering).
Replaces the per-app pages+apiEndpoints CodeAgent pair with a per-app
dispatch: surface-driven via mapAppWithSurface + runEnrichmentAgent
when surface supports the framework, falling back to the existing
two-agent flow when the fallback gate fires.

Cloud-resource apps still route to the existing cloudResourceEndpoints
agent — surface is HTTP-route-focused (per design Non-Goals).

Phases 1 (apps discovery), 1.5 (app.json), 3 (assets read), 4 (risk
scoring), 5 (assembly) byte-for-byte unchanged.
runIncrementalWhiteboxAttackSurfaceWorkflow untouched (Phase 3 of the
design is a follow-up).

Subagent events preserved: enrich-${app.name} for the surface path,
pages-${app.name} / apiEndpoints-${app.name} for fallback,
cloudResourceEndpoints-${app.name} for cloud apps.

Per-app log line indicates which path was taken.
- Update bun.lock to include @pensar/surface@0.1.1 (workspace install in
  the development context hoisted to the parent lockfile, leaving apex's
  own lockfile out of sync — fix the standalone install).
- Prettier --write across the seven new/modified files. No logic changes.
Without this, Phase 1's appsAgent has document_endpoint in its default
tool registry, and the shared system prompt heavily instructs every
agent to call document_endpoint per route. Phase 1 over-reaches: after
documenting apps via document_app, it keeps going and tries to call
document_endpoint for every route — long before Phase 2 (where surface
runs) gets a chance.

Adds excludeTools: ['document_endpoint'] symmetric to Phase 2's existing
excludeTools: ['document_app'] in spawnDiscoveryAgent + the enrichment
agent. Each phase now has the right tool surface for its job: apps-only
for Phase 1, endpoints-only for Phase 2 (surface or fallback).
…e 1 objective

The previous fix removed document_endpoint from Phase 1's tool registry,
but the agent worked around it by calling document_app on individual
endpoints (e.g. `document_app GET /api/products`) since the shared system
prompt's 'document every route' directive is so strong.

Add an explicit exclusion bullet to the IMPORTANT block ('Individual API
routes, web pages, or HTTP endpoints — endpoint enumeration is handled
by a separate phase') and reinforce in the closing line. The agent now
has matching tool-level + objective-level constraints; no improvising.
The previous fix removed document_endpoint from Phase 1's tool registry
and added an objective-level exclusion, but the agent kept improvising
(calling document_app on individual routes) because the shared
WHITEBOX_CODE_AGENT_SYSTEM_PROMPT still contains a heavy ## document_endpoint
section + Working Approach references. Even with the tool unavailable,
the prompt was strong enough that the agent treated routes as documentable
and reached for the closest-shaped tool.

Add WHITEBOX_APPS_DISCOVERY_SYSTEM_PROMPT — a stripped variant of the
shared prompt with all document_endpoint mentions removed. The prompt
now tells Phase 1 explicitly that endpoint enumeration is a separate
phase's job. Phase 2 agents (pages/apiEndpoints/cloudResource/enrichment)
keep using the original WHITEBOX_CODE_AGENT_SYSTEM_PROMPT — they
legitimately need that guidance.

After this: Phase 1 has matching tool-level + objective-level + system-
prompt-level constraints, all consistent. No improvising surface left.
…_endpoint shape

Two bugs the user surfaced from a real Coffee Shop run:

1. The enrichment agent was using WHITEBOX_CODE_AGENT_SYSTEM_PROMPT,
   which heavily instructs every agent to 'orient first, list files,
   grep, search-then-read, be thorough — discover N routes.' The
   enrichment agent was reading that as discovery framing and ignored
   its 'list above is complete' objective, exploring the repo from
   scratch and re-finding routes surface had already enumerated.

2. The objective told the agent to call `document_asset` with a
   nested `details` block. That tool name was renamed to
   `document_endpoint` (with a flat schema) in canary, and the
   agent was hunting for a tool that doesn't exist — falling back
   to default discovery behavior in confusion.

Add WHITEBOX_ENRICHMENT_SYSTEM_PROMPT — purpose-built for enrichment.
It tells the agent: 'you have a deterministic list, read just the
handler at file:line, document_endpoint with these flat fields, do not
list_files or grep for new routes.' Drops the discovery Working Approach
in favor of enrichment-only guidance.

Update buildEnrichmentObjective to:
- Use the actual document_endpoint tool name + flat schema
  (routePath/method/file/line/handler/authRequired/endpointType/riskLevel),
  matching the schema in offSecAgent/tools/documentEndpoint.ts.
- Pre-derive endpointType per entry (web-endpoint for PAGE, otherwise
  api-endpoint).
- Drop the pentestObjectives ask — document_endpoint generates them
  automatically via threatModelGenerator on the tool side.
- End with the explicit count: 'must equal the number above: ${N}'.

Phase 2 fallback agents (pages/apiEndpoints/cloudResource) keep using
WHITEBOX_CODE_AGENT_SYSTEM_PROMPT — they are doing discovery and need
that framing.
Refactors the surface-driven path so each endpoint gets its own
enrichment subagent, surfacing in the subagent view as a distinct row
(e.g. "Coffee Shop: POST /api/admin/diagnostics") rather than a
single "enrich-Coffee Shop" agent making N document_endpoint calls.

Why per-endpoint:
- Matches the original issue #662 design verbatim.
- Each agent has tiny scope: read one handler, document one endpoint,
  return. Zero cross-endpoint coupling.
- The cross-endpoint reasoning advantage of per-app enrichment is moot
  now that pentestObjectives are auto-generated by document_endpoint
  (via generateThreatModelForEndpoint) — the agent only writes
  description + riskLevel + auth refinement, none of which need
  app-wide context.
- Token caps can't truncate: each agent's output is bounded by one
  endpoint.
- Subagent view becomes self-documenting (one row per endpoint with
  method + path in the display name).

Changes:
- enrichmentAgent.ts:
  - runEnrichmentAgent now takes EndpointEnrichmentInput (single endpoint)
    and produces subagentId 'enrich-<app-slug>-<route-slug>' + display
    name '<app.name>: <method> <path>'.
  - New runAppEnrichment wrapper fans out N agents via
    runWithBoundedConcurrency at ENRICHMENT_CONCURRENCY=5.
  - Hard-excludes list_files/grep/document_app on the per-endpoint agent
    so the model can't fall back to discovery behavior.
  - buildEnrichmentObjective rewritten for single-endpoint scope with
    explicit Workflow section ('1. read_file the handler  2. document_endpoint
    once  3. response').

- whiteboxAttackSurface.ts: Phase 2 surface-driven branch swaps
  runEnrichmentAgent for runAppEnrichment. No other workflow changes.

- enrichmentAgent.test.ts: rewrites the prompt-builder test for
  per-endpoint shape + adds cases for PAGE single-method serialization,
  multi-method JSON-array serialization, and 'no auth signals' prefill.
Phase 1's apps-discovery agent typically returns 'app/' (or similar
sub-directory) for single-app repos where the routes live there.
Pointing surface at that subdir misses the parent's package.json
(or requirements.txt, go.mod, etc.) — surface's framework detection
returns 'frameworks: []', the gate triggers fallback, and the user
sees the legacy two-agent discovery path even though surface would
have worked from the repo root.

Reproducible against pensarai/coffee-shop:
  map('/coffee-shop')        → frameworks=['nextjs'], endpoints=8
  map('/coffee-shop/app')    → frameworks=[],         endpoints=0  ← bug

Add findDependencyRoot(appPath, repoRoot): walks up from appPath
toward repoRoot looking for the nearest directory containing a
recognized dep manifest. Used by mapAppWithSurface to pick the
right scan root before invoking surface.

For single-app repos (Coffee Shop with location='app/'), the walk
finds the repo's package.json and scans from there. For monorepos
where each package has its own package.json, the walk stops at the
package directory immediately — no over-broadening. Bounded by
repoRoot so we never escape the project.

Workflow now passes codebasePath as the second arg to bound the walk.

7 new unit tests cover: walk-up to parent, deep-nested walk-up,
monorepo package boundary, root with own dep file, no-walk-past-root,
and graceful fallback when no dep file exists. End-to-end smoke against
the real coffee-shop layout: was returning fallback, now returns
surface mode with 5 consolidated endpoints from 8 raw rows.
…apex classifier

Bumps @pensar/surface to ~0.2.1, which now emits the route's categorical
role (api / page / action / websocket) on EndpointInfo directly. Apex's
hand-written file-pattern classifier becomes redundant — the kind field
is the source of truth.

Changes:
- Bump @pensar/surface to ~0.2.1 (skips deprecated 0.2.0).
- Delete src/core/integrations/surface/classifier.{ts,test.ts}.
- Add 'kind: EndpointKind' to ConsolidatedEndpoint; re-export EndpointKind.
- Replace post-consolidation classification with a one-line ternary in
  the integration helper: kind==='page' ? method=['PAGE'] : pass-through.
- Simplify EnrichmentEndpoint to a pass-through alias for ConsolidatedEndpoint
  (drops the classifiedMethod + isPage workaround fields).
- Drop the bridging map in workflow Phase 2 — surfaceResult.endpoints flow
  straight to runAppEnrichment.
- Update tests: fixtures use kind directly; assert kind==='page' produces
  method=['PAGE'].

Coverage gain: Next.js page routes (app/page.tsx, pages/*.tsx) now flow
end-to-end as kind=page → endpointType=web-endpoint. Smoke against
coffee-shop returns 10 endpoints (8 api + 2 page) where the previous
classifier shim returned only 8.
…hten dedup

- Drop pass-through `EnrichmentEndpoint` alias; use `ConsolidatedEndpoint` directly
- Remove unused `HttpMethod`/`EndpointKind` re-exports from surface/types
- Replace O(n) handler-string dedup in consolidateBySameRoute with a Set
- Strip narrating WHAT comments, version tags, and design-doc section refs
…lication

- Move AppInfoSchema, AppsDiscoveryResultSchema, DiscoverySummarySchema and
  their inferred types into agents/specialized/whiteboxAttackSurface/types.ts.
  enrichmentAgent.ts no longer reaches up into workflows/ for them.
- Move the three workflow-phase system prompts (apps-discovery, enrichment,
  general discovery) into agents/specialized/whiteboxAttackSurface/prompts.ts
  and compose them from shared snippets so a tweak in tool guidance hits
  every prompt.
- Rename WHITEBOX_CODE_AGENT_SYSTEM_PROMPT -> WHITEBOX_DISCOVERY_SYSTEM_PROMPT
  (it's the fallback + incremental prompt now, not "the" code-agent prompt).
- Drop the AppMetadata interface in favor of Pick<AppInfo, ...> + a
  toAppMetadata helper, eliminating three near-duplicate inline metadata
  shapes.
- mapAppWithSurface is no longer async (surface.map is sync, the speculative
  future-proofing comment is gone) and FallbackDecision is now a discriminated
  union so reason is typed only on the fallback branch.
- Drop the method=["PAGE"] substitution from the integration layer. The
  surface integration is a pure pass-through; kind stays authoritative for
  page/api classification. Apex's method="PAGE" storage convention is
  applied once at the document_endpoint write boundary in
  buildEnrichmentObjective.
- Trim field-level prose in the enrichment objective; the tool schema
  already documents each field.
… cross-app leakage

mapAppWithSurface used to return every endpoint surface found anywhere
under the climbed-up scan root, with no filter restricting them to
app.location. When phase 1 set `location` to a path without its own
dep manifest (e.g. an SST IaC file under `infra/`), the climb walked
to the repo root, surface scanned the whole monorepo, and every such
app received the same union of routes — observed on a Console recon
where four SST-defined apps each got 89-122 endpoints sourced from
`console/`, `packages/`, AND `infra/`.

New decision sequence in mapAppWithSurface:

  1. Force fallback when app.location is the repo root — phase 1 didn't
     disambiguate; a no-op scope filter would re-attribute everything
     to one app.
  2. Try a narrow `map(appPath)` first. When appPath has its own
     manifest this is the correct, fastest path AND avoids surface's
     global (method, path) dedup eating routes across siblings.
  3. Otherwise climb to a parent manifest, scan there, then filter
     endpoints whose `file` resolves under appPath's subtree. The
     file-equality branch handles `app.location` being a file (e.g.
     `infra/api.ts`); the prefix-startsWith branch handles directories.

`scopeEndpointsToApp` is exported for direct testing.

Adds a regression test and four sibling cases covering: shared root
manifest, file-path location, own-manifest narrow scan, repo-root
fallback, and frameworks-detected-but-empty-scoped fallback.
…alls

Adds an `instrumentation.ts` module that emits structured stdout JSON for:

- Bedrock fetch lifecycle: start, headers, first_byte, done, body_error,
  body_cancel, signal_abort. Each event carries a per-fetch `callId` and an
  `ageMs` since fetch start, so stalls can be located precisely (e.g. "got
  headers + first byte then went silent for 14 minutes before the timeout
  signal fired").

- Apex retry decisions: rate-limit retry (`apex.retry.rate_limit`),
  context-length compaction / summarization, streamText `onError` handler,
  and unrecoverable stream errors. Each carries the `logicalCallId` for the
  enclosing streamResponse so multi-retry chains are joinable in CloudWatch
  Insights.

These logs bypass the `silent: true` flag (which OffensiveSecurityAgent
hardcodes for CLI/TUI quietness) so server contexts always have visibility
into retry storms and hung Bedrock streams. The user-facing console.warn
calls remain gated by `silent` for backward-compatible CLI behavior.

The `wrapFetchWithBedrockLogs` wrapper now also owns the `composeSignal`
step, so its `signal_abort` log fires for the composed timeout (the caller
never sees it directly otherwise).

Format: `[apex.instrumentation] {json}` — greppable; CloudWatch Insights can
parse with `parse @message "[apex.instrumentation] *" as raw_json`.

Why now: production recon stalls have been silent for 3+ hours with the
process alive, no errors, no logs. We don't know whether the 15-min
streaming timeout fires, whether retries cycle, or where bytes stop
flowing. This module surfaces all three.
@jorgeraad jorgeraad force-pushed the feat/instrumentation-tier1 branch from b48835a to 5a29ef8 Compare May 1, 2026 21:48
@jorgeraad jorgeraad changed the base branch from feat/surface-integration to canary May 1, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant