Skip to content

Update agent process persistence plan: universal persistence, file relocation, remove AutoConfiguration, field analysis#344

Open
Copilot wants to merge 10 commits into
mainfrom
copilot/explore-storage-engines-kafka-rocksdb
Open

Update agent process persistence plan: universal persistence, file relocation, remove AutoConfiguration, field analysis#344
Copilot wants to merge 10 commits into
mainfrom
copilot/explore-storage-engines-kafka-rocksdb

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

The persistence plan needed several revisions: persist all AgentProcess instances (not just tasks), relocate the plan file, drop the AutoConfiguration override suggestion, use direct SimpleAgentProcess creation for rehydration, and add a comprehensive field analysis to ensure full state restoration.

Changes

  • Universal persistence — Replaced "Selective Persistence: Task Processes Only" with "Universal Persistence: All AgentProcess Instances". Memory distillation processes now persist alongside task assignments, since they also use tick-based processing and benefit from crash recovery.

  • Removed AutoConfiguration references — Dropped the note proposing @ConditionalOnMissingBean upstream to Embabel and removed the erroneous META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.imports file from the repo root. The @Bean @Primary in AgenticAggregateServiceApplication is sufficient; no AutoConfiguration changes needed.

  • Relocated plan — Moved plans/agent-process-persistence/plan.mdplans/agent-process-persistence-plan.md (flat file, consistent with other plans).

  • Direct SimpleAgentProcess rehydration — Replaced the wrapper/decorator rehydration approach with direct SimpleAgentProcess creation using the restored Blackboard and original process ID. AgentPlatform.getAgentProcess() internally delegates to agentProcessRepository.findById(id), which returns the recreated process. No wrapper needed.

  • Accurate orphan recovery description — Updated Phase 5 and Architect Decision query-support preparations #4 to reflect that the runtime emits failure events (AgentTaskFinishedEvent(FAILED) / MemoryDistillationFailedEvent) to clear orphaned state, rather than just logging a warning.

  • SimpleAgentProcess / AbstractAgentProcess field analysis — Decompiled and documented all fields (16 in AbstractAgentProcess, 3 in SimpleAgentProcess) with persist/skip decisions for each. Identified replanBlacklist (Set<String>) as missing from the original snapshot — this field tracks failed action names and must be persisted to prevent retry loops after restart. Added it to AgentProcessSnapshot. Documented that _lastWorldState and _goal are auto-reconstructed from the Blackboard on the first tick(), so they don't need persistence.

  • Expanded rehydration flow — Detailed all 12 steps including exact SimpleAgentProcess constructor parameters (id, parentId, agent, ProcessOptions.DEFAULT, blackboard, platformServices, plannerFactory, createdAt) and mutable state restoration (status, history, failureInfo, replanBlacklist contents via reflection + addAll()).

Key sections updated

Section Change
Problem Statement Added orphaned MemoryDistillation entries alongside AssignedTask
Field Analysis New section — Complete field tables for AbstractAgentProcess and SimpleAgentProcess
Process table Memory distillation row: Short-lived, run-to-completion / NoTick-based / Yes
AgentProcessSnapshot Added replanBlacklist field; added createdAt mapping comment
What NOT to Serialize Expanded with WorldStateDeterminer, _lastWorldState, _goal, _terminationRequest, telemetry fields
Key Insights Added insights #6 (WorldState/Goal auto-reconstruction) and #7 (replanBlacklist importance)
Rehydration Flow Expanded to 12 steps with exact constructor params and mutable state restoration
Phase 2 Rehydration Wrapper/decorator → direct SimpleAgentProcess creation with full field detail
Phase 3 deliverables pendingWrites buffer stages all processes, not just "persistent processes"
Phase 5 Runtime Integration Orphan recovery emits failure events (not just logs)
Architect Decision #1 Updated with complete field restoration approach and replanBlacklist reflection note
Architect Decision #3 Selective → Universal persistence
Architect Decision #4 Orphan recovery emits failure events to clear state
Architect Decision #7 Removed @ConditionalOnMissingBean language
Risk Analysis Updated rehydration risk with field analysis confirmation; added replanBlacklist reflection risk

Copilot AI and others added 5 commits April 13, 2026 09:01
…er every tick, selective persistence, RocksDB as default

Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/ac318fda-093c-481b-87c7-1d6b00224139

Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
…tion and Spring wiring strategy

Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/1804059f-c47a-4fa9-b18f-4596cb4f5279

Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
…ntProcess instances, remove AutoConfiguration references

Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/e8684b23-6d25-42f1-86d2-bdab0cdd5152

Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
@jwijgerd jwijgerd marked this pull request as ready for review April 13, 2026 10:26
Copilot AI review requested due to automatic review settings April 13, 2026 10:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the agent process persistence design document to reflect revised persistence scope (all AgentProcess instances), updated Spring wiring guidance, and a plan file relocation.

Changes:

  • Added the revised persistence plan document at plans/agent-process-persistence-plan.md (universal persistence, no upstream AutoConfiguration override proposal).
  • Added META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.imports with an Embabel auto-configuration entry.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
plans/agent-process-persistence-plan.md New/relocated persistence plan documenting universal AgentProcess persistence and transaction coordination approach.
META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.imports Adds an AutoConfiguration imports entry (currently placed at repo root).

Comment thread plans/agent-process-persistence-plan.md Outdated
Comment thread plans/agent-process-persistence-plan.md Outdated
Comment on lines +511 to +513
- Note: Orphan recovery (missing process for assigned task) is **already handled** by the existing
`resumeNextAgentTask()` code which logs a warning and skips the tick. With persistence, this
situation should be extremely rare (only if RocksDB + Kafka both lose data).
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section claims orphan recovery for a missing AgentProcess “logs a warning and skips the tick”. In current code, KafkaAgenticAggregateRuntime.handleMissingAgentProcess emits failure events (AgentTaskFinishedEvent(FAILED) or MemoryDistillationFailedEvent) specifically to clear state and avoid retry loops. Please update the plan to reflect the event-emitting behavior.

Suggested change
- Note: Orphan recovery (missing process for assigned task) is **already handled** by the existing
`resumeNextAgentTask()` code which logs a warning and skips the tick. With persistence, this
situation should be extremely rare (only if RocksDB + Kafka both lose data).
- Note: Orphan recovery (missing process for assigned task or memory distillation) is **already
handled** by the existing runtime code. When `KafkaAgenticAggregateRuntime` detects a missing
`AgentProcess`, it emits failure events (for example `AgentTaskFinishedEvent(FAILED)` or
`MemoryDistillationFailedEvent`) to clear orphaned aggregate state and avoid retry loops. With
persistence, this situation should be extremely rare (for example, after data loss or repository
inconsistency), but the failure-event path remains an important safety net.

Copilot uses AI. Check for mistakes.
Comment thread plans/agent-process-persistence-plan.md Outdated
Comment on lines +552 to +554
4. **Orphan Recovery**: **Already handled.** The existing `resumeNextAgentTask()` code handles missing
processes by logging a warning and skipping the tick. With persistent storage, this scenario should
be extremely rare.
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architect Decision #4 repeats that missing processes are handled by “logging a warning and skipping the tick”, but the runtime now emits failure events to clear the orphaned AssignedTask/MemoryDistillation entries (KafkaAgenticAggregateRuntime.handleMissingAgentProcess). Please align this decision text with the actual behavior.

Suggested change
4. **Orphan Recovery**: **Already handled.** The existing `resumeNextAgentTask()` code handles missing
processes by logging a warning and skipping the tick. With persistent storage, this scenario should
be extremely rare.
4. **Orphan Recovery**: **Already handled.** The existing runtime handles missing processes by
emitting failure events that clear orphaned `AssignedTask` and `MemoryDistillation` entries when
a referenced `AgentProcess` cannot be found. With persistent storage, this scenario should be
extremely rare.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
com.embabel.agent.autoconfigure.platform.AgentPlatformAutoConfiguration No newline at end of file
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds an AutoConfiguration.imports file at repository root (META-INF/spring/...), but Akces modules register auto-config via files under each module’s src/main/resources/META-INF/spring/ (e.g., main/client and main/query-support). As placed here, it won’t be packaged into any jar and won’t have any effect. If this is intended, move it into the correct module’s resources (likely main/agentic) and include the standard license header used in the other AutoConfiguration.imports files; otherwise remove it to avoid confusion.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 90232c4.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 90232c4.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@@ -0,0 +1 @@
com.embabel.agent.autoconfigure.platform.AgentPlatformAutoConfiguration No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file

Comment thread plans/agent-process-persistence-plan.md Outdated
- Handle graceful degradation for unresolvable types
- Unit tests for rehydration with mock AgentPlatform

**Approach — Wrapper/Decorator Pattern**: Create a `RestoredAgentProcess` that wraps a freshly created
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a SimpleAgentProcess directly with the restored Blackboard not via AgentPlatform.createAgentProcess() (which will create a new Id) No Wrapper needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Phase 2 to create SimpleAgentProcess directly with the restored Blackboard and original process ID. Removed the wrapper/decorator approach. Also updated Architect Decision #1 and Risk Analysis to reflect this. See 90232c4.

Comment thread plans/agent-process-persistence-plan.md Outdated
- Unit tests for rehydration with mock AgentPlatform

**Approach — Wrapper/Decorator Pattern**: Create a `RestoredAgentProcess` that wraps a freshly created
`AgentProcess` (via `AgentPlatform.createAgentProcess()`) and implements the exact Embabel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AgentPlatform.getAgentProcess() will internally call agentProcessRepository.findById(id) which should return the recreated AgentProcess

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the rehydration approach — AgentProcessRepository.findById(id) returns the recreated process, and AgentPlatform.getAgentProcess(id) internally delegates to it. See 90232c4.

…ydration, fix orphan recovery descriptions

Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/62febcd3-67d0-47fc-8cdf-c0b7c9856033

Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
Copilot AI requested a review from jwijgerd April 13, 2026 11:17
Comment thread plans/agent-process-persistence-plan.md Outdated
| Risk | Impact | Mitigation |
|------|--------|------------|
| Wrapper/decorator doesn't perfectly replicate Embabel process behavior | High | Comprehensive testing; track Embabel version upgrades |
| Rehydrated `SimpleAgentProcess` doesn't perfectly replicate original process behavior | High | Comprehensive testing; track Embabel version upgrades |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyze the SimpleAgentProcess and it's superclass and see what we are currently missing in the snapshot. The goal is to be able to restore the full state

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyzed all fields in SimpleAgentProcess (3 fields) and AbstractAgentProcess (16 fields). Added a complete field-by-field table with persist/skip decisions for each.

Key finding: replanBlacklist (Set<String> in SimpleAgentProcess) was missing from the snapshot — it tracks failed action names to prevent re-selection during replanning. Added it to AgentProcessSnapshot and the rehydration flow.

Fields that need persistence: id, parentId, agentName, status, timestamp/createdAt, failureInfo, blackboard (entries + bindings + hidden + conditions), history, replanBlacklist.

Fields reconstructed automatically: _lastWorldState and _goal (recomputed from Blackboard on first tick()), WorldStateDeterminer (derived from Agent conditions + Blackboard), Planner (from PlannerFactory).

Note: restoring replanBlacklist requires reflection since it's a private final Set<String>. Flagged this in the risk analysis.

See 5b9ddfd.

Copilot AI and others added 2 commits April 13, 2026 11:32
…nBlacklist to snapshot, detail rehydration flow

Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/37efe0a7-d623-4c7c-9c9c-9fea5427468c

Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
Copilot AI changed the title Update agent process persistence plan: universal persistence, file relocation, remove AutoConfiguration Update agent process persistence plan: universal persistence, file relocation, remove AutoConfiguration, field analysis Apr 13, 2026
Copilot AI requested a review from jwijgerd April 13, 2026 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants