Update agent process persistence plan: universal persistence, file relocation, remove AutoConfiguration, field analysis#344
Conversation
Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/8e03ca97-0f84-4984-87d4-a08e48d6cd20 Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/8e03ca97-0f84-4984-87d4-a08e48d6cd20 Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
…er every tick, selective persistence, RocksDB as default Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/ac318fda-093c-481b-87c7-1d6b00224139 Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
…tion and Spring wiring strategy Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/1804059f-c47a-4fa9-b18f-4596cb4f5279 Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
…ntProcess instances, remove AutoConfiguration references Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/e8684b23-6d25-42f1-86d2-bdab0cdd5152 Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates the agent process persistence design document to reflect revised persistence scope (all AgentProcess instances), updated Spring wiring guidance, and a plan file relocation.
Changes:
- Added the revised persistence plan document at
plans/agent-process-persistence-plan.md(universal persistence, no upstream AutoConfiguration override proposal). - Added
META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.importswith an Embabel auto-configuration entry.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| plans/agent-process-persistence-plan.md | New/relocated persistence plan documenting universal AgentProcess persistence and transaction coordination approach. |
| META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.imports | Adds an AutoConfiguration imports entry (currently placed at repo root). |
| - Note: Orphan recovery (missing process for assigned task) is **already handled** by the existing | ||
| `resumeNextAgentTask()` code which logs a warning and skips the tick. With persistence, this | ||
| situation should be extremely rare (only if RocksDB + Kafka both lose data). |
There was a problem hiding this comment.
This section claims orphan recovery for a missing AgentProcess “logs a warning and skips the tick”. In current code, KafkaAgenticAggregateRuntime.handleMissingAgentProcess emits failure events (AgentTaskFinishedEvent(FAILED) or MemoryDistillationFailedEvent) specifically to clear state and avoid retry loops. Please update the plan to reflect the event-emitting behavior.
| - Note: Orphan recovery (missing process for assigned task) is **already handled** by the existing | |
| `resumeNextAgentTask()` code which logs a warning and skips the tick. With persistence, this | |
| situation should be extremely rare (only if RocksDB + Kafka both lose data). | |
| - Note: Orphan recovery (missing process for assigned task or memory distillation) is **already | |
| handled** by the existing runtime code. When `KafkaAgenticAggregateRuntime` detects a missing | |
| `AgentProcess`, it emits failure events (for example `AgentTaskFinishedEvent(FAILED)` or | |
| `MemoryDistillationFailedEvent`) to clear orphaned aggregate state and avoid retry loops. With | |
| persistence, this situation should be extremely rare (for example, after data loss or repository | |
| inconsistency), but the failure-event path remains an important safety net. |
| 4. **Orphan Recovery**: **Already handled.** The existing `resumeNextAgentTask()` code handles missing | ||
| processes by logging a warning and skipping the tick. With persistent storage, this scenario should | ||
| be extremely rare. |
There was a problem hiding this comment.
Architect Decision #4 repeats that missing processes are handled by “logging a warning and skipping the tick”, but the runtime now emits failure events to clear the orphaned AssignedTask/MemoryDistillation entries (KafkaAgenticAggregateRuntime.handleMissingAgentProcess). Please align this decision text with the actual behavior.
| 4. **Orphan Recovery**: **Already handled.** The existing `resumeNextAgentTask()` code handles missing | |
| processes by logging a warning and skipping the tick. With persistent storage, this scenario should | |
| be extremely rare. | |
| 4. **Orphan Recovery**: **Already handled.** The existing runtime handles missing processes by | |
| emitting failure events that clear orphaned `AssignedTask` and `MemoryDistillation` entries when | |
| a referenced `AgentProcess` cannot be found. With persistent storage, this scenario should be | |
| extremely rare. |
| @@ -0,0 +1 @@ | |||
| com.embabel.agent.autoconfigure.platform.AgentPlatformAutoConfiguration No newline at end of file | |||
There was a problem hiding this comment.
This adds an AutoConfiguration.imports file at repository root (META-INF/spring/...), but Akces modules register auto-config via files under each module’s src/main/resources/META-INF/spring/ (e.g., main/client and main/query-support). As placed here, it won’t be packaged into any jar and won’t have any effect. If this is intended, move it into the correct module’s resources (likely main/agentic) and include the standard license header used in the other AutoConfiguration.imports files; otherwise remove it to avoid confusion.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| @@ -0,0 +1 @@ | |||
| com.embabel.agent.autoconfigure.platform.AgentPlatformAutoConfiguration No newline at end of file | |||
| - Handle graceful degradation for unresolvable types | ||
| - Unit tests for rehydration with mock AgentPlatform | ||
|
|
||
| **Approach — Wrapper/Decorator Pattern**: Create a `RestoredAgentProcess` that wraps a freshly created |
There was a problem hiding this comment.
Create a SimpleAgentProcess directly with the restored Blackboard not via AgentPlatform.createAgentProcess() (which will create a new Id) No Wrapper needed
| - Unit tests for rehydration with mock AgentPlatform | ||
|
|
||
| **Approach — Wrapper/Decorator Pattern**: Create a `RestoredAgentProcess` that wraps a freshly created | ||
| `AgentProcess` (via `AgentPlatform.createAgentProcess()`) and implements the exact Embabel |
There was a problem hiding this comment.
AgentPlatform.getAgentProcess() will internally call agentProcessRepository.findById(id) which should return the recreated AgentProcess
There was a problem hiding this comment.
Added this to the rehydration approach — AgentProcessRepository.findById(id) returns the recreated process, and AgentPlatform.getAgentProcess(id) internally delegates to it. See 90232c4.
…ydration, fix orphan recovery descriptions Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/62febcd3-67d0-47fc-8cdf-c0b7c9856033 Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
| | Risk | Impact | Mitigation | | ||
| |------|--------|------------| | ||
| | Wrapper/decorator doesn't perfectly replicate Embabel process behavior | High | Comprehensive testing; track Embabel version upgrades | | ||
| | Rehydrated `SimpleAgentProcess` doesn't perfectly replicate original process behavior | High | Comprehensive testing; track Embabel version upgrades | |
There was a problem hiding this comment.
analyze the SimpleAgentProcess and it's superclass and see what we are currently missing in the snapshot. The goal is to be able to restore the full state
There was a problem hiding this comment.
Analyzed all fields in SimpleAgentProcess (3 fields) and AbstractAgentProcess (16 fields). Added a complete field-by-field table with persist/skip decisions for each.
Key finding: replanBlacklist (Set<String> in SimpleAgentProcess) was missing from the snapshot — it tracks failed action names to prevent re-selection during replanning. Added it to AgentProcessSnapshot and the rehydration flow.
Fields that need persistence: id, parentId, agentName, status, timestamp/createdAt, failureInfo, blackboard (entries + bindings + hidden + conditions), history, replanBlacklist.
Fields reconstructed automatically: _lastWorldState and _goal (recomputed from Blackboard on first tick()), WorldStateDeterminer (derived from Agent conditions + Blackboard), Planner (from PlannerFactory).
Note: restoring replanBlacklist requires reflection since it's a private final Set<String>. Flagged this in the risk analysis.
See 5b9ddfd.
…nBlacklist to snapshot, detail rehydration flow Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/37efe0a7-d623-4c7c-9c9c-9fea5427468c Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
…estore contents via addAll Agent-Logs-Url: https://github.com/elasticsoftwarefoundation/akces-framework/sessions/37efe0a7-d623-4c7c-9c9c-9fea5427468c Co-authored-by: jwijgerd <914840+jwijgerd@users.noreply.github.com>
The persistence plan needed several revisions: persist all
AgentProcessinstances (not just tasks), relocate the plan file, drop the AutoConfiguration override suggestion, use directSimpleAgentProcesscreation for rehydration, and add a comprehensive field analysis to ensure full state restoration.Changes
Universal persistence — Replaced "Selective Persistence: Task Processes Only" with "Universal Persistence: All AgentProcess Instances". Memory distillation processes now persist alongside task assignments, since they also use tick-based processing and benefit from crash recovery.
Removed AutoConfiguration references — Dropped the note proposing
@ConditionalOnMissingBeanupstream to Embabel and removed the erroneousMETA-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.importsfile from the repo root. The@Bean @PrimaryinAgenticAggregateServiceApplicationis sufficient; no AutoConfiguration changes needed.Relocated plan — Moved
plans/agent-process-persistence/plan.md→plans/agent-process-persistence-plan.md(flat file, consistent with other plans).Direct
SimpleAgentProcessrehydration — Replaced the wrapper/decorator rehydration approach with directSimpleAgentProcesscreation using the restored Blackboard and original process ID.AgentPlatform.getAgentProcess()internally delegates toagentProcessRepository.findById(id), which returns the recreated process. No wrapper needed.Accurate orphan recovery description — Updated Phase 5 and Architect Decision query-support preparations #4 to reflect that the runtime emits failure events (
AgentTaskFinishedEvent(FAILED)/MemoryDistillationFailedEvent) to clear orphaned state, rather than just logging a warning.SimpleAgentProcess/AbstractAgentProcessfield analysis — Decompiled and documented all fields (16 inAbstractAgentProcess, 3 inSimpleAgentProcess) with persist/skip decisions for each. IdentifiedreplanBlacklist(Set<String>) as missing from the original snapshot — this field tracks failed action names and must be persisted to prevent retry loops after restart. Added it toAgentProcessSnapshot. Documented that_lastWorldStateand_goalare auto-reconstructed from the Blackboard on the firsttick(), so they don't need persistence.Expanded rehydration flow — Detailed all 12 steps including exact
SimpleAgentProcessconstructor parameters (id,parentId,agent,ProcessOptions.DEFAULT,blackboard,platformServices,plannerFactory,createdAt) and mutable state restoration (status,history,failureInfo,replanBlacklistcontents via reflection +addAll()).Key sections updated
MemoryDistillationentries alongsideAssignedTaskAbstractAgentProcessandSimpleAgentProcessShort-lived, run-to-completion/ No →Tick-based/ YesreplanBlacklistfield; addedcreatedAtmapping commentWorldStateDeterminer,_lastWorldState,_goal,_terminationRequest, telemetry fieldsWorldState/Goalauto-reconstruction) and #7 (replanBlacklistimportance)SimpleAgentProcesscreation with full field detailpendingWritesbuffer stages all processes, not just "persistent processes"replanBlacklistreflection note@ConditionalOnMissingBeanlanguagereplanBlacklistreflection risk