Open-source memory system converts agent interaction traces into reusable guidelines, boosting hard-task reliability by 14.2% on AppWorld benchmarks.
Researchers released ALTK-Evolve, an open-source long-term memory framework for AI agents that replaces raw transcript replay with distilled, scored guidelines. The system captures full agent trajectories via OpenTelemetry-based observability tools like Langfuse, extracts structural patterns, and injects only relevant principles at inference time. On the AppWorld benchmark — multi-step API tasks averaging 9.5 APIs across 1.8 apps — it delivered a 14.2% reliability improvement on hard cases. It integrates with Claude Code, Codex, and IBM Bob out of the box.
ALTK-Evolve slots into existing agent pipelines via OpenTelemetry and intercepts execution traces to build a scored, pruned guideline library — no fine-tuning, no RAG overhaul. The 14.2% hard-task gain on AppWorld is meaningful because hard cases are exactly where naive transcript replay fails: complex control flow, multi-app dependencies, edge cases. The just-in-time retrieval model keeps context lean rather than dumping the full memory store into every prompt.
If you're running a ReAct or tool-calling agent on Claude Code or Codex, clone the ALTK-Evolve repo this week, wire it to your existing Langfuse traces, and run the AppWorld eval against your baseline to confirm the 14.2% delta holds on your task distribution.
Navigate to the ALTK-Evolve GitHub repo and clone it: git clone https://github.com/altk/altk-evolve
Tags
Also today
Signals by role
Also today
Tools mentioned