Over the last couple of years we observed that Large Langeuage Model (LLM) usage also known as “GenAI”, did not just stop at prompt engineering. We moved through context engineering by providing prompts and other information as context to get better and more accurate one shot answers and then in Software Engineering we saw “spec driven development” where the context is specialised to software requirements, design and other specifications that provide a container and boundary for the language model to build a solution
Yet in all this, we still worked in single threads of conversation without ongoing training of the model and any long-term awareness of our “global context” to drive better tonality, context and outcomes specific to individual needs or persona.
I believe the next wave will be memory architecture for AI systems as I write this in Jan 2026 because my question is if AI is acting as a collaborator in long-running systems, where does its memory live and how is it architected?
If you have been thinking along the same lines, then lets explore this topic further together and pin what has been done in this space pre-2026 and what we can expect to see in the year ahead.
So this post frames the shift from context tricks to deliberate memory design, then lands on a practical 2026 reference stack and patterns you can implement.
1. Waves of AI in software engineering
The progression has been consistent. Each wave increases leverage, and each wave increases the blast radius of mistakes.
Autocomplete and predictive text Prompt engineering Context engineering Spec-driven development Memory architecture
Characteristics:
Autocomplete improved velocity, not architecture Prompt engineering improved expressiveness, not reproducibility Context engineering improved grounding, not continuity Spec-driven development improved alignment, not persistence Memory architecture becomes the layer that reduces rework across time
Observation:
Most teams upgraded model capability faster than they upgraded system memory. The result is familiar: agents appear smart inside a single interaction and become expensive across a program of work.
2. What “memory” actually means in LLM systems
“Memory” is overloaded. If the team does not disambiguate it early, the solution defaults to longer chat history plus a vector store.
2.1 Parametric memory
Knowledge baked into model weights.
Strengths:
Broad general knowledge Strong priors and reasoning patterns
Weaknesses:
Slow to update Unreliable for organisation-specific and time-sensitive context Not auditable in system terms
2.2 Ephemeral or context memory
The current prompt window: chat history, retrieved documents, scratchpads, and tool outputs.
Constraint:
Context windows behave like RAM, not a database The working set is rewritten every turn Retention and eviction are usually accidental, not designed
2.3 External or long-term memory
Stores the system owns: vector databases, document stores, SQL/NoSQL, knowledge graphs, logs, traces, and event streams.
2.4 Agent state vs system memory
Agent state is what a single run sees. System memory is what the organisation retains.
This maps cleanly to distributed systems work: state lives somewhere, ownership matters, retention matters, and observability matters. Memory is architecture.
Call-out:
Today, most teams say “memory” but really mean “slightly longer chat history plus a vector DB”.
3. Techniques that got us here
The current mainstream is valuable. It is also insufficient for long-running workflows.
3.1 Vector DB and vanilla RAG
How it works:
Chunk text Embed chunks Store embeddings Retrieve top-k by similarity Optionally rerank Inject into the context window
Where it shines:
Document Q&A Codebase search Knowledge base assistants “Find the paragraph that says X”
Where it breaks down:
Multi-episode work where history matters Questions that depend on relationships and time Auditability of evidence across months Growth of unbounded corpora and retrieval noise
3.2 Graph RAG and knowledge graphs
Graph approaches bring explicit structure back into retrieval.
Pros:
Better “who/what/when/how related” queries Supports reasoning over relationships Improves explainability via traversals and edges
Cons:
Modelling effort Governance overhead Drift between real system behaviour and the curated graph
3.3 Other patterns
Conversation buffers and windowed chat memory Rolling summaries Hybrid retrieval (BM25 plus embeddings plus reranking) Tool-aware retrieval that prioritises structured sources
TODO: Add a short comparison table (size, latency, relationship handling, governance overhead, operational cost).
4. Core constraints that shape memory design
Most “memory stacks” fail for predictable reasons. These three constraints are the reusable lens.
4.1 Capacity
Memory grows faster than teams expect.
Every message, tool call, event, and log line is a write Every write creates retrieval noise unless it is consolidated Unbounded storage becomes unbounded spend
Design decisions that must be explicit:
Raw text vs embeddings vs summaries Retention per tier Consolidation frequency Deletion strategy
4.2 Latency and search cost
Users expect chat-speed. Memory systems naturally drift toward batch processing.
Latency sources:
Embedding computation ANN search Reranking Multi-hop retrieval Repeated retrieval inside a task
Architectural levers:
Precompute embeddings at ingestion Cache retrieval results per episode Tiered indexes for hot vs cold memory Deterministic limits on retrieval hops
4.3 Relationships and structure
Similarity is not structure.
If the question depends on sequences, causality, ownership, and time, a pure vector index will underperform.
Examples from enterprise domains:
Incident timelines across services Integration flows across sagas Claim or case lifecycles with multiple actors Architectural decisions and downstream outcomes
Structure options:
Episodic timelines Entity indexes Event-sourced style logs Lightweight graphs derived from episodes
Rule of thumb:
If your questions involve “who did what, when, and why”, you need explicit structure, not just a bigger vector DB.
5. The shift from recall to episodic and lifelong memory
The next step is not a larger vector DB. The shift is towards meaningfully bounded memory units and deliberate consolidation.
5.1 Episodic memory
An episode is a coherent unit of experience: an incident, support case, sprint, deployment, design session, or a decision record plus outcomes.
Chunks are arbitrary. Episodes are bounded and auditable.
A practical episodic store includes:
Episode metadata (time range, actors, entities, intent, outcome) Evidence links (tool outputs, tickets, commits, dashboards) Summaries at multiple resolutions (short, medium, deep) Optional embeddings for episode-level recall and clustering
TODO: Define a minimal episode schema (episode_id, start_ts, end_ts, actors, entities, intent, outcome, evidence pointers, summaries, embedding pointers).
5.2 Lifelong agents and multi-tier memory
Long-running agents need tiers:
Working memory: current task, short-lived context Episodic memory: past tasks, incidents, conversations, decisions Semantic memory: distilled facts, preferences, policies derived from episodes
Key design move:
Explicit consolidation. Without it, the system accumulates clutter and retrieval becomes probabilistic noise.
5.3 Memory as resource management
Treat the context window as RAM. Treat external stores as disk. Treat retrieval as paging.
This framing forces policies:
What stays hot What is paged in What is summarised What is evicted What is protected and audited
6. A 2026 memory stack for enterprise agents
This is a reference architecture, not a product recommendation. It fits existing enterprise stacks and makes memory observable.
6.1 Ingress
Inputs that generate memory writes:
User requests Tool outputs System events Logs and traces Documents and specs
6.2 Working set
Session-scoped, fast, aggressively bounded.
Policies:
Max tokens Eviction rules Summarisation triggers Tool output compaction rules
6.3 Episodic layer
The system of record for what happened.
Storage options:
Relational Document Event-sourced log with derived views
Core capabilities:
Append events into an episode Close an episode with an outcome Generate summaries Attach evidence pointers Index by time and entities Optional episode embeddings
6.4 Semantic layer
Distilled memory for reuse.
Examples:
Organisation constraints Architectural decisions and standards Preferred patterns and vendor constraints Known failure modes and mitigations
Storage options:
Relational or graph, depending on query patterns Keep it smaller than episodic memory by design
6.5 Access patterns
Before inference:
Retrieve semantic facts and constraints Retrieve episodes relevant by entity and time Retrieve evidence snippets only after selecting episodes
After inference:
Write tool calls and outputs Update episode state Generate a short “episode delta” summary
Background jobs:
Consolidation and pruning Contradiction detection across semantic memory Drift detection between standards and observed outcomes
TODO: Add a diagram showing these layers and read/write flows around the LLM.
7. Practical patterns you can implement now
These patterns work because they constrain scope and make memory an engineered asset.
Pattern 1: Conversation buffer plus episodic log plus periodic summaries
Use case: Ops and support copilots.
Working set for the active incident Episodic store per incident and post-incident actions Weekly or monthly summaries by service and failure mode
Trade-offs:
Low modelling overhead Strong audit trail Requires consolidation discipline
Pattern 2: Tool-call journal plus vector index plus lightweight graph
Use case: Integration and distributed design assistants.
Every tool call becomes a structured event Embeddings support recall Lightweight graph derived from episodes supports “related systems and flows”
Trade-offs:
Higher build cost Better relationship queries Better explainability for “why this decision was made”
Pattern 3: Preference and policy memory
Use case: Any enterprise agent that must behave consistently.
Store constraints as semantic memory (security rules, cost limits, approved stacks, non-negotiables) Inject constraints into planning, spec generation, and validation steps
Trade-offs:
Small store, high leverage Requires ownership and governance Failure mode is stale policy becoming hidden bias
TODO: For each pattern, add: when to use it, data shape, retrieval policy, consolidation policy, and metrics.
8. How memory changes architecture and design work
Memory becomes a first-class element in solution architecture diagrams, alongside queues, APIs, and databases.
8.1 Governance
Ownership of memory schemas Retention and residency rules Audit trail for what the agent knew at a given time Schema evolution and backwards compatibility
8.2 New non-functional requirements
Latency budgets for retrieval and reranking Storage and retention costs Quality metrics for recall and consolidation Privacy controls and redaction workflows
8.3 Checklist for architects starting an AI initiative
Define the episodes in this domain Define memory tiers and retention per tier Define retrieval policy deterministically Define consolidation jobs and schedules Define evaluation harness and metrics Define audit and access controls
9. Where this series goes next
This post sets the baseline. The series becomes valuable when it moves from concepts to measurable designs.
Planned follow-ups:
Episodic schema and consolidation jobs Metrics and observability for memory quality Memory safety and access control for multi-agent systems Cost modelling for retention and retrieval Patterns for graph-derived memory from event streams
References
TODO: Replace at least half of these with primary sources and papers and group them by topic.
The Big LLM Architecture Comparison (Sebastian Raschka) Design Patterns for Long-Term Memory in LLM-Powered Architectures (Serokell) How LLM Memory Works: Architecture, Techniques, and Developer Patterns (C-Sharp Corner) Titans + MIRAS: Helping AI have long-term memory (Google Research) LLM System Design guides and architecture explainers (various)