What does this article cover?

How to design context inputs—retrieval, memory, summarisation and tool outputs—so LLM applications stay accurate and fast.

Product and engineering teams building LLM features who need predictable quality under tight latency and cost constraints.

Context Engineering for LLM Applications: Retrieval, Memory and Compression

Most “prompt engineering” problems are really context engineering problems. What the model sees—retrieved evidence, tool outputs, user history, policies, and constraints—matters more than the wording of a single system message.

In production, context is also your main cost and latency driver. If you do not manage it deliberately, quality becomes unstable: the model hallucinates when evidence is missing, refuses too often when policies are unclear, and times out when the token budget is blown.

Start with a context map

Break the input into explicit layers so you can version, test and observe each one:

Instructions. Stable policies, role boundaries, safety constraints and response format.
User intent. The question plus any task context, goals, and preferences.
Evidence. Retrieved snippets, structured records, and citations (with permissions applied).
Tools. Function schemas, tool constraints, and results from tool calls.
Memory. Session state, durable preferences, and summaries (if you use them at all).

Once separated, you can apply different controls: retrieval recall tests for evidence, schema validation for tools, and policy tests for instructions. This is also the foundation for evaluation loops that catch regressions.

Retrieval is part of the interface

For RAG systems, the retrieval layer is your “truth supply chain”. Bad retrieval looks like hallucination, even when the model is behaving correctly.

Query shaping. Rewrite queries from user language into domain language, and add filters (region, product, permission tier) explicitly.
Chunking and metadata. Index content with stable IDs, source timestamps and owners so you can reason about freshness and provenance.
Permission enforcement. Apply ACL filters before retrieval results ever reach the model; do not rely on prompt instructions for access control.

If you want practical patterns, see Practical RAG Patterns and the failure modes in Why Most RAG Implementations Underperform.

Memory is not a single bucket

“Memory” is useful, but it is also a liability. Many teams get better outcomes by treating memory as a set of narrow mechanisms:

Session state. Short-lived state used to complete multi-step tasks (safe default).
Durable preferences. Opt-in preferences stored outside the model (tone, format, language).
Summarised history. A rolling summary that is regenerated and tested, not appended endlessly.

Decide upfront what you will not remember (sensitive identifiers, confidential content) and enforce that through storage controls and redaction, not just policy text.

Context compression and budgeting

When the context window is tight, teams tend to “truncate and hope”. Better approaches are deterministic:

Relevance filtering. Keep only top evidence passages, and drop near-duplicates.
Structured tool outputs. Return compact JSON with stable field names instead of verbose prose.
Summarise by intent. Summarise only the parts relevant to the user’s current task, and include citations to the original.
Token budgets. Allocate a fixed budget per layer (instructions/evidence/tools/memory) so no layer starves the others.

Make context observable and testable

Log context composition: prompt version, retrieval IDs, tool schemas, tool results, and the final token counts. Without that, post-incident analysis is guesswork and quality regressions take weeks to isolate.

Finally, treat context changes like releases. If you can’t explain “what changed” between two runs, you can’t operate the system confidently.