Skip to main content
Version: 2.0

Context engineering

Context engineering is the deliberate crafting of everything the LLM reads on a given turn, apart from the user's own message: the system instructions, the tool descriptions and schemas, the tool outputs, the retrieved snippets, the reminders, and whatever has been summarized or hidden by compaction. Getting this right is the practical difference between a proof-of-concept agent and a production one. It's also how you keep token cost and latency in check as sessions grow.

This page is a practical guide to the knobs Vectara exposes for the job, and what to prioritize when tuning an agent.

What to work on, in order

Seven levers, in rough order of impact. Invest top to bottom.

Retrieval leads because it caps answer quality for any knowledge agent — a perfect prompt can't rescue bad snippets. Monitoring sits last because it's a continuous discipline that shapes the other six, not a one-time configuration task.

  1. Retrieval. For agents grounded in enterprise data, the quality of the snippets reaching the model is an upper bound on the quality of the answer.
  2. System instructions. The single largest lever you control.
  3. Tool configuration. Tool names, descriptions, default arguments, and per-step visibility all live in the prompt.
  4. Delegation with sub-agents. Keep the parent context small by isolating specialist work — or work you want to parallelize — in a sub-agent that only hands back a result.
  5. Long-session management. Compaction, reminders, and skills keep the agent coherent as the session grows.
  6. Tool-output protection. Stop a runaway tool response from poisoning the rest of the session.
  7. Monitoring and iteration. Agents get better because you watch them, not because you reasoned hard at design time.

1. Retrieval

For a knowledge agent, corpora_search is the primary delivery mechanism for the tokens the model actually needs. Tuning it is the highest-leverage work you can do.

Do not move on until retrieval is good. Every downstream lever — prompting, tools, sub-agents, compaction — compounds on top of the snippets you feed the model. Fix this first.

Start here:

  • Improving search quality — the diagnostic checklist for bad retrieval.
  • Hybrid search — combine semantic and lexical matching so exact terms (SKUs, part numbers, internal codenames) are not lost.
  • Rerankers — reorder results by relevance. The multilingual reranker, MMR, and chain rerankers cover most production needs.
  • Filters — narrow results by metadata. Multi-tenant isolation lives here.
  • Citations — return traceable references alongside results so the agent can cite its sources.
  • Configure query parameters — every knob on a corpora_search tool maps to a field on the Advanced Query API.

Once retrieval is solid, everything else is refinement on top of it.

2. System instructions

Instructions are the system prompt for an agent. They are where you should budget the most ongoing attention.

  • Write instructions like a spec, not a summary. "Be helpful and friendly" is not an instruction. "If the user asks about billing, confirm their account tier before quoting prices, and never quote a discount unless session.metadata.tier == 'enterprise'" is.
  • Iterate. Most failure modes found in production are a single-line fix in the instructions. Budget time to watch real sessions and patch the prompt as you go.
  • Use Velocity templates for runtime data. Reference session metadata with ${session.metadata.field} instead of hardcoding user-specific values into the prompt.
  • Version your instructions. Inline instructions are fine for one-offs; production agents should use named, versioned reference instructions so you can roll back cleanly when an edit regresses behavior.
  • Keep the system prompt lean. A long prompt costs tokens on every turn, spreads the model's attention thin, and drives up latency. Push specialist content into skills that load on demand, and put organization-specific vocabulary into a glossary instead of inlining it.

See the Instructions documentation for the Velocity template language, the reference/inline distinction, and examples.

3. Tool configuration

A tool's name, description, and input schema are part of the system message on every turn. Configuring tools carefully keeps the prompt tight and prevents the model from making choices you don't want it to make.

  • argument_override on a tool configuration pins argument values the LLM can't change. Use it to enforce access control (a metadata_filter computed from session.metadata.user_role), lock safe defaults, and prevent the model from making expensive mistakes with external APIs.

    "argument_override": { "include_domains": ["docs.vectara.com"] }
  • allowed_tools on an agent step scopes which tools are visible during that step. A classifier step doesn't need web_search cluttering its prompt.

  • Name tools for their purpose, not their type. knowledge_search says more to the LLM than corpora_search, and a good name reduces confusion when the agent has to pick between tools.

See Built-in tools for the full catalog of tool types and their schemas.

4. Delegation with sub-agents

One of the single largest context-engineering wins in any non-trivial workflow is to stop trying to do everything in one conversation. Sub-agents let a parent agent delegate a scoped piece of work to another agent, whose entire session — its own instructions, its own tool set, its own turn-by-turn reasoning — stays isolated from the parent's context. The parent only sees the final result.

Reach for a sub-agent whenever:

  • A sub-task can be isolated. A researcher sub-agent that iterates through five web searches, refines its query, and returns a clean summary is far cheaper to the parent than having the parent do all five searches itself and carry the outputs forward.
  • You want specialist instructions. The sub-agent can have a different system prompt and a narrower tool set than the parent.
  • You want to parallelize. The parent can invoke several sub-agents in parallel, each reasoning independently, then combine the results.
  • You want to cap the blast radius of a long side quest. If the sub-agent fails or goes off track, it doesn't leave 30 turns of dead-end reasoning in the parent's history.

Sub-agents are the counterpart to steps: steps change the system prompt but preserve the whole session history; sub-agents give you a fresh session for a side task. Use sub-agents when the parent should only see the result; use steps when the conversation needs to continue coherently.

When to reach for steps

Most agents work fine as a single step. Reach for steps when the agent genuinely moves through distinct phases — for example triage → research → draft → review — and each phase deserves a different system prompt or tool set. Steps preserve the full session history across transitions, which is what distinguishes them from sub-agents.

5. Long-session management

Three pressures build up as a session grows: the prompt gets expensive per turn, the model begins to forget earlier content, and the LLM's recency bias means the most recent tokens dominate its attention. Three primitives address each of these.

  • Compaction lets an agent keep going past its context window by summarizing older turns once usage crosses a threshold. Pair it with the search_session_history tool when the agent needs to recover detail from hidden turns on demand.
  • Reminders turn recency bias in your favor — rules you re-inject on every turn carry more weight than the same rule buried at the top of a long session. Glossaries are a specialized reminder for expanding internal acronyms, codenames, and team names.
  • Skills unlock progressive disclosure: the system prompt stays short and only advertises a skill's name and summary, while the full instruction body loads on demand. This is how you scale the number of specialized behaviors without bloating every turn.
  • Memory lets the agent offload state to disk — plans, notes, synthesized decisions — instead of keeping it in the prompt. In-session memory uses artifacts; cross-session memory uses a Vectara corpus the agent can query with corpora_search.

6. Tool-output offloading

Some tools can return very large payloads — a multi-megabyte API response, a long log file, the full text of a 500-page document. Tool-output offloading catches outputs above a configurable size and writes them to a session artifact instead of letting them enter the prompt. The agent sees a compact reference (size, line count, shape, artifact id) and pulls out what it actually needs with artifact_read, artifact_grep, or artifact_jq.

Turn this on for any agent whose tools can return big blobs. It pays for itself on the first runaway response.

7. Monitor, evaluate, iterate

Watch for regressions after prompt edits, retrieval drift as your corpus grows, and rising tool error rates — these are the failure modes that only show up in production. Close the loop with:

Find the real failure modes in production, then patch the instructions, the retrieval configuration, the tool set, or the skills accordingly.