Tool-output offloading
Most tool outputs are small. Some are not. A database query that
returns 500 rows, an API call that returns a 3 MB JSON document, a
document_conversion on a 400-page PDF — any of these can drop a
payload into the prompt that dwarfs the rest of the conversation and
blows through the model's context window in a single turn.
Tool-output offloading is the fix: when a tool's output exceeds a
configurable size, the platform writes the full output to a session
artifact and hands the agent a compact reference instead. The agent
sees a summary envelope with the artifact id, size, line count, and
shape; it can then pull out what it actually needs with
artifact_read, artifact_grep, or — for JSON outputs —
artifact_jq.
Offloading is on by default for any agent that has one of
artifact_read, artifact_grep, or artifact_jq configured in its
tool_configurations — the platform can only offload safely if the
agent has a way to read artifacts back. Most agents in practice just
get offloading for free as part of enabling those tools. If your agent
has none of them, the platform leaves offloading off so the agent
isn't left with references it can't follow.
When to reach for this
Turn offloading on when you see the agent dragging large payloads through the prompt:
- An agent running
corpora_searchwith highnum_resultsand stuffing every hit into context. document_conversionon large PDFs producing tens of thousands of tokens per call.- SQL tools returning hundreds of rows when the agent only needs a handful of fields.
How it works
With offloading active, every tool's output is measured before it enters the conversation. If the output is under the configured threshold, it passes through unchanged. If it's over, the platform:
- Writes the full output to a session artifact (same storage as user-uploaded files).
- Computes a compact reference containing the artifact id, byte size, line count, a compact shape descriptor of the content, and a per-tool map of access hints for whichever artifact tools the agent has configured.
- Sends that reference to the LLM in place of the original output.
The agent reads the reference, decides whether it needs detail, and
calls artifact_read for line ranges, artifact_grep for regex
search, or artifact_jq for structured extraction from JSON outputs.
The original output is never in the prompt — it only comes back
through an explicit tool call.
Get the defaults by configuring an artifact tool
Configuring any of artifact_read, artifact_grep, or artifact_jq
is enough — offloading turns itself on, uses default thresholds, and
the agent has a tool it can use to read offloaded outputs. Configure
all three when you can; each handles a different access pattern.
MINIMUM SETUP — OFFLOADING ENABLES ITSELF
Code example with json syntax.1
The compact reference the agent receives includes a how_to_access
map with entries for whichever of artifact_read, artifact_grep,
and artifact_jq are configured, so the LLM knows which tool call to
use.
Tune the thresholds
For most workloads the defaults are fine. When you do want to tune,
the agent's tool_output_offloading exposes four knobs. The defaults
are conservative — see the API reference for exact values.
enabled— explicit on/off. Only necessary to force offloading off on an agent that has one of the artifact tools configured.context_percentage— the threshold as a fraction of the model's context window. Lower it if your agent makes many tool calls per turn and you want aggressive offloading; raise it if you're seeing the agent get confused by references to artifacts for outputs it could have handled inline.min_threshold_bytes— a floor for tiny outputs. Anything under this is never offloaded, because reading a small artifact is slower than just letting it pass through inline.max_threshold_bytes— a ceiling for large outputs. Caps the absolute size of anything allowed into the prompt so a single tool response can't dominate the conversation, even on a 200k-token model where a percentage-based threshold would be enormous.
The effective threshold scales with the model's context window, clamped between the floor and ceiling. In practice most large-context models hit the ceiling as the binding constraint.
What the agent sees
Instead of the raw output, the LLM receives a small JSON envelope like:
{
"artifact_id": "art_tool_output_9k2x",
"size_bytes": 2457600,
"line_count": 14020,
"shape": { "rows": "array(500) of object(6 keys)", "next_cursor": "string" },
"how_to_access": {
"artifact_read": "Read full content or a line range (start_line/end_line)",
"artifact_grep": "Search for patterns with grep",
"artifact_jq": "Query with jq expressions, e.g. '.results[0]', '.[] | select(.score > 0.5)'"
}
}
The shape is a compact descriptor for the LLM to eyeball — not a
JSON-schema document. For JSON outputs the agent typically follows up
with artifact_jq; for text it uses artifact_grep or
artifact_read. The artifact tools themselves are not offloadable —
their outputs always return inline, so the agent doesn't end up
chasing references to references.
Limits and trade-offs
- Artifact reads cost tool turns. The agent pays an extra LLM turn every time it decides to read an offloaded artifact. For outputs that are borderline, the agent is sometimes better off with the output inline — which is why small outputs pass through by default.
- Tools without structure are harder for the agent to navigate. A
2 MB JSON document with a clean shape is easy for
artifact_jqto query. A 2 MB free-form text blob is less so. Consider whether your custom tools can return structured output before relying on offloading for them. - Artifacts live for the session. Offloaded outputs are stored as normal session artifacts and clean up when the session ends. If you need the output to outlive the session, persist it somewhere yourself.
- Threshold is byte-based, not token-based. The defaults are conservative because token-to-byte ratios vary by language and content type. Err on the side of offloading more.
- Some content needs to stay inline. Short, structured responses that the agent will reference in most turns (e.g., "the user's current cart") aren't a good fit for offloading — you'll pay a read on every turn. Keep those tools below the threshold.
Related
- Context engineering overview
- Artifacts — the storage offloaded outputs are written to.
- Built-in tools —
artifact_read,artifact_grep,artifact_jq, and the rest.