Skip to main content
Version: 2.0

Tool-output offloading

Most tool outputs are small. Some are not. A database query that returns 500 rows, an API call that returns a 3 MB JSON document, a document_conversion on a 400-page PDF — any of these can drop a payload into the prompt that dwarfs the rest of the conversation and blows through the model's context window in a single turn.

Tool-output offloading is the fix: when a tool's output exceeds a configurable size, the platform writes the full output to a session artifact and hands the agent a compact reference instead. The agent sees a summary envelope with the artifact id, size, line count, and shape; it can then pull out what it actually needs with artifact_read, artifact_grep, or — for JSON outputs — artifact_jq.

Offloading is on by default for any agent that has one of artifact_read, artifact_grep, or artifact_jq configured in its tool_configurations — the platform can only offload safely if the agent has a way to read artifacts back. Most agents in practice just get offloading for free as part of enabling those tools. If your agent has none of them, the platform leaves offloading off so the agent isn't left with references it can't follow.

When to reach for this

Turn offloading on when you see the agent dragging large payloads through the prompt:

  • An agent running corpora_search with high num_results and stuffing every hit into context.
  • document_conversion on large PDFs producing tens of thousands of tokens per call.
  • SQL tools returning hundreds of rows when the agent only needs a handful of fields.

How it works

With offloading active, every tool's output is measured before it enters the conversation. If the output is under the configured threshold, it passes through unchanged. If it's over, the platform:

  1. Writes the full output to a session artifact (same storage as user-uploaded files).
  2. Computes a compact reference containing the artifact id, byte size, line count, a compact shape descriptor of the content, and a per-tool map of access hints for whichever artifact tools the agent has configured.
  3. Sends that reference to the LLM in place of the original output.

The agent reads the reference, decides whether it needs detail, and calls artifact_read for line ranges, artifact_grep for regex search, or artifact_jq for structured extraction from JSON outputs. The original output is never in the prompt — it only comes back through an explicit tool call.

Get the defaults by configuring an artifact tool

Configuring any of artifact_read, artifact_grep, or artifact_jq is enough — offloading turns itself on, uses default thresholds, and the agent has a tool it can use to read offloaded outputs. Configure all three when you can; each handles a different access pattern.

MINIMUM SETUP — OFFLOADING ENABLES ITSELF

Code example with json syntax.
1

The compact reference the agent receives includes a how_to_access map with entries for whichever of artifact_read, artifact_grep, and artifact_jq are configured, so the LLM knows which tool call to use.

Tune the thresholds

For most workloads the defaults are fine. When you do want to tune, the agent's tool_output_offloading exposes four knobs. The defaults are conservative — see the API reference for exact values.

  • enabled — explicit on/off. Only necessary to force offloading off on an agent that has one of the artifact tools configured.
  • context_percentage — the threshold as a fraction of the model's context window. Lower it if your agent makes many tool calls per turn and you want aggressive offloading; raise it if you're seeing the agent get confused by references to artifacts for outputs it could have handled inline.
  • min_threshold_bytes — a floor for tiny outputs. Anything under this is never offloaded, because reading a small artifact is slower than just letting it pass through inline.
  • max_threshold_bytes — a ceiling for large outputs. Caps the absolute size of anything allowed into the prompt so a single tool response can't dominate the conversation, even on a 200k-token model where a percentage-based threshold would be enormous.

The effective threshold scales with the model's context window, clamped between the floor and ceiling. In practice most large-context models hit the ceiling as the binding constraint.

What the agent sees

Instead of the raw output, the LLM receives a small JSON envelope like:

{
"artifact_id": "art_tool_output_9k2x",
"size_bytes": 2457600,
"line_count": 14020,
"shape": { "rows": "array(500) of object(6 keys)", "next_cursor": "string" },
"how_to_access": {
"artifact_read": "Read full content or a line range (start_line/end_line)",
"artifact_grep": "Search for patterns with grep",
"artifact_jq": "Query with jq expressions, e.g. '.results[0]', '.[] | select(.score > 0.5)'"
}
}

The shape is a compact descriptor for the LLM to eyeball — not a JSON-schema document. For JSON outputs the agent typically follows up with artifact_jq; for text it uses artifact_grep or artifact_read. The artifact tools themselves are not offloadable — their outputs always return inline, so the agent doesn't end up chasing references to references.

Limits and trade-offs

  • Artifact reads cost tool turns. The agent pays an extra LLM turn every time it decides to read an offloaded artifact. For outputs that are borderline, the agent is sometimes better off with the output inline — which is why small outputs pass through by default.
  • Tools without structure are harder for the agent to navigate. A 2 MB JSON document with a clean shape is easy for artifact_jq to query. A 2 MB free-form text blob is less so. Consider whether your custom tools can return structured output before relying on offloading for them.
  • Artifacts live for the session. Offloaded outputs are stored as normal session artifacts and clean up when the session ends. If you need the output to outlive the session, persist it somewhere yourself.
  • Threshold is byte-based, not token-based. The defaults are conservative because token-to-byte ratios vary by language and content type. Err on the side of offloading more.
  • Some content needs to stay inline. Short, structured responses that the agent will reference in most turns (e.g., "the user's current cart") aren't a good fit for offloading — you'll pay a read on every turn. Keep those tools below the threshold.