Skip to main content
Version: 2.0

Tune retrieval

This guide is for any caller that constructs Vectara query requests directly — REST API consumers, SDK users, code generated by a coding agent, or anyone tuning queries through the Vectara console's Query / Retrieval tab. The mental model and the JSON parameters are the same in all three. (For agent builders configuring a corpora_search tool, see Tune retrieval for agents.)

The guide is opinionated and ordered by impact. Sections build on each other; if results are weak, work top to bottom — earlier fixes pay back the most.

Endpoints

Vectara has three query endpoints. Use the one that matches the shape of the call you're making.

EndpointMethodWhen to use
/v2/corpora/{corpus_key}/searchGETQuick lookups against a single corpus, query in URL.
/v2/corpora/{corpus_key}/queryPOSTSingle-corpus retrieval with full tuning options.
/v2/queryPOSTMulti-corpus retrieval with full tuning options.

Both POST endpoints accept the same tuning parameters; the multi-corpus form takes a corpora array instead of a path-based corpus_key. Every example below uses the POST form. Authentication is x-api-key or Authorization: Bearer <jwt> on every request.

Quick reference: a strong default query

If you want a working baseline to copy and tune from later, use this:

RECOMMENDED BASELINE QUERY

Code example with bash syntax.
1

That request: pulls 50 candidates, drops anything the neural reranker scores under 0.5, deduplicates with MMR, returns the top 10 with two sentences of surrounding context highlighted with <em> tags. The rest of this guide explains why each piece is there and how to tune it.

1. Ingest is the heart of retrieval quality

Nothing the query side can do will recover information that was lost or mangled during ingest. The four ingest decisions that most affect retrieval are:

  • Parts carry meaning, not just text. A document is broken into parts; each part is what gets embedded, reranked, and returned in search_results. Sentence-level chunks (the file upload default) are fine for most prose; pre-structured documents should use the structured/core indexing APIs.
  • Tables and images need real descriptions. Tables become their own parts when extract_tables is set; the description field is what gets embedded. Images are retrievable through their summary, not pixels — generate the summary with a domain-specific prompt.
  • Declare filter attributes early. Only fields declared as filter_attributes on the corpus can appear in a metadata_filter at query time. Adding them later requires reindexing.
  • Agent-generated metadata is fair game. Use a pipeline to add summaries, tags, or "who/what/when" annotations the source didn't carry.

Full mechanics: Data ingestion.

2. Construct the query body

The minimum required fields are query (the text to search for) and either corpus_key in the URL (single-corpus endpoints) or search.corpora (multi-corpus endpoint).

Single corpus

POST /v2/corpora/{corpus_key}/query:

SINGLE-CORPUS QUERY BODY

Code example with json syntax.
1

Multi-corpus

POST /v2/query:

MULTI-CORPUS QUERY BODY

Code example with json syntax.
1

Each corpus in the corpora array can have its own metadata_filter and lexical_interpolation. The limit, offset, context_configuration, and reranker apply to the merged result set.

Field reference

FieldTypeDefaultWhat it does
querystringRequired. The search text.
search.limitinteger10Max results returned after reranking. For tuning, set this on the reranker, not here.
search.offsetinteger0Number of results to skip. Used for pagination.
metadata_filterstringnoneSQL-like filter on declared filter attributes. See section 3.
lexical_interpolationfloat0.025Blend keyword scoring into semantic scoring. See section 6.
semanticsstringdefaultOverride query/response interpretation. Rarely needed.

3. Filter on metadata

Filters are the highest-precision lever you have. Use them whenever the user's intent narrows the search space — by date, by tenant, by product area, by document type.

A filter expression is SQL-like and references declared filter attributes only. Document-level attributes are prefixed doc., part-level attributes are prefixed part.:

doc.year = 2024 AND doc.lang = 'eng'
doc.created_at > '2025-01-01'
part.section = 'summary' OR part.section = 'intro'
doc.tags @> 'security'

If the corpus doesn't declare a field as a filter attribute, filtering on it returns an error — declare it on the corpus first (PUT /v2/corpora/{corpus_key}/replace_filter_attributes) and reindex the affected metadata.

To enumerate the actual values present for a filter attribute (useful when constructing filters dynamically):

GET /v2/corpora/{corpus_key}/filter_attribute_stats

returns the distinct values and counts.

Extracting filters from natural language

intelligent_query_rewriting (Tech Preview) extracts a metadata_filter from the natural-language query at search time. Set it on the request body:

AUTO-EXTRACT METADATA FILTER FROM THE QUERY

Code example with json syntax.
1

The response includes rewritten_queries[].filter_extraction showing what was extracted. If the corpus has no matching filter attributes, or extraction fails, the response includes a warnings array. Useful for low-effort filter coverage; for high-precision use cases, write filters explicitly.

See Intelligent query rewriting and Metadata filters overview.

4. Match context_configuration to your chunking

context_configuration controls how much surrounding text is included with each result. It's the single biggest lever on result size and on how readable a result is downstream.

Fields:

  • sentences_before / sentences_after (integer, default 0) — preferred.
  • characters_before / characters_after (integer) — fallback for content without clean sentence terminators.
  • start_tag / end_tag (string) — wrap the matching part inside the expanded text so callers can highlight it.

You can use sentence-based or character-based bounds, not both — if both are set, sentence-based wins.

A practical default is 2 sentences before and 2 after with <em> markers:

SENSIBLE DEFAULT CONTEXT CONFIGURATION

Code example with json syntax.
1

Tune from there:

  • Long parts (full sections): drop surrounding context to 0 — duplicating it just inflates response size.
  • Short parts (FAQ entries, list items, captions): bump sentences_before/after to 3–5 so each result is self-contained.
  • Code or structured content: prefer characters_before/after.

5. Build a reranker chain

A single reranker is rarely the right answer. The strongest setups chain three:

  1. Neural reranker with a calibrated cutoff — does the heavy lifting on relevance.
  2. MMR — reduces redundancy.
  3. Optional UDF — applies business signal (recency, popularity).

Neural reranker with a cutoff

Start with the multilingual neural reranker (customer_reranker / Rerank_Multilingual_v1). Its scores are normalized to roughly 0.0–1.0, which is what makes a cutoff meaningful — that's not true of raw hybrid scores.

Start at cutoff: 0.5 — Vectara's recommended default. Tune from there:

  • Below 0.3: noise leaks through.
  • Above 0.7: zero-result responses on real queries become common.

Combine cutoff with a generous limit — let the cutoff drop junk, let the limit cap the absolute count.

If the reranker supports instructions (instruction-following rerankers do), use the instructions field to bias toward the task. e.g. "instructions": "Prefer policy documents over marketing material."

MMR for diversity

Neural rerankers reward relevance, not diversity. Without MMR you routinely get back five near-duplicates of the highest-scoring paragraph. Apply MMR after the neural reranker with a modest diversity_bias (start at 0.3) and a limit matching what you actually want to return (typically 5–15).

UDF for time and business signal

A userfn reranker applies arbitrary scoring on top of metadata. Two cases come up constantly:

  • Recency: down-weight stale documents using doc.created_at.
  • Authority: boost documents with high part.upvotes, doc.is_official, or a custom dimension.

Run UDF last in the chain so the neural reranker isn't fighting business rules.

Chain example

NEURAL + MMR + RECENCY UDF RERANKER CHAIN

Code example with json syntax.
1

The UDF reads as: if the document is more than 8760 hours (≈1 year) old, halve its score; otherwise leave it alone. Full UDF expression language: User Defined Function reranker.

See also: Reranking overview, Limits and cutoffs, Chain reranker.

6. Lexical signal (lexical_interpolation)

lexical_interpolation (often called lambda) blends keyword scoring into the embedding score. 0.0 is pure semantic; 1.0 is pure lexical. The default is 0.025 — a light keyword sprinkle.

Tune by query type, not by intuition:

Query styleSuggested λ
Conceptual questions ("how does X work")0.0–0.025
Mixed natural-language with key terms0.025–0.1
Identifier or codename heavy ("error E_42")0.1–0.3
Exact-string lookup0.3–0.6

If you find yourself wanting λ > 0.5, the underlying problem is usually that the term you care about isn't being parsed as one token by the embedder — fix it at ingest (e.g. ensure E_42 survives tokenization) before reaching for more lexical weight.

7. Handle results

The successful response from a POST query looks like:

RESPONSE SHAPE

Code example with json syntax.
1

Key fields:

  • search_results — array of reranked results. The order is the final order; don't re-sort by score (each reranker stage rescores, so a final score is comparable only within the response).
  • document_id — opaque, stable. Use it as a key for caching or to fetch the full document.
  • document_metadata / part_metadata — whatever you indexed, unfiltered. Anything you need at display time should live here.
  • request_corpora_index / corpus_key — for multi-corpus responses, identifies which corpus each result came from.
  • table / image — populated when the part is a table or image part; otherwise null.

Pagination

Use search.offset to walk through results page by page. Pagination applies to the post-rerank result set:

{ "search": { "limit": 10, "offset": 20 } }

Returns results 21–30. Note that paging deeply with a heavy reranker chain is expensive — every call reranks the full candidate set again. For paginated UX, cache the result list client-side after the first call when possible.

Big results

Tables aren't split — a single table part can be many KB on its own. Long contexts × high limit × big results compound quickly. If your client is forwarding results to an LLM downstream, watch for context bloat: cap limit to what the LLM can actually use, drop unused metadata fields, and consider a tighter context_configuration.

If you're feeding results into an agent runtime (Claude, Cursor, your own agent), look at tool-output offloading — the agent runtime can write large search responses to an artifact and let the agent navigate them with artifact_jq instead of loading them into the prompt.

8. Generation, streaming, and citations

If you want a generated answer alongside the results, add a generation block:

ADDING GENERATION TO A QUERY

Code example with json syntax.
1

Set stream_response: true at the top level for Server-Sent Events streaming.

Generation is a separate concern from retrieval — once retrieval is good, the generation knobs control how the answer is written, not which documents inform it. Details: Vectara Prompt Engine, Citations.

9. How to iterate

A tuning loop that works in practice:

  1. Build a small eval set. 20–50 real queries with the answer document(s) labeled. Skipping this is the most common reason teams ship bad retrieval.
  2. Run with reranking off ("reranker": { "type": "none" }) and look at the pre-rerank results. If the right document isn't in the top 50, no reranker will save you — go back to ingest, filters, or lexical_interpolation.
  3. Turn the neural reranker on and check that the right document is now in the top 5. If not, tune instructions or your context_configuration (the reranker scores text including surrounding context when include_context: true).
  4. Add MMR and verify you're not just shipping duplicates.
  5. Add the UDF and check that recency/authority hasn't pushed a genuinely best answer off the list.
  6. Inspect query history. GET /v2/queries/{query_id} returns spans[] — each span shows pre-rerank, post-rerank, and rewritten-query data. Set save_history: true on the request to capture it.

Common pitfalls

  • Filtering on a non-declared field returns an error, not zero results. If a metadata_filter you expect to work fails, check the corpus's filter_attributes.
  • Cutoffs only behave predictably with neural rerankers. With raw hybrid scores or BM25 outputs, scores are unbounded and a static cutoff is unreliable.
  • High lexical_interpolation is usually the wrong fix for missing exact-term hits. Fix tokenization at ingest first.
  • Rerankers run in chain order, top-down. A userfn that returns null removes the result from the set before later stages see it.
  • search.limit caps the post-rerank count. Set the retrieval cap on the neural reranker's limit (or pre-rerank with none and a high search.limit) — otherwise you may rerank fewer candidates than you think.
  • Multi-corpus responses interleave by score. Don't assume the first N results are all from the first corpus. Use request_corpora_index or corpus_key if origin matters.