Tune retrieval
This guide is for any caller that constructs Vectara query requests
directly — REST API consumers, SDK users, code generated by a coding
agent, or anyone tuning queries through the Vectara console's Query /
Retrieval tab. The mental model and the JSON parameters are the same
in all three. (For agent builders configuring a corpora_search tool,
see Tune retrieval for agents.)
The guide is opinionated and ordered by impact. Sections build on each other; if results are weak, work top to bottom — earlier fixes pay back the most.
Endpoints
Vectara has three query endpoints. Use the one that matches the shape of the call you're making.
| Endpoint | Method | When to use |
|---|---|---|
/v2/corpora/{corpus_key}/search | GET | Quick lookups against a single corpus, query in URL. |
/v2/corpora/{corpus_key}/query | POST | Single-corpus retrieval with full tuning options. |
/v2/query | POST | Multi-corpus retrieval with full tuning options. |
Both POST endpoints accept the same tuning parameters; the multi-corpus
form takes a corpora array instead of a path-based corpus_key.
Every example below uses the POST form. Authentication is x-api-key
or Authorization: Bearer <jwt> on every request.
Quick reference: a strong default query
If you want a working baseline to copy and tune from later, use this:
RECOMMENDED BASELINE QUERY
Code example with bash syntax.1
That request: pulls 50 candidates, drops anything the neural reranker
scores under 0.5, deduplicates with MMR, returns the top 10 with two
sentences of surrounding context highlighted with <em> tags. The
rest of this guide explains why each piece is there and how to tune it.
1. Ingest is the heart of retrieval quality
Nothing the query side can do will recover information that was lost or mangled during ingest. The four ingest decisions that most affect retrieval are:
- Parts carry meaning, not just text. A document is broken into
parts; each part is what gets embedded, reranked, and returned in
search_results. Sentence-level chunks (the file upload default) are fine for most prose; pre-structured documents should use the structured/core indexing APIs. - Tables and images need real descriptions. Tables become their
own parts when
extract_tablesis set; thedescriptionfield is what gets embedded. Images are retrievable through their summary, not pixels — generate the summary with a domain-specific prompt. - Declare filter attributes early. Only fields declared as
filter_attributeson the corpus can appear in ametadata_filterat query time. Adding them later requires reindexing. - Agent-generated metadata is fair game. Use a pipeline to add summaries, tags, or "who/what/when" annotations the source didn't carry.
Full mechanics: Data ingestion.
2. Construct the query body
The minimum required fields are query (the text to search for) and
either corpus_key in the URL (single-corpus endpoints) or
search.corpora (multi-corpus endpoint).
Single corpus
POST /v2/corpora/{corpus_key}/query:
SINGLE-CORPUS QUERY BODY
Code example with json syntax.1
Multi-corpus
POST /v2/query:
MULTI-CORPUS QUERY BODY
Code example with json syntax.1
Each corpus in the corpora array can have its own metadata_filter
and lexical_interpolation. The limit, offset,
context_configuration, and reranker apply to the merged result
set.
Field reference
| Field | Type | Default | What it does |
|---|---|---|---|
query | string | — | Required. The search text. |
search.limit | integer | 10 | Max results returned after reranking. For tuning, set this on the reranker, not here. |
search.offset | integer | 0 | Number of results to skip. Used for pagination. |
metadata_filter | string | none | SQL-like filter on declared filter attributes. See section 3. |
lexical_interpolation | float | 0.025 | Blend keyword scoring into semantic scoring. See section 6. |
semantics | string | default | Override query/response interpretation. Rarely needed. |
3. Filter on metadata
Filters are the highest-precision lever you have. Use them whenever the user's intent narrows the search space — by date, by tenant, by product area, by document type.
A filter expression is SQL-like and references declared filter
attributes only. Document-level attributes are prefixed doc.,
part-level attributes are prefixed part.:
doc.year = 2024 AND doc.lang = 'eng'
doc.created_at > '2025-01-01'
part.section = 'summary' OR part.section = 'intro'
doc.tags @> 'security'
If the corpus doesn't declare a field as a filter attribute, filtering
on it returns an error — declare it on the corpus first
(PUT /v2/corpora/{corpus_key}/replace_filter_attributes) and reindex
the affected metadata.
To enumerate the actual values present for a filter attribute (useful when constructing filters dynamically):
GET /v2/corpora/{corpus_key}/filter_attribute_stats
returns the distinct values and counts.
Extracting filters from natural language
intelligent_query_rewriting (Tech Preview) extracts a
metadata_filter from the natural-language query at search time. Set
it on the request body:
AUTO-EXTRACT METADATA FILTER FROM THE QUERY
Code example with json syntax.1
The response includes rewritten_queries[].filter_extraction showing
what was extracted. If the corpus has no matching filter attributes,
or extraction fails, the response includes a warnings array. Useful
for low-effort filter coverage; for high-precision use cases, write
filters explicitly.
See Intelligent query rewriting and Metadata filters overview.
4. Match context_configuration to your chunking
context_configuration controls how much surrounding text is
included with each result. It's the single biggest lever on result
size and on how readable a result is downstream.
Fields:
sentences_before/sentences_after(integer, default0) — preferred.characters_before/characters_after(integer) — fallback for content without clean sentence terminators.start_tag/end_tag(string) — wrap the matching part inside the expanded text so callers can highlight it.
You can use sentence-based or character-based bounds, not both — if both are set, sentence-based wins.
A practical default is 2 sentences before and 2 after with <em>
markers:
SENSIBLE DEFAULT CONTEXT CONFIGURATION
Code example with json syntax.1
Tune from there:
- Long parts (full sections): drop surrounding context to
0— duplicating it just inflates response size. - Short parts (FAQ entries, list items, captions): bump
sentences_before/afterto 3–5 so each result is self-contained. - Code or structured content: prefer
characters_before/after.
5. Build a reranker chain
A single reranker is rarely the right answer. The strongest setups chain three:
- Neural reranker with a calibrated cutoff — does the heavy lifting on relevance.
- MMR — reduces redundancy.
- Optional UDF — applies business signal (recency, popularity).
Neural reranker with a cutoff
Start with the multilingual neural reranker (customer_reranker /
Rerank_Multilingual_v1). Its scores are normalized to roughly
0.0–1.0, which is what makes a cutoff meaningful — that's not true
of raw hybrid scores.
Start at cutoff: 0.5 — Vectara's recommended default. Tune from
there:
- Below
0.3: noise leaks through. - Above
0.7: zero-result responses on real queries become common.
Combine cutoff with a generous limit — let the cutoff drop junk,
let the limit cap the absolute count.
If the reranker supports instructions (instruction-following rerankers
do), use the instructions field to bias toward the task. e.g.
"instructions": "Prefer policy documents over marketing material."
MMR for diversity
Neural rerankers reward relevance, not diversity. Without MMR you
routinely get back five near-duplicates of the highest-scoring
paragraph. Apply MMR after the neural reranker with a modest
diversity_bias (start at 0.3) and a limit matching what you
actually want to return (typically 5–15).
UDF for time and business signal
A userfn reranker applies arbitrary scoring on top of metadata. Two
cases come up constantly:
- Recency: down-weight stale documents using
doc.created_at. - Authority: boost documents with high
part.upvotes,doc.is_official, or a custom dimension.
Run UDF last in the chain so the neural reranker isn't fighting business rules.
Chain example
NEURAL + MMR + RECENCY UDF RERANKER CHAIN
Code example with json syntax.1
The UDF reads as: if the document is more than 8760 hours (≈1 year) old, halve its score; otherwise leave it alone. Full UDF expression language: User Defined Function reranker.
See also: Reranking overview, Limits and cutoffs, Chain reranker.
6. Lexical signal (lexical_interpolation)
lexical_interpolation (often called lambda) blends keyword scoring
into the embedding score. 0.0 is pure semantic; 1.0 is pure
lexical. The default is 0.025 — a light keyword sprinkle.
Tune by query type, not by intuition:
| Query style | Suggested λ |
|---|---|
| Conceptual questions ("how does X work") | 0.0–0.025 |
| Mixed natural-language with key terms | 0.025–0.1 |
| Identifier or codename heavy ("error E_42") | 0.1–0.3 |
| Exact-string lookup | 0.3–0.6 |
If you find yourself wanting λ > 0.5, the underlying problem is
usually that the term you care about isn't being parsed as one token
by the embedder — fix it at ingest (e.g. ensure E_42 survives
tokenization) before reaching for more lexical weight.
7. Handle results
The successful response from a POST query looks like:
RESPONSE SHAPE
Code example with json syntax.1
Key fields:
search_results— array of reranked results. The order is the final order; don't re-sort byscore(each reranker stage rescores, so a final score is comparable only within the response).document_id— opaque, stable. Use it as a key for caching or to fetch the full document.document_metadata/part_metadata— whatever you indexed, unfiltered. Anything you need at display time should live here.request_corpora_index/corpus_key— for multi-corpus responses, identifies which corpus each result came from.table/image— populated when the part is a table or image part; otherwisenull.
Pagination
Use search.offset to walk through results page by page. Pagination
applies to the post-rerank result set:
{ "search": { "limit": 10, "offset": 20 } }
Returns results 21–30. Note that paging deeply with a heavy reranker chain is expensive — every call reranks the full candidate set again. For paginated UX, cache the result list client-side after the first call when possible.
Big results
Tables aren't split — a single table part can be many KB on its own.
Long contexts × high limit × big results compound quickly. If
your client is forwarding results to an LLM downstream, watch for
context bloat: cap limit to what the LLM can actually use, drop
unused metadata fields, and consider a tighter context_configuration.
If you're feeding results into an agent runtime (Claude, Cursor, your
own agent), look at
tool-output offloading — the agent
runtime can write large search responses to an artifact and let the
agent navigate them with artifact_jq instead of loading them into
the prompt.
8. Generation, streaming, and citations
If you want a generated answer alongside the results, add a
generation block:
ADDING GENERATION TO A QUERY
Code example with json syntax.1
Set stream_response: true at the top level for Server-Sent Events
streaming.
Generation is a separate concern from retrieval — once retrieval is good, the generation knobs control how the answer is written, not which documents inform it. Details: Vectara Prompt Engine, Citations.
9. How to iterate
A tuning loop that works in practice:
- Build a small eval set. 20–50 real queries with the answer document(s) labeled. Skipping this is the most common reason teams ship bad retrieval.
- Run with reranking off (
"reranker": { "type": "none" }) and look at the pre-rerank results. If the right document isn't in the top 50, no reranker will save you — go back to ingest, filters, orlexical_interpolation. - Turn the neural reranker on and check that the right document
is now in the top 5. If not, tune
instructionsor yourcontext_configuration(the reranker scores text including surrounding context wheninclude_context: true). - Add MMR and verify you're not just shipping duplicates.
- Add the UDF and check that recency/authority hasn't pushed a genuinely best answer off the list.
- Inspect query history.
GET /v2/queries/{query_id}returnsspans[]— each span shows pre-rerank, post-rerank, and rewritten-query data. Setsave_history: trueon the request to capture it.
Common pitfalls
- Filtering on a non-declared field returns an error, not zero
results. If a
metadata_filteryou expect to work fails, check the corpus'sfilter_attributes. - Cutoffs only behave predictably with neural rerankers. With raw hybrid scores or BM25 outputs, scores are unbounded and a static cutoff is unreliable.
- High
lexical_interpolationis usually the wrong fix for missing exact-term hits. Fix tokenization at ingest first. - Rerankers run in chain order, top-down. A
userfnthat returnsnullremoves the result from the set before later stages see it. search.limitcaps the post-rerank count. Set the retrieval cap on the neural reranker'slimit(or pre-rerank withnoneand a highsearch.limit) — otherwise you may rerank fewer candidates than you think.- Multi-corpus responses interleave by score. Don't assume the
first N results are all from the first corpus. Use
request_corpora_indexorcorpus_keyif origin matters.
Related
- Hybrid search
- Reranking overview
- Filters
- Custom dimensions
- Citations
- Tune retrieval for agents — same content, framed for agent tool configuration