Version: 2.0

The platform stack

The Vectara platform is a stack of layers. Every application you build uses the same blocks: different corpora, different tools, different agents, but the same primitives. This page walks each layer top-down and tells you which parts you configure versus what Vectara runs on your behalf.

The headline numbers: Vectara exposes a consistent set of platform primitives across deployment models: interfaces, agent runtime, tools, LLM gateway, retrieval, corpora and ingestion, and foundation controls. This page explains what each layer does, what you configure, and what Vectara operates on your behalf.

HHEM grades every answer in under 50ms.
Boomerang leads XQuAD-R cross-lingual retrieval at 76.2%, with a substantial lead on low-resource languages.
35+ built-in agent tools, custom tools deployed in minutes.
SOC 2 Type II certified, HIPAA on request, SaaS / VPC / on-prem / air-gapped.

Layer 1: Interfaces

The surfaces through which your code and your operators talk to the platform.

REST API: the primary integration point. Anything that speaks HTTP can drive Vectara: web frontends, mobile apps, backend services, even other agents. Generated from the OpenAPI spec, with first-class request and response schemas. See the REST API reference.
Vectara Skills for coding agents: a packaged set of skills that Claude Code, Cursor, and similar coding agents load to build Vectara applications. Same primitives, surfaced as natural-language operations. See Agent skills for coding agents.
Admin Console: the operator surface. Inspect sessions, replay retrieval traces, tune HHEM thresholds, manage corpora, manage agents, manage pipelines. See the Console quickstart.

You decide which interface drives which workload. Build a chat UI? REST. Bootstrap a new corpus from a coding agent? Skills. Triage a regression in production? Console.

Layer 2: Agent runtime

The orchestration engine that turns declarative agent configuration into deterministic, auditable execution.

Stepped state machines: an agent is a graph of named steps. Each step has its own instructions, scoped tool set, and output schema. Conditional routing between steps is the agent's logic. The LLM picks what to do within a step; the platform picks which step runs next.
Sub-agent delegation: any agent can be called as a tool by another agent. Specialists run in their own sessions, with their own models and RBAC, and report results back.
Structured-output gating: a step can emit JSON matching a schema. Downstream conditions read typed fields via get('$.path'). No string parsing in prose.
Cross-session approvals: a step can pause waiting for a human decision. The session resumes on webhook, possibly days later.

Configure agents in JSON or via the Console wizard. See Agent concepts and Steps for the full configuration model.

Layer 3: Tools

What an agent can do. A tool is a small, named, schema-typed function the LLM can call during a conversation. Three sources fill an agent's tool list.

Built-in catalog: 35+ ready tools across retrieval, corpus read+write, documents, images, data, code, and orchestration. List them in tool_configurations and they work. See Agent tools.
Python Lambda tools: for pure-Python business logic the catalog does not cover: validators, calculators, scorers, parsers. POST the code to /v2/tools. Vectara runs it sandboxed. See Lambda tools.
MCP servers: any Model Context Protocol server. Register the URL with an auth token and every tool the server exposes becomes callable by the agent. See Model Context Protocol.

Wrap any external REST API with web_get and an OAuth or token credential. Salesforce, Google Drive, Jira, ServiceNow, Confluence: same shape, different argument_override.

Per-step allowed_tools narrows the catalog at each state. Least privilege by default. A jailbreak in one step cannot reach tools scoped to another.

Layer 4: LLM gateway

The platform's bring-your-own-model surface. Pick the LLM per agent, per step, or per cost tier.

Hosted models: Anthropic, OpenAI, Gemini, and on-prem industry-specialized models, all out of the box.
BYO LLM: bring your own deployment. The platform calls it the same way it calls the hosted models. See Bring your own LLM.
Hallucination Corrector: rewrites ungrounded spans rather than fabricating. Pairs with HHEM grading. See Vectara Hallucination Corrector.

The agent's model field picks the provider. Swap the model on the specialist without touching the orchestrator.

Layer 5: Retrieval engine

A six-stage pipeline, tunable at every stage. Not a one-shot vector lookup.

Documents → Chunking → Boomerang → Hybrid (BM25 + dense + filters)
          → Slingshot reranker → Citations → Generation-ready context

Chunking: sentence or max-chars chunking; never crosses section boundaries. See Chunking strategies.
Boomerang: Vectara's multilingual embedding model. Leads XQuAD-R cross-lingual at 76.2%, with substantial gains on low-resource languages.
Hybrid search: BM25 + dense retrieval with metadata filters applied before generation. See Hybrid search.
Slingshot reranker: chain rerankers (knee, MMR, multilingual, UDF) to compose the right pipeline per workload. See Reranking.
Citations: every retrieved chunk travels with its source. The generated answer cites the documents it grounded on. See Citations.

Corpora scale horizontally. Per-query metadata filters narrow the search space before generation. Citations and HHEM scores travel with every chunk.

Layer 6: Corpora & ingestion

The persistent context layer. A corpus is more than a vector store: it is writable, filterable, multilingual semantic storage. The same primitive serves three jobs:

Knowledge: documents the agent should know (manuals, KB articles, support tickets, policies).
Memory: what the agent learned about a user (preferences, history, prior decisions).
State and scratchpad: cached tool results, derived facts, structured outputs written back so future agents find them by meaning.

Filter attributes (user_id, session_id, doc_type, tier) turn a corpus into structured storage. Multi-corpus search at query time lets one call span knowledge + memory + cache, weighted differently. RBAC is enforced at retrieval: a user only ever sees chunks their identity is entitled to see.

Ingestion runs as pipelines. A pipeline pulls records from a source (S3, SharePoint, web, Salesforce, Slack, Notion, Google Drive, GitHub, custom) on a schedule (cron, interval, manual, webhook) and fans every record into a fresh agent session for verification and indexing. See Pipelines quickstart and Knowledge.

For the corpus primitive serving knowledge, memory, and scratchpad, see Knowledge and Context & memory.

Layer 7: Foundation

Foundation controls reduce the risk of data exposure by enforcing tenant isolation, authentication, authorization, retrieval-time access controls, encryption, audit logging, and deployment-level governance.

Tenant isolation: Every retrieval is filtered by tenant, RBAC, and metadata before reaching the LLM. There is no path for a model to see data the requesting identity cannot.
IdP / SSO: OIDC and SAML. Group claims map to Vectara roles and per-corpus scopes. See Authentication.
RBAC by corpus: Permissions filter at retrieval time. See Role-based access control.
Audit and traces: Every query, retrieval source, prompt, LLM call, citation, and HHEM score is logged and streamable to your SIEM. See Query observability.
Encryption: AES-256 at rest, TLS 1.3 in transit, KMS-managed keys per tenant. Optional BYOK in private cloud.
Compliance: SOC 2 Type II certified. HIPAA on request. See Privacy overview.
No training on your data: Vectara does not use your corpora to train its models.

For private deployments, including VPC, on-premises, and air-gapped, see Private deployment.

How the application talks to the stack

A typical integration flows like this, whether your team writes the application or Vectara Managed Agents delivers it for you:

The application holds the user's identity and calls the Vectara REST API.
Declare agents: steps, allowed tools, prompts, skills. Store as JSON in your source control.
Point retrieval at one or more corpora with metadata filters bound to session metadata.
Choose your LLM: hosted (Anthropic, OpenAI, Gemini), on-prem, or BYO.
Set HHEM threshold and decline-vs-answer policy.
Stream traces to your observability sink (Splunk, Grafana, your warehouse).

The same six configuration choices ship with every Vectara application. Platform releases, including new rerankers, Boomerang versions, and HHEM revisions, reach you without requiring reimplementation.

Agent anatomy — the parts inside one agent.
Request lifecycle — what happens on a single user turn.
Hallucination evaluation — how HHEM grades every answer.
Security — tenant isolation, RBAC, audit, compliance.

Layer 1: Interfaces​

Layer 2: Agent runtime​

Layer 3: Tools​

Layer 4: LLM gateway​

Layer 5: Retrieval engine​

Layer 6: Corpora & ingestion​

Layer 7: Foundation​

How the application talks to the stack​

Related​