Skip to main content
Version: 2.0

Indexing Documents

Manage document data efficiently by addressing challenges like data sprawl and metadata inconsistencies for creating, querying, and maintaining documents. This guide covers both indexing new documents and managing existing ones, making it ideal for building scalable search solutions or automating content governance.

  • Create structured or core documents with custom metadata
  • Index documents from text content or upload files
  • List and filter documents in a corpus
  • Retrieve, update, and delete documents by ID
  • Summarize content using LLM-powered tools
Prerequisites

This guide assumes you have a corpus called my-docs. If you haven't created a corpus yet, follow the Quick Start guide to set up your first corpus.

Create a structured document

CREATE A STRUCTURED DOCUMENT
1

Create and index a structured document into your corpus to make it searchable. Structured documents are organized into sections, each with optional titles and metadata, making them ideal for contracts, reports, or other organized content.

The documents.create method corresponds to the HTTP POST /v2/corpora/{corpus_key}/documents endpoint.

Key Parameters:

  • id (string, required): Unique identifier for the document within the corpus
  • type (string, required): Must be "structured" for section-based documents
  • sections (array, required): List of document sections with text content
  • metadata (object, optional): Document-level metadata for filtering

Section Parameters:

  • title (string, optional): Section heading or title
  • text (string, required): The actual content text for this section
  • metadata (object, optional): Section-level metadata for fine-grained filtering

Use structured documents for organized content like employee handbooks, policies, or technical manuals where clear section organization improves searchability.

Error Handling:

  • 400 Bad Request: Invalid document structure or parameters
  • 403 Forbidden: Insufficient permissions - ensure API key has indexing rights
  • 404 Not Found: Corpus doesn't exist
  • 409 Conflict: Document with the same ID already exists

Create a core document

CREATE A CORE DOCUMENT
1

Create and index a core document using document parts. Core documents are more flexible than structured documents and work well for unstructured content like support articles, FAQs, or knowledge base entries.

Key Differences from Structured Documents:

  • Uses document_parts instead of sections
  • Parts don't have titles, only text content and optional metadata
  • Better suited for unstructured or semi-structured content

Use Core Documents When:

  • Content doesn't have clear section structure
  • You want maximum flexibility in document organization
  • Working with imported content from various sources

To update or overwrite the document, you must delete it using client.documents.delete() and then re-index it, as direct updates to content are not supported. Attempting to re-index with the same ID and different content will result in a 409 error.

Error Handling:

  • 400 Bad Request: Invalid document structure or parameters
  • 403 Forbidden: Insufficient permissions - ensure API key has indexing rights
  • 404 Not Found: Corpus doesn't exist
  • 409 Conflict: Document with the same ID already exists with different content
  • 413 Payload Too Large: Document exceeds size limit

List documents in a corpus

LIST DOCUMENTS IN A CORPUS
1

Explore powerful methods to retrieve and manage document listings within a corpus, enabling efficient data access and organization.

The documents.list method corresponds to the HTTP GET /v2/corpora/{corpus_key}/documents endpoint. For more details on request and response parameters, see the List Documents REST API.

Parameters:

  • corpus_key (string, required): Unique identifier for the corpus
  • limit (int, optional): Maximum number of documents to return per page (default: 10)
  • metadata_filter (string, optional): Filter expression for document metadata
  • page_key (string, optional): Token to fetch the next page of results

Returns: Iterator of Document objects (containing id and metadata, but not full content).

Use metadata filters to find specific document types or categories. The method returns paginated results for efficient handling of large document collections.


Get a document by ID

GET A DOCUMENT BY ID
1

Access specific documents efficiently by their unique IDs, enabling detailed inspection or display within your corpus.

The documents.get method corresponds to the HTTP GET /v2/corpora/{corpus_key}/documents/{document_id} endpoint.

Parameters:

  • corpus_key (string, required): Unique identifier of the corpus
  • document_id (string, required): Unique identifier of the document

Returns: Document object with full text content and metadata.

Use this method when you need to retrieve the complete document content, not just the metadata returned by the list operation.


Update document metadata

UPDATE DOCUMENT METADATA
1

Enhance document management by updating metadata fields, perfect for tagging, categorization, and maintaining document status.

The documents.update method corresponds to the HTTP PATCH /v2/corpora/{corpus_key}/documents/{document_id} endpoint.

Parameters:

  • corpus_key (string, required): Unique identifier of the corpus
  • document_id (string, required): Unique identifier of the document
  • metadata (object, required): New metadata to merge with existing metadata

The update operation merges the provided metadata with existing metadata, allowing you to add new fields or modify existing ones without losing other data.


Delete a document

DELETE A DOCUMENT
1

Manage your corpus effectively by permanently removing documents, supporting data cleanup and lifecycle management.

The documents.delete method corresponds to the HTTP DELETE /v2/corpora/{corpus_key}/documents/{document_id} endpoint.

Parameters:

  • corpus_key (string, required): Unique identifier of the corpus
  • document_id (string, required): Unique identifier of the document to delete
caution

Deletion is permanent and cannot be undone. Ensure you have backups if the document might be needed later.


Summarize a document

SUMMARIZE A DOCUMENT
1

Generate LLM-powered summaries for specific documents in your corpus. Use this for content previews, search snippets, or generative UI applications.

The documents.summarize method corresponds to the HTTP POST /v2/corpora/{corpus_key}/documents/{document_id}/summarize endpoint.

Parameters:

  • corpus_key (string, required): Unique identifier of the corpus
  • document_id (string, required): Unique identifier of the document
  • llm_name (string, optional): LLM model to use for summarization
  • prompt_template (string, optional): Custom prompt with $document_content placeholder

Returns: Summary response object with the generated summary text.

Use custom prompt templates to tailor summaries for specific use cases like customer support, technical documentation, or content previews.


Workflow: Create corpus and index document

COMPLETE WORKFLOW: CREATE CORPUS AND INDEX DOCUMENT
1

This example demonstrates the fundamental two-step workflow for establishing a new knowledge base in Vectara.

  1. Corpus creation: The first step creates a new corpus with a unique identifier (key) and human-readable name. The corpus acts as a namespace for your documents and defines important characteristics like metadata schemas, filter attributes, and access controls. The example includes error handling for the common case where the corpus already exists.
  2. Document ingestion: The second step uploads and indexes a structured document into the corpus. The document is parsed into searchable sections, with each section containing both text content and optional metadata. Vectara processes the content automatically, making it immediately queryable through the search API.

Best Practices

  • Descriptive naming: Use meaningful corpus keys and names that clearly identify the content domain and purpose.
  • Consistent metadata: Establish a uniform metadata schema across all documents within a corpus to enable effective filtering.
  • Robust error handling: Implement comprehensive logic that handles both creation failures and "already exists" scenarios gracefully.
  • Verification steps: Confirm corpus creation success before attempting document indexing to avoid orphaned content.
  • Resource management: Consider using unique corpus keys for testing to avoid conflicts with existing resources.

Next steps

After understanding document management and indexing, you can:

  • Query documents: Use client.query() to search across document content with the Query guide
  • Upload files: Use client.upload.file() to index PDFs, DOCX, and other file formats with the Upload Files guide
  • Manage corpora: Create and configure corpora with client.corpora.create() using the Corpora guide
  • Batch operations: Process multiple documents efficiently for large-scale content management
  • Advanced filtering: Leverage metadata for sophisticated document organization