Skip to main content
Version: 2.0

Upload a file to the corpus

POST 

/v2/corpora/:corpus_key/upload_file

Supported API Key Type:
Index ServicePersonal

Upload a file to a corpus for automatic text extraction, chunking, and indexing. This endpoint is designed for unstructured documents where you want Vectara to handle parsing for you. Each uploaded file can be up to 10 MB.

Supported file types include:

  • Markdown (.md)
  • PDF/A (.pdf)
  • OpenOffice documents (.odt)
  • Microsoft Word (.doc, .docx)
  • Microsoft PowerPoint (.ppt, .pptx)
  • Plain text (.txt)
  • HTML (.html)
  • LXML (.lxml)
  • RTF (.rtf)
  • EPUB (.epub)
  • Email files (RFC 822)
note

For semi-structured documents that require more control over fields or metadata, use the Create Corpus Document API instead.

Additional format support through Vectara Ingest

If you need to ingest additional file types or data sources, you can use the open-source Vectara Ingest Python framework. It supports connectors for websites, RSS feeds, CSV, Confluence, HubSpot, ServiceNow, Jira, Notion, Slack, MediaWiki, GitHub, SharePoint, Twitter/X, YouTube, and more.

caution

Vectara Ingest is provided as an open-source example and is not officially supported.

Multipart form fields

This endpoint expects a multipart/form-data request with the following fields:

  • metadata (optional): JSON metadata to attach to the parsed document.
    Example: metadata={"key":"value"}
  • chunking_strategy (optional): Controls how extracted text is chunked.
    Defaults to sentence-based chunking (one chunk per sentence).
    Example: {"type":"sentence_chunking_strategy"}. Example for max character chunking: {"type":"max_chars_chunking_strategy","max_chars_per_chunk":512}
  • table_extraction_config (optional): Enables extraction of tables from supported file types such as PDFs.
    Example: {"extract_tables": true}
  • file (required): The file to upload.
  • filename (required): The desired document ID, specified within the file upload field.

Apart from these parameters, the servers expect a valid JWT Token in the HTTP headers:

\$ curl -L -X POST 'https://api.vectara.io/v2/corpora/:corpus_key/upload_file' \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-H 'x-api-key: zwt_123456' \
-F 'metadata=\{"key": "value"\};type=application/json' \
-F 'file=@/path/to/file/file.pdf;filename=desired_filename.pdf'

Filenames with non-ASCII characters

When uploading files with non-ASCII (non-English) characters, such as Russian or Chinese, ensure that the filename is URL encoded. The Vectara REST API follows web standards which require URL-encoded file names.

Set the document ID

To set a custom Document ID, pass it as the filename in the Content-Disposition header:

Content-Disposition: form-data; name="file"; filename="your_document_id"

For more information about Content-Disposition, see the Mozilla documentation on headers.

Attach additional metadata

You can attach additional metadata to the file by specifying a metadata form field, which can contain a JSON string:

{ "filesize": 1234 }

Tabular data extraction and summarization

Setting table_extraction_config.extract_tables = true enables extraction of tabular data (such as financial filings such as 10-K, 10-Q, S-1). You can also apply custom prompt templates to summarize table content during upload.

caution

Table extraction does not support scanned images of tables.

Custom table summarization with prompt templates

Vectara supports table summarization using custom prompt templates during document upload. This lets you define custom prompt templates that control how the LLM interprets and summarizes table data during extraction. By customizing the prompt_template, you can tailor summaries for domain-specific language, analytical perspectives, or formatting preferences.

Image support

You can include images in structured documents using the Indexing API with Base64 encoding. You cannot send images directly with individual query requests. If you want to retrieve a specific image that is embedded within a document, use the Retrieve image API

Request

Responses

The extracted document has been parsed and added to the corpus.