Upload a file to the corpus
POST/v2/corpora/:corpus_key/upload_file
Upload a file to a corpus for automatic text extraction, chunking, and indexing. This endpoint is designed for unstructured documents where you want Vectara to handle parsing for you. Each uploaded file can be up to 10 MB.
Supported file types include:
- Markdown (
.md) - PDF/A (
.pdf) - OpenOffice documents (
.odt) - Microsoft Word (
.doc,.docx) - Microsoft PowerPoint (
.ppt,.pptx) - Plain text (
.txt) - HTML (
.html) - LXML (
.lxml) - RTF (
.rtf) - EPUB (
.epub) - Email files (RFC 822)
For semi-structured documents that require more control over fields or metadata, use the Create Corpus Document API instead.
Additional format support through Vectara Ingest
If you need to ingest additional file types or data sources, you can use the open-source Vectara Ingest Python framework. It supports connectors for websites, RSS feeds, CSV, Confluence, HubSpot, ServiceNow, Jira, Notion, Slack, MediaWiki, GitHub, SharePoint, Twitter/X, YouTube, and more.
Vectara Ingest is provided as an open-source example and is not officially supported.
Multipart form fields
This endpoint expects a multipart/form-data request with the following fields:
- metadata (optional): JSON metadata to attach to the parsed document.
Example:metadata={"key":"value"} - chunking_strategy (optional): Controls how extracted text is chunked.
Defaults to sentence-based chunking (one chunk per sentence).
Example:{"type":"sentence_chunking_strategy"}. Example for max character chunking:{"type":"max_chars_chunking_strategy","max_chars_per_chunk":512} - table_extraction_config (optional): Enables extraction of tables from supported file types such as PDFs.
Example:{"extract_tables": true} - file (required): The file to upload.
- filename (required): The desired document ID, specified within the file upload field.
Apart from these parameters, the servers expect a valid JWT Token in the HTTP headers:
\$ curl -L -X POST 'https://api.vectara.io/v2/corpora/:corpus_key/upload_file' \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-H 'x-api-key: zwt_123456' \
-F 'metadata=\{"key": "value"\};type=application/json' \
-F 'file=@/path/to/file/file.pdf;filename=desired_filename.pdf'
Filenames with non-ASCII characters
When uploading files with non-ASCII (non-English) characters, such as Russian or Chinese, ensure that the filename is URL encoded. The Vectara REST API follows web standards which require URL-encoded file names.
Set the document ID
To set a custom Document ID, pass it as the filename in the Content-Disposition header:
Content-Disposition: form-data; name="file"; filename="your_document_id"
For more information about Content-Disposition, see the Mozilla documentation on headers.
Attach additional metadata
You can attach additional metadata to the file by specifying a metadata form field, which can contain a JSON string:
{ "filesize": 1234 }
Tabular data extraction and summarization
Setting table_extraction_config.extract_tables = true enables extraction of tabular data (such as financial filings such as 10-K, 10-Q, S-1). You can also apply custom prompt templates to summarize table content during upload.
Table extraction does not support scanned images of tables.
Custom table summarization with prompt templates
Vectara supports table summarization using custom prompt templates during document upload. This lets you define custom prompt templates that control how the LLM interprets and summarizes table data during extraction. By customizing the prompt_template, you can tailor summaries for domain-specific language, analytical perspectives, or formatting preferences.
Image support
You can include images in structured documents using the Indexing API with Base64 encoding. You cannot send images directly with individual query requests. If you want to retrieve a specific image that is embedded within a document, use the Retrieve image API
Request
Responses
- 201
- 400
- 403
- 404
- 415
The extracted document has been parsed and added to the corpus.
Upload file request was malformed.
Permissions do not allow uploading a file to the corpus.
Corpus not found.
The media type of the uploaded file is not supported.