Upload a file to the corpus
POST/v2/corpora/:corpus_key/upload_file
The File Upload API enables you to extract text from unstructured documents in common file types like PDFs, Microsoft Word, Text, HTML, and Markdown. It also supports extracting table data from PDFs, allowing for improved analysis and querying of structured tabular data. Each file you upload can be up to 10 MB in size. We recommend the File Upload API when you have not already written your own extraction logic.
This endpoint expects a multipart/form-data request with the following fields:
- metadata: An optional JSON object containing additional metadata to associate with the document.
Example:metadata={"key": "value"} - chunking_strategy: An optional JSON object that sets the chunking method for text extraction.
- By default, the platform uses sentence-based chunking (one chunk per sentence).
- Example for explicit sentence chunking:
chunking_strategy={"type":"sentence_chunking_strategy"} - Example for max chars chunking:
chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":512}
- table_extraction_config: An optional JSON object to control table extraction from supported file types (e.g., PDF).
Example:table_extraction_config={"extract_tables": true} - file: The file to upload. Attach your file as the value for this field.
- filename: The desired name for the uploaded file. Specify as part of the file field in your request.
Apart from these parameters, the servers expect a valid JWT Token in the HTTP headers.
curl -L -X POST 'https://api.vectara.io/v2/corpora/:corpus_key/upload_file' \\
-H 'Content-Type: multipart/form-data' \\
-H 'Accept: application/json' \\
-H 'x-api-key: zwt_123456' \\
-F 'metadata={"key": "value"};type=application/json' \\
-F 'file=@/path/to/file/file.pdf;filename=desired_filename.pdf'
Filenames with Non-ASCII Characters
When uploading files with non-ASCII (non-English) characters, such as Russian or Chinese, ensure that the filename is URL encoded. API v2 follows web standards which require URL-encoded file names.
Image Support
You can include images in structured documents using the Indexing API with Base64 encoding. You cannot send images directly with individual query requests. If you want to retrieve a specific image that is embedded within a document, use the Retrieve Image API.
Set the Document ID
The Content-Disposition header lets you specify the Document ID of a file
when you use the following format:
Content-Disposition: form-data; name="*file*"; filename="*your_document_id*"
where file is the name of the file, and filename is the Document ID that
you want. The primary purpose of this header is to specify the
form-data, so using filename as the Document ID is specific to our
platform. For more information about Content-Disposition, see
the Mozilla documentation on headers.
Uploading PDFs with Tables
Set the table_extraction_config field to true to extract table data from a
PDF. This feature is particularly useful for financial reports like 10-Q,
10-K, and S1 filings. With table extraction enabled, you can query specific
table cells using the Query API.
This feature does not support extracting data from scanned-in images of tables.
Custom Table Summarization with Prompt Templates
Vectara supports table summarization using custom prompt templates during
document upload. This lets you define custom prompt templates that control how the
LLM interprets and summarizes table data during extraction. By customizing the
prompt_template, you can tailor summaries for domain-specific language,
analytical perspectives, or formatting preferences.
Attach Additional Metadata
You can attach additional metadata to the file by specifying a metadata
form field, which can contain a JSON string:
{ "filesize": 1234 }
Request
Responses
- 201
- 400
- 403
- 404
- 415
The extracted document has been parsed and added to the corpus.
Upload file request was malformed.
Permissions do not allow uploading a file to the corpus.
Corpus not found.
The media type of the uploaded file is not supported.