Upload a file to the corpus
POST/v2/corpora/:corpus_key/upload_file
Upload files such as PDFs and Word Documents for automatic text extraction and metadata parsing.
The request expects a multipart/form-data
format containing the following parts:
metadata
- (Optional) Specifies a JSON object representing any additional metadata to be associated with the extracted document. For example,'metadata={"key": "value"};type=application/json'
chunking_strategy
- (Optional) Specifies the chunking strategy for the platform to use. If you do not set this option, the platform uses the default strategy, which creates one chunk per sentence. For example,'chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":200};type=application/json'
table_extraction_config
- (Optional) Specifies whether to extract table data from the uploaded file. If you do not set this option, the platform does not extract tables from PDF files. Example config,'table_extraction_config={"extract_tables":true};type=application/json'
file
- Specifies the file that you want to upload.filename
- Specified as part of the file field with the file name that you want to associate with the uploaded file. For a curl example, use the following syntax:'file=@/path/to/file/file.pdf;filename=desired_filename.pdf'
For more detailed information, see this File Upload API guide.
Request
Path Parameters
Possible values: <= 50 characters
, Value must match regular expression [a-zA-Z0-9_\=\-]+$
The unique key identifying the corpus of which to upload the file.
Header Parameters
Possible values: >= 1
The API will make a best effort to complete the request in the specified seconds or time out.
Possible values: >= 1
The API will make a best effort to complete the request in the specified milliseconds or time out.
- multipart/form-data
Body
Upload a file for the Vectara platform to attempt to parse and turn into a document within the corpus. The first part of the multipart request can contain any document metadata to attach to the parsed document. Only one document may be uploaded at a time.
- MaxCharsChunkingStrategy
metadata object
Arbitrary object that will be attached as document metadata to the extracted document.
Arbitrary object that will be attached as document metadata to the extracted document.
chunking_strategy object
(Optional) Choose how to split documents into chunks during indexing. If you do not set a chunking strategy, the platform uses the default strategy which creates one chunk (docpart) per sentence.
Default value: max_chars_chunking_strategy
When setting the type to max_chars_chunking_strategy, you can control the size of chunks (docparts).
Possible values: >= 100
Specifies the maximum number of characters per chunk.
The platform adds sentences to a chunk until the total number of characters exceeds the limit.
If a single sentence exceeds the limit, it splits the sentence across chunks. Note: This is the only case where the chunk may not contain a complete sentence.
table_extraction_config object
(Optional) Configuration for table extraction from the document.
If set to true, the platform will attempt to extract tables from the document. The tables will be indexed as separate document parts.
Optional multipart section to override the filename.
Binary file contents. The file name of the file will be used as the document ID.
Responses
- 201
- 400
- 403
- 404
The extracted document has been parsed and added to the corpus.
- application/json
- Schema
- Example (from schema)
Schema
- Array [
- ]
- Array [
- ]
The document ID.
metadata object
The document metadata.
The document metadata.
tables object[]
The tables that this document contains. Tables are not available when table extraction is not enabled.
The unique ID of the table within the document.
The title of the table.
data object
The data of the table.
The headers of the table.
The rows in the data.
The description of the table.
parts object[]
Parts of the document that make up the document. However, parts are not available when retrieving a list of documents or when creating a document. This property is only available when retrieving a document by ID.
The text of the document part.
metadata object
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The context text for the document part.
custom_dimensions object
The custom dimensions as additional weights.
storage_usage object
How much storage the document used. This information is currently not returned when retrieving the document, and only returned when indexing a document.
Number of bytes used by document counting towards maximum corpus size, and towards any billing plans.
Number of metadata bytes used by a document.
extraction_usage object
How much extraction quota the document used. This information is currently not returned when retrieving the document, and only returned when indexing a document.
The number of pages from the document that consumed the extraction quota.
{
"id": "my-doc-id",
"metadata": {},
"tables": [
{
"id": "table_1",
"title": "string",
"data": {
"headers": [
[
{
"text_value": "string",
"int_value": 0,
"float_value": 0,
"bool_value": true,
"colspan": 0,
"rowspan": 0
}
]
],
"rows": [
[
{
"text_value": "string",
"int_value": 0,
"float_value": 0,
"bool_value": true,
"colspan": 0,
"rowspan": 0
}
]
]
},
"description": "string"
}
],
"parts": [
{
"text": "I'm a nice document part.",
"metadata": {
"nice_rank": 9000
},
"context": "string",
"custom_dimensions": {}
}
],
"storage_usage": {
"bytes_used": 0,
"metadata_bytes_used": 0
},
"extraction_usage": {
"table_extraction_used": 0
}
}
Upload file request was malformed.
- application/json
- Schema
- Example (from schema)
Schema
field_errors object
The errors that relate to specific fields in the request.
The ID of the request that can be used to help Vectara support debug what went wrong.
{
"field_errors": {},
"messages": [
"string"
],
"request_id": "string"
}
Permissions do not allow uploading a file to the corpus.
- application/json
- Schema
- Example (from schema)
Schema
The messages describing why the error occurred.
The ID of the request that can be used to help Vectara support debug what went wrong.
{
"messages": [
"Internal server error."
],
"request_id": "string"
}
Corpus not found.
- application/json
- Schema
- Example (from schema)
Schema
The ID cannot be found.
ID of the request that can be used to help Vectara support debug what went wrong.
{
"id": "string",
"messages": [
"string"
],
"request_id": "string"
}