Upload a file to the corpus
POST/v2/corpora/:corpus_key/upload_file
Upload files such as PDFs and Word Documents for automatic text extraction and metadata parsing.
The request expects a multipart/form-data
format containing the following parts:
metadata
- (Optional) Specifies a JSON object representing any additional metadata to be associated with the extracted document. For example,'metadata={"key": "value"};type=application/json'
chunking_strategy
- (Optional) Specifies the chunking strategy for the platform to use. If you do not set this option, the platform uses the default strategy, which creates one chunk per sentence. For example,'chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":200};type=application/json'
file
- Specifies the file that you want to upload.filename
- Specified as part of the file field with the file name that you want to associate with the uploaded file. For a curl example, use the following syntax:'file=@/path/to/file/file.pdf;filename=desired_filename.pdf'
For more detailed information, see this File Upload API guide.
Request
Path Parameters
Possible values: <= 50 characters
, Value must match regular expression [a-zA-Z0-9_\=\-]+$
The unique key identifying the corpus of which to upload the file.
Header Parameters
Possible values: >= 1
The API will make a best effort to complete the request in the specified seconds or time out.
Possible values: >= 1
The API will make a best effort to complete the request in the specified milliseconds or time out.
- multipart/form-data
Body
Upload a file for the Vectara platform to attempt to parse and turn into a document within the corpus. The first part of the multipart request can contain any document metadata to attach to the parsed document. Only one document may be uploaded at a time.
- MaxCharsChunkingStrategy
metadata object
Arbitrary object that will be attached as document metadata to the extracted document.
Arbitrary object that will be attached as document metadata to the extracted document.
chunking_strategy object
(Optional) Choose how to split documents into chunks during indexing. If you do not set a chunking strategy, the platform uses the default strategy which creates one chunk (docpart) per sentence.
Default value: max_chars_chunking_strategy
When setting the type to max_chars_chunking_strategy, you can control the size of chunks (docparts).
Possible values: >= 100
Specifies the maximum number of characters per chunk.
The platform adds sentences to a chunk until the total number of characters exceeds the limit.
If a single sentence exceeds the limit, it splits the sentence across chunks. Note: This is the only case where the chunk may not contain a complete sentence.
Optional multipart section to override the filename.
Binary file contents. The file name of the file will be used as the document ID.
Responses
- 201
- 400
- 403
- 404
The extracted document has been parsed and added to the corpus.
- application/json
- Schema
- Example (from schema)
Schema
- Array [
- ]
The document ID.
metadata object
The document metadata.
The document metadata.
parts object[]
Parts of the document that make up the document. However, parts are not available when retrieving a list of documents or when creating a document. This property is only available when retrieving a document by ID.
The text of the document part.
metadata object
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The context text for the document part.
custom_dimensions object
The custom dimensions as additional weights.
storage_usage object
How much storage the document used. This information is currently not returned when retrieving the document, and only returned when indexing a document.
Number of bytes used by document counting towards maximum corpus size, and towards any billing plans.
Number of metadata bytes used by a document.
{
"id": "my-doc-id",
"metadata": {},
"parts": [
{
"text": "I'm a nice document part.",
"metadata": {
"nice_rank": 9000
},
"context": "string",
"custom_dimensions": {}
}
],
"storage_usage": {
"bytes_used": 0,
"metadata_bytes_used": 0
}
}
Upload file request was malformed.
- application/json
- Schema
- Example (from schema)
Schema
field_errors object
The errors that relate to specific fields in the request.
The ID of the request that can be used to help Vectara support debug what went wrong.
{
"field_errors": {},
"messages": [
"string"
],
"request_id": "string"
}
Permissions do not allow uploading a file to the corpus.
- application/json
- Schema
- Example (from schema)
Schema
The messages describing why the error occurred.
The ID of the request that can be used to help Vectara support debug what went wrong.
{
"messages": [
"Internal server error."
],
"request_id": "string"
}
Corpus not found.
- application/json
- Schema
- Example (from schema)
Schema
The ID cannot be found.
ID of the request that can be used to help Vectara support debug what went wrong.
{
"id": "string",
"messages": [
"string"
],
"request_id": "string"
}