Add a document to a corpus
POST/v2/corpora/:corpus_key/documents
Add a document to a corpus. This endpoint supports two document formats, structured and core.
- Structured documents have a more conventional structure that provide document sections and parts in a format created by Vectara's proprietary strategy automatically. You provide a logical document structure, and Vectara handles the partitioning.
- Core documents differ in that they follow an advanced, granular structure that explicitly defines each document part in an array. Each part becomes a distinct, searchable item in query results. You have precise control over the document structure and content.
For more details, see Indexing.
Request
Path Parameters
Possible values: <= 50 characters
, Value must match regular expression [a-zA-Z0-9_\=\-]+$
The unique key identifying the queried corpus.
Header Parameters
Possible values: >= 1
The API will make a best effort to complete the request in the specified seconds or time out.
Possible values: >= 1
The API will make a best effort to complete the request in the specified milliseconds or time out.
- application/json
Body
- CoreDocument
- StructuredDocument
- Array [
- ]
- Array [
- ]
- Array [
- Array [
- ]
- ]
- MaxCharsChunkingStrategy
The document ID must be unique within the corpus.
Default value: core
When the type of the indexed document is core
the rest of
the object is expected to follow this schema. This schema allows
precise specification of document chunks that get directly translated
to retrieve search results.
metadata object
Arbitrary object of document level metadata. Properties of this object can be used by document filters if defined as a corpus filter attribute.
Arbitrary object of document level metadata. Properties of this object can be used by document filters if defined as a corpus filter attribute.
tables object[]
The tables that this document contains.
The unique ID of the table within the document.
The title of the table.
data object
The data of the table.
The headers of the table.
The rows in the data.
The description of the table.
document_parts object[]required
Possible values: >= 1
Parts of the document that make up the document.
The text of the document part.
metadata object
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The ID of the table that this document part belongs to.
The context text for the document part.
custom_dimensions object
The custom dimensions as additional weights.
The document ID must be unique within the corpus.
Default value: structured
When the type of the indexed document is structured
the rest of
the object is expected to follow this schema. It allows you to create a document
that follows normal document conventions. The Vectara platform will then
create document parts using its internal algorithm.
The title of the document.
The description of the document.
metadata object
The metadata for a document as an arbitrary JSON object. Properties of this object can be used by document level filter attributes.
The metadata for a document as an arbitrary JSON object. Properties of this object can be used by document level filter attributes.
custom_dimensions object
The custom dimensions as additional weights.
sections object[]required
Possible values: >= 1
The subsection of the document.
The section ID. This gets converted to a metadata field automatically.
The section title.
The text of the section.
metadata object
Arbitrary object that becomes document part level metadata on any document part created by this section. Properties of this object can be used by document part level filters if defined as a corpus filter attribute.
Arbitrary object that becomes document part level metadata on any document part created by this section. Properties of this object can be used by document part level filters if defined as a corpus filter attribute.
tables object[]
The tables that this section contains.
The unique ID of the table within the document.
The title of the table.
data object
The data of the table.
The headers of the table.
The rows in the data.
The description of the table.
The sections that this section contains.
chunking_strategy object
(Optional) Choose how to split documents into chunks during indexing. If you do not set a chunking strategy, the platform uses the default strategy which creates one chunk (docpart) per sentence.
Default value: max_chars_chunking_strategy
When setting the type to max_chars_chunking_strategy, you can control the size of chunks (docparts).
Possible values: >= 100
Specifies the maximum number of characters per chunk.
The platform adds sentences to a chunk until the total number of characters exceeds the limit.
If a single sentence exceeds the limit, it splits the sentence across chunks. Note: This is the only case where the chunk may not contain a complete sentence.
Responses
- 201
- 400
- 403
- 404
Document added to the corpus.
- application/json
- Schema
- Example (from schema)
Schema
- Array [
- ]
- Array [
- ]
The document ID.
metadata object
The document metadata.
The document metadata.
tables object[]
The tables that this document contains. Tables are not available when table extraction is not enabled.
The unique ID of the table within the document.
The title of the table.
data object
The data of the table.
The headers of the table.
The rows in the data.
The description of the table.
parts object[]
Parts of the document that make up the document. However, parts are not available when retrieving a list of documents or when creating a document. This property is only available when retrieving a document by ID.
The text of the document part.
metadata object
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The metadata for a document part. These may be used in metadata filters at query time if filter attributes are configured on the corpus.
The context text for the document part.
custom_dimensions object
The custom dimensions as additional weights.
storage_usage object
How much storage the document used. This information is currently not returned when retrieving the document, and only returned when indexing a document.
Number of bytes used by document counting towards maximum corpus size, and towards any billing plans.
Number of metadata bytes used by a document.
{
"id": "my-doc-id",
"metadata": {},
"tables": [
{
"id": "table_1",
"title": "string",
"data": {
"headers": [
[
{
"text_value": "string",
"int_value": 0,
"float_value": 0,
"bool_value": true,
"colspan": 0,
"rowspan": 0
}
]
],
"rows": [
[
{
"text_value": "string",
"int_value": 0,
"float_value": 0,
"bool_value": true,
"colspan": 0,
"rowspan": 0
}
]
]
},
"description": "string"
}
],
"parts": [
{
"text": "I'm a nice document part.",
"metadata": {
"nice_rank": 9000
},
"context": "string",
"custom_dimensions": {}
}
],
"storage_usage": {
"bytes_used": 0,
"metadata_bytes_used": 0
}
}
Document creation request was malformed.
- application/json
- Schema
- Example (from schema)
Schema
field_errors object
The errors that relate to specific fields in the request.
The ID of the request that can be used to help Vectara support debug what went wrong.
{
"field_errors": {},
"messages": [
"string"
],
"request_id": "string"
}
Permissions do not allow adding a document to the corpus.
- application/json
- Schema
- Example (from schema)
Schema
The messages describing why the error occurred.
The ID of the request that can be used to help Vectara support debug what went wrong.
{
"messages": [
"Internal server error."
],
"request_id": "string"
}
Corpus not found.
- application/json
- Schema
- Example (from schema)
Schema
The ID cannot be found.
ID of the request that can be used to help Vectara support debug what went wrong.
{
"id": "string",
"messages": [
"string"
],
"request_id": "string"
}