Create a corpus
POST/v2/corpora
The Create Corpus API lets you create a corpus to store and manage your documents. A corpus is a container for documents and their associated metadata. When creating a corpus, you can specify various settings such as the corpus key, name, description, encoder, and filter attributes.
Corpus object
When you create a corpus object, the corpus_key property is required to uniquely identify the corpus. The name parameter is optional and defaults to the value of key. The optional description properties lets you provide additional information about the corpus. When creating a new corpus, you also have the flexibility to specify a custom corpus_key that follows a naming convention of your choice. This allows you to assign easily identifiable keys to your corpora, making it easier to manage and reference them in your application.
You can specify whether to treat queries or documents in the corpus as questions or answers using the queries_are_answers and documents_are_questions boolean properties. These settings affect the semantics of the encoder used at query time and indexing time.
Add metadata as filter attributes
When creating a corpus with this endpoint or the Vectara Console, you define metadata fields using the filter_attributes object. This ensures the corpus supports filtering on specific metadata attributes, either at the document level or the part level.
Filter attributes enable you to attach metadata to your data at the document (doc) or part level, which you can use later in filter expressions to narrow the scope of your queries. A filter attribute must specify a unique name (up to 64 characters long), and a level which indicates whether it exists in the doc or part level metadata. At indexing time, metadata with this name is extracted and made available for filter expressions to operate on. Learn more
Doc and part filter levels
The doc attribute applies to the entire document. Use this for metadata that is consistent across the whole document, such as author, publication date, and document ID.
The part attribute applies to specific sections or chunks within a document. Use for metadata that may vary within different parts of the document, such as sections, page numbers, and sentiment scores.
If indexed is true, the system will build an index on the extracted values to further improve the performance of filter expressions involving the attribute.
Filter attributes must specify a type, which is validated when documents are indexed. The four supported types are integer, which stores signed whole-number values up to eight bytes in length; real, for storing floating point values in [IEEE 754 8-byte format]; text for storing textual strings in [UTF-8 encoding], and boolean for storing true/false values.
After you define filter attributes, you can use them within your queries. For example:
- Document-level attribute:
doc.publication_year > 2020 - Part-level attribute:
part.sentiment_score > 0.7
Custom dimensions
Custom dimensions let you add additional context to your data that contain user-defined values in addition to what Vectara automatically extracts and stores from the text. For example, upvotes can be a custom dimension. For example, see Add custom dimensions to boost content."
Request
Responses
- 201
- 400
- 403
- 409
The response message returns a unique id that you use to reference the corpus. The name does not need to be unique within an account.
Invalid request body in the create corpus request.
Permissions do not allow creating a corpus.
The corpus already exists