Create Corpus API Definition
The Create Corpus API lets you create a corpus to store and manage your documents. A corpus is a container for documents and their associated metadata. When creating a corpus, you can specify various settings such as the corpus key, name, description, encoder, and filter attributes.
Corpus Object
When you create a corpus
object, the key
property is required to uniquely
identify the corpus. The name
parameter is optional and defaults to the
value of key
. The optional description
properties lets you provide
additional information about the corpus.
You can specify whether to treat queries or documents in the corpus as
questions or answers using the queries_are_answers
and documents_are_questions
boolean properties. These settings affect the semantics of the encoder used at
query time and indexing time.
The encoder_id
property allows you to choose the encoder for the corpus. If
not specified, it defaults to the latest Vectara encoder.
In order to reference metadata in filter expressions, the attributes
must be declared at creation time in the filter_attributes
array. This list
cannot be changed once the corpus is created.
Scale users can specify custom_dimensions
to allow weighting of document parts
during indexing and querying. Like filter attributes, custom dimensions cannot
be changed after corpus creation.
The response message returns a unique id
that you use to reference the
corpus. The name
does not need to be unique within an account.
Filter Attribute
In order to reference metadata in filter expressions, the referenceable attributes must be declared at creation time in the filter attributes. This list cannot be changed once the corpus is created.
For information on custom dimensions, a Scale-only feature, please see Custom Dimensions. Like filter attributes, custom dimensions cannot be changed after the corpus is created.
A filter attribute must specify a name
, and a level
which indicates
whether it exists in the document or part level metadata. At indexing time,
metadata with this name will be extracted and made available for filter
expressions to operate on.
If indexed
is true, the system will build an index on the extracted values
to further improve the performance of filter expressions involving the
attribute.
Finally, filter attributes must specify a type
, which is validated when
documents are indexed. The four supported types are integer
, which stores
signed whole-number values up to eight bytes in length; real
, for storing
floating point values in IEEE 754 8-byte format; text
for storing
textual strings in UTF-8 encoding, and boolean
for storing true/false
values.
REST 2.0 URL
Create Corpus REST 2.0 Endpoint
Vectara exposes a REST endpoint at the following URL to create a corpus:https://api.vectara.io/v2/corpora
The API Reference shows the full Create Corpus REST definition.
gRPC Example
You can find the full Create Corpus gRPC definition at admin.proto.
The CreateCorpusRequest
message contains a Corpus message with the name,
description, and other customization options. The CreateCorpusResponse
provides the response with the new Corpus ID and status.