Skip to main content
Version: 2.0

Create a corpus

POST 

/v2/corpora

Create a corpus, which is a container to store documents and associated metadata. Here, you define the unique corpus_key that identifies the corpus. The corpus_key can be custom-defined following your preferred naming convention, allowing you to easily manage the corpus's data and reference it in queries. For more information, see Corpus Key Definition.

Request

Header Parameters

    Request-Timeout integer

    Possible values: >= 1

    The API will make a best effort to complete the request in the specified seconds or time out.

    Request-Timeout-Millis integer

    Possible values: >= 1

    The API will make a best effort to complete the request in the specified milliseconds or time out.

Body

    key CorpusKeyrequired

    Possible values: <= 50 characters, Value must match regular expression [a-zA-Z0-9_\=\-]+$

    A user-provided key for a corpus.

    name string

    The name for the corpus. This value defaults to the key.

    description string

    Description of the corpus.

    queries_are_answers boolean

    Default value: false

    Queries made to this corpus are considered answers, and not questions.

    documents_are_questions boolean

    Default value: false

    Documents inside this corpus are considered questions, and not answers.

    encoder_id stringdeprecated

    Possible values: Value must match regular expression enc_[0-9]+$

    Deprecated: Use encoder_name instead.

    encoder_name string

    The encoder used by the corpus, boomerang-2023-q3.

    filter_attributes object[]

    The new filter attributes of the corpus. If unset then the corpus will not have filter attributes.

  • Array [
  • name stringrequired

    The JSON path of the filter attribute in a document or document part metadata.

    level stringrequired

    Possible values: [document, part]

    Indicates whether this is a document or document part metadata filter.

    description string

    Description of the filter. May be omitted.

    indexed boolean

    Default value: true

    Indicates whether an index should be created for the filter. Creating an index will improve query latency when using the filter.

    type stringrequired

    Possible values: [integer, real_number, text, boolean, list[integer], list[real_number], list[text]]

    The value type of the filter.

  • ]
  • custom_dimensions object[]

    A custom dimension is an additional numerical field attached to a document part. You can then multiply this numerical field with a query time custom dimension of the same name. This allows boosting (or burying) document parts for arbitrary reasons. This feature is only enabled for Pro and Enterprise customers.

  • Array [
  • name stringrequired

    The name of the custom dimension.

    description string

    Description of the custom dimension.

    indexing_default double

    Default value of a custom dimension on a document part if the custom dimension value is not specified when the document part is indexed.

    A value of 0 means that custom dimension is not considered.

    querying_default double

    Default value of a custom dimension for a query if the value of the custom dimension is not specified when querying the corpus.

    A value of 0 means that custom dimension is not considered.

  • ]

Responses

The corpus has been created.

Schema
    id string

    Possible values: Value must match regular expression crp_[0-9]+$

    Vectara ID of the corpus.

    key CorpusKey

    Possible values: <= 50 characters, Value must match regular expression [a-zA-Z0-9_\=\-]+$

    A user-provided key for a corpus.

    name string

    Name for the corpus. This value defaults to the key.

    description string

    Corpus description.

    enabled boolean

    Specifies whether the corpus is enabled or not.

    chat_history_corpus boolean

    Indicates that this corpus does not store documents and stores chats instead.

    queries_are_answers boolean

    Default value: false

    Queries made to this corpus are considered answers, and not questions. This swaps the semantics of the encoder used at query time.

    documents_are_questions boolean

    Default value: false

    Documents inside this corpus are considered questions, and not answers. This swaps the semantics of the encoder used at indexing.

    encoder_id stringdeprecated

    Possible values: Value must match regular expression enc_[0-9]+$

    The encoder used by the corpus. Deprecated: Use encoder_name instead

    encoder_name string

    The encoder used by the corpus, boomerang-2023-q3.

    filter_attributes object[]

    The new filter attributes of the corpus.

  • Array [
  • name stringrequired

    The JSON path of the filter attribute in a document or document part metadata.

    level stringrequired

    Possible values: [document, part]

    Indicates whether this is a document or document part metadata filter.

    description string

    Description of the filter. May be omitted.

    indexed boolean

    Default value: true

    Indicates whether an index should be created for the filter. Creating an index will improve query latency when using the filter.

    type stringrequired

    Possible values: [integer, real_number, text, boolean, list[integer], list[real_number], list[text]]

    The value type of the filter.

  • ]
  • custom_dimensions object[]

    The custom dimensions of all document parts inside the corpus.

  • Array [
  • name stringrequired

    The name of the custom dimension.

    description string

    Description of the custom dimension.

    indexing_default double

    Default value of a custom dimension on a document part if the custom dimension value is not specified when the document part is indexed.

    A value of 0 means that custom dimension is not considered.

    querying_default double

    Default value of a custom dimension for a query if the value of the custom dimension is not specified when querying the corpus.

    A value of 0 means that custom dimension is not considered.

  • ]
  • limits object
    used_docs int64

    The number of documents contained in the corpus.

    used_parts int64

    The number of document parts contained in the corpus.

    used_bytes int64

    NOTE: This field is currently not populated by the system. The number of bytes contained in the corpus. This includes the document metadata, document part metadata, and document contents.

    used_characters int64

    The number of characters contained in the corpus. This includes the document metadata, document part metadata, and document contents.

    max_bytes int64

    NOTE: This field is currently not populated by the system. The maximum number of bytes the corpus can be.

    max_metadata_bytes int64

    The maximum size that metadata can be on documents.

    index_rate int64

    NOTE: This field is currently not populated by the system. The maximum per-second addition of new documents to corpus.

    created_at date-time

    Indicates when the corpus was created.

Loading...