Metadata filters
This section helps you learn about metadata filter expressions and how to use them with your data.
- What are metadata filters?
- Document-level and part-level metadata
- Using metadata
- Functions and operators
- Data types
- Metadata use case examples
What are metadata filters?โ
Metadata filter expressions are attached to queries and to their
corpus keys. These filter expressions serve to restrict the search to only the
part of the corpus that matches the expression. In both form and function,
they are a simpler version of a WHERE clause's search condition
in ANSI SQL, see ยง7.6.
A filter expression operates on the metadata attached to documents that are indexed in Vectara. Because you can associate this metadata to either the entire document, or to specific parts within it, the scope must be explicitly specified for every metadata reference in the expression.
When defining filter attributes in the UI, do not include the 'doc.' prefix.
Only use the prefix when writing filter expressions.
After you define filter attributes, you can use them within your queries. For example:
FILTER ATTRIBUTE EXAMPLE
Code example with json syntax.1
Document-level and part-level metadataโ
Metadata can be associated with the entire document (document-level) or
specific sections of the document (part-level). These valid scopes are doc.
and part., for document and part-level metadata, respectively.
When indexing data in Vectara, you associate metadata at these levels:
- Document-level scope
- Applied across the entire document. Use document-level filtering for metadata that does not vary and remains consistent across the whole document.
- Examples:
doc.author = 'John Doe' and doc.publication_year > 2024doc.publication_date >= '2023-01-01' AND doc.publication_date < '2024-01-01' AND doc.category IN ('Technology', 'Science')
- Part-level scope
- Applied to specific sections or chunks within a document. Use part-level filtering when properties vary within different parts of the document.
- Examples
part.section = 'Introduction'part.clause_type = 'Liability' AND part.risk_level = 25 AND part.is_boilerplate = false
For more information, check out some of our Metadata Examples and Use Cases.
Metadata data typesโ
When creating metadata fields for filtering, you must define the appropriate data type for each field. Vectara supports the following metadata field types:
- Integer: Stores signed whole numbers. Suitable for fields like year, count,
or ID. For example:
doc.publication_year = 2021 - Real Number (Float): Stores decimal numbers, often used for scores,
percentages, or measurements. For example:
part.sentiment_score > 0.7 - Text: Stores UTF-8 strings. Ideal for storing names, categories, or labels.
For example:
doc.category = 'Science' - Boolean: Stores true or false values, commonly used for toggles or binary
states. For example:
doc.is_featured = true - Null: Indicates the absence of a metadata field for a document. For example:
doc.status IS NULL
Selecting the best data typeโ
Select the correct data type to ensure your queries run efficiently and produce accurate results. Consider these tips:
- Match the data to the type:
- Use numeric types (
integerorfloat) for numerical comparisons and calculations. - Use
textfor fields that require exact matches or keyword searches. - Use
integerforyearvalues to enable range queries, instead of text.
- Use numeric types (
- Avoid mixing types: Keep numerical data in numeric types and text data in
textfields. Mixing them can cause inefficient queries and unexpected behavior. A bad example is usingdoc.year = '2021'(as a text field). - Indexing considerations: Metadata fields marked as
indexed: Trueallow faster querying but may increase storage overhead. Choose indexing selectively based on usage patterns.
Using metadataโ
To effectively use metadata filters in Vectara, you need to configure the metadata fields that your queries can filter on. This process involves creating or updating metadata attributes during corpus setup or for an existing corpus.
This section explains how to create or add metadata filters and provides helpful context for planning and implementing metadata fields. Let's look at the ways to create and add metadata filters to your corpus data.
- Add metadata during corpus creation Use the Create Corpus API.
- Upload documents with metadata Use the File Upload API.
- Update metadata for an existing corpus Use the Update Corpus Document API.
- Replace metadata for an existing corpus Use the Replace Corpus Document Metadata API.
Updating or replacing metadata is limited only to document-level metadata.
Default metadata filtersโ
A few pieces of metadata expressions are filterable out of the box, including Document ID, Language, and Titles. These filters are very useful in a variety of situations.
doc.id fieldโ
Each document is assigned a unique identifier at indexing. You can use the
doc.id field to retrieve or filter specific Document IDs in your corpus.
Valid filter expressions include something like:
doc.id = 'my-document-2023.pdf'doc.id = 'my-document-2022.pdf' OR 'my-document-2023.pdf'doc.id = 'my-document-2023.pdf' AND 'my-document-2024.pdf'
part.lang fieldโ
Each section of a document is evaluated for its language at index time and the
part.lang field is added with a 3-character lower-case language code
(ISO 639-2). For
example, if the section was detected as English, then part.lang would contain
eng and if it was detected as German, than part.lang would contain deu.
Valid filter expressions for this would be something like:
part.lang = 'eng'part.lang = 'deu'part.lang = 'eng' OR part.lang = 'deu'
part.is_title fieldโ
When adding content, Vectara adds a special Boolean
field to indicate whether the field is a title field or not. This is useful
for a few different cases depending on how you model your data. For example,
some users want to only match on a title field, or never match on a title field,
in which case this field can be used to filter.
This field actually uses three value logic: true, false, and unset. We designed it like this to avoid creating too much metadata because customers are billed for metadata, so it is in the customer's interest. Here is how it works using "neural networks" and an example document:
-
Title: "Neural Networks and Deep Learning"
-
Section 1: "Introduction to Neural Networks"
-
Section 2: "Applications of Neural Networks in AI"
-
Section 3: "Conclusion"
-
To filter for only title fields, use
part.is_title = true. You get results with "neural networks" in the title, such as "Neural Networks and Deep Learning" in the title. -
To return only non-title sections, use
part.is_title = false. You get results for sections that contain "neural networks" but are not titles, such as "Introduction to Neural Networks," "Applications of Neural Networks in AI," and "Conclusion. You do not get titles with that term in the results. -
However, not all documents have titles. To include sections with no title set, use
part.is_title <> true. You could get a variety of results that do not have specific title designations but they contain the term "neural networks".
Functions and operatorsโ
Most operators in Vectara have the same precedence and are left-associative. You need to use parenthesis to enforce a different precedence.
The following table indicates the supported operators and their precedence (highest to lowest). Non-binary operators do not specify associativity.
| Operator | Associativity | Description |
|---|---|---|
+, - | - | unary plus and minus |
*, /, % | left | multiplication, division, modulo |
+, - | left | addition, subtraction |
<, <=, >, >= | left | comparison |
=, ==, !=, <> | left | comparison |
IS NULL, IS NOT NULL | - | NULL comparison |
IN | - | range containment |
NOT | - | logical negation |
AND | left | logical conjunction |
OR | left | logical disjunction |
These operators provide a powerful way to filter and retrieve documents. By using them effectively, users can create complex queries to find the most relevant documents for their specific use cases. Let's look at these operators in more detail:
Unary plus and minus operators (+, -)โ
The unary plus and minus operators indicate a positive or negative numeric value. Use when you need to filter documents based on numeric fields that can have both positive and negative values, such as scores, ratings, or temperatures.
For example, filter documents with a score greater than (or less than) specific scores with the positive or negative sign:
-
Unary plus - Filter documents with a score greater than positive 10:
doc.score > 10 -
Unary minus - Filter documents with a score less than negative 5:
doc.score < -5