Metadata filters
This section helps you learn about metadata filter expressions and how to use them with your data.
- What are metadata filters?
- Document-level and part-level metadata
- Using metadata
- Functions and operators
- Data types
- Metadata use case examples
What are metadata filters?โ
Metadata filter expressions are attached to queries and to their
corpus keys. These filter expressions serve to restrict the search to only the
part of the corpus that matches the expression. In both form and function,
they are a simpler version of a WHERE clause's search condition
in ANSI SQL, see ยง7.6.
A filter expression operates on the metadata attached to documents that are indexed in Vectara. Because you can associate this metadata to either the entire document, or to specific parts within it, the scope must be explicitly specified for every metadata reference in the expression.
After you define filter attributes, you can use them within your queries. For example:
FILTER ATTRIBUTE EXAMPLE
Code example with json syntax.1
Document-level and part-level metadataโ
Metadata can be associated with the entire document (document-level) or
specific sections of the document (part-level). These valid scopes are doc.
and part., for document and part-level metadata, respectively.
When indexing data in Vectara, you associate metadata at these levels:
- Document-level scope
- Applied across the entire document. Use document-level filtering for metadata that does not vary and remains consistent across the whole document.
- Examples:
doc.author = 'John Doe' and doc.publication_year > 2024doc.publication_date >= '2023-01-01' AND doc.publication_date < '2024-01-01' AND doc.category IN ('Technology', 'Science')
- Part-level scope
- Applied to specific sections or chunks within a document. Use part-level filtering when properties vary within different parts of the document.
- Examples
part.section = 'Introduction'part.clause_type = 'Liability' AND part.risk_level = 25 AND part.is_boilerplate = false
For more information, check out some of our Metadata Examples and Use Cases.
Metadata data typesโ
When creating metadata fields for filtering, you must define the appropriate data type for each field. Vectara supports the following metadata field types:
- Integer: Stores signed whole numbers. Suitable for fields like year, count,
or ID. For example:
doc.publication_year = 2021 - Real Number (Float): Stores decimal numbers, often used for scores,
percentages, or measurements. For example:
part.sentiment_score > 0.7 - Text: Stores UTF-8 strings. Ideal for storing names, categories, or labels.
For example:
doc.category = 'Science' - Boolean: Stores true or false values, commonly used for toggles or binary
states. For example:
doc.is_featured = true - Null: Indicates the absence of a metadata field for a document. For example:
doc.status IS NULL
Selecting the best data typeโ
Select the correct data type to ensure your queries run efficiently and produce accurate results. Consider these tips:
- Match the data to the type:
- Use numeric types (
integerorfloat) for numerical comparisons and calculations. - Use
textfor fields that require exact matches or keyword searches. - Use
integerforyearvalues to enable range queries, instead of text.
- Use numeric types (
- Avoid mixing types: Keep numerical data in numeric types and text data in
textfields. Mixing them can cause inefficient queries and unexpected behavior. A bad example is usingdoc.year = '2021'(as a text field). - Indexing considerations: Metadata fields marked as
indexed: Trueallow faster querying but may increase storage overhead. Choose indexing selectively based on usage patterns.
Using metadataโ
To effectively use metadata filters in Vectara, you need to configure the metadata fields that your queries can filter on. This process involves creating or updating metadata attributes during corpus setup or for an existing corpus.
This section explains how to create or add metadata filters and provides helpful context for planning and implementing metadata fields. Let's look at the ways to create and add metadata filters to your corpus data.
- Add metadata during corpus creation Use the Create Corpus API.
- Upload documents with metadata Use the File Upload API.
- Update metadata for an existing corpus Use the Update Corpus Document API.
- Replace metadata for an existing corpus Use the Replace Corpus Document Metadata API.
Updating or replacing metadata is limited only to document-level metadata.
Default metadata filtersโ
A few pieces of metadata expressions are filterable out of the box, including Document ID, Language, and Titles. These filters are very useful in a variety of situations.
doc.id fieldโ
Each document is assigned a unique identifier at indexing. You can use the
doc.id field to retrieve or filter specific Document IDs in your corpus.
Valid filter expressions include something like:
doc.id = 'my-document-2023.pdf'doc.id = 'my-document-2022.pdf' OR 'my-document-2023.pdf'doc.id = 'my-document-2023.pdf' AND 'my-document-2024.pdf'
part.lang fieldโ
Each section of a document is evaluated for its language at index time and the
part.lang field is added with a 3-character lower-case language code
(ISO 639-2). For
example, if the section was detected as English, then part.lang would contain
eng and if it was detected as German, than part.lang would contain deu.
Valid filter expressions for this would be something like:
part.lang = 'eng'part.lang = 'deu'part.lang = 'eng' OR part.lang = 'deu'
part.is_title fieldโ
When adding content, Vectara adds a special Boolean
field to indicate whether the field is a title field or not. This is useful
for a few different cases depending on how you model your data. For example,
some users want to only match on a title field, or never match on a title field,
in which case this field can be used to filter.
This field actually uses three value logic: true, false, and unset. We designed it like this to avoid creating too much metadata because customers are billed for metadata, so it is in the customer's interest. Here is how it works using "neural networks" and an example document:
-
Title: "Neural Networks and Deep Learning"
-
Section 1: "Introduction to Neural Networks"
-
Section 2: "Applications of Neural Networks in AI"
-
Section 3: "Conclusion"
-
To filter for only title fields, use
part.is_title = true. You get results with "neural networks" in the title, such as "Neural Networks and Deep Learning" in the title. -
To return only non-title sections, use
part.is_title = false. You get results for sections that contain "neural networks" but are not titles, such as "Introduction to Neural Networks," "Applications of Neural Networks in AI," and "Conclusion. You do not get titles with that term in the results. -
However, not all documents have titles. To include sections with no title set, use
part.is_title <> true. You could get a variety of results that do not have specific title designations but they contain the term "neural networks".
Functions and operatorsโ
Most operators in Vectara have the same precedence and are left-associative. You need to use parenthesis to enforce a different precedence.
The following table indicates the supported operators and their precedence (highest to lowest). Non-binary operators do not specify associativity.
| Operator | Associativity | Description |
|---|---|---|
+, - | - | unary plus and minus |
*, /, % | left | multiplication, division, modulo |
+, - | left | addition, subtraction |
<, <=, >, >= | left | comparison |
=, ==, !=, <> | left | comparison |
IS NULL, IS NOT NULL | - | NULL comparison |
IN | - | range containment |
NOT | - | logical negation |
AND | left | logical conjunction |
OR | left | logical disjunction |
These operators provide a powerful way to filter and retrieve documents. By using them effectively, users can create complex queries to find the most relevant documents for their specific use cases. Let's look at these operators in more detail:
Unary plus and minus operators (+, -)โ
The unary plus and minus operators indicate a positive or negative numeric value. Use when you need to filter documents based on numeric fields that can have both positive and negative values, such as scores, ratings, or temperatures.
For example, filter documents with a score greater than (or less than) specific scores with the positive or negative sign:
-
Unary plus - Filter documents with a score greater than positive 10:
doc.score > 10 -
Unary minus - Filter documents with a score less than negative 5:
doc.score < -5
Multiplication, division, and modulo operators (*, / %)โ
These operators perform mathematical operations on numeric values to multiply, divide, and find the remainder of a value. Use when involving calculating prices with taxes, determining the number of pages or items per group, or finding documents with specific numeric patterns.
For example use multiplication to filter on price, total pages, and page count to find odd or even numbers.
- Multiplication - Filters documents where the price increased by 10% is greater than 100:
doc.price * 1.1 > 100
- Division - Filters documents where the total number of pages divided by 10 is less than 20:
doc.totalpages / 10 < 20
- Modulo - Filters documents where the page count is divisible by 3:
doc.pagecount % 3 = 0
Addition and subtraction operators (+, -)โ
These addition and subtraction operators perform arithmetic operations on numeric values. Use for tasks like adjusting scores or prices based on specific criteria or comparing values with a certain threshold.
For example, filter on scores above a specific number or prices after discount.
-
Addition - Filters documents where the score plus 10 is greater than or equal to 80:
doc.score + 10 >= 80 -
Subtraction - Filters documents where the price minus the discount is less than or equal to 50:
doc.price - doc.discount <= 50
Less and greater comparison operators (<, <=, >, >=)โ
These comparison operators are used to filter documents based on specific conditions. Use for a wide range of use cases, such as finding documents within a certain price range, date range, or any other numeric or comparable values.
For example, filter on prices below a specific number, ratings below a threshold, publish dates after a specific date, and scores above a specific number.
-
Less than (
<) - Filters documents where the price is less than 100:doc.price < 100 -
Less than or equal to (
<=) - Filters documents where the rating is less than or equal to 4.5:doc.rating <= 4.5 -
Greater than (
>) - Filters documents published after January 1, 2022:doc.publishdate > '2022-01-01' -
Greater than or equal to (
>=) - Filters documents with a score greater than or equal to 80:doc.score >= 80
Equality and inequality operators (=, ==, !=, <>)โ
These comparison operators check for equality or inequality for each side of the function. Use for filtering documents based on specific values of fields, such as categories, statuses, or names.
For example, filter on a specific category or status, or filter all except for that category.
-
Equals (
=or==) - Filters documents where the category is "Technology" or the status is "active":doc.category = 'Technology'ordoc.status == 'active' -
Does not equal to (
!=or<>) - Filters documents where the category is neither "Sports" or "Entertainment":doc.category != 'Sports'ordoc.category <> 'Entertainment'
NULL comparison operators (IS NULL, IS NOT NULL)โ
These operators check whether or not a value is NULL (empty or missing). Use for filtering documents based on the presence or absence of values in specific fields.
For example, filter on no author or only data that has a description.
-
Value is null - Filters documents where the author field is empty or missing:
doc.author IS NULL -
Value is not null - Filters documents where the description field has a value:
doc.description IS NOT NULL
Range containment operator (IN)โ
The IN operator checks if a value is within a specified set. Use for
filtering documents based on multiple possible values for a field,
such as categories, tags, or statuses.
For example, filter on two specific categories or statuses.
-
Value is in a category - Filters documents where the category is either "Science" or "History":
doc.category IN ('Science', 'History') -
Value is a particular status - Filters documents where the status is either "active" or "pending":
doc.status IN ('active', 'pending')
Negation operator (NOT)โ
The NOT operator is used to negate a condition, returning documents that do
not match the specified criteria. Use for excluding certain documents
based on specific field values.
For example, filter on everything but a specific category or below a certain score.
-
Value is not in a specific category - Filters documents where the category is not "Technology":
NOT (doc.category = 'Technology') -
Value is not less than a score of
50- Filters documents where the score is greater than or equal to 50:NOT (doc.score < 50)
Conjunction operator (AND)โ
The AND operator combines multiple conditions, requiring all conditions to
be true. Use for narrowing down search results based on multiple
factors.
For example, filter on score and publish date ranges, or on a specific category and author.
-
Specify score and publish date - Filters documents with a score greater than 80 and published after January 1, 2022:
doc.score > 80 AND doc.publishdate > '2022-01-01' -
Specify category and author - Filters documents where the category is "Technology" and the author is "John Smith":
doc.category = 'Technology' AND doc.author = 'John Smith'
Logical disjunction (OR)โ
The OR operator combines multiple conditions, requiring at least one
condition to be true. Use for broadening search results based on
multiple possible values.
For example, filter on documents with one of two specific categories or documents that either active or above a certain score.
-
Specify one of two possible categories - Filters documents where the category is either "Technology" or "Business":
doc.category = 'Technology' OR doc.category = 'Business' -
Specify one of two attributes - Filters documents where the status is "active" or the score is greater than 90:
doc.status = 'active' OR doc.score > 90
Operator combinationsโ
Combining different operators enables you to create more specific filtering conditions. By using parentheses and combining these operators in different ways, you can effectively narrow or broaden your query results to find the most relevant documents. The following examples show combinations such as "IN and AND," "NOT and AND," "OR and "AND," and "Not and IN and AND."
-
Specify one of two possible categories and a published year - Filters documents where the category is either "Science" or "Technology" AND the published year is greater than 2020:
doc.category IN ('Science', 'Technology') AND doc.publishedyear > 2020 -
Value is NOT a status and category - Filters documents that are both NOT in the "draft" status AND "Technology" category:
NOT (doc.status = 'draft' AND doc.category = 'Technology) -
Specify a status or qualified score - Filters documents where the status is "active" or the score is greater than 90 as long as the status is also "pending":
doc.status = 'active' OR (doc.status = 'pending' AND doc.score > 90) -
Value is not in a category and with a specific score - Filters documents that are NOT in the "Sports" or "Entertainment" category AND have a score greater than or equal to 50:
NOT (doc.category IN ('Sports', 'Entertainment') AND doc.score >= 50) -
Specify one of two possible categories after a date and with a specific status - Filters documents where the category is either "Business" or "Finance", the publish date is after January 1, 2022, AND the status is "published":
doc.category IN ('Business', 'Finance') AND doc.publishdate > '2022-01-01' AND doc.status = 'published'
Data typesโ
This section provides a list of the various data types supported by Vectara, helping you make informed decisions when working with different data types.
| Data Type | Description | Metadata Literal Syntax |
|---|---|---|
| Integer | The value is a signed integer up to eight bytes in length. | Any number of digits without a period. |
| Real (Float) | The value is a floating point number corresponding to a Java double, and is of IEEE 754 float64 format. | Any number of digits with a period. |
| Text | The value is UTF-8 text. | A string is enclosed in single quotes ('). You can escape a ' inside text by having two quotes (''). |
| Boolean | The value is Boolean | true or false |
| Null | If metadata is not present, its absence is indicated by NULL. | null |
Metadata use case examplesโ
Metadata filters enable highly versatile and granular control over query results. This section provides real-world examples and use cases to illustrate how metadata filters can be applied to solve common business and technical challenges.
Language-specific filteringโ
In multilingual documents, different sections may be in different languages. Use part-level metadata to target specific language segments.
**Example: **Filter for German-language customer reviews with a rating higher than 3 stars.
LANGUAGE FILTER
Code example with sql syntax.1
The lang metadata tag used in this example is detected and set automatically
by the platform at indexing time. It's set at the part level for accuracy,
because a single document may contain content in multiple languages.
Date-specific document retrievalโ
More complicated expressions are possible, such as the one below, which checks for documents with a publication date in 2021.
Example: Retrieve documents published in 2021 using epoch time.
DATA-SPECIFIC FILTER
Code example with sql syntax.1
Here, pub_epoch stores the date in epoch time.
You can find a full list of supported operations on the Functions and Operators page, and a full list of how to specify literals on Data Types.
Filter by document statusโ
For auditing purposes, you may want to limit results to documents marked as
Published instead of Draft:
doc.status = 'Published'
Filter by custom tagโ
Custom metadata fields enable filtering based on business-specific criteria, such as priority, category, or internal tags.
Example: Filter documents tagged as High Priority in the Technology category.
BUSINESS-SPECIFIC CRITERIA
Code example with sql syntax.1
Example query with a document-level filterโ
This example asks the question "What are the key benefits of cloud computing?"
from the Cloud Computing References corpus. Within the corpora object, we
specified a metadata_filter to filter though published documents with
"metadata_filter": "doc.status = 'Published'",
METADATA EXAMPLE
Code example with json syntax.1
Example response with a document-level filterโ
The example response returns documents with a "status": "Published", in the document
metadata. This response also shows other metadata associated with each document_id.
RESPONSE EXAMPLE
Code example with json syntax.1
Example query with part-level metadataโ
Now let's send a query with part-level meta for part.concept = 'Overview'.
We will only change the metadata_filter value from the previous example so
that it filters for this part-level metadata:
METADATA EXAMPLE
Code example with json syntax.1
Example response with part-level metadataโ
PART-LEVEL METADATA EXAMPLE
Code example with json syntax.1