Metadata and Filtering
Metadata lets you tag documents and document parts with structured information, such as type, department, creation date, or custom business attributes.
With Vectara, metadata powers precise search, filtering, and vertical-specific retrievalβenabling smarter RAG and analytics use cases.
Metadata is a dictionary of key-value pairs associated with each document or document part. You use metadata to:
- Enable fast filtering ('doc.department = "finance"')
- Control vertical-specific queries ('doc.type = "contract"')
- Add business context (`part.customer_id, 'doc.location')
- Support structured retrieval for complex applications
This guide assumes you have a corpus called my-docs
. If you haven't created a corpus yet, follow
the Quick Start guide to set up your first corpus.
Create corpus with filter attributesβ
Before you can filter by metadata, you must define filter attributes when creating your corpus. These attributes tell Vectara which metadata fields should be indexed for fast filtering.
1
Critical: Filter attributes must be defined at corpus creation time. You cannot add filter attributes to an existing corpus later.
Filter Attribute Parameters:
name
(string): The metadata field name to make filterablelevel
(string): Either "document" or "part" depending on where metadata is attachedtype
(string): Data type - "text", "integer", "real", or "boolean"indexed
(boolean): Set totrue
for fast filtering performance
Add metadata at ingestionβ
Add metadata when indexing documents using the Python SDK. You can set metadata at:
- Document level (applies to the whole doc)
- Part/Section level (applies to a section/part)
Example: Ingest a document with metadataβ
1
Important: The metadata field names (department
, year
, doc_type
, section_type
) must exactly match the filter attribute names defined in your corpus.
Querying with metadata filtersβ
Filter your queries using metadata fields to target only relevant documents or parts.
- Document-level filter: Applies to whole documents.
- Part-level filter: Targets individual sections/parts based on their metadata.
Example: Query with a metadata filterβ
1
Example: Part-level metadata filteringβ
1
Example: Complex metadata filteringβ
1
- Filter syntax is similar to SQL. Use single quotes for strings.
- Combine multiple conditions with
AND
orOR
. - Use comparison operators:
=
,!=
,>
,>=
,<
,<=
- Use
IN
for multiple values:doc.type IN ('policy', 'procedure')
- You can only filter on metadata fields defined as filter attributes in your corpus.
Metadata best practicesβ
- Plan filter fields: When creating a corpus, define which metadata keys should be indexed for filtering.
- Use consistent types: Stick to string, number, or boolean values for predictable filtering.
- Be explicit: Set metadata at both document and section level if your queries require fine-grained filtering.
- Keep keys lowercase: Avoid spaces and special characters in metadata keys.
- Match filter attributes: Ensure metadata field names exactly match the filter attribute names defined in your corpus.
Troubleshooting metadata filtersβ
The error INVALID_ARGUMENT: The filter expression contains an error. Unrecognized references: doc.department, doc.year
occurs when:
- Filter attributes not defined: The corpus doesn't have filter attributes for the metadata fields you're trying to filter on.
- Name mismatch: The metadata field names don't exactly match the filter attribute names.
- Wrong level: Using
doc.
prefix for part-level attributes or vice versa.
Solutions:
- Ensure filter attributes are defined when creating the corpus (cannot be added later)
- Verify metadata field names exactly match filter attribute names
- Use
doc.
prefix for document-level filters andpart.
for part-level filters - Check for typos and use single quotes for string values
Complete working exampleβ
1
This complete example shows the proper workflow:
- Create corpus with filter attributes
- Index documents with matching metadata
- Query with metadata filters
The key is ensuring metadata field names exactly match the filter attribute names defined in your corpus.