Visual data
Vectara makes visual content, such as pictures, charts, and diagrams, retrievable in two ways: by indexing a text representation of the image, or, by embedding the referenced image for image-based matching on a corpus enabled with image encoding. This page explains the retrieval mechanisms and the common ingestion paths for visual content, so you can choose the one that matches your data and your retrieval goal.
For request and response details, see the File Upload API and Create Corpus Document API references.
How visual content becomes retrievable
A document is retrieved through its parts. Each part is the unit that gets embedded and scored at query time. For visual content there are two ways a part can be matched:
| Mechanism | What gets embedded | When it applies |
|---|---|---|
| Text representation | The part's text (a caption, description, or summary you supply or generate) | Any corpus. Findable by keyword and semantic text queries. |
| Image-based matching | An embedding generated from the image referenced by the part | Corpora that support image encoding. A text query is matched against the image embedding, and matching image parts are returned as results. |
Use one or both mechanisms depending on how users search. A text query can match a text representation, and on an image-capable corpus the same text query can also match an image embedding. A single image can carry both, so one query can match either its text or its image.
The image_part_mode field on a document part selects the mechanism:
text: the part is matched on itstext.image: the part is matched by similarity between the query and the referenced image's embedding. Itstext, if set, also makes the part findable through keyword search.image_and_text: the part is matched against both the referenced image and itstexttogether.
When image_part_mode is omitted, the mode is inferred. On a corpus that does not support image encoding, a part is always text. Otherwise it defaults to image when it references an image and has no text, image_and_text when it references an image and has text, and text otherwise.
You query with text, not with an image. You cannot send an image as a query input. On an image-capable corpus, a text query is matched against image embeddings, and matching image parts are returned as results. An agent is the exception to how this feels: you can attach an image to an agent session, and the agent reads it and issues a text query on your behalf, so you can start from an image even though the corpus is still queried with text. To fetch the original image file by image_id, use the Retrieve Image API.
Choose an ingestion path
The common paths to get visual content into a corpus are file upload, creating a Core document, and an agent-based pipeline transform. They differ in how much you supply and how much is produced for you. The table shows what each path does. The sections that follow explain when to use each one.
| Capability | File upload | Core document | Pipeline transform |
|---|---|---|---|
| Extract text from supported files | Yes | You supply the text | Yes, with a vision model |
| Extract tables from supported files | Yes, when configured | You define the tables | Yes, region by region |
| Generate an image description for you | No | No | Yes, with a vision model |
| Image-based matching on an image-capable corpus | No | Yes, with image_part_mode | Yes, when it indexes image parts |
Image artifacts are a separate surface from corpus ingestion. An agent can read image artifacts in a session for analysis, and those files are session-scoped unless you add them to a corpus with the index tool. See Artifacts. This page covers ingesting visual content into a corpus.
Path 1: Upload a file
POST /v2/corpora/{corpus_key}/upload_file extracts text and tables from a document and indexes the result. This path makes the text of a document retrievable, including text laid out alongside pictures, charts, and diagrams. Table extraction turns each detected table into its own part with a generated description. See Table data for that flow.
The File Upload API extracts and indexes text from supported document files, and supports table extraction when configured. If retrieval depends on the visual content inside a picture, chart, diagram, or scanned region, use a Core document or an agent/pipeline workflow where you can explicitly provide or generate the image representation. If your users search for the content of an image rather than the text near it, use one of the next two paths.
File Upload table extraction does not support scanned images of tables. A table that exists only as pixels in a scanned page is not extracted as structured rows. To get structured rows from a table image, use an agent-based pipeline transform, where the agent can detect table regions, read their cells, and index the result as structured table data.
For the supported file formats, see Supported file formats.
Path 2: Create a Core document
POST /v2/corpora/{corpus_key}/documents with a Core document gives you direct control. You attach images to the document and reference them from parts. No caption or description is generated for you, so retrieval quality depends on the image reference, image_part_mode, and the text representation you provide.
An image carries Base64 image_data with a mime_type, and optional title, caption, and description. On a corpus that supports image encoding, the description is indexed as the image's text body for keyword search. A part then references the image by image_id and sets image_part_mode to control how it is matched.
CORE DOCUMENT WITH AN IMAGE PART
Code example with json syntax.1
You can attach more than one part to the same image. For example, one part describing the image as a whole, plus separate parts for individual regions or referenced text, so different query angles match the right facet of the image.
A text representation is only as findable as the language in it. A description that names the entities, units, and domain terms a user would search for is retrievable. A generic description, such as a diagram with lines and boxes is unlikely to help with domain-specific retrieval.
For the full request body, see the Create Corpus Document API reference.
Path 3: Pipeline transform
In a pipeline, the stage that processes each source record is called the transform. A pipeline uses an agent to transform data input into indexed chunks. Each source file is uploaded to a fresh agent session as the first input, and the agent processes it. This is the path where a text representation can be generated for you: the agent reads a visual file with its tools and produces searchable text (a caption or description, extracted text, or table content) before indexing it. What it produces depends on the tools and instructions you give it.
PIPELINE TRANSFORM THAT RUNS AN AGENT
Code example with json syntax.1
An agent reads visual files with a vision model and produces searchable text before indexing it. For a standalone image, it generates a caption or description, so the image is findable by a text query, and image-matchable on an image-capable corpus. For a PDF, it can lift the embedded figures out and process each one the same way. The agent then adds its output to the corpus with the index tool, where it is indexed through the mechanism in Path 2.
You can steer how each visual file is turned into text:
- Generate a single summary of the image as one caption or description.
- Process the image region by region, handling tables, pictures, and text separately so each becomes its own searchable part.
- Customize the prompts that guide captioning, text transcription, and table summarization, so the generated text uses the domain language your users search in.
Supported image formats are PNG, JPEG, GIF, WEBP, BMP, and TIFF.
Use this path when you want this processing to run automatically over a source of files rather than per request. The Console ships an indexing-agent template that wires up these visual-processing tools for you.
For how transforms and agents fit together, see Pipeline concepts and the Pipelines quickstart. For the tools an agent can run, see Agent tools. For session files and how they become corpus documents, see Artifacts.
Verify retrieval after ingest
A text representation that reads well to a human can still miss the words a user types. After ingesting visual content, confirm it is findable:
- Query the corpus with the terms your users would actually use, not the words already in the description.
- Confirm the image part appears in the results.
- If your corpus and query path support image-based matching, confirm that the query returns the parts you set to
imageorimage_and_text.
If a part does not surface, revise its text to name the entities and terms users search for, or adjust its image_part_mode, and reindex the affected documents. Image-based matching is separate from retrieving the original image file. To fetch an embedded image by image_id, use the Retrieve Image API.