When you create a corpus via the API or the UI, you will have the option to create it and "don't store the text," also known as a "textless" mode. When this is enabled, several things happen in the platform. This document talks about when it's appropriate to enable textless, what happens on the platform, and what benefits and limitations it brings.
What happens when textless is enabled
When you enable
textless on a corpus, Vectara discards
the text content of the document immediately after it converts the text to a
vector. At that point, the text is no longer recoverable and won't be
returned in any Vectara APIs.
Note that Vectara does retain any metadata -- including document IDs -- that were supplied alongside the text. This allows you to retrieve the document from a separate system of record based on the ID to show it and also allows Vectara to perform any metadata-based filtering on the document.
One use case for
textless is when you have very sensitive text content. By
enabling this feature, the text content becomes unrecoverable
to Vectara or to any user that manages to query for and
find the document.
In general, this feature is optimal for use cases where the cost of any information leakage is very high. Note that Vectara does encrypt documents
Currently, the reranking capability relies on the text being stored. As a result, attempting to rerank search results on any corpora where text storage has been turned off will not work at this time.