Add Custom Dimensions to Enhance Scoring
Custom dimensions enable our Scale users to have a fixed set of additional "dimensions" that contain user-defined numerical values and are stored in addition to the dimensions that Vectara automatically extracts and stores from the text. At query time, users can use these custom dimensions to increase or decrease the resulting score dynamically, query by query.
For example, let's say we want to add a custom dimension to boost posts from a forum based on how many "upvotes" it has received. We can create the corpus with a "votes" custom dimension as follows:
curl -X POST \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-H "customer-id: ${CUSTOMER_ID}" \
https://api.vectara.io:443/v1/create-corpus \
-d @- <<END;
{
"corpus":
{
"name": "Acme Forums",
"description": "Contents of the Acme Forum",
"custom_dimensions": [
{
"name": "votes",
"description": "Log of the number of votes received by this post",
"serving_default": 0.0,
"indexing_default": 0.0
}
]
}
}
END
Then, at index time, you can attach the value of the custom dimension as follows:
{
"documentId": "237a8b63-2826-4ee1-8d83-14c2451a3357",
"parts": [
{
"context": "...",
"text": "Yesterday I woke up and observed a rainbow out of my window.",
"custom_dims": [{
"name": "votes",
"value": 1.235
}]
}
]
}
And then to boost documents based on the value of these custom dimensions, you can apply a query as follows:
curl -X POST \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-H "customer-id: ${CUSTOMER_ID}" \
https://api.vectara.io:443/v1/query \
-d @- <<END;
{
"query": [
{ "query": "When was the last time you saw a rainbow?",
"num_results": 5,
"corpus_key": [{
"customer_id": ${CUSTOMER_ID},
"corpus_id": ${CORPUS_ID},
"dim": [{
"name": "votes",
"value": 0.01
}]
}]
}
]
}
END
How custom dimensions affect scores
In order to calculate the final score of a document and query that contains custom dimensions, Vectara takes the dot product of the query's custom dimensions with the document's custom dimensions and the resulting number is added to their score.
Negative values decrease the overall score (sometimes called "burying") and postive values increase the overall score (sometimes called "boosting"). A dot product of 0 does not affect the underlying text retrieval score.
For more information on how scores can be interpreted in general, see the documentation on interpreting scores.
Choosing values for custom dimensions
Because scores in Vectara range from -1 to 1, in general it's best to make sure the dot product of the custom dimension values you store in your document and the query custom dimensions are between -1 and 1.
Indexing
If you're tracking some underlying value that increases or decreases linearly
(like upvotes, number of responses, total units sold, etc), then you would
typically take the log()
of the value first before storing it in a document to
ensure that it cannot dominate the overall score too much.
In some cases, it can be useful to bound the boost or penalty for a field. For example, in some cases a longer content length might warrant a boost while older documents might warrant being buried, but in either case, there may be a point at which "even longer" or "even older" doesn't really matter. In these cases, it can be useful to apply a sigmoid function to the content length or age at indexing time.
Querying
Even if the absolute value of the custom dimension is small, it will still have a large impact on the score. Try to keep the document values in the -100 to +100 range so that you don't need to suppress these values further. Depending on how your document values scale, query values for a custom dimension should normally be in a range of -0.1 to 0.1, or even smaller like -0.01 to 0.01 if document values on the larger side of that.