Supported File Types
Raw document types
The upload endpoint supports several raw document types. Vectara extracts text
from these documents and sections them as best it can. This provides a
convenient way to index text, yet the caller has less control compared to when
providing the Document
proto message themselves. The following raw document
types are supported:
- Commonmark / Markdown (
md
extension). - PDF/A (
pdf
). - Open Office (
odt
). - Microsoft Word (
doc
,docx
). - Microsoft Powerpoint (
ppt
,pptx
). - Text files (
txt
). - HTML files (
.html
). - LXML files (
.lxml
). - RTF files (
.rtf
). - ePUB files (
.epub
). - Email files conforming to RFC 822.
Semi-structured documents
In gRPC, the upload endpoint supports sending semi-structured documents through
this endpoint that reflect a Document
proto message. Those can be sent in
the following formats:
pb
: Contains binary serializedDocument
proto message.pbtxt
: ContainsDocument
proto message in proto text format.json
: ContainsDocument
proto message in json text format.
In REST API v2, use the Indexing API v2 endpoint instead.