Skip to main content

Supported File Types

Raw document types

The upload endpoint supports several raw document types. Vectara extracts text from these documents and sections them as best it can. This provides a convenient way to index text, yet the caller has less control compared to when providing the Document proto message themselves. The following raw document types are supported:

  • Commonmark / Markdown (md extension).
  • PDF/A (pdf).
  • Open Office (odt).
  • Microsoft Word (doc, docx).
  • Microsoft Powerpoint (ppt, pptx).
  • Text files (txt).
  • HTML files (.html).
  • LXML files (.lxml).
  • RTF files (.rtf).
  • ePUB files (.epub).
  • Email files conforming to RFC 822.

Semi-structured documents

In addition, the upload endpoint supports sending semi-structured documents through this endpoint that reflect a Document proto message. Those can be sent in the following formats:

  • pb: Contains binary serialized Document proto message.

  • pbtxt: Contains Document proto message in proto text format.

  • json: Contains Document proto message in json text format.

For more details of how to format these types of files, read the formatting document