Skip to main content
Version: 2.0

S3

The S3 source reads objects from Amazon S3 or any S3-compatible service (MinIO, Ceph, etc.). Each object in the bucket becomes a record in the pipeline — its contents are uploaded to a new agent session for processing.

Configuration

SOURCE FIELD (S3)

Code example with json syntax.
1

Fields

FieldRequiredDescription
bucketYesThe S3 bucket name.
regionYesThe region of the S3-compatible service (e.g. us-east-1).
access_key_idYesAWS access key ID. Encrypted at rest and never returned in responses.
secret_access_keyYesAWS secret access key. Encrypted at rest and never returned in responses.
prefixNoKey prefix to scope ingestion to a subset of objects (e.g. legal/contracts/).
endpoint_urlNoCustom endpoint URL for S3-compatible services. If omitted, defaults to AWS S3.

Using with S3-compatible services

Set endpoint_url to point at your service. The rest of the configuration works the same as with AWS S3.

SOURCE FIELD (S3-COMPATIBLE, MINIO)

Code example with json syntax.
1

How records are fetched

Each run lists objects in the bucket using the S3 ListObjectsV2 API, scoped by the optional prefix. Folder markers (empty keys ending in /) are skipped. The pipeline paginates through the full listing, processing each object concurrently in small batches.

The pipeline captures an upper-bound timestamp at the start of each run and only processes objects whose lastModified is at or before that timestamp. This ensures that objects added to the bucket while a run is in progress are left for the next run — they aren't partially processed.

Incremental sync

When sync_mode is incremental (the default), the pipeline tracks a watermark based on each object's lastModified timestamp. After a successful run, the watermark advances to the upper bound captured at the start of that run.

On the next run, only objects with lastModified > stored_watermark are processed. This ensures:

  • New files are picked up.
  • Modified files (S3 updates lastModified on overwrite) are reprocessed.
  • Unchanged files are skipped, keeping costs low.

Deleted files are not explicitly tracked — they simply stop appearing in the listing.

Source metadata

Each record carries source metadata that the connector resolves at fetch time.

system_metadata:

KeyDescription
sizeObject size in bytes.
etagThe object's ETag.

user_metadata contains the object's user-defined metadata (the x-amz-meta-* headers), when present.

acl_metadata holds the object's ACL in the source-independent ACL metadata shape. S3 has no notion of comment access or membership-resolved groups, so commenters and the group_* buckets are always null. Entries in owners, editors, and readers are the grantee's email (when the grant is by email) or the AWS canonical user id. The buckets map to S3 grants as follows:

BucketS3 grant
ownersthe object owner
editorsgrantees with FULL_CONTROL or WRITE
readersgrantees with READ
public_accessthe predefined AllUsers group
org_wide_accessthe predefined AuthenticatedUsers group

If the bucket's service doesn't support the object-ACL API, acl_metadata is empty (every bucket null).

Permissions

The credentials you provide need these S3 permissions on the bucket:

  • s3:ListBucket — to enumerate objects in the bucket (scoped by prefix).
  • s3:GetObject — to download each object's contents.

Example IAM policy:

MINIMAL IAM POLICY

Code example with json syntax.
1