Skip to content

feat(cantor-multicloudj): add multi-cloud object storage module via multicloudj#172

Open
p-konduru wants to merge 1 commit into
salesforce:masterfrom
p-konduru:feature/cantor-multicloudj
Open

feat(cantor-multicloudj): add multi-cloud object storage module via multicloudj#172
p-konduru wants to merge 1 commit into
salesforce:masterfrom
p-konduru:feature/cantor-multicloudj

Conversation

@p-konduru

@p-konduru p-konduru commented Jun 30, 2026

Copy link
Copy Markdown

Overview

Introduces cantor-multicloudj, a new Maven module that implements Cantor's Objects and Events interfaces on top of com.salesforce.multicloudj BucketClient. One codebase targets AWS S3, Alibaba Cloud OSS, and GCP Cloud Storage through a single cloud-agnostic abstraction, so deployments can switch backends by swapping the underlying BucketClient without touching Cantor application code.

This change is purely additive. The only edit outside the new module is registering it in the parent pom.xml; no existing module's behavior changes.

What's added

  • CantorOnMulticloudj — top-level facade exposing objects() and events(), with BucketClient and convenience constructors.
  • ObjectsOnMulticloudj — full Objects contract: store, get, delete, keys, size, and streaming variants, backed by object-storage primitives.
  • EventsOnMulticloudj — Events contract via a buffer-and-flush model with client-side filtering on metadata/dimensions; flushes buffered events before namespace expiry so in-flight writes are not lost.
  • MulticloudjUtils — shared helpers for listing, batched deletes, and namespace key trimming.
  • AbstractBaseMulticloudjNamespaceable — hoisted base class for shared namespace create / drop / exists logic and constants, de-duplicated across Objects and Events.
  • Module README.md with usage, supported backends, and an explicit Known Limitations / Trade-offs section.

Security hardening

  • Bounded blob downloads — max-size guard on object reads to prevent OOM on adversarial or accidentally huge objects.
  • Restrictive event-buffer directory permissions — buffer dir is created with owner-only permissions (0700).
  • Path traversal validation — buffer directory path is validated and canonicalized before use to block ..-style escapes.

Tests

  • 118 tests covering Objects, Events, namespace lifecycle, buffering/flush semantics, and edge cases.
  • No cloud credentials required — tests run against the blob-inmemory multicloudj provider, so CI and local dev work offline.

Known limitations / trade-offs

  • No server-side filtering / S3 Select equivalent. multicloudj's BucketClient does not expose an S3 Select-style API across backends, so Events filtering is performed client-side after fetching the candidate blobs. This is the main perf trade-off versus a native S3 implementation and is called out in the module README.
  • Eventually-consistent listing semantics are inherited from the underlying object store; callers needing strong read-your-writes on keys() should account for backend behavior.
  • Events use a buffer-and-flush model, not per-event writes, so very recent events may not be visible until the buffer flushes (flush is forced before namespace expiry to prevent data loss).

Test plan

  • mvn -pl cantor-multicloudj -am clean install builds cleanly from a fresh checkout
  • mvn -pl cantor-multicloudj test passes all 118 tests with no cloud credentials configured
  • Verify cantor-multicloudj appears as a module in the root pom.xml and no other module's build output changes
  • Spot-check ObjectsOnMulticloudj against the Objects interface contract (store/get/delete/keys/stream round-trip)
  • Spot-check EventsOnMulticloudj buffer-and-flush behavior and client-side filtering on metadata + dimensions
  • Confirm event buffer directory is created with 0700 perms and rejects path-traversal inputs
  • Confirm blob download size guard rejects oversize objects with a clear error rather than OOM

Server Integration

cantor-server is now wired to construct a cantor-multicloudj Cantor at runtime based on config.

  • CantorFactory — new multicloudj branch alongside the existing s3 / mysql / h2 types. Reads provider, bucket, region from config; builds a BucketClient via the multicloudj builder API; supports optional endpoint.override (.withEndpoint(URI)), optional proxy (.withProxyEndpoint(URI)), and optional buffer.directory for EventsOnMulticloudj. Because Sets is not implemented by this backend (matches cantor-s3), the factory sources Sets from another cantor type configured under multicloudj.sets.type.
  • Constants.java — 8 new keys under the multicloudj config namespace: provider, bucket, region, proxy.host, proxy.port, endpoint.override, buffer.directory, sets.type.
  • cantor-server.conf — default template block:
    multicloudj = {
        # supported providers: aws, gcp, ali
        provider=aws
        bucket=bucket-placeholder
        region=us-west-2
        sets.type=h2
    }
    
  • cantor-server/pom.xml — adds cantor-multicloudj compile dep, and declares blob-aws / blob-ali / blob-gcp (multicloudj 0.4.0) as runtime + optional deps so deployers can choose which cloud provider(s) to ship.

Usage

Set cantor.storage.type = multicloudj in cantor-server.conf, configure the multicloudj { ... } block, and ensure the matching blob-<provider> runtime dependency is on the classpath (they are declared optional in the server pom so consumers pick per-deployment).

Server integration test plan

  • mvn -pl cantor-server -am compile succeeds
  • Starting cantor-server with storage.type = multicloudj and a valid multicloudj config block produces a working Cantor that routes Objects/Events to the configured cloud provider and Sets to the type named in multicloudj.sets.type
  • Config validation rejects empty/missing provider or bucket with clear error messages
  • Optional endpoint.override, proxy, and buffer.directory are honored when present and skipped when absent

@salesforce-cla

Copy link
Copy Markdown

Thanks for the contribution! Unfortunately we can't verify the commit author(s): p-konduru <p***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

…server integration

New Maven module cantor-multicloudj that implements Cantor's Objects and Events
interfaces on top of com.salesforce.multicloudj BucketClient. Supports AWS S3,
Alibaba Cloud OSS, and GCP Cloud Storage through a single cloud-agnostic abstraction.

Module (cantor-multicloudj):
- CantorOnMulticloudj facade with BucketClient and convenience constructors
- ObjectsOnMulticloudj: full Objects contract (store/get/delete/keys/stream)
- EventsOnMulticloudj: buffer-and-flush with client-side filtering
- MulticloudjUtils: shared helpers (listing, batched deletes, namespace trimming)
- Security hardening: bounded downloads, restrictive buffer permissions, path traversal validation
- 118 tests using blob-inmemory provider (no cloud credentials needed)
- Module README with Known Limitations / Trade-offs

Server integration (cantor-server):
- CantorFactory wired with 'multicloudj' storage type
- Full config support: provider, bucket, region, endpoint override, proxy, buffer directory
- Default cantor-server.conf template
- Provider runtimes (blob-aws, blob-ali, blob-gcp) as optional runtime deps
@p-konduru p-konduru force-pushed the feature/cantor-multicloudj branch from f9a0527 to 7a1a8e8 Compare July 1, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant