Skip to content

Add Lakekeeper catalog support in docs #4177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

somratdutta
Copy link
Contributor

@somratdutta somratdutta commented Jul 28, 2025

Summary

Checklist

@somratdutta somratdutta requested review from a team as code owners July 28, 2025 19:57
Copy link

vercel bot commented Jul 28, 2025

@somratdutta is attempting to deploy a commit to the ClickHouse Team on Vercel.

A member of the Team first needs to authorize it.

@somratdutta
Copy link
Contributor Author

Testing Instructions

This PR depends on a recently merged fix that is not yet available as a Docker image. Below are comprehensive testing instructions to validate the changes locally.

Prerequisites

Download the appropriate ClickHouse binary from the builds artfiacts based on your platform. For macOS on Apple Silicon, use the arm_darwin build.

Environment Setup

  1. Initialize ClickHouse Server

    chmod +x clickhouse
    ./clickhouse server
  2. Deploy Supporting Infrastructure

    Create a docker-compose.yaml file with the following configuration:

    services:
      jupyter:
        image: quay.io/jupyter/pyspark-notebook:2024-10-14
        depends_on:
          lakekeeper:
            condition: service_healthy
          initialwarehouse:
            condition: service_completed_successfully
        command: start-notebook.sh --NotebookApp.token=''
        volumes:
          - ./notebooks:/home/jovyan/examples/
        ports:
          - "8888:8888"
    
      lakekeeper:
        image: quay.io/lakekeeper/catalog:latest
        environment:
          - LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
          - LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
          - LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
          - RUST_LOG=info
        command: ["serve"]
        healthcheck:
          test: ["CMD", "/home/nonroot/lakekeeper", "healthcheck"]
          interval: 1s
          timeout: 10s
          retries: 10
          start_period: 30s
        depends_on:
          migrate:
            condition: service_completed_successfully
          db:
            condition: service_healthy
          minio:
            condition: service_healthy
        ports:
          - 8181:8181
    
      migrate:
        image: quay.io/lakekeeper/catalog:latest-main
        environment:
          - LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
          - LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
          - LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
          - RUST_LOG=info
        restart: "no"
        command: ["migrate"]
        depends_on:
          db:
            condition: service_healthy
    
      bootstrap:
        image: curlimages/curl
        depends_on:
          lakekeeper:
            condition: service_healthy
        restart: "no"
        command:
          - -w
          - "%{http_code}"
          - "-X"
          - "POST"
          - "-v"
          - "http://lakekeeper:8181/management/v1/bootstrap"
          - "-H"
          - "Content-Type: application/json"
          - "--data"
          - '{"accept-terms-of-use": true}'
          - "-o"
          - "/dev/null"
    
      initialwarehouse:
        image: curlimages/curl
        depends_on:
          lakekeeper:
            condition: service_healthy
          bootstrap:
            condition: service_completed_successfully
        restart: "no"
        command:
          - -w
          - "%{http_code}"
          - "-X"
          - "POST"
          - "-v"
          - "http://lakekeeper:8181/management/v1/warehouse"
          - "-H"
          - "Content-Type: application/json"
          - "--data"
          - '{"warehouse-name": "demo", "project-id": "00000000-0000-0000-0000-000000000000", "storage-profile": {"type": "s3", "bucket": "warehouse-rest", "key-prefix": "", "assume-role-arn": null, "endpoint": "http://minio:9000", "region": "local-01", "path-style-access": true, "flavor": "minio", "sts-enabled": true}, "storage-credential": {"type": "s3", "credential-type": "access-key", "aws-access-key-id": "minio", "aws-secret-access-key": "ClickHouse_Minio_P@ssw0rd"}}'
          - "-o"
          - "/dev/null"
    
      db:
        image: bitnami/postgresql:16.3.0
        environment:
          - POSTGRESQL_USERNAME=postgres
          - POSTGRESQL_PASSWORD=postgres
          - POSTGRESQL_DATABASE=postgres
        healthcheck:
          test: ["CMD-SHELL", "pg_isready -U postgres -p 5432 -d postgres"]
          interval: 2s
          timeout: 10s
          retries: 5
          start_period: 10s
    
      minio:
        image: bitnami/minio:2025.4.22
        environment:
          - MINIO_ROOT_USER=minio
          - MINIO_ROOT_PASSWORD=ClickHouse_Minio_P@ssw0rd
          - MINIO_API_PORT_NUMBER=9000
          - MINIO_CONSOLE_PORT_NUMBER=9001
          - MINIO_SCHEME=http
          - MINIO_DEFAULT_BUCKETS=warehouse-rest
        networks: 
          default:
            aliases:
              - warehouse-rest.minio
        ports:
          - "9002:9000"
          - "9003:9001"
        healthcheck:
          test: ["CMD", "mc", "ls", "local", "|", "grep", "warehouse-rest"]
          interval: 2s
          timeout: 10s
          retries: 3
          start_period: 15s

Data Ingestion via PyIceberg

Create the notebook notebooks/Pyiceberg.ipynb to establish test data using the Apache Iceberg format:

!pip install -q pyiceberg
from pyiceberg.catalog.rest import RestCatalog
import logging
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

# Uncomment for detailed logging
# logging.basicConfig(level=logging.DEBUG)

CATALOG_URL = "http://lakekeeper:8181/catalog"
DEMO_WAREHOUSE = "demo"

catalog = RestCatalog(
    name="my_catalog",
    warehouse=DEMO_WAREHOUSE,
    uri=CATALOG_URL,
    token="dummy",
)

# Initialize namespace
test_namespace = ("pyiceberg_namespace",)
if test_namespace not in catalog.list_namespaces():
    catalog.create_namespace(test_namespace)

# Prepare test dataset
test_table = ("pyiceberg_namespace", "my_table")
df = pd.DataFrame({
    "id": [1, 2, 3],
    "data": ["a", "b", "c"],
})
pa_df = pa.Table.from_pandas(df)

# Clean existing table if present
if test_table in catalog.list_tables(namespace=test_namespace):
    catalog.drop_table(test_table)

# Create and populate table
table = catalog.create_table(
    test_table,
    schema=pa_df.schema,
    properties={"write.metadata.compression-codec": "none"},
)

table.append(pa_df)

# Verify data ingestion
table = catalog.load_table(test_table)
print(table.scan().to_pandas())

Integration Testing

After executing the notebook, connect to ClickHouse and validate the DataLakeCatalog integration:

./clickhouse client

Execute the following SQL commands to verify functionality:

-- Enable experimental Iceberg support
SET allow_experimental_database_iceberg = 1;

-- Configure DataLakeCatalog with REST catalog backend
CREATE DATABASE demo
ENGINE = DataLakeCatalog('http://localhost:8181/catalog', 'minio', 'ClickHouse_Minio_P@ssw0rd')
SETTINGS 
    catalog_type = 'rest', 
    storage_endpoint = 'http://localhost:9002/warehouse-rest', 
    warehouse = 'demo';

-- Verify table discovery
SHOW TABLES FROM demo;

-- Validate data retrieval
SELECT * FROM demo.`pyiceberg_namespace.my_table`;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants