Add Lakekeeper catalog support in docs #4177

somratdutta · 2025-07-28T19:57:41Z

Summary

Checklist

Delete items not relevant to your PR
URL changes should add a redirect to the old URL via https://github.com/ClickHouse/clickhouse-docs/blob/main/docusaurus.config.js
If adding a new integration page, also add an entry to the integrations list here: https://github.com/ClickHouse/clickhouse-docs/blob/main/docs/integrations/index.mdx

vercel · 2025-07-28T19:57:51Z

@somratdutta is attempting to deploy a commit to the ClickHouse Team on Vercel.

A member of the Team first needs to authorize it.

somratdutta · 2025-07-28T20:14:31Z

Testing Instructions

This PR depends on a recently merged fix that is not yet available as a Docker image. Below are comprehensive testing instructions to validate the changes locally.

Prerequisites

Download the appropriate ClickHouse binary from the builds artfiacts based on your platform. For macOS on Apple Silicon, use the arm_darwin build.

Environment Setup

Initialize ClickHouse Server
```
chmod +x clickhouse
./clickhouse server
```

Deploy Supporting Infrastructure

Create a docker-compose.yaml file with the following configuration:

services:
  jupyter:
    image: quay.io/jupyter/pyspark-notebook:2024-10-14
    depends_on:
      lakekeeper:
        condition: service_healthy
      initialwarehouse:
        condition: service_completed_successfully
    command: start-notebook.sh --NotebookApp.token=''
    volumes:
      - ./notebooks:/home/jovyan/examples/
    ports:
      - "8888:8888"

  lakekeeper:
    image: quay.io/lakekeeper/catalog:latest
    environment:
      - LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
      - LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
      - LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
      - RUST_LOG=info
    command: ["serve"]
    healthcheck:
      test: ["CMD", "/home/nonroot/lakekeeper", "healthcheck"]
      interval: 1s
      timeout: 10s
      retries: 10
      start_period: 30s
    depends_on:
      migrate:
        condition: service_completed_successfully
      db:
        condition: service_healthy
      minio:
        condition: service_healthy
    ports:
      - 8181:8181

  migrate:
    image: quay.io/lakekeeper/catalog:latest-main
    environment:
      - LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
      - LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
      - LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
      - RUST_LOG=info
    restart: "no"
    command: ["migrate"]
    depends_on:
      db:
        condition: service_healthy

  bootstrap:
    image: curlimages/curl
    depends_on:
      lakekeeper:
        condition: service_healthy
    restart: "no"
    command:
      - -w
      - "%{http_code}"
      - "-X"
      - "POST"
      - "-v"
      - "http://lakekeeper:8181/management/v1/bootstrap"
      - "-H"
      - "Content-Type: application/json"
      - "--data"
      - '{"accept-terms-of-use": true}'
      - "-o"
      - "/dev/null"

  initialwarehouse:
    image: curlimages/curl
    depends_on:
      lakekeeper:
        condition: service_healthy
      bootstrap:
        condition: service_completed_successfully
    restart: "no"
    command:
      - -w
      - "%{http_code}"
      - "-X"
      - "POST"
      - "-v"
      - "http://lakekeeper:8181/management/v1/warehouse"
      - "-H"
      - "Content-Type: application/json"
      - "--data"
      - '{"warehouse-name": "demo", "project-id": "00000000-0000-0000-0000-000000000000", "storage-profile": {"type": "s3", "bucket": "warehouse-rest", "key-prefix": "", "assume-role-arn": null, "endpoint": "http://minio:9000", "region": "local-01", "path-style-access": true, "flavor": "minio", "sts-enabled": true}, "storage-credential": {"type": "s3", "credential-type": "access-key", "aws-access-key-id": "minio", "aws-secret-access-key": "ClickHouse_Minio_P@ssw0rd"}}'
      - "-o"
      - "/dev/null"

  db:
    image: bitnami/postgresql:16.3.0
    environment:
      - POSTGRESQL_USERNAME=postgres
      - POSTGRESQL_PASSWORD=postgres
      - POSTGRESQL_DATABASE=postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -p 5432 -d postgres"]
      interval: 2s
      timeout: 10s
      retries: 5
      start_period: 10s

  minio:
    image: bitnami/minio:2025.4.22
    environment:
      - MINIO_ROOT_USER=minio
      - MINIO_ROOT_PASSWORD=ClickHouse_Minio_P@ssw0rd
      - MINIO_API_PORT_NUMBER=9000
      - MINIO_CONSOLE_PORT_NUMBER=9001
      - MINIO_SCHEME=http
      - MINIO_DEFAULT_BUCKETS=warehouse-rest
    networks: 
      default:
        aliases:
          - warehouse-rest.minio
    ports:
      - "9002:9000"
      - "9003:9001"
    healthcheck:
      test: ["CMD", "mc", "ls", "local", "|", "grep", "warehouse-rest"]
      interval: 2s
      timeout: 10s
      retries: 3
      start_period: 15s

Data Ingestion via PyIceberg

Create the notebook notebooks/Pyiceberg.ipynb to establish test data using the Apache Iceberg format:

!pip install -q pyiceberg
from pyiceberg.catalog.rest import RestCatalog
import logging
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

# Uncomment for detailed logging
# logging.basicConfig(level=logging.DEBUG)

CATALOG_URL = "http://lakekeeper:8181/catalog"
DEMO_WAREHOUSE = "demo"

catalog = RestCatalog(
    name="my_catalog",
    warehouse=DEMO_WAREHOUSE,
    uri=CATALOG_URL,
    token="dummy",
)

# Initialize namespace
test_namespace = ("pyiceberg_namespace",)
if test_namespace not in catalog.list_namespaces():
    catalog.create_namespace(test_namespace)

# Prepare test dataset
test_table = ("pyiceberg_namespace", "my_table")
df = pd.DataFrame({
    "id": [1, 2, 3],
    "data": ["a", "b", "c"],
})
pa_df = pa.Table.from_pandas(df)

# Clean existing table if present
if test_table in catalog.list_tables(namespace=test_namespace):
    catalog.drop_table(test_table)

# Create and populate table
table = catalog.create_table(
    test_table,
    schema=pa_df.schema,
    properties={"write.metadata.compression-codec": "none"},
)

table.append(pa_df)

# Verify data ingestion
table = catalog.load_table(test_table)
print(table.scan().to_pandas())

Integration Testing

After executing the notebook, connect to ClickHouse and validate the DataLakeCatalog integration:

./clickhouse client

Execute the following SQL commands to verify functionality:

-- Enable experimental Iceberg support
SET allow_experimental_database_iceberg = 1;

-- Configure DataLakeCatalog with REST catalog backend
CREATE DATABASE demo
ENGINE = DataLakeCatalog('http://localhost:8181/catalog', 'minio', 'ClickHouse_Minio_P@ssw0rd')
SETTINGS 
    catalog_type = 'rest', 
    storage_endpoint = 'http://localhost:9002/warehouse-rest', 
    warehouse = 'demo';

-- Verify table discovery
SHOW TABLES FROM demo;

-- Validate data retrieval
SELECT * FROM demo.`pyiceberg_namespace.my_table`;

somratdutta added 2 commits July 29, 2025 01:26

Add Lakekeeper catalog documentation to data lake use cases

4d86648

minor changes

25853d9

somratdutta requested review from a team as code owners July 28, 2025 19:57

dhtclk approved these changes Jul 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Lakekeeper catalog support in docs #4177

Add Lakekeeper catalog support in docs #4177

Uh oh!

somratdutta commented Jul 28, 2025 •

edited

Loading

Uh oh!

vercel bot commented Jul 28, 2025

Uh oh!

somratdutta commented Jul 28, 2025

Uh oh!

Uh oh!

Add Lakekeeper catalog support in docs #4177

Are you sure you want to change the base?

Add Lakekeeper catalog support in docs #4177

Uh oh!

Conversation

somratdutta commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

vercel bot commented Jul 28, 2025

Uh oh!

somratdutta commented Jul 28, 2025

Testing Instructions

Prerequisites

Environment Setup

Data Ingestion via PyIceberg

Integration Testing

Uh oh!

Uh oh!

somratdutta commented Jul 28, 2025 •

edited

Loading