Skip to content

Add Lakekeeper catalog support in docs #4177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/integrations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
|Redis|<Redissvg alt="Redis logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Allows ClickHouse to use [Redis](https://redis.io/) as a dictionary source.|[Documentation](/sql-reference/dictionaries/index.md#redis)|
|Redpanda|<Image img={redpanda} alt="Redpanda logo" size="logo"/>|Data ingestion|Redpanda is the streaming data platform for developers. It's API-compatible with Apache Kafka, but 10x faster, much easier to use, and more cost effective|[Blog](https://redpanda.com/blog/real-time-olap-database-clickhouse-redpanda)|
|REST Catalog||Data ingestion|Integration with REST Catalog specification for Iceberg tables, supporting multiple catalog providers including Tabular.io.|[Documentation](/use-cases/data-lake/rest-catalog)|
|Lakekeeper||Data ingestion|Integration with Lakekeeper, an open-source REST catalog implementation for Apache Iceberg with multi-tenant support.|[Documentation](/use-cases/data-lake/lakekeeper-catalog)|
|Rust|<Image img={rust} size="logo" alt="Rust logo"/>|Language client|A typed client for ClickHouse|[Documentation](/integrations/language-clients/rust.md)|
|SQLite|<Sqlitesvg alt="Sqlite logo" style={{width: '3rem', 'height': '3rem'}}/>|Data ingestion|Allows to import and export data to SQLite and supports queries to SQLite tables directly from ClickHouse.|[Documentation](/engines/table-engines/integrations/sqlite)|
|Superset|<Supersetsvg alt="Superset logo" style={{width: '3rem'}}/>|Data visualization|Explore and visualize your ClickHouse data with Apache Superset.|[Documentation](/integrations/data-visualization/superset-and-clickhouse.md)|
Expand Down
1 change: 1 addition & 0 deletions docs/use-cases/data_lake/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polar
| [Querying data in S3 using ClickHouse and the Glue Data Catalog](/use-cases/data-lake/glue-catalog) | Query your data in S3 buckets using ClickHouse and the Glue Data Catalog. |
| [Querying data in S3 using ClickHouse and the Unity Data Catalog](/use-cases/data-lake/unity-catalog) | Query your using the Unity Catalog. |
| [Querying data in S3 using ClickHouse and the REST Catalog](/use-cases/data-lake/rest-catalog) | Query your data using the REST Catalog (Tabular.io). |
| [Querying data in S3 using ClickHouse and the Lakekeeper Catalog](/use-cases/data-lake/lakekeeper-catalog) | Query your data using the Lakekeeper Catalog. |
368 changes: 368 additions & 0 deletions docs/use-cases/data_lake/lakekeeper_catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,368 @@
---
slug: /use-cases/data-lake/lakekeeper-catalog
sidebar_label: 'Lakekeeper Catalog'
title: 'Lakekeeper Catalog'
pagination_prev: null
pagination_next: null
description: 'In this guide, we will walk you through the steps to query
your data using ClickHouse and the Lakekeeper Catalog.'
keywords: ['Lakekeeper', 'REST', 'Tabular', 'Data Lake', 'Iceberg']
show_related_blogs: true
---

import ExperimentalBadge from '@theme/badges/ExperimentalBadge';

<ExperimentalBadge/>

:::note
Integration with the Lakekeeper Catalog works with Iceberg tables only.
This integration supports both AWS S3 and other cloud storage providers.
:::

ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polaris, etc.). This guide will walk you through the steps to query your data using ClickHouse and the [Lakekeeper](https://docs.lakekeeper.io/) catalog.

Lakekeeper is an open-source REST catalog implementation for Apache Iceberg that provides:
- **Rust native** implementation for high performance and reliability
- **REST API** compliance with the Iceberg REST catalog specification
- **Cloud storage** integration with S3-compatible storage

:::note
As this feature is experimental, you will need to enable it using:
`SET allow_experimental_database_iceberg = 1;`
:::

## Local Development Setup {#local-development-setup}

For local development and testing, you can use a containerized Lakekeeper setup. This approach is ideal for learning, prototyping, and development environments.

### Prerequisites {#local-prerequisites}

1. **Docker and Docker Compose**: Ensure Docker is installed and running
2. **Sample Setup**: You can use the Lakekeeper docker-compose setup

### Setting up Local Lakekeeper Catalog {#setting-up-local-lakekeeper-catalog}

You can use the official [Lakekeeper docker-compose setup](https://github.com/lakekeeper/lakekeeper/tree/main/examples/minimal) which provides a complete environment with Lakekeeper, PostgreSQL metadata backend, and MinIO for object storage.

**Step 1:** Create a new folder in which to run the example, then create a file `docker-compose.yml` with the following configuration:

```yaml
version: '3.8'

services:
lakekeeper:
image: quay.io/lakekeeper/catalog:latest
environment:
- LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
- LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
- LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
- RUST_LOG=info
command: ["serve"]
healthcheck:
test: ["CMD", "/home/nonroot/lakekeeper", "healthcheck"]
interval: 1s
timeout: 10s
retries: 10
start_period: 30s
depends_on:
migrate:
condition: service_completed_successfully
db:
condition: service_healthy
minio:
condition: service_healthy
ports:
- 8181:8181
networks:
- iceberg_net

migrate:
image: quay.io/lakekeeper/catalog:latest-main
environment:
- LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure!
- LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres
- LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres
- RUST_LOG=info
restart: "no"
command: ["migrate"]
depends_on:
db:
condition: service_healthy
networks:
- iceberg_net

bootstrap:
image: curlimages/curl
depends_on:
lakekeeper:
condition: service_healthy
restart: "no"
command:
- -w
- "%{http_code}"
- "-X"
- "POST"
- "-v"
- "http://lakekeeper:8181/management/v1/bootstrap"
- "-H"
- "Content-Type: application/json"
- "--data"
- '{"accept-terms-of-use": true}'
- "-o"
- "/dev/null"
networks:
- iceberg_net

initialwarehouse:
image: curlimages/curl
depends_on:
lakekeeper:
condition: service_healthy
bootstrap:
condition: service_completed_successfully
restart: "no"
command:
- -w
- "%{http_code}"
- "-X"
- "POST"
- "-v"
- "http://lakekeeper:8181/management/v1/warehouse"
- "-H"
- "Content-Type: application/json"
- "--data"
- '{"warehouse-name": "demo", "project-id": "00000000-0000-0000-0000-000000000000", "storage-profile": {"type": "s3", "bucket": "warehouse-rest", "key-prefix": "", "assume-role-arn": null, "endpoint": "http://minio:9000", "region": "local-01", "path-style-access": true, "flavor": "minio", "sts-enabled": true}, "storage-credential": {"type": "s3", "credential-type": "access-key", "aws-access-key-id": "minio", "aws-secret-access-key": "ClickHouse_Minio_P@ssw0rd"}}'
- "-o"
- "/dev/null"
networks:
- iceberg_net

db:
image: bitnami/postgresql:16.3.0
environment:
- POSTGRESQL_USERNAME=postgres
- POSTGRESQL_PASSWORD=postgres
- POSTGRESQL_DATABASE=postgres
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -p 5432 -d postgres"]
interval: 2s
timeout: 10s
retries: 5
start_period: 10s
volumes:
- postgres_data:/bitnami/postgresql
networks:
- iceberg_net

minio:
image: bitnami/minio:2025.4.22
environment:
- MINIO_ROOT_USER=minio
- MINIO_ROOT_PASSWORD=ClickHouse_Minio_P@ssw0rd
- MINIO_API_PORT_NUMBER=9000
- MINIO_CONSOLE_PORT_NUMBER=9001
- MINIO_SCHEME=http
- MINIO_DEFAULT_BUCKETS=warehouse-rest
networks:
iceberg_net:
aliases:
- warehouse-rest.minio
ports:
- "9002:9000"
- "9003:9001"
healthcheck:
test: ["CMD", "mc", "ls", "local", "|", "grep", "warehouse-rest"]
interval: 2s
timeout: 10s
retries: 3
start_period: 15s
volumes:
- minio_data:/bitnami/minio/data

clickhouse:
image: clickhouse/clickhouse-server:head
container_name: lakekeeper-clickhouse
user: '0:0' # Ensures root permissions
ports:
- "8123:8123"
- "9000:9000"
volumes:
- clickhouse_data:/var/lib/clickhouse
- ./clickhouse/data_import:/var/lib/clickhouse/data_import # Mount dataset folder
networks:
- iceberg_net
environment:
- CLICKHOUSE_DB=default
- CLICKHOUSE_USER=default
- CLICKHOUSE_DO_NOT_CHOWN=1
- CLICKHOUSE_PASSWORD=
depends_on:
lakekeeper:
condition: service_healthy
minio:
condition: service_healthy

volumes:
postgres_data:
minio_data:
clickhouse_data:

networks:
iceberg_net:
driver: bridge
```

**Step 2:** Run the following command to start the services:

```bash
docker compose up -d
```

**Step 3:** Wait for all services to be ready. You can check the logs:

```bash
docker-compose logs -f
```

:::note
The Lakekeeper setup requires that sample data be loaded into the Iceberg tables first. Make sure the environment has created and populated the tables before attempting to query them through ClickHouse. The availability of tables depends on the specific docker-compose setup and sample data loading scripts.
:::

### Connecting to Local Lakekeeper Catalog {#connecting-to-local-lakekeeper-catalog}

Connect to your ClickHouse container:

```bash
docker exec -it lakekeeper-clickhouse clickhouse-client
```

Then create the database connection to the Lakekeeper catalog:

```sql
SET allow_experimental_database_iceberg = 1;

CREATE DATABASE demo
ENGINE = DataLakeCatalog('http://lakekeeper:8181/catalog', 'minio', 'ClickHouse_Minio_P@ssw0rd')
SETTINGS catalog_type = 'rest', storage_endpoint = 'http://minio:9002/warehouse-rest', warehouse = 'demo'
```

## Querying Lakekeeper catalog tables using ClickHouse {#querying-lakekeeper-catalog-tables-using-clickhouse}

Now that the connection is in place, you can start querying via the Lakekeeper catalog. For example:

```sql
USE demo;

SHOW TABLES;
```

If your setup includes sample data (such as the taxi dataset), you should see tables like:

```sql title="Response"
┌─name──────────┐
│ default.taxis │
└───────────────┘
```

:::note
If you don't see any tables, this usually means:
1. The environment hasn't created the sample tables yet
2. The Lakekeeper catalog service isn't fully initialized
3. The sample data loading process hasn't completed

You can check the Spark logs to see the table creation progress:
```bash
docker-compose logs spark
```
:::

To query a table (if available):

```sql
SELECT count(*) FROM `default.taxis`;
```

```sql title="Response"
┌─count()─┐
│ 2171187 │
└─────────┘
```

:::note Backticks required
Backticks are required because ClickHouse doesn't support more than one namespace.
:::

To inspect the table DDL:

```sql
SHOW CREATE TABLE `default.taxis`;
```

```sql title="Response"
┌─statement─────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE demo.`default.taxis` │
│ ( │
│ `VendorID` Nullable(Int64), │
│ `tpep_pickup_datetime` Nullable(DateTime64(6)), │
│ `tpep_dropoff_datetime` Nullable(DateTime64(6)), │
│ `passenger_count` Nullable(Float64), │
│ `trip_distance` Nullable(Float64), │
│ `RatecodeID` Nullable(Float64), │
│ `store_and_fwd_flag` Nullable(String), │
│ `PULocationID` Nullable(Int64), │
│ `DOLocationID` Nullable(Int64), │
│ `payment_type` Nullable(Int64), │
│ `fare_amount` Nullable(Float64), │
│ `extra` Nullable(Float64), │
│ `mta_tax` Nullable(Float64), │
│ `tip_amount` Nullable(Float64), │
│ `tolls_amount` Nullable(Float64), │
│ `improvement_surcharge` Nullable(Float64), │
│ `total_amount` Nullable(Float64), │
│ `congestion_surcharge` Nullable(Float64), │
│ `airport_fee` Nullable(Float64) │
│ ) │
│ ENGINE = Iceberg('http://minio:9002/warehouse-rest/warehouse/default/taxis/', 'minio', '[HIDDEN]') │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
```

## Loading data from your Data Lake into ClickHouse {#loading-data-from-your-data-lake-into-clickhouse}

If you need to load data from the Lakekeeper catalog into ClickHouse, start by creating a local ClickHouse table:

```sql
CREATE TABLE taxis
(
`VendorID` Int64,
`tpep_pickup_datetime` DateTime64(6),
`tpep_dropoff_datetime` DateTime64(6),
`passenger_count` Float64,
`trip_distance` Float64,
`RatecodeID` Float64,
`store_and_fwd_flag` String,
`PULocationID` Int64,
`DOLocationID` Int64,
`payment_type` Int64,
`fare_amount` Float64,
`extra` Float64,
`mta_tax` Float64,
`tip_amount` Float64,
`tolls_amount` Float64,
`improvement_surcharge` Float64,
`total_amount` Float64,
`congestion_surcharge` Float64,
`airport_fee` Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(tpep_pickup_datetime)
ORDER BY (VendorID, tpep_pickup_datetime, PULocationID, DOLocationID);
```

Then load the data from your Lakekeeper catalog table via an `INSERT INTO SELECT`:

```sql
INSERT INTO taxis
SELECT * FROM demo.`default.taxis`;
```


3 changes: 2 additions & 1 deletion sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,8 @@ const sidebars = {
items: [
"use-cases/data_lake/glue_catalog",
"use-cases/data_lake/unity_catalog",
"use-cases/data_lake/rest_catalog"
"use-cases/data_lake/rest_catalog",
"use-cases/data_lake/lakekeeper_catalog"
]
},
{
Expand Down
Loading