From 4d86648950730a1b9da59f401e0356307167e0b5 Mon Sep 17 00:00:00 2001 From: somratdutta Date: Thu, 24 Jul 2025 12:17:14 +0530 Subject: [PATCH 1/2] Add Lakekeeper catalog documentation to data lake use cases --- docs/integrations/index.mdx | 1 + docs/use-cases/data_lake/index.md | 1 + .../use-cases/data_lake/lakekeeper_catalog.md | 321 ++++++++++++++++++ sidebars.js | 3 +- 4 files changed, 325 insertions(+), 1 deletion(-) create mode 100644 docs/use-cases/data_lake/lakekeeper_catalog.md diff --git a/docs/integrations/index.mdx b/docs/integrations/index.mdx index 299c2b25fcb..5e703973ef5 100644 --- a/docs/integrations/index.mdx +++ b/docs/integrations/index.mdx @@ -246,6 +246,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no |Redis||Data ingestion|Allows ClickHouse to use [Redis](https://redis.io/) as a dictionary source.|[Documentation](/sql-reference/dictionaries/index.md#redis)| |Redpanda|Redpanda logo|Data ingestion|Redpanda is the streaming data platform for developers. It's API-compatible with Apache Kafka, but 10x faster, much easier to use, and more cost effective|[Blog](https://redpanda.com/blog/real-time-olap-database-clickhouse-redpanda)| |REST Catalog||Data ingestion|Integration with REST Catalog specification for Iceberg tables, supporting multiple catalog providers including Tabular.io.|[Documentation](/use-cases/data-lake/rest-catalog)| +|Lakekeeper||Data ingestion|Integration with Lakekeeper, an open-source REST catalog implementation for Apache Iceberg with multi-tenant support.|[Documentation](/use-cases/data-lake/lakekeeper-catalog)| |Rust|Rust logo|Language client|A typed client for ClickHouse|[Documentation](/integrations/language-clients/rust.md)| |SQLite||Data ingestion|Allows to import and export data to SQLite and supports queries to SQLite tables directly from ClickHouse.|[Documentation](/engines/table-engines/integrations/sqlite)| |Superset||Data visualization|Explore and visualize your ClickHouse data with Apache Superset.|[Documentation](/integrations/data-visualization/superset-and-clickhouse.md)| diff --git a/docs/use-cases/data_lake/index.md b/docs/use-cases/data_lake/index.md index 0d0380e2d34..52716c41e03 100644 --- a/docs/use-cases/data_lake/index.md +++ b/docs/use-cases/data_lake/index.md @@ -14,3 +14,4 @@ ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polar | [Querying data in S3 using ClickHouse and the Glue Data Catalog](/use-cases/data-lake/glue-catalog) | Query your data in S3 buckets using ClickHouse and the Glue Data Catalog. | | [Querying data in S3 using ClickHouse and the Unity Data Catalog](/use-cases/data-lake/unity-catalog) | Query your using the Unity Catalog. | | [Querying data in S3 using ClickHouse and the REST Catalog](/use-cases/data-lake/rest-catalog) | Query your data using the REST Catalog (Tabular.io). | +| [Querying data in S3 using ClickHouse and the Lakekeeper Catalog](/use-cases/data-lake/lakekeeper-catalog) | Query your data using the Lakekeeper Catalog. | diff --git a/docs/use-cases/data_lake/lakekeeper_catalog.md b/docs/use-cases/data_lake/lakekeeper_catalog.md new file mode 100644 index 00000000000..4d489d7fcec --- /dev/null +++ b/docs/use-cases/data_lake/lakekeeper_catalog.md @@ -0,0 +1,321 @@ +--- +slug: /use-cases/data-lake/lakekeeper-catalog +sidebar_label: 'Lakekeeper Catalog' +title: 'Lakekeeper Catalog' +pagination_prev: null +pagination_next: null +description: 'In this guide, we will walk you through the steps to query + your data using ClickHouse and the Lakekeeper Catalog.' +keywords: ['Lakekeeper', 'REST', 'Tabular', 'Data Lake', 'Iceberg'] +show_related_blogs: true +--- + +import ExperimentalBadge from '@theme/badges/ExperimentalBadge'; + + + +:::note +Integration with the Lakekeeper Catalog works with Iceberg tables only. +This integration supports both AWS S3 and other cloud storage providers. +::: + +ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polaris, etc.). This guide will walk you through the steps to query your data using ClickHouse and the [Lakekeeper](https://github.com/lakekeeper/lakekeeper) catalog. + +Lakekeeper is an open-source REST catalog implementation for Apache Iceberg that provides: +- **REST API** compliance with the Iceberg REST catalog specification +- **Multi-tenant** support for managing multiple warehouses +- **Cloud storage** integration with S3-compatible storage +- **Production-ready** deployment capabilities + +:::note +As this feature is experimental, you will need to enable it using: +`SET allow_experimental_database_iceberg = 1;` +::: + +## Local Development Setup {#local-development-setup} + +For local development and testing, you can use a containerized Lakekeeper setup. This approach is ideal for learning, prototyping, and development environments. + +### Prerequisites {#local-prerequisites} + +1. **Docker and Docker Compose**: Ensure Docker is installed and running +2. **Sample Setup**: You can use the Lakekeeper docker-compose setup + +### Setting up Local Lakekeeper Catalog {#setting-up-local-lakekeeper-catalog} + +You can use the official Lakekeeper docker-compose setup which provides a complete environment with Lakekeeper, PostgreSQL metadata backend, and MinIO for object storage. + +**Step 1:** Create a new folder in which to run the example, then create a file `docker-compose.yml` with the following configuration: + +```yaml +version: '3.8' + +services: + postgres: + image: postgres:15 + container_name: lakekeeper-postgres + environment: + POSTGRES_USER: iceberg + POSTGRES_PASSWORD: iceberg + POSTGRES_DB: iceberg + ports: + - "5432:5432" + volumes: + - postgres_data:/var/lib/postgresql/data + networks: + - iceberg_net + + minio: + image: minio/minio:latest + container_name: lakekeeper-minio + environment: + MINIO_ROOT_USER: admin + MINIO_ROOT_PASSWORD: password + MINIO_DOMAIN: minio + ports: + - "9001:9001" + - "9000:9000" + command: ["server", "/data", "--console-address", ":9001"] + volumes: + - minio_data:/data + networks: + - iceberg_net + + # Initialize MinIO with required buckets + mc: + image: minio/mc:latest + container_name: lakekeeper-mc + depends_on: + - minio + entrypoint: > + /bin/sh -c " + until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; + /usr/bin/mc mb minio/warehouse; + /usr/bin/mc policy set public minio/warehouse; + exit 0; + " + networks: + - iceberg_net + + lakekeeper: + image: lakekeeper/lakekeeper:latest + container_name: lakekeeper-catalog + depends_on: + - postgres + - minio + environment: + LAKEKEEPER__PG_ENCRYPTION_KEY: "abcdefghijklmnopqrstuvwxyz123456" + LAKEKEEPER__PG_DATABASE_URL_READ: "postgresql://iceberg:iceberg@postgres:5432/iceberg" + LAKEKEEPER__PG_DATABASE_URL_WRITE: "postgresql://iceberg:iceberg@postgres:5432/iceberg" + LAKEKEEPER__STORAGE__S3__ENDPOINT: "http://minio:9000" + LAKEKEEPER__STORAGE__S3__ACCESS_KEY_ID: "admin" + LAKEKEEPER__STORAGE__S3__SECRET_ACCESS_KEY: "password" + LAKEKEEPER__STORAGE__S3__REGION: "us-east-1" + LAKEKEEPER__STORAGE__S3__BUCKET: "warehouse" + LAKEKEEPER__STORAGE__S3__PATH_STYLE_ACCESS: "true" + ports: + - "8080:8080" + networks: + - iceberg_net + + clickhouse: + image: clickhouse/clickhouse-server:head + container_name: lakekeeper-clickhouse + user: '0:0' # Ensures root permissions + ports: + - "8123:8123" + - "9002:9000" + volumes: + - ./clickhouse:/var/lib/clickhouse + - ./clickhouse/data_import:/var/lib/clickhouse/data_import # Mount dataset folder + networks: + - iceberg_net + environment: + - CLICKHOUSE_DB=default + - CLICKHOUSE_USER=default + - CLICKHOUSE_DO_NOT_CHOWN=1 + - CLICKHOUSE_PASSWORD= + +volumes: + postgres_data: + minio_data: + +networks: + iceberg_net: + driver: bridge +``` + +**Step 2:** Run the following command to start the services: + +```bash +docker compose up -d +``` + +**Step 3:** Wait for all services to be ready. You can check the logs: + +```bash +docker-compose logs -f lakekeeper +``` + +**Step 4:** Verify that Lakekeeper is running by checking the catalog status: + +```bash +curl http://localhost:8080/v1/config +``` + +You should see a JSON response indicating the catalog configuration. + +:::note +The Lakekeeper setup requires that the MinIO buckets be created first. The `mc` service in the docker-compose file handles this initialization. Make sure all services are healthy before attempting to query them through ClickHouse. +::: + +### Connecting to Local Lakekeeper Catalog {#connecting-to-local-lakekeeper-catalog} + +Connect to your ClickHouse container: + +```bash +docker exec -it lakekeeper-clickhouse clickhouse-client +``` + +Then create the database connection to the Lakekeeper catalog: + +```sql +SET allow_experimental_database_iceberg = 1; + +CREATE DATABASE lakekeeper_demo +ENGINE = DataLakeCatalog('http://lakekeeper:8080/v1', '', '') +SETTINGS + catalog_type = 'rest', + storage_endpoint = 'http://minio:9000/warehouse', + warehouse = 'demo' +``` + +## Creating Sample Data {#creating-sample-data} + +Before querying tables, let's create some sample data using a simple Python script or by using the Iceberg Python library to create tables in Lakekeeper. + +**Step 1:** Create a simple table using the REST API: + +```bash +# First, create a namespace (database) +curl -X POST http://localhost:8080/v1/namespaces \ + -H "Content-Type: application/json" \ + -d '{"namespace": ["demo"], "properties": {}}' + +# Then create a table (this is a simplified example - in practice you would use Iceberg clients) +``` + +:::note +For production use, you would typically use Iceberg-compatible tools like Apache Spark, PyIceberg, or other Iceberg clients to create and populate tables. The Lakekeeper catalog acts as the metadata layer that coordinates table operations. +::: + +## Querying Lakekeeper catalog tables using ClickHouse {#querying-lakekeeper-catalog-tables-using-clickhouse} + +Now that the connection is in place, you can start querying via the Lakekeeper catalog. For example: + +```sql +USE lakekeeper_demo; + +SHOW TABLES; +``` + +If your setup includes sample data, you should see tables created in the demo namespace. + +:::note +If you don't see any tables, this usually means: +1. No tables have been created in the Lakekeeper catalog yet +2. The Lakekeeper service isn't fully initialized +3. The namespace doesn't exist + +You can check the Lakekeeper logs to see the catalog activity: +```bash +docker-compose logs lakekeeper +``` +::: + +To create and query a sample table (assuming you have created one through Iceberg clients): + +```sql +-- Example query if you have created sample tables +SELECT count(*) FROM `demo.sample_table`; +``` + +:::note Backticks required +Backticks are required because ClickHouse doesn't support more than one namespace. +::: + +To inspect a table DDL (if available): + +```sql +SHOW CREATE TABLE `demo.sample_table`; +``` + +## Loading data from your Data Lake into ClickHouse {#loading-data-from-your-data-lake-into-clickhouse} + +If you need to load data from the Lakekeeper catalog into ClickHouse, start by creating a local ClickHouse table that matches your Iceberg table schema: + +```sql +-- Example table structure - adjust based on your actual Iceberg table schema +CREATE TABLE local_sample_table +( + `id` Int64, + `name` String, + `timestamp` DateTime64(6), + `value` Float64 +) +ENGINE = MergeTree() +PARTITION BY toYYYYMM(timestamp) +ORDER BY (id, timestamp); +``` + +Then load the data from your Lakekeeper catalog table via an `INSERT INTO SELECT`: + +```sql +INSERT INTO local_sample_table +SELECT * FROM lakekeeper_demo.`demo.sample_table`; +``` + +## Managing the Lakekeeper Catalog {#managing-lakekeeper-catalog} + +### Accessing the MinIO Console + +You can access the MinIO console at `http://localhost:9001` using: +- Username: `admin` +- Password: `password` + +### Monitoring Lakekeeper + +Lakekeeper provides REST endpoints for monitoring and management: + +```bash +# Check catalog health +curl http://localhost:8080/health + +# List namespaces +curl http://localhost:8080/v1/namespaces + +# Get catalog configuration +curl http://localhost:8080/v1/config +``` + +### Cleanup + +To stop and remove all containers: + +```bash +docker-compose down -v +``` + +This will remove all containers and their associated volumes, including the PostgreSQL metadata and MinIO data. + +## Production Considerations {#production-considerations} + +When deploying Lakekeeper in production: + +1. **Security**: Configure proper authentication and authorization +2. **Persistence**: Use persistent volumes for PostgreSQL and MinIO data +3. **High Availability**: Deploy multiple Lakekeeper instances behind a load balancer +4. **Monitoring**: Set up proper monitoring and alerting for all components +5. **Backup**: Implement backup strategies for metadata and object storage + +For more information, refer to the [Lakekeeper documentation](https://github.com/lakekeeper/lakekeeper). \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index c7eb0b5f3c1..9de8c6165bb 100644 --- a/sidebars.js +++ b/sidebars.js @@ -168,7 +168,8 @@ const sidebars = { items: [ "use-cases/data_lake/glue_catalog", "use-cases/data_lake/unity_catalog", - "use-cases/data_lake/rest_catalog" + "use-cases/data_lake/rest_catalog", + "use-cases/data_lake/lakekeeper_catalog" ] }, { From 25853d9aec2882784050db0daef417a791783b6d Mon Sep 17 00:00:00 2001 From: somratdutta Date: Tue, 29 Jul 2025 01:25:39 +0530 Subject: [PATCH 2/2] minor changes --- .../use-cases/data_lake/lakekeeper_catalog.md | 361 ++++++++++-------- 1 file changed, 204 insertions(+), 157 deletions(-) diff --git a/docs/use-cases/data_lake/lakekeeper_catalog.md b/docs/use-cases/data_lake/lakekeeper_catalog.md index 4d489d7fcec..7613b665e97 100644 --- a/docs/use-cases/data_lake/lakekeeper_catalog.md +++ b/docs/use-cases/data_lake/lakekeeper_catalog.md @@ -19,13 +19,12 @@ Integration with the Lakekeeper Catalog works with Iceberg tables only. This integration supports both AWS S3 and other cloud storage providers. ::: -ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polaris, etc.). This guide will walk you through the steps to query your data using ClickHouse and the [Lakekeeper](https://github.com/lakekeeper/lakekeeper) catalog. +ClickHouse supports integration with multiple catalogs (Unity, Glue, REST, Polaris, etc.). This guide will walk you through the steps to query your data using ClickHouse and the [Lakekeeper](https://docs.lakekeeper.io/) catalog. Lakekeeper is an open-source REST catalog implementation for Apache Iceberg that provides: +- **Rust native** implementation for high performance and reliability - **REST API** compliance with the Iceberg REST catalog specification -- **Multi-tenant** support for managing multiple warehouses - **Cloud storage** integration with S3-compatible storage -- **Production-ready** deployment capabilities :::note As this feature is experimental, you will need to enable it using: @@ -43,7 +42,7 @@ For local development and testing, you can use a containerized Lakekeeper setup. ### Setting up Local Lakekeeper Catalog {#setting-up-local-lakekeeper-catalog} -You can use the official Lakekeeper docker-compose setup which provides a complete environment with Lakekeeper, PostgreSQL metadata backend, and MinIO for object storage. +You can use the official [Lakekeeper docker-compose setup](https://github.com/lakekeeper/lakekeeper/tree/main/examples/minimal) which provides a complete environment with Lakekeeper, PostgreSQL metadata backend, and MinIO for object storage. **Step 1:** Create a new folder in which to run the example, then create a file `docker-compose.yml` with the following configuration: @@ -51,82 +50,144 @@ You can use the official Lakekeeper docker-compose setup which provides a comple version: '3.8' services: - postgres: - image: postgres:15 - container_name: lakekeeper-postgres + lakekeeper: + image: quay.io/lakekeeper/catalog:latest environment: - POSTGRES_USER: iceberg - POSTGRES_PASSWORD: iceberg - POSTGRES_DB: iceberg + - LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure! + - LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres + - LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres + - RUST_LOG=info + command: ["serve"] + healthcheck: + test: ["CMD", "/home/nonroot/lakekeeper", "healthcheck"] + interval: 1s + timeout: 10s + retries: 10 + start_period: 30s + depends_on: + migrate: + condition: service_completed_successfully + db: + condition: service_healthy + minio: + condition: service_healthy ports: - - "5432:5432" - volumes: - - postgres_data:/var/lib/postgresql/data + - 8181:8181 networks: - iceberg_net - minio: - image: minio/minio:latest - container_name: lakekeeper-minio + migrate: + image: quay.io/lakekeeper/catalog:latest-main environment: - MINIO_ROOT_USER: admin - MINIO_ROOT_PASSWORD: password - MINIO_DOMAIN: minio - ports: - - "9001:9001" - - "9000:9000" - command: ["server", "/data", "--console-address", ":9001"] - volumes: - - minio_data:/data + - LAKEKEEPER__PG_ENCRYPTION_KEY=This-is-NOT-Secure! + - LAKEKEEPER__PG_DATABASE_URL_READ=postgresql://postgres:postgres@db:5432/postgres + - LAKEKEEPER__PG_DATABASE_URL_WRITE=postgresql://postgres:postgres@db:5432/postgres + - RUST_LOG=info + restart: "no" + command: ["migrate"] + depends_on: + db: + condition: service_healthy networks: - iceberg_net - # Initialize MinIO with required buckets - mc: - image: minio/mc:latest - container_name: lakekeeper-mc + bootstrap: + image: curlimages/curl depends_on: - - minio - entrypoint: > - /bin/sh -c " - until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; - /usr/bin/mc mb minio/warehouse; - /usr/bin/mc policy set public minio/warehouse; - exit 0; - " + lakekeeper: + condition: service_healthy + restart: "no" + command: + - -w + - "%{http_code}" + - "-X" + - "POST" + - "-v" + - "http://lakekeeper:8181/management/v1/bootstrap" + - "-H" + - "Content-Type: application/json" + - "--data" + - '{"accept-terms-of-use": true}' + - "-o" + - "/dev/null" networks: - iceberg_net - lakekeeper: - image: lakekeeper/lakekeeper:latest - container_name: lakekeeper-catalog + initialwarehouse: + image: curlimages/curl depends_on: - - postgres - - minio + lakekeeper: + condition: service_healthy + bootstrap: + condition: service_completed_successfully + restart: "no" + command: + - -w + - "%{http_code}" + - "-X" + - "POST" + - "-v" + - "http://lakekeeper:8181/management/v1/warehouse" + - "-H" + - "Content-Type: application/json" + - "--data" + - '{"warehouse-name": "demo", "project-id": "00000000-0000-0000-0000-000000000000", "storage-profile": {"type": "s3", "bucket": "warehouse-rest", "key-prefix": "", "assume-role-arn": null, "endpoint": "http://minio:9000", "region": "local-01", "path-style-access": true, "flavor": "minio", "sts-enabled": true}, "storage-credential": {"type": "s3", "credential-type": "access-key", "aws-access-key-id": "minio", "aws-secret-access-key": "ClickHouse_Minio_P@ssw0rd"}}' + - "-o" + - "/dev/null" + networks: + - iceberg_net + + db: + image: bitnami/postgresql:16.3.0 environment: - LAKEKEEPER__PG_ENCRYPTION_KEY: "abcdefghijklmnopqrstuvwxyz123456" - LAKEKEEPER__PG_DATABASE_URL_READ: "postgresql://iceberg:iceberg@postgres:5432/iceberg" - LAKEKEEPER__PG_DATABASE_URL_WRITE: "postgresql://iceberg:iceberg@postgres:5432/iceberg" - LAKEKEEPER__STORAGE__S3__ENDPOINT: "http://minio:9000" - LAKEKEEPER__STORAGE__S3__ACCESS_KEY_ID: "admin" - LAKEKEEPER__STORAGE__S3__SECRET_ACCESS_KEY: "password" - LAKEKEEPER__STORAGE__S3__REGION: "us-east-1" - LAKEKEEPER__STORAGE__S3__BUCKET: "warehouse" - LAKEKEEPER__STORAGE__S3__PATH_STYLE_ACCESS: "true" - ports: - - "8080:8080" + - POSTGRESQL_USERNAME=postgres + - POSTGRESQL_PASSWORD=postgres + - POSTGRESQL_DATABASE=postgres + healthcheck: + test: ["CMD-SHELL", "pg_isready -U postgres -p 5432 -d postgres"] + interval: 2s + timeout: 10s + retries: 5 + start_period: 10s + volumes: + - postgres_data:/bitnami/postgresql networks: - iceberg_net + minio: + image: bitnami/minio:2025.4.22 + environment: + - MINIO_ROOT_USER=minio + - MINIO_ROOT_PASSWORD=ClickHouse_Minio_P@ssw0rd + - MINIO_API_PORT_NUMBER=9000 + - MINIO_CONSOLE_PORT_NUMBER=9001 + - MINIO_SCHEME=http + - MINIO_DEFAULT_BUCKETS=warehouse-rest + networks: + iceberg_net: + aliases: + - warehouse-rest.minio + ports: + - "9002:9000" + - "9003:9001" + healthcheck: + test: ["CMD", "mc", "ls", "local", "|", "grep", "warehouse-rest"] + interval: 2s + timeout: 10s + retries: 3 + start_period: 15s + volumes: + - minio_data:/bitnami/minio/data + clickhouse: image: clickhouse/clickhouse-server:head container_name: lakekeeper-clickhouse user: '0:0' # Ensures root permissions ports: - "8123:8123" - - "9002:9000" + - "9000:9000" volumes: - - ./clickhouse:/var/lib/clickhouse + - clickhouse_data:/var/lib/clickhouse - ./clickhouse/data_import:/var/lib/clickhouse/data_import # Mount dataset folder networks: - iceberg_net @@ -135,10 +196,16 @@ services: - CLICKHOUSE_USER=default - CLICKHOUSE_DO_NOT_CHOWN=1 - CLICKHOUSE_PASSWORD= + depends_on: + lakekeeper: + condition: service_healthy + minio: + condition: service_healthy volumes: postgres_data: minio_data: + clickhouse_data: networks: iceberg_net: @@ -154,19 +221,11 @@ docker compose up -d **Step 3:** Wait for all services to be ready. You can check the logs: ```bash -docker-compose logs -f lakekeeper -``` - -**Step 4:** Verify that Lakekeeper is running by checking the catalog status: - -```bash -curl http://localhost:8080/v1/config +docker-compose logs -f ``` -You should see a JSON response indicating the catalog configuration. - :::note -The Lakekeeper setup requires that the MinIO buckets be created first. The `mc` service in the docker-compose file handles this initialization. Make sure all services are healthy before attempting to query them through ClickHouse. +The Lakekeeper setup requires that sample data be loaded into the Iceberg tables first. Make sure the environment has created and populated the tables before attempting to query them through ClickHouse. The availability of tables depends on the specific docker-compose setup and sample data loading scripts. ::: ### Connecting to Local Lakekeeper Catalog {#connecting-to-local-lakekeeper-catalog} @@ -182,140 +241,128 @@ Then create the database connection to the Lakekeeper catalog: ```sql SET allow_experimental_database_iceberg = 1; -CREATE DATABASE lakekeeper_demo -ENGINE = DataLakeCatalog('http://lakekeeper:8080/v1', '', '') -SETTINGS - catalog_type = 'rest', - storage_endpoint = 'http://minio:9000/warehouse', - warehouse = 'demo' +CREATE DATABASE demo +ENGINE = DataLakeCatalog('http://lakekeeper:8181/catalog', 'minio', 'ClickHouse_Minio_P@ssw0rd') +SETTINGS catalog_type = 'rest', storage_endpoint = 'http://minio:9002/warehouse-rest', warehouse = 'demo' ``` -## Creating Sample Data {#creating-sample-data} - -Before querying tables, let's create some sample data using a simple Python script or by using the Iceberg Python library to create tables in Lakekeeper. - -**Step 1:** Create a simple table using the REST API: - -```bash -# First, create a namespace (database) -curl -X POST http://localhost:8080/v1/namespaces \ - -H "Content-Type: application/json" \ - -d '{"namespace": ["demo"], "properties": {}}' - -# Then create a table (this is a simplified example - in practice you would use Iceberg clients) -``` - -:::note -For production use, you would typically use Iceberg-compatible tools like Apache Spark, PyIceberg, or other Iceberg clients to create and populate tables. The Lakekeeper catalog acts as the metadata layer that coordinates table operations. -::: - ## Querying Lakekeeper catalog tables using ClickHouse {#querying-lakekeeper-catalog-tables-using-clickhouse} Now that the connection is in place, you can start querying via the Lakekeeper catalog. For example: ```sql -USE lakekeeper_demo; +USE demo; SHOW TABLES; ``` -If your setup includes sample data, you should see tables created in the demo namespace. +If your setup includes sample data (such as the taxi dataset), you should see tables like: + +```sql title="Response" +┌─name──────────┐ +│ default.taxis │ +└───────────────┘ +``` :::note If you don't see any tables, this usually means: -1. No tables have been created in the Lakekeeper catalog yet -2. The Lakekeeper service isn't fully initialized -3. The namespace doesn't exist +1. The environment hasn't created the sample tables yet +2. The Lakekeeper catalog service isn't fully initialized +3. The sample data loading process hasn't completed -You can check the Lakekeeper logs to see the catalog activity: +You can check the Spark logs to see the table creation progress: ```bash -docker-compose logs lakekeeper +docker-compose logs spark ``` ::: -To create and query a sample table (assuming you have created one through Iceberg clients): +To query a table (if available): ```sql --- Example query if you have created sample tables -SELECT count(*) FROM `demo.sample_table`; +SELECT count(*) FROM `default.taxis`; +``` + +```sql title="Response" +┌─count()─┐ +│ 2171187 │ +└─────────┘ ``` :::note Backticks required Backticks are required because ClickHouse doesn't support more than one namespace. ::: -To inspect a table DDL (if available): +To inspect the table DDL: ```sql -SHOW CREATE TABLE `demo.sample_table`; +SHOW CREATE TABLE `default.taxis`; +``` + +```sql title="Response" +┌─statement─────────────────────────────────────────────────────────────────────────────────────┐ +│ CREATE TABLE demo.`default.taxis` │ +│ ( │ +│ `VendorID` Nullable(Int64), │ +│ `tpep_pickup_datetime` Nullable(DateTime64(6)), │ +│ `tpep_dropoff_datetime` Nullable(DateTime64(6)), │ +│ `passenger_count` Nullable(Float64), │ +│ `trip_distance` Nullable(Float64), │ +│ `RatecodeID` Nullable(Float64), │ +│ `store_and_fwd_flag` Nullable(String), │ +│ `PULocationID` Nullable(Int64), │ +│ `DOLocationID` Nullable(Int64), │ +│ `payment_type` Nullable(Int64), │ +│ `fare_amount` Nullable(Float64), │ +│ `extra` Nullable(Float64), │ +│ `mta_tax` Nullable(Float64), │ +│ `tip_amount` Nullable(Float64), │ +│ `tolls_amount` Nullable(Float64), │ +│ `improvement_surcharge` Nullable(Float64), │ +│ `total_amount` Nullable(Float64), │ +│ `congestion_surcharge` Nullable(Float64), │ +│ `airport_fee` Nullable(Float64) │ +│ ) │ +│ ENGINE = Iceberg('http://minio:9002/warehouse-rest/warehouse/default/taxis/', 'minio', '[HIDDEN]') │ +└───────────────────────────────────────────────────────────────────────────────────────────────┘ ``` ## Loading data from your Data Lake into ClickHouse {#loading-data-from-your-data-lake-into-clickhouse} -If you need to load data from the Lakekeeper catalog into ClickHouse, start by creating a local ClickHouse table that matches your Iceberg table schema: +If you need to load data from the Lakekeeper catalog into ClickHouse, start by creating a local ClickHouse table: ```sql --- Example table structure - adjust based on your actual Iceberg table schema -CREATE TABLE local_sample_table +CREATE TABLE taxis ( - `id` Int64, - `name` String, - `timestamp` DateTime64(6), - `value` Float64 + `VendorID` Int64, + `tpep_pickup_datetime` DateTime64(6), + `tpep_dropoff_datetime` DateTime64(6), + `passenger_count` Float64, + `trip_distance` Float64, + `RatecodeID` Float64, + `store_and_fwd_flag` String, + `PULocationID` Int64, + `DOLocationID` Int64, + `payment_type` Int64, + `fare_amount` Float64, + `extra` Float64, + `mta_tax` Float64, + `tip_amount` Float64, + `tolls_amount` Float64, + `improvement_surcharge` Float64, + `total_amount` Float64, + `congestion_surcharge` Float64, + `airport_fee` Float64 ) ENGINE = MergeTree() -PARTITION BY toYYYYMM(timestamp) -ORDER BY (id, timestamp); +PARTITION BY toYYYYMM(tpep_pickup_datetime) +ORDER BY (VendorID, tpep_pickup_datetime, PULocationID, DOLocationID); ``` Then load the data from your Lakekeeper catalog table via an `INSERT INTO SELECT`: ```sql -INSERT INTO local_sample_table -SELECT * FROM lakekeeper_demo.`demo.sample_table`; +INSERT INTO taxis +SELECT * FROM demo.`default.taxis`; ``` -## Managing the Lakekeeper Catalog {#managing-lakekeeper-catalog} - -### Accessing the MinIO Console - -You can access the MinIO console at `http://localhost:9001` using: -- Username: `admin` -- Password: `password` - -### Monitoring Lakekeeper - -Lakekeeper provides REST endpoints for monitoring and management: - -```bash -# Check catalog health -curl http://localhost:8080/health - -# List namespaces -curl http://localhost:8080/v1/namespaces - -# Get catalog configuration -curl http://localhost:8080/v1/config -``` - -### Cleanup - -To stop and remove all containers: - -```bash -docker-compose down -v -``` - -This will remove all containers and their associated volumes, including the PostgreSQL metadata and MinIO data. - -## Production Considerations {#production-considerations} - -When deploying Lakekeeper in production: - -1. **Security**: Configure proper authentication and authorization -2. **Persistence**: Use persistent volumes for PostgreSQL and MinIO data -3. **High Availability**: Deploy multiple Lakekeeper instances behind a load balancer -4. **Monitoring**: Set up proper monitoring and alerting for all components -5. **Backup**: Implement backup strategies for metadata and object storage - -For more information, refer to the [Lakekeeper documentation](https://github.com/lakekeeper/lakekeeper). \ No newline at end of file + \ No newline at end of file