Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
280 changes: 280 additions & 0 deletions docs/en/solutions/How_to_Use_lakeFS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
---
kind:
- Solution
products:
- Alauda Application Services
ProductsVersion:
- 4.x
id: KB251000010
---

# lakeFS Data Version Control Solution Guide

## Background

### The Challenge

Modern data lakes face significant challenges in managing data versioning, reproducibility, and collaboration. Traditional approaches often lead to:

- **Data Quality Issues**: Difficulty tracking changes and rolling back problematic data updates
- **Reproducibility Problems**: Inability to recreate specific data states for analysis or debugging
- **Collaboration Conflicts**: Multiple teams working on the same data without proper isolation
- **Testing Complexity**: Challenges in testing data transformations before applying to production

### The Solution

lakeFS provides Git-like version control for data lakes, enabling:

- **Branching and Merging**: Isolate changes in branches and merge them safely
- **Data Versioning**: Track changes to data with commit-like semantics
- **Reproducible Analytics**: Reference specific data versions for consistent results
- **CI/CD for Data**: Implement testing and validation workflows for data pipelines

## Environment Information

Applicable Versions: >=ACP 4.1.0, lakeFS: >=1.70.1

## Quick Reference

### Key Concepts
- **Repository**: A collection of branches, tags, and commits that track changes to data
- **Branch**: An isolated line of development within a repository
- **Commit**: A snapshot of the repository at a specific point in time
- **Merge**: Combining changes from one branch into another

### Common Use Cases

| Scenario | Recommended Approach | Section Reference |
|----------|---------------------|------------------|
| **Data Versioning** | Create repositories and commit changes | [Basic Operations](#basic-operations) |
| **Collaborative Development** | Use feature branches for isolated work | [Branching Strategy](#branching-strategy) |
| **Data Quality Validation** | Implement pre-commit hooks and testing | [Data Validation](#data-validation) |
| **Production Deployment** | Merge validated changes to main branch | [Production Workflows](#production-workflows) |
Comment on lines +49 to +52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix invalid link fragments in use case table.

The table references four internal sections that do not exist in this document: #basic-operations, #branching-strategy, #data-validation, and #production-workflows. These broken links degrade the user experience and navigation.

Consider either:

  1. Adding the referenced sections to the document, or
  2. Removing or updating the section references to point to the official lakeFS documentation

To resolve, update the table to reference external documentation or remove the links:

| Scenario | Recommended Approach | Section Reference |
|----------|---------------------|------------------|
-| **Data Versioning** | Create repositories and commit changes | [Basic Operations](#basic-operations) |
-| **Collaborative Development** | Use feature branches for isolated work | [Branching Strategy](#branching-strategy) |
-| **Data Quality Validation** | Implement pre-commit hooks and testing | [Data Validation](#data-validation) |
-| **Production Deployment** | Merge validated changes to main branch | [Production Workflows](#production-workflows) |
+| **Data Versioning** | Create repositories and commit changes | [Basic Operations](https://docs.lakefs.io/) |
+| **Collaborative Development** | Use feature branches for isolated work | [Branching Strategy](https://docs.lakefs.io/) |
+| **Data Quality Validation** | Implement pre-commit hooks and testing | [Data Validation](https://docs.lakefs.io/) |
+| **Production Deployment** | Merge validated changes to main branch | [Production Workflows](https://docs.lakefs.io/) |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| **Data Versioning** | Create repositories and commit changes | [Basic Operations](#basic-operations) |
| **Collaborative Development** | Use feature branches for isolated work | [Branching Strategy](#branching-strategy) |
| **Data Quality Validation** | Implement pre-commit hooks and testing | [Data Validation](#data-validation) |
| **Production Deployment** | Merge validated changes to main branch | [Production Workflows](#production-workflows) |
| **Data Versioning** | Create repositories and commit changes | [Basic Operations](https://docs.lakefs.io/) |
| **Collaborative Development** | Use feature branches for isolated work | [Branching Strategy](https://docs.lakefs.io/) |
| **Data Quality Validation** | Implement pre-commit hooks and testing | [Data Validation](https://docs.lakefs.io/) |
| **Production Deployment** | Merge validated changes to main branch | [Production Workflows](https://docs.lakefs.io/) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

49-49: Link fragments should be valid

(MD051, link-fragments)


50-50: Link fragments should be valid

(MD051, link-fragments)


51-51: Link fragments should be valid

(MD051, link-fragments)


52-52: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
In docs/en/solutions/How_to_Use_lakeFS.md around lines 49 to 52, the table
contains broken internal link fragments (#basic-operations, #branching-strategy,
#data-validation, #production-workflows); update those entries to either point
to the correct sections in this document (add the missing sections and anchors)
or replace the links with the corresponding official lakeFS documentation URLs
(or remove the link markup so only the plain text remains) so the table no
longer contains invalid fragments.


## Prerequisites

Before implementing lakeFS, ensure you have:

- ACP v4.1.0 or later
- PostgreSQL instance for metadata storage
- Object storage backend (Ceph RGW recommended or MinIO)
- Basic understanding of Git workflows and data lake concepts

### Storage Requirements

- **PostgreSQL**: Minimum 10GB storage for metadata
- **Object Storage**: Sufficient capacity for your data assets
- **Backup Strategy**: Regular backups of PostgreSQL database

## Installation Guide

### Chart Upload

Download the lakeFS chart from the Marketplace in the Alauda Customer Portal and upload the lakeFS chart to your ACP catalog:

```bash
CHART=lakefs.ALL.1.7.9.tgz
ADDR="https://your-acp-domain.com"
USER="[email protected]"
PASS="your-password"

violet push $CHART \
--platform-address "$ADDR" \
--platform-username "$USER" \
--platform-password "$PASS"
```

### Backend Storage Configuration

#### Recommended: Ceph RGW Setup

1. Deploy Ceph storage system following the [Ceph installation guide](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/installation/create_service_stand.html)

2. Create Ceph Object Store User:

```yaml
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: lakefs-user
namespace: rook-ceph
spec:
store: object-pool
displayName: object-pool
quotas:
maxBuckets: 100
maxSize: -1
maxObjects: -1
capabilities:
user: "*"
bucket: "*"
```

3. Retrieve access credentials:

```bash
user_secret=$(kubectl -n rook-ceph get cephobjectstoreuser lakefs-user -o jsonpath='{.status.info.secretName}')
ACCESS_KEY=$(kubectl -n rook-ceph get secret $user_secret -o jsonpath='{.data.AccessKey}' | base64 --decode)
SECRET_KEY=$(kubectl -n rook-ceph get secret $user_secret -o jsonpath='{.data.SecretKey}' | base64 --decode)
```

#### Alternative: MinIO Setup

Deploy MinIO following the [MinIO installation guide](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_minio/installation.html)

### PostgreSQL Database Setup

1. Deploy PostgreSQL following the [PostgreSQL installation guide](https://docs.alauda.io/postgresql/4.1/installation.html)

2. Create database for lakeFS:

```sql
CREATE DATABASE lakefs;
```

### lakeFS Deployment

1. Access ACP web console and navigate to "Applications" → "Create" → "Create from Catalog"

2. Select the lakeFS chart

3. Configure deployment values:

```yaml
image:
repository: your-registry-domain.com/3rdparty/treeverse/lakefs

lakefsConfig: |
database:
type: postgres
blockstore:
type: s3
s3:
force_path_style: true
endpoint: "http://rook-ceph-rgw-object-pool.rook-ceph.svc:7480"
discover_bucket_region: false
credentials:
access_key_id: QFKEJGDSGWFG44SL495W
secret_access_key: 67yy3SE5Epu2RC9EADlFIxedPcnO9AAglX8tYJyy
Comment on lines +157 to +158
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove or replace exposed credentials in example configuration.

The example configuration contains realistic-looking access credentials. Even if these are not real, this represents a security risk and should be replaced with placeholder text to prevent accidental credential exposure in documentation.

Apply this diff to replace the credentials with placeholders:

      credentials:
-       access_key_id: QFKEJGDSGWFG44SL495W
-       secret_access_key: 67yy3SE5Epu2RC9EADlFIxedPcnO9AAglX8tYJyy
+       access_key_id: "<YOUR_ACCESS_KEY>"
+       secret_access_key: "<YOUR_SECRET_KEY>"

Then add a note instructing users to replace these with their actual credentials from the Ceph user secret retrieval step (lines 115-119).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
access_key_id: QFKEJGDSGWFG44SL495W
secret_access_key: 67yy3SE5Epu2RC9EADlFIxedPcnO9AAglX8tYJyy
access_key_id: "<YOUR_ACCESS_KEY>"
secret_access_key: "<YOUR_SECRET_KEY>"
🧰 Tools
🪛 Gitleaks (8.29.0)

[high] 158-158: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
In docs/en/solutions/How_to_Use_lakeFS.md around lines 157-158, the example
shows literal access_key_id and secret_access_key values; replace them with
non-sensitive placeholders (e.g., ACCESS_KEY_ID_PLACEHOLDER and
SECRET_ACCESS_KEY_PLACEHOLDER) in the example configuration and ensure the
placeholders are clearly labeled as such, then add a brief note (referencing
lines 115-119) instructing users to replace these placeholders with their actual
credentials obtained from the Ceph user secret retrieval step.


secrets:
databaseConnectionString: "postgres://postgres:password@postgres-service:5432/lakefs"

service:
type: NodePort

livenessProbe:
failureThreshold: 30
periodSeconds: 10
timeoutSeconds: 2
```

4. Deploy and verify the application reaches "Ready" status

## Configuration Guide

### Accessing lakeFS

1. Retrieve the NodePort service endpoint:

```bash
kubectl get svc lakefs-service -n your-namespace
```

2. Access the lakeFS web UI through the NodePort

3. Download initial credentials from the web UI

### Getting Started with lakeFS

For detailed usage instructions, workflows, and advanced features, please refer to the official [lakeFS documentation](https://docs.lakefs.io/).

The official documentation covers:
- Basic operations (branching, committing, merging)
- Advanced features (hooks, retention policies, cross-repository operations)
- Integration with data tools (Spark, Airflow, dbt, etc.)
- API reference and CLI usage
- Best practices and use cases

## Troubleshooting

### Common Issues

#### Authentication Problems

**Symptoms**: Unable to access lakeFS UI or API

**Solutions**:
- Verify credentials are correctly set in deployment
- Check PostgreSQL connection string format
- Validate object storage credentials

#### Performance Issues

**Symptoms**: Slow operations or timeouts

**Solutions**:
- Monitor PostgreSQL performance
- Check object storage latency
- Review network connectivity between components

### Diagnostic Commands

Check lakeFS health:

```bash
curl http://lakefs-service:8000/health
```

Verify PostgreSQL connection:

```bash
kubectl exec -it lakefs-pod -- pg_isready -h postgres-service -p 5432
```

## Best Practices

### Repository Structure

- Organize data by domain or team
- Use descriptive branch names (feature/, bugfix/, hotfix/)
- Implement clear commit message conventions

### Security Considerations

- Regularly rotate access credentials
- Implement principle of least privilege for repository access
- Enable audit logging for sensitive operations

### Backup Strategy

- Regular backups of PostgreSQL metadata database
- Object storage redundancy through backend configuration
- Test restoration procedures periodically

## Reference

### Configuration Parameters

**lakeFS Deployment:**
- `databaseConnectionString`: PostgreSQL connection string
- `blockstore.type`: Storage backend type (s3, gs, azure)
- `blockstore.s3.endpoint`: Object storage endpoint
- `blockstore.s3.credentials`: Access credentials

### Useful Links

- [lakeFS Documentation](https://docs.lakefs.io/) - Comprehensive usage guide and API reference
- [PostgreSQL Operator Documentation](https://docs.alauda.io/postgresql/4.1/functions/index.html)
- [Ceph Object Storage Guide](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/create_object_user.html)

## Summary

This guide provides comprehensive instructions for implementing lakeFS on Alauda Container Platform. The solution delivers Git-like version control for data lakes, enabling:

- **Reproducible Data Analytics**: Track and reference specific data versions
- **Collaborative Development**: Isolate changes with branching and merging
- **Data Quality Assurance**: Implement validation workflows
- **Production Reliability**: Controlled promotion of data changes

By following these practices, organizations can significantly improve their data management capabilities while maintaining the flexibility and scalability of modern data lake architectures.