-
Notifications
You must be signed in to change notification settings - Fork 13
support lakefs solution docs #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,280 @@ | ||||||||||
| --- | ||||||||||
| kind: | ||||||||||
| - Solution | ||||||||||
| products: | ||||||||||
| - Alauda Application Services | ||||||||||
| ProductsVersion: | ||||||||||
| - 4.x | ||||||||||
| id: KB251000010 | ||||||||||
| --- | ||||||||||
|
|
||||||||||
| # lakeFS Data Version Control Solution Guide | ||||||||||
|
|
||||||||||
| ## Background | ||||||||||
|
|
||||||||||
| ### The Challenge | ||||||||||
|
|
||||||||||
| Modern data lakes face significant challenges in managing data versioning, reproducibility, and collaboration. Traditional approaches often lead to: | ||||||||||
|
|
||||||||||
| - **Data Quality Issues**: Difficulty tracking changes and rolling back problematic data updates | ||||||||||
| - **Reproducibility Problems**: Inability to recreate specific data states for analysis or debugging | ||||||||||
| - **Collaboration Conflicts**: Multiple teams working on the same data without proper isolation | ||||||||||
| - **Testing Complexity**: Challenges in testing data transformations before applying to production | ||||||||||
|
|
||||||||||
| ### The Solution | ||||||||||
|
|
||||||||||
| lakeFS provides Git-like version control for data lakes, enabling: | ||||||||||
|
|
||||||||||
| - **Branching and Merging**: Isolate changes in branches and merge them safely | ||||||||||
| - **Data Versioning**: Track changes to data with commit-like semantics | ||||||||||
| - **Reproducible Analytics**: Reference specific data versions for consistent results | ||||||||||
| - **CI/CD for Data**: Implement testing and validation workflows for data pipelines | ||||||||||
|
|
||||||||||
| ## Environment Information | ||||||||||
|
|
||||||||||
| Applicable Versions: >=ACP 4.1.0, lakeFS: >=1.70.1 | ||||||||||
|
|
||||||||||
| ## Quick Reference | ||||||||||
|
|
||||||||||
| ### Key Concepts | ||||||||||
| - **Repository**: A collection of branches, tags, and commits that track changes to data | ||||||||||
| - **Branch**: An isolated line of development within a repository | ||||||||||
| - **Commit**: A snapshot of the repository at a specific point in time | ||||||||||
| - **Merge**: Combining changes from one branch into another | ||||||||||
|
|
||||||||||
| ### Common Use Cases | ||||||||||
|
|
||||||||||
| | Scenario | Recommended Approach | Section Reference | | ||||||||||
| |----------|---------------------|------------------| | ||||||||||
| | **Data Versioning** | Create repositories and commit changes | [Basic Operations](#basic-operations) | | ||||||||||
| | **Collaborative Development** | Use feature branches for isolated work | [Branching Strategy](#branching-strategy) | | ||||||||||
| | **Data Quality Validation** | Implement pre-commit hooks and testing | [Data Validation](#data-validation) | | ||||||||||
| | **Production Deployment** | Merge validated changes to main branch | [Production Workflows](#production-workflows) | | ||||||||||
|
|
||||||||||
| ## Prerequisites | ||||||||||
|
|
||||||||||
| Before implementing lakeFS, ensure you have: | ||||||||||
|
|
||||||||||
| - ACP v4.1.0 or later | ||||||||||
| - PostgreSQL instance for metadata storage | ||||||||||
| - Object storage backend (Ceph RGW recommended or MinIO) | ||||||||||
| - Basic understanding of Git workflows and data lake concepts | ||||||||||
|
|
||||||||||
| ### Storage Requirements | ||||||||||
|
|
||||||||||
| - **PostgreSQL**: Minimum 10GB storage for metadata | ||||||||||
| - **Object Storage**: Sufficient capacity for your data assets | ||||||||||
| - **Backup Strategy**: Regular backups of PostgreSQL database | ||||||||||
|
|
||||||||||
| ## Installation Guide | ||||||||||
|
|
||||||||||
| ### Chart Upload | ||||||||||
|
|
||||||||||
| Download the lakeFS chart from the Marketplace in the Alauda Customer Portal and upload the lakeFS chart to your ACP catalog: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| CHART=lakefs.ALL.1.7.9.tgz | ||||||||||
| ADDR="https://your-acp-domain.com" | ||||||||||
| USER="[email protected]" | ||||||||||
| PASS="your-password" | ||||||||||
|
|
||||||||||
| violet push $CHART \ | ||||||||||
| --platform-address "$ADDR" \ | ||||||||||
| --platform-username "$USER" \ | ||||||||||
| --platform-password "$PASS" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ### Backend Storage Configuration | ||||||||||
|
|
||||||||||
| #### Recommended: Ceph RGW Setup | ||||||||||
|
|
||||||||||
| 1. Deploy Ceph storage system following the [Ceph installation guide](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/installation/create_service_stand.html) | ||||||||||
|
|
||||||||||
| 2. Create Ceph Object Store User: | ||||||||||
|
|
||||||||||
| ```yaml | ||||||||||
| apiVersion: ceph.rook.io/v1 | ||||||||||
| kind: CephObjectStoreUser | ||||||||||
| metadata: | ||||||||||
| name: lakefs-user | ||||||||||
| namespace: rook-ceph | ||||||||||
| spec: | ||||||||||
| store: object-pool | ||||||||||
| displayName: object-pool | ||||||||||
| quotas: | ||||||||||
| maxBuckets: 100 | ||||||||||
| maxSize: -1 | ||||||||||
| maxObjects: -1 | ||||||||||
| capabilities: | ||||||||||
| user: "*" | ||||||||||
| bucket: "*" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 3. Retrieve access credentials: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| user_secret=$(kubectl -n rook-ceph get cephobjectstoreuser lakefs-user -o jsonpath='{.status.info.secretName}') | ||||||||||
| ACCESS_KEY=$(kubectl -n rook-ceph get secret $user_secret -o jsonpath='{.data.AccessKey}' | base64 --decode) | ||||||||||
| SECRET_KEY=$(kubectl -n rook-ceph get secret $user_secret -o jsonpath='{.data.SecretKey}' | base64 --decode) | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| #### Alternative: MinIO Setup | ||||||||||
|
|
||||||||||
| Deploy MinIO following the [MinIO installation guide](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_minio/installation.html) | ||||||||||
|
|
||||||||||
| ### PostgreSQL Database Setup | ||||||||||
|
|
||||||||||
| 1. Deploy PostgreSQL following the [PostgreSQL installation guide](https://docs.alauda.io/postgresql/4.1/installation.html) | ||||||||||
|
|
||||||||||
| 2. Create database for lakeFS: | ||||||||||
|
|
||||||||||
| ```sql | ||||||||||
| CREATE DATABASE lakefs; | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ### lakeFS Deployment | ||||||||||
|
|
||||||||||
| 1. Access ACP web console and navigate to "Applications" → "Create" → "Create from Catalog" | ||||||||||
|
|
||||||||||
| 2. Select the lakeFS chart | ||||||||||
|
|
||||||||||
| 3. Configure deployment values: | ||||||||||
|
|
||||||||||
| ```yaml | ||||||||||
| image: | ||||||||||
| repository: your-registry-domain.com/3rdparty/treeverse/lakefs | ||||||||||
|
|
||||||||||
| lakefsConfig: | | ||||||||||
| database: | ||||||||||
| type: postgres | ||||||||||
| blockstore: | ||||||||||
| type: s3 | ||||||||||
| s3: | ||||||||||
| force_path_style: true | ||||||||||
| endpoint: "http://rook-ceph-rgw-object-pool.rook-ceph.svc:7480" | ||||||||||
| discover_bucket_region: false | ||||||||||
| credentials: | ||||||||||
| access_key_id: QFKEJGDSGWFG44SL495W | ||||||||||
| secret_access_key: 67yy3SE5Epu2RC9EADlFIxedPcnO9AAglX8tYJyy | ||||||||||
|
Comment on lines
+157
to
+158
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove or replace exposed credentials in example configuration. The example configuration contains realistic-looking access credentials. Even if these are not real, this represents a security risk and should be replaced with placeholder text to prevent accidental credential exposure in documentation. Apply this diff to replace the credentials with placeholders: credentials:
- access_key_id: QFKEJGDSGWFG44SL495W
- secret_access_key: 67yy3SE5Epu2RC9EADlFIxedPcnO9AAglX8tYJyy
+ access_key_id: "<YOUR_ACCESS_KEY>"
+ secret_access_key: "<YOUR_SECRET_KEY>"Then add a note instructing users to replace these with their actual credentials from the Ceph user secret retrieval step (lines 115-119). 📝 Committable suggestion
Suggested change
🧰 Tools🪛 Gitleaks (8.29.0)[high] 158-158: Detected a Generic API Key, potentially exposing access to various services and sensitive operations. (generic-api-key) 🤖 Prompt for AI Agents |
||||||||||
|
|
||||||||||
| secrets: | ||||||||||
| databaseConnectionString: "postgres://postgres:password@postgres-service:5432/lakefs" | ||||||||||
|
|
||||||||||
| service: | ||||||||||
| type: NodePort | ||||||||||
|
|
||||||||||
| livenessProbe: | ||||||||||
| failureThreshold: 30 | ||||||||||
| periodSeconds: 10 | ||||||||||
| timeoutSeconds: 2 | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 4. Deploy and verify the application reaches "Ready" status | ||||||||||
|
|
||||||||||
| ## Configuration Guide | ||||||||||
|
|
||||||||||
| ### Accessing lakeFS | ||||||||||
|
|
||||||||||
| 1. Retrieve the NodePort service endpoint: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl get svc lakefs-service -n your-namespace | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 2. Access the lakeFS web UI through the NodePort | ||||||||||
|
|
||||||||||
| 3. Download initial credentials from the web UI | ||||||||||
|
|
||||||||||
| ### Getting Started with lakeFS | ||||||||||
|
|
||||||||||
| For detailed usage instructions, workflows, and advanced features, please refer to the official [lakeFS documentation](https://docs.lakefs.io/). | ||||||||||
|
|
||||||||||
| The official documentation covers: | ||||||||||
| - Basic operations (branching, committing, merging) | ||||||||||
| - Advanced features (hooks, retention policies, cross-repository operations) | ||||||||||
| - Integration with data tools (Spark, Airflow, dbt, etc.) | ||||||||||
| - API reference and CLI usage | ||||||||||
| - Best practices and use cases | ||||||||||
|
|
||||||||||
| ## Troubleshooting | ||||||||||
|
|
||||||||||
| ### Common Issues | ||||||||||
|
|
||||||||||
| #### Authentication Problems | ||||||||||
|
|
||||||||||
| **Symptoms**: Unable to access lakeFS UI or API | ||||||||||
|
|
||||||||||
| **Solutions**: | ||||||||||
| - Verify credentials are correctly set in deployment | ||||||||||
| - Check PostgreSQL connection string format | ||||||||||
| - Validate object storage credentials | ||||||||||
|
|
||||||||||
| #### Performance Issues | ||||||||||
|
|
||||||||||
| **Symptoms**: Slow operations or timeouts | ||||||||||
|
|
||||||||||
| **Solutions**: | ||||||||||
| - Monitor PostgreSQL performance | ||||||||||
| - Check object storage latency | ||||||||||
| - Review network connectivity between components | ||||||||||
|
|
||||||||||
| ### Diagnostic Commands | ||||||||||
|
|
||||||||||
| Check lakeFS health: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| curl http://lakefs-service:8000/health | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Verify PostgreSQL connection: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl exec -it lakefs-pod -- pg_isready -h postgres-service -p 5432 | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ## Best Practices | ||||||||||
|
|
||||||||||
| ### Repository Structure | ||||||||||
|
|
||||||||||
| - Organize data by domain or team | ||||||||||
| - Use descriptive branch names (feature/, bugfix/, hotfix/) | ||||||||||
| - Implement clear commit message conventions | ||||||||||
|
|
||||||||||
| ### Security Considerations | ||||||||||
|
|
||||||||||
| - Regularly rotate access credentials | ||||||||||
| - Implement principle of least privilege for repository access | ||||||||||
| - Enable audit logging for sensitive operations | ||||||||||
|
|
||||||||||
| ### Backup Strategy | ||||||||||
|
|
||||||||||
| - Regular backups of PostgreSQL metadata database | ||||||||||
| - Object storage redundancy through backend configuration | ||||||||||
| - Test restoration procedures periodically | ||||||||||
|
|
||||||||||
| ## Reference | ||||||||||
|
|
||||||||||
| ### Configuration Parameters | ||||||||||
|
|
||||||||||
| **lakeFS Deployment:** | ||||||||||
| - `databaseConnectionString`: PostgreSQL connection string | ||||||||||
| - `blockstore.type`: Storage backend type (s3, gs, azure) | ||||||||||
| - `blockstore.s3.endpoint`: Object storage endpoint | ||||||||||
| - `blockstore.s3.credentials`: Access credentials | ||||||||||
|
|
||||||||||
| ### Useful Links | ||||||||||
|
|
||||||||||
| - [lakeFS Documentation](https://docs.lakefs.io/) - Comprehensive usage guide and API reference | ||||||||||
| - [PostgreSQL Operator Documentation](https://docs.alauda.io/postgresql/4.1/functions/index.html) | ||||||||||
| - [Ceph Object Storage Guide](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/create_object_user.html) | ||||||||||
|
|
||||||||||
| ## Summary | ||||||||||
|
|
||||||||||
| This guide provides comprehensive instructions for implementing lakeFS on Alauda Container Platform. The solution delivers Git-like version control for data lakes, enabling: | ||||||||||
|
|
||||||||||
| - **Reproducible Data Analytics**: Track and reference specific data versions | ||||||||||
| - **Collaborative Development**: Isolate changes with branching and merging | ||||||||||
| - **Data Quality Assurance**: Implement validation workflows | ||||||||||
| - **Production Reliability**: Controlled promotion of data changes | ||||||||||
|
|
||||||||||
| By following these practices, organizations can significantly improve their data management capabilities while maintaining the flexibility and scalability of modern data lake architectures. | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix invalid link fragments in use case table.
The table references four internal sections that do not exist in this document:
#basic-operations,#branching-strategy,#data-validation, and#production-workflows. These broken links degrade the user experience and navigation.Consider either:
To resolve, update the table to reference external documentation or remove the links:
📝 Committable suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
49-49: Link fragments should be valid
(MD051, link-fragments)
50-50: Link fragments should be valid
(MD051, link-fragments)
51-51: Link fragments should be valid
(MD051, link-fragments)
52-52: Link fragments should be valid
(MD051, link-fragments)
🤖 Prompt for AI Agents