Skip to content

Install VDK Control Service with custom SDK

Antoni Ivanov edited this page Jun 21, 2022 · 26 revisions

Overview

In this tutorial, we will install the Versatile Data Kit Control Service using custom created SDK.

This SDK will be used automatically by all Data Jobs being deployed to it. And any change to the SDK will be automatically applied for all deployed data jobs instantaneously (starting from the next run).

Prerequisites

Here are listed the minimum prerequisites needed to be able to install VDK Control Service using custom SDK.

1. Git and Docker repository.

This tutorial assumes Github will be used. Go to https://github.com/new and create a repository. For this example, we have created "github.com/tozka/demo-vdk.git"

1.2. Generate Github Token.

You will need this Github Token later. Make sure to save it in a known place.

Make sure you gave permissions for both repo and packages (as we'd use it for both git and docker repository)

See example:

github-token

2. Python (PyPi) repository

This is where we will release (upload) our custom SDK. For POC purposes we will use https://test.pypi.org

image

3. Kubernetes and Helm

We need Kubernetes to install the Control Service. And also helm to install it.

In production, you may want to use some cloud provider like GKE, TKG, EKS or other 3 letter abbreviation ...

In this example though, we will use kind and set up things locally.

  • First, install kind
  • Create a demo cluster using:
kind create cluster --name demo

Optional integrations

VDK comes with some optional integrations with 3th party systems to provide more value that can be enabled with configuration only.

Those we will not be covered in this tutorial. Start a new discussion or contact us on slack on how to integrate since the options are not as clearly documented as we'd like.

1. External Logging

All job logs can be forwarded to a centralized logging system.

Prerequisites: SysLog or Fluentd

2. Notifications

SMPT Server for mail notifications. It's configured in in both SDK and Control Service

Prerequisites: SMTP Server

3. Integration with a monitoring system (e.g Prometheus).

See list of metrics supported in here See more in monitoring configuration

Prerequisites: Prometheus or Wavefront or similar

4. Advanced Alerting rules

You can define some more advanced monitoring rules. The Helm chart comes with prepared PrometheusRules (e.g Job Delay alerting) that can be used with AlertManager and Prometheus

Prerequisites: The out of the box rules require AlertManager

5. SSO Support

It supports Oauth2-based authorization of all operations enabling easy to integrate with company SSO.

See more in security section of Control Service Helm chart

Prerequisites: OAuth2

6. Access Control Webhooks

Access Control Webhooks enables to create more complex rules for who is allowed to do what operations in the Control Service (for cases where Oauth2 is not enough).

Prerequisites: Webhook endpoint

Install Versatile Data Kit with custom SDK

Here we will install the Versatile Data Kit.

First, we will create our custom SDK. This is a very simple process. If you are familiar with python packaging using setuptools, you will find these steps trivial.

1. Create custom VDK

custom-sdk-process

1. Create a directory for our SDK

mkdir my-org-vdk
cd my-org-vdk

2. Create and edit setup.py

Open setup.py in your favorite IDE.

We want to create an SDK that will support

  • Database queries to both Postgres and Snowflake
  • Ingesting Data into Postgres, Snowflake and using HTTP and using file.
  • Control Service Operations - deploying data jobs.

In install_requires we specify the plugins we need to achieve that:

import setuptools

setuptools.setup(
    name="my-org-vdk",
    version="1.0",
    install_requires=[
        "vdk-core",
        "vdk-plugin-control-cli",
        "vdk-postgres",
        "vdk-snowflake",
        "vdk-ingest-http",
        "vdk-ingest-file",
    ]
)

3. Upload our SDK distribution to a PiPy repository

In order for our python SDK to be installable and usable, we need to release it.

  • First, we build and package it:
python setup.py sdist --formats=gztar
  • Then we upload it to pypi.org. Fill out PIP_REPO_UPLOAD_USER_PASSWORD and PIP_REPO_UPLOAD_USER_NAME from step 3 of the Prerequisites section.
twine upload --repository-url https://test.pypi.org/legacy -u "$PIP_REPO_UPLOAD_USER_NAME" -p "$PIP_REPO_UPLOAD_USER_PASSWORD" dist/my-org-vdk-1.0.tar.gz

2. Create SDK Docker image

We need to create a simple docker image with our SDK installed which will be used by all jobs managed by VDK Control Service.

1. Create Dockerfile with our SDK installed

Open empty Dockerfile-vdk-base with a text editor or IDE. The content of the Dockerfile is simply this:

FROM python:3.7-slim

WORKDIR /vdk

ENV VDK_VERSION $vdk_version

#Install VDK
RUN pip install --extra-index-url https://test.pypi.org/simple my-org-vdk

As you can see it's pretty basic. We just want to install VDK.

2. Build and publish the Docker image

Make sure to tag it both with the version of the SDK and with the tag "release"

For example (replace with your own GitHub repo created in prerequisite):

docker build -t ghcr.io/tozka/my-org-vdk:1.0 -t ghcr.io/tozka/my-org-vdk:release -f Dockerfile-vdk-base .

docker push ghcr.io/tozka/my-org-vdk:release
docker push ghcr.io/tozka/my-org-vdk:1.0

3. Install Versatile Data Kit Control Service with Helm.

1. Create and edit new file values.yaml

Here we will use the GitHub token, account name, and repo created in step 2 of the Prerequisites.

In my case those are

  • GITHUB_ACCOUNT_NAME = tozka
  • GITHUB_URL = github.com/tozka/demo-vdk.git

The content of the values.yaml is:


resources:
   limits:
      memory: 0
   requests:
      memory: 0

cockroachdb:
   statefulset:
      resources:
         limits:
            memory: 0
         requests:
            memory: 0  
   init:
      resources:
         limits:
            cpu: 0
            memory: 0
         requests:
            cpu: 0
            memory: 0


deploymentGitUrl: "${GITHUB_URL}"
deploymentGitUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentGitPassword: "${GITHUB_TOKEN}"
uploadGitReadWriteUsername: "${GITHUB_ACCOUNT_NAME}"
uploadGitReadWritePassword: "${GITHUB_TOKEN}"
deploymentDockerRegistryType: generic
deploymentDockerRegistryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPasswordReadOnly: "${GITHUB_TOKEN}"
deploymentDockerRegistryUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPassword: "${GITHUB_TOKEN}"
deploymentDockerRepository: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"
proxyRepositoryURL: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"


deploymentVdkDistributionImage:

  registryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
  registryPasswordReadOnly: "${GITHUB_TOKEN}"

  registry: ghcr.io/${GITHUB_ACCOUNT_NAME}
  repository: "my-org-vdk"
  tag: "release"

security:
  enabled: False

2. Install VDK Helm chart

helm repo add vdk-gitlab https://gitlab.com/api/v4/projects/28814611/packages/helm/stable
helm repo update

helm install my-vdk-runtime vdk-gitlab/pipelines-control-service -f values.yaml

3. Expose Control Service API

In order to access the application from our browser we need to expose it using kubectl port-forward command:

kubectl port-forward service/my-vdk-runtime-svc 8092:8092

Use

Then we'd create and deploy a job:

Install custom VDK

pip install --extra-index-url https://test.pypi.org/legacy my-org-vdk

Configure VDK to know about Control Service

export VDK_CONTROL_SERVICE_REST_API_URL=http://localhost:8092

Create a sample data job

This will create a data job and register it in the Control Service. Locally it will create a directory with sample files of a data job:

vdk create --name example --team my-team --path .

Develop the data job

Browse the files in the example directory

Deploy the data job

It's a single "click" (or CLI command). Behind the scenes, VDK will package and install all dependencies, create docker images and container and finally schedule it (if configured) for execution.

vdk deploy --job-path example --reason "reason"

We can see some details about our job

vdk show --name example --team my-team

Note how there is both a VDK version and a Job Version. Those are deployed independently. VDK version is taken from the Control Service configuration and managed centrally. While the Job version is separate and the data engineer developing the job is in control .

Both the VDK version and job version can be changed if needed with vdk deploy --update command.


Clone this wiki locally