diff --git a/.cursor/instructions.mdc b/.cursor/instructions.mdc new file mode 100644 index 0000000..e2bd3fb --- /dev/null +++ b/.cursor/instructions.mdc @@ -0,0 +1,141 @@ +--- +description: Reference guide for adapting this MLOps template to custom ML solutions on Databricks. +globs: + - "**/*.ipynb" + - "**/*.yml" + - "**/*.yaml" +--- + +# Databricks E2E MLOps Template — Adaptation Guide + +This repository is an end-to-end MLOps template on Databricks. It uses the Iris classification dataset as a placeholder — adapt it to any dataset and ML problem type (regression, clustering, classification, forecasting, etc.). + +## Repository Structure + +``` +notebooks/ + 1_data_preprocessing/ → Data ingestion & feature engineering + 2_model_training_and_deployment/ + model_training.ipynb → Train & register model in Unity Catalog + model_deployment/ → MLflow 3 deployment pipeline (evaluate → approve → deploy) + 3_inference/ + batch_inference.ipynb → Batch scoring with the Champion model + realtime_inference.ipynb → Serving endpoint example +resources/ → Databricks Asset Bundle job definitions (one YAML per job) +databricks.yml → Bundle config with dev/prod targets and shared variables +azure_pipelines.yml → CD pipeline for Azure DevOps (GitHub Actions alternative in .github/) +``` + +## Parameterization Convention + +All notebooks accept parameters via `dbutils.widgets`. The three core parameters wired through every notebook and job are: + +| Parameter | Purpose | Default | +|-----------|---------|---------| +| `catalog_name` | Unity Catalog catalog for all tables and models | Set per target in `databricks.yml` | +| `schema_name` | Schema within the catalog | `default` | +| `model_version` | Model version (deployment notebooks only) | `1` | + +Three-level Unity Catalog references always use `{catalog_name}.{schema_name}.` — never hardcode a schema or catalog name inside notebook code. + +Job YAML files in `resources/` declare these as job-level `parameters` (sourced from `${var.*}` bundle variables) and pass them to notebooks via `base_parameters`. + +## How to Adapt for a Custom ML Solution + +### 1. Data Ingestion (`data_ingestion.ipynb`) + +- Replace the `sklearn.datasets.load_iris()` call with your own data source (cloud storage, JDBC, API, Delta table, etc.). +- Rename `iris_data` to a meaningful table name (e.g., `customer_churn_features`). +- Adjust column transformations, data types, and primary key constraints to match your schema. +- If you need multi-stage preprocessing (bronze → silver → gold), add more notebooks under `1_data_preprocessing/` and add corresponding tasks to the job YAML. + +### 2. Model Training (`model_training.ipynb`) + +- Replace `DecisionTreeClassifier` with the algorithm suited to your problem: + - **Regression**: `LinearRegression`, `XGBRegressor`, `LGBMRegressor`, etc. + - **Clustering**: `KMeans`, `DBSCAN`, etc. (note: clustering has no target column). + - **Classification**: `RandomForestClassifier`, `LogisticRegression`, `XGBClassifier`, etc. + - **Forecasting**: Prophet, ARIMA, or deep learning approaches. +- Update the `features` list and `target` variable to match your dataset columns. +- Adjust the logged metrics — use metrics appropriate for your problem type (e.g., `rmse`/`mae` for regression, `silhouette_score` for clustering, `accuracy`/`f1` for classification). +- Update `selected_metric` in the best-run selection logic accordingly. +- Update `model_name` (currently `iris_model`) to describe your model. +- The model signature (`infer_signature`) is inferred automatically — just ensure you pass representative `X_train` and `y_train`. + +### 3. Model Deployment Pipeline (`model_deployment/`) + +This follows the [MLflow 3 Deployment Jobs](https://docs.databricks.com/gcp/en/mlflow/deployment-job) pattern: **evaluate → approve → deploy**. + +- **`1_model_evaluation.ipynb`**: Update the `eval_data`, `target`, and `model_type` to match your problem. For regression use `model_type = "regressor"`. You can add custom evaluation logic here. +- **`2_model_approval.ipynb`**: This is a human-in-the-loop gate — it checks for an `approval` tag on the model version. Customize the tag name or approval logic as needed. +- **`3_model_deployment.ipynb`**: Creates/updates a serving endpoint. Adjust `workload_size`, `scale_to_zero_enabled`, and endpoint naming conventions as needed. If you only need batch inference (no serving endpoint), you can remove or skip this notebook. +- After registering your first model version, [connect it to the deployment job](https://docs.databricks.com/gcp/en/mlflow/deployment-job#connect-the-deployment-job-to-a-model) in the Databricks UI. + +### 4. Inference (`3_inference/`) + +- **Batch** (`batch_inference.ipynb`): Replace the sample data source with your actual new/unseen data. Update column names, prediction post-processing, and the inference table name (`iris_inferences` → your table). The CDF (Change Data Feed) setting at the end enables [Lakehouse Monitoring](https://docs.databricks.com/aws/en/lakehouse-monitoring/) — keep it if you plan to monitor model performance. +- **Realtime** (`realtime_inference.ipynb`): Update `sample_input` to match your model's input schema. The endpoint name is derived from `{catalog_name}-{schema_name}--endpoint`. + +### 5. Job Definitions (`resources/*.yml`) + +Each YAML file defines a Databricks Workflows job. When adapting: + +- Rename jobs and task keys to reflect your pipeline. +- Update `notebook_path` if you rename or add notebooks. +- Add new `base_parameters` if your notebooks need extra parameters. +- To use dedicated job clusters instead of serverless, uncomment and configure the `job_clusters` section. +- Update `email_notifications` with your team's addresses. + +### 6. Bundle Configuration (`databricks.yml`) + +- Change `bundle.name` to your project name. +- Set `catalog_name` and `schema_name` per target (dev/prod) under `targets.*.variables`. +- Update permission groups (`PowerUsers`, `Developers`) to match your organization. +- Add new variables under the top-level `variables` block if your notebooks need them. + +### 7. CI/CD + +- **Azure DevOps**: Update `WORKSPACE_HOST_NAME` in `azure_pipelines.yml` and add `DATABRICKS_CLIENT_ID` / `DATABRICKS_CLIENT_SECRET` as pipeline secrets. +- **GitHub Actions**: Update `WORKSPACE_HOST_NAME` in `.github/workflows/databricks_deployment.yml` and add secrets in repo settings. +- Branch mapping: `dev` branch → dev target, `master`/`main` branch → prod target. + +#### Adding a Staging Environment + +The template ships with `dev` and `prod` targets. To add a `staging` (or any intermediate) environment: + +1. **`databricks.yml`** — Add a new target block between `dev` and `prod`: + ```yaml + staging: + mode: production + workspace: + root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} + variables: + catalog_name: + schema_name: default + environment: staging + resources: + jobs: + data_ingestion_job: + permissions: + - level: "CAN_MANAGE" + group_name: "PowerUsers" + - level: "CAN_MANAGE_RUN" + group_name: "Developers" + # ... repeat for each job with the desired permission levels + ``` + Use a dedicated catalog (e.g., `myproject_staging`) to keep staging data isolated from dev and prod. + +2. **CI/CD pipeline** — Add a branch trigger for the staging environment: + - **Azure DevOps** (`azure_pipelines.yml`): Add `staging` to the `branches.include` list and extend the environment-detection script with a new condition mapping `refs/heads/staging` → `staging` target. + - **GitHub Actions** (`.github/workflows/databricks_deployment.yml`): Add an equivalent branch condition that runs `databricks bundle deploy --target staging` on pushes to the `staging` branch. + +3. **Service principal** — If staging uses a different workspace, add a separate set of `DATABRICKS_CLIENT_ID` / `DATABRICKS_CLIENT_SECRET` secrets for the staging environment, and update the pipeline to select the correct credentials based on the target. + +4. **Branch strategy** — A typical flow is `feature/* → dev → staging → master/main`, where staging acts as a pre-production validation gate. Protect the `staging` branch with required reviews or status checks to enforce quality before promotion to prod. + +## Key Conventions + +- **Challenger/Champion pattern**: newly trained models are registered with the `challenger` alias. After passing evaluation and approval, the deployment notebook promotes them to `Champion`. Inference notebooks always load the `@champion` alias. +- **MLflow experiment naming**: experiments are scoped per user and catalog — `{user}/{model}_{catalog}` — to avoid collisions during development. +- **Serving endpoint naming**: derived as `{catalog_name}-{schema_name}-{model}-endpoint` (dots replaced with dashes) to ensure valid endpoint names. +- **Idempotent notebooks**: ingestion and inference notebooks check for table existence before deciding whether to create or append/overwrite. diff --git a/.gitignore b/.gitignore index 4f25aa2..f3172ab 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,6 @@ __pycache__ .idea/ .env +.databricks/ +.cursor/* +!.cursor/instructions.mdc \ No newline at end of file diff --git a/README.md b/README.md index b618eba..33debab 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ -# Databricks e2e Data Project +# Databricks MLOps Quickstart This is a simple end-to-end example of a Databricks MLOps project that uses the [Iris Dataset](https://scikit-learn.org/1.4/auto_examples/datasets/plot_iris_dataset.html). The goal of this project is to create a model that allows you to automatically classify flowers into different species based on their properties, and to have a [CI/CD](https://en.wikipedia.org/wiki/CI/CD) pipeline enabled that will allow you to easily track and deploy your code to different environments, such as Development and Production. As a part of this project, you will set up: -- A job for ingesting the Iris data into a feature table (which, simply speaking, is a Delta table with a Primary Key) +- A job for ingesting the Iris data into a feature table - A job for training a simple classification model and storing the model in the Unity Catalog - A job that uses the model to run inferences on the top of newly collected (unidentified) flowers - An MLflow 3 deployment job for automating the process of defining the model used for running the production inferences @@ -31,14 +31,14 @@ Having a *Continuous Deployment (CD)* pipeline enabled means that if you make an CI/CD Process -**Wrap up:** as a part of this demo, you have created notebooks and jobs for a fictional MLOps end-to-end pipeline. By connecting your dev and prod workspaces and preparing your CI/CD setup, you will be able to deploy automatic updates to your Dev and Prod environments with a click of a button! Below, you can check the list of jobs that will get created in the demo, which can be easily filtered by using the tag **Project**: "e2e-data-project". +**Wrap up:** as a part of this demo, you have created notebooks and jobs for a fictional MLOps end-to-end pipeline. By connecting your dev and prod workspaces and preparing your CI/CD setup, you will be able to deploy automatic updates to your Dev and Prod environments with a click of a button! Below, you can check the list of jobs that will get created in the demo, which can be easily filtered by using the tag **Project**: "mlops-quickstart". CI/CD Process ## Notes - If you want to clone this Repo to reproduce it on your end, don't forget to change the **host** and **catalog_name** from the [databricks.yml](databricks.yml) file, and: - - If you are using GitHub Actions, you need to add 2 Secrets called **DATABRICKS_CLIENT_ID** and **DATABRICKS_CLIENT_SECRET** with your service principal authentication details and also update the **DATABRICKS_HOST** variable in the [.github/workflows/databricks-deployment.yml](.github/workflows/databricks_deployment.yml) file; - - If you use Azure DevOps, then you also have to add these 2 parameters as pipeline secrets (**DATABRICKS_CLIENT_ID** and **DATABRICKS_CLIENT_SECRET**), and update the **DATABRICKS_HOST** variable in the [azure-pipelines.yml](azure_pipelines.yml) file. + - If you are using GitHub Actions, you need to add 2 Secrets called **DATABRICKS_CLIENT_ID** and **DATABRICKS_CLIENT_SECRET** with your service principal authentication details and also update the **WORKSPACE_HOST_NAME** secret in the [.github/workflows/databricks-deployment.yml](.github/workflows/databricks_deployment.yml) file; + - If you use Azure DevOps, then you also have to add these 2 parameters as pipeline secrets (**DATABRICKS_CLIENT_ID** and **DATABRICKS_CLIENT_SECRET**), and update the **WORKSPACE_HOST_NAME** variable in the [azure-pipelines.yml](azure_pipelines.yml) file. - For this simple tutorial, we are using the same workspace and service principal for Dev and Prod, for the simplicity of demonstrating it. Make sure to check our documentation for more details on authentication and on how to manage different environments! ## Learn more diff --git a/databricks.yml b/databricks.yml index d44befc..27eecb2 100644 --- a/databricks.yml +++ b/databricks.yml @@ -1,5 +1,5 @@ bundle: - name: e2e-data-project + name: mlops-quickstart include: - resources/*.yml @@ -11,7 +11,8 @@ targets: root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} default: true variables: - catalog_name: pedroz_e2edata_dev + catalog_name: mlops_quickstart_dev + schema_name: iris environment: dev # Override permissions for dev environment resources: @@ -46,7 +47,8 @@ targets: workspace: root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} variables: - catalog_name: pedroz_e2edata_prod + catalog_name: mlops_quickstart_prod + schema_name: iris environment: prod # Override permissions for prod environment (only PowerUsers group) @@ -72,7 +74,10 @@ targets: variables: catalog_name: description: "Name of the Unity Catalog to use" - default: "default_catalog" + default: "mlops_quickstart_dev" + schema_name: + description: "Name of the schema to use within the catalog" + default: "iris" environment: description: "Environment name (dev/prod)" default: "dev" \ No newline at end of file diff --git a/notebooks/1_data_preprocessing/data_ingestion.ipynb b/notebooks/1_data_preprocessing/data_ingestion.ipynb index 56544cd..0b29dad 100644 --- a/notebooks/1_data_preprocessing/data_ingestion.ipynb +++ b/notebooks/1_data_preprocessing/data_ingestion.ipynb @@ -1,269 +1,271 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "5647212e-8e74-46e9-b783-93b1ccf19a4c", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# This notebook is meant to extract the data from sklearn.datasets and ingest it into a table in the UC\n", - "# Dummy change" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "c9929a75-c378-4333-8f21-99bc5c62e505", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "from sklearn import datasets \n", - "import pandas as pd\n", - "from pyspark.sql.functions import monotonically_increasing_id" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "766e6b89-5006-48d0-bcdc-d2f7bc9d0079", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "dbutils.widgets.text(\"catalog_name\", \"pedroz_e2edata_dev\")\n", - "catalog_name = dbutils.widgets.get(\"catalog_name\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "4c99ff27-a91d-4c82-9cd1-197bb8d31d95", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "iris_data = datasets.load_iris(as_frame=True)\n", - "df_iris = pd.DataFrame(data = iris_data['data'], columns = iris_data['feature_names'])\n", - "df_iris.columns = df_iris.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", - "df_iris['species'] = iris_data['target']\n", - "df_iris.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "479e45b2-54b4-4040-bada-b6a9d58242c0", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "df_iris['id'] = range(1, len(df_iris) + 1)\n", - "df_iris.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "f69d1295-0c2a-480f-9224-f3c2db701289", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "spark_df_iris = spark.createDataFrame(df_iris)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "22f52809-27d7-4a26-86e0-c76cf8f7c710", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "try:\n", - " display(spark.table(f\"{catalog_name}.default.iris_data\").limit(5))\n", - " table_exists = True\n", - "except:\n", - " table_exists = False" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "d2e69e48-8356-4b05-93a8-59ff968640c3", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "if not table_exists:\n", - " spark_df_iris.write.mode(\"overwrite\").saveAsTable(f\"{catalog_name}.default.iris_data\")\n", - " spark.sql(f\"ALTER TABLE {catalog_name}.default.iris_data ALTER COLUMN id SET NOT NULL\")\n", - " spark.sql(f\"ALTER TABLE {catalog_name}.default.iris_data ADD CONSTRAINT pk_id PRIMARY KEY (id)\")\n", - " print(\"Created table and added a primary key to it\")\n", - "else:\n", - " spark_df_iris.write.mode(\"overwrite\").saveAsTable(f\"{catalog_name}.default.iris_data\")\n", - " print(\"Overwrote table data\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "dc805f7d-439d-4d88-a1cb-e6a551ee022f", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "df = spark.sql(f\"SELECT * FROM {catalog_name}.default.iris_data\")\n", - "display(df.limit(5))" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": null, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "2" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "mostRecentlyExecutedCommandWithImplicitDF": { - "commandId": 6519546643539454, - "dataframes": [ - "_sqldf" - ] + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "5647212e-8e74-46e9-b783-93b1ccf19a4c", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# This notebook is meant to extract the data from sklearn.datasets and ingest it into a table in the UC\n", + "# Dummy change" + ] }, - "pythonIndentUnit": 4 - }, - "notebookName": "data-ingestion", - "widgets": { - "catalog_name": { - "currentValue": "pedroz_e2edata_dev", - "nuid": "a2e50b0e-7349-4502-b8bf-17b480305d84", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "pedroz_e2edata_dev", - "label": null, - "name": "catalog_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "c9929a75-c378-4333-8f21-99bc5c62e505", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "pedroz_e2edata_dev", - "label": null, - "name": "catalog_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "from sklearn import datasets \n", + "import pandas as pd\n", + "from pyspark.sql.functions import monotonically_increasing_id" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "766e6b89-5006-48d0-bcdc-d2f7bc9d0079", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "4c99ff27-a91d-4c82-9cd1-197bb8d31d95", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "iris_data = datasets.load_iris(as_frame=True)\n", + "df_iris = pd.DataFrame(data = iris_data['data'], columns = iris_data['feature_names'])\n", + "df_iris.columns = df_iris.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", + "df_iris['species'] = iris_data['target']\n", + "df_iris.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "479e45b2-54b4-4040-bada-b6a9d58242c0", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "df_iris['id'] = range(1, len(df_iris) + 1)\n", + "df_iris.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f69d1295-0c2a-480f-9224-f3c2db701289", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "spark_df_iris = spark.createDataFrame(df_iris)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "22f52809-27d7-4a26-86e0-c76cf8f7c710", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "widgetType": "text" - } + "outputs": [], + "source": [ + "try:\n", + " display(spark.table(f\"{catalog_name}.{schema_name}.iris_data\").limit(5))\n", + " table_exists = True\n", + "except:\n", + " table_exists = False" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "d2e69e48-8356-4b05-93a8-59ff968640c3", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "if not table_exists:\n", + " spark_df_iris.write.mode(\"overwrite\").saveAsTable(f\"{catalog_name}.{schema_name}.iris_data\")\n", + " spark.sql(f\"ALTER TABLE {catalog_name}.{schema_name}.iris_data ALTER COLUMN id SET NOT NULL\")\n", + " spark.sql(f\"ALTER TABLE {catalog_name}.{schema_name}.iris_data ADD CONSTRAINT pk_id PRIMARY KEY (id)\")\n", + " print(\"Created table and added a primary key to it\")\n", + "else:\n", + " spark_df_iris.write.mode(\"overwrite\").saveAsTable(f\"{catalog_name}.{schema_name}.iris_data\")\n", + " print(\"Overwrote table data\")" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "dc805f7d-439d-4d88-a1cb-e6a551ee022f", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "df = spark.sql(f\"SELECT * FROM {catalog_name}.{schema_name}.iris_data\")\n", + "display(df.limit(5))" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": null, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "2" + }, + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "mostRecentlyExecutedCommandWithImplicitDF": { + "commandId": 6519546643539454, + "dataframes": [ + "_sqldf" + ] + }, + "pythonIndentUnit": 4 + }, + "notebookName": "data-ingestion", + "widgets": { + "catalog_name": { + "currentValue": "pedroz_e2edata_dev", + "nuid": "a2e50b0e-7349-4502-b8bf-17b480305d84", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "pedroz_e2edata_dev", + "label": null, + "name": "catalog_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "pedroz_e2edata_dev", + "label": null, + "name": "catalog_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + } + } + }, + "language_info": { + "name": "python" } - } }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/notebooks/2_model_training_and_deployment/model_deployment/1_model_evaluation.ipynb b/notebooks/2_model_training_and_deployment/model_deployment/1_model_evaluation.ipynb index 15c2a3e..9b946e2 100644 --- a/notebooks/2_model_training_and_deployment/model_deployment/1_model_evaluation.ipynb +++ b/notebooks/2_model_training_and_deployment/model_deployment/1_model_evaluation.ipynb @@ -1,211 +1,215 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "916d34b1-28d2-471f-b3ea-fb0064cc4504", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "%pip install mlflow=='3.4.0'\n", - "dbutils.library.restartPython()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "f79706be-1a13-4eff-baa0-789b89020bef", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import mlflow\n", - "from sklearn import datasets" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "0004f1e0-7188-476a-a232-684e5a2fcece", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "dbutils.widgets.text(\"model_name\", \"pedroz_e2edata_dev.default.iris_model\")\n", - "dbutils.widgets.text(\"model_version\", \"1\")\n", - "\n", - "model_name = dbutils.widgets.get(\"model_name\")\n", - "model_version = dbutils.widgets.get(\"model_version\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "b89ad4f4-2db8-41df-91d4-8bb98b183348", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# Pull the dataset for running the inference\n", - "iris_samples = datasets.load_iris(as_frame=True)\n", - "df_samples = pd.DataFrame(data = iris_samples['data'], columns = iris_samples['feature_names'])\n", - "df_samples.columns = df_samples.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", - "df_samples['species'] = iris_samples.target.astype(int)\n", - "df_samples.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "8f46cecb-fa41-40b9-8ea7-c242372b5e07", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# REQUIRED: add evaluation dataset and target here\n", - "eval_data = df_samples\n", - "target = \"species\"\n", - "# REQUIRED: add model type here (e.g. \"regressor\", \"databricks-agent\", etc.)\n", - "model_type = \"classifier\"\n", - "\n", - "model_uri = f'models:/{model_name}/{model_version}'\n", - "# can also fetch model ID and use that for URI instead as described below\n", - "\n", - "with mlflow.start_run(run_name=\"evaluation\") as run:\n", - " mlflow.models.evaluate(\n", - " model=model_uri,\n", - " data=eval_data,\n", - " targets=target,\n", - " model_type=model_type\n", - " )" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": null, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "2" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "pythonIndentUnit": 4 - }, - "notebookName": "model-evaluation", - "widgets": { - "model_name": { - "currentValue": "pedroz_e2edata_dev.default.iris_model", - "nuid": "8e3ab24c-d5b7-400f-b969-4374bd46a6de", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "iris_model", - "label": null, - "name": "model_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "916d34b1-28d2-471f-b3ea-fb0064cc4504", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "iris_model", - "label": null, - "name": "model_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "%pip install mlflow=='3.4.0'\n", + "dbutils.library.restartPython()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f79706be-1a13-4eff-baa0-789b89020bef", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import mlflow\n", + "from sklearn import datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "0004f1e0-7188-476a-a232-684e5a2fcece", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "dbutils.widgets.text(\"model_version\", \"1\")\n", + "\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")\n", + "model_version = dbutils.widgets.get(\"model_version\")\n", + "\n", + "model_name = f\"{catalog_name}.{schema_name}.iris_model\"" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "b89ad4f4-2db8-41df-91d4-8bb98b183348", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "widgetType": "text" - } + "outputs": [], + "source": [ + "# Pull the dataset for running the inference\n", + "iris_samples = datasets.load_iris(as_frame=True)\n", + "df_samples = pd.DataFrame(data = iris_samples['data'], columns = iris_samples['feature_names'])\n", + "df_samples.columns = df_samples.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", + "df_samples['species'] = iris_samples.target.astype(int)\n", + "df_samples.head()" + ] }, - "model_version": { - "currentValue": "1", - "nuid": "8927867b-5018-486b-9577-7f545068e5b7", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "1", - "label": null, - "name": "model_version", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "8f46cecb-fa41-40b9-8ea7-c242372b5e07", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# REQUIRED: add evaluation dataset and target here\n", + "eval_data = df_samples\n", + "target = \"species\"\n", + "# REQUIRED: add model type here (e.g. \"regressor\", \"databricks-agent\", etc.)\n", + "model_type = \"classifier\"\n", + "\n", + "model_uri = f'models:/{model_name}/{model_version}'\n", + "# can also fetch model ID and use that for URI instead as described below\n", + "\n", + "with mlflow.start_run(run_name=\"evaluation\") as run:\n", + " mlflow.models.evaluate(\n", + " model=model_uri,\n", + " data=eval_data,\n", + " targets=target,\n", + " model_type=model_type\n", + " )" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": null, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "2" }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "1", - "label": null, - "name": "model_version", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 4 }, - "widgetType": "text" - } + "notebookName": "model-evaluation", + "widgets": { + "model_name": { + "currentValue": "pedroz_e2edata_dev.default.iris_model", + "nuid": "8e3ab24c-d5b7-400f-b969-4374bd46a6de", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "iris_model", + "label": null, + "name": "model_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "iris_model", + "label": null, + "name": "model_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + }, + "model_version": { + "currentValue": "1", + "nuid": "8927867b-5018-486b-9577-7f545068e5b7", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "1", + "label": null, + "name": "model_version", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "1", + "label": null, + "name": "model_version", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + } + } + }, + "language_info": { + "name": "python" } - } }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/notebooks/2_model_training_and_deployment/model_deployment/2_model_approval.ipynb b/notebooks/2_model_training_and_deployment/model_deployment/2_model_approval.ipynb index ba7cc28..4f7d3d7 100644 --- a/notebooks/2_model_training_and_deployment/model_deployment/2_model_approval.ipynb +++ b/notebooks/2_model_training_and_deployment/model_deployment/2_model_approval.ipynb @@ -1,206 +1,210 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "ab6629f3-78e2-4944-a1ae-45c47d2b455f", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "%pip install mlflow=='3.4.0'\n", - "dbutils.library.restartPython()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "dd45f111-9993-472a-ab4c-b6d5419d376a", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "dbutils.widgets.text(\"model_name\", \"pedroz_e2edata_dev.default.iris_model\")\n", - "dbutils.widgets.text(\"model_version\", \"1\")\n", - "dbutils.widgets.text(\"approval_tag_name\", \"approval\")\n", - "\n", - "model_name = dbutils.widgets.get(\"model_name\")\n", - "model_version = dbutils.widgets.get(\"model_version\")\n", - "approval_tag_name = dbutils.widgets.get(\"approval_tag_name\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, - "inputWidgets": {}, - "nuid": "8f1acdc6-00ff-4424-a382-1345da234ca2", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "from mlflow import MlflowClient" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "d0e6eb3f-dce9-48e4-a25d-92fa9af54289", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "client = MlflowClient(registry_uri=\"databricks-uc\")\n", - "\n", - "# fetch the model version's UC tags\n", - "model_tags = client.get_model_version(model_name, model_version).tags\n", - "\n", - "# check if any tag matches the approval tag name\n", - "if not any(tag == approval_tag_name for tag in model_tags.keys()):\n", - " raise Exception(\"Model version not approved for deployment\")\n", - "else:\n", - " # if tag is found, check if it is approved\n", - " if model_tags.get(approval_tag_name).lower() == \"approved\":\n", - " print(\"Model version approved for deployment\")\n", - " else:\n", - " raise Exception(\"Model version not approved for deployment\")" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": null, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "2" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "pythonIndentUnit": 4 - }, - "notebookName": "model-approval", - "widgets": { - "approval_tag_name": { - "currentValue": "approved", - "nuid": "f2036e90-1ce2-40e5-9486-c5b637d08249", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "approved", - "label": null, - "name": "approval_tag_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" - }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "approved", - "label": null, - "name": "approval_tag_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "ab6629f3-78e2-4944-a1ae-45c47d2b455f", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "widgetType": "text" - } + "outputs": [], + "source": [ + "%pip install mlflow=='3.4.0'\n", + "dbutils.library.restartPython()" + ] }, - "model_name": { - "currentValue": "pedroz_e2edata_dev.default.iris_model", - "nuid": "e9d29f87-03e5-49dc-9fb0-f1ff3ca9ebd4", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "pedroz_e2edata_dev.default.iris_model", - "label": null, - "name": "model_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "dd45f111-9993-472a-ab4c-b6d5419d376a", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "pedroz_e2edata_dev.default.iris_model", - "label": null, - "name": "model_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "dbutils.widgets.text(\"model_version\", \"1\")\n", + "dbutils.widgets.text(\"approval_tag_name\", \"approval\")\n", + "\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")\n", + "model_version = dbutils.widgets.get(\"model_version\")\n", + "approval_tag_name = dbutils.widgets.get(\"approval_tag_name\")\n", + "\n", + "model_name = f\"{catalog_name}.{schema_name}.iris_model\"" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "8f1acdc6-00ff-4424-a382-1345da234ca2", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "widgetType": "text" - } + "outputs": [], + "source": [ + "from mlflow import MlflowClient" + ] }, - "model_version": { - "currentValue": "1", - "nuid": "f884f2a0-a720-4bba-b2c1-bc035c5947e2", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "1", - "label": null, - "name": "model_version", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "d0e6eb3f-dce9-48e4-a25d-92fa9af54289", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "1", - "label": null, - "name": "model_version", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "client = MlflowClient(registry_uri=\"databricks-uc\")\n", + "\n", + "# fetch the model version's UC tags\n", + "model_tags = client.get_model_version(model_name, model_version).tags\n", + "\n", + "# check if any tag matches the approval tag name\n", + "if not any(tag == approval_tag_name for tag in model_tags.keys()):\n", + " raise Exception(\"Model version not approved for deployment\")\n", + "else:\n", + " # if tag is found, check if it is approved\n", + " if model_tags.get(approval_tag_name).lower() == \"approved\":\n", + " print(\"Model version approved for deployment\")\n", + " else:\n", + " raise Exception(\"Model version not approved for deployment\")" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": null, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "2" + }, + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 4 }, - "widgetType": "text" - } + "notebookName": "model-approval", + "widgets": { + "approval_tag_name": { + "currentValue": "approved", + "nuid": "f2036e90-1ce2-40e5-9486-c5b637d08249", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "approved", + "label": null, + "name": "approval_tag_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "approved", + "label": null, + "name": "approval_tag_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + }, + "model_name": { + "currentValue": "pedroz_e2edata_dev.default.iris_model", + "nuid": "e9d29f87-03e5-49dc-9fb0-f1ff3ca9ebd4", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "pedroz_e2edata_dev.default.iris_model", + "label": null, + "name": "model_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "pedroz_e2edata_dev.default.iris_model", + "label": null, + "name": "model_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + }, + "model_version": { + "currentValue": "1", + "nuid": "f884f2a0-a720-4bba-b2c1-bc035c5947e2", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "1", + "label": null, + "name": "model_version", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "1", + "label": null, + "name": "model_version", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + } + } + }, + "language_info": { + "name": "python" } - } }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/notebooks/2_model_training_and_deployment/model_deployment/3_model_deployment.ipynb b/notebooks/2_model_training_and_deployment/model_deployment/3_model_deployment.ipynb index 351ca22..a392384 100644 --- a/notebooks/2_model_training_and_deployment/model_deployment/3_model_deployment.ipynb +++ b/notebooks/2_model_training_and_deployment/model_deployment/3_model_deployment.ipynb @@ -1,222 +1,226 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "c8b631ab-95ef-4c7d-be0d-7d18ef95a9a5", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "%pip install mlflow=='3.4.0'\n", - "dbutils.library.restartPython()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "508c8944-2593-4a57-96f0-30e2ef1b73a0", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "dbutils.widgets.text(\"model_name\", \"pedroz_e2edata_dev.default.iris_model\")\n", - "dbutils.widgets.text(\"model_version\", \"1\")\n", - "\n", - "model_name = dbutils.widgets.get(\"model_name\")\n", - "model_version = dbutils.widgets.get(\"model_version\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "cd61ab38-2e42-49f9-b367-51240416fbee", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "from databricks.sdk import WorkspaceClient\n", - "from databricks.sdk.service.serving import (\n", - " ServedEntityInput,\n", - " EndpointCoreConfigInput\n", - ")\n", - "from databricks.sdk.errors import ResourceDoesNotExist\n", - "from mlflow import MlflowClient" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "9c19d472-4d27-4b3c-adf7-a4a8e3ca0fb2", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# Promote the model version to Champion\n", - "\n", - "client = MlflowClient()\n", - "\n", - "client.set_registered_model_alias(\n", - " f'{model_name}', \n", - " \"Champion\", \n", - " model_version\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "2120abec-6433-4138-8970-da4a9e8aba7b", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# Create a serving endpoint for the model\n", - "\n", - "# REQUIRED: Enter serving endpoint name\n", - "serving_endpoint_name = model_name.replace('.', '-') + \"-endpoint\"\n", - "\n", - "w = WorkspaceClient() # Assumes DATABRICKS_HOST and DATABRICKS_TOKEN are set\n", - "served_entities=[\n", - " ServedEntityInput(\n", - " entity_name=model_name,\n", - " entity_version=model_version,\n", - " workload_size=\"Small\",\n", - " scale_to_zero_enabled=True\n", - " )\n", - "]\n", - "\n", - "# Update serving endpoint if it already exists, otherwise create the serving endpoint\n", - "try:\n", - " w.serving_endpoints.update_config(name=serving_endpoint_name, served_entities=served_entities)\n", - "except ResourceDoesNotExist:\n", - " w.serving_endpoints.create(name=serving_endpoint_name, config=EndpointCoreConfigInput(served_entities=served_entities))" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": null, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "2" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "pythonIndentUnit": 4 - }, - "notebookName": "model-deployment", - "widgets": { - "model_name": { - "currentValue": "pedroz_e2edata_dev.default.iris_model", - "nuid": "107eb075-0cd1-4d46-967f-9397401e09dc", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "pedroz_e2edata_dev.default.iris_model", - "label": null, - "name": "model_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "c8b631ab-95ef-4c7d-be0d-7d18ef95a9a5", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "pedroz_e2edata_dev.default.iris_model", - "label": null, - "name": "model_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "%pip install mlflow=='3.4.0'\n", + "dbutils.library.restartPython()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "508c8944-2593-4a57-96f0-30e2ef1b73a0", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "dbutils.widgets.text(\"model_version\", \"1\")\n", + "\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")\n", + "model_version = dbutils.widgets.get(\"model_version\")\n", + "\n", + "model_name = f\"{catalog_name}.{schema_name}.iris_model\"" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "cd61ab38-2e42-49f9-b367-51240416fbee", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "from databricks.sdk import WorkspaceClient\n", + "from databricks.sdk.service.serving import (\n", + " ServedEntityInput,\n", + " EndpointCoreConfigInput\n", + ")\n", + "from databricks.sdk.errors import ResourceDoesNotExist\n", + "from mlflow import MlflowClient" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "9c19d472-4d27-4b3c-adf7-a4a8e3ca0fb2", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "widgetType": "text" - } + "outputs": [], + "source": [ + "# Promote the model version to Champion\n", + "\n", + "client = MlflowClient()\n", + "\n", + "client.set_registered_model_alias(\n", + " f'{model_name}', \n", + " \"Champion\", \n", + " model_version\n", + ")" + ] }, - "model_version": { - "currentValue": "1", - "nuid": "a0decf9a-501b-4dc9-916d-cc858a20694b", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "1", - "label": null, - "name": "model_version", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "2120abec-6433-4138-8970-da4a9e8aba7b", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# Create a serving endpoint for the model\n", + "\n", + "# REQUIRED: Enter serving endpoint name\n", + "serving_endpoint_name = model_name.replace('.', '-') + \"-endpoint\"\n", + "\n", + "w = WorkspaceClient() # Assumes DATABRICKS_HOST and DATABRICKS_TOKEN are set\n", + "served_entities=[\n", + " ServedEntityInput(\n", + " entity_name=model_name,\n", + " entity_version=model_version,\n", + " workload_size=\"Small\",\n", + " scale_to_zero_enabled=True\n", + " )\n", + "]\n", + "\n", + "# Update serving endpoint if it already exists, otherwise create the serving endpoint\n", + "try:\n", + " w.serving_endpoints.update_config(name=serving_endpoint_name, served_entities=served_entities)\n", + "except ResourceDoesNotExist:\n", + " w.serving_endpoints.create(name=serving_endpoint_name, config=EndpointCoreConfigInput(served_entities=served_entities))" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": null, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "2" }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "1", - "label": null, - "name": "model_version", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 4 }, - "widgetType": "text" - } + "notebookName": "model-deployment", + "widgets": { + "model_name": { + "currentValue": "pedroz_e2edata_dev.default.iris_model", + "nuid": "107eb075-0cd1-4d46-967f-9397401e09dc", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "pedroz_e2edata_dev.default.iris_model", + "label": null, + "name": "model_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "pedroz_e2edata_dev.default.iris_model", + "label": null, + "name": "model_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + }, + "model_version": { + "currentValue": "1", + "nuid": "a0decf9a-501b-4dc9-916d-cc858a20694b", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "1", + "label": null, + "name": "model_version", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "1", + "label": null, + "name": "model_version", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + } + } + }, + "language_info": { + "name": "python" } - } }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/notebooks/2_model_training_and_deployment/model_training.ipynb b/notebooks/2_model_training_and_deployment/model_training.ipynb index 1e6f1ce..a5d5056 100644 --- a/notebooks/2_model_training_and_deployment/model_training.ipynb +++ b/notebooks/2_model_training_and_deployment/model_training.ipynb @@ -1,350 +1,352 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "45a4ded4-8725-46c1-a0c2-c4ed531e8d6b", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# This notebook is meant to train a classification model from the Iris dataset and save it to the UC" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "collapsed": true, - "inputWidgets": {}, - "nuid": "c7beae74-8e38-40e5-a07a-f8b53bd21541", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "%pip install mlflow=='3.4.0'\n", - "dbutils.library.restartPython()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "0b781b85-5fb9-4af5-8642-f5f3fb3993a7", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.tree import DecisionTreeClassifier\n", - "from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score\n", - "import mlflow\n", - "import mlflow.sklearn\n", - "from mlflow.models.signature import infer_signature\n", - "from mlflow.tracking.client import MlflowClient\n", - "import requests\n", - "from datetime import datetime" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "a94d72ec-d6d4-4fd3-9792-cc37870736da", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "dbutils.widgets.text(\"catalog_name\", \"pedroz_e2edata_dev\")\n", - "catalog_name = dbutils.widgets.get(\"catalog_name\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "70c8ec82-ceec-4e6f-bd77-bc1ba02016d8", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "model_name = 'iris_model'" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "9c01992c-8f50-4c1a-872d-8840ac2e2d1c", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "feature_table_name = f'{catalog_name}.default.iris_data'" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "f1ea3d22-39fb-48cb-9023-6f5442449f6f", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "experiment_name = f\"/Users/{dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()}/{model_name}_{catalog_name}\"" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "f9183695-cffb-40d6-9377-a10a5285b1e2", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "import mlflow\n", - "\n", - "# Create an MLFlow experiment\n", - "mlflow.set_experiment(experiment_name)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "2ed03c85-3569-48d7-823d-4e73110e36ff", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# If you want to autolog the model, use the following command\n", - "# Note that some of the auto-logging capabilities were set to false because we are logging some metrics \n", - "mlflow.autolog(log_input_examples=False,log_model_signatures=False,log_models=False,log_datasets=False,)\n", - "\n", - "# Start a training run\n", - "with mlflow.start_run() as run:\n", - " # Load data from Unity Catalog table\n", - " df_iris = spark.table(feature_table_name).toPandas()\n", - " features = ['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']\n", - " target = 'species'\n", - "\n", - " X = df_iris[features]\n", - " y = df_iris[target]\n", - "\n", - " X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)\n", - "\n", - " # Train the model\n", - " model = DecisionTreeClassifier()\n", - " model.fit(X_train, y_train)\n", - "\n", - " # Make predictions\n", - " y_pred = model.predict(X_test)\n", - "\n", - " # Calculate and log metrics\n", - " accuracy = accuracy_score(y_test, y_pred)\n", - " recall = recall_score(y_test, y_pred, average='macro')\n", - " precision = precision_score(y_test, y_pred, average='macro')\n", - " f1 = f1_score(y_test, y_pred, average='macro')\n", - "\n", - " mlflow.log_metric(\"test_accuracy\", accuracy)\n", - " mlflow.log_metric(\"test_recall\", recall)\n", - " mlflow.log_metric(\"test_precision\", precision)\n", - " mlflow.log_metric(\"test_f1\", f1)\n", - "\n", - " # Infer model signature\n", - " signature = infer_signature(X_train, y_train)\n", - "\n", - " # Log the model\n", - " mlflow.sklearn.log_model(\n", - " sk_model=model,\n", - " name='model',\n", - " signature=signature,\n", - " input_example=X_train.head()\n", - " )\n", - "\n", - " # Log input dataset for lineage\n", - " data_source = mlflow.data.load_delta(table_name=feature_table_name)\n", - " mlflow.log_input(data_source, context=\"training\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "0e85f8e0-b6a0-49da-863c-f090b655a311", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# Out of all runs in the experiment, only register the run with the best selected metric\n", - "# Important note: this logic is optional and totally depends on your processes, so feel free to customize it!\n", - "# If you want, you can simply register the latest run instead, for example. \n", - "\n", - "selected_metric = 'test_accuracy'\n", - "client = mlflow.tracking.MlflowClient()\n", - "experiment = client.get_experiment_by_name(experiment_name)\n", - "\n", - "runs = client.search_runs(experiment_ids=[experiment.experiment_id], order_by=[f\"metrics.{selected_metric} DESC\"], max_results=1)\n", - "best_run_id = runs[0].info.run_id\n", - "\n", - "model_uri = f\"runs:/{best_run_id}/model\"\n", - "registered_model = mlflow.register_model(model_uri, f\"{catalog_name}.default.{model_name}\")\n", - "client.set_registered_model_alias(name=registered_model.name, alias=\"challenger\", version=registered_model.version)" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": { - "hardware": { - "accelerator": null, - "gpuPoolId": null, - "memory": null - } - }, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "2" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "mostRecentlyExecutedCommandWithImplicitDF": { - "commandId": 5753565242613738, - "dataframes": [ - "_sqldf" - ] + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "45a4ded4-8725-46c1-a0c2-c4ed531e8d6b", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# This notebook is meant to train a classification model from the Iris dataset and save it to the UC" + ] }, - "pythonIndentUnit": 4 - }, - "notebookName": "model-training", - "widgets": { - "catalog_name": { - "currentValue": "pedroz_e2edata_dev", - "nuid": "c3deff61-8418-4ca2-ae02-acb938772fc8", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "pedroz_e2edata_dev", - "label": null, - "name": "catalog_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "collapsed": true, + "inputWidgets": {}, + "nuid": "c7beae74-8e38-40e5-a07a-f8b53bd21541", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "pedroz_e2edata_dev", - "label": null, - "name": "catalog_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "%pip install mlflow=='3.4.0'\n", + "dbutils.library.restartPython()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "0b781b85-5fb9-4af5-8642-f5f3fb3993a7", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score\n", + "import mlflow\n", + "import mlflow.sklearn\n", + "from mlflow.models.signature import infer_signature\n", + "from mlflow.tracking.client import MlflowClient\n", + "import requests\n", + "from datetime import datetime" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "a94d72ec-d6d4-4fd3-9792-cc37870736da", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "70c8ec82-ceec-4e6f-bd77-bc1ba02016d8", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "model_name = 'iris_model'" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "9c01992c-8f50-4c1a-872d-8840ac2e2d1c", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "feature_table_name = f'{catalog_name}.{schema_name}.iris_data'" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f1ea3d22-39fb-48cb-9023-6f5442449f6f", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "experiment_name = f\"/Users/{dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()}/{model_name}_{catalog_name}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f9183695-cffb-40d6-9377-a10a5285b1e2", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "import mlflow\n", + "\n", + "# Create an MLFlow experiment\n", + "mlflow.set_experiment(experiment_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "2ed03c85-3569-48d7-823d-4e73110e36ff", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# If you want to autolog the model, use the following command\n", + "# Note that some of the auto-logging capabilities were set to false because we are logging some metrics \n", + "mlflow.autolog(log_input_examples=False,log_model_signatures=False,log_models=False,log_datasets=False,)\n", + "\n", + "# Start a training run\n", + "with mlflow.start_run() as run:\n", + " # Load data from Unity Catalog table\n", + " df_iris = spark.table(feature_table_name).toPandas()\n", + " features = ['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']\n", + " target = 'species'\n", + "\n", + " X = df_iris[features]\n", + " y = df_iris[target]\n", + "\n", + " X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)\n", + "\n", + " # Train the model\n", + " model = DecisionTreeClassifier()\n", + " model.fit(X_train, y_train)\n", + "\n", + " # Make predictions\n", + " y_pred = model.predict(X_test)\n", + "\n", + " # Calculate and log metrics\n", + " accuracy = accuracy_score(y_test, y_pred)\n", + " recall = recall_score(y_test, y_pred, average='macro')\n", + " precision = precision_score(y_test, y_pred, average='macro')\n", + " f1 = f1_score(y_test, y_pred, average='macro')\n", + "\n", + " mlflow.log_metric(\"test_accuracy\", accuracy)\n", + " mlflow.log_metric(\"test_recall\", recall)\n", + " mlflow.log_metric(\"test_precision\", precision)\n", + " mlflow.log_metric(\"test_f1\", f1)\n", + "\n", + " # Infer model signature\n", + " signature = infer_signature(X_train, y_train)\n", + "\n", + " # Log the model\n", + " mlflow.sklearn.log_model(\n", + " sk_model=model,\n", + " name='model',\n", + " signature=signature,\n", + " input_example=X_train.head()\n", + " )\n", + "\n", + " # Log input dataset for lineage\n", + " data_source = mlflow.data.load_delta(table_name=feature_table_name)\n", + " mlflow.log_input(data_source, context=\"training\")" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "0e85f8e0-b6a0-49da-863c-f090b655a311", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "widgetType": "text" - } + "outputs": [], + "source": [ + "# Out of all runs in the experiment, only register the run with the best selected metric\n", + "# Important note: this logic is optional and totally depends on your processes, so feel free to customize it!\n", + "# If you want, you can simply register the latest run instead, for example. \n", + "\n", + "selected_metric = 'test_accuracy'\n", + "client = mlflow.tracking.MlflowClient()\n", + "experiment = client.get_experiment_by_name(experiment_name)\n", + "\n", + "runs = client.search_runs(experiment_ids=[experiment.experiment_id], order_by=[f\"metrics.{selected_metric} DESC\"], max_results=1)\n", + "best_run_id = runs[0].info.run_id\n", + "\n", + "model_uri = f\"runs:/{best_run_id}/model\"\n", + "registered_model = mlflow.register_model(model_uri, f\"{catalog_name}.{schema_name}.{model_name}\")\n", + "client.set_registered_model_alias(name=registered_model.name, alias=\"challenger\", version=registered_model.version)" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": { + "hardware": { + "accelerator": null, + "gpuPoolId": null, + "memory": null + } + }, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "2" + }, + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "mostRecentlyExecutedCommandWithImplicitDF": { + "commandId": 5753565242613738, + "dataframes": [ + "_sqldf" + ] + }, + "pythonIndentUnit": 4 + }, + "notebookName": "model-training", + "widgets": { + "catalog_name": { + "currentValue": "pedroz_e2edata_dev", + "nuid": "c3deff61-8418-4ca2-ae02-acb938772fc8", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "pedroz_e2edata_dev", + "label": null, + "name": "catalog_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "pedroz_e2edata_dev", + "label": null, + "name": "catalog_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + } + } + }, + "language_info": { + "name": "python" } - } }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/notebooks/3_inference/batch_inference.ipynb b/notebooks/3_inference/batch_inference.ipynb index f348524..1ad5d4c 100644 --- a/notebooks/3_inference/batch_inference.ipynb +++ b/notebooks/3_inference/batch_inference.ipynb @@ -1,969 +1,415 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "eecd0c39-51b6-4846-a778-2705405d3ba2", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# This notebook is meant to run batch inference on the top of new iris samples" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "collapsed": true, - "inputWidgets": {}, - "nuid": "c5ba3b37-1f41-465f-b63a-f21c7b310d5a", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "%pip install mlflow=='3.4.0'\n", - "dbutils.library.restartPython()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "34837c3f-e57b-4f68-8820-9676cc8687f5", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "from sklearn import datasets\n", - "from mlflow.pyfunc import load_model\n", - "import pandas as pd\n", - "import mlflow\n", - "from datetime import datetime" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "dea4f346-38f2-4ac8-8ad5-12dbea72581e", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "dbutils.widgets.text(\"catalog_name\", \"pedroz_e2edata_dev\")\n", - "catalog_name = dbutils.widgets.get(\"catalog_name\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "1f14ab9f-3de3-42ff-9acd-28ad69f99a0f", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "model_name = f'{catalog_name}.default.iris_model'" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "36c73312-834e-461b-840b-4f6733d88317", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [ + "cells": [ { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/spark-95fe552f-1bd8-4fee-8597-dd/.ipykernel/7148/command-8412231637893746-4118003963:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", - " df_samples.columns = df_samples.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", - "/home/spark-95fe552f-1bd8-4fee-8597-dd/.ipykernel/7148/command-8412231637893746-4118003963:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n", - " df_samples.columns = df_samples.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n" - ] + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "eecd0c39-51b6-4846-a778-2705405d3ba2", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# This notebook is meant to run batch inference on the top of new iris samples" + ] }, { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cm
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", - "
" - ], - "text/plain": [ - " sepal_length_cm sepal_width_cm petal_length_cm petal_width_cm\n", - "0 5.1 3.5 1.4 0.2\n", - "1 4.9 3.0 1.4 0.2\n", - "2 4.7 3.2 1.3 0.2\n", - "3 4.6 3.1 1.5 0.2\n", - "4 5.0 3.6 1.4 0.2" + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "collapsed": true, + "inputWidgets": {}, + "nuid": "c5ba3b37-1f41-465f-b63a-f21c7b310d5a", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "%pip install mlflow=='3.4.0'\n", + "dbutils.library.restartPython()" ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Pull the dataset for running the inference\n", - "iris_samples = datasets.load_iris(as_frame=True)\n", - "df_samples = pd.DataFrame(data = iris_samples['data'], columns = iris_samples['feature_names'])\n", - "df_samples.columns = df_samples.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", - "df_samples.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "470892ae-0135-4e8e-a1ea-decd40f883d3", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "model_uri = f\"models:/{model_name}@champion\"\n", - "model = load_model(model_uri)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "cda93c22-763e-48fd-9587-23662f51bf25", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [ + }, { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cmprediction
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
\n", - "
" - ], - "text/plain": [ - " sepal_length_cm sepal_width_cm petal_length_cm petal_width_cm prediction\n", - "0 5.1 3.5 1.4 0.2 0\n", - "1 4.9 3.0 1.4 0.2 0\n", - "2 4.7 3.2 1.3 0.2 0\n", - "3 4.6 3.1 1.5 0.2 0\n", - "4 5.0 3.6 1.4 0.2 0" + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "34837c3f-e57b-4f68-8820-9676cc8687f5", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "from sklearn import datasets\n", + "from mlflow.pyfunc import load_model\n", + "import pandas as pd\n", + "import mlflow\n", + "from datetime import datetime" ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "predictions = model.predict(df_samples)\n", - "df_samples['prediction'] = predictions\n", - "\n", - "df_samples.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "8c157a55-1d6b-4fee-811a-8704cc487a46", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [ + }, { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cmpredictionactual_label
05.13.51.40.200
14.93.01.40.200
24.73.21.30.200
34.63.11.50.200
45.03.61.40.200
\n", - "
" - ], - "text/plain": [ - " sepal_length_cm sepal_width_cm ... prediction actual_label\n", - "0 5.1 3.5 ... 0 0\n", - "1 4.9 3.0 ... 0 0\n", - "2 4.7 3.2 ... 0 0\n", - "3 4.6 3.1 ... 0 0\n", - "4 5.0 3.6 ... 0 0\n", - "\n", - "[5 rows x 6 columns]" + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "dea4f346-38f2-4ac8-8ad5-12dbea72581e", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")" ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_samples['actual_label'] = iris_samples['target']\n", - "df_samples.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "6a6fb952-66ab-4dc3-b843-9d9ba4ead94d", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "# Adding the model_id and prediction_timestamp columns to the dataframe - \n", - "# these are required if, in the future, you want to use Lakehouse Monitoring to track the performance of the model\n", - "mlflow_client = mlflow.tracking.MlflowClient()\n", - "model_version = mlflow_client.get_model_version_by_alias(model_name, \"champion\").version\n", - "\n", - "df_samples['prediction_timestamp'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n", - "df_samples['model_id'] = model_version\n", - "\n", - "display(df_samples)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "87940370-a537-4860-bcc3-698374a09938", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "df_spark = spark.createDataFrame(df_samples)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "3bc4d933-5319-4adc-93d6-61c177070df0", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [ + }, { - "data": { - "text/html": [ - "
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cmspeciesid
5.13.51.40.201
4.93.01.40.202
4.73.21.30.203
4.63.11.50.204
5.03.61.40.205
" + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "1f14ab9f-3de3-42ff-9acd-28ad69f99a0f", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "model_name = f'{catalog_name}.{schema_name}.iris_model'" ] - }, - "metadata": { - "application/vnd.databricks.v1+output": { - "addedWidgets": {}, - "aggData": [], - "aggError": "", - "aggOverflow": false, - "aggSchema": [], - "aggSeriesLimitReached": false, - "aggType": "", - "arguments": {}, - "columnCustomDisplayInfos": {}, - "data": [ - [ - 5.1, - 3.5, - 1.4, - 0.2, - 0, - 1 - ], - [ - 4.9, - 3, - 1.4, - 0.2, - 0, - 2 - ], - [ - 4.7, - 3.2, - 1.3, - 0.2, - 0, - 3 - ], - [ - 4.6, - 3.1, - 1.5, - 0.2, - 0, - 4 - ], - [ - 5, - 3.6, - 1.4, - 0.2, - 0, - 5 - ] - ], - "datasetInfos": [], - "dbfsResultPath": null, - "isJsonSchema": true, - "metadata": {}, - "overflow": false, - "plotOptions": { - "customPlotOptions": {}, - "displayType": "table", - "pivotAggregation": null, - "pivotColumns": null, - "xColumns": null, - "yColumns": null - }, - "removedWidgets": [], - "schema": [ - { - "metadata": "{}", - "name": "sepal_length_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "sepal_width_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "petal_length_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "petal_width_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "species", - "type": "\"long\"" - }, - { - "metadata": "{}", - "name": "id", - "type": "\"long\"" + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "36c73312-834e-461b-840b-4f6733d88317", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" } - ], - "type": "table" - } - }, - "output_type": "display_data" - } - ], - "source": [ - "try:\n", - " display(spark.table(f\"{catalog_name}.default.iris_data\").limit(5))\n", - " table_exists = True\n", - "except:\n", - " table_exists = False" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "ebf03859-daa6-409e-a0f8-19c65d4492d2", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "if table_exists: # append\n", - " df_spark.write.mode(\"append\").saveAsTable(f\"{catalog_name}.default.iris_inferences\")\n", - "else: # create table from scratch\n", - " df_spark.write.mode(\"overwrite\").saveAsTable(f\"{catalog_name}.default.iris_inferences\")" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "a1ef9455-6e30-4e46-abb2-780440abe526", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [ + }, + "outputs": [], + "source": [ + "# Pull the dataset for running the inference\n", + "iris_samples = datasets.load_iris(as_frame=True)\n", + "df_samples = pd.DataFrame(data = iris_samples['data'], columns = iris_samples['feature_names'])\n", + "df_samples.columns = df_samples.columns.str.replace(' ', '_').str.replace('(', '').str.replace(')', '')\n", + "df_samples.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "470892ae-0135-4e8e-a1ea-decd40f883d3", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "model_uri = f\"models:/{model_name}@champion\"\n", + "model = load_model(model_uri)" + ] + }, { - "data": { - "text/html": [ - "
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cmpredictionactual_labelprediction_timestampmodel_id
5.13.51.40.2002025-08-11 17:47:2221
4.93.01.40.2002025-08-11 17:47:2221
4.73.21.30.2002025-08-11 17:47:2221
4.63.11.50.2002025-08-11 17:47:2221
5.03.61.40.2002025-08-11 17:47:2221
" + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "cda93c22-763e-48fd-9587-23662f51bf25", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "predictions = model.predict(df_samples)\n", + "df_samples['prediction'] = predictions\n", + "\n", + "df_samples.head()" ] - }, - "metadata": { - "application/vnd.databricks.v1+output": { - "addedWidgets": {}, - "aggData": [], - "aggError": "", - "aggOverflow": false, - "aggSchema": [], - "aggSeriesLimitReached": false, - "aggType": "", - "arguments": {}, - "columnCustomDisplayInfos": {}, - "data": [ - [ - 5.1, - 3.5, - 1.4, - 0.2, - 0, - 0, - "2025-08-11 17:47:22", - "21" - ], - [ - 4.9, - 3, - 1.4, - 0.2, - 0, - 0, - "2025-08-11 17:47:22", - "21" - ], - [ - 4.7, - 3.2, - 1.3, - 0.2, - 0, - 0, - "2025-08-11 17:47:22", - "21" - ], - [ - 4.6, - 3.1, - 1.5, - 0.2, - 0, - 0, - "2025-08-11 17:47:22", - "21" - ], - [ - 5, - 3.6, - 1.4, - 0.2, - 0, - 0, - "2025-08-11 17:47:22", - "21" - ] - ], - "datasetInfos": [], - "dbfsResultPath": null, - "isJsonSchema": true, - "metadata": {}, - "overflow": false, - "plotOptions": { - "customPlotOptions": {}, - "displayType": "table", - "pivotAggregation": null, - "pivotColumns": null, - "xColumns": null, - "yColumns": null - }, - "removedWidgets": [], - "schema": [ - { - "metadata": "{}", - "name": "sepal_length_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "sepal_width_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "petal_length_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "petal_width_cm", - "type": "\"double\"" - }, - { - "metadata": "{}", - "name": "prediction", - "type": "\"long\"" - }, - { - "metadata": "{}", - "name": "actual_label", - "type": "\"long\"" - }, - { - "metadata": "{}", - "name": "prediction_timestamp", - "type": "\"string\"" - }, - { - "metadata": "{}", - "name": "model_id", - "type": "\"string\"" + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "8c157a55-1d6b-4fee-811a-8704cc487a46", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" } - ], - "type": "table" - } - }, - "output_type": "display_data" - } - ], - "source": [ - "display(spark.sql(f\"SELECT * FROM {catalog_name}.default.iris_inferences LIMIT 5\"))" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "69d7f892-4b87-4bba-9699-49f455bc128f", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [ + }, + "outputs": [], + "source": [ + "df_samples['actual_label'] = iris_samples['target']\n", + "df_samples.head()" + ] + }, { - "data": { - "text/plain": [ - "DataFrame[]" + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "6a6fb952-66ab-4dc3-b843-9d9ba4ead94d", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# Adding the model_id and prediction_timestamp columns to the dataframe - \n", + "# these are required if, in the future, you want to use Lakehouse Monitoring to track the performance of the model\n", + "mlflow_client = mlflow.tracking.MlflowClient()\n", + "model_version = mlflow_client.get_model_version_by_alias(model_name, \"champion\").version\n", + "\n", + "df_samples['prediction_timestamp'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')\n", + "df_samples['model_id'] = model_version\n", + "\n", + "display(df_samples)" ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Enabling the Change Data Feed is a recommended practice for Inference Monitoring using Lakehouse Monitoring\n", - "# When CDF is enabled, only newly appended data is processed. \n", - "spark.sql(f\"ALTER TABLE {catalog_name}.default.iris_inferences SET TBLPROPERTIES (delta.enableChangeDataFeed = true)\")" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": { - "hardware": { - "accelerator": null, - "gpuPoolId": null, - "memory": null - } - }, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "2" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "mostRecentlyExecutedCommandWithImplicitDF": { - "commandId": 7372515407401739, - "dataframes": [ - "_sqldf" - ] }, - "pythonIndentUnit": 4 - }, - "notebookName": "batch-inference", - "widgets": { - "catalog_name": { - "currentValue": "pedroz_e2edata_dev", - "nuid": "6ef15a2c-f408-4625-aa2e-4b36d2e9efcb", - "typedWidgetInfo": { - "autoCreated": false, - "defaultValue": "pedroz_e2edata_dev", - "label": null, - "name": "catalog_name", - "options": { - "validationRegex": null, - "widgetDisplayType": "Text" + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "87940370-a537-4860-bcc3-698374a09938", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } }, - "parameterDataType": "String" - }, - "widgetInfo": { - "defaultValue": "pedroz_e2edata_dev", - "label": null, - "name": "catalog_name", - "options": { - "autoCreated": null, - "validationRegex": null, - "widgetType": "text" + "outputs": [], + "source": [ + "df_spark = spark.createDataFrame(df_samples)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "3bc4d933-5319-4adc-93d6-61c177070df0", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "try:\n", + " display(spark.table(f\"{catalog_name}.{schema_name}.iris_data\").limit(5))\n", + " table_exists = True\n", + "except:\n", + " table_exists = False" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "ebf03859-daa6-409e-a0f8-19c65d4492d2", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "if table_exists: # append\n", + " df_spark.write.mode(\"append\").saveAsTable(f\"{catalog_name}.{schema_name}.iris_inferences\")\n", + "else: # create table from scratch\n", + " df_spark.write.mode(\"overwrite\").saveAsTable(f\"{catalog_name}.{schema_name}.iris_inferences\")" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "a1ef9455-6e30-4e46-abb2-780440abe526", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "display(spark.sql(f\"SELECT * FROM {catalog_name}.{schema_name}.iris_inferences LIMIT 5\"))" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "69d7f892-4b87-4bba-9699-49f455bc128f", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# Enabling the Change Data Feed is a recommended practice for Inference Monitoring using Lakehouse Monitoring\n", + "# When CDF is enabled, only newly appended data is processed. \n", + "spark.sql(f\"ALTER TABLE {catalog_name}.{schema_name}.iris_inferences SET TBLPROPERTIES (delta.enableChangeDataFeed = true)\")" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": { + "hardware": { + "accelerator": null, + "gpuPoolId": null, + "memory": null + } + }, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "2" + }, + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "mostRecentlyExecutedCommandWithImplicitDF": { + "commandId": 7372515407401739, + "dataframes": [ + "_sqldf" + ] + }, + "pythonIndentUnit": 4 }, - "widgetType": "text" - } + "notebookName": "batch-inference", + "widgets": { + "catalog_name": { + "currentValue": "pedroz_e2edata_dev", + "nuid": "6ef15a2c-f408-4625-aa2e-4b36d2e9efcb", + "typedWidgetInfo": { + "autoCreated": false, + "defaultValue": "pedroz_e2edata_dev", + "label": null, + "name": "catalog_name", + "options": { + "validationRegex": null, + "widgetDisplayType": "Text" + }, + "parameterDataType": "String" + }, + "widgetInfo": { + "defaultValue": "pedroz_e2edata_dev", + "label": null, + "name": "catalog_name", + "options": { + "autoCreated": null, + "validationRegex": null, + "widgetType": "text" + }, + "widgetType": "text" + } + } + } + }, + "language_info": { + "name": "python" } - } }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/notebooks/3_inference/realtime_inference.ipynb b/notebooks/3_inference/realtime_inference.ipynb index 558f170..f4686b3 100644 --- a/notebooks/3_inference/realtime_inference.ipynb +++ b/notebooks/3_inference/realtime_inference.ipynb @@ -1,125 +1,133 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, - "inputWidgets": {}, - "nuid": "c349d0aa-264d-41f5-837b-72703eb01f87", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" + "cells": [ + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "c349d0aa-264d-41f5-837b-72703eb01f87", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "# This notebook is meant to show a simple example of how to use the near real-time inference endpoint\n", + "\n", + "dbutils.widgets.text(\"catalog_name\", \"mlops_quickstart_dev\")\n", + "dbutils.widgets.text(\"schema_name\", \"iris\")\n", + "\n", + "catalog_name = dbutils.widgets.get(\"catalog_name\")\n", + "schema_name = dbutils.widgets.get(\"schema_name\")" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "62d91fa5-5a0d-4fbc-a81b-1850fd7b8d2f", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "from databricks.sdk import WorkspaceClient" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f82f4dff-f1f0-42da-b982-2a83442d3ba6", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "w = WorkspaceClient()\n", + "\n", + "endpoint_name = f\"{catalog_name}-{schema_name}-iris_model-endpoint\"\n", + "\n", + "sample_input = [\n", + " {\n", + " \"sepal_length_cm\": 5.1,\n", + " \"sepal_width_cm\": 3.5,\n", + " \"petal_length_cm\": 1.4,\n", + " \"petal_width_cm\": 0.2\n", + " }\n", + "]\n", + "\n", + "prediction = w.serving_endpoints.query(\n", + " name=endpoint_name,\n", + " dataframe_records=sample_input\n", + ")\n", + "\n", + "display(prediction)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "6f25d6d3-1170-47a6-837e-69f90f17e1dd", + "showTitle": false, + "tableResultSettingsMap": {}, + "title": "" + } + }, + "outputs": [], + "source": [ + "print(\n", + " 'Input:',\n", + " sample_input,\n", + " '\\nOutput: ',\n", + " prediction.predictions\n", + ")" + ] } - }, - "outputs": [], - "source": [ - "# This notebook is meant to show a simple example of how to use the near real-time inference endpoint" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, - "inputWidgets": {}, - "nuid": "62d91fa5-5a0d-4fbc-a81b-1850fd7b8d2f", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "from databricks.sdk import WorkspaceClient" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "f82f4dff-f1f0-42da-b982-2a83442d3ba6", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" - } - }, - "outputs": [], - "source": [ - "w = WorkspaceClient()\n", - "\n", - "sample_input = [\n", - " {\n", - " \"sepal_length_cm\": 5.1,\n", - " \"sepal_width_cm\": 3.5,\n", - " \"petal_length_cm\": 1.4,\n", - " \"petal_width_cm\": 0.2\n", - " }\n", - "]\n", - "\n", - "prediction = w.serving_endpoints.query(\n", - " name=\"pedroz_e2edata_dev-default-iris_model-endpoint\",\n", - " dataframe_records=sample_input\n", - ")\n", - "\n", - "display(prediction)" - ] - }, - { - "cell_type": "code", - "execution_count": 0, - "metadata": { - "application/vnd.databricks.v1+cell": { - "cellMetadata": { - "byteLimit": 2048000, - "rowLimit": 10000 - }, - "inputWidgets": {}, - "nuid": "6f25d6d3-1170-47a6-837e-69f90f17e1dd", - "showTitle": false, - "tableResultSettingsMap": {}, - "title": "" + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "computePreferences": null, + "dashboards": [], + "environmentMetadata": { + "base_environment": "", + "environment_version": "4" + }, + "inputWidgetPreferences": null, + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 4 + }, + "notebookName": "realtime-inference", + "widgets": {} + }, + "language_info": { + "name": "python" } - }, - "outputs": [], - "source": [ - "print(\n", - " 'Input:',\n", - " sample_input,\n", - " '\\nOutput: ',\n", - " prediction.predictions\n", - ")" - ] - } - ], - "metadata": { - "application/vnd.databricks.v1+notebook": { - "computePreferences": null, - "dashboards": [], - "environmentMetadata": { - "base_environment": "", - "environment_version": "4" - }, - "inputWidgetPreferences": null, - "language": "python", - "notebookMetadata": { - "pythonIndentUnit": 4 - }, - "notebookName": "realtime-inference", - "widgets": {} }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/resources/1_data_preprocessing_job.yml b/resources/1_data_preprocessing_job.yml index a203672..6c42518 100644 --- a/resources/1_data_preprocessing_job.yml +++ b/resources/1_data_preprocessing_job.yml @@ -14,6 +14,8 @@ resources: parameters: - name: "catalog_name" default: "${var.catalog_name}" + - name: "schema_name" + default: "${var.schema_name}" # If you want to define a shared job cluster for data preprocessing, # uncomment the following section - this is an example that works for Azure Databricks @@ -27,7 +29,7 @@ resources: # node_type_id: "Standard_D4s_v3" # custom_tags: # Environment: "${var.environment}" - # Project: "e2e-data-project" + # Project: "mlops-quickstart" # Purpose: "data-preprocessing" # Job tasks @@ -43,6 +45,7 @@ resources: notebook_path: "../notebooks/1_data_preprocessing/data_ingestion.ipynb" base_parameters: catalog_name: "{{job.parameters.catalog_name}}" + schema_name: "{{job.parameters.schema_name}}" # Timeout and retry settings timeout_seconds: 3600 @@ -51,4 +54,4 @@ resources: # Tags for organization tags: Environment: "${var.environment}" - Project: "e2e-data-project" \ No newline at end of file + Project: "mlops-quickstart" \ No newline at end of file diff --git a/resources/2_1_model_training_job.yml b/resources/2_1_model_training_job.yml index e5c134b..dab9ca6 100644 --- a/resources/2_1_model_training_job.yml +++ b/resources/2_1_model_training_job.yml @@ -14,6 +14,8 @@ resources: parameters: - name: "catalog_name" default: "${var.catalog_name}" + - name: "schema_name" + default: "${var.schema_name}" # If you want to define a shared job cluster for data preprocessing, # uncomment the following section - this is an example that works for Azure Databricks @@ -27,7 +29,7 @@ resources: # node_type_id: "Standard_D4s_v3" # custom_tags: # Environment: "${var.environment}" - # Project: "e2e-data-project" + # Project: "mlops-quickstart" # Purpose: "model-training" # Job tasks @@ -43,6 +45,7 @@ resources: notebook_path: "../notebooks/2_model_training_and_deployment/model_training.ipynb" base_parameters: catalog_name: "{{job.parameters.catalog_name}}" + schema_name: "{{job.parameters.schema_name}}" # Timeout and retry settings timeout_seconds: 3600 @@ -51,4 +54,4 @@ resources: # Tags for organization tags: Environment: "${var.environment}" - Project: "e2e-data-project" \ No newline at end of file + Project: "mlops-quickstart" \ No newline at end of file diff --git a/resources/2_2_model_deployment_job.yml b/resources/2_2_model_deployment_job.yml index 8e9d9ff..58e456a 100644 --- a/resources/2_2_model_deployment_job.yml +++ b/resources/2_2_model_deployment_job.yml @@ -12,8 +12,12 @@ resources: # Job parameters parameters: + - name: "catalog_name" + default: "${var.catalog_name}" + - name: "schema_name" + default: "${var.schema_name}" - name: "model_name" - default: "${var.catalog_name}.default.iris_model" + default: "${var.catalog_name}.${var.schema_name}.iris_model" - name: "model_version" default: "1" @@ -26,7 +30,8 @@ resources: notebook_task: notebook_path: "../notebooks/2_model_training_and_deployment/model_deployment/1_model_evaluation.ipynb" base_parameters: - model_name: "{{job.parameters.model_name}}" + catalog_name: "{{job.parameters.catalog_name}}" + schema_name: "{{job.parameters.schema_name}}" model_version: "{{job.parameters.model_version}}" - task_key: "approval" @@ -38,7 +43,8 @@ resources: notebook_task: notebook_path: "../notebooks/2_model_training_and_deployment/model_deployment/2_model_approval.ipynb" base_parameters: - model_name: "{{job.parameters.model_name}}" + catalog_name: "{{job.parameters.catalog_name}}" + schema_name: "{{job.parameters.schema_name}}" model_version: "{{job.parameters.model_version}}" - task_key: "deployment" @@ -50,7 +56,8 @@ resources: notebook_task: notebook_path: "../notebooks/2_model_training_and_deployment/model_deployment/3_model_deployment.ipynb" base_parameters: - model_name: "{{job.parameters.model_name}}" + catalog_name: "{{job.parameters.catalog_name}}" + schema_name: "{{job.parameters.schema_name}}" model_version: "{{job.parameters.model_version}}" # Timeout and retry settings @@ -60,4 +67,4 @@ resources: # Tags for organization tags: Environment: "${var.environment}" - Project: "e2e-data-project" + Project: "mlops-quickstart" diff --git a/resources/3_batch_inference_job.yml b/resources/3_batch_inference_job.yml index 02d07ab..ef2b136 100644 --- a/resources/3_batch_inference_job.yml +++ b/resources/3_batch_inference_job.yml @@ -14,6 +14,8 @@ resources: parameters: - name: "catalog_name" default: "${var.catalog_name}" + - name: "schema_name" + default: "${var.schema_name}" # If you want to define a shared job cluster for data preprocessing, # uncomment the following section - this is an example that works for Azure Databricks @@ -27,7 +29,7 @@ resources: # node_type_id: "Standard_D4s_v3" # custom_tags: # Environment: "${var.environment}" - # Project: "e2e-data-project" + # Project: "mlops-quickstart" # Purpose: "batch-inference" # Job tasks @@ -43,6 +45,7 @@ resources: notebook_path: "../notebooks/3_inference/batch_inference.ipynb" base_parameters: catalog_name: "{{job.parameters.catalog_name}}" + schema_name: "{{job.parameters.schema_name}}" # Timeout and retry settings timeout_seconds: 3600 @@ -51,4 +54,4 @@ resources: # Tags for organization tags: Environment: "${var.environment}" - Project: "e2e-data-project" \ No newline at end of file + Project: "mlops-quickstart" \ No newline at end of file