Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 8 additions & 117 deletions deidentify_string/deidentify_string.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,102 +18,6 @@
"This notebook will show you how to install a function for deidentifying strings and tokenizing PII in unstructured data using a Skyflow Vault."
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "5f5dc62c-e14f-4e79-a6f7-275a449bf50d",
"showTitle": false,
"tableResultSettingsMap": {},
"title": ""
}
},
"source": [
"## Configure secrets in Databricks (optional)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "de4c8638-b117-48ce-8c24-1d764466d9b2",
"showTitle": false,
"tableResultSettingsMap": {},
"title": ""
}
},
"source": [
"To use this function as written you must configure a Secret in Databricks for storing the Skyflow API credentials. Alternately, for testing, when calling the function you can manually pass credentials as an argument.\n",
"\n",
"#### Install the CLI\n",
"\n",
"If you use Homebrew on a Mac, the below commands will complete the install.\n",
"\n",
"```\n",
"brew tap databricks/tap\n",
"brew install databricks\n",
"```\n",
"\n",
"For more detailed instructions for any dev environment see the official documentation: [Databricks | Install or update the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/install)\n",
"\n",
"##### Configure the CLI\n",
"\n",
"To get started and create a configuration profile on your machine, run `databricks configure`.\n",
"\n",
"You should be prompted for `Databricks Host` and a `Personal Access Token`. \n",
"\n",
"To get a Personal Access Token (PAT) for development login to the Databricks UI, open Settings, click Developer, then Access Tokens.\n",
"\n",
"For more information on authenticating the Databricks CLI see the official documentation: [Databricks | Authentication for the Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/authentication)\n",
"\n",
"#### Configure a secret scope in Databricks\n",
"\n",
"Now that you've configured and authenticated the Databricks CLI, run the following command to create a 'scope' for your secrets in Databricks: \n",
"\n",
"`databricks secrets create-scope <scope-name>`\n",
"\n",
"For the rest of this demo we'll use the scope `sky-agentic-demo`.\n",
"\n",
"`databricks secrets create-scope sky-agentic-demo`\n",
"\n",
"#### Get details from Skyflow\n",
"\n",
"- Create or log into your account at [skyflow.com](https://skyflow.com) and generate an API key: [docs.skyflow.com](https://docs.skyflow.com/api-authentication/)\n",
"- Copy your API key, Vault URL, and Vault ID\n",
"\n",
"\n",
"#### Store the secrets in Databricks\n",
"\n",
"Create your secrets using the JSON syntax:\n",
"\n",
"```sh\n",
"databricks secrets put-secret --json '{\n",
" \"scope\": \"sky-agentic-demo\",\n",
" \"key\": \"sky_api_key\",\n",
" \"string_value\": \"--sky_api_key--\"\n",
"}'\n",
"```\n",
"\n",
"To confirm the secrets have been uploaded successfully, run `databricks secrets list-secrets sky-agentic-demo` to see a list of the keys you provided and an updated timestamp.\n",
"\n",
"Example:\n",
"\n",
"```sh\n",
"Key Last Updated Timestamp\n",
"sky_api_key 1739998630197\n",
"```\n",
"\n",
"Then to read a secret in a Notebook, use `dbutils.secrets`:\n",
"\n",
"`sky_api_key = dbutils.secrets.get(scope = \"sky-agentic-demo\", key = \"sky_api_key\")`\n",
"\n",
"To learn more about Secrets in Databricks, see the official documentation: [Secret Management | Databricks](https://docs.databricks.com/aws/en/security/secrets)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand All @@ -129,7 +33,8 @@
"source": [
"## Install the function\n",
"\n",
"Before you install, make sure you set your `vault_id` and `vault_url`. These are hardcoded values in our function, though you can modify it to also accept parameters for these values from the user invoking the function or use Databricks environment variables.\n",
"Before you install, make sure you set your `vault_id`, `vault_url`, and `bearer_token`. \n",
"These are hardcoded values in our function, though you can modify it to also accept parameters for these values from the user invoking the function or use Databricks environment variables or Secrets.\n",
"\n"
]
},
Expand All @@ -151,13 +56,12 @@
"%sql\n",
"CREATE OR REPLACE FUNCTION\n",
"agentic.default.deidentify_string (\n",
" input_text STRING COMMENT 'The string to be de-identified.',\n",
" sky_api_key STRING COMMENT 'Optional: The API key for the Skyflow API.'\n",
" input_text STRING COMMENT 'The string to be de-identified.'\n",
")\n",
"RETURNS STRING\n",
"LANGUAGE PYTHON\n",
"DETERMINISTIC\n",
"COMMENT 'Deidentify a string using the Skyflow API. Removes any sensitive data from the string and returns a safe string with placeholders in place of sensitive data tokens.'\n",
"COMMENT 'De-identify a string using the Skyflow API. Removes any sensitive data from the string and returns a safe string with placeholders in place of sensitive data tokens.'\n",
"AS $$\n",
" import sys\n",
" import json\n",
Expand All @@ -166,17 +70,12 @@
" \n",
" vault_id = \"SKYFLOW_VAULT_ID\"\n",
" vault_url = \"https://sample.vault.skyflowapis.com\"\n",
" bearer_token = '<YOUR_BEARER_TOKEN>'\n",
" \n",
" sys_stdout = sys.stdout\n",
" redirected_output = StringIO()\n",
" sys.stdout = redirected_output\n",
"\n",
" if sky_api_key is None or sky_api_key == '':\n",
" # try to fetch the API key from env variables\n",
" # bearer_token = os.environ.get(\"SKY_API_KEY\")\n",
" bearer_token = '<YOUR_API_KEY>'\n",
" else:\n",
" bearer_token = sky_api_key\n",
"\n",
" api_path = \"/v1/detect/deidentify/string\"\n",
" api_url = vault_url + api_path\n",
" headers = {\n",
Expand Down Expand Up @@ -240,11 +139,6 @@
},
"outputs": [],
"source": [
"# Retrieve an access token from Databricks Secrets.\n",
"sky_api_key = dbutils.secrets.get(scope=\"sky-agentic-demo\", key=\"sky_api_key\")\n",
"# Alternately, you can hardcode the API key here.\n",
"# sky_api_key = \"yourkey\"\n",
"\n",
"# Provide some sample text. In practice you'll read this from a file or table.\n",
"input_text = \"Hi my name is Joseph McCarron and I live in Austin TX\"\n",
"\n",
Expand All @@ -255,7 +149,7 @@
"\n",
"# Create the result dataframe and pass the API key and the input dataframe\n",
"result_df = spark.sql(f\"\"\"\n",
"SELECT agentic.default.deidentify_string(input_text, '{sky_api_key}') AS deidentified_text\n",
"SELECT agentic.default.deidentify_string(input_text) AS deidentified_text\n",
"FROM input_table\n",
"\"\"\")\n",
"\n",
Expand Down Expand Up @@ -386,12 +280,9 @@
},
"outputs": [],
"source": [
"# Retrieve an access token from Databricks Secrets. Alternately hardcode your API key here for test use.\n",
"sky_api_key = dbutils.secrets.get(scope=\"sky-agentic-demo\", key=\"sky_api_key\")\n",
"\n",
"# Note: if you're using your own table, modify the query below to use your table name and column names.\n",
"result_df = spark.sql(f\"\"\"\n",
"SELECT chat_id, user_id, timestamp, agentic.default.deidentify_string(user_message, '{sky_api_key}') AS deidentified_user_message, bot_response, user_name, user_email\n",
"SELECT chat_id, user_id, timestamp, agentic.default.deidentify_string(user_message) AS deidentified_user_message, bot_response, user_name, user_email\n",
"FROM chats\n",
"\"\"\")\n",
"\n",
Expand Down