dbxredact

PII/PHI detection and redaction for Databricks, with a management web app.

Disclaimer: This is a Databricks Solution Accelerator -- a starting point for your project. Evaluate, test, and customize for your use case. Detection accuracy depends on your data and configuration.

What It Does

dbxredact detects and redacts Protected Health Information (PHI) and Personally Identifiable Information (PII) in text data stored in Unity Catalog. It ships as:

A Python library (src/dbxredact/) with Spark UDFs for detection, alignment, redaction, and evaluation
Databricks notebooks for running redaction and benchmarking pipelines
A web management app (FastAPI + React, deployed as a Databricks App) for configuration, pipeline execution, review, and labeling

Key Capabilities

Three detection methods: Presidio (rule-based NLP), AI Query (LLM endpoints), GLiNER (nvidia/gliner-PII transformer NER)
Multi-language support: AI Query and rule-based approaches support Spanish language
Entity alignment: Union (recall-focused) or consensus (precision-focused) across detectors
Typed or generic redaction: [PERSON] / [EMAIL] or [REDACTED]
Block / safe lists: Deny lists (always flag) and allow lists (suppress false positives), stored in UC tables, applied uniformly across all detectors
Benchmarking: Automated evaluation (precision, recall, F1), AI judge grading, and improvement recommendations
Streaming: Incremental processing via Structured Streaming with checkpoint-based deduplication
GPU acceleration: GPU cluster profiles for GLiNER inference
Cost estimation: Pre-run token and compute cost estimates in the management app
Safety guards: Governance floors on thresholds, destructive-write confirmation, PII-output confirmation
Audit logging: Per-document entity-type counts written to an audit table (no raw PII stored)

Prerequisites

Databricks CLI >= 0.283.0
Poetry >= 2.0
Node.js / npm >= 18 (for the web app frontend build)
Python >= 3.10
A Databricks workspace with Unity Catalog enabled
A SQL Warehouse -- set the ID in variables.yml (sql_warehouse_id) and in your dev.env / prod.env as WAREHOUSE_ID

Quickstart

1. Clone the repository

git clone https://github.com/databricks-industry-solutions/dbxredact.git
cd dbxredact

2. Install prerequisites

Make sure you have the tools listed in Prerequisites installed: Databricks CLI (>= 0.283.0), Poetry (>= 2.0), Node.js/npm (>= 18), and Python (>= 3.10).

Authenticate the Databricks CLI to your workspace:

databricks auth login --host https://your-workspace.cloud.databricks.com

3. Create Unity Catalog volumes

Run these SQL statements in your workspace (e.g., via a SQL editor or notebook). Replace your_catalog and your_schema with your actual catalog and schema names:

CREATE SCHEMA IF NOT EXISTS your_catalog.your_schema;
CREATE VOLUME IF NOT EXISTS your_catalog.your_schema.wheels;
CREATE VOLUME IF NOT EXISTS your_catalog.your_schema.cluster_logs;
CREATE VOLUME IF NOT EXISTS your_catalog.your_schema.checkpoints;

wheels stores the Python library. cluster_logs is used by the benchmark job. checkpoints is used by the streaming pipeline.

Note: The wheel volume only needs to exist in the deployment schema (the SCHEMA in your env file). The redaction pipeline itself can read from and write to any fully qualified table in any schema.

4. Configure environment

cp example.env dev.env

Open dev.env and fill in your values:

DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
CATALOG=your_catalog
SCHEMA=your_schema
WAREHOUSE_ID=your_warehouse_id

# Optional: set to "false" to deploy jobs only (no app)
# DEPLOY_APP=false

WAREHOUSE_ID is the ID of a SQL warehouse in your workspace. Find it under SQL Warehouses > your warehouse > Connection Details. It is required when deploying the app. If you set DEPLOY_APP=false, it can be omitted.

5. Deploy

./deploy.sh dev

The script is interactive -- it prompts before each step (press Enter to proceed, n to skip, q to quit). It will:

Generate databricks.yml from the template using your env file values
Build the Python wheel with Poetry
Upload the wheel to your Unity Catalog volume
Validate and deploy the Databricks Asset Bundle (jobs, app, and artifacts)
Grant UC permissions to the app service principal
Start the management app

Use DEPLOY_APP=false in your env file to skip app-related steps and deploy jobs only.

6. Run a pipeline

From the app (recommended): Open the deployed Databricks App, go to "Run Pipeline", select your source table, choose a cluster profile, and click "Launch".

From CLI:

databricks bundle run redaction_pipeline_cpu_small -t dev \
  --notebook-params source_table=catalog.schema.source,text_column=text,output_table=catalog.schema.redacted

Alternative: Git Folder (Not Recommended)

This approach is for quick evaluation only. You get the notebooks and Python library, but not the management app, pre-configured jobs, or cluster profiles. For the full experience, use the DAB deployment above.

The management app cannot run via Git Folder because it requires the Databricks App runtime (which builds the React frontend and binds job resources at deploy time).

To use this approach, clone the repo into a Databricks Git Folder, then install the library in your notebook:

# Install directly from GitHub (no local build needed):
%pip install git+https://github.com/databricks-industry-solutions/dbxredact.git

# Or, if you have a pre-built wheel in a UC volume:
%pip install /Volumes/your_catalog/your_schema/wheels/dbxredact-0.1.2-py3-none-any.whl

Then open notebooks/4_redaction_pipeline.py, configure the widgets at the top of the notebook, and run all cells. You will need to attach the notebook to an ML-runtime cluster yourself.

Detection Methods

Method	Model	When to Use
Presidio	spaCy `en_core_web_trf` (auto-falls back to `_lg`/`_sm`)	Fast, deterministic, no API calls
AI Query	Databricks LLM endpoints (default: `databricks-gpt-oss-120b`)	Context-aware detection of complex patterns
GLiNER	`nvidia/gliner-PII` (55+ entity types)	Transformer NER, GPU-accelerated, no API calls

Detection methods can be used individually or in ensemble. Ensemble results are merged via configurable alignment (union or consensus).

Detection Profiles

Profile	Detectors	GLiNER Chunk	Presidio	Best For
Fast (default)	AI Query (low) + GLiNER + Presidio (pattern-only)	256 words	Pattern-only (no spaCy)	Routine redaction, large-scale batch
Deep	All three	256 words	spaCy trf	Compliance-critical, maximum recall
Custom	Manual	Manual	Manual	Experimenting with specific configs

The fast profile achieves F1~0.89 (overlap) across benchmark datasets by combining AI Query with GLiNER and Presidio in pattern-only mode (deterministic regex for SSN, phone, MRN, dates -- no spaCy required).

Presidio Pattern-Only Mode

When Presidio is enabled, it normally loads spaCy NER models (400+ MB). If you want deterministic pattern-based detection as a lightweight backup (SSN, phone, MRN, dates, reference IDs) without the spaCy dependency, pass presidio_pattern_only=True:

result_df = run_detection_pipeline(
    spark=spark, source_df=df, doc_id_column="doc_id", text_column="text",
    use_presidio=True, use_ai_query=True, use_gliner=True,
    presidio_pattern_only=True,  # regex-only, no spaCy
)

Pattern-only mode includes 25 high-precision regex recognizers covering HIPAA Safe Harbor identifiers: SSN (with and without dashes), phone, email, credit card, IP, MRN, reference IDs, age/gender, DD/MM dates, DEA numbers, NPI, labeled DOB, fax, health plan IDs, account numbers, VIN, EIN, license numbers, MBI (Medicare Beneficiary ID), passport, labeled ZIP codes, routing numbers, and ITIN. NER-based recognizers (PERSON, LOCATION) are skipped.

Benchmark Interpretation

Ground-truth annotations may not be perfect -- some edge cases (3-letter hospital abbreviations, bare number sequences, standalone day names) represent a ceiling that no general model will fully reach. When evaluating benchmark results:

Focus on aligned recall (the aligned output is what drives redaction) and precision (false positives erode user trust).
Use block lists to force-flag known entities that detectors miss, and safe lists to suppress recurring false positives.
Re-run benchmarks after config changes to measure the actual impact.

Cluster Profiles

Six pre-configured job variants ship with the bundle:

Profile	Workers	Instance (default)	Use Case
CPU Small	2	i3.xlarge	Development, small datasets
CPU Medium	5	i3.xlarge	Medium workloads
CPU Large	10	i3.xlarge	Production scale
GPU Small	2	g5.xlarge	GLiNER, small datasets
GPU Medium	5	g5.xlarge	GLiNER, medium datasets
GPU Large	10	g5.xlarge	GLiNER at scale

GPU profiles use Databricks Runtime 17.3.x-gpu-ml-scala2.13.

Cloud-Specific Node Types

The default instance types are AWS-specific. Azure workspaces must override cpu_node_type and gpu_node_type in variables.yml or their target config:

Type	AWS (default)	Azure equivalent
CPU	i3.xlarge	Standard_DS3_v2
GPU	g5.xlarge	Standard_NC4as_T4_v3

If you see Node type X is not supported during deployment, set the correct node type for your cloud in variables.yml:

cpu_node_type:
  default: "Standard_DS3_v2"
gpu_node_type:
  default: "Standard_NC4as_T4_v3"

Management App

dbxredact includes a web application deployed as a Databricks App. The app provides:

Configuration: Create and manage detection configurations (enable/disable detectors, set thresholds, choose endpoints)
Block / Safe Lists: Add terms to block lists (always flag) or safe lists (suppress false positives) via the UI
Run Pipeline: Select a source table, choose a cluster profile (CPU/GPU, Small/Medium/Large), view cost estimates, and trigger redaction jobs
Job History: Monitor pipeline runs with status tracking, cost estimates, and links to Databricks job runs
Review: Side-by-side comparison of original and redacted text
Labeling: Annotate entities by highlighting text to build ground-truth datasets for evaluation
Admin: PII retention monitoring, annotation purge (POST /api/admin/purge-annotations), and audit log viewer

The app is defined in apps/dbxredact-app/ and deployed via the Databricks Asset Bundle (resources/app.yml).

App Environment Variables

Variable	Default	Description
`DATABRICKS_WAREHOUSE_ID`	(required)	SQL warehouse for app queries
`CATALOG` / `SCHEMA`	(required)	Unity Catalog location for app tables
`ALLOWED_ORIGINS`	`*`	Comma-separated CORS origins (set for production)
`RETENTION_DAYS`	`90`	Days before PII retention warnings fire for annotation/ground-truth tables
`DEBUG`	`false`	When `true`, error responses include exception details

Architecture

Detection Pipeline

flowchart LR
    SRC[(Source Table)] --> P["Presidio\nRule-based + NLP"]
    SRC --> AI["AI Query\nLLM Endpoint"]
    SRC --> GL["GLiNER\nnvidia/gliner-PII"]
    P & AI & GL --> ALN["Alignment\nunion | consensus"]
    ALN --> FILT["Block/Safe\nList Filter"]
    FILT --> RED["Redaction\ngeneric | typed"]
    RED --> OUT[(Redacted Table)]

Management App

flowchart TB
    subgraph app [Management App]
        CONFIG[Configure] --> RUN[Run Pipeline]
        RUN --> HISTORY[Job History]
        REVIEW[Review Results]
        LABEL[Label Entities]
        LISTS[Block/Safe Lists]
    end
    RUN -->|triggers| JOB[Databricks Job]
    JOB -->|writes| OUT[(Redacted Table)]
    LABEL -->|writes| GT[(Ground Truths Table)]
    LISTS -->|read by| JOB

Pipeline Module Structure

flowchart TB
    subgraph lib["src/dbxredact/"]
        config["config.py\nPrompts, thresholds, RedactionConfig"]
        analyzer["analyzer.py\nPresidio analyzer setup"]
        presidio["presidio.py\nBatch UDF formatting"]
        ai_det["ai_detector.py\nAI Query prompt + UDF"]
        gliner_det["gliner_detector.py\nChunking, caching, offset remap"]
        detection["detection.py\nOrchestrator for all detectors"]
        alignment["alignment.py\nMulti-source entity merge"]
        entity_filter["entity_filter.py\nBlock/safe list filtering"]
        redaction["redaction.py\nText replacement UDFs"]
        evaluation["evaluation.py\nMetrics, error analysis"]
        judge["judge.py\nAI judge + next actions"]
        pipeline["pipeline.py\nEnd-to-end orchestration"]
    end

    config --> analyzer & ai_det & gliner_det & judge
    analyzer --> presidio
    presidio & ai_det & gliner_det --> detection
    detection --> pipeline
    alignment --> pipeline
    entity_filter --> pipeline
    redaction --> pipeline
    evaluation -.->|"benchmarking only"| pipeline
    judge -.->|"benchmarking only"| pipeline

Benchmarking & Development Feedback Loop

flowchart TB
    subgraph Local["Local Dev Environment"]
        DEV["dev.env\nCATALOG, SCHEMA, HOST"]
        SCRIPT["scripts/run_benchmark.sh\n-p job_params"]
        DEPLOY["deploy.sh\npoetry build + bundle deploy"]
        LOGS["benchmark_results/\nstdout, stderr, summary"]
    end

    subgraph DAB["Databricks Asset Bundle"]
        direction TB
        VALIDATE["databricks bundle validate"]
        BDEPLOY["databricks bundle deploy"]
        WHEEL["dist/*.whl uploaded"]
    end

    subgraph BenchmarkJob["Benchmark Job (Databricks)"]
        direction TB
        N1["1. Detection\nPresidio + AI + GLiNER"]
        N2["2. Evaluation\nTP/FP/FN/TN, F1, Recall\nstrict + overlap matching"]
        N3["3. Redaction\nApply to detected entities"]
        N5["5. AI Judge\nPASS / PARTIAL / FAIL"]
        N6["6. Audit\nConsolidate all metrics"]
        N7["7. Next Actions\nAI recommendations"]

        N1 --> N2 --> N3 --> N5 --> N6 --> N7
    end

    subgraph ClusterLogs["Cluster Log Delivery"]
        CL["stdout with\n[BENCHMARK_RESULTS] tags"]
        VOL[("/Volumes/.../cluster_logs")]
    end

    DEV --> SCRIPT
    SCRIPT --> VALIDATE --> BDEPLOY --> WHEEL
    WHEEL --> N1
    N7 -->|"logs"| CL --> VOL
    SCRIPT -->|"databricks fs cp"| VOL
    VOL -->|"download + extract"| LOGS
    LOGS -->|"review & iterate"| DEV

Notebooks

Notebook	Purpose
`4_redaction_pipeline.py`	Production redaction (full or incremental)
`0_load_benchmark_data.py`	Upload benchmark CSVs to Unity Catalog
`1_benchmarking_detection.py`	Run all detection methods
`2_benchmarking_evaluation.py`	Precision, recall, F1 (strict + overlap matching)
`3_benchmarking_redaction.py`	Apply redaction to detection results
`5_benchmarking_judge.py`	AI judge grades redacted output
`6_benchmarking_audit.py`	Consolidate metrics into audit table
`7_benchmarking_next_actions.py`	AI-generated improvement recommendations
`9_gliner_fine_tuning.py`	Fine-tune GLiNER on custom labeled data

Synthetic benchmark data is included in data/ with ground-truth PII annotations (NAME, DATE, LOCATION, IDNUM, CONTACT):

File Domain Docs Annotations

synthetic_benchmark_medical.csv Clinical (discharge summaries, lab reports, etc.) 10 ~180

synthetic_benchmark_finance.csv Financial (wire transfers, loans, KYC, tax, etc.) 10 ~250

synthetic_benchmark.csv Combined 20 ~430

Upload a CSV to a Unity Catalog table and use it as both the source and ground truth for the benchmarking notebooks. To regenerate or customize: python scripts/generate_synthetic_benchmark.py --domain medical|finance|all.

Important: After regenerating CSVs, you must re-upload the data to your Unity Catalog table for the updated annotations to take effect in benchmarking. Example:
DROP TABLE IF EXISTS your_catalog.your_schema.synthetic_benchmark_medical;
-- Then re-create from CSV upload or use the Databricks UI file upload
For larger-scale evaluation, supply your own labeled dataset and update the widget defaults accordingly.

API Reference

Full Pipeline

from dbxredact import RedactionConfig, run_redaction_pipeline

config = RedactionConfig(
    use_presidio=True, use_ai_query=True, use_gliner=False,
    redaction_strategy="typed",
)

result_df = run_redaction_pipeline(
    spark=spark,
    source_table="catalog.schema.medical_notes",
    text_column="note_text",
    output_table="catalog.schema.medical_notes_redacted",
    config=config,
)

Detection Only

from dbxredact import run_detection_pipeline

result_df = run_detection_pipeline(
    spark=spark,
    source_df=source_df,
    doc_id_column="doc_id",
    text_column="text",
    use_presidio=True,
    use_ai_query=True,
    endpoint="databricks-gpt-oss-120b",
)

Simple Text Redaction

from dbxredact import redact_text

text = "Patient John Smith (SSN: 123-45-6789) visited on 2024-01-15."
entities = [
    {"entity": "John Smith", "start": 8, "end": 18, "entity_type": "PERSON"},
    {"entity": "123-45-6789", "start": 25, "end": 36, "entity_type": "US_SSN"},
]
result = redact_text(text, entities, strategy="typed")
# "Patient [PERSON] (SSN: [US_SSN]) visited on 2024-01-15."

Block / Safe Lists

Block and safe lists are applied as post-processing filters after detection and alignment, not as Presidio custom recognizers. This means they work uniformly across all three detection methods. Block lists force specific terms to always be flagged as PII. Safe lists suppress false positives by removing matches. They are stored as Unity Catalog tables (redact_block_list, redact_safe_list) and can be managed through the app UI or SQL.

See src/dbxredact/entity_filter.py for the EntityFilter API and load_filter_from_table to load lists from Unity Catalog tables.

Streaming (Incremental) Mode

The redaction pipeline supports incremental processing via Structured Streaming. Select incremental for the "Refresh Approach" widget in 4_redaction_pipeline.py, or call run_redaction_pipeline_streaming directly.

Key operational notes

Deduplication (best effort): The streaming pipeline makes a best effort to deduplicate by doc_id within each micro-batch before writing via MERGE INTO. However, it does not guarantee cross-batch deduplication — if the same doc_id appears in two different micro-batches, the later batch will overwrite the earlier result rather than skip it. For strict deduplication guarantees, use the batch pipeline, which calls .distinct() on the full source before processing.
Checkpoint coupling: The streaming checkpoint is tightly coupled to the Spark query plan. If you change which detectors are enabled, switch alignment mode, or modify detection logic, delete the checkpoint directory before restarting the stream.
mergeSchema is on: Switching between production and validation output strategies will widen the output table automatically.
AI failure flagging: When AI Query returns an error for a row, the output includes _ai_detection_failed = True and a warning is logged. These rows still flow through redaction (using other detectors if available) but should be reviewed or retried.
LLM non-determinism: If a micro-batch is retried after a transient failure, AI Query may produce slightly different results for the same document.
max_files_per_trigger: Controls how many files each micro-batch ingests (default 10). Set to 0 / None for unlimited. Useful for throttling first-run backfill on large tables.
Checkpoint path: Should be a Unity Catalog Volume path (/Volumes/catalog/schema/volume_name/...). Non-Volume paths (DBFS, local) may not persist across cluster restarts; a warning is emitted if detected.

Safety Guards

The pipeline enforces several safety gates to prevent accidental PII exposure or data loss:

Guard	Trigger	Required Opt-in
Destructive write	`output_mode="in_place"`	`confirm_destructive=True`
PII in output table	`output_strategy="validation"`	`confirm_validation_output=True`
Low-recall alignment	`alignment_mode="consensus"`	`allow_consensus_redaction=True`

Detection thresholds have governance floors to prevent configs that silently disable detection: score_threshold must be >= 0.1 and gliner_threshold must be >= 0.05. The RedactionConfig dataclass and the app's API schema both enforce these bounds.

Pipeline Details

Output Columns

The pipeline reads only doc_id and the specified text_column from the source table. Other source columns are not carried to the output.

Production mode (default): Output contains doc_id, {text_column}_redacted, _detection_status, and _entity_count.
Validation mode: Output includes all intermediate columns (raw detector results, aligned entities, redacted text) for debugging. Requires confirm_validation_output=True since it persists raw PII.

Column	Description
`doc_id`	Document identifier (join key)
`{text_column}_redacted`	Redacted text
`_detection_status`	`ok`, `no_entities`, or `detection_error`
`_entity_count`	Number of entities detected in the document

To join redacted output back to your original table, use doc_id as the key:

SELECT o.*, r.text_redacted
FROM catalog.schema.original o
JOIN catalog.schema.redacted r ON o.doc_id = r.doc_id

Multiple Column Redaction

Currently, each pipeline run processes a single text column. Multi-column support is on the roadmap. For now, run the pipeline once per column and join outputs downstream on doc_id:

for col in ["notes", "address", "comments"]:
    run_redaction_pipeline(spark, source_table=..., text_column=col,
                           output_table=f"..._{col}_redacted", ...)

Document Length

No explicit limit. Practical limits depend on the detection method:

GLiNER: Handles long texts internally via word-boundary chunking with automatic offset correction (_chunk_and_predict). No user-side chunking needed.
AI Query: Subject to the LLM endpoint's context window / token limit. Documents exceeding the limit will be truncated by the endpoint.
Presidio: Processes text in-memory via spaCy. No hard cap, but very large documents may be slow.

In-Place Redaction

By default the pipeline writes to a separate output table. To overwrite the text column directly in the source table, use output_mode="in_place". This uses MERGE INTO so only processed rows are touched.

run_redaction_pipeline(
    spark=spark,
    source_table="catalog.schema.notes",
    text_column="note_text",
    output_mode="in_place",
    confirm_destructive=True,  # required -- operation is irreversible
    config=config,
)

RedactionConfig

Instead of passing 25+ keyword arguments, use RedactionConfig to bundle pipeline settings. When config is provided, its fields override any matching kwargs.

from dbxredact import RedactionConfig, run_redaction_pipeline

config = RedactionConfig(
    use_presidio=True,
    use_ai_query=True,
    use_gliner=False,
    presidio_pattern_only=True,
    score_threshold=0.5,
    redaction_strategy="typed",
    output_strategy="production",
)

result_df = run_redaction_pipeline(
    spark=spark,
    source_table="catalog.schema.notes",
    text_column="note_text",
    output_table="catalog.schema.notes_redacted",
    config=config,
)

Audit Logging

Pass audit_table to write per-document entity-type counts (no raw PII) for compliance tracking:

run_redaction_pipeline(
    ...,
    audit_table="catalog.schema.redact_audit_log",
)

The audit log records run_id, doc_id, entity_type, entity_count, detection_status, detectors_used, and a config snapshot. The management app creates this table automatically on startup.

Project Structure

dbxredact/
  databricks.yml.template    # DAB config template (deploy.sh generates databricks.yml)
  deploy.sh                  # Build, configure, and deploy script
  pyproject.toml             # Poetry dependencies and build config
  variables.yml              # Bundle variables (catalog, schema, etc.)
  example.env                # Template for dev.env / prod.env
  src/dbxredact/             # Core Python library
    config.py                #   Prompts, thresholds, RedactionConfig dataclass
    analyzer.py              #   Custom Presidio recognizers (HIPAA Safe Harbor)
    detection.py             #   Orchestrator for all detectors
    gliner_detector.py       #   GLiNER batch UDF with worker-level model cache
    presidio.py              #   Presidio batch UDF
    ai_detector.py           #   AI Query prompt construction + UDF
    alignment.py             #   Multi-source entity merge (union/consensus)
    redaction.py             #   Text replacement UDFs (with span merging)
    entity_filter.py         #   Block/safe list filtering
    pipeline.py              #   End-to-end orchestration + safety guards
    evaluation.py            #   Metrics (precision, recall, F1)
    judge.py                 #   AI judge + next actions
    cost.py                  #   Token cost estimation constants
  notebooks/                 # Databricks notebooks
    4_redaction_pipeline.py  #   Production redaction pipeline
    0_load_benchmark_data.py #   Upload benchmark CSVs to UC
    1-7_benchmarking_*.py    #   Benchmarking pipeline steps
    9_gliner_fine_tuning.py  #   GLiNER fine-tuning notebook
  apps/dbxredact-app/        # Management web app
    app.py                   #   FastAPI entry point + SPA serving
    api/                     #   Backend API routes and services
    src/                     #   React frontend source
    package.json             #   Node.js dependencies
  resources/                 # DAB resource definitions
    jobs.yml                 #   Job definitions (CPU/GPU x Small/Medium/Large)
    app.yml                  #   App resource definition
  scripts/                   # Utility scripts
  data/                      # Synthetic benchmark datasets
  tests/                     # Unit tests

Testing

poetry install --with dev
poetry run pytest tests/ -x -q --ignore=tests/integration

Integration tests (tests/integration/) require a live Spark cluster and are excluded from local/CI runs.

Compute Types

Use an ML cluster - not currently working on serverless. GLiNER models perform better on GPU, but other models do not.

Libraries

Core Dependencies

Library	Version	License	Description	PyPI
presidio-analyzer	2.2.358	MIT	Microsoft Presidio PII detection engine	PyPI
presidio-anonymizer	2.2.358	MIT	Microsoft Presidio anonymization engine	PyPI
spacy	3.8.7	MIT	Industrial-strength NLP library	PyPI
gliner	>=0.2.0	Apache 2.0	Generalist NER using bidirectional transformers	PyPI
rapidfuzz	>=3.0.0	MIT	Fast fuzzy string matching	PyPI
pydantic	>=2.0.0	MIT	Data validation using Python type hints	PyPI
pyyaml	>=6.0.1	MIT	YAML parser and emitter	PyPI
databricks-sdk	>=0.30.0	Apache 2.0	Databricks SDK for Python	PyPI

GLiNER Model

Model	License	Description	HuggingFace
nvidia/gliner-PII	NVIDIA Open Model License	PII/PHI-focused NER model with 55+ entity types	HuggingFace

This solution accelerator uses the nvidia/gliner-PII model in accordance with the NVIDIA Open Model License. The NVIDIA Open Model License permits commercial use and is compatible with the DB License under which this accelerator is released. However, use of the model itself is governed by NVIDIA's license terms, not the DB License. Customers are responsible for reviewing and complying with the NVIDIA Open Model License independently before deploying in production.

spaCy Models (for Presidio)

The default is en_core_web_trf (RoBERTa transformer, NER F1 ~90.2%). The code auto-falls back to _lg or _sm if _trf is not installed. Install the best model available for your cluster:

Model	NER F1	Size	GPU	License	Install
en_core_web_trf (recommended)	90.2%	438 MB	Recommended	MIT	spaCy Models
en_core_web_lg	85.4%	560 MB	No	MIT	spaCy Models
en_core_web_sm	84.6%	12 MB	No	MIT	spaCy Models

Runtime Dependencies (provided by Databricks)

Library	License	Description
pandas	BSD-3-Clause	Data manipulation library
pyspark	Apache 2.0	Apache Spark Python API
pyarrow	Apache 2.0	Apache Arrow Python bindings

All dependencies use permissive open-source licenses (MIT, Apache 2.0, BSD-3-Clause). No copyleft (GPL) dependencies.

Deploying to Production

Set these variables in variables.yml before running ./deploy.sh prod:

current_user: Service principal or user email for run_as permissions (e.g. deployer@company.com)
current_working_directory: Workspace path for artifact deployment (e.g. /Workspace/Shared/dbxredact)

The prod target enforces run_as permissions and requires explicit user configuration.

Compliance and Responsibility

This is a solution accelerator -- it provides tooling to assist with PII/PHI detection and redaction, but all compliance obligations remain with the user. This includes but is not limited to:

HIPAA: You are responsible for ensuring your deployment meets HIPAA requirements (encryption, access controls, audit logging, BAAs, etc.)
GDPR, CCPA, and other privacy regulations: Evaluate whether your use of this tool satisfies applicable data protection laws
Validation: You must verify that redaction results are complete and accurate for your specific data and use case
Data Encryption: Enable encryption at rest and in transit in your Databricks workspace
Access Controls: Configure appropriate table/catalog permissions in Unity Catalog
Audit Logging: Enable workspace audit logs for compliance tracking

Databricks makes no guarantees that use of this tool alone is sufficient for regulatory compliance.

License

DB License

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
apps/dbxredact-app		apps/dbxredact-app
data		data
notebooks		notebooks
resources		resources
scripts		scripts
src/dbxredact		src/dbxredact
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
databricks.yml.template		databricks.yml.template
deploy.sh		deploy.sh
example.env		example.env
load_benchmark.sh		load_benchmark.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
variables.yml		variables.yml

File	Domain	Docs	Annotations
`synthetic_benchmark_medical.csv`	Clinical (discharge summaries, lab reports, etc.)	10	~180
`synthetic_benchmark_finance.csv`	Financial (wire transfers, loans, KYC, tax, etc.)	10	~250
`synthetic_benchmark.csv`	Combined	20	~430

Folders and files

Latest commit

History

Repository files navigation

dbxredact

What It Does

Key Capabilities

Prerequisites

Quickstart

1. Clone the repository

2. Install prerequisites

3. Create Unity Catalog volumes

4. Configure environment

5. Deploy

6. Run a pipeline

Alternative: Git Folder (Not Recommended)

Detection Methods

Detection Profiles

Presidio Pattern-Only Mode

Benchmark Interpretation

Cluster Profiles

Cloud-Specific Node Types

Management App

App Environment Variables

Architecture

Detection Pipeline

Management App

Pipeline Module Structure

Benchmarking & Development Feedback Loop

Notebooks

API Reference

Full Pipeline

Detection Only

Simple Text Redaction

Block / Safe Lists

Streaming (Incremental) Mode

Key operational notes

Safety Guards

Pipeline Details

Output Columns

Multiple Column Redaction

Document Length

In-Place Redaction

RedactionConfig

Audit Logging

Project Structure

Testing

Compute Types

Libraries

Core Dependencies

GLiNER Model

spaCy Models (for Presidio)

Runtime Dependencies (provided by Databricks)

Deploying to Production

Compliance and Responsibility

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages