PII/PHI detection and redaction for Databricks, with a management web app.
Disclaimer: This is a Databricks Solution Accelerator -- a starting point for your project. Evaluate, test, and customize for your use case. Detection accuracy depends on your data and configuration.
dbxredact detects and redacts Protected Health Information (PHI) and Personally Identifiable Information (PII) in text data stored in Unity Catalog. It ships as:
- A Python library (
src/dbxredact/) with Spark UDFs for detection, alignment, redaction, and evaluation - Databricks notebooks for running redaction and benchmarking pipelines
- A web management app (FastAPI + React, deployed as a Databricks App) for configuration, pipeline execution, review, and labeling
- Three detection methods: Presidio (rule-based NLP), AI Query (LLM endpoints), GLiNER (
nvidia/gliner-PIItransformer NER) - Multi-language support: AI Query and rule-based approaches support Spanish language
- Entity alignment: Union (recall-focused) or consensus (precision-focused) across detectors
- Typed or generic redaction:
[PERSON]/[EMAIL]or[REDACTED] - Block / safe lists: Deny lists (always flag) and allow lists (suppress false positives), stored in UC tables, applied uniformly across all detectors
- Benchmarking: Automated evaluation (precision, recall, F1), AI judge grading, and improvement recommendations
- Streaming: Incremental processing via Structured Streaming with checkpoint-based deduplication
- GPU acceleration: GPU cluster profiles for GLiNER inference
- Cost estimation: Pre-run token and compute cost estimates in the management app
- Safety guards: Governance floors on thresholds, destructive-write confirmation, PII-output confirmation
- Audit logging: Per-document entity-type counts written to an audit table (no raw PII stored)
- Databricks CLI >= 0.283.0
- Poetry >= 2.0
- Node.js / npm >= 18 (for the web app frontend build)
- Python >= 3.10
- A Databricks workspace with Unity Catalog enabled
- A SQL Warehouse -- set the ID in
variables.yml(sql_warehouse_id) and in yourdev.env/prod.envasWAREHOUSE_ID
git clone https://github.com/databricks-industry-solutions/dbxredact.git
cd dbxredactMake sure you have the tools listed in Prerequisites installed: Databricks CLI (>= 0.283.0), Poetry (>= 2.0), Node.js/npm (>= 18), and Python (>= 3.10).
Authenticate the Databricks CLI to your workspace:
databricks auth login --host https://your-workspace.cloud.databricks.comRun these SQL statements in your workspace (e.g., via a SQL editor or notebook). Replace your_catalog and your_schema with your actual catalog and schema names:
CREATE SCHEMA IF NOT EXISTS your_catalog.your_schema;
CREATE VOLUME IF NOT EXISTS your_catalog.your_schema.wheels;
CREATE VOLUME IF NOT EXISTS your_catalog.your_schema.cluster_logs;
CREATE VOLUME IF NOT EXISTS your_catalog.your_schema.checkpoints;wheels stores the Python library. cluster_logs is used by the benchmark job. checkpoints is used by the streaming pipeline.
Note: The wheel volume only needs to exist in the deployment schema (the
SCHEMAin your env file). The redaction pipeline itself can read from and write to any fully qualified table in any schema.
cp example.env dev.envOpen dev.env and fill in your values:
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
CATALOG=your_catalog
SCHEMA=your_schema
WAREHOUSE_ID=your_warehouse_id
# Optional: set to "false" to deploy jobs only (no app)
# DEPLOY_APP=false
WAREHOUSE_ID is the ID of a SQL warehouse in your workspace. Find it under SQL Warehouses > your warehouse > Connection Details. It is required when deploying the app. If you set DEPLOY_APP=false, it can be omitted.
./deploy.sh devThe script is interactive -- it prompts before each step (press Enter to proceed, n to skip, q to quit). It will:
- Generate
databricks.ymlfrom the template using your env file values - Build the Python wheel with Poetry
- Upload the wheel to your Unity Catalog volume
- Validate and deploy the Databricks Asset Bundle (jobs, app, and artifacts)
- Grant UC permissions to the app service principal
- Start the management app
Use DEPLOY_APP=false in your env file to skip app-related steps and deploy jobs only.
From the app (recommended): Open the deployed Databricks App, go to "Run Pipeline", select your source table, choose a cluster profile, and click "Launch".
From CLI:
databricks bundle run redaction_pipeline_cpu_small -t dev \
--notebook-params source_table=catalog.schema.source,text_column=text,output_table=catalog.schema.redactedThis approach is for quick evaluation only. You get the notebooks and Python library, but not the management app, pre-configured jobs, or cluster profiles. For the full experience, use the DAB deployment above.
The management app cannot run via Git Folder because it requires the Databricks App runtime (which builds the React frontend and binds job resources at deploy time).
To use this approach, clone the repo into a Databricks Git Folder, then install the library in your notebook:
# Install directly from GitHub (no local build needed):
%pip install git+https://github.com/databricks-industry-solutions/dbxredact.git
# Or, if you have a pre-built wheel in a UC volume:
%pip install /Volumes/your_catalog/your_schema/wheels/dbxredact-0.1.2-py3-none-any.whlThen open notebooks/4_redaction_pipeline.py, configure the widgets at the top of the notebook, and run all cells. You will need to attach the notebook to an ML-runtime cluster yourself.
| Method | Model | When to Use |
|---|---|---|
| Presidio | spaCy en_core_web_trf (auto-falls back to _lg/_sm) |
Fast, deterministic, no API calls |
| AI Query | Databricks LLM endpoints (default: databricks-gpt-oss-120b) |
Context-aware detection of complex patterns |
| GLiNER | nvidia/gliner-PII (55+ entity types) |
Transformer NER, GPU-accelerated, no API calls |
Detection methods can be used individually or in ensemble. Ensemble results are merged via configurable alignment (union or consensus).
| Profile | Detectors | GLiNER Chunk | Presidio | Best For |
|---|---|---|---|---|
| Fast (default) | AI Query (low) + GLiNER + Presidio (pattern-only) | 256 words | Pattern-only (no spaCy) | Routine redaction, large-scale batch |
| Deep | All three | 256 words | spaCy trf | Compliance-critical, maximum recall |
| Custom | Manual | Manual | Manual | Experimenting with specific configs |
The fast profile achieves F1~0.89 (overlap) across benchmark datasets by combining AI Query with GLiNER and Presidio in pattern-only mode (deterministic regex for SSN, phone, MRN, dates -- no spaCy required).
When Presidio is enabled, it normally loads spaCy NER models (400+ MB). If you want deterministic pattern-based detection as a lightweight backup (SSN, phone, MRN, dates, reference IDs) without the spaCy dependency, pass presidio_pattern_only=True:
result_df = run_detection_pipeline(
spark=spark, source_df=df, doc_id_column="doc_id", text_column="text",
use_presidio=True, use_ai_query=True, use_gliner=True,
presidio_pattern_only=True, # regex-only, no spaCy
)Pattern-only mode includes 25 high-precision regex recognizers covering HIPAA Safe Harbor identifiers: SSN (with and without dashes), phone, email, credit card, IP, MRN, reference IDs, age/gender, DD/MM dates, DEA numbers, NPI, labeled DOB, fax, health plan IDs, account numbers, VIN, EIN, license numbers, MBI (Medicare Beneficiary ID), passport, labeled ZIP codes, routing numbers, and ITIN. NER-based recognizers (PERSON, LOCATION) are skipped.
Ground-truth annotations may not be perfect -- some edge cases (3-letter hospital abbreviations, bare number sequences, standalone day names) represent a ceiling that no general model will fully reach. When evaluating benchmark results:
- Focus on aligned recall (the aligned output is what drives redaction) and precision (false positives erode user trust).
- Use block lists to force-flag known entities that detectors miss, and safe lists to suppress recurring false positives.
- Re-run benchmarks after config changes to measure the actual impact.
Six pre-configured job variants ship with the bundle:
| Profile | Workers | Instance (default) | Use Case |
|---|---|---|---|
| CPU Small | 2 | i3.xlarge | Development, small datasets |
| CPU Medium | 5 | i3.xlarge | Medium workloads |
| CPU Large | 10 | i3.xlarge | Production scale |
| GPU Small | 2 | g5.xlarge | GLiNER, small datasets |
| GPU Medium | 5 | g5.xlarge | GLiNER, medium datasets |
| GPU Large | 10 | g5.xlarge | GLiNER at scale |
GPU profiles use Databricks Runtime 17.3.x-gpu-ml-scala2.13.
The default instance types are AWS-specific. Azure workspaces must override cpu_node_type and gpu_node_type in variables.yml or their target config:
| Type | AWS (default) | Azure equivalent |
|---|---|---|
| CPU | i3.xlarge | Standard_DS3_v2 |
| GPU | g5.xlarge | Standard_NC4as_T4_v3 |
If you see Node type X is not supported during deployment, set the correct node type for your cloud in variables.yml:
cpu_node_type:
default: "Standard_DS3_v2"
gpu_node_type:
default: "Standard_NC4as_T4_v3"dbxredact includes a web application deployed as a Databricks App. The app provides:
- Configuration: Create and manage detection configurations (enable/disable detectors, set thresholds, choose endpoints)
- Block / Safe Lists: Add terms to block lists (always flag) or safe lists (suppress false positives) via the UI
- Run Pipeline: Select a source table, choose a cluster profile (CPU/GPU, Small/Medium/Large), view cost estimates, and trigger redaction jobs
- Job History: Monitor pipeline runs with status tracking, cost estimates, and links to Databricks job runs
- Review: Side-by-side comparison of original and redacted text
- Labeling: Annotate entities by highlighting text to build ground-truth datasets for evaluation
- Admin: PII retention monitoring, annotation purge (
POST /api/admin/purge-annotations), and audit log viewer
The app is defined in apps/dbxredact-app/ and deployed via the Databricks Asset Bundle (resources/app.yml).
| Variable | Default | Description |
|---|---|---|
DATABRICKS_WAREHOUSE_ID |
(required) | SQL warehouse for app queries |
CATALOG / SCHEMA |
(required) | Unity Catalog location for app tables |
ALLOWED_ORIGINS |
* |
Comma-separated CORS origins (set for production) |
RETENTION_DAYS |
90 |
Days before PII retention warnings fire for annotation/ground-truth tables |
DEBUG |
false |
When true, error responses include exception details |
flowchart LR
SRC[(Source Table)] --> P["Presidio\nRule-based + NLP"]
SRC --> AI["AI Query\nLLM Endpoint"]
SRC --> GL["GLiNER\nnvidia/gliner-PII"]
P & AI & GL --> ALN["Alignment\nunion | consensus"]
ALN --> FILT["Block/Safe\nList Filter"]
FILT --> RED["Redaction\ngeneric | typed"]
RED --> OUT[(Redacted Table)]
flowchart TB
subgraph app [Management App]
CONFIG[Configure] --> RUN[Run Pipeline]
RUN --> HISTORY[Job History]
REVIEW[Review Results]
LABEL[Label Entities]
LISTS[Block/Safe Lists]
end
RUN -->|triggers| JOB[Databricks Job]
JOB -->|writes| OUT[(Redacted Table)]
LABEL -->|writes| GT[(Ground Truths Table)]
LISTS -->|read by| JOB
flowchart TB
subgraph lib["src/dbxredact/"]
config["config.py\nPrompts, thresholds, RedactionConfig"]
analyzer["analyzer.py\nPresidio analyzer setup"]
presidio["presidio.py\nBatch UDF formatting"]
ai_det["ai_detector.py\nAI Query prompt + UDF"]
gliner_det["gliner_detector.py\nChunking, caching, offset remap"]
detection["detection.py\nOrchestrator for all detectors"]
alignment["alignment.py\nMulti-source entity merge"]
entity_filter["entity_filter.py\nBlock/safe list filtering"]
redaction["redaction.py\nText replacement UDFs"]
evaluation["evaluation.py\nMetrics, error analysis"]
judge["judge.py\nAI judge + next actions"]
pipeline["pipeline.py\nEnd-to-end orchestration"]
end
config --> analyzer & ai_det & gliner_det & judge
analyzer --> presidio
presidio & ai_det & gliner_det --> detection
detection --> pipeline
alignment --> pipeline
entity_filter --> pipeline
redaction --> pipeline
evaluation -.->|"benchmarking only"| pipeline
judge -.->|"benchmarking only"| pipeline
flowchart TB
subgraph Local["Local Dev Environment"]
DEV["dev.env\nCATALOG, SCHEMA, HOST"]
SCRIPT["scripts/run_benchmark.sh\n-p job_params"]
DEPLOY["deploy.sh\npoetry build + bundle deploy"]
LOGS["benchmark_results/\nstdout, stderr, summary"]
end
subgraph DAB["Databricks Asset Bundle"]
direction TB
VALIDATE["databricks bundle validate"]
BDEPLOY["databricks bundle deploy"]
WHEEL["dist/*.whl uploaded"]
end
subgraph BenchmarkJob["Benchmark Job (Databricks)"]
direction TB
N1["1. Detection\nPresidio + AI + GLiNER"]
N2["2. Evaluation\nTP/FP/FN/TN, F1, Recall\nstrict + overlap matching"]
N3["3. Redaction\nApply to detected entities"]
N5["5. AI Judge\nPASS / PARTIAL / FAIL"]
N6["6. Audit\nConsolidate all metrics"]
N7["7. Next Actions\nAI recommendations"]
N1 --> N2 --> N3 --> N5 --> N6 --> N7
end
subgraph ClusterLogs["Cluster Log Delivery"]
CL["stdout with\n[BENCHMARK_RESULTS] tags"]
VOL[("/Volumes/.../cluster_logs")]
end
DEV --> SCRIPT
SCRIPT --> VALIDATE --> BDEPLOY --> WHEEL
WHEEL --> N1
N7 -->|"logs"| CL --> VOL
SCRIPT -->|"databricks fs cp"| VOL
VOL -->|"download + extract"| LOGS
LOGS -->|"review & iterate"| DEV
| Notebook | Purpose |
|---|---|
4_redaction_pipeline.py |
Production redaction (full or incremental) |
0_load_benchmark_data.py |
Upload benchmark CSVs to Unity Catalog |
1_benchmarking_detection.py |
Run all detection methods |
2_benchmarking_evaluation.py |
Precision, recall, F1 (strict + overlap matching) |
3_benchmarking_redaction.py |
Apply redaction to detection results |
5_benchmarking_judge.py |
AI judge grades redacted output |
6_benchmarking_audit.py |
Consolidate metrics into audit table |
7_benchmarking_next_actions.py |
AI-generated improvement recommendations |
9_gliner_fine_tuning.py |
Fine-tune GLiNER on custom labeled data |
Synthetic benchmark data is included in
data/with ground-truth PII annotations (NAME, DATE, LOCATION, IDNUM, CONTACT):
File Domain Docs Annotations synthetic_benchmark_medical.csvClinical (discharge summaries, lab reports, etc.) 10 ~180 synthetic_benchmark_finance.csvFinancial (wire transfers, loans, KYC, tax, etc.) 10 ~250 synthetic_benchmark.csvCombined 20 ~430 Upload a CSV to a Unity Catalog table and use it as both the source and ground truth for the benchmarking notebooks. To regenerate or customize:
python scripts/generate_synthetic_benchmark.py --domain medical|finance|all.Important: After regenerating CSVs, you must re-upload the data to your Unity Catalog table for the updated annotations to take effect in benchmarking. Example:
DROP TABLE IF EXISTS your_catalog.your_schema.synthetic_benchmark_medical; -- Then re-create from CSV upload or use the Databricks UI file uploadFor larger-scale evaluation, supply your own labeled dataset and update the widget defaults accordingly.
from dbxredact import RedactionConfig, run_redaction_pipeline
config = RedactionConfig(
use_presidio=True, use_ai_query=True, use_gliner=False,
redaction_strategy="typed",
)
result_df = run_redaction_pipeline(
spark=spark,
source_table="catalog.schema.medical_notes",
text_column="note_text",
output_table="catalog.schema.medical_notes_redacted",
config=config,
)from dbxredact import run_detection_pipeline
result_df = run_detection_pipeline(
spark=spark,
source_df=source_df,
doc_id_column="doc_id",
text_column="text",
use_presidio=True,
use_ai_query=True,
endpoint="databricks-gpt-oss-120b",
)from dbxredact import redact_text
text = "Patient John Smith (SSN: 123-45-6789) visited on 2024-01-15."
entities = [
{"entity": "John Smith", "start": 8, "end": 18, "entity_type": "PERSON"},
{"entity": "123-45-6789", "start": 25, "end": 36, "entity_type": "US_SSN"},
]
result = redact_text(text, entities, strategy="typed")
# "Patient [PERSON] (SSN: [US_SSN]) visited on 2024-01-15."Block and safe lists are applied as post-processing filters after detection and alignment, not as Presidio custom recognizers. This means they work uniformly across all three detection methods. Block lists force specific terms to always be flagged as PII. Safe lists suppress false positives by removing matches. They are stored as Unity Catalog tables (redact_block_list, redact_safe_list) and can be managed through the app UI or SQL.
See src/dbxredact/entity_filter.py for the EntityFilter API and load_filter_from_table to load lists from Unity Catalog tables.
The redaction pipeline supports incremental processing via Structured Streaming. Select incremental for the "Refresh Approach" widget in 4_redaction_pipeline.py, or call run_redaction_pipeline_streaming directly.
- Deduplication (best effort): The streaming pipeline makes a best effort to deduplicate by
doc_idwithin each micro-batch before writing viaMERGE INTO. However, it does not guarantee cross-batch deduplication — if the samedoc_idappears in two different micro-batches, the later batch will overwrite the earlier result rather than skip it. For strict deduplication guarantees, use the batch pipeline, which calls.distinct()on the full source before processing. - Checkpoint coupling: The streaming checkpoint is tightly coupled to the Spark query plan. If you change which detectors are enabled, switch alignment mode, or modify detection logic, delete the checkpoint directory before restarting the stream.
mergeSchemais on: Switching betweenproductionandvalidationoutput strategies will widen the output table automatically.- AI failure flagging: When AI Query returns an error for a row, the output includes
_ai_detection_failed = Trueand a warning is logged. These rows still flow through redaction (using other detectors if available) but should be reviewed or retried. - LLM non-determinism: If a micro-batch is retried after a transient failure, AI Query may produce slightly different results for the same document.
max_files_per_trigger: Controls how many files each micro-batch ingests (default 10). Set to 0 / None for unlimited. Useful for throttling first-run backfill on large tables.- Checkpoint path: Should be a Unity Catalog Volume path (
/Volumes/catalog/schema/volume_name/...). Non-Volume paths (DBFS, local) may not persist across cluster restarts; a warning is emitted if detected.
The pipeline enforces several safety gates to prevent accidental PII exposure or data loss:
| Guard | Trigger | Required Opt-in |
|---|---|---|
| Destructive write | output_mode="in_place" |
confirm_destructive=True |
| PII in output table | output_strategy="validation" |
confirm_validation_output=True |
| Low-recall alignment | alignment_mode="consensus" |
allow_consensus_redaction=True |
Detection thresholds have governance floors to prevent configs that silently disable detection: score_threshold must be >= 0.1 and gliner_threshold must be >= 0.05. The RedactionConfig dataclass and the app's API schema both enforce these bounds.
The pipeline reads only doc_id and the specified text_column from the source table. Other source columns are not carried to the output.
- Production mode (default): Output contains
doc_id,{text_column}_redacted,_detection_status, and_entity_count. - Validation mode: Output includes all intermediate columns (raw detector results, aligned entities, redacted text) for debugging. Requires
confirm_validation_output=Truesince it persists raw PII.
| Column | Description |
|---|---|
doc_id |
Document identifier (join key) |
{text_column}_redacted |
Redacted text |
_detection_status |
ok, no_entities, or detection_error |
_entity_count |
Number of entities detected in the document |
To join redacted output back to your original table, use doc_id as the key:
SELECT o.*, r.text_redacted
FROM catalog.schema.original o
JOIN catalog.schema.redacted r ON o.doc_id = r.doc_idCurrently, each pipeline run processes a single text column. Multi-column support is on the roadmap. For now, run the pipeline once per column and join outputs downstream on doc_id:
for col in ["notes", "address", "comments"]:
run_redaction_pipeline(spark, source_table=..., text_column=col,
output_table=f"..._{col}_redacted", ...)No explicit limit. Practical limits depend on the detection method:
- GLiNER: Handles long texts internally via word-boundary chunking with automatic offset correction (
_chunk_and_predict). No user-side chunking needed. - AI Query: Subject to the LLM endpoint's context window / token limit. Documents exceeding the limit will be truncated by the endpoint.
- Presidio: Processes text in-memory via spaCy. No hard cap, but very large documents may be slow.
By default the pipeline writes to a separate output table. To overwrite the text column directly in the source table, use output_mode="in_place". This uses MERGE INTO so only processed rows are touched.
run_redaction_pipeline(
spark=spark,
source_table="catalog.schema.notes",
text_column="note_text",
output_mode="in_place",
confirm_destructive=True, # required -- operation is irreversible
config=config,
)Instead of passing 25+ keyword arguments, use RedactionConfig to bundle pipeline settings. When config is provided, its fields override any matching kwargs.
from dbxredact import RedactionConfig, run_redaction_pipeline
config = RedactionConfig(
use_presidio=True,
use_ai_query=True,
use_gliner=False,
presidio_pattern_only=True,
score_threshold=0.5,
redaction_strategy="typed",
output_strategy="production",
)
result_df = run_redaction_pipeline(
spark=spark,
source_table="catalog.schema.notes",
text_column="note_text",
output_table="catalog.schema.notes_redacted",
config=config,
)Pass audit_table to write per-document entity-type counts (no raw PII) for compliance tracking:
run_redaction_pipeline(
...,
audit_table="catalog.schema.redact_audit_log",
)The audit log records run_id, doc_id, entity_type, entity_count, detection_status, detectors_used, and a config snapshot. The management app creates this table automatically on startup.
dbxredact/
databricks.yml.template # DAB config template (deploy.sh generates databricks.yml)
deploy.sh # Build, configure, and deploy script
pyproject.toml # Poetry dependencies and build config
variables.yml # Bundle variables (catalog, schema, etc.)
example.env # Template for dev.env / prod.env
src/dbxredact/ # Core Python library
config.py # Prompts, thresholds, RedactionConfig dataclass
analyzer.py # Custom Presidio recognizers (HIPAA Safe Harbor)
detection.py # Orchestrator for all detectors
gliner_detector.py # GLiNER batch UDF with worker-level model cache
presidio.py # Presidio batch UDF
ai_detector.py # AI Query prompt construction + UDF
alignment.py # Multi-source entity merge (union/consensus)
redaction.py # Text replacement UDFs (with span merging)
entity_filter.py # Block/safe list filtering
pipeline.py # End-to-end orchestration + safety guards
evaluation.py # Metrics (precision, recall, F1)
judge.py # AI judge + next actions
cost.py # Token cost estimation constants
notebooks/ # Databricks notebooks
4_redaction_pipeline.py # Production redaction pipeline
0_load_benchmark_data.py # Upload benchmark CSVs to UC
1-7_benchmarking_*.py # Benchmarking pipeline steps
9_gliner_fine_tuning.py # GLiNER fine-tuning notebook
apps/dbxredact-app/ # Management web app
app.py # FastAPI entry point + SPA serving
api/ # Backend API routes and services
src/ # React frontend source
package.json # Node.js dependencies
resources/ # DAB resource definitions
jobs.yml # Job definitions (CPU/GPU x Small/Medium/Large)
app.yml # App resource definition
scripts/ # Utility scripts
data/ # Synthetic benchmark datasets
tests/ # Unit tests
poetry install --with dev
poetry run pytest tests/ -x -q --ignore=tests/integrationIntegration tests (tests/integration/) require a live Spark cluster and are excluded from local/CI runs.
Use an ML cluster - not currently working on serverless. GLiNER models perform better on GPU, but other models do not.
| Library | Version | License | Description | PyPI |
|---|---|---|---|---|
| presidio-analyzer | 2.2.358 | MIT | Microsoft Presidio PII detection engine | PyPI |
| presidio-anonymizer | 2.2.358 | MIT | Microsoft Presidio anonymization engine | PyPI |
| spacy | 3.8.7 | MIT | Industrial-strength NLP library | PyPI |
| gliner | >=0.2.0 | Apache 2.0 | Generalist NER using bidirectional transformers | PyPI |
| rapidfuzz | >=3.0.0 | MIT | Fast fuzzy string matching | PyPI |
| pydantic | >=2.0.0 | MIT | Data validation using Python type hints | PyPI |
| pyyaml | >=6.0.1 | MIT | YAML parser and emitter | PyPI |
| databricks-sdk | >=0.30.0 | Apache 2.0 | Databricks SDK for Python | PyPI |
| Model | License | Description | HuggingFace |
|---|---|---|---|
| nvidia/gliner-PII | NVIDIA Open Model License | PII/PHI-focused NER model with 55+ entity types | HuggingFace |
This solution accelerator uses the nvidia/gliner-PII model in accordance with the NVIDIA Open Model License. The NVIDIA Open Model License permits commercial use and is compatible with the DB License under which this accelerator is released. However, use of the model itself is governed by NVIDIA's license terms, not the DB License. Customers are responsible for reviewing and complying with the NVIDIA Open Model License independently before deploying in production.
The default is en_core_web_trf (RoBERTa transformer, NER F1 ~90.2%). The code auto-falls back to _lg or _sm if _trf is not installed. Install the best model available for your cluster:
| Model | NER F1 | Size | GPU | License | Install |
|---|---|---|---|---|---|
| en_core_web_trf (recommended) | 90.2% | 438 MB | Recommended | MIT | spaCy Models |
| en_core_web_lg | 85.4% | 560 MB | No | MIT | spaCy Models |
| en_core_web_sm | 84.6% | 12 MB | No | MIT | spaCy Models |
| Library | License | Description |
|---|---|---|
| pandas | BSD-3-Clause | Data manipulation library |
| pyspark | Apache 2.0 | Apache Spark Python API |
| pyarrow | Apache 2.0 | Apache Arrow Python bindings |
All dependencies use permissive open-source licenses (MIT, Apache 2.0, BSD-3-Clause). No copyleft (GPL) dependencies.
Set these variables in variables.yml before running ./deploy.sh prod:
current_user: Service principal or user email forrun_aspermissions (e.g.deployer@company.com)current_working_directory: Workspace path for artifact deployment (e.g./Workspace/Shared/dbxredact)
The prod target enforces run_as permissions and requires explicit user configuration.
This is a solution accelerator -- it provides tooling to assist with PII/PHI detection and redaction, but all compliance obligations remain with the user. This includes but is not limited to:
- HIPAA: You are responsible for ensuring your deployment meets HIPAA requirements (encryption, access controls, audit logging, BAAs, etc.)
- GDPR, CCPA, and other privacy regulations: Evaluate whether your use of this tool satisfies applicable data protection laws
- Validation: You must verify that redaction results are complete and accurate for your specific data and use case
- Data Encryption: Enable encryption at rest and in transit in your Databricks workspace
- Access Controls: Configure appropriate table/catalog permissions in Unity Catalog
- Audit Logging: Enable workspace audit logs for compliance tracking
Databricks makes no guarantees that use of this tool alone is sufficient for regulatory compliance.