Skip to content

Add Presidio text anonymization scaffold#233

Open
XxSURYANSHxX wants to merge 1 commit into
healthyinc:devfrom
XxSURYANSHxX:gsoc-week2-presidio-text-anonymization
Open

Add Presidio text anonymization scaffold#233
XxSURYANSHxX wants to merge 1 commit into
healthyinc:devfrom
XxSURYANSHxX:gsoc-week2-presidio-text-anonymization

Conversation

@XxSURYANSHxX
Copy link
Copy Markdown

@pradeeban, @karthiksathishjeemain, @Chali-healthy. This PR starts the Week 2 work for my GSoC project:

Bio-Block: Advanced PHI Anonymization and Hybrid Data Retrieval Pipeline

The main goal of this PR is to replace the Week 1 text placeholder handler with a real clinical text anonymization pipeline.

In Week 1, the ingestion endpoint could detect file modality and route files to the correct handler, but all handlers were still placeholders. This PR makes only the text handler real.

After this PR:

  • Text uploads now use real anonymization.
  • CSV uploads still use the placeholder handler.
  • DICOM uploads still use the placeholder handler.
  • NIfTI uploads still use the placeholder handler.
  • WSI uploads still use the placeholder handler.
  • IPFS, encryption, indexing, and blockchain steps still remain pending.

This keeps the PR small, focused, and limited to the Week 2 text anonymization scope.


Why This PR Is Needed

The Week 2 proposal scope is:

  • Presidio integration
  • Custom clinical recognizers
  • Deterministic surrogate generation
  • Safe entity summaries
  • Integration with the Week 1 ingestion endpoint

Before this PR, the text ingestion flow only selected a placeholder handler. It did not actually anonymize clinical text.

This PR adds the first real anonymization path for uploaded text files while avoiding unrelated backend changes.


Main Changes

1. Added a Clinical Text Anonymization Service

A new service was added:

python_backend/services/text_anonymization.py

This service exposes a focused function:

anonymize_clinical_text(
    text: str,
    profile: str = "strict",
    study_salt: str | None = None,
) -> dict

The function:

  • Validates the input text.
  • Rejects empty text.
  • Uses Presidio analyzer flow.
  • Registers custom clinical recognizers.
  • Replaces detected PHI with safe surrogates or redaction tokens.
  • Returns anonymized text.
  • Returns only safe entity counts.
  • Does not return raw detected PHI values.

Example response shape:

{
  "anonymization_status": "completed",
  "anonymized_text": "Patient has MRN_A1B2C3D4 and email <REDACTED_EMAIL>.",
  "detected_entities": {
    "MEDICAL_RECORD_NUMBER": 1,
    "EMAIL_ADDRESS": 1
  }
}

Presidio Integration

This PR uses Microsoft Presidio for the text anonymization pipeline.

The new service uses Presidio's AnalyzerEngine with custom PatternRecognizer instances.

One important implementation detail is that the new Week 2 text service avoids hidden runtime model downloads.

Presidio's default setup can try to load or download a spaCy model if none is available. To avoid that, this PR uses a blank local spaCy tokenizer for pattern-based recognition. This keeps the new text ingestion path deterministic and safe for local development and CI.

No spaCy model download is required for this new Week 2 text ingestion path.


Custom Clinical Recognizers Added

This PR adds custom recognizers for clinical identifiers that general PII recognizers may miss.

The added clinical entity types are:

  • MEDICAL_RECORD_NUMBER
  • PATIENT_ID
  • HEALTH_PLAN_ID
  • ACCESSION_NUMBER
  • DEVICE_ID

The recognizers are designed to be conservative.

For example, this should be detected:

MRN: 123456

But this should not be blindly treated as an MRN:

The room number 123456 was cleaned.

This is important because clinical notes often contain many numbers that are not identifiers.


Medical Record Number Recognition

The MRN recognizer supports clinical context such as:

  • MRN
  • medical record
  • medical record number
  • hospital number
  • chart number

Supported examples include:

MRN: 123456
MRN 123456
Medical Record Number - 123456

The recognizer is case-insensitive, so both of these are supported:

MRN: 123456
mrn: 123456

Patient ID Recognition

The Patient ID recognizer supports context such as:

  • patient id
  • patient number
  • patient identifier
  • pt id

Example:

Patient ID PT-1001 was admitted.

The raw ID is replaced with a deterministic surrogate like:

PATIENT_ID_XXXXXXXX

Health Plan / Insurance ID Recognition

The health plan recognizer supports context such as:

  • health plan
  • beneficiary
  • insurance
  • policy
  • member id
  • subscriber id

Example:

Insurance ID ABC123456789 was verified.

The raw value is replaced with a deterministic surrogate like:

HEALTH_PLAN_XXXXXXXX

Additional Clinical Recognizers

This PR also adds recognizers for:

Accession Number

Example contexts:

  • accession
  • accession number
  • acc no
  • accession no

Replacement format:

ACCESSION_XXXXXXXX

Device ID / Serial Number

Example contexts:

  • device
  • device id
  • serial
  • serial number
  • implant
  • equipment

Replacement format:

DEVICE_XXXXXXXX

Deterministic Surrogate Generation

This PR adds deterministic surrogate generation using salted SHA-256 hashes.

The behavior is:

  • Same original value plus same salt gives the same surrogate.
  • Same original value plus different salt gives a different surrogate.
  • Different original values give different surrogates.
  • The surrogate does not contain the original value.
  • No reversible mapping is stored.
  • Raw PHI values are not returned in the API response.

Examples:

MRN: 123456

becomes something like:

MRN_A1B2C3D4
Patient ID PT-1001

becomes something like:

PATIENT_ID_9F8E7D6C
Insurance ID ABC123456789

becomes something like:

HEALTH_PLAN_12AB34CD

The implementation uses SHA-256 from the Python standard library.


Common PHI Handling

This PR also handles several common PHI patterns.

Email Addresses

Email addresses are replaced with:

<REDACTED_EMAIL>

Example:

john.doe@example.com

becomes:

<REDACTED_EMAIL>

Phone Numbers

Phone numbers are replaced with:

<REDACTED_PHONE>

Example:

555-123-4567

becomes:

<REDACTED_PHONE>

SSNs

SSN-like values are replaced with:

<REDACTED_SSN>

Dates

Common date patterns are currently replaced with:

<REDACTED_DATE>

Date shifting is not implemented in this first Week 2 slice. This PR keeps date handling simple and safe by redacting common date formats for now.


Ingestion Endpoint Integration

The Week 1 ingestion endpoint is:

POST /api/v1/ingest

This PR updates the ingestion flow so that text files now go through the real anonymization service.

For text uploads, the endpoint now:

  • Detects the uploaded file as text.
  • Reads the text content safely.
  • Applies a 256 KiB text upload limit.
  • Decodes the file as UTF-8.
  • Rejects unsupported encoding clearly.
  • Calls the clinical text anonymization service.
  • Returns anonymization_status: "completed".
  • Returns anonymized text.
  • Returns safe entity counts only.
  • Keeps downstream pipeline steps as pending.

For non-text uploads, the endpoint still returns the Week 1 placeholder response.


Response Safety

This PR is careful about not exposing raw PHI.

The API response may include:

  • Filename
  • Detected modality
  • Privacy profile
  • Handler name
  • Routing status
  • Anonymization status
  • Anonymized text
  • Safe detected entity summary
  • Downstream pending states

The API response does not include:

  • Raw uploaded text separately
  • Raw detected PHI values
  • Raw entity examples
  • Debug traces
  • Internal stack traces
  • Reversible mappings

The entity summary only includes entity types and counts.

Safe example:

{
  "MEDICAL_RECORD_NUMBER": 1,
  "EMAIL_ADDRESS": 1
}

Unsafe example not used by this PR:

{
  "detected_values": ["123456", "john.doe@example.com"]
}

Text Upload Size Guard

This PR adds a text upload size guard:

256 KiB

If a text upload is larger than the limit, the endpoint returns a clear error.

This avoids accidentally reading very large text files into memory during this first implementation slice.

Streaming or chunked anonymization is not implemented yet.


UTF-8 Handling

Text files are decoded as UTF-8.

If the uploaded text file is not valid UTF-8, the endpoint returns a clear error instead of failing with an internal exception.

Example error:

Text uploads must be UTF-8 encoded

What Was Intentionally Not Changed

This PR does not implement any non-text anonymization work.

Not included:

  • DICOM metadata scrubbing
  • NIfTI metadata scrubbing
  • OCR pixel redaction
  • WSI tiling
  • CSV k-anonymity
  • l-diversity
  • t-closeness
  • BM25
  • ChromaDB retrieval changes
  • Semantic search
  • RRF/MMR
  • IPFS integration
  • CID encryption
  • Blockchain transaction flow

This PR also does not touch:

  • /store
  • /store_enhanced
  • Existing search endpoints
  • ChromaDB collections
  • Blockchain/IPFS code
  • Unrelated preview logic

The goal was to keep this PR focused only on Week 2 text anonymization.


Files Changed

Added

python_backend/services/text_anonymization.py
python_backend/tests/test_text_anonymization.py

Updated

python_backend/services/ingestion.py
python_backend/main.py
python_backend/tests/test_ingestion.py

Tests Added

A new service-level test file was added:

python_backend/tests/test_text_anonymization.py

The ingestion test file was also updated:

python_backend/tests/test_ingestion.py

The tests cover:

  • MRN detection with context
  • Avoiding MRN false positives for random numbers
  • Case-insensitive MRN recognition
  • Separator variations for MRN values
  • Patient ID detection and replacement
  • Health plan / insurance ID detection and replacement
  • Deterministic surrogate behavior with the same salt
  • Different surrogate behavior with a different salt
  • Email redaction
  • Phone redaction
  • Medical term preservation
  • No-PHI text success
  • Empty text rejection
  • Safe entity summary shape
  • Overlapping email handling
  • Text ingestion through /api/v1/ingest
  • MIME mismatch routing by .txt extension
  • Unsupported encoding rejection
  • Large text upload rejection
  • Non-text modalities still returning placeholders

Test Results

I ran the focused Week 2 tests.

Text anonymization service tests

Command:

py -3.11 -m pytest tests/test_text_anonymization.py

Result:

14 passed

Ingestion endpoint tests

Command:

py -3.11 -m pytest tests/test_ingestion.py

Result:

15 passed

There were a few existing dependency warnings during the ingestion tests, but there were no test failures.


Dependency Notes

No new dependencies were added in this PR.

The required packages were already present in:

python_backend/requirements.txt

Already present:

presidio-analyzer
presidio-anonymizer
spacy

No spaCy model download is required for the new Week 2 text ingestion path.


Current Behavior by Modality

Text

Text now uses real anonymization.

Status:

completed

CSV

CSV still uses the placeholder handler.

Status:

placeholder

DICOM

DICOM still uses the placeholder handler.

Status:

placeholder

NIfTI

NIfTI still uses the placeholder handler.

Status:

placeholder

WSI

WSI still uses the placeholder handler.

Status:

placeholder

Current Limitations

This is the first Week 2 implementation slice, so a few things are intentionally left for later.

Current limitations:

  • PERSON detection is not forced without a reliable spaCy model.
  • Date shifting is not implemented yet.
  • Common date patterns are currently redacted instead of shifted.
  • The default salt is only a development fallback.
  • The real production source for study_salt still needs mentor confirmation.
  • Streaming/chunked anonymization for large text files is not implemented yet.

Privacy Notes

This PR uses only synthetic test examples.

Examples used in tests include fake values such as:

John Doe
MRN: 123456
john.doe@example.com
555-123-4567
PT-1001
ABC123456789

No real PHI was added.

The implementation does not print raw uploaded text or raw detected values.

The entity summary is safe because it only contains counts by entity type.

Please let me know your feedback and if there are any changes required, I will make more commits to this as i improve it even further. Thankyou so much :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant