Add Presidio text anonymization scaffold#233
Open
XxSURYANSHxX wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@pradeeban, @karthiksathishjeemain, @Chali-healthy. This PR starts the Week 2 work for my GSoC project:
Bio-Block: Advanced PHI Anonymization and Hybrid Data Retrieval Pipeline
The main goal of this PR is to replace the Week 1 text placeholder handler with a real clinical text anonymization pipeline.
In Week 1, the ingestion endpoint could detect file modality and route files to the correct handler, but all handlers were still placeholders. This PR makes only the text handler real.
After this PR:
This keeps the PR small, focused, and limited to the Week 2 text anonymization scope.
Why This PR Is Needed
The Week 2 proposal scope is:
Before this PR, the text ingestion flow only selected a placeholder handler. It did not actually anonymize clinical text.
This PR adds the first real anonymization path for uploaded text files while avoiding unrelated backend changes.
Main Changes
1. Added a Clinical Text Anonymization Service
A new service was added:
This service exposes a focused function:
The function:
Example response shape:
{ "anonymization_status": "completed", "anonymized_text": "Patient has MRN_A1B2C3D4 and email <REDACTED_EMAIL>.", "detected_entities": { "MEDICAL_RECORD_NUMBER": 1, "EMAIL_ADDRESS": 1 } }Presidio Integration
This PR uses Microsoft Presidio for the text anonymization pipeline.
The new service uses Presidio's
AnalyzerEnginewith customPatternRecognizerinstances.One important implementation detail is that the new Week 2 text service avoids hidden runtime model downloads.
Presidio's default setup can try to load or download a spaCy model if none is available. To avoid that, this PR uses a blank local spaCy tokenizer for pattern-based recognition. This keeps the new text ingestion path deterministic and safe for local development and CI.
No spaCy model download is required for this new Week 2 text ingestion path.
Custom Clinical Recognizers Added
This PR adds custom recognizers for clinical identifiers that general PII recognizers may miss.
The added clinical entity types are:
MEDICAL_RECORD_NUMBERPATIENT_IDHEALTH_PLAN_IDACCESSION_NUMBERDEVICE_IDThe recognizers are designed to be conservative.
For example, this should be detected:
But this should not be blindly treated as an MRN:
This is important because clinical notes often contain many numbers that are not identifiers.
Medical Record Number Recognition
The MRN recognizer supports clinical context such as:
MRNmedical recordmedical record numberhospital numberchart numberSupported examples include:
The recognizer is case-insensitive, so both of these are supported:
Patient ID Recognition
The Patient ID recognizer supports context such as:
patient idpatient numberpatient identifierpt idExample:
The raw ID is replaced with a deterministic surrogate like:
Health Plan / Insurance ID Recognition
The health plan recognizer supports context such as:
health planbeneficiaryinsurancepolicymember idsubscriber idExample:
The raw value is replaced with a deterministic surrogate like:
Additional Clinical Recognizers
This PR also adds recognizers for:
Accession Number
Example contexts:
accessionaccession numberacc noaccession noReplacement format:
Device ID / Serial Number
Example contexts:
devicedevice idserialserial numberimplantequipmentReplacement format:
Deterministic Surrogate Generation
This PR adds deterministic surrogate generation using salted SHA-256 hashes.
The behavior is:
Examples:
becomes something like:
becomes something like:
becomes something like:
The implementation uses SHA-256 from the Python standard library.
Common PHI Handling
This PR also handles several common PHI patterns.
Email Addresses
Email addresses are replaced with:
Example:
becomes:
Phone Numbers
Phone numbers are replaced with:
Example:
becomes:
SSNs
SSN-like values are replaced with:
Dates
Common date patterns are currently replaced with:
Date shifting is not implemented in this first Week 2 slice. This PR keeps date handling simple and safe by redacting common date formats for now.
Ingestion Endpoint Integration
The Week 1 ingestion endpoint is:
This PR updates the ingestion flow so that text files now go through the real anonymization service.
For text uploads, the endpoint now:
anonymization_status: "completed".For non-text uploads, the endpoint still returns the Week 1 placeholder response.
Response Safety
This PR is careful about not exposing raw PHI.
The API response may include:
The API response does not include:
The entity summary only includes entity types and counts.
Safe example:
{ "MEDICAL_RECORD_NUMBER": 1, "EMAIL_ADDRESS": 1 }Unsafe example not used by this PR:
{ "detected_values": ["123456", "john.doe@example.com"] }Text Upload Size Guard
This PR adds a text upload size guard:
If a text upload is larger than the limit, the endpoint returns a clear error.
This avoids accidentally reading very large text files into memory during this first implementation slice.
Streaming or chunked anonymization is not implemented yet.
UTF-8 Handling
Text files are decoded as UTF-8.
If the uploaded text file is not valid UTF-8, the endpoint returns a clear error instead of failing with an internal exception.
Example error:
What Was Intentionally Not Changed
This PR does not implement any non-text anonymization work.
Not included:
This PR also does not touch:
/store/store_enhancedThe goal was to keep this PR focused only on Week 2 text anonymization.
Files Changed
Added
Updated
Tests Added
A new service-level test file was added:
The ingestion test file was also updated:
The tests cover:
/api/v1/ingest.txtextensionTest Results
I ran the focused Week 2 tests.
Text anonymization service tests
Command:
Result:
Ingestion endpoint tests
Command:
Result:
There were a few existing dependency warnings during the ingestion tests, but there were no test failures.
Dependency Notes
No new dependencies were added in this PR.
The required packages were already present in:
Already present:
No spaCy model download is required for the new Week 2 text ingestion path.
Current Behavior by Modality
Text
Text now uses real anonymization.
Status:
CSV
CSV still uses the placeholder handler.
Status:
DICOM
DICOM still uses the placeholder handler.
Status:
NIfTI
NIfTI still uses the placeholder handler.
Status:
WSI
WSI still uses the placeholder handler.
Status:
Current Limitations
This is the first Week 2 implementation slice, so a few things are intentionally left for later.
Current limitations:
study_saltstill needs mentor confirmation.Privacy Notes
This PR uses only synthetic test examples.
Examples used in tests include fake values such as:
No real PHI was added.
The implementation does not print raw uploaded text or raw detected values.
The entity summary is safe because it only contains counts by entity type.
Please let me know your feedback and if there are any changes required, I will make more commits to this as i improve it even further. Thankyou so much :)