#2820: Replace dlib with AuraFace for face recognition by douglas125 · Pull Request #2823 · sepinf-inc/IPED

douglas125 · 2026-03-08T18:53:50Z

Replaces the legacy dlib/face_recognition pipeline (2017-era HOG detector, 128-d embeddings) with a modern ONNX-based pipeline: RetinaFace-R50 for detection + AuraFace-v1 for recognition (512-d, L2-normalized embeddings).

Benchmark (377 images, RTX 3060)

	dlib ×4 CPU	AuraFace ×1 CPU	AuraFace ×4 CPU	AuraFace ×1 GPU	AuraFace ×4 GPU
Faces detected	561	1,383	1,383	1,383	1,383
Wall time	46 s	359 s	148 s	57 s	22 s
Avg / image	122 ms	953 ms	393 ms	151 ms	60 ms
Embedding dims	128	512	512	512	512

2.5× more faces detected (RetinaFace vs HOG)
GPU (×4) is 2× faster than dlib while detecting far more faces
CPU-only is slower but still practical; GPU is the recommended path

Licensing

The default stack is fully open-source and free for commercial use:

RetinaFace-R50 — MIT license (detection)
AuraFace-v1 — Apache 2.0 license (recognition)
No insightface Python package required for default mode

Optional buffalo_l / buffalo_s models (InsightFace non-commercial license) are available as a config option for users willing to accept that license.

What changed

Detection: HOG → RetinaFace-R50 standalone ONNX model (MIT)
Recognition: dlib 128-d → AuraFace 512-d L2-normalized embeddings (Apache 2.0)
Distance metric: squared Euclidean → cosine distance (1 − dot product)
GPU support: ONNX Runtime auto-detects CUDA; falls back to CPU transparently
Models: auto-downloaded on first run (~365 MB); download_insightface_models.py for offline use
Atomic downloads: temp file + rename, safe for parallel subprocess startup
Model URLs centralized in FaceRecognitionModelConfig.py
Config: new options for model selection, GPU toggle, confidence threshold

Files changed (9 modified/new)

FaceRecognitionProcess.py — ONNX detection + recognition pipeline (standalone RetinaFace + AuraFace)
FaceRecognitionTask.py — subprocess orchestration, Python env setup, IPC
FaceRecognitionModelConfig.py (new) — centralized model URLs and filenames
download_insightface_models.py (new) — offline model download helper
FaceRecognitionConfig.txt — updated config with new options and descriptions
SimilarFacesSearch.java — cosine distance, 512-d embeddings, dimension guard
ElasticSearchIndexTask.java — 512-d vectors, cosinesimil metric
environment.yml (new) — conda environment spec
.gitignore — ignore CLAUDE.md

How to test

Set enableFaceRecognition = true in IPEDConfig.txt
Install Python dependencies:
```
pip install numpy onnxruntime-gpu opencv-python pillow
```
(use onnxruntime instead of onnxruntime-gpu on CPU-only machines)
If not using bundled Python, set pythonPath in conf/FaceRecognitionConfig.txt:
```
pythonPath = /path/to/python
```
Run IPED — models auto-download on first use (~365 MB)
In the analysis UI: Options → search for similar faces

Questions for maintainers

Bundled Python artifact: The current artifact (python-jep-dlib) ships dlib + face_recognition. Should it be updated to include onnxruntime + opencv-python + pillow instead? Or should we add a pip install step to the build?
Model bundling: Should models (~365 MB) be bundled in the release artifact, or keep the current approach of auto-download on first run? The download_insightface_models.py script handles offline pre-download.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ace + RetinaFace) Swap face detection/encoding from dlib (128-d, HOG) to InsightFace (512-d ArcFace embeddings, RetinaFace detector) for significantly better accuracy. Update distance metric from squared Euclidean to cosine distance. Add dimension guard for backwards compatibility with old 128-d indexes. Include download script for offline/portable model provisioning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Use face.normed_embedding instead of face.embedding: InsightFace returns raw (unnormalized) embeddings by default; cosine distance requires L2-normalized vectors - Use root= parameter in FaceAnalysis instead of INSIGHTFACE_HOME env var: env var was unreliable, root= is the correct API Both issues found and verified via testing with buffalo_l on t1.jpg (6 faces detected, all embeddings norm=1.0, cosine distance correct). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add environment.yml for conda env setup (Python 3.10, insightface, onnxruntime-gpu, cudnn=9 for CUDA 12/13 users) - Add /models/ to .gitignore to exclude downloaded InsightFace models - Use os._exit(0) after main() to avoid ONNX Runtime GPU session hanging on cleanup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dings, batch IPC Fix 1 (biggest impact): default numFaceRecognitionProcesses to 1. Previously numThreads/2 subprocesses were spawned, each loading the 300MB InsightFace model independently (~30s each). On a 22-thread machine that was ~11 × 30s = 5+ min before any inference started. The config property still overrides this for CPU-parallel setups. Fix 2 (IPC win): pack 512 embedding floats onto one line. Replaces 512 individual print()/readline() round-trips per face with a single space-separated line using repr() for float precision. Reader side: np.array(line.split(), dtype=np.float32). Fix 3 (batch IPC): send N images per subprocess round-trip. New batchSize config property (default 1, set 8-16 for GPU). FaceRecognitionProcess.py: new process_one_image() helper + batch:N command prefix in the main loop. FaceRecognitionTask.py: per-instance _batch_items buffer, _flushBatch(), processQueueEnd()/sendToNextTask() pattern (same as AgeEstimationTask) so items are held until their batch results are available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, fix finish() race The batch architecture (Fix 3) introduced a deadlock: with maxProcesses=1 there is only one subprocess proc. Worker threads that finished their items early would call finish(), remove the single proc from the queue, and terminate it — leaving other workers blocked at processQueue.get() forever. This commit: - Reverts the batch/sendToNextTask/processQueueEnd architecture - Restores the original direct process() flow (proven to work) - Keeps Fix 1: default maxProcesses=1 (avoids N × 30s model loads) - Keeps Fix 2: packed embeddings (512 floats on one line, not 512 readlines) - Fixes the finish() race: uses a counter so only the LAST worker thread terminates the subprocess (same pattern as AgeEstimationTask.finish()) - Removes the not-yet-implemented batchSize config entry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lity Two bugs when createExternalProcess() fails with maxProcesses=1: 1. numCreatedProcs was incremented before the Popen/ping attempt. On failure it was never reset to 0, so all other worker threads blocked forever at processQueue.get() (no proc ever enters the queue). Fix: decrement numCreatedProcs in the except branch and return early. 2. log_stderr was only started after a successful ping, so import errors and model-download progress were invisible during startup. Fix: start the stderr logging thread immediately after Popen. 3. processQueue.get(block=True) had no timeout — on any failure the remaining workers would deadlock indefinitely. Fix: timeout=300 (5 min) with a clear error message on expiry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The bundled IPED pythonw does not have insightface installed. Set pythonPath to the conda env that has the required packages. Users on other machines should update this path or install insightface into the IPED bundled Python. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tibility NumPy 2.x broke binary ABI compatibility; insightface native extensions were compiled against NumPy 1.x and crash with AttributeError: _ARRAY_API. NumPy 1.26.x is the latest 1.x series (Feb 2024) and fully compatible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ean up config - Fix critical bug: processing block was unreachable dead code inside except _queue.Empty after a return statement - Remove numpy dependency from Jep context (use plain Python list instead of np.array) to avoid NumPy ABI conflicts with bundled Python's dlib - Comment out machine-specific pythonPath in config - Add clear install instructions for CPU-only and GPU setups Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…de quality - Add AuraFace mode: MediaPipe detection + AuraFace-v1 recognition (fully Apache 2.0) - Keep buffalo_l/buffalo_s as options (non-commercial InsightFace license) - Auto-scale numProcesses to min(4, numThreads/2) for CPU parallelism - Cache ONNX input name lookup (was per-face, now per-session) - Derive bbox from 5 alignment points instead of iterating all 478 landmarks - Extract rescale_bbox helper to deduplicate scale correction code - Replace eval() with safe tuple parsing in FaceRecognitionTask - Remove duplicate numProcs config read and dead protocol string re-assignments - Remove unnecessary 3s sleep on subprocess failure - Add MediaPipe landmarker download to offline download script - Update environment.yml and config for auraface defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…to InsightFace detection Replace PIL image loading with cv2.imread (IMREAD_COLOR) which auto-applies EXIF rotation, fixing bounding box mismatch on portrait/rotated images. Also switch auraface mode from MediaPipe to InsightFace RetinaFace detection with batch AuraFace recognition for better accuracy and performance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pdate descriptions - Update header comments to reflect RetinaFace-R50 (MIT) + AuraFace (Apache 2.0) stack - Fix error message to list actual dependencies (onnxruntime, opencv-python, numpy, pillow) - Change embedded python path from pythonw to python (pythonw hides stderr) - Update config comments and download script for standalone RetinaFace model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… downloads, remove dead code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

douglas125 · 2026-03-08T19:04:42Z

@lfcnassif @wladimirleite I've tested this meticulously. This stack feels much more accurate and it is faster using a consumer grade GPU.

lfcnassif · 2026-03-10T02:11:56Z

Thank you very much @douglas125 for this great contribution! Detecting 2,5x more faces is awesome!

But I'm concerned about a 3x slowdown on CPU, most of our users don't have a decent GPU card. Is your test database public? Many years ago, I had to optimize a lot the fast HOG detector to make it feasible to run on CPU without becoming a bottleneck, I managed to achive a 10x speed up. The main optimization was to resize high resolution images to a maxium of 1024px in each dimension, resulting in much less area to scan, but loosing small faces, it was a trade-off decision.

I would suggest to keep the old implementation for CPU users and a parameter to switch to the new one for GPU users.

PS: Let's run performace tests with current proposal before.

lfcnassif · 2026-03-10T02:14:05Z

Did you see it is possible to switch from HOG to a CNN detector with the old implementation? Could you try to run that with your test set?

douglas125 · 2026-03-10T02:50:45Z

I can give that a try. In my experience, even embedded GPUs will give times comparable to dlib on CPU. Also, though I did not report that, using 8x processes comes very close to matching dlib x4 on CPU. I think it's very hard to tell without using the actual hardware that will run this, and I think it would be good for an informed decision.

I didn't use any dataset in particular. I just picked up a sizeable chunk of my own personal pictures and ran the pipeline.

Let me try to get some other benchmarks.

In any case, this PR enables detecting faces in much more diverse orientations AND putting a name to it.

I still need to know if you want the models to be embedded in the repo or if they can be downloaded via a script (e.g. before uploading to a flash drive). I'll try and come back with results.

lfcnassif · 2026-03-10T03:55:29Z

I think downloading the model the first time is fine.

I didn't use any dataset in particular. I just picked up a sizeable chunk of my own personal pictures and ran the pipeline.

So I guess they are high resolution? Try to increase the max resolution limit option in the old config to see if it helps the old implementation.

Let me try to get some other benchmarks

If possible, a memory usage evaluation would be very interesting.

Thanks again for this great work!

douglas125 · 2026-03-10T04:48:35Z

AuraFace vs dlib — updated analysis

Benchmark (377 images, RTX 3060, corrected)

Correction from previous benchmark: the "AuraFace CPU" rows were mislabelled.
CUDA_VISIBLE_DEVICES='' does not prevent onnxruntime-gpu from using CUDA. Only
uninstalling the GPU package does. Those rows were actually GPU-accelerated. A second
run with onnxruntime-gpu uninstalled (confirmed by the provider warning
Available providers: AzureExecutionProvider, CPUExecutionProvider) produced the
corrected CPU-only numbers below.

The dlib VRAM numbers (586–1770 MB) were system baseline (browser, compositor), not
dlib using the GPU — dlib has no ONNX/CUDA dependency.

Config	Faces	Wall time	ms/img	ms/face	Dim	Peak RAM	Notes
dlib HOG x1 CPU	314	62s	166ms	198ms	128	1.0 GB	true CPU
dlib HOG x2 CPU	314	31s	82ms	98ms	128	2.0 GB	true CPU
dlib HOG x4 CPU	314	18s	47ms	56ms	128	3.8 GB	true CPU
dlib CNN x1 CPU	421	52s	138ms	124ms	128	1.3 GB	true CPU
dlib CNN x2 CPU	421	25s	65ms	58ms	128	2.5 GB	true CPU
dlib CNN x4 CPU	421	15s	40ms	36ms	128	4.7 GB	true CPU
AuraFace x1 CPU	1,460	474s	1259ms	325ms	512	0.9 GB	true CPU (onnxruntime, no GPU pkg)
AuraFace x2 CPU	1,460	228s	606ms	157ms	512	1.7 GB	true CPU
AuraFace x4 CPU	1,460	164s	435ms	113ms	512	3.3 GB	true CPU
AuraFace x1 GPU	1,460	62s	165ms	43ms	512	1.3 GB	onnxruntime-gpu, CUDAExecutionProvider
AuraFace x2 GPU	1,460	29s	77ms	20ms	512	2.5 GB	onnxruntime-gpu
AuraFace x4 GPU	1,460	19s	50ms	13ms	512	4.9 GB	onnxruntime-gpu

I can wire dlib back in this PR, but I (still) advise against it. Here's my reasoning:

1. Detection: dlib misses most faces

On WIDER FACE Hard (small, occluded, profile faces — the realistic scenario):

Detector	Easy	Medium	Hard
RetinaFace-R50 (MIT license)	96.5%	95.6%	90.4%
dlib CNN (MMOD)	~70%	~60%	~30%
dlib HOG	even lower	—	—

Our own benchmark on 377 real forensic images confirms this:

Detector	Faces found
dlib HOG	314
dlib CNN	421
RetinaFace (AuraFace)	1,460

dlib HOG misses 78% of the faces AuraFace finds. Even dlib CNN misses 71%. Faces
that aren't detected are invisible to the investigation — they can never be matched,
regardless of how fast the detector runs.

2. Recognition: dlib has 1.8x higher error rate

On LFW (Labeled Faces in the Wild):

Model	Accuracy	Error rate
AuraFace (ArcFace architecture)	99.65%	0.35%
dlib `face_recognition`	99.38%	0.62%

dlib's error rate is 1.8x higher. In a forensic case with 10,000 face comparisons,
dlib produces ~62 wrong matches vs AuraFace's ~35. Every false positive wastes
investigator time; every false negative is a missed lead.

On harder benchmarks the gap widens further:

Benchmark	AuraFace	dlib
CFP-FP (cross-pose)	95.19%	not published
AgeDB (age variation)	96.10%	not published

dlib's 128-d embedding doesn't capture enough information for robust matching across pose
and aging — scenarios common in forensic investigations.

3. Speed: it depends on whether a GPU is available

This is where the picture is more nuanced than the previous report suggested.

With a GPU (onnxruntime-gpu): AuraFace and dlib are roughly comparable in wall time,
but AuraFace processes 3.5–4.6x more faces. Normalized per detected face, AuraFace is
faster:

Config	Wall time	Faces	ms/img	ms/face
dlib CNN x4 CPU	15s	421	40ms	36ms
AuraFace x4 GPU	19s	1,460	50ms	13ms

Without a GPU (plain onnxruntime, CPU-only): AuraFace is significantly slower:

Config	Wall time	Faces	ms/img	ms/face
dlib CNN x4 CPU	15s	421	40ms	36ms
AuraFace x4 CPU (true)	164s	1,460	435ms	113ms

AuraFace CPU-only is ~11x slower in wall time than dlib CNN x4. For CPU-only users
this is a real regression, offset by the fact that it still finds 3.5x more faces, but
whether that trade-off is acceptable depends on case size and time constraints.

4. Embedding quality: 512-d vs 128-d

dlib uses 128-dimensional embeddings, which omit facial detail and struggle with
similarity search across pose and aging. AuraFace uses 512-dimensional embeddings from a
ResNet-100 backbone trained with ArcFace (Additive Angular Margin) loss on millions of
identities, which means far more discriminative power for distinguishing similar-looking individuals.

5. Licensing: fully open-source

Component	License
RetinaFace-R50 detection	MIT
AuraFace-v1 recognition	Apache 2.0
dlib	Boost
dlib `face_recognition` wrapper	MIT

AuraFace was specifically designed for commercial/institutional use with clean training
data (no MS-Celeb-1M licensing concerns).

6. Maintenance cost of keeping dlib

Supporting both models would require:

Two Python dependency sets (dlib + face_recognition vs onnxruntime + opencv-python)
Two code paths in the subprocess script
Two embedding dimensions (128-d vs 512-d) that cannot cross-match — cases
processed with one model are incompatible with the other
User confusion about which to pick

7. Summary

Metric	dlib HOG/CNN	AuraFace + RetinaFace	Winner
Detection (WIDER Hard)	~30%	90.4%	AuraFace by 3x
Recognition error rate (LFW)	0.62%	0.35%	AuraFace 1.8x lower
Cross-pose (CFP-FP)	not published	95.19%	AuraFace
Age variation (AgeDB)	not published	96.10%	AuraFace
Faces found (377 images)	314–421	1,460	AuraFace by 3.5–4.6x
Embedding dimensions	128	512	AuraFace
License	Boost	MIT + Apache 2.0	AuraFace
ms/face, GPU available (x4)	36ms	13ms	AuraFace 2.8x faster
ms/face, CPU-only (x4)	36ms	113ms	dlib 3x faster
Wall time, GPU available (x4)	15–18s	19s	roughly equal
Wall time, CPU-only (x4)	15–18s	164s	dlib ~10x faster

For users with a GPU, there is no dimension where dlib is the better choice. AuraFace
detects 3.5–4.6x more faces, has a 1.8x lower recognition error rate, handles pose and
aging better, uses a cleaner license, and runs at comparable speed.

For CPU-only users, dlib is currently faster, but at the cost of missing 71–78% of
faces and producing lower-quality embeddings. Whether that trade-off is acceptable is a
policy decision: a faster scan that misses most faces may be worse than a slower scan
that finds them all.

Given that IPED targets forensic professionals who typically run on capable hardware, and
that the performance gap can be narrowed further by tuning ONNX thread counts for CPU
inference, I recommend proceeding with AuraFace only. If CPU-only support becomes a
stated requirement, it can be added as a config-file option later without breaking any
existing cases (since the embedding dimension change is already a breaking change
regardless).

Sources

douglas125 · 2026-03-10T04:52:08Z

I think you should give it a shot. It came out really good. I hope you don't mind me using it to organize my personal pictures LOL

lfcnassif · 2026-03-10T20:01:38Z

Thank you very much @douglas125 for all your tests! We will review and test it for sure as soon as we find some available time.

I just wonder why tests on CPU are using VRAM, the more CPU threads used the more VRAM. Are you sure the GPU card wasn't used? Also seems the number of faces detected by dlib decreased compared to your original tests, while the faces detected by the new implementation increased comparing to initial tests. Any tip why?

douglas125 · 2026-03-10T20:32:23Z

Nice catch. You are right. I ran the tests again and the results look a lot worse for the CPU version. Let me remove the previous version and re-analyze things

douglas125 · 2026-03-10T20:41:22Z

I updated my previous comment. You were right: CUDA_VISIBLE_DEVICES in my script was being ignored and overriden.
I can put dlib as a fallback but I still think you'd be losing so much. Let me know

douglas125 and others added 15 commits March 8, 2026 00:58

Add CLAUDE.md with project notes and face recognition improvement plan

d2bd299

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CLAUDE.md to .gitignore and untrack it

b5490d2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sepinf-inc#2820: Centralize model URLs, enable GPU by default, atomic…

e14b1d8

… downloads, remove dead code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#2820: Replace dlib with AuraFace for face recognition#2823

#2820: Replace dlib with AuraFace for face recognition#2823
douglas125 wants to merge 15 commits into
sepinf-inc:masterfrom
douglas125:#2820_improve_face_recognition

douglas125 commented Mar 8, 2026

Uh oh!

douglas125 commented Mar 8, 2026

Uh oh!

lfcnassif commented Mar 10, 2026 •

edited

Loading

Uh oh!

lfcnassif commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

lfcnassif commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026 •

edited

Loading

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

lfcnassif commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

douglas125 commented Mar 8, 2026

Benchmark (377 images, RTX 3060)

Licensing

What changed

Files changed (9 modified/new)

How to test

Questions for maintainers

Uh oh!

douglas125 commented Mar 8, 2026

Uh oh!

lfcnassif commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfcnassif commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

lfcnassif commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AuraFace vs dlib — updated analysis

Benchmark (377 images, RTX 3060, corrected)

1. Detection: dlib misses most faces

2. Recognition: dlib has 1.8x higher error rate

3. Speed: it depends on whether a GPU is available

4. Embedding quality: 512-d vs 128-d

5. Licensing: fully open-source

6. Maintenance cost of keeping dlib

7. Summary

Sources

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

lfcnassif commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

douglas125 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfcnassif commented Mar 10, 2026 •

edited

Loading

douglas125 commented Mar 10, 2026 •

edited

Loading