Skip to content

#2820: Replace dlib with AuraFace for face recognition#2823

Open
douglas125 wants to merge 15 commits into
sepinf-inc:masterfrom
douglas125:#2820_improve_face_recognition
Open

#2820: Replace dlib with AuraFace for face recognition#2823
douglas125 wants to merge 15 commits into
sepinf-inc:masterfrom
douglas125:#2820_improve_face_recognition

Conversation

@douglas125

Copy link
Copy Markdown

Fixes #2820.

Replaces the legacy dlib/face_recognition pipeline (2017-era HOG detector, 128-d embeddings) with a modern ONNX-based pipeline: RetinaFace-R50 for detection + AuraFace-v1 for recognition (512-d, L2-normalized embeddings).

Benchmark (377 images, RTX 3060)

dlib ×4 CPU AuraFace ×1 CPU AuraFace ×4 CPU AuraFace ×1 GPU AuraFace ×4 GPU
Faces detected 561 1,383 1,383 1,383 1,383
Wall time 46 s 359 s 148 s 57 s 22 s
Avg / image 122 ms 953 ms 393 ms 151 ms 60 ms
Embedding dims 128 512 512 512 512
  • 2.5× more faces detected (RetinaFace vs HOG)
  • GPU (×4) is 2× faster than dlib while detecting far more faces
  • CPU-only is slower but still practical; GPU is the recommended path

Licensing

The default stack is fully open-source and free for commercial use:

  • RetinaFace-R50 — MIT license (detection)
  • AuraFace-v1 — Apache 2.0 license (recognition)
  • No insightface Python package required for default mode

Optional buffalo_l / buffalo_s models (InsightFace non-commercial license) are available as a config option for users willing to accept that license.

What changed

  • Detection: HOG → RetinaFace-R50 standalone ONNX model (MIT)
  • Recognition: dlib 128-d → AuraFace 512-d L2-normalized embeddings (Apache 2.0)
  • Distance metric: squared Euclidean → cosine distance (1 − dot product)
  • GPU support: ONNX Runtime auto-detects CUDA; falls back to CPU transparently
  • Models: auto-downloaded on first run (~365 MB); download_insightface_models.py for offline use
  • Atomic downloads: temp file + rename, safe for parallel subprocess startup
  • Model URLs centralized in FaceRecognitionModelConfig.py
  • Config: new options for model selection, GPU toggle, confidence threshold

Files changed (9 modified/new)

  • FaceRecognitionProcess.py — ONNX detection + recognition pipeline (standalone RetinaFace + AuraFace)
  • FaceRecognitionTask.py — subprocess orchestration, Python env setup, IPC
  • FaceRecognitionModelConfig.py (new) — centralized model URLs and filenames
  • download_insightface_models.py (new) — offline model download helper
  • FaceRecognitionConfig.txt — updated config with new options and descriptions
  • SimilarFacesSearch.java — cosine distance, 512-d embeddings, dimension guard
  • ElasticSearchIndexTask.java — 512-d vectors, cosinesimil metric
  • environment.yml (new) — conda environment spec
  • .gitignore — ignore CLAUDE.md

How to test

  1. Set enableFaceRecognition = true in IPEDConfig.txt
  2. Install Python dependencies:
    pip install numpy onnxruntime-gpu opencv-python pillow
    
    (use onnxruntime instead of onnxruntime-gpu on CPU-only machines)
  3. If not using bundled Python, set pythonPath in conf/FaceRecognitionConfig.txt:
    pythonPath = /path/to/python
    
  4. Run IPED — models auto-download on first use (~365 MB)
  5. In the analysis UI: Options → search for similar faces

Questions for maintainers

  1. Bundled Python artifact: The current artifact (python-jep-dlib) ships dlib + face_recognition. Should it be updated to include onnxruntime + opencv-python + pillow instead? Or should we add a pip install step to the build?

  2. Model bundling: Should models (~365 MB) be bundled in the release artifact, or keep the current approach of auto-download on first run? The download_insightface_models.py script handles offline pre-download.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

douglas125 and others added 15 commits March 8, 2026 00:58
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ace + RetinaFace)

Swap face detection/encoding from dlib (128-d, HOG) to InsightFace
(512-d ArcFace embeddings, RetinaFace detector) for significantly
better accuracy. Update distance metric from squared Euclidean to
cosine distance. Add dimension guard for backwards compatibility
with old 128-d indexes. Include download script for offline/portable
model provisioning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use face.normed_embedding instead of face.embedding: InsightFace
  returns raw (unnormalized) embeddings by default; cosine distance
  requires L2-normalized vectors
- Use root= parameter in FaceAnalysis instead of INSIGHTFACE_HOME
  env var: env var was unreliable, root= is the correct API

Both issues found and verified via testing with buffalo_l on t1.jpg
(6 faces detected, all embeddings norm=1.0, cosine distance correct).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add environment.yml for conda env setup (Python 3.10, insightface,
  onnxruntime-gpu, cudnn=9 for CUDA 12/13 users)
- Add /models/ to .gitignore to exclude downloaded InsightFace models
- Use os._exit(0) after main() to avoid ONNX Runtime GPU session
  hanging on cleanup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dings, batch IPC

Fix 1 (biggest impact): default numFaceRecognitionProcesses to 1.
Previously numThreads/2 subprocesses were spawned, each loading the
300MB InsightFace model independently (~30s each). On a 22-thread
machine that was ~11 × 30s = 5+ min before any inference started.
The config property still overrides this for CPU-parallel setups.

Fix 2 (IPC win): pack 512 embedding floats onto one line.
Replaces 512 individual print()/readline() round-trips per face with
a single space-separated line using repr() for float precision.
Reader side: np.array(line.split(), dtype=np.float32).

Fix 3 (batch IPC): send N images per subprocess round-trip.
New batchSize config property (default 1, set 8-16 for GPU).
FaceRecognitionProcess.py: new process_one_image() helper + batch:N
command prefix in the main loop.
FaceRecognitionTask.py: per-instance _batch_items buffer, _flushBatch(),
processQueueEnd()/sendToNextTask() pattern (same as AgeEstimationTask)
so items are held until their batch results are available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, fix finish() race

The batch architecture (Fix 3) introduced a deadlock: with maxProcesses=1
there is only one subprocess proc. Worker threads that finished their items
early would call finish(), remove the single proc from the queue, and
terminate it — leaving other workers blocked at processQueue.get() forever.

This commit:
- Reverts the batch/sendToNextTask/processQueueEnd architecture
- Restores the original direct process() flow (proven to work)
- Keeps Fix 1: default maxProcesses=1 (avoids N × 30s model loads)
- Keeps Fix 2: packed embeddings (512 floats on one line, not 512 readlines)
- Fixes the finish() race: uses a counter so only the LAST worker thread
  terminates the subprocess (same pattern as AgeEstimationTask.finish())
- Removes the not-yet-implemented batchSize config entry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lity

Two bugs when createExternalProcess() fails with maxProcesses=1:

1. numCreatedProcs was incremented before the Popen/ping attempt.
   On failure it was never reset to 0, so all other worker threads
   blocked forever at processQueue.get() (no proc ever enters the queue).
   Fix: decrement numCreatedProcs in the except branch and return early.

2. log_stderr was only started after a successful ping, so import errors
   and model-download progress were invisible during startup.
   Fix: start the stderr logging thread immediately after Popen.

3. processQueue.get(block=True) had no timeout — on any failure the
   remaining workers would deadlock indefinitely.
   Fix: timeout=300 (5 min) with a clear error message on expiry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The bundled IPED pythonw does not have insightface installed.
Set pythonPath to the conda env that has the required packages.
Users on other machines should update this path or install
insightface into the IPED bundled Python.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tibility

NumPy 2.x broke binary ABI compatibility; insightface native extensions
were compiled against NumPy 1.x and crash with AttributeError: _ARRAY_API.
NumPy 1.26.x is the latest 1.x series (Feb 2024) and fully compatible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ean up config

- Fix critical bug: processing block was unreachable dead code inside
  except _queue.Empty after a return statement
- Remove numpy dependency from Jep context (use plain Python list instead
  of np.array) to avoid NumPy ABI conflicts with bundled Python's dlib
- Comment out machine-specific pythonPath in config
- Add clear install instructions for CPU-only and GPU setups

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…de quality

- Add AuraFace mode: MediaPipe detection + AuraFace-v1 recognition (fully Apache 2.0)
- Keep buffalo_l/buffalo_s as options (non-commercial InsightFace license)
- Auto-scale numProcesses to min(4, numThreads/2) for CPU parallelism
- Cache ONNX input name lookup (was per-face, now per-session)
- Derive bbox from 5 alignment points instead of iterating all 478 landmarks
- Extract rescale_bbox helper to deduplicate scale correction code
- Replace eval() with safe tuple parsing in FaceRecognitionTask
- Remove duplicate numProcs config read and dead protocol string re-assignments
- Remove unnecessary 3s sleep on subprocess failure
- Add MediaPipe landmarker download to offline download script
- Update environment.yml and config for auraface defaults

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to InsightFace detection

Replace PIL image loading with cv2.imread (IMREAD_COLOR) which auto-applies
EXIF rotation, fixing bounding box mismatch on portrait/rotated images.
Also switch auraface mode from MediaPipe to InsightFace RetinaFace detection
with batch AuraFace recognition for better accuracy and performance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate descriptions

- Update header comments to reflect RetinaFace-R50 (MIT) + AuraFace (Apache 2.0) stack
- Fix error message to list actual dependencies (onnxruntime, opencv-python, numpy, pillow)
- Change embedded python path from pythonw to python (pythonw hides stderr)
- Update config comments and download script for standalone RetinaFace model

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… downloads, remove dead code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@douglas125

Copy link
Copy Markdown
Author

@lfcnassif @wladimirleite I've tested this meticulously. This stack feels much more accurate and it is faster using a consumer grade GPU.

@lfcnassif

lfcnassif commented Mar 10, 2026

Copy link
Copy Markdown
Member

Thank you very much @douglas125 for this great contribution! Detecting 2,5x more faces is awesome!

But I'm concerned about a 3x slowdown on CPU, most of our users don't have a decent GPU card. Is your test database public? Many years ago, I had to optimize a lot the fast HOG detector to make it feasible to run on CPU without becoming a bottleneck, I managed to achive a 10x speed up. The main optimization was to resize high resolution images to a maxium of 1024px in each dimension, resulting in much less area to scan, but loosing small faces, it was a trade-off decision.

I would suggest to keep the old implementation for CPU users and a parameter to switch to the new one for GPU users.

PS: Let's run performace tests with current proposal before.

@lfcnassif

Copy link
Copy Markdown
Member

Did you see it is possible to switch from HOG to a CNN detector with the old implementation? Could you try to run that with your test set?

@douglas125

Copy link
Copy Markdown
Author

I can give that a try. In my experience, even embedded GPUs will give times comparable to dlib on CPU. Also, though I did not report that, using 8x processes comes very close to matching dlib x4 on CPU. I think it's very hard to tell without using the actual hardware that will run this, and I think it would be good for an informed decision.

I didn't use any dataset in particular. I just picked up a sizeable chunk of my own personal pictures and ran the pipeline.

Let me try to get some other benchmarks.

In any case, this PR enables detecting faces in much more diverse orientations AND putting a name to it.

I still need to know if you want the models to be embedded in the repo or if they can be downloaded via a script (e.g. before uploading to a flash drive). I'll try and come back with results.

@lfcnassif

Copy link
Copy Markdown
Member

I think downloading the model the first time is fine.

I didn't use any dataset in particular. I just picked up a sizeable chunk of my own personal pictures and ran the pipeline.

So I guess they are high resolution? Try to increase the max resolution limit option in the old config to see if it helps the old implementation.

Let me try to get some other benchmarks

If possible, a memory usage evaluation would be very interesting.

Thanks again for this great work!

@douglas125

douglas125 commented Mar 10, 2026

Copy link
Copy Markdown
Author

AuraFace vs dlib — updated analysis

Benchmark (377 images, RTX 3060, corrected)

Correction from previous benchmark: the "AuraFace CPU" rows were mislabelled.
CUDA_VISIBLE_DEVICES='' does not prevent onnxruntime-gpu from using CUDA. Only
uninstalling the GPU package does. Those rows were actually GPU-accelerated. A second
run with onnxruntime-gpu uninstalled (confirmed by the provider warning
Available providers: AzureExecutionProvider, CPUExecutionProvider) produced the
corrected CPU-only numbers below.

The dlib VRAM numbers (586–1770 MB) were system baseline (browser, compositor), not
dlib using the GPU — dlib has no ONNX/CUDA dependency.

Config Faces Wall time ms/img ms/face Dim Peak RAM Notes
dlib HOG x1 CPU 314 62s 166ms 198ms 128 1.0 GB true CPU
dlib HOG x2 CPU 314 31s 82ms 98ms 128 2.0 GB true CPU
dlib HOG x4 CPU 314 18s 47ms 56ms 128 3.8 GB true CPU
dlib CNN x1 CPU 421 52s 138ms 124ms 128 1.3 GB true CPU
dlib CNN x2 CPU 421 25s 65ms 58ms 128 2.5 GB true CPU
dlib CNN x4 CPU 421 15s 40ms 36ms 128 4.7 GB true CPU
AuraFace x1 CPU 1,460 474s 1259ms 325ms 512 0.9 GB true CPU (onnxruntime, no GPU pkg)
AuraFace x2 CPU 1,460 228s 606ms 157ms 512 1.7 GB true CPU
AuraFace x4 CPU 1,460 164s 435ms 113ms 512 3.3 GB true CPU
AuraFace x1 GPU 1,460 62s 165ms 43ms 512 1.3 GB onnxruntime-gpu, CUDAExecutionProvider
AuraFace x2 GPU 1,460 29s 77ms 20ms 512 2.5 GB onnxruntime-gpu
AuraFace x4 GPU 1,460 19s 50ms 13ms 512 4.9 GB onnxruntime-gpu

I can wire dlib back in this PR, but I (still) advise against it. Here's my reasoning:

1. Detection: dlib misses most faces

On WIDER FACE Hard (small, occluded, profile faces — the realistic scenario):

Detector Easy Medium Hard
RetinaFace-R50 (MIT license) 96.5% 95.6% 90.4%
dlib CNN (MMOD) ~70% ~60% ~30%
dlib HOG even lower

Our own benchmark on 377 real forensic images confirms this:

Detector Faces found
dlib HOG 314
dlib CNN 421
RetinaFace (AuraFace) 1,460

dlib HOG misses 78% of the faces AuraFace finds. Even dlib CNN misses 71%. Faces
that aren't detected are invisible to the investigation — they can never be matched,
regardless of how fast the detector runs.

2. Recognition: dlib has 1.8x higher error rate

On LFW (Labeled Faces in the Wild):

Model Accuracy Error rate
AuraFace (ArcFace architecture) 99.65% 0.35%
dlib face_recognition 99.38% 0.62%

dlib's error rate is 1.8x higher. In a forensic case with 10,000 face comparisons,
dlib produces ~62 wrong matches vs AuraFace's ~35. Every false positive wastes
investigator time; every false negative is a missed lead.

On harder benchmarks the gap widens further:

Benchmark AuraFace dlib
CFP-FP (cross-pose) 95.19% not published
AgeDB (age variation) 96.10% not published

dlib's 128-d embedding doesn't capture enough information for robust matching across pose
and aging — scenarios common in forensic investigations.

3. Speed: it depends on whether a GPU is available

This is where the picture is more nuanced than the previous report suggested.

With a GPU (onnxruntime-gpu): AuraFace and dlib are roughly comparable in wall time,
but AuraFace processes 3.5–4.6x more faces. Normalized per detected face, AuraFace is
faster:

Config Wall time Faces ms/img ms/face
dlib CNN x4 CPU 15s 421 40ms 36ms
AuraFace x4 GPU 19s 1,460 50ms 13ms

Without a GPU (plain onnxruntime, CPU-only): AuraFace is significantly slower:

Config Wall time Faces ms/img ms/face
dlib CNN x4 CPU 15s 421 40ms 36ms
AuraFace x4 CPU (true) 164s 1,460 435ms 113ms

AuraFace CPU-only is ~11x slower in wall time than dlib CNN x4. For CPU-only users
this is a real regression, offset by the fact that it still finds 3.5x more faces, but
whether that trade-off is acceptable depends on case size and time constraints.

4. Embedding quality: 512-d vs 128-d

dlib uses 128-dimensional embeddings, which omit facial detail and struggle with
similarity search across pose and aging. AuraFace uses 512-dimensional embeddings from a
ResNet-100 backbone trained with ArcFace (Additive Angular Margin) loss on millions of
identities, which means far more discriminative power for distinguishing similar-looking individuals.

5. Licensing: fully open-source

Component License
RetinaFace-R50 detection MIT
AuraFace-v1 recognition Apache 2.0
dlib Boost
dlib face_recognition wrapper MIT

AuraFace was specifically designed for commercial/institutional use with clean training
data (no MS-Celeb-1M licensing concerns).

6. Maintenance cost of keeping dlib

Supporting both models would require:

  • Two Python dependency sets (dlib + face_recognition vs onnxruntime + opencv-python)
  • Two code paths in the subprocess script
  • Two embedding dimensions (128-d vs 512-d) that cannot cross-match — cases
    processed with one model are incompatible with the other
  • User confusion about which to pick

7. Summary

Metric dlib HOG/CNN AuraFace + RetinaFace Winner
Detection (WIDER Hard) ~30% 90.4% AuraFace by 3x
Recognition error rate (LFW) 0.62% 0.35% AuraFace 1.8x lower
Cross-pose (CFP-FP) not published 95.19% AuraFace
Age variation (AgeDB) not published 96.10% AuraFace
Faces found (377 images) 314–421 1,460 AuraFace by 3.5–4.6x
Embedding dimensions 128 512 AuraFace
License Boost MIT + Apache 2.0 AuraFace
ms/face, GPU available (x4) 36ms 13ms AuraFace 2.8x faster
ms/face, CPU-only (x4) 36ms 113ms dlib 3x faster
Wall time, GPU available (x4) 15–18s 19s roughly equal
Wall time, CPU-only (x4) 15–18s 164s dlib ~10x faster

For users with a GPU, there is no dimension where dlib is the better choice. AuraFace
detects 3.5–4.6x more faces, has a 1.8x lower recognition error rate, handles pose and
aging better, uses a cleaner license, and runs at comparable speed.

For CPU-only users, dlib is currently faster, but at the cost of missing 71–78% of
faces and producing lower-quality embeddings. Whether that trade-off is acceptable is a
policy decision: a faster scan that misses most faces may be worse than a slower scan
that finds them all.

Given that IPED targets forensic professionals who typically run on capable hardware, and
that the performance gap can be narrowed further by tuning ONNX thread counts for CPU
inference, I recommend proceeding with AuraFace only. If CPU-only support becomes a
stated requirement, it can be added as a config-file option later without breaking any
existing cases (since the embedding dimension change is already a breaking change
regardless).

Sources

@douglas125

Copy link
Copy Markdown
Author

I think you should give it a shot. It came out really good. I hope you don't mind me using it to organize my personal pictures LOL

@lfcnassif

Copy link
Copy Markdown
Member

Thank you very much @douglas125 for all your tests! We will review and test it for sure as soon as we find some available time.

I just wonder why tests on CPU are using VRAM, the more CPU threads used the more VRAM. Are you sure the GPU card wasn't used? Also seems the number of faces detected by dlib decreased compared to your original tests, while the faces detected by the new implementation increased comparing to initial tests. Any tip why?

@douglas125

Copy link
Copy Markdown
Author

Nice catch. You are right. I ran the tests again and the results look a lot worse for the CPU version. Let me remove the previous version and re-analyze things

@douglas125

Copy link
Copy Markdown
Author

I updated my previous comment. You were right: CUDA_VISIBLE_DEVICES in my script was being ignored and overriden.
I can put dlib as a fallback but I still think you'd be losing so much. Let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve face recognition accuracy and performance

2 participants