#2820: Replace dlib with AuraFace for face recognition#2823
#2820: Replace dlib with AuraFace for face recognition#2823douglas125 wants to merge 15 commits into
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ace + RetinaFace) Swap face detection/encoding from dlib (128-d, HOG) to InsightFace (512-d ArcFace embeddings, RetinaFace detector) for significantly better accuracy. Update distance metric from squared Euclidean to cosine distance. Add dimension guard for backwards compatibility with old 128-d indexes. Include download script for offline/portable model provisioning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use face.normed_embedding instead of face.embedding: InsightFace returns raw (unnormalized) embeddings by default; cosine distance requires L2-normalized vectors - Use root= parameter in FaceAnalysis instead of INSIGHTFACE_HOME env var: env var was unreliable, root= is the correct API Both issues found and verified via testing with buffalo_l on t1.jpg (6 faces detected, all embeddings norm=1.0, cosine distance correct). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add environment.yml for conda env setup (Python 3.10, insightface, onnxruntime-gpu, cudnn=9 for CUDA 12/13 users) - Add /models/ to .gitignore to exclude downloaded InsightFace models - Use os._exit(0) after main() to avoid ONNX Runtime GPU session hanging on cleanup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dings, batch IPC Fix 1 (biggest impact): default numFaceRecognitionProcesses to 1. Previously numThreads/2 subprocesses were spawned, each loading the 300MB InsightFace model independently (~30s each). On a 22-thread machine that was ~11 × 30s = 5+ min before any inference started. The config property still overrides this for CPU-parallel setups. Fix 2 (IPC win): pack 512 embedding floats onto one line. Replaces 512 individual print()/readline() round-trips per face with a single space-separated line using repr() for float precision. Reader side: np.array(line.split(), dtype=np.float32). Fix 3 (batch IPC): send N images per subprocess round-trip. New batchSize config property (default 1, set 8-16 for GPU). FaceRecognitionProcess.py: new process_one_image() helper + batch:N command prefix in the main loop. FaceRecognitionTask.py: per-instance _batch_items buffer, _flushBatch(), processQueueEnd()/sendToNextTask() pattern (same as AgeEstimationTask) so items are held until their batch results are available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, fix finish() race The batch architecture (Fix 3) introduced a deadlock: with maxProcesses=1 there is only one subprocess proc. Worker threads that finished their items early would call finish(), remove the single proc from the queue, and terminate it — leaving other workers blocked at processQueue.get() forever. This commit: - Reverts the batch/sendToNextTask/processQueueEnd architecture - Restores the original direct process() flow (proven to work) - Keeps Fix 1: default maxProcesses=1 (avoids N × 30s model loads) - Keeps Fix 2: packed embeddings (512 floats on one line, not 512 readlines) - Fixes the finish() race: uses a counter so only the LAST worker thread terminates the subprocess (same pattern as AgeEstimationTask.finish()) - Removes the not-yet-implemented batchSize config entry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lity Two bugs when createExternalProcess() fails with maxProcesses=1: 1. numCreatedProcs was incremented before the Popen/ping attempt. On failure it was never reset to 0, so all other worker threads blocked forever at processQueue.get() (no proc ever enters the queue). Fix: decrement numCreatedProcs in the except branch and return early. 2. log_stderr was only started after a successful ping, so import errors and model-download progress were invisible during startup. Fix: start the stderr logging thread immediately after Popen. 3. processQueue.get(block=True) had no timeout — on any failure the remaining workers would deadlock indefinitely. Fix: timeout=300 (5 min) with a clear error message on expiry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The bundled IPED pythonw does not have insightface installed. Set pythonPath to the conda env that has the required packages. Users on other machines should update this path or install insightface into the IPED bundled Python. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tibility NumPy 2.x broke binary ABI compatibility; insightface native extensions were compiled against NumPy 1.x and crash with AttributeError: _ARRAY_API. NumPy 1.26.x is the latest 1.x series (Feb 2024) and fully compatible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ean up config - Fix critical bug: processing block was unreachable dead code inside except _queue.Empty after a return statement - Remove numpy dependency from Jep context (use plain Python list instead of np.array) to avoid NumPy ABI conflicts with bundled Python's dlib - Comment out machine-specific pythonPath in config - Add clear install instructions for CPU-only and GPU setups Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…de quality - Add AuraFace mode: MediaPipe detection + AuraFace-v1 recognition (fully Apache 2.0) - Keep buffalo_l/buffalo_s as options (non-commercial InsightFace license) - Auto-scale numProcesses to min(4, numThreads/2) for CPU parallelism - Cache ONNX input name lookup (was per-face, now per-session) - Derive bbox from 5 alignment points instead of iterating all 478 landmarks - Extract rescale_bbox helper to deduplicate scale correction code - Replace eval() with safe tuple parsing in FaceRecognitionTask - Remove duplicate numProcs config read and dead protocol string re-assignments - Remove unnecessary 3s sleep on subprocess failure - Add MediaPipe landmarker download to offline download script - Update environment.yml and config for auraface defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to InsightFace detection Replace PIL image loading with cv2.imread (IMREAD_COLOR) which auto-applies EXIF rotation, fixing bounding box mismatch on portrait/rotated images. Also switch auraface mode from MediaPipe to InsightFace RetinaFace detection with batch AuraFace recognition for better accuracy and performance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate descriptions - Update header comments to reflect RetinaFace-R50 (MIT) + AuraFace (Apache 2.0) stack - Fix error message to list actual dependencies (onnxruntime, opencv-python, numpy, pillow) - Change embedded python path from pythonw to python (pythonw hides stderr) - Update config comments and download script for standalone RetinaFace model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… downloads, remove dead code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@lfcnassif @wladimirleite I've tested this meticulously. This stack feels much more accurate and it is faster using a consumer grade GPU. |
|
Thank you very much @douglas125 for this great contribution! Detecting 2,5x more faces is awesome! But I'm concerned about a 3x slowdown on CPU, most of our users don't have a decent GPU card. Is your test database public? Many years ago, I had to optimize a lot the fast HOG detector to make it feasible to run on CPU without becoming a bottleneck, I managed to achive a 10x speed up. The main optimization was to resize high resolution images to a maxium of 1024px in each dimension, resulting in much less area to scan, but loosing small faces, it was a trade-off decision. I would suggest to keep the old implementation for CPU users and a parameter to switch to the new one for GPU users. PS: Let's run performace tests with current proposal before. |
|
Did you see it is possible to switch from HOG to a CNN detector with the old implementation? Could you try to run that with your test set? |
|
I can give that a try. In my experience, even embedded GPUs will give times comparable to dlib on CPU. Also, though I did not report that, using 8x processes comes very close to matching dlib x4 on CPU. I think it's very hard to tell without using the actual hardware that will run this, and I think it would be good for an informed decision. I didn't use any dataset in particular. I just picked up a sizeable chunk of my own personal pictures and ran the pipeline. Let me try to get some other benchmarks. In any case, this PR enables detecting faces in much more diverse orientations AND putting a name to it. I still need to know if you want the models to be embedded in the repo or if they can be downloaded via a script (e.g. before uploading to a flash drive). I'll try and come back with results. |
|
I think downloading the model the first time is fine.
So I guess they are high resolution? Try to increase the max resolution limit option in the old config to see if it helps the old implementation.
If possible, a memory usage evaluation would be very interesting. Thanks again for this great work! |
AuraFace vs dlib — updated analysisBenchmark (377 images, RTX 3060, corrected)
I can wire dlib back in this PR, but I (still) advise against it. Here's my reasoning: 1. Detection: dlib misses most facesOn WIDER FACE Hard (small, occluded, profile faces — the realistic scenario):
Our own benchmark on 377 real forensic images confirms this:
dlib HOG misses 78% of the faces AuraFace finds. Even dlib CNN misses 71%. Faces 2. Recognition: dlib has 1.8x higher error rateOn LFW (Labeled Faces in the Wild):
dlib's error rate is 1.8x higher. In a forensic case with 10,000 face comparisons, On harder benchmarks the gap widens further:
dlib's 128-d embedding doesn't capture enough information for robust matching across pose 3. Speed: it depends on whether a GPU is availableThis is where the picture is more nuanced than the previous report suggested. With a GPU (onnxruntime-gpu): AuraFace and dlib are roughly comparable in wall time,
Without a GPU (plain onnxruntime, CPU-only): AuraFace is significantly slower:
AuraFace CPU-only is ~11x slower in wall time than dlib CNN x4. For CPU-only users 4. Embedding quality: 512-d vs 128-ddlib uses 128-dimensional embeddings, which omit facial detail and struggle with 5. Licensing: fully open-source
AuraFace was specifically designed for commercial/institutional use with clean training 6. Maintenance cost of keeping dlibSupporting both models would require:
7. Summary
For users with a GPU, there is no dimension where dlib is the better choice. AuraFace For CPU-only users, dlib is currently faster, but at the cost of missing 71–78% of Given that IPED targets forensic professionals who typically run on capable hardware, and Sources |
|
I think you should give it a shot. It came out really good. I hope you don't mind me using it to organize my personal pictures LOL |
|
Thank you very much @douglas125 for all your tests! We will review and test it for sure as soon as we find some available time. I just wonder why tests on CPU are using VRAM, the more CPU threads used the more VRAM. Are you sure the GPU card wasn't used? Also seems the number of faces detected by dlib decreased compared to your original tests, while the faces detected by the new implementation increased comparing to initial tests. Any tip why? |
|
Nice catch. You are right. I ran the tests again and the results look a lot worse for the CPU version. Let me remove the previous version and re-analyze things |
|
I updated my previous comment. You were right: CUDA_VISIBLE_DEVICES in my script was being ignored and overriden. |
Fixes #2820.
Replaces the legacy dlib/face_recognition pipeline (2017-era HOG detector, 128-d embeddings) with a modern ONNX-based pipeline: RetinaFace-R50 for detection + AuraFace-v1 for recognition (512-d, L2-normalized embeddings).
Benchmark (377 images, RTX 3060)
Licensing
The default stack is fully open-source and free for commercial use:
insightfacePython package required for default modeOptional
buffalo_l/buffalo_smodels (InsightFace non-commercial license) are available as a config option for users willing to accept that license.What changed
download_insightface_models.pyfor offline useFaceRecognitionModelConfig.pyFiles changed (9 modified/new)
FaceRecognitionProcess.py— ONNX detection + recognition pipeline (standalone RetinaFace + AuraFace)FaceRecognitionTask.py— subprocess orchestration, Python env setup, IPCFaceRecognitionModelConfig.py(new) — centralized model URLs and filenamesdownload_insightface_models.py(new) — offline model download helperFaceRecognitionConfig.txt— updated config with new options and descriptionsSimilarFacesSearch.java— cosine distance, 512-d embeddings, dimension guardElasticSearchIndexTask.java— 512-d vectors, cosinesimil metricenvironment.yml(new) — conda environment spec.gitignore— ignore CLAUDE.mdHow to test
enableFaceRecognition = trueinIPEDConfig.txtonnxruntimeinstead ofonnxruntime-gpuon CPU-only machines)pythonPathinconf/FaceRecognitionConfig.txt:Questions for maintainers
Bundled Python artifact: The current artifact (
python-jep-dlib) shipsdlib+face_recognition. Should it be updated to includeonnxruntime+opencv-python+pillowinstead? Or should we add apip installstep to the build?Model bundling: Should models (~365 MB) be bundled in the release artifact, or keep the current approach of auto-download on first run? The
download_insightface_models.pyscript handles offline pre-download.🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com