Chart Diversity Scorer (VOL Focus)

A standalone tool to compute diversity scores for collections of chart images using DINOv2 embeddings, with primary focus on the VOL (Volume) metric.

What is VOL?

VOL (Volume) is a diversity metric that measures how well a collection of images spans the embedding space. It's computed as the geometric mean of eigenvalues of the Gram matrix (similarity matrix) of L2-normalized embeddings.

Interpretation

Range: [0, 1]
Higher VOL = more diverse/spread out images
VOL = 1 indicates maximum diversity (orthogonal embeddings)
VOL → 0 indicates low diversity (similar/duplicate images)

Why VOL?

VOL is particularly useful for:

Detecting duplicate or near-duplicate images
Measuring dataset diversity
Evaluating synthetic data generation quality
Comparing different chart collections

Features

🚀 Uses state-of-the-art DINOv2 vision transformer
📊 Focuses on VOL metric with supporting metrics
🖼️ Supports multiple image formats (PNG, JPG, SVG, etc.)
⚡ GPU-accelerated (with CPU fallback)
💾 Optional saving of embeddings and scores
📈 Eigenvalue statistics for deeper analysis

Installation

Requirements

Python 3.9 or higher
CUDA-capable GPU (optional)
uv - Fast Python package installer

Quick Setup

Option 1: Automated Setup (Recommended)

bash setup.sh

This script will:

Check Python version
Install uv if needed
Install all dependencies
Verify the installation

Option 2: Manual Setup

Install uv (if not already installed):

# Via curl (Linux/macOS)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Via Homebrew (macOS)
brew install uv

# Via pip (any platform)
pip install uv

Install dependencies:

With virtual environment (recommended):

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

Or system-wide:

uv pip install --system -e .

Why uv? It's 10-100x faster than pip and provides better dependency resolution

Usage

Basic Usage

python compute_diversity.py /path/to/your/charts

Save Results to File

python compute_diversity.py /path/to/your/charts --save-scores

Save Both Scores and Embeddings

python compute_diversity.py /path/to/your/charts \
    --save-scores \
    --save-embeddings \
    --output-dir results

Use CPU Only (No GPU)

python compute_diversity.py /path/to/your/charts --cpu

Adjust Batch Size

# Larger batch size for more GPU memory
python compute_diversity.py /path/to/your/charts --batch-size 32

# Smaller batch size for less GPU memory
python compute_diversity.py /path/to/your/charts --batch-size 8

Command-Line Options

positional arguments:
  input_dir             Folder with chart images (supports PNG, JPG, SVG, etc.)

optional arguments:
  -h, --help            Show help message and exit
  --output-dir DIR      Output folder for results (default: diversity_out)
  --batch-size N        Batch size for processing (default: 16)
  --cpu                 Force CPU usage (disable GPU)
  --save-embeddings     Save embeddings to file
  --save-scores         Save scores to file

Output

Console Output

The script prints comprehensive results including:

Primary Metric: VOL score
Eigenvalue Statistics: Mean, std, min, max
Supporting Metrics: MPD, RAD, Q10, ENT

Example output:

============================================================
🎯 DIVERSITY SCORES (VOL FOCUS)
============================================================
Number of images: 50
Embedding dimension: 768
------------------------------------------------------------

🔷 PRIMARY METRIC:
  VOL (Volume):                   0.234567

  Interpretation:
  • VOL represents the 'volume' of the convex hull in embedding space
  • Higher VOL = more diverse/spread out images
  • Range: [0, 1], where 1 = maximum diversity
  • Computed as geometric mean of eigenvalues
------------------------------------------------------------

📈 EIGENVALUE STATISTICS:
  Mean:    1.234567
  Std:     0.123456
  Min:     0.012345
  Max:     2.345678
------------------------------------------------------------

📏 SUPPORTING METRICS:
  MPD (Mean Pairwise Distance):   0.456789
  RAD (Minimum Distance):         0.012345
  Q10 (10th Percentile):          0.123456
  ENT (Normalized Entropy):       0.789012
============================================================

File Output

When using --save-scores, a detailed text file is saved with:

All metrics and scores
Eigenvalue statistics
VOL interpretation guide

When using --save-embeddings, a NumPy array file (.npy) is saved containing the DINOv2 embeddings for all images.

Supported Image Formats

PNG (.png)
JPEG (.jpg, .jpeg)
WebP (.webp)
BMP (.bmp)
TIFF (.tif, .tiff)
SVG (.svg) - requires cairosvg

Understanding the Metrics

Primary Metric

VOL (Volume): Geometric mean of eigenvalues of the Gram matrix. Measures the "volume" spanned by embeddings in high-dimensional space.

Supporting Metrics

MPD (Mean Pairwise Distance): Average distance between all pairs of images
RAD (Radius): Minimum pairwise distance (detects duplicates)
Q10 (10th Percentile): 10th percentile of pairwise distances (robust to outliers)
ENT (Normalized Entropy): Entropy of eigenvalue distribution (normalized by log(n))

Eigenvalue Statistics

Mean: Average eigenvalue (indicates overall embedding spread)
Std: Standard deviation (indicates variability in dimensions)
Min: Smallest eigenvalue (detects collapsed dimensions)
Max: Largest eigenvalue (detects dominant directions)

Technical Details

DINOv2 Model

This tool uses DINOv2 ViT-Base-Patch14 (vit_base_patch14_dinov2), a self-supervised vision transformer trained on diverse image data. It produces 768-dimensional embeddings that capture semantic visual features.

Why DINOv2?

State-of-the-art for visual similarity
No task-specific fine-tuning needed
Robust to image variations
Excellent for chart/diagram understanding

Volume Computation

1. Compute similarity matrix: S = E @ E^T (where E are normalized embeddings)
2. Compute eigenvalues: λ₁, λ₂, ..., λₙ = eig(S)
3. VOL = (∏ λᵢ)^(1/n) = geometric mean of eigenvalues

Performance Tips

For Large Datasets

Use GPU: Ensure PyTorch is installed with CUDA support
Increase batch size: Use --batch-size 32 or higher if GPU memory allows
Monitor memory: Reduce batch size if you encounter OOM errors

For Small Datasets

Use CPU: Add --cpu flag if GPU overhead is not worth it (< 100 images)
Smaller batch size: Use --batch-size 8 to reduce memory usage

Expected Processing Times

100 images: ~30 seconds (GPU) / ~2 minutes (CPU)
1000 images: ~5 minutes (GPU) / ~20 minutes (CPU)
10000 images: ~45 minutes (GPU) / ~3 hours (CPU)

Times are approximate and depend on hardware

Troubleshooting

"CUDA out of memory" Error

Reduce batch size:

python compute_diversity.py /path/to/charts --batch-size 4

"No images found" Error

Ensure your folder contains supported image formats and check file permissions.

SVG Rendering Issues

If SVG files fail to load:

# On macOS
brew install cairo

# On Ubuntu/Debian
sudo apt-get install libcairo2-dev

# Then reinstall cairosvg
uv pip install --upgrade cairosvg

This tool is provided as-is for research and evaluation purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_diversity.py		compute_diversity.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh

License

v7labs/diversity-metric

Folders and files

Latest commit

History

Repository files navigation