A modular toolkit for evaluating semantic embeddings and NLP models.
Current Version: v0.1.1 Next Release: v0.2.0 (Core Expansion) Stability: Beta - Production Ready Foundations
β v0.1.1 Released: Complete testing infrastructure (153 tests, 51% coverage), CLI interface, error handling & logging, and CI/CD pipeline.
Existing benchmarks are excellent for standardized comparisons, but sometimes you need:
- β Quick prototyping with your own data format
- β Domain-specific testing without extensive setup
- β Small-scale validation (100s of samples vs 1000s)
- β Custom evaluation tasks tailored to your use case
- β Embedding space insights beyond task performance
- π¦ Simple JSON format β Drop your data and go
- π¬ Modular design β Use only what you need
- π Language-agnostic β Any language, any model
- π Extensible β Add your own tasks and metrics
- π Fast iteration β Minutes from data to insights
- Rapid prototyping and experimentation
- Domain-specific embedding evaluation
- Testing models on proprietary data
- RAG system optimization
- Research on new embedding architectures
- Educational projects and learning
π‘ SemEval complements existing benchmarks by focusing on flexibility and ease of use for custom evaluation scenarios.
4 Core Tasks (13+ planned)
- Information Retrieval - NDCG, MRR, MAP metrics
- Semantic Similarity - Triplet evaluation with margin analysis
- Linguistic Robustness - Test stability under variations (typos, morphology, negation)
- Vector Arithmetic - Analogy and compositional semantics
- Embedding Quality Metrics (coming in v0.2.0)
- Isotropy & Uniformity Analysis
- Representation Health Checks
- Label-free Quality Assessment
- β
CLI Interface - Fast, user-friendly command-line tools (
eval,validate,init,version,info) - β Instant Startup - ~0.2s CLI performance with lazy loading optimization
- β Comprehensive Testing - 153 tests, 51% coverage (core metrics 89-100%)
- β Type Safety - Full Pydantic V2 validation
- β Error Handling - 17 custom exceptions with rich context
- β Structured Logging - Centralized logging with performance tracking
- β CI/CD Pipeline - GitHub Actions with automated testing
- Multiple Encoders: Sentence Transformers, HuggingFace, custom encoders
- Type-Safe: Full Pydantic V2 validation
- Performance Optimized: Automatic GPU/CPU detection, batch processing
- Rich Exports: JSON, CSV, Markdown reports
- Installation
- Quick Start
- CLI Commands
- Usage Examples
- Evaluation Tasks
- Configuration
- Supported Encoders
- Data Format
- Export & Reporting
- Python 3.8 or higher
- PyTorch 1.9+
- sentence-transformers
- transformers
- pydantic>=2.0
- pydantic-settings
- pyyaml
# Using uv (recommended - fast and modern)
uv pip install -e .
# Using pip
pip install -e .
# With development dependencies
pip install -e ".[dev]"# Check if CLI is working
semeval version
# Get system info
semeval info
# Create a test template
semeval init -o test.jsonThe fastest way to get started with SemEval:
# Create a template test file
semeval init --template basic -o my_test.json
# Validate your test data
semeval validate my_test.json
# Run evaluation
semeval eval --model "sentence-transformers/all-MiniLM-L6-v2" --data my_test.json
# Check version and system info
semeval version
semeval infoAvailable Templates:
basic- Semantic similarity examplesir- Information retrieval examplessimilarity- Comprehensive similarity testsrobustness- Linguistic robustness tests
CLI Performance: Instant startup (~0.2s) with lazy loading of ML dependencies.
from semeval import TaskRunner, SentenceTransformerEncoder
# 1. Create an encoder with any sentence-transformers model
encoder = SentenceTransformerEncoder(
"sentence-transformers/all-MiniLM-L6-v2"
)
# 2. Create a runner
runner = TaskRunner(encoder=encoder, verbose=True)
# 3. Run all evaluation tasks
result = runner.run("data/test_data.json")
# 4. Get results
summary = result.get_summary()
print(f"Total runtime: {summary['total_runtime']:.2f}s")
# 5. Access task-specific metrics
for task_name, task_info in summary['tasks'].items():
print(f"\n{task_name}: {task_info['status']}")# English
encoder = SentenceTransformerEncoder("sentence-transformers/all-MiniLM-L6-v2")
# Multilingual
encoder = SentenceTransformerEncoder("Alibaba-NLP/gte-multilingual-base")
# Domain-specific (Turkish example)
encoder = SentenceTransformerEncoder("emrecan/bert-base-turkish-cased-mean-nli-stsb-tr")
# Your custom model
encoder = SentenceTransformerEncoder("your-organization/your-model")SemEval provides a powerful command-line interface for quick evaluations and automation.
Generate template test data files with proper schema:
# Create basic semantic similarity template
semeval init
# Create information retrieval template
semeval init --template ir -o ir_test.json
# Create robustness testing template
semeval init --template robustness -o robustness_test.json
# Overwrite existing file
semeval init --template similarity -o test.json --forceAvailable Templates:
basic- Semantic similarity with 2 triplet examplesir- Information retrieval with sample corpus and queriessimilarity- Comprehensive semantic similarity testsrobustness- Linguistic robustness (morphology, typos, negation)
Check your test data for errors before running evaluation:
# Basic validation
semeval validate test_data.json
# Strict mode (fail on warnings)
semeval validate test_data.json --strict
# Generate HTML validation report (planned)
semeval validate test_data.json --report validation.htmlValidation Features:
- β Schema validation with detailed error messages
- β Data statistics (task counts, sample sizes)
- β Quality warnings (small dataset warnings)
- β Metadata verification
Evaluate models with your test data:
# Basic evaluation
semeval eval --model "sentence-transformers/all-MiniLM-L6-v2" --data test.json
# Run specific tasks only
semeval eval -m "model-name" -d test.json --tasks "ir,similarity"
# Use different encoder type
semeval eval -m "bert-base-uncased" -d test.json --encoder huggingface
# Specify device and output directory
semeval eval -m "model" -d test.json --device cuda --output results/
# Verbose mode
semeval eval -m "model" -d test.json --verbose
# Use config file (planned)
semeval eval --config config.yamlOptions:
--model, -m- Model name or path (HuggingFace/SentenceTransformers)--data, -d- Path to test data JSON file--output, -o- Output directory for results (default:results/)--tasks, -t- Comma-separated task list (default: all tasks)--device- Device to use:auto,cpu,cuda,mps(default:auto)--encoder, -e- Encoder type:sentence-transformer,huggingface(default:sentence-transformer)--verbose, -v- Verbose output--config, -c- Config file path (planned)
Compare multiple models side-by-side (planned):
semeval compare --models "model1,model2,model3" --data test.jsonGenerate formatted reports from evaluation results (planned):
# Generate HTML report
semeval report results.json
# Generate Markdown report
semeval report results.json --format markdown
# Custom output path
semeval report results.json -o my_report.htmlShow SemEval version:
semeval versionDisplay system and environment information:
semeval infoShows:
- Python version
- Platform information
- PyTorch version
- CUDA availability
- MPS (Apple Silicon) availability
- GPU count
SemEval CLI is optimized for instant startup with lazy loading:
- Help/Version/Info: ~0.2s (no ML dependencies loaded)
- Init/Validate: ~0.2s (lightweight operations)
- Eval: Model loading time + evaluation time (ML dependencies loaded only when needed)
This is achieved through:
- Lazy module imports using Python's
__getattr__ - Function-level imports for heavy dependencies
- No unnecessary torch/transformers loading for simple commands
from semeval import TaskRunner, SentenceTransformerEncoder, load_settings
# Load settings from config.yaml
settings = load_settings()
# Create encoder using config
encoder = SentenceTransformerEncoder(
settings.model.name,
device=settings.model.device
)
# Run with settings
runner = TaskRunner(encoder=encoder, settings=settings)
result = runner.run("data/test_data.json")Output:
[INFO] Starting Evaluation
[INFO] Loading test data from: data/test_data.json
[INFO] Model: sentence-transformers/all-MiniLM-L6-v2
[INFO] Running Information Retrieval Task
[INFO] Running Semantic Similarity Task
[INFO] Running Linguistic Robustness Task
[INFO] Running Vector Arithmetic Task
β
Evaluation complete: 2.75s
from semeval.postprocess import ResultsExporter, ReportGenerator
exporter = ResultsExporter()
output_dir = settings.output.base_dir
# Export all formats
exporter.export_csv(result, f"{output_dir}/results.csv")
exporter.export_json(result, f"{output_dir}/results.json")
exporter.export_markdown(result, f"{output_dir}/results.md")
# Export per-task files
task_paths = exporter.export_per_task(result, output_dir)from semeval import TaskRunner, SentenceTransformerEncoder
encoder = SentenceTransformerEncoder("model-name")
runner = TaskRunner(encoder=encoder)
# Run only Semantic Similarity task
result = runner.run_task("semantic_similarity", "data/test_data.json")
print(f"Triplet Accuracy: {result.metrics['accuracy']:.2%}")
print(f"Average Margin: {result.metrics['avg_margin']:.3f}")# Set environment variables
export SEMEVAL_MODEL__NAME="Alibaba-NLP/gte-multilingual-base"
export SEMEVAL_MODEL__DEVICE="cuda"
export SEMEVAL_LOGGING__VERBOSE="true"from semeval import load_settings, TaskRunner, SentenceTransformerEncoder
# Settings automatically load from env vars
settings = load_settings()
encoder = SentenceTransformerEncoder(
settings.model.name, # Uses env var
device=settings.model.device
)
runner = TaskRunner(encoder=encoder, settings=settings)
result = runner.run("data/test_data.json")from semeval import TaskRunner, SentenceTransformerEncoder
from semeval.postprocess import ReportGenerator
# Run evaluation
encoder = SentenceTransformerEncoder("model-name")
runner = TaskRunner(encoder=encoder)
result = runner.run("data/test_data.json")
# Generate comprehensive markdown report
generator = ReportGenerator()
generator.generate_report(
result,
"output/comprehensive_report.md",
model_name="My Model",
include_recommendations=True
)from semeval import TaskRunner, SentenceTransformerEncoder
import pandas as pd
models = [
"sentence-transformers/all-MiniLM-L6-v2",
"Alibaba-NLP/gte-multilingual-base",
"your-custom-model"
]
results = []
for model_name in models:
encoder = SentenceTransformerEncoder(model_name)
runner = TaskRunner(encoder=encoder, verbose=False)
result = runner.run("data/test_data.json")
summary = result.get_summary()
results.append({
'model': model_name,
'ndcg@10': summary['tasks']['information_retrieval']['metrics'].get('cosine-NDCG@10', 0),
'triplet_acc': summary['tasks']['semantic_similarity']['metrics'].get('accuracy', 0),
'runtime': summary['total_runtime']
})
df = pd.DataFrame(results)
print(df)SemEval includes 4 comprehensive evaluation tasks:
Evaluates the model's ability to retrieve relevant documents for queries.
Metrics:
- NDCG@k (Normalized Discounted Cumulative Gain)
- MRR@k (Mean Reciprocal Rank)
- MAP@k (Mean Average Precision)
- Precision@k, Recall@k, Accuracy@k
Usage:
result = runner.run_task("information_retrieval", "data/test_data.json")
print(f"NDCG@10: {result.metrics['cosine-NDCG@10']:.4f}")Data Requirements:
- Corpus of documents
- Query set
- Relevance judgments (query-doc pairs with scores 0-2)
Tests the model's ability to distinguish between semantically similar and dissimilar text pairs using triplet evaluation.
Metrics:
- Triplet Accuracy
- Average Margin (positive_sim - negative_sim)
- Margin Distribution (> 0.1, > 0.2)
- Performance by difficulty level
- Performance by subcategory
Usage:
result = runner.run_task("semantic_similarity", "data/test_data.json")
print(f"Accuracy: {result.metrics['accuracy']:.2%}")
print(f"Avg Margin: {result.metrics['avg_margin']:.3f}")Data Requirements:
- Triplets: anchor, positive, negative texts
- Optional: difficulty labels, categories
Evaluates model stability under linguistic variations (typos, morphological changes, negations).
Metrics:
- Overall robustness score
- Morphology robustness (case, number, tense variations)
- Typo robustness (spelling errors)
- Negation robustness (handling of negation)
- Embedding stability metrics
Usage:
result = runner.run_task("linguistic_robustness", "data/test_data.json")
print(f"Overall Robustness: {result.metrics['overall_robustness']:.2%}")Data Requirements:
- Original texts with linguistic variations
- Variation types (morphology, typo, negation)
Tests compositional semantic understanding through analogy and vector operations.
Metrics:
- Analogy accuracy
- Category-specific performance
- Subcategory breakdown
- Average cosine similarity to expected results
Usage:
result = runner.run_task("vector_arithmetic", "data/test_data.json")
print(f"Analogy Accuracy: {result.metrics['accuracy']:.2%}")Data Requirements:
- Analogy pairs: (a, b, c, expected_d)
- Categories and subcategories
SemEval uses a powerful YAML-based configuration system with environment variable overrides.
config.yaml: Base configurationconfig.dev.yaml: Development settings (verbose, quick metrics)config.prod.yaml: Production settings (optimized, extended metrics)
# config.yaml
model:
name: "sentence-transformers/all-MiniLM-L6-v2"
device: "auto" # auto, cuda, mps, cpu
batch_size: 32
output:
base_dir: "output"
export_formats:
- json
- csv
- markdown
save_comprehensive_report: true
tasks:
information_retrieval:
enabled: true
ndcg_at_k: [1, 3, 5, 10]
map_at_k: [1, 3, 5, 10]
mrr_at_k: [1, 3, 5, 10]
semantic_similarity:
enabled: true
report_failed_triplets: 5
linguistic_robustness:
enabled: true
similarity_threshold: 0.8
vector_arithmetic:
enabled: true
top_k: 1
logging:
verbose: false
level: "INFO"Settings can be overridden using environment variables with the prefix SEMEVAL_:
# Model settings
export SEMEVAL_MODEL__NAME="Alibaba-NLP/gte-multilingual-base"
export SEMEVAL_MODEL__DEVICE="cuda"
export SEMEVAL_MODEL__BATCH_SIZE="64"
# Output settings
export SEMEVAL_OUTPUT__BASE_DIR="custom_output"
# Logging
export SEMEVAL_LOGGING__VERBOSE="true"
export SEMEVAL_LOGGING__LEVEL="DEBUG"from semeval import load_settings
# Load default config.yaml
settings = load_settings()
# Load environment-specific config
settings = load_settings(env="dev") # loads config.dev.yaml
settings = load_settings(env="prod") # loads config.prod.yaml
# Access settings
print(f"Model: {settings.model.name}")
print(f"Device: {settings.model.device}")
print(f"Output: {settings.output.base_dir}")Settings are loaded with the following priority (highest to lowest):
- Environment variables (
SEMEVAL_*) .envfile- Environment-specific YAML (
config.{env}.yaml) - Base YAML (
config.yaml) - Default values in code
from semeval.core.encoders import SentenceTransformerEncoder
encoder = SentenceTransformerEncoder(
model_name="sentence-transformers/all-MiniLM-L6-v2",
device="auto" # auto, cuda, mps, cpu
)from semeval.core.encoders import HuggingFaceEncoder
encoder = HuggingFaceEncoder(
model_name="bert-base-uncased",
device="cuda",
max_length=512
)from semeval.core.base_encoder import BaseEncoder
class MyEncoder(BaseEncoder):
def encode(self, texts, **kwargs):
# Your encoding logic
return embeddings
def get_embedding_dim(self) -> int:
return 768
@property
def model_name(self) -> str:
return "my-model"Test data is provided in JSON format:
{
"metadata": {
"version": "1.0",
"description": "Semantic Evaluation Test Suite",
"language": "en",
"total_tasks": 4
},
"tasks": {
"information_retrieval": { ... },
"semantic_similarity": { ... },
"linguistic_robustness": { ... },
"vector_arithmetic": { ... }
}
}See USAGE.md for detailed data format specifications.
from semeval.postprocess import ResultsExporter, ReportGenerator
exporter = ResultsExporter()
generator = ReportGenerator()
# Export to different formats
df = exporter.export_csv(result, "output/results.csv")
exporter.export_json(result, "output/results.json")
exporter.export_markdown(result, "output/results.md")
# Generate comprehensive report
generator.generate_report(
result,
"output/comprehensive_report.md",
model_name="My Model",
include_recommendations=True
)Export individual files for each task:
# Export each task to separate JSON and Markdown files
task_paths = exporter.export_per_task(
result,
"output",
export_formats=['json', 'markdown']
)
# Generated files:
# - information_retrieval_result.json
# - information_retrieval_result.md
# - semantic_similarity_result.json
# - semantic_similarity_result.md
# - linguistic_robustness_result.json
# - linguistic_robustness_result.md
# - vector_arithmetic_result.json
# - vector_arithmetic_result.mdoutput/
βββ results.csv # All metrics in CSV
βββ results.json # Complete results in JSON
βββ results.md # Summary markdown
βββ comprehensive_report.md # Detailed report with recommendations
βββ information_retrieval_result.json # Per-task exports
βββ information_retrieval_result.md
βββ semantic_similarity_result.json
βββ semantic_similarity_result.md
βββ linguistic_robustness_result.json
βββ linguistic_robustness_result.md
βββ vector_arithmetic_result.json
βββ vector_arithmetic_result.md
| Metric | Range | Interpretation |
|---|---|---|
| NDCG@k | [0, 1] | Ranking quality with graded relevance |
| MRR@k | [0, 1] | Reciprocal rank of first relevant doc |
| MAP@k | [0, 1] | Mean average precision |
| Metric | Range | Interpretation |
|---|---|---|
| Triplet Accuracy | [0, 1] | Fraction of correctly ordered triplets |
| Average Margin | [-1, 1] | Mean difference (pos_sim - neg_sim) |
| Metric | Range | Interpretation |
|---|---|---|
| Overall Robustness | [0, 1] | Average stability across variations |
| Morphology Robustness | [0, 1] | Stability under morphological changes |
| Typo Robustness | [0, 1] | Stability under typos |
| Metric | Range | Interpretation |
|---|---|---|
| Analogy Accuracy | [0, 1] | Fraction of correct analogies |
| Avg Cosine Similarity | [-1, 1] | Average similarity to expected |
Focus: Production readiness foundations
- β Testing Infrastructure - 153 tests, 51% coverage (metrics: 89-100%, core: 80-96%)
- β
CLI Interface - Fast command-line tools (
eval,validate,init,version,info) with ~0.2s startup - β Error Handling & Logging - 17 custom exceptions, centralized structured logging
- β Data Validation - Schema validation with helpful error messages
- β Performance Optimization - Lazy loading for instant CLI startup (20x speedup)
- β CI/CD Pipeline - GitHub Actions with multi-Python testing (3.8-3.11)
Why this matters: Solid foundations ensure reliability and great developer experience before expanding features.
Focus: Advanced metrics and new tasks
- β Isotropy & Uniformity Metrics - Label-free embedding quality analysis
- β Semantic Textual Similarity (STS) - Continuous similarity scoring
- β Clustering Evaluation - Unsupervised quality metrics
- β Paraphrase Detection - Binary classification task
- β‘ Caching Layer - Speed up repeated evaluations
- π¨ Advanced CLI Features:
semeval compare- Side-by-side model comparison with statistical testssemeval report- HTML/PDF report generation with chartssemeval benchmark- Run standardized benchmarks- Config file support for
evalcommand
Focus: Advanced analysis and ecosystem integration
- CKA (Centered Kernel Alignment) - Compare model representations
- Token-level Alignment - Fine-grained semantic matching
- Question Answering Retrieval - QA-specific evaluation
- HuggingFace Hub Integration - Direct model/dataset access
- Performance Monitoring - Profiling and optimization tools
- Interactive Dashboard - Web UI for evaluation
Current Version: v0.1.1 (Beta) License: MIT Python: 3.8+ Maintainer: @omrylcn
SemEval aims to be:
- Simple - From JSON to insights in minutes
- Flexible - Any language, any domain, any model
- Comprehensive - Beyond accuracy, understand embedding quality
- Production-Ready - Testing, error handling, monitoring
- Community-Driven - Open source