diff --git a/README.md b/README.md index a54f33885..c7dcedc54 100644 --- a/README.md +++ b/README.md @@ -29,8 +29,9 @@ The following diagram shows the Nemo Retriever extraction pipeline. 1. [What NeMo Retriever Extraction Is](#what-nvidia-ingest-is) 2. [Prerequisites](#prerequisites) 3. [Quickstart](#library-mode-quickstart) -4. [GitHub Repository Structure](#nv-ingest-repository-structure) -5. [Notices](#notices) +4. [Benchmarking](#benchmarking) +5. [GitHub Repository Structure](#nv-ingest-repository-structure) +6. [Notices](#notices) ## What NeMo Retriever Extraction Is @@ -296,6 +297,44 @@ Please keep in mind that this response is purely humorous and interpretative, as > Please also checkout our [demo using a retrieval pipeline on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted w/ NVIDIA Ingest. +## Benchmarking + +nv-ingest includes a comprehensive testing framework for benchmarking performance and evaluating retrieval accuracy. + +### Quick Start + +```bash +cd scripts/tests + +# Run end-to-end benchmark +python run.py --case=e2e --dataset=bo767 + +# Evaluate retrieval accuracy +python run.py --case=e2e_recall --dataset=bo767 +``` + +### Available Benchmarks + +- **End-to-End Performance** - Measure ingestion throughput, latency, and resource utilization +- **Retrieval Accuracy** - Evaluate recall@k metrics against ground truth datasets +- **MIG Benchmarking** - Test performance with NVIDIA Multi-Instance GPU (MIG) configurations + +### Documentation + +- **[Testing Framework Guide](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/)** - Complete guide to benchmarking and testing nv-ingest (same as `scripts/tests/README.md`) +- **[MIG Benchmarking](https://docs.nvidia.com/nemo/retriever/extraction/mig-benchmarking/)** - GPU partitioning for multi-tenant deployments on Kubernetes/Helm + +### Benchmark Datasets + +- **bo767** - 767 PDF documents with ground truth for recall evaluation +- **bo20** - 20 PDF documents for quick validation +- **single** - singular multimodal pdf for quick validation +- **earnings** - earnings reports ppt and pdf dataset +-- **financebench** - financial data +- **Custom datasets** - Use your own datasets with the testing framework + +For more information, see the [benchmarking documentation](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/). + ## GitHub Repository Structure diff --git a/docs/docs/extraction/benchmarking.md b/docs/docs/extraction/benchmarking.md new file mode 100644 index 000000000..ecd062efb --- /dev/null +++ b/docs/docs/extraction/benchmarking.md @@ -0,0 +1,826 @@ +# nv-ingest Integration Testing Framework + +A configurable, dataset-agnostic testing framework for end-to-end validation of nv-ingest pipelines. This framework uses structured YAML configuration for type safety, validation, and parameter management. + +## Quick Start + +### Prerequisites +- Docker and Docker Compose running +- Python environment with nv-ingest-client +- Access to test datasets + +### Run Your First Test + +```bash +# 1. Navigate to the tests directory +cd scripts/tests + +# 2. Run with a pre-configured dataset (assumes services are running) +python run.py --case=e2e --dataset=bo767 + +# Or use a custom path that uses the "active" configuration +python run.py --case=e2e --dataset=/path/to/your/data + +# With managed infrastructure (starts/stops services) +python run.py --case=e2e --dataset=bo767 --managed +``` + +**Important**: All test commands should be run from the `scripts/tests/` directory. + +## Configuration System + +### YAML Configuration (`test_configs.yaml`) + +The framework uses a structured YAML file for all test configuration. Configuration is organized into logical sections: + +#### Active Configuration + +The `active` section contains your current test settings. Edit these values directly for your test runs: + +```yaml +active: + # Dataset + dataset_dir: /path/to/your/dataset + test_name: null # Auto-generated if null + + # API Configuration + api_version: v1 # v1 or v2 + pdf_split_page_count: null # V2 only: pages per chunk (null = default 32) + + # Infrastructure + hostname: localhost + readiness_timeout: 600 + profiles: [retrieval, table-structure] + + # Runtime + sparse: true + gpu_search: false + embedding_model: auto + + # Extraction + extract_text: true + extract_tables: true + extract_charts: true + extract_images: false + extract_infographics: true + text_depth: page + table_output_format: markdown + + # Pipeline (optional steps) + enable_caption: false + enable_split: false + split_chunk_size: 1024 + split_chunk_overlap: 150 + + # Storage + spill_dir: /tmp/spill + artifacts_dir: null + collection_name: null +``` + +#### Pre-Configured Datasets + +Each dataset includes its path, extraction settings, and recall evaluator in one place: + +```yaml +datasets: + bo767: + path: /raid/jioffe/bo767 + extract_text: true + extract_tables: true + extract_charts: true + extract_images: false + extract_infographics: false + recall_dataset: bo767 # Evaluator for recall testing + + bo20: + path: /raid/jioffe/bo20 + extract_text: true + extract_tables: true + extract_charts: true + extract_images: true + extract_infographics: false + recall_dataset: null # bo20 does not have recall + + earnings: + path: /raid/jioffe/earnings_conusulting + extract_text: true + extract_tables: true + extract_charts: true + extract_images: false + extract_infographics: false + recall_dataset: earnings # Evaluator for recall testing +``` + +**Automatic Configuration**: When you use `--dataset=bo767`, the framework automatically: +- Sets the dataset path +- Applies the correct extraction settings (text, tables, charts, images, infographics) +- Configures the recall evaluator (if applicable) + +**Usage:** +```bash +# Single dataset - configs applied automatically +python run.py --case=e2e --dataset=bo767 + +# Multiple datasets (sweeping) - each gets its own config +python run.py --case=e2e --dataset=bo767,earnings,bo20 + +# Custom path still works (uses active section config) +python run.py --case=e2e --dataset=/custom/path +``` + +**Dataset Extraction Settings:** + +| Dataset | Text | Tables | Charts | Images | Infographics | Recall | +|---------|------|--------|--------|--------|--------------|--------| +| `bo767` | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | +| `earnings` | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | +| `bo20` | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | +| `financebench` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| `single` | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | + +### Configuration Precedence + +Settings are applied in order of priority: + +**Environment variables > Dataset-specific config (path + extraction + recall_dataset) > YAML active config** + +**Note**: CLI arguments are only used for runtime decisions (which test to run, which dataset, execution mode). All configuration values come from YAML or environment variables. + +Example: +```bash +# YAML active section has api_version: v1 +# Dataset bo767 has extract_images: false +# Override via environment variable (highest priority) +EXTRACT_IMAGES=true API_VERSION=v2 python run.py --case=e2e --dataset=bo767 +# Result: Uses bo767 path, but extract_images=true (env override) and api_version=v2 (env override) +``` + +**Precedence Details:** +1. **Environment variables** - Highest priority, useful for CI/CD overrides +2. **Dataset-specific config** - Applied automatically when using `--dataset=` + - Includes: path, extraction settings, recall_dataset + - Only applies if dataset is defined in `datasets` section +3. **YAML active config** - Base configuration, used as fallback + +### Configuration Options Reference + +#### Core Options +- `dataset_dir` (string, required): Path to dataset directory +- `test_name` (string): Test identifier (auto-generated if null) +- `api_version` (string): API version - `v1` or `v2` +- `pdf_split_page_count` (integer): PDF splitting page count (V2 only, 1-128) + +#### Extraction Options +- `extract_text`, `extract_tables`, `extract_charts`, `extract_images`, `extract_infographics` (boolean): Content extraction toggles +- `text_depth` (string): Text extraction granularity - `page`, `document`, `block`, `line`, etc. +- `table_output_format` (string): Table output format - `markdown`, `html`, `latex`, `pseudo_markdown`, `simple` + +#### Pipeline Options +- `enable_caption` (boolean): Enable image captioning +- `enable_split` (boolean): Enable text chunking +- `split_chunk_size` (integer): Chunk size for text splitting +- `split_chunk_overlap` (integer): Overlap for text splitting + +#### Infrastructure Options +- `hostname` (string): Service hostname +- `readiness_timeout` (integer): Docker startup timeout in seconds +- `profiles` (list): Docker compose profiles + +#### Runtime Options +- `sparse` (boolean): Use sparse embeddings +- `gpu_search` (boolean): Use GPU for search +- `embedding_model` (string): Embedding model name (`auto` for auto-detection) +- `llm_summarization_model` (string): LLM model for summarization (used by `e2e_with_llm_summary`) + +#### Storage Options +- `spill_dir` (string): Temporary processing directory +- `artifacts_dir` (string): Test output directory (auto-generated if null) +- `collection_name` (string): Milvus collection name (auto-generated as `{test_name}_multimodal` if null, deterministic - no timestamp) + +### Valid Configuration Values + +**text_depth**: `block`, `body`, `document`, `header`, `line`, `nearby_block`, `other`, `page`, `span` + +**table_output_format**: `html`, `image`, `latex`, `markdown`, `pseudo_markdown`, `simple` + +**api_version**: `v1`, `v2` + +Configuration is validated on load with helpful error messages. + +## Running Tests + +### Basic Usage + +```bash +# Run with default YAML configuration (assumes services are running) +python run.py --case=e2e --dataset=bo767 + +# With document-level analysis +python run.py --case=e2e --dataset=bo767 --doc-analysis + +# With managed infrastructure (starts/stops services) +python run.py --case=e2e --dataset=bo767 --managed +``` + +### Dataset Sweeping + +Run multiple datasets in a single command - each dataset automatically gets its native extraction configuration: + +```bash +# Sweep multiple datasets +python run.py --case=e2e --dataset=bo767,earnings,bo20 + +# Each dataset runs sequentially with its own: +# - Extraction settings (from dataset config) +# - Artifact directory (timestamped per dataset) +# - Results summary at the end + +# With managed infrastructure (services start once, shared across all datasets) +python run.py --case=e2e --dataset=bo767,earnings,bo20 --managed + +# E2E+Recall sweep (each dataset ingests then evaluates recall) +python run.py --case=e2e_recall --dataset=bo767,earnings + +# Recall-only sweep (evaluates existing collections) +python run.py --case=recall --dataset=bo767,earnings +``` + +**Sweep Behavior:** +- Services start once (if `--managed`) before the sweep +- Each dataset gets its own artifact directory +- Each dataset automatically applies its extraction config from `datasets` section +- Summary printed at end showing success/failure for each dataset +- Services stop once at end (unless `--keep-up`) + +### Using Environment Variables + +```bash +# Override via environment (useful for CI/CD) +API_VERSION=v2 EXTRACT_TABLES=false python run.py --case=e2e + +# Temporary changes without editing YAML +DATASET_DIR=/custom/path python run.py --case=e2e +``` + +## Test Scenarios + +### Available Tests + +| Name | Description | Configuration Needed | Status | +|------|-------------|----------------------|--------| +| `e2e` | Dataset-agnostic E2E ingestion | `active` section only | ✅ Primary (YAML config) | +| `e2e_with_llm_summary` | E2E with LLM summarization via UDF | `active` section only | ✅ Available (YAML config) | +| `recall` | Recall evaluation against existing collections | `active` + `recall` sections | ✅ Available (YAML config) | +| `e2e_recall` | Fresh ingestion + recall evaluation | `active` + `recall` sections | ✅ Available (YAML config) | + +**Note**: Legacy test cases (`dc20_e2e`, `dc20_v2_e2e`) have been moved to `scripts/private_local`. + +### Configuration Synergy + +**For E2E-only users:** +- Only configure `active` section +- `collection_name` in active: auto-generates from `test_name` or dataset basename if `null` (deterministic, no timestamp) +- Collection name pattern: `{test_name}_multimodal` (e.g., `bo767_multimodal`, `earnings_consulting_multimodal`) +- `recall` section is optional (not used unless running recall tests) +- **Note**: You can run `recall` later against the same collection created by `e2e` + +**For Recall-only users:** +- Configure `active` section: `hostname`, `sparse`, `gpu_search`, etc. (for evaluation) +- Configure `recall` section: `recall_dataset` (required) +- Set `test_name` in active to match your existing collection (collection must be `{test_name}_multimodal`) +- `collection_name` in active is ignored (recall generates `{test_name}_multimodal`) + +**For E2E+Recall users:** +- Configure `active` section: `dataset_dir`, `test_name`, extraction settings, etc. +- Configure `recall` section: `recall_dataset` (required) +- Collection naming: e2e_recall automatically creates `{test_name}_multimodal` collection +- `collection_name` in active is ignored (e2e_recall forces `{test_name}_multimodal` pattern) + +### Example Configurations + +**V2 API with PDF Splitting:** +```yaml +# Edit test_configs.yaml active section: +active: + api_version: v2 + pdf_split_page_count: 32 + extract_text: true + extract_tables: true + extract_charts: true +``` + +**Text-Only Processing:** +```yaml +active: + extract_text: true + extract_tables: false + extract_charts: false + extract_images: false + extract_infographics: false +``` + +**RAG with Text Chunking:** +```yaml +active: + enable_split: true + split_chunk_size: 1024 + split_chunk_overlap: 150 +``` + +**Multimodal with Image Extraction:** +```yaml +active: + extract_text: true + extract_tables: true + extract_charts: true + extract_images: true + extract_infographics: true + enable_caption: true +``` + +## Recall Testing + +Recall testing evaluates retrieval accuracy against ground truth query sets. Two test cases are available: + +### Test Cases + +**`recall`** - Recall-only evaluation against existing collections: +- Skips ingestion (assumes collections already exist) +- Loads existing collections from Milvus +- Evaluates recall using multimodal queries (all datasets are multimodal-only) +- Supports reranker comparison (no reranker, with reranker, or reranker-only) + +**`e2e_recall`** - Fresh ingestion + recall evaluation: +- Performs full ingestion pipeline +- Creates multimodal collection during ingestion +- Evaluates recall immediately after ingestion +- Combines ingestion metrics with recall metrics + +### Reranker Configuration + +Three modes via `reranker_mode` setting: + +1. **No reranker** (default): `reranker_mode: none` + - Runs evaluation without reranker only + +2. **Both modes**: `reranker_mode: both` + - Runs evaluation twice: once without reranker, once with reranker + - Useful for comparing reranker impact + +3. **Reranker only**: `reranker_mode: with` + - Runs evaluation with reranker only + - Faster when you only need reranked results + +### Collection Naming + +**Deterministic Collection Names (No Timestamps)** + +All test cases use deterministic collection names (no timestamps) to enable: +- Reusing collections across test runs +- Running recall evaluation after e2e ingestion +- Consistent collection naming patterns + +**Collection Name Patterns:** + +All test cases use the same consistent pattern: `{test_name}_multimodal` + +| Test Case | Pattern | Example | +|-----------|---------|---------| +| `e2e` | `{test_name}_multimodal` | `bo767_multimodal` | +| `e2e_with_llm_summary` | `{test_name}_multimodal` | `bo767_multimodal` | +| `e2e_recall` | `{test_name}_multimodal` | `bo767_multimodal` | +| `recall` | `{test_name}_multimodal` | `bo767_multimodal` | + +**Benefits:** +- ✅ Run `e2e` then `recall` separately - they use the same collection +- ✅ Consistent naming across all test cases +- ✅ Deterministic names (no timestamps) enable collection reuse + +**Recall Collections:** +- A single multimodal collection is created for recall evaluation +- Pattern: `{test_name}_multimodal` +- Example: `bo767_multimodal` +- All datasets evaluate against this multimodal collection (no modality-specific collections) + +**Note**: Artifact directories still use timestamps for tracking over time (e.g., `bo767_20251106_180859_UTC`), but collection names are deterministic. + +### Multimodal-Only Evaluation + +All datasets use **multimodal-only** evaluation: +- Ground truth queries contain all content types (text, tables, charts) +- Single collection contains all extracted content types +- Simplified evaluation interface (no modality filtering) + +### Ground Truth Files + +**bo767 dataset:** +- Ground truth file: `bo767_query_gt.csv` (consolidated multimodal queries) +- Located in repo `data/` directory +- Default `ground_truth_dir: null` automatically uses `data/` directory +- Custom path can be specified via `ground_truth_dir` config + +**Other datasets** (finance_bench, earnings, audio): +- Ground truth files must be obtained separately (not in public repo) +- Set `ground_truth_dir` to point to your ground truth directory +- Dataset-specific evaluators are extensible (see `recall_utils.py`) + +### Configuration + +Edit the `recall` section in `test_configs.yaml`: + +```yaml +recall: + # Reranker configuration + reranker_mode: none # Options: "none", "with", "both" + + # Recall evaluation settings + recall_top_k: 10 + ground_truth_dir: null # null = use repo data/ directory + recall_dataset: bo767 # Required: must be explicitly set (bo767, finance_bench, earnings, audio) +``` + +### Usage Examples + +**Recall-only (existing collections):** +```bash +# Evaluate existing bo767 collections (no reranker) +# recall_dataset automatically set from dataset config +python run.py --case=recall --dataset=bo767 + +# With reranker only (set reranker_mode in YAML recall section) +python run.py --case=recall --dataset=bo767 + +# Sweep multiple datasets for recall evaluation +python run.py --case=recall --dataset=bo767,earnings +``` + +**E2E + Recall (fresh ingestion):** +```bash +# Fresh ingestion with recall evaluation +# recall_dataset automatically set from dataset config +python run.py --case=e2e_recall --dataset=bo767 + +# Sweep multiple datasets (each ingests then evaluates) +python run.py --case=e2e_recall --dataset=bo767,earnings +``` + +**Dataset configuration:** +- **Dataset path**: Automatically set from `datasets` section when using `--dataset=` +- **Extraction settings**: Automatically applied from `datasets` section +- **recall_dataset**: Automatically set from `datasets` section (e.g., `bo767`, `earnings`, `finance_bench`) + - Can be overridden via environment variable: `RECALL_DATASET=bo767` +- **test_name**: Auto-generated from dataset name or basename of path (can set in YAML `active` section) +- **Collection naming**: `{test_name}_multimodal` (automatically generated for recall cases) +- All datasets evaluate against the same `{test_name}_multimodal` collection (multimodal-only) + +### Output + +Recall results are included in `results.json`: +```json +{ + "recall_results": { + "no_reranker": { + "1": 0.554, + "3": 0.746, + "5": 0.807, + "10": 0.857 + }, + "with_reranker": { + "1": 0.601, + "3": 0.781, + "5": 0.832, + "10": 0.874 + } + } +} +``` + +Metrics are also logged via `kv_event_log()`: +- `recall_multimodal_@{k}_no_reranker` +- `recall_multimodal_@{k}_with_reranker` +- `recall_eval_time_s_no_reranker` +- `recall_eval_time_s_with_reranker` + +## Sweeping Parameters + +### Dataset Sweeping (Recommended) + +The easiest way to test multiple datasets is using dataset sweeping: + +```bash +# Test multiple datasets - each gets its native config automatically +python run.py --case=e2e --dataset=bo767,earnings,bo20 + +# Each dataset runs with its pre-configured extraction settings +# Results are organized in separate artifact directories +``` + +### Parameter Sweeping + +To sweep through different parameter values: + +1. **Edit** `test_configs.yaml` - Update values in the `active` section +2. **Run** the test: `python run.py --case=e2e --dataset=` +3. **Analyze** results in `artifacts/_/` +4. **Repeat** steps 1-3 for next parameter combination + +Example parameter sweep workflow: +```bash +# Test 1: Baseline V1 +vim test_configs.yaml # Set: api_version=v1, extract_tables=true +python run.py --case=e2e --dataset=bo767 + +# Test 2: V2 with 32-page splitting +vim test_configs.yaml # Set: api_version=v2, pdf_split_page_count=32 +python run.py --case=e2e --dataset=bo767 + +# Test 3: V2 with 8-page splitting +vim test_configs.yaml # Set: pdf_split_page_count=8 +python run.py --case=e2e --dataset=bo767 + +# Test 4: Tables disabled (override via env var) +EXTRACT_TABLES=false python run.py --case=e2e --dataset=bo767 +``` + +**Note**: Each test run creates a new timestamped artifact directory, so you can compare results across sweeps. + +## Execution Modes + +### Attach Mode (Default) + +```bash +python run.py --case=e2e --dataset=bo767 +``` + +- **Default behavior**: Assumes services are already running +- Runs test case only (no service management) +- Faster for iterative testing +- Use when Docker services are already up +- `--no-build` and `--keep-up` flags are ignored in attach mode + +### Managed Mode + +```bash +python run.py --case=e2e --dataset=bo767 --managed +``` + +- Starts Docker services automatically +- Waits for service readiness (configurable timeout) +- Runs test case +- Collects artifacts +- Stops services after test (unless `--keep-up`) + +**Managed mode options:** +```bash +# Skip Docker image rebuild (faster startup) +python run.py --case=e2e --dataset=bo767 --managed --no-build + +# Keep services running after test (useful for multi-test scenarios) +python run.py --case=e2e --dataset=bo767 --managed --keep-up +``` + +## Artifacts and Logging + +All test outputs are collected in timestamped directories: + +``` +scripts/tests/artifacts/__UTC/ +├── results.json # Consolidated test metadata and results +├── stdout.txt # Complete test output +└── e2e.json # Structured metrics and events +``` + +**Note**: Artifact directories use timestamps for tracking test runs over time, while collection names are deterministic (no timestamps) to enable collection reuse and recall evaluation. + +### Results Structure + +`results.json` contains: +- **Runner metadata**: case name, timestamp, git commit, infrastructure mode +- **Test configuration**: API version, extraction settings, dataset info +- **Test results**: chunks created, timing, performance metrics + +### Document Analysis + +Enable per-document element breakdown: + +```bash +python run.py --case=e2e --doc-analysis +``` + +**Sample Output:** +``` +Document Analysis: + document1.pdf: 44 elements (text: 15, tables: 13, charts: 15, images: 0, infographics: 1) + document2.pdf: 14 elements (text: 9, tables: 0, charts: 4, images: 0, infographics: 1) +``` + +This provides: +- Element counts by type for each document +- Useful for understanding dataset characteristics +- Helps identify processing bottlenecks +- Validates extraction completeness + +## Architecture + +### Framework Components + +**1. Configuration Layer** +- `test_configs.yaml` - Structured configuration file + - Active test configuration (edit directly) + - Dataset shortcuts for quick access +- `config.py` - Configuration management + - YAML loading and parsing + - Type-safe config dataclass + - Validation logic with helpful errors + - Environment variable override support + +**2. Test Runner** +- `run.py` - Main orchestration + - Configuration loading with precedence chain + - Docker service management (managed mode) + - Test case execution with config injection + - Artifact collection and consolidation + +**3. Test Cases** +- `cases/e2e.py` - Primary E2E test (✅ YAML-based) + - Accepts config object directly + - Type-safe parameter access + - Full pipeline validation (extract → embed → VDB → retrieval) + - Transparent configuration logging +- `cases/e2e_with_llm_summary.py` - E2E with LLM (✅ YAML-based) + - Adds UDF-based LLM summarization + - Same config-based architecture as e2e.py +- `cases/recall.py` - Recall evaluation (✅ YAML-based) + - Evaluates retrieval accuracy against existing collections + - Requires `recall_dataset` in config (from dataset config or env var) + - Supports reranker comparison modes (none, with, both) + - Multimodal-only evaluation against `{test_name}_multimodal` collection +- `cases/e2e_recall.py` - E2E + Recall (✅ YAML-based) + - Combines ingestion (via e2e.py) with recall evaluation (via recall.py) + - Automatically creates collection during ingestion + - Requires `recall_dataset` in config (from dataset config or env var) + - Merges ingestion and recall metrics in results + +**4. Shared Utilities** +- `interact.py` - Common testing utilities + - `embed_info()` - Embedding model detection + - `milvus_chunks()` - Vector database statistics + - `segment_results()` - Result categorization by type + - `kv_event_log()` - Structured logging + - `pdf_page_count()` - Dataset page counting + +### Configuration Flow + +``` +test_configs.yaml → load_config() → TestConfig object → test case + (active + (applies (validated, + datasets) dataset config) type-safe) + ↑ ↑ + Env overrides Dataset configs + (highest) (auto-applied) +``` + +**Configuration Loading:** +1. Start with `active` section from YAML +2. If `--dataset=` matches a configured dataset: + - Apply dataset path + - Apply dataset extraction settings + - Apply dataset `recall_dataset` (if set) +3. Apply environment variable overrides (if any) +4. Validate and create `TestConfig` object + +All test cases receive a validated `TestConfig` object with typed fields, eliminating string parsing errors. + +## Development Guide + +### Adding New Test Cases + +1. **Create test script** in `cases/` directory + +2. **Accept config parameter**: + ```python + def main(config, log_path: str = "test_results") -> int: + """ + Test case entry point. + + Args: + config: TestConfig object with all settings + log_path: Path for structured logging + + Returns: + Exit code (0 = success) + """ + # Access config directly (type-safe) + data_dir = config.dataset_dir + api_version = config.api_version + extract_text = config.extract_text + # ... + ``` + +3. **Add transparent logging**: + ```python + print("=== Test Configuration ===") + print(f"Dataset: {config.dataset_dir}") + print(f"API: {config.api_version}") + print(f"Extract: text={config.extract_text}, tables={config.extract_tables}") + print("=" * 60) + ``` + +4. **Use structured logging**: + ```python + from interact import kv_event_log + + kv_event_log("ingestion_time_s", elapsed_time, log_path) + kv_event_log("text_chunks", num_text_chunks, log_path) + ``` + +5. **Register case** in `run.py`: + ```python + CASES = ["e2e", "e2e_with_llm_summary", "your_new_case"] + ``` + +### Extending Configuration + +To add new configurable parameters: + +1. **Add to `TestConfig` dataclass** in `config.py`: + ```python + @dataclass + class TestConfig: + # ... existing fields + new_param: bool = False # Add with type and default + ``` + +2. **Add to YAML** `active` section: + ```yaml + active: + # ... existing config + new_param: false # Match Python default + ``` + +3. **Add environment variable mapping** in `config.py` (if needed): + ```python + env_mapping = { + # ... existing mappings + "NEW_PARAM": ("new_param", parse_bool), + } + ``` + +4. **Add validation** (if needed) in `TestConfig.validate()`: + ```python + def validate(self) -> List[str]: + errors = [] + # ... existing validation + if self.new_param and self.some_other_field is None: + errors.append("new_param requires some_other_field to be set") + return errors + ``` + +5. **Update this README** with parameter description + +### Testing Different Datasets + +The framework is dataset-agnostic and supports multiple approaches: + +**Option 1: Use pre-configured dataset (Recommended)** +```bash +# Dataset configs automatically applied +python run.py --case=e2e --dataset=bo767 +``` + +**Option 2: Add new dataset to YAML** +```yaml +datasets: + my_dataset: + path: /path/to/your/dataset + extract_text: true + extract_tables: true + extract_charts: true + extract_images: false + extract_infographics: false + recall_dataset: null # or set to evaluator name if applicable +``` +Then use: `python run.py --case=e2e --dataset=my_dataset` + +**Option 3: Use custom path (uses active section config)** +```bash +python run.py --case=e2e --dataset=/path/to/your/dataset +``` + +**Option 4: Environment variable override** +```bash +# Override specific settings via env vars +EXTRACT_IMAGES=true python run.py --case=e2e --dataset=bo767 +``` + +**Best Practice**: For repeated testing, add your dataset to the `datasets` section with its native extraction settings. This ensures consistent configuration and enables dataset sweeping. + +## Additional Resources + +- **Configuration**: See `config.py` for complete field list and validation logic +- **Test utilities**: See `interact.py` for shared helper functions +- **Docker setup**: See project root README for service management commands +- **API documentation**: See `docs/` for API version differences + +The framework prioritizes clarity, type safety, and validation to support reliable testing of nv-ingest pipelines. diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index e23025ba2..da2424c6d 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -71,6 +71,7 @@ nav: - Notebooks: extraction/notebooks.md - Use the CLI: extraction/nv-ingest_cli.md - Use the API: extraction/nv-ingest-python-api.md + - V2 API Guide: extraction/v2-api-guide.md - Split Documents: extraction/chunking.md - Upload Data: extraction/data-store.md - Filter Search Results: extraction/custom-metadata.md @@ -81,6 +82,8 @@ nav: - Telemetry: extraction/telemetry.md - Customize Your Pipeline: - Add User-defined Stages: extraction/user-defined-stages.md + - Benchmarking: + - Overview: extraction/benchmarking.md - Reference: - Generate Your NGC Keys: extraction/ngc-api-key.md - Content Metadata: extraction/content-metadata.md