|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +viral-classify is a set of scripts and tools for taxonomic identification, classification, and filtration from NGS data, with a focus on viral applications. This is a docker-centric Python project built on top of viral-core, with wrappers for various metagenomics classifiers (Kraken, Kaiju, BLAST, etc.) and utilities for read depletion. |
| 8 | + |
| 9 | +## Development Commands |
| 10 | + |
| 11 | +### Testing |
| 12 | + |
| 13 | +**Note:** These commands assume you're inside a properly configured environment (either in a Docker container or with all dependencies installed locally). For running tests in Docker, see "Running Tests in Docker" below. |
| 14 | + |
| 15 | +Run all unit tests: |
| 16 | +```bash |
| 17 | +pytest -rsxX -n auto test/unit |
| 18 | +``` |
| 19 | + |
| 20 | +Run specific test file: |
| 21 | +```bash |
| 22 | +pytest test/unit/test_taxonomy.py |
| 23 | +``` |
| 24 | + |
| 25 | +Run with slow tests (integration tests): |
| 26 | +```bash |
| 27 | +pytest -rsxX -n auto --runslow test/unit |
| 28 | +``` |
| 29 | + |
| 30 | +Run with coverage: |
| 31 | +```bash |
| 32 | +pytest --cov |
| 33 | +``` |
| 34 | + |
| 35 | +Show fixture durations: |
| 36 | +```bash |
| 37 | +pytest --fixture-durations=10 test/unit |
| 38 | +``` |
| 39 | + |
| 40 | +### Running Tests in Docker |
| 41 | + |
| 42 | +**IMPORTANT:** Tests must be run in a Docker container with all dependencies pre-installed. There are two approaches: |
| 43 | + |
| 44 | +#### Option 1: Use viral-classify Docker image (Recommended for testing) |
| 45 | + |
| 46 | +This image has all conda dependencies pre-installed and is ready to run tests immediately: |
| 47 | + |
| 48 | +```bash |
| 49 | +# Run all tests |
| 50 | +docker run --rm \ |
| 51 | + -v $(pwd):/opt/viral-ngs/viral-classify \ |
| 52 | + -v $(pwd)/test:/opt/viral-ngs/source/test \ |
| 53 | + quay.io/broadinstitute/viral-classify \ |
| 54 | + bash -c "cd /opt/viral-ngs/viral-classify && pytest -rsxX -n auto test/unit" |
| 55 | + |
| 56 | +# Run specific test class |
| 57 | +docker run --rm \ |
| 58 | + -v $(pwd):/opt/viral-ngs/viral-classify \ |
| 59 | + -v $(pwd)/test:/opt/viral-ngs/source/test \ |
| 60 | + quay.io/broadinstitute/viral-classify \ |
| 61 | + bash -c "cd /opt/viral-ngs/viral-classify && pytest -v test/unit/test_taxon_filter.py::TestFilterLastal" |
| 62 | + |
| 63 | +# Run single test method |
| 64 | +docker run --rm \ |
| 65 | + -v $(pwd):/opt/viral-ngs/viral-classify \ |
| 66 | + -v $(pwd)/test:/opt/viral-ngs/source/test \ |
| 67 | + quay.io/broadinstitute/viral-classify \ |
| 68 | + bash -c "cd /opt/viral-ngs/viral-classify && pytest -v test/unit/test_taxon_filter.py::TestFilterLastal::test_filter_lastal_bam_polio" |
| 69 | +``` |
| 70 | + |
| 71 | +**Note:** Two volume mounts are required: |
| 72 | +- `-v $(pwd):/opt/viral-ngs/viral-classify` - Mounts source code |
| 73 | +- `-v $(pwd)/test:/opt/viral-ngs/source/test` - Mounts test inputs (shared with viral-core) |
| 74 | + |
| 75 | +#### Option 2: Use viral-core base image (For development with dependency changes) |
| 76 | + |
| 77 | +If you're modifying conda dependencies, start from viral-core and install dependencies: |
| 78 | + |
| 79 | +```bash |
| 80 | +# Interactive shell for development |
| 81 | +docker run -it --rm \ |
| 82 | + -v $(pwd):/opt/viral-ngs/viral-classify \ |
| 83 | + -v $(pwd)/test:/opt/viral-ngs/source/test \ |
| 84 | + quay.io/broadinstitute/viral-core |
| 85 | + |
| 86 | +# Inside container, install dependencies: |
| 87 | +/opt/viral-ngs/viral-classify/docker/install-dev-layer.sh |
| 88 | + |
| 89 | +# Then run tests: |
| 90 | +cd /opt/viral-ngs/viral-classify |
| 91 | +pytest -rsxX -n auto test/unit |
| 92 | +``` |
| 93 | + |
| 94 | +### Docker Development Workflow |
| 95 | + |
| 96 | +The development paradigm is intentionally docker-centric. |
| 97 | + |
| 98 | +**For quick testing without dependency changes:** Use the pre-built viral-classify image (see "Running Tests in Docker" above). |
| 99 | + |
| 100 | +**For development with dependency changes:** |
| 101 | + |
| 102 | +1. Mount local checkout into viral-core container: |
| 103 | +```bash |
| 104 | +docker run -it --rm \ |
| 105 | + -v $(pwd):/opt/viral-ngs/viral-classify \ |
| 106 | + -v $(pwd)/test:/opt/viral-ngs/source/test \ |
| 107 | + quay.io/broadinstitute/viral-core |
| 108 | +``` |
| 109 | + |
| 110 | +2. Inside container, install this module's dependencies: |
| 111 | +```bash |
| 112 | +/opt/viral-ngs/viral-classify/docker/install-dev-layer.sh |
| 113 | +``` |
| 114 | + |
| 115 | +3. Test interactively within container: |
| 116 | +```bash |
| 117 | +cd /opt/viral-ngs/viral-classify |
| 118 | +pytest -rsxX -n auto test/unit |
| 119 | +``` |
| 120 | + |
| 121 | +4. Optionally snapshot your container with dependencies installed: |
| 122 | +```bash |
| 123 | +# From host machine, in another terminal |
| 124 | +docker commit <container_id> local/viral-classify-dev |
| 125 | +``` |
| 126 | + |
| 127 | +**Important:** Always use both volume mounts (`-v` flags) as shown above. The test input files are shared between viral-core and viral-classify, so both paths must be mounted. |
| 128 | + |
| 129 | +### Common Docker Testing Issues |
| 130 | + |
| 131 | +**Tests fail with "can't open file" or "file not found" errors:** |
| 132 | +- Ensure you're using BOTH volume mounts: `-v $(pwd):/opt/viral-ngs/viral-classify` AND `-v $(pwd)/test:/opt/viral-ngs/source/test` |
| 133 | +- Test input files live in a shared location between viral-core and viral-classify |
| 134 | + |
| 135 | +**Tests fail with "command not found" for tools like lastdb, kraken, etc.:** |
| 136 | +- Use the `quay.io/broadinstitute/viral-classify` image, not `viral-core` |
| 137 | +- Or run `install-dev-layer.sh` inside the viral-core container before testing |
| 138 | + |
| 139 | +**Platform warnings (linux/amd64 vs linux/arm64):** |
| 140 | +- These warnings are expected on ARM Macs and can be ignored |
| 141 | +- Docker will use emulation automatically |
| 142 | + |
| 143 | +### Docker Build |
| 144 | + |
| 145 | +Build docker image: |
| 146 | +```bash |
| 147 | +docker build -t viral-classify . |
| 148 | +``` |
| 149 | + |
| 150 | +The Dockerfile layers viral-classify on top of viral-core:2.3.3, installing conda dependencies to 4 separate environments (main + env2/env3/env4 for dependency conflicts), then copying source code. |
| 151 | + |
| 152 | +## Architecture |
| 153 | + |
| 154 | +### Main Entry Points |
| 155 | + |
| 156 | +- **`metagenomics.py`** - Main CLI for taxonomic classification and database operations |
| 157 | +- **`taxon_filter.py`** - Main CLI for read depletion and filtering pipelines |
| 158 | +- **`kmer_utils.py`** - K-mer based utility operations |
| 159 | + |
| 160 | +All use argparse for CLI and util.cmd for command registration via `__commands__` list. |
| 161 | + |
| 162 | +### Core Classification Commands (metagenomics.py) |
| 163 | + |
| 164 | +Key subcommands available via `metagenomics.py <command>`: |
| 165 | +- `kraken` - Classify reads using Kraken taxonomic classifier |
| 166 | +- `kraken2` - Classify reads using Kraken2 |
| 167 | +- `kaiju` - Classify reads using Kaiju protein-based classifier |
| 168 | +- `krona` - Create Krona HTML visualization from classification results |
| 169 | +- `blast_contigs` - BLAST contigs for taxonomic assignment |
| 170 | +- `diamond` - Diamond protein alignment for classification |
| 171 | +- `taxonomy_db` - Download and manage NCBI taxonomy databases |
| 172 | +- `filter_bam_to_taxa` - Filter BAM to specific taxonomic groups |
| 173 | +- `align_rna` - Align RNA sequences for taxonomic assignment |
| 174 | + |
| 175 | +### Core Depletion Commands (taxon_filter.py) |
| 176 | + |
| 177 | +Key subcommands available via `taxon_filter.py <command>`: |
| 178 | +- `deplete` - Run full depletion pipeline (BWA → BMTagger → BLASTN) |
| 179 | +- `deplete_bwa` - Deplete reads matching BWA database |
| 180 | +- `deplete_bmtagger` - Deplete reads using BMTagger |
| 181 | +- `deplete_blastn` - Deplete reads matching BLASTN database |
| 182 | +- `filter_lastal` - Filter reads using LAST aligner |
| 183 | + |
| 184 | +### Module Structure |
| 185 | + |
| 186 | +- **`classify/`** - Tool wrapper modules for taxonomic classification |
| 187 | + - `kraken.py` - Kraken/KrakenUniq classifier wrapper |
| 188 | + - `kraken2.py` - Kraken2 classifier wrapper |
| 189 | + - `kaiju.py` - Kaiju protein classifier wrapper |
| 190 | + - `krona.py` - Krona visualization wrapper |
| 191 | + - `blast.py` - BLAST+ blastn and makeblastdb wrappers |
| 192 | + - `diamond.py` - Diamond protein aligner wrapper |
| 193 | + - `bmtagger.py` - BMTagger read depletion wrapper |
| 194 | + - `last.py` - LAST aligner wrapper |
| 195 | + - `megan.py` - MEGAN metagenomics analyzer wrapper |
| 196 | + - `kmc.py` - K-mer Counter (KMC) wrapper |
| 197 | + |
| 198 | +- **`taxon_id_scripts/`** - Perl scripts for BLAST-based taxonomic analysis |
| 199 | + - `retrieve_top_blast_hits_LCA_for_each_sequence.pl` - LCA computation from BLAST |
| 200 | + - `LCA_table_to_kraken_output_format.pl` - Convert LCA to Kraken format |
| 201 | + - `filter_LCA_matches.pl` - Filter LCA results |
| 202 | + - `blastoff.sh` - BLAST wrapper script |
| 203 | + |
| 204 | +- **`test/`** - pytest-based test suite |
| 205 | + - `test/unit/` - Unit and integration tests |
| 206 | + - `conftest.py` - pytest fixtures and configuration |
| 207 | + - `test/__init__.py` - Test utilities (TestCaseWithTmp, assertion helpers) |
| 208 | + - `test/stubs.py` - Test stubs and mocks |
| 209 | + - `test/input/` - Static test input files organized by test class name |
| 210 | + |
| 211 | +### Dependencies from viral-core |
| 212 | + |
| 213 | +viral-classify imports core utilities from viral-core (not in this repository): |
| 214 | +- `util.cmd` - Command-line parsing and command registration |
| 215 | +- `util.file` - File handling utilities |
| 216 | +- `util.misc` - Miscellaneous utilities |
| 217 | +- `read_utils` - Read processing utilities |
| 218 | +- `tools.*` - Tool wrapper base classes and common tools (picard, samtools, bwa, etc.) |
| 219 | + - All tool wrappers inherit from `tools.Tool` base class |
| 220 | + |
| 221 | +### Conda Dependencies |
| 222 | + |
| 223 | +The project uses **4 separate conda environments** to handle dependency conflicts: |
| 224 | + |
| 225 | +- **Main environment** (`requirements-conda.txt`): blast, bmtagger, kmc, last, perl |
| 226 | +- **env2** (`requirements-conda-env2.txt`): Tools with incompatible dependencies |
| 227 | +- **env3** (`requirements-conda-env3.txt`): Additional isolated tools |
| 228 | +- **env4** (`requirements-conda-env4.txt`): Additional isolated tools |
| 229 | + |
| 230 | +All environments are added to PATH in the Dockerfile. |
| 231 | + |
| 232 | +## Testing Requirements |
| 233 | + |
| 234 | +- pytest is used with parallelized execution (`-n auto`) |
| 235 | +- Tests use fixtures from `conftest.py` providing scoped temp directories |
| 236 | +- Test input files are in `test/input/<TestClassName>/` |
| 237 | +- Access test inputs via `util.file.get_test_input_path(self)` in test classes |
| 238 | +- **New tests should add no more than ~20-30 seconds to testing time** |
| 239 | +- **Tests taking longer must be marked with `@pytest.mark.slow`** |
| 240 | +- Run slow tests with `pytest --runslow` |
| 241 | +- **New functionality must include unit tests covering basic use cases and confirming successful execution of underlying binaries** |
| 242 | + |
| 243 | +### Test Fixtures and Utilities |
| 244 | + |
| 245 | +From `conftest.py`: |
| 246 | +- `tmpdir_session`, `tmpdir_module`, `tmpdir_class`, `tmpdir_function` - Scoped temp directories |
| 247 | +- `monkeypatch_function_result` - Patch function results for specific args |
| 248 | +- `--runslow` option to enable slow/integration tests |
| 249 | +- `--fixture-durations` to profile fixture performance |
| 250 | +- Set `VIRAL_NGS_TMP_DIRKEEP` environment variable to preserve temp dirs for debugging |
| 251 | + |
| 252 | +From `test/__init__.py`: |
| 253 | +- `TestCaseWithTmp` - Base class with temp dir support |
| 254 | +- `assert_equal_contents()` - Compare file contents |
| 255 | +- `assert_equal_bam_reads()` - Compare BAM files (converted to SAM) |
| 256 | +- `assert_md5_equal_to_line_in_file()` - Verify checksums |
| 257 | + |
| 258 | +## CI/CD |
| 259 | + |
| 260 | +GitHub Actions workflow (`.github/workflows/build.yml`) runs on push/PR: |
| 261 | +- Docker image build and push to quay.io/broadinstitute/viral-classify |
| 262 | + - Master branch: tagged as `latest` and with version number |
| 263 | + - Non-master branches: tagged as `quay.io/broadinstitute/viral-classify` (ephemeral) |
| 264 | +- Unit and integration tests with pytest |
| 265 | +- Coverage reporting to coveralls.io |
| 266 | +- Documentation build validation (actual docs hosted on Read the Docs) |
| 267 | + |
| 268 | +## Key Design Patterns |
| 269 | + |
| 270 | +### Command Registration |
| 271 | + |
| 272 | +Commands are registered by appending `(command_name, parser_function)` tuples to `__commands__`. Each command has: |
| 273 | +- A parser function (`parser_<command_name>`) that creates argparse parser |
| 274 | +- A main function (`main_<command_name>`) that implements the logic |
| 275 | +- Connection via `util.cmd.attach_main(parser, main_function)` |
| 276 | + |
| 277 | +Example: |
| 278 | +```python |
| 279 | +def parser_classify_kraken(parser=argparse.ArgumentParser()): |
| 280 | + parser.add_argument('inBam', help='Input BAM file') |
| 281 | + parser.add_argument('outReads', help='Output reads') |
| 282 | + util.cmd.attach_main(parser, main_classify_kraken) |
| 283 | + return parser |
| 284 | + |
| 285 | +def main_classify_kraken(args): |
| 286 | + # Implementation |
| 287 | + pass |
| 288 | + |
| 289 | +__commands__.append(('kraken', parser_classify_kraken)) |
| 290 | +``` |
| 291 | + |
| 292 | +### Tool Wrapper Pattern |
| 293 | + |
| 294 | +All classification tools in `classify/` inherit from `tools.Tool`: |
| 295 | +- Define `BINS` dict mapping logical names to executable names |
| 296 | +- Implement `version()` method |
| 297 | +- Implement tool-specific methods (build, classify, filter, report, etc.) |
| 298 | +- Use `self.execute()` to run commands with proper option formatting |
| 299 | +- Define install methods (usually `tools.PrexistingUnixCommand` for conda-installed tools) |
| 300 | + |
| 301 | +Example structure: |
| 302 | +```python |
| 303 | +class Kraken(tools.Tool): |
| 304 | + BINS = { |
| 305 | + 'classify': 'kraken', |
| 306 | + 'build': 'kraken-build', |
| 307 | + 'filter': 'kraken-filter', |
| 308 | + 'report': 'kraken-report' |
| 309 | + } |
| 310 | + |
| 311 | + def __init__(self, install_methods=None): |
| 312 | + if not install_methods: |
| 313 | + install_methods = [tools.PrexistingUnixCommand(shutil.which('kraken'))] |
| 314 | + super(Kraken, self).__init__(install_methods=install_methods) |
| 315 | + |
| 316 | + def version(self): |
| 317 | + return KRAKEN_VERSION |
| 318 | +``` |
| 319 | + |
| 320 | +### Taxonomy Database Handling |
| 321 | + |
| 322 | +The `TaxonomyDb` class in `metagenomics.py`: |
| 323 | +- Loads NCBI taxonomy data (nodes.dmp, names.dmp, gi_taxid_*.dmp) |
| 324 | +- Supports lazy loading with `load_gis`, `load_nodes`, `load_names` flags |
| 325 | +- Provides LCA (Lowest Common Ancestor) computation via `get_ordered_ancestors()` |
| 326 | +- Can load from local files or S3 with automatic decompression |
| 327 | +- Used for BLAST hit analysis and taxonomic filtering |
| 328 | + |
| 329 | +### Depletion Pipeline Flow |
| 330 | + |
| 331 | +Typical depletion workflow (via `deplete` command): |
| 332 | +1. Revert BAM formatting with Picard |
| 333 | +2. Deplete with BWA against host/contaminant databases |
| 334 | +3. Deplete with BMTagger against additional databases |
| 335 | +4. Deplete with BLASTN for more sensitive filtering |
| 336 | +5. Each stage outputs intermediate BAM for inspection |
| 337 | + |
| 338 | +Individual depletion tools can be run separately: |
| 339 | +- `deplete_bwa` - BWA-based depletion only |
| 340 | +- `deplete_bmtagger` - BMTagger-based depletion only |
| 341 | +- `deplete_blastn` - BLASTN-based depletion only |
| 342 | + |
| 343 | +## Code Style and Linting |
| 344 | + |
| 345 | +Configuration files in repository root: |
| 346 | +- `.flake8` - Flake8 linting configuration |
| 347 | +- `.pylintrc` - Pylint configuration |
| 348 | +- `.style.yapf` - YAPF code formatting style |
| 349 | + |
| 350 | +Use these tools with their respective configs when modifying code. |
| 351 | + |
| 352 | +## Documentation |
| 353 | + |
| 354 | +Documentation is built with Sphinx and hosted on Read the Docs: |
| 355 | +- Source files in `docs/` directory (reStructuredText format) |
| 356 | +- Uses `sphinx-argparse` to auto-generate CLI documentation from argparse parsers |
| 357 | +- Build process clones viral-core during docs build (see `docs/conf.py`) |
| 358 | +- GitHub Actions validates docs build, but deployment is handled separately by Read the Docs |
| 359 | + |
| 360 | +Read the docs at: http://viral-classify.readthedocs.org/ |
| 361 | + |
| 362 | +## Common Development Tasks |
| 363 | + |
| 364 | +### Adding a New Classification Tool |
| 365 | + |
| 366 | +1. Create wrapper class in `classify/<tool>.py` inheriting from `tools.Tool` |
| 367 | +2. Define tool binaries, version, and installation methods |
| 368 | +3. Add conda dependency to appropriate `requirements-conda*.txt` |
| 369 | +4. Add command parser and main function to `metagenomics.py` or `taxon_filter.py` |
| 370 | +5. Register command in `__commands__` list |
| 371 | +6. Add unit tests to `test/unit/` |
| 372 | +7. Add test input files to `test/input/<TestClassName>/` |
| 373 | + |
| 374 | +### Adding a New Conda Dependency |
| 375 | + |
| 376 | +1. Check if package exists: `conda search -c bioconda <package_name>` |
| 377 | +2. Add to appropriate `requirements-conda*.txt` file (or env2/env3/env4 if conflicts exist) |
| 378 | +3. Test in Docker container with `install-conda-dependencies.sh` |
| 379 | +4. Update viral-core if adding to base layer dependencies |
| 380 | +5. Document any new environment requirements |
| 381 | + |
| 382 | +### Debugging Test Failures |
| 383 | + |
| 384 | +1. Set `VIRAL_NGS_TMP_DIRKEEP=1` to preserve temp directories |
| 385 | +2. Run single test: `pytest -v test/unit/test_file.py::TestClass::test_method` |
| 386 | +3. Use `pytest -s` to see stdout/stderr |
| 387 | +4. Use `--fixture-durations` to identify slow fixtures |
| 388 | +5. Check test input files in `test/input/<TestClassName>/` |
0 commit comments