Skip to content

Commit bcdde8a

Browse files
authored
Merge pull request #53 from broadinstitute/dp-patch
Add CLAUDE.md and fix TestFilterLastal for last v1648
2 parents 3c88169 + 1d07fa2 commit bcdde8a

File tree

3 files changed

+389
-1
lines changed

3 files changed

+389
-1
lines changed

CLAUDE.md

Lines changed: 388 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,388 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Overview
6+
7+
viral-classify is a set of scripts and tools for taxonomic identification, classification, and filtration from NGS data, with a focus on viral applications. This is a docker-centric Python project built on top of viral-core, with wrappers for various metagenomics classifiers (Kraken, Kaiju, BLAST, etc.) and utilities for read depletion.
8+
9+
## Development Commands
10+
11+
### Testing
12+
13+
**Note:** These commands assume you're inside a properly configured environment (either in a Docker container or with all dependencies installed locally). For running tests in Docker, see "Running Tests in Docker" below.
14+
15+
Run all unit tests:
16+
```bash
17+
pytest -rsxX -n auto test/unit
18+
```
19+
20+
Run specific test file:
21+
```bash
22+
pytest test/unit/test_taxonomy.py
23+
```
24+
25+
Run with slow tests (integration tests):
26+
```bash
27+
pytest -rsxX -n auto --runslow test/unit
28+
```
29+
30+
Run with coverage:
31+
```bash
32+
pytest --cov
33+
```
34+
35+
Show fixture durations:
36+
```bash
37+
pytest --fixture-durations=10 test/unit
38+
```
39+
40+
### Running Tests in Docker
41+
42+
**IMPORTANT:** Tests must be run in a Docker container with all dependencies pre-installed. There are two approaches:
43+
44+
#### Option 1: Use viral-classify Docker image (Recommended for testing)
45+
46+
This image has all conda dependencies pre-installed and is ready to run tests immediately:
47+
48+
```bash
49+
# Run all tests
50+
docker run --rm \
51+
-v $(pwd):/opt/viral-ngs/viral-classify \
52+
-v $(pwd)/test:/opt/viral-ngs/source/test \
53+
quay.io/broadinstitute/viral-classify \
54+
bash -c "cd /opt/viral-ngs/viral-classify && pytest -rsxX -n auto test/unit"
55+
56+
# Run specific test class
57+
docker run --rm \
58+
-v $(pwd):/opt/viral-ngs/viral-classify \
59+
-v $(pwd)/test:/opt/viral-ngs/source/test \
60+
quay.io/broadinstitute/viral-classify \
61+
bash -c "cd /opt/viral-ngs/viral-classify && pytest -v test/unit/test_taxon_filter.py::TestFilterLastal"
62+
63+
# Run single test method
64+
docker run --rm \
65+
-v $(pwd):/opt/viral-ngs/viral-classify \
66+
-v $(pwd)/test:/opt/viral-ngs/source/test \
67+
quay.io/broadinstitute/viral-classify \
68+
bash -c "cd /opt/viral-ngs/viral-classify && pytest -v test/unit/test_taxon_filter.py::TestFilterLastal::test_filter_lastal_bam_polio"
69+
```
70+
71+
**Note:** Two volume mounts are required:
72+
- `-v $(pwd):/opt/viral-ngs/viral-classify` - Mounts source code
73+
- `-v $(pwd)/test:/opt/viral-ngs/source/test` - Mounts test inputs (shared with viral-core)
74+
75+
#### Option 2: Use viral-core base image (For development with dependency changes)
76+
77+
If you're modifying conda dependencies, start from viral-core and install dependencies:
78+
79+
```bash
80+
# Interactive shell for development
81+
docker run -it --rm \
82+
-v $(pwd):/opt/viral-ngs/viral-classify \
83+
-v $(pwd)/test:/opt/viral-ngs/source/test \
84+
quay.io/broadinstitute/viral-core
85+
86+
# Inside container, install dependencies:
87+
/opt/viral-ngs/viral-classify/docker/install-dev-layer.sh
88+
89+
# Then run tests:
90+
cd /opt/viral-ngs/viral-classify
91+
pytest -rsxX -n auto test/unit
92+
```
93+
94+
### Docker Development Workflow
95+
96+
The development paradigm is intentionally docker-centric.
97+
98+
**For quick testing without dependency changes:** Use the pre-built viral-classify image (see "Running Tests in Docker" above).
99+
100+
**For development with dependency changes:**
101+
102+
1. Mount local checkout into viral-core container:
103+
```bash
104+
docker run -it --rm \
105+
-v $(pwd):/opt/viral-ngs/viral-classify \
106+
-v $(pwd)/test:/opt/viral-ngs/source/test \
107+
quay.io/broadinstitute/viral-core
108+
```
109+
110+
2. Inside container, install this module's dependencies:
111+
```bash
112+
/opt/viral-ngs/viral-classify/docker/install-dev-layer.sh
113+
```
114+
115+
3. Test interactively within container:
116+
```bash
117+
cd /opt/viral-ngs/viral-classify
118+
pytest -rsxX -n auto test/unit
119+
```
120+
121+
4. Optionally snapshot your container with dependencies installed:
122+
```bash
123+
# From host machine, in another terminal
124+
docker commit <container_id> local/viral-classify-dev
125+
```
126+
127+
**Important:** Always use both volume mounts (`-v` flags) as shown above. The test input files are shared between viral-core and viral-classify, so both paths must be mounted.
128+
129+
### Common Docker Testing Issues
130+
131+
**Tests fail with "can't open file" or "file not found" errors:**
132+
- Ensure you're using BOTH volume mounts: `-v $(pwd):/opt/viral-ngs/viral-classify` AND `-v $(pwd)/test:/opt/viral-ngs/source/test`
133+
- Test input files live in a shared location between viral-core and viral-classify
134+
135+
**Tests fail with "command not found" for tools like lastdb, kraken, etc.:**
136+
- Use the `quay.io/broadinstitute/viral-classify` image, not `viral-core`
137+
- Or run `install-dev-layer.sh` inside the viral-core container before testing
138+
139+
**Platform warnings (linux/amd64 vs linux/arm64):**
140+
- These warnings are expected on ARM Macs and can be ignored
141+
- Docker will use emulation automatically
142+
143+
### Docker Build
144+
145+
Build docker image:
146+
```bash
147+
docker build -t viral-classify .
148+
```
149+
150+
The Dockerfile layers viral-classify on top of viral-core:2.3.3, installing conda dependencies to 4 separate environments (main + env2/env3/env4 for dependency conflicts), then copying source code.
151+
152+
## Architecture
153+
154+
### Main Entry Points
155+
156+
- **`metagenomics.py`** - Main CLI for taxonomic classification and database operations
157+
- **`taxon_filter.py`** - Main CLI for read depletion and filtering pipelines
158+
- **`kmer_utils.py`** - K-mer based utility operations
159+
160+
All use argparse for CLI and util.cmd for command registration via `__commands__` list.
161+
162+
### Core Classification Commands (metagenomics.py)
163+
164+
Key subcommands available via `metagenomics.py <command>`:
165+
- `kraken` - Classify reads using Kraken taxonomic classifier
166+
- `kraken2` - Classify reads using Kraken2
167+
- `kaiju` - Classify reads using Kaiju protein-based classifier
168+
- `krona` - Create Krona HTML visualization from classification results
169+
- `blast_contigs` - BLAST contigs for taxonomic assignment
170+
- `diamond` - Diamond protein alignment for classification
171+
- `taxonomy_db` - Download and manage NCBI taxonomy databases
172+
- `filter_bam_to_taxa` - Filter BAM to specific taxonomic groups
173+
- `align_rna` - Align RNA sequences for taxonomic assignment
174+
175+
### Core Depletion Commands (taxon_filter.py)
176+
177+
Key subcommands available via `taxon_filter.py <command>`:
178+
- `deplete` - Run full depletion pipeline (BWA → BMTagger → BLASTN)
179+
- `deplete_bwa` - Deplete reads matching BWA database
180+
- `deplete_bmtagger` - Deplete reads using BMTagger
181+
- `deplete_blastn` - Deplete reads matching BLASTN database
182+
- `filter_lastal` - Filter reads using LAST aligner
183+
184+
### Module Structure
185+
186+
- **`classify/`** - Tool wrapper modules for taxonomic classification
187+
- `kraken.py` - Kraken/KrakenUniq classifier wrapper
188+
- `kraken2.py` - Kraken2 classifier wrapper
189+
- `kaiju.py` - Kaiju protein classifier wrapper
190+
- `krona.py` - Krona visualization wrapper
191+
- `blast.py` - BLAST+ blastn and makeblastdb wrappers
192+
- `diamond.py` - Diamond protein aligner wrapper
193+
- `bmtagger.py` - BMTagger read depletion wrapper
194+
- `last.py` - LAST aligner wrapper
195+
- `megan.py` - MEGAN metagenomics analyzer wrapper
196+
- `kmc.py` - K-mer Counter (KMC) wrapper
197+
198+
- **`taxon_id_scripts/`** - Perl scripts for BLAST-based taxonomic analysis
199+
- `retrieve_top_blast_hits_LCA_for_each_sequence.pl` - LCA computation from BLAST
200+
- `LCA_table_to_kraken_output_format.pl` - Convert LCA to Kraken format
201+
- `filter_LCA_matches.pl` - Filter LCA results
202+
- `blastoff.sh` - BLAST wrapper script
203+
204+
- **`test/`** - pytest-based test suite
205+
- `test/unit/` - Unit and integration tests
206+
- `conftest.py` - pytest fixtures and configuration
207+
- `test/__init__.py` - Test utilities (TestCaseWithTmp, assertion helpers)
208+
- `test/stubs.py` - Test stubs and mocks
209+
- `test/input/` - Static test input files organized by test class name
210+
211+
### Dependencies from viral-core
212+
213+
viral-classify imports core utilities from viral-core (not in this repository):
214+
- `util.cmd` - Command-line parsing and command registration
215+
- `util.file` - File handling utilities
216+
- `util.misc` - Miscellaneous utilities
217+
- `read_utils` - Read processing utilities
218+
- `tools.*` - Tool wrapper base classes and common tools (picard, samtools, bwa, etc.)
219+
- All tool wrappers inherit from `tools.Tool` base class
220+
221+
### Conda Dependencies
222+
223+
The project uses **4 separate conda environments** to handle dependency conflicts:
224+
225+
- **Main environment** (`requirements-conda.txt`): blast, bmtagger, kmc, last, perl
226+
- **env2** (`requirements-conda-env2.txt`): Tools with incompatible dependencies
227+
- **env3** (`requirements-conda-env3.txt`): Additional isolated tools
228+
- **env4** (`requirements-conda-env4.txt`): Additional isolated tools
229+
230+
All environments are added to PATH in the Dockerfile.
231+
232+
## Testing Requirements
233+
234+
- pytest is used with parallelized execution (`-n auto`)
235+
- Tests use fixtures from `conftest.py` providing scoped temp directories
236+
- Test input files are in `test/input/<TestClassName>/`
237+
- Access test inputs via `util.file.get_test_input_path(self)` in test classes
238+
- **New tests should add no more than ~20-30 seconds to testing time**
239+
- **Tests taking longer must be marked with `@pytest.mark.slow`**
240+
- Run slow tests with `pytest --runslow`
241+
- **New functionality must include unit tests covering basic use cases and confirming successful execution of underlying binaries**
242+
243+
### Test Fixtures and Utilities
244+
245+
From `conftest.py`:
246+
- `tmpdir_session`, `tmpdir_module`, `tmpdir_class`, `tmpdir_function` - Scoped temp directories
247+
- `monkeypatch_function_result` - Patch function results for specific args
248+
- `--runslow` option to enable slow/integration tests
249+
- `--fixture-durations` to profile fixture performance
250+
- Set `VIRAL_NGS_TMP_DIRKEEP` environment variable to preserve temp dirs for debugging
251+
252+
From `test/__init__.py`:
253+
- `TestCaseWithTmp` - Base class with temp dir support
254+
- `assert_equal_contents()` - Compare file contents
255+
- `assert_equal_bam_reads()` - Compare BAM files (converted to SAM)
256+
- `assert_md5_equal_to_line_in_file()` - Verify checksums
257+
258+
## CI/CD
259+
260+
GitHub Actions workflow (`.github/workflows/build.yml`) runs on push/PR:
261+
- Docker image build and push to quay.io/broadinstitute/viral-classify
262+
- Master branch: tagged as `latest` and with version number
263+
- Non-master branches: tagged as `quay.io/broadinstitute/viral-classify` (ephemeral)
264+
- Unit and integration tests with pytest
265+
- Coverage reporting to coveralls.io
266+
- Documentation build validation (actual docs hosted on Read the Docs)
267+
268+
## Key Design Patterns
269+
270+
### Command Registration
271+
272+
Commands are registered by appending `(command_name, parser_function)` tuples to `__commands__`. Each command has:
273+
- A parser function (`parser_<command_name>`) that creates argparse parser
274+
- A main function (`main_<command_name>`) that implements the logic
275+
- Connection via `util.cmd.attach_main(parser, main_function)`
276+
277+
Example:
278+
```python
279+
def parser_classify_kraken(parser=argparse.ArgumentParser()):
280+
parser.add_argument('inBam', help='Input BAM file')
281+
parser.add_argument('outReads', help='Output reads')
282+
util.cmd.attach_main(parser, main_classify_kraken)
283+
return parser
284+
285+
def main_classify_kraken(args):
286+
# Implementation
287+
pass
288+
289+
__commands__.append(('kraken', parser_classify_kraken))
290+
```
291+
292+
### Tool Wrapper Pattern
293+
294+
All classification tools in `classify/` inherit from `tools.Tool`:
295+
- Define `BINS` dict mapping logical names to executable names
296+
- Implement `version()` method
297+
- Implement tool-specific methods (build, classify, filter, report, etc.)
298+
- Use `self.execute()` to run commands with proper option formatting
299+
- Define install methods (usually `tools.PrexistingUnixCommand` for conda-installed tools)
300+
301+
Example structure:
302+
```python
303+
class Kraken(tools.Tool):
304+
BINS = {
305+
'classify': 'kraken',
306+
'build': 'kraken-build',
307+
'filter': 'kraken-filter',
308+
'report': 'kraken-report'
309+
}
310+
311+
def __init__(self, install_methods=None):
312+
if not install_methods:
313+
install_methods = [tools.PrexistingUnixCommand(shutil.which('kraken'))]
314+
super(Kraken, self).__init__(install_methods=install_methods)
315+
316+
def version(self):
317+
return KRAKEN_VERSION
318+
```
319+
320+
### Taxonomy Database Handling
321+
322+
The `TaxonomyDb` class in `metagenomics.py`:
323+
- Loads NCBI taxonomy data (nodes.dmp, names.dmp, gi_taxid_*.dmp)
324+
- Supports lazy loading with `load_gis`, `load_nodes`, `load_names` flags
325+
- Provides LCA (Lowest Common Ancestor) computation via `get_ordered_ancestors()`
326+
- Can load from local files or S3 with automatic decompression
327+
- Used for BLAST hit analysis and taxonomic filtering
328+
329+
### Depletion Pipeline Flow
330+
331+
Typical depletion workflow (via `deplete` command):
332+
1. Revert BAM formatting with Picard
333+
2. Deplete with BWA against host/contaminant databases
334+
3. Deplete with BMTagger against additional databases
335+
4. Deplete with BLASTN for more sensitive filtering
336+
5. Each stage outputs intermediate BAM for inspection
337+
338+
Individual depletion tools can be run separately:
339+
- `deplete_bwa` - BWA-based depletion only
340+
- `deplete_bmtagger` - BMTagger-based depletion only
341+
- `deplete_blastn` - BLASTN-based depletion only
342+
343+
## Code Style and Linting
344+
345+
Configuration files in repository root:
346+
- `.flake8` - Flake8 linting configuration
347+
- `.pylintrc` - Pylint configuration
348+
- `.style.yapf` - YAPF code formatting style
349+
350+
Use these tools with their respective configs when modifying code.
351+
352+
## Documentation
353+
354+
Documentation is built with Sphinx and hosted on Read the Docs:
355+
- Source files in `docs/` directory (reStructuredText format)
356+
- Uses `sphinx-argparse` to auto-generate CLI documentation from argparse parsers
357+
- Build process clones viral-core during docs build (see `docs/conf.py`)
358+
- GitHub Actions validates docs build, but deployment is handled separately by Read the Docs
359+
360+
Read the docs at: http://viral-classify.readthedocs.org/
361+
362+
## Common Development Tasks
363+
364+
### Adding a New Classification Tool
365+
366+
1. Create wrapper class in `classify/<tool>.py` inheriting from `tools.Tool`
367+
2. Define tool binaries, version, and installation methods
368+
3. Add conda dependency to appropriate `requirements-conda*.txt`
369+
4. Add command parser and main function to `metagenomics.py` or `taxon_filter.py`
370+
5. Register command in `__commands__` list
371+
6. Add unit tests to `test/unit/`
372+
7. Add test input files to `test/input/<TestClassName>/`
373+
374+
### Adding a New Conda Dependency
375+
376+
1. Check if package exists: `conda search -c bioconda <package_name>`
377+
2. Add to appropriate `requirements-conda*.txt` file (or env2/env3/env4 if conflicts exist)
378+
3. Test in Docker container with `install-conda-dependencies.sh`
379+
4. Update viral-core if adding to base layer dependencies
380+
5. Document any new environment requirements
381+
382+
### Debugging Test Failures
383+
384+
1. Set `VIRAL_NGS_TMP_DIRKEEP=1` to preserve temp directories
385+
2. Run single test: `pytest -v test/unit/test_file.py::TestClass::test_method`
386+
3. Use `pytest -s` to see stdout/stderr
387+
4. Use `--fixture-durations` to identify slow fixtures
388+
5. Check test input files in `test/input/<TestClassName>/`
Binary file not shown.

test/unit/test_taxon_filter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ def test_filter_lastal_bam_polio(self):
5252
args = taxon_filter.parser_filter_lastal_bam(argparse.ArgumentParser()).parse_args([
5353
inBam, self.lastdb_path, outBam])
5454
args.func_main(args)
55-
expectedOut = os.path.join(util.file.get_test_input_path(), 'TestDepleteHuman', 'expected', 'test-reads.taxfilt.imperfect.bam')
55+
expectedOut = os.path.join(util.file.get_test_input_path(), 'TestDepleteHuman', 'expected', 'test-reads.taxfilt.imperfect-2.bam')
5656
assert_equal_bam_reads(self, outBam, expectedOut)
5757

5858
def test_lastal_empty_input(self):

0 commit comments

Comments
 (0)