Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/nonhybrid.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Since the input does not contain useful per-target gene labels, a gene
annotation database is required and used to label genes in the outputs::

cnvkit.py batch Sample1.bam Sample2.bam -n Control1.bam Control2.bam \
-m wgs -f hg19.fasta --annotate refFlat.txt
-m wgs -f hg38.fasta --annotate data/refFlat_hg38.txt

To speed up and/or improve the accuracy of WGS analyses, try any or all of the
following:
Expand Down
47 changes: 26 additions & 21 deletions doc/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ Run the CNVkit pipeline on one or more BAM files::

# From baits and tumor/normal BAMs
cnvkit.py batch *Tumor.bam --normal *Normal.bam \
--targets my_baits.bed --annotate refFlat.txt \
--fasta hg19.fasta --access data/access-5kb-mappable.hg19.bed \
--targets my_baits.bed --annotate data/refFlat_hg38.txt \
--fasta hg38.fasta --access data/access-10kb.hg38.bed \
--output-reference my_reference.cnn --output-dir results/ \
--diagram --scatter

Expand All @@ -38,7 +38,7 @@ Run the CNVkit pipeline on one or more BAM files::
# Reusing targets and antitargets to build a new reference, but no analysis
cnvkit.py batch -n *Normal.bam --output-reference new_reference.cnn \
-t my_targets.bed -a my_antitargets.bed \
-f hg19.fasta -g data/access-5kb-mappable.hg19.bed
-f hg38.fasta -g data/access-10kb.hg38.bed

With the ``-p`` option, process each of the BAM files in parallel, as separate
subprocesses. The status messages logged to the console will be somewhat
Expand All @@ -51,15 +51,15 @@ complete sooner.

The pipeline executed by the ``batch`` command is equivalent to::

cnvkit.py access hg19.fa -o access.hg19.bed
cnvkit.py autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names]
cnvkit.py access hg38.fa -o access.hg38.bed
cnvkit.py autobin *.bam -t baits.bed -g access.hg38.bed [--annotate data/refFlat_hg38.txt --short-names]

# For each sample...
cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn
cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn

# With all normal samples...
cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn
cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg38.fa -o my_reference.cnn

# For each tumor sample...
cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr
Expand Down Expand Up @@ -94,7 +94,7 @@ Prepare a BED file of baited regions for use with CNVkit.

::

cnvkit.py target my_baits.bed --annotate refFlat.txt --split -o my_targets.bed
cnvkit.py target my_baits.bed --annotate data/refFlat_hg38.txt --split -o my_targets.bed

The BED file should be the baited genomic regions for your target capture kit,
as provided by your vendor. Since these regions (usually exons) may be of
Expand Down Expand Up @@ -167,8 +167,10 @@ Labeling target regions
In case the vendor BED file does not label each region with a corresponding gene
name, the ``--annotate`` option can add or replace these labels.
Gene annotation databases, e.g. RefSeq or Ensembl, are available in "flat"
format from UCSC (e.g. `refFlat.txt for hg19
<http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz>`_).
format from UCSC (e.g. `refFlat.txt for hg38
<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz>`_).
A pre-downloaded ``refFlat_hg38.txt`` is included in the CNVkit ``data/``
directory.

In other cases the region labels are a combination of human-readable gene names
and database accession codes, separated by commas (e.g.
Expand All @@ -193,7 +195,7 @@ reference genome, output as a BED file.

::

cnvkit.py access hg19.fa -x excludes.bed -o access-excludes.hg19.bed
cnvkit.py access hg38.fa -x excludes.bed -o access-excludes.hg38.bed
cnvkit.py access mm10.fasta -s 10000 -o access-10kb.mm10.bed

Many fully sequenced genomes, including the human genome, contain large regions
Expand All @@ -214,18 +216,21 @@ This option can be used more than once to exclude several BED files listing
different sets of regions.
For example, regions of poor mappability have been precalculated by others and
are available from the `UCSC FTP Server
<ftp://hgdownload.soe.ucsc.edu/goldenPath/>`_ (see `here for hg19
<ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/>`_).
<ftp://hgdownload.soe.ucsc.edu/goldenPath/>`_ (see `hg38 bigWig files
<https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/>`_,
or `hg19 ENCODE mappability
<ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/>`_
for legacy workflows).

If there are many small excluded/inaccessible regions in the genome, then small,
less-reliable antitarget bins would be squeezed into the remaining accessible
regions. The ``-s`` option ignores short regions that would otherwise be
excluded, allowing larger antitarget bins to overlap them.

An "access" file precomputed for the UCSC reference human genome build hg19,
with some know low-mappability regions excluded, is included in the CNVkit
source distribution under the ``data/`` directory
(``data/access-5kb-mappable.hg19.bed``).
Precomputed "access" files are included in the CNVkit source distribution under
the ``data/`` directory. For the hg38/GRCh38 human genome build, use
``data/access-10kb.hg38.bed``. An hg19 access file
(``data/access-5k-mappable.hg19.bed``) is also available for legacy workflows.


.. _antitarget:
Expand All @@ -239,7 +244,7 @@ off-target/"antitarget" regions.

::

cnvkit.py antitarget my_targets.bed -g data/access-5kb-mappable.hg19.bed -o my_antitargets.bed
cnvkit.py antitarget my_targets.bed -g data/access-10kb.hg38.bed -o my_antitargets.bed

Certain genomic regions cannot be mapped by short-read resequencing (see
:ref:`access`); we can avoid them when calculating the antitarget locations by
Expand Down Expand Up @@ -287,9 +292,9 @@ estimated average read depths and recommended bin sizes on standard output.

::

cnvkit.py autobin *.bam -t my_targets.bed -g access.hg19.bed
cnvkit.py autobin *.bam -t my_targets.bed -g data/access-10kb.hg38.bed
cnvkit.py autobin *.bam -m amplicon -t my_targets.bed
cnvkit.py autobin *.bam -m wgs -b 50000 -g access.hg19.bed --annotate refFlat.txt
cnvkit.py autobin *.bam -m wgs -b 50000 -g data/access-10kb.hg38.bed --annotate data/refFlat_hg38.txt

The BAM index (.bai) is used to quickly determine the total number of reads
present in a file, and random sampling of targeted regions (``-t``) is used to
Expand Down Expand Up @@ -388,7 +393,7 @@ Paired or pooled normals
Provide the ``*.targetcoverage.cnn`` and ``*.antitargetcoverage.cnn`` files
created by the :ref:`coverage` command::

cnvkit.py reference *coverage.cnn -f ucsc.hg19.fa -o Reference.cnn
cnvkit.py reference *coverage.cnn -f hg38.fa -o Reference.cnn

To analyze a cohort sequenced on a single platform, we recommend combining all
normal samples into a pooled reference, even if matched tumor-normal pairs were
Expand Down Expand Up @@ -430,7 +435,7 @@ still computes the GC content of each region if the reference genome is given.

::

cnvkit.py reference -o FlatReference.cnn -f ucsc.hg19.fa -t targets.bed -a antitargets.bed
cnvkit.py reference -o FlatReference.cnn -f hg38.fa -t targets.bed -a antitargets.bed

Possible uses for a flat reference include:

Expand Down
19 changes: 10 additions & 9 deletions doc/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ website and download:

1. Your species' reference genome sequence, in FASTA format [required]
2. Gene annotation database, via RefSeq or Ensembl, in BED or "RefFlat" format
(e.g. `refFlat.txt
<http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz>`_)
[optional]
(e.g. `refFlat.txt for hg38
<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz>`_)
[optional] -- a pre-downloaded ``refFlat_hg38.txt`` is included in the
CNVkit ``data/`` directory

You probably already have the reference genome sequence. If your species' genome
is not available from UCSC, use whatever reference sequence you have. CNVkit
Expand Down Expand Up @@ -91,8 +92,8 @@ samples share the suffix "Normal.bam" and tumor samples "Tumor.bam", a complete
``batch`` command could be::

cnvkit.py batch *Tumor.bam --normal *Normal.bam \
--targets my_baits.bed --fasta hg19.fasta \
--access data/access-5kb-mappable.hg19.bed \
--targets my_baits.bed --fasta hg38.fasta \
--access data/access-10kb.hg38.bed \
--output-reference my_reference.cnn --output-dir example/

See the built-in help message to see what these options do, and for additional
Expand All @@ -104,8 +105,8 @@ If you have no normal samples to use for the :ref:`reference`, you can create a
"flat" reference which assumes equal coverage in all bins by using the
``--normal/-n`` flag without specifying any additional BAM files::

cnvkit.py batch *Tumor.bam -n -t my_baits.bed -f hg19.fasta \
--access data/access-5kb-mappable.hg19.bed \
cnvkit.py batch *Tumor.bam -n -t my_baits.bed -f hg38.fasta \
--access data/access-10kb.hg38.bed \
--output-reference my_flat_reference.cnn -d example2/

In either case, you should run this command with the reference genome sequence
Expand All @@ -116,8 +117,8 @@ normal sample.
If your targets are missing gene names, you can add them here with the
``--annotate`` argument::

cnvkit.py batch *Tumor.bam -n *Normal.bam -t my_baits.bed -f hg19.fasta \
--annotate refFlat.txt --access data/access-5kb-mappable.hg19.bed \
cnvkit.py batch *Tumor.bam -n *Normal.bam -t my_baits.bed -f hg38.fasta \
--annotate data/refFlat_hg38.txt --access data/access-10kb.hg38.bed \
--output-reference my_flat_reference.cnn -d example3/

.. note:: **Which BED file should I use?**
Expand Down
2 changes: 1 addition & 1 deletion doc/scripts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Additional scripts
boundaries for enriched regions. (This is usually much slower then the
guided approach.) ::

guess_baits.py -g access.hg19.bed Sample1.bam Sample2.bam -o baits.bed
guess_baits.py -g data/access-10kb.hg38.bed Sample1.bam Sample2.bam -o baits.bed

In either mode, the input region coordinates can be provided in any of the
formats handled by skgenome.tabio, but it's best to first run them through
Expand Down
Loading