diff --git a/doc/nonhybrid.rst b/doc/nonhybrid.rst index d8513a77..ccced958 100644 --- a/doc/nonhybrid.rst +++ b/doc/nonhybrid.rst @@ -26,7 +26,7 @@ Since the input does not contain useful per-target gene labels, a gene annotation database is required and used to label genes in the outputs:: cnvkit.py batch Sample1.bam Sample2.bam -n Control1.bam Control2.bam \ - -m wgs -f hg19.fasta --annotate refFlat.txt + -m wgs -f hg38.fasta --annotate data/refFlat_hg38.txt To speed up and/or improve the accuracy of WGS analyses, try any or all of the following: diff --git a/doc/pipeline.rst b/doc/pipeline.rst index 5467257f..da0f1f0c 100644 --- a/doc/pipeline.rst +++ b/doc/pipeline.rst @@ -27,8 +27,8 @@ Run the CNVkit pipeline on one or more BAM files:: # From baits and tumor/normal BAMs cnvkit.py batch *Tumor.bam --normal *Normal.bam \ - --targets my_baits.bed --annotate refFlat.txt \ - --fasta hg19.fasta --access data/access-5kb-mappable.hg19.bed \ + --targets my_baits.bed --annotate data/refFlat_hg38.txt \ + --fasta hg38.fasta --access data/access-10kb.hg38.bed \ --output-reference my_reference.cnn --output-dir results/ \ --diagram --scatter @@ -38,7 +38,7 @@ Run the CNVkit pipeline on one or more BAM files:: # Reusing targets and antitargets to build a new reference, but no analysis cnvkit.py batch -n *Normal.bam --output-reference new_reference.cnn \ -t my_targets.bed -a my_antitargets.bed \ - -f hg19.fasta -g data/access-5kb-mappable.hg19.bed + -f hg38.fasta -g data/access-10kb.hg38.bed With the ``-p`` option, process each of the BAM files in parallel, as separate subprocesses. The status messages logged to the console will be somewhat @@ -51,15 +51,15 @@ complete sooner. The pipeline executed by the ``batch`` command is equivalent to:: - cnvkit.py access hg19.fa -o access.hg19.bed - cnvkit.py autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names] + cnvkit.py access hg38.fa -o access.hg38.bed + cnvkit.py autobin *.bam -t baits.bed -g access.hg38.bed [--annotate data/refFlat_hg38.txt --short-names] # For each sample... cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn # With all normal samples... - cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn + cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg38.fa -o my_reference.cnn # For each tumor sample... cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr @@ -94,7 +94,7 @@ Prepare a BED file of baited regions for use with CNVkit. :: - cnvkit.py target my_baits.bed --annotate refFlat.txt --split -o my_targets.bed + cnvkit.py target my_baits.bed --annotate data/refFlat_hg38.txt --split -o my_targets.bed The BED file should be the baited genomic regions for your target capture kit, as provided by your vendor. Since these regions (usually exons) may be of @@ -167,8 +167,10 @@ Labeling target regions In case the vendor BED file does not label each region with a corresponding gene name, the ``--annotate`` option can add or replace these labels. Gene annotation databases, e.g. RefSeq or Ensembl, are available in "flat" -format from UCSC (e.g. `refFlat.txt for hg19 -`_). +format from UCSC (e.g. `refFlat.txt for hg38 +`_). +A pre-downloaded ``refFlat_hg38.txt`` is included in the CNVkit ``data/`` +directory. In other cases the region labels are a combination of human-readable gene names and database accession codes, separated by commas (e.g. @@ -193,7 +195,7 @@ reference genome, output as a BED file. :: - cnvkit.py access hg19.fa -x excludes.bed -o access-excludes.hg19.bed + cnvkit.py access hg38.fa -x excludes.bed -o access-excludes.hg38.bed cnvkit.py access mm10.fasta -s 10000 -o access-10kb.mm10.bed Many fully sequenced genomes, including the human genome, contain large regions @@ -214,18 +216,21 @@ This option can be used more than once to exclude several BED files listing different sets of regions. For example, regions of poor mappability have been precalculated by others and are available from the `UCSC FTP Server -`_ (see `here for hg19 -`_). +`_ (see `hg38 bigWig files +`_, +or `hg19 ENCODE mappability +`_ +for legacy workflows). If there are many small excluded/inaccessible regions in the genome, then small, less-reliable antitarget bins would be squeezed into the remaining accessible regions. The ``-s`` option ignores short regions that would otherwise be excluded, allowing larger antitarget bins to overlap them. -An "access" file precomputed for the UCSC reference human genome build hg19, -with some know low-mappability regions excluded, is included in the CNVkit -source distribution under the ``data/`` directory -(``data/access-5kb-mappable.hg19.bed``). +Precomputed "access" files are included in the CNVkit source distribution under +the ``data/`` directory. For the hg38/GRCh38 human genome build, use +``data/access-10kb.hg38.bed``. An hg19 access file +(``data/access-5k-mappable.hg19.bed``) is also available for legacy workflows. .. _antitarget: @@ -239,7 +244,7 @@ off-target/"antitarget" regions. :: - cnvkit.py antitarget my_targets.bed -g data/access-5kb-mappable.hg19.bed -o my_antitargets.bed + cnvkit.py antitarget my_targets.bed -g data/access-10kb.hg38.bed -o my_antitargets.bed Certain genomic regions cannot be mapped by short-read resequencing (see :ref:`access`); we can avoid them when calculating the antitarget locations by @@ -287,9 +292,9 @@ estimated average read depths and recommended bin sizes on standard output. :: - cnvkit.py autobin *.bam -t my_targets.bed -g access.hg19.bed + cnvkit.py autobin *.bam -t my_targets.bed -g data/access-10kb.hg38.bed cnvkit.py autobin *.bam -m amplicon -t my_targets.bed - cnvkit.py autobin *.bam -m wgs -b 50000 -g access.hg19.bed --annotate refFlat.txt + cnvkit.py autobin *.bam -m wgs -b 50000 -g data/access-10kb.hg38.bed --annotate data/refFlat_hg38.txt The BAM index (.bai) is used to quickly determine the total number of reads present in a file, and random sampling of targeted regions (``-t``) is used to @@ -388,7 +393,7 @@ Paired or pooled normals Provide the ``*.targetcoverage.cnn`` and ``*.antitargetcoverage.cnn`` files created by the :ref:`coverage` command:: - cnvkit.py reference *coverage.cnn -f ucsc.hg19.fa -o Reference.cnn + cnvkit.py reference *coverage.cnn -f hg38.fa -o Reference.cnn To analyze a cohort sequenced on a single platform, we recommend combining all normal samples into a pooled reference, even if matched tumor-normal pairs were @@ -430,7 +435,7 @@ still computes the GC content of each region if the reference genome is given. :: - cnvkit.py reference -o FlatReference.cnn -f ucsc.hg19.fa -t targets.bed -a antitargets.bed + cnvkit.py reference -o FlatReference.cnn -f hg38.fa -t targets.bed -a antitargets.bed Possible uses for a flat reference include: diff --git a/doc/quickstart.rst b/doc/quickstart.rst index b2641fbd..56a69521 100644 --- a/doc/quickstart.rst +++ b/doc/quickstart.rst @@ -25,9 +25,10 @@ website and download: 1. Your species' reference genome sequence, in FASTA format [required] 2. Gene annotation database, via RefSeq or Ensembl, in BED or "RefFlat" format - (e.g. `refFlat.txt - `_) - [optional] + (e.g. `refFlat.txt for hg38 + `_) + [optional] -- a pre-downloaded ``refFlat_hg38.txt`` is included in the + CNVkit ``data/`` directory You probably already have the reference genome sequence. If your species' genome is not available from UCSC, use whatever reference sequence you have. CNVkit @@ -91,8 +92,8 @@ samples share the suffix "Normal.bam" and tumor samples "Tumor.bam", a complete ``batch`` command could be:: cnvkit.py batch *Tumor.bam --normal *Normal.bam \ - --targets my_baits.bed --fasta hg19.fasta \ - --access data/access-5kb-mappable.hg19.bed \ + --targets my_baits.bed --fasta hg38.fasta \ + --access data/access-10kb.hg38.bed \ --output-reference my_reference.cnn --output-dir example/ See the built-in help message to see what these options do, and for additional @@ -104,8 +105,8 @@ If you have no normal samples to use for the :ref:`reference`, you can create a "flat" reference which assumes equal coverage in all bins by using the ``--normal/-n`` flag without specifying any additional BAM files:: - cnvkit.py batch *Tumor.bam -n -t my_baits.bed -f hg19.fasta \ - --access data/access-5kb-mappable.hg19.bed \ + cnvkit.py batch *Tumor.bam -n -t my_baits.bed -f hg38.fasta \ + --access data/access-10kb.hg38.bed \ --output-reference my_flat_reference.cnn -d example2/ In either case, you should run this command with the reference genome sequence @@ -116,8 +117,8 @@ normal sample. If your targets are missing gene names, you can add them here with the ``--annotate`` argument:: - cnvkit.py batch *Tumor.bam -n *Normal.bam -t my_baits.bed -f hg19.fasta \ - --annotate refFlat.txt --access data/access-5kb-mappable.hg19.bed \ + cnvkit.py batch *Tumor.bam -n *Normal.bam -t my_baits.bed -f hg38.fasta \ + --annotate data/refFlat_hg38.txt --access data/access-10kb.hg38.bed \ --output-reference my_flat_reference.cnn -d example3/ .. note:: **Which BED file should I use?** diff --git a/doc/scripts.rst b/doc/scripts.rst index 2e2e862e..a4f7c7af 100644 --- a/doc/scripts.rst +++ b/doc/scripts.rst @@ -44,7 +44,7 @@ Additional scripts boundaries for enriched regions. (This is usually much slower then the guided approach.) :: - guess_baits.py -g access.hg19.bed Sample1.bam Sample2.bam -o baits.bed + guess_baits.py -g data/access-10kb.hg38.bed Sample1.bam Sample2.bam -o baits.bed In either mode, the input region coordinates can be provided in any of the formats handled by skgenome.tabio, but it's best to first run them through