You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+41-1Lines changed: 41 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,22 @@
2
2
3
3
This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs.
4
4
5
+
## Pipeline features outlined
6
+
7
+
**Decoy alignments.** Before aligning to the genome, we first align to decoy sequences. This has several advantages: it speeds up the process dramatically, reduces noise from erroneous alignments, and provides potential to analyze signal at repeats. The pipeline will align *sequentially* to these decoy sequences (if provided):
8
+
9
+
- chrM (doubled; for non-circular aligners, to draw away reads from NuMTs)
10
+
- Alu elements
11
+
- alpha satellites
12
+
- rDNA
13
+
- repbase
14
+
15
+
We have provided indexed assemblies for download for each of these **for human** in the [ref_decoy](https://github.com/databio/ref_decoy) repository (excluding repbase, which is not publicly available). Any assemblies not provided are skipped.
16
+
17
+
**Fraction of reads in peaks (FRIP).** By default, the pipeline will calculate the FRIP as a quality control, using the peaks it identifies internally. If you want, it will **additionally** calculate a FRIP using a reference set of peaks (for example, from another experiment). For this you must provide a reference peak set (as a bed file) to the pipeline. You can do this by adding a column named `FRIP_ref` to your annotation sheet (see [pipeline_interface.yaml](/config/pipeline_interface.yaml)). Specify the reference peak filename (or use a derived column and specify the path in the project config file `data_sources` section).
18
+
19
+
20
+
5
21
## Installing
6
22
7
23
**Prerequisites**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a pipeline for a single sample, and [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation). You can do a user-specific install of both like this:
**Required executables**. To run the pipeline, you will also need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section.
20
36
21
-
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). The pipeline aligns serially to decoy sequences if you have them set up, which greatly improves pipeline performance. You can set up the decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
37
+
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You can set up the (optional) decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
22
38
23
39
**Clone the pipeline**. Then, clone this repository using one of these methods:
- using HTTPS: `git clone https://github.com/databio/ATACseq.git`
26
42
27
43
## Configuring
44
+
28
45
You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, there is a pre-made config file and project template. Follow the instructions on the [Chang lab configuration](examples/chang_project) page.
@@ -68,6 +85,29 @@ Your annotation file must specify these columns:
68
85
69
86
Run your project as above, by passing your project config file to `looper run`. More detailed instructions and advanced options for how to define your project are in the [Looper documentation on defining a project](http://looper.readthedocs.io/en/latest/define-your-project.html). Of particular interest may be the section on [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns).
70
87
88
+
## TSS enrichments
89
+
90
+
In order to calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. Here's code to generate that.
Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project. You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html).
I used resulting file [metadata/annocomb_gold_atac_gse.csv](metadata/annocomb_gold_atac_gse.csv) to create the looper metadata sheet, [metadata/gold_atac_annotation.csv](metadata/gold_atac_annotation.csv).
15
+
16
+
I create project config file and sampled test data. The SRA fastq files should be stored in a folder `${SRAFQ}`, and then this will run with looper with no additional changes.
17
+
18
+
## Run pipeline
19
+
20
+
```
21
+
looper run ${CODE}ATACseq/examples/gold_atac/metadata/project_config.yaml -d
0 commit comments