Skip to content

Commit 1b0503e

Browse files
authored
Merge pull request #2 from databio/dev
Development changes into master
2 parents f172220 + 47dc737 commit 1b0503e

20 files changed

+522
-181
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.pyc
2+
.~lock*

CHANGELOG.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Change log
2+
All notable changes to this project will be documented in this file.
3+
4+
## [0.2.0]
5+
### Added
6+
- FRiP can now be calculated based on reference peaks
7+
- Pipeline now reports Picard estimated library size statistic
8+
- Added option for pyadapt trimming
9+
- Added example project using 'gold standard' data
10+
- Added new resource package grades
11+
- Added preliminary 'exact cuts' scripts, but they are not yet used
12+
13+
### Changed
14+
- Improved README
15+
- Changed filename of the TSS file
16+
- Reorganized structure of alignment code
17+
18+
## [0.1.0]
19+
### Added
20+
- First release of ATAC-seq pypiper pipeline

README.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,22 @@
22

33
This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs.
44

5+
## Pipeline features outlined
6+
7+
**Decoy alignments.** Before aligning to the genome, we first align to decoy sequences. This has several advantages: it speeds up the process dramatically, reduces noise from erroneous alignments, and provides potential to analyze signal at repeats. The pipeline will align *sequentially* to these decoy sequences (if provided):
8+
9+
- chrM (doubled; for non-circular aligners, to draw away reads from NuMTs)
10+
- Alu elements
11+
- alpha satellites
12+
- rDNA
13+
- repbase
14+
15+
We have provided indexed assemblies for download for each of these **for human** in the [ref_decoy](https://github.com/databio/ref_decoy) repository (excluding repbase, which is not publicly available). Any assemblies not provided are skipped.
16+
17+
**Fraction of reads in peaks (FRIP).** By default, the pipeline will calculate the FRIP as a quality control, using the peaks it identifies internally. If you want, it will **additionally** calculate a FRIP using a reference set of peaks (for example, from another experiment). For this you must provide a reference peak set (as a bed file) to the pipeline. You can do this by adding a column named `FRIP_ref` to your annotation sheet (see [pipeline_interface.yaml](/config/pipeline_interface.yaml)). Specify the reference peak filename (or use a derived column and specify the path in the project config file `data_sources` section).
18+
19+
20+
521
## Installing
622

723
**Prerequisites**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a pipeline for a single sample, and [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation). You can do a user-specific install of both like this:
@@ -18,13 +34,14 @@ export PATH=$PATH:~/.local/bin
1834

1935
**Required executables**. To run the pipeline, you will also need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section.
2036

21-
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). The pipeline aligns serially to decoy sequences if you have them set up, which greatly improves pipeline performance. You can set up the decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
37+
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You can set up the (optional) decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
2238

2339
**Clone the pipeline**. Then, clone this repository using one of these methods:
2440
- using SSH: `git clone [email protected]:databio/ATACseq.git`
2541
- using HTTPS: `git clone https://github.com/databio/ATACseq.git`
2642

2743
## Configuring
44+
2845
You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, there is a pre-made config file and project template. Follow the instructions on the [Chang lab configuration](examples/chang_project) page.
2946

3047
Option 1: **Default configuration** ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)).
@@ -68,6 +85,29 @@ Your annotation file must specify these columns:
6885

6986
Run your project as above, by passing your project config file to `looper run`. More detailed instructions and advanced options for how to define your project are in the [Looper documentation on defining a project](http://looper.readthedocs.io/en/latest/define-your-project.html). Of particular interest may be the section on [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns).
7087

88+
## TSS enrichments
89+
90+
In order to calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. Here's code to generate that.
91+
92+
From refGene:
93+
94+
```
95+
# Provide genome string and gene file
96+
GENOME="hg38"
97+
URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz"
98+
99+
wget -O ${GENOME}_TSS_full.txt.gz ${URL}
100+
zcat ${GENOME}_TSS_full.txt.gz | awk '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}' | LC_COLLATE=C sort -k1,1 -k2,2n -u > ${GENOME}_TSS.tsv
101+
echo ${GENOME}_TSS.tsv
102+
```
103+
104+
Another option from Gencode GTF:
105+
106+
```
107+
grep "level 1" ${GENOME}.gtf | grep "gene" | awk '{if($7=="+"){print $1"\t"$4"\t"$4"\t"$7}else{print $1"\t"$5"\t"$5"\t"$7}}' | LC_COLLATE=C sort -u -k1,1V -k2,2n > ${GENOME}_TSS.tsv
108+
109+
```
110+
71111
## Using a cluster
72112

73113
Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project. You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html).

cmd.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# using pre-fix of fastq file
2-
#python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G mm9 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
3-
python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
1+
# using pre-fix of fastq file
2+
#python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G mm9 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
3+
python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml -gs mm -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz

config/pipeline_interface.yaml

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,21 @@ ATACseq.py:
1111
"--input2": read2
1212
"-gs mm": null
1313
"--single-or-paired": read_type
14+
optional_arguments:
15+
"--frip-ref-peaks": FRIP_ref
1416
resources:
1517
default:
1618
file_size: "0"
19+
cores: "2"
20+
mem: "4000"
21+
time: "0-04:00:00"
22+
normal:
23+
file_size: "0.5"
24+
cores: "4"
25+
mem: "16000"
26+
time: "2-00:00:00"
27+
large:
28+
file_size: "6"
1729
cores: "8"
1830
mem: "32000"
19-
time: "2-00:00:00"
20-
partition: "parallel"
31+
time: "3-00:00:00"

config/protocol_mappings.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
ATAC: ATACseq.py
1+
ATAC: ATACseq.py
2+
ATAC-SEQ: ATACseq.py

examples/gold_atac/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
2+
# Gold ATAC
3+
4+
Testing ATAC-seq pipeline on gold standard public ATAC-seq data.
5+
6+
## Grab data, project setup
7+
8+
Download raw `fastq.gz` files (use `fastq-dump` from SRA. You may also use `get_geo.py` to download raw ATAC-seq reads from SRA and metadata from GEO:
9+
10+
```
11+
python get_geo.py -i ~/code/ATACseq/examples/gold_atac/metadata/gold_atac_gse.csv -r --fastq
12+
```
13+
14+
I used resulting file [metadata/annocomb_gold_atac_gse.csv](metadata/annocomb_gold_atac_gse.csv) to create the looper metadata sheet, [metadata/gold_atac_annotation.csv](metadata/gold_atac_annotation.csv).
15+
16+
I create project config file and sampled test data. The SRA fastq files should be stored in a folder `${SRAFQ}`, and then this will run with looper with no additional changes.
17+
18+
## Run pipeline
19+
20+
```
21+
looper run ${CODE}ATACseq/examples/gold_atac/metadata/project_config.yaml -d
22+
```
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
sample_name,Sample_title,Sample_source_name_ch1,organism,Sample_organism_ch1,library,Sample_library_selection,Sample_library_strategy,data_source,Sample_type,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model
2+
ATAC-seq_from_dendritic_cell_(ENCLB065VMV),ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000
3+
ATAC-seq_from_dendritic_cell_(ENCLB811FLK),ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000
4+
ATAC-seq_from_dendritic_cell_(ENCLB887PKE),ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500
5+
ATAC-seq_from_dendritic_cell_(ENCLB586KIS),ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000
6+
ATAC-seq_from_dendritic_cell_(ENCLB384NOX),ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
sample_name,sample_description,treatment_description,organism,library,data_source,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model,read1,read2
2+
test1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,TEST_1,TEST_2
3+
gold1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
4+
gold2,ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
5+
gold3,ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500,SRA_1,SRA_2
6+
gold4,ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
7+
gold5,ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
GSE94182
2+
GSE94222
3+
GSE94177
4+
GSE94196
5+
GSE94173

0 commit comments

Comments
 (0)