Skip to content

Step 3a: Tractor GWAS (Hail)

Taotao Tan edited this page Nov 8, 2022 · 4 revisions

This page describes implementation of the Tractor local ancestry aware GWAS model.
Updates: for a more complete tutorial, please go to this page

Joint-analysis association model

The joint local ancestry aware association model we describe in our manuscript for quantitative traits is:

Y ~ b0 + b1 * X1 + b2 * X2 + b3 * X3 + ... + bk * Xk

where X1 is the number of haplotypes of the index ancestry present at that locus for each individual, X2 is the number of copies of the risk allele coming from the first ancestry, X3 is copies coming from the second ancestry, and X4 to Xk are other covariates such as PCs or another metric of overall admixture fraction. This model produces ancestry-specific effect size estimates and p values, which can be extremely useful in post-GWAS efforts. A thorough description of Tractor and this GWAS model can be found in our paper.

The recommended implementation of the Tractor joint analysis model uses the free scalable software framework, Hail (https://hail.is/). Hail can be run locally or on the cloud. It can be useful to build and test the pipeline in a Jupyter notebook when applying it to your own cohort data. For this reason, we supply an example Jupyter notebook, written in python-compatible Hail format, Tractor-Example-GWAS.ipynb that can be adapted for user data.

The specific commands to load in and run linear regression on ancestry-deconvolved files are as follows:

  1. Reading in and formatting the Tractor dosage files in Hail. This is a bit more involved than a standard VCF or matrix table load because we need to run GWAS on entries rather than rows. For each ancestry dosage and haplotype file, the following (anc0 or anc1) will be run. Ensure that the header row is not commented out.
    row_fields={'CHROM': hl.tstr, 'POS': hl.tint, 'ID': hl.tstr, 'REF': hl.tstr, 'ALT': hl.tstr}
    anc0dos = hl.import_matrix_table('gs://path-to-file.txt.gz', force_bgz=True, row_fields=row_fields, row_key=[], min_partitions=32)
    anc0dos = anc0dos.key_rows_by().drop('row_id’)
    anc0dos = anc0dos.key_rows_by(locus=hl.locus(anc0dos.CHROM, anc0dos.POS))

  2. Join the ancestry and haplotype dosage files together into a Hail matrix table.
    2 way: mt = anc0dos.annotate_entries(anc1dos = anc1dos[anc0dos.locus, anc0dos.col_id], hapcounts0 = hapcounts0[anc0dos.locus, anc0dos.col_id])
    3 way: mt = anc0dos.annotate_entries(anc1dos = anc1dos[anc0dos.locus, anc0dos.col_id], anc2dos = anc2dos[anc0dos.locus, anc0dos.col_id], hapcounts0 = hapcounts0[anc0dos.locus, anc0dos.col_id], hapcounts1 = hapcounts1[anc0dos.locus, anc0dos.col_id]))

  3. Load in and annotate with phenotype/covariate information.
    From here on we now utilize standard Hail commands. Key by the ID fieldname that matches that of the genotype files.
    phenos = hl.import_table('gs://path/to/pheno_covariate_file.txt').key_by('IID')
    mt = mt.annotate_cols(phenos = phenos[mt.col_id])

  4. Run the linear regression.
    For a single phenotype, for example TC, along with relevant covariates; here we show the covariates being sex, age, blood dilution level, and an estimate of global African ancestry fraction.
    mt = mt.annotate_rows(TC = hl.agg.linreg(hl.float(mt.phenos.TC),[1.0, mt.hapcounts0.x, mt.x, mt.anc1dos.x, mt.phenos.isMale, mt.phenos.age, mt.phenos.dilutionFactor, hl.float(mt.phenos.admixFracAFR)]))

Doing this, we are annotating the matrix table with the results of the GWAS. The results are saved as a struct for each phenotype (here a new row annotation named 'TC') which contains the following, where arrays are indexable in the order of input (e.g. intercept is [0], haplotype dosage is [1], anc0 dosage is [2], anc 1 is [3], covariates begin at [4]…).

To run multiple phenotypes in batch, one can make a list of phenotypes and then cycle through them annotating the mt with structs named after each:
phenonames = ['TC', 'TG', 'HDLC', 'LDLC']
mt = mt.annotate_rows(results=hl.struct(**{pheno: hl.agg.linreg(mt[pheno], [1.0, mt.hapcounts0.x, mt.anc0dos.x, mt.anc1dos.x, mt.phenos.isMale, mt.phenos.age, mt.phenos.dilutionFactor, hl.float(mt.phenos.admixFracAFR)]) for pheno in phenonames}))

We are also developing an alternative pipeline to run the joint analysis GWAS model outside of Hail. Please see the Tractor_haul github page for additional documentation and code related to this. This strategy is of particular use for binary traits, as Hail does not yet have an implementation for logistic regression on entries.


GWAS on deconvolved ancestral segments

Caution: we notice that this approach will typically introduce some downward bias for effect size estimates due to "omitted variable bias".

Some users may instead prefer to use the deconvolved pieces of admixed cohorts in their GWAS rather than run our joint-analysis model. As described in the manuscript, this analysis style could also be useful in large mixed collections where most individuals are of a homogenous group, and a smaller subset of individuals is admixed such that only one ancestry component is wanted to be included with the larger group.

The previous Tractor step, Extracting tracts and ancestral dosages, in addition to ancestry dosages, produces multiple VCF files containing only the genotype information from each ancestral background. Other ancestries are blanked out as missing data to remove concerns over population structure. These files can then be loaded into plink to run standard logistic (or linear) regression as follows:

plink --vcf cohort.anc0.vcf --vcf-half-call haploid --out cohort.anc0 --make-bed --biallelic-only strict
plink --bfile cohort.anc0 --pheno pheno_file.txt --logistic --pheno-name <pheno col name> --allow-no-sex --covar covariateFile.txt --out cohort.pheno.anc1 --ci 0.95

Both ancestries will need to be run separately following these steps to produce ancestry-specific summary statistics.

Documentation for additional plink flags and file structure can be found on the plink websites at https://zzz.bwh.harvard.edu/plink/ and https://www.cog-genomics.org/plink/.

Clone this wiki locally