Skip to content

Atkinson-Lab/Tractor-Mix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tractor-Mix

Tractor-Mix is a GWAS method tailored for admixed populations with related individuals. It shares the core intuition of Tractor by assigning genetic markers based on local ancestry and jointly modeling them in a regression framework. To address sample relatedness, we developed a linear/logistic mixed model incorporating an estimated GRM, allowing us to generate joint and ancestry-specific p-values, as well as effect size estimates.

Tractor-Mix consists of 6 steps, namely:

  • Phasing and local ancestry inference
  • Deconvolute genetic markers by local ancestry
  • Estimate covariates with PC-Air
  • Estimate GRM/kinship with PC-Relate
  • Fit the null model with GMMAT
  • Run Tractor-Mix

Note: The current implementation can handle a cohort with ~15000 individuals

Image Title

 
 

Step1: Phasing and local ancestry inference

Similar to Tractor, Tractor-Mix relies on the effective performance of phasing and local ancestry inference. We have developed a tutorial on using Shapeit2/Shapeit5 and Rfmix2; for more details, please refer to this page.

When selecting a reference panel, it’s important to choose continental or subcontinental populations that closely match the query admixture. For example, in an African-American cohort, using all continental AFR and EUR populations from the 1000 Genomes Project would be advisable as a reference panel. Excessive ancestry calls can diminish statistical power. For instance, a cohort with [AFR, EUR, AMR] = [60%, 38%, 2%] is not ideal for Tractor-Mix due to the low 2% contribution from AMR. In such cases, treating the cohort as a 2-way admixture is more appropriate.

 
 

Step2: Deconvolute genetic markers by local ancestry

To allocate alleles to each local ancestry background, we will use the extract_tracts.py script from Tractor. For detailed instructions, please refer to this page. Briefly, you can execute the following code for this step:

python3 extract_tracts.py \
  --vcf file.phased.vcf.gz \
  --msp file.deconvoluted.msp.tsv \
  --num-ancs 2

For a 2-way admixture, this process will generate six files in total (2 VCFs, 2 hapcount.txt, and 2 dosage.txt files). In our Tractor-Mix model, we will only use the 2 dosage.txt files.

 
 

Step3: Estimate covariates with PC-Air [or simple PC]

In the linear/logistic mixed model, principal components (PCs) or admixture proportions are typically used to account for population stratification. For cohorts with related samples, a common approach is to first partition the samples into independent and related subsets. We construct the PC space using the independent samples and then project the related individuals into this space, ensuring that the principal components are not influenced by family structure. This method is implemented in the R package GENESIS, which can be installed via Bioconductor. A tutorial on running PC-Air can be found here.

Briefly, if you have VCF files, you might perform LD pruning using PLINK and then convert the bim/bam/fam files to GDS format with SNPRelate::snpgdsBED2GDS. In addition to genotype data, GENESIS requires a KING-robust estimate as input. You can compute the KING-robust matrix using SNPRelate, as instructed in the GENESIS vignette, or use PLINK2 with the plink2 --make-king square argument.

Alternatively, you may use PLINK to calculate the PC.

[Note: In our simulations and the benchmark from the GENESIS paper, the difference between raw PCA vs PC-Air is minor.]

 
 

Step4: Estimate GRM/kinship with PC-Relate [or standard GRM]

After completing PC-Air, you can run PC-Relate to obtain the estimated kinship matrix, using several top PCs from PC-Air (refer to the same tutorial). Please note that GENESIS may occasionally scramble the order of samples in the output file, especially if your sample IDs are numerical, so it's important to ensure the order is correct.

To improve computational efficiency, we recommend creating a sparse GRM using the following code:

library(Matrix)
PCRelatemat = pcrelateToMatrix(pcrelate_res, sample.include = iids, scaleKin = 2)[iids,iids]
PCRelatemat_sparse = PCRelatemat
# mask values < 0.05
PCRelatemat_sparse[PCRelatemat_sparse < 0.05] = 0
PCRelatemat_sparse = as(PCRelatemat_sparse, "sparseMatrix") 

Alternatively, you may use PLINK/GCTA/SAIGE to calculate the GRM.

[Note: In our simulations and the benchmark from the GENESIS paper, the difference between raw GRM vs PC-Relate is minor]

 
 

Step5: Fit the null model with GMMAT

GMMAT can be installed using install.packages("GMMAT"). To fit the null model, use the following commands:

For a continuous outcome:

# continuous
Model_Null =   glmmkin(fixed = Pheno ~  Covariates, 
                       data = YourDf, id = "ID", kins = GRM, 
                       family = gaussian())

# dichotomous
Model_Null = glmmkin(fixed = Pheno ~ Covariates, 
                     data = YourDf, id = "ID", kins = GRM, 
                     family = binomial())

In essence, we fit a linear or logistic mixed model that includes covariates as the fixed effect term and the GRM as the random effect term. This null model will later be used for the genome-wide scan.

Please ensure that the order of IDs in your covariate file matches the order in the GRM and dosage files.

 
 

Step6: Run Tractor-Mix

Now you should have the null model and ancestry-specific genotype dosage files. You can run Tractor-Mix with:

source("TractorMix.score.R")

# continuous
TractorMix.score(obj = Model_Null, 
                 infiles = c("Genotype.anc0.dosage.txt", "Genotype.anc1.dosage.txt"),
                 outfiles = "result.tsv", 
                 AC_threshold = 50)
                 
# dichotomous (need to specify a threshold to filter out variants with low ancestry-specific allele counts)
TractorMix.score(obj = Model_Null, 
                 infiles = c("Genotype.anc0.dosage.txt", "Genotype.anc1.dosage.txt"),
                 outfiles = "result.tsv", 
                 AC_threshold = 50)

From TractorMix, you should obtain:

  • P: a joint p-value
  • Eff_anc0, Eff_anc1: ancestry-specific effect size estimates
  • SE_anc0, SE_anc1: ancestry-specific standard error
  • Pval_anc0, Pval_anc1: ancestry-specific p values
  • include_anc0, include_anc1: indicate if ancestry-specific dosages are included (to prevent false positive inflation due to low MAC)

In our UKBB analysis with 9,000 samples and 8.5M variants, it took approximately 3 hours per trait with 22 chromosomes run in parallel on a typical high performance computing cluster setup (using sparse GRM, 32GB RAM, 8 CPUs per chromosome).

 
 

Optional: Wald test of Tractor-Mix (TBD)

Wald test should only be applied to a smaller set of markers (e.g. less than 100 variants).

About

The source code for Tractor-Mix

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published