Pedigree Simulator is an improved version of a tool originally developed by Staples et al. (2014) to generate simulated pedigrees for benchmarking PRIMUS in its original publication. This tool also uses code from the IBDsims program developed by Morrison (2013). We developed this version to introduce compute optimizations for use in a variety of contexts; specifically, for benchmarking COMPADRE, a tool that unifies PRIMUS, ERSA, and PADRE.
First, click the green Code button at the top of this page and select a cloning option.
Dependency and reference data installation takes place using Docker, which must first be installed and launched on your machine.
Navigate into the project directory cloned from GitHub:
cd pedigree-simulatorBuild the Docker image:
docker build -t pedigreesim .Note: The build process will take between 10-20 minutes due to the size of the reference data being downloaded (approx. 40 gigabytes after being unzipped).
After building the Docker image, enter the container using docker run. During this step, you should also set your local volume mount (for writing the output), specified by the -v flag.
Note: Make sure to provide the absolute path to your local output folder in this flag. Even if you run the docker run command from inside the top level of the repository folder, you must use ./output instead of just output.
docker run -v \
/your/path/to/pedigree-simulator/output:/usr/src/output \
-it --entrypoint /bin/bash pedigreesim:latest Once inside the container, you can run the tool from the command line:
perl main.pl 100 uniform3 20 EUR parallelThe main.pl script takes several positional arguments. The following descriptions use the above command as an example.
100: The simulation "number", or the unique identifier for the output folder/filesuniform3: The simulation "type." Currently, the script supportsuniform3,uniform2, andhalfsib3. The key distinction here is thathalfsib3offers half-sibling relationships in the pedigree. The trailing number represents the average number of offspring per node in the pedigree.20: The number of individuals in the pedigree.EUR: The 1000 Genomes superpopulation from which founder genotypes are drawn. Currently, the script supports EUR (European) and AMR (Admixed American) superpopulation seeding.
parallel: Enables parallel processing of the genotype adding step with 22 threads (one per chromosome). This is much faster but very RAM intensive and not recommended outside of HPC/server environments. The genotype adding step will run using a single thread if this argument is not used.
This tool generates a full pedigree as well as incrementally missing versions (up to 20% of all pedigree nodes). For example, a size 20 pedigree output will contain versions with up to 4 nodes missing. This is an artifact of the code left over from our developement done in line with the COMPADRE benchmarking, where we evaluated pedigree reconstruction success as pedigrees became more sparse. If you want to change the maximum % of samples removed in this incremental process, please update the global $missing_denominator variable in line 45 of src/main.pl before building the Docker image. The default value of 5 divides the total pedigree size by 5, removing 1/5th (20%) of all nodes by the last incrementally missing version of the pedigree. If you want more missingness than 20%, consider decreasing the value to 4 or 2, and if you want more, increase it.
Tools like COMPADRE utilize shared IBD segments alongside typical genotype data. We generated IBD segments in our benchmarking of COMPADRE by first phasing the simulated output VCF files with SHAPEIT5 (using unique BioVU haplotypes as a reference panel), then performing segment detection with GERMLINE2. We provided a basic script to highlight the command structure we used in our own benchmarking in the tools/ folder. Note that this script expects a .env file with several executable paths, such as for SHAPEIT, GERMLINE, PLINK2, Python 3, as well as a genetic map file folder path.
Another important note about IBD segment simulation: if you are using tools like GERMLINE2 that expect phased input data, you might want to use a different fileset [than the provided 1000 Genomes set] as a haplotype reference to avoid reference overlap. For example, in our benchmarking of COMPADRE, we used phased, population-matched BioVU data as the haplotype reference.
An alternative option is to perform phase-free segment detection with tools like IBIS. This is a useful option if you are trying to evaluate tools like Bonsai that expect unphased data.
Please email contact AT compadre DOT dev with the subject line "Pedigree Simulator Help" or submit an issue report/pull request on GitHub.
If you use this tool in your research, please cite the following:
Evans, G. F., Baker, J. T., Petty, L. E., ... & Below, J. E. (2025). COMPADRE:
Combined pedigree-aware distant relatedness estimation for improved pedigree reconstruction.
The American Journal of Human Genetics. DOI: 10.1016/j.ajhg.2025.09.011
Pedigree Simulator was developed by the Below Lab in the Division of Genetic Medicine at Vanderbilt University Medical Center, Nashville, TN, USA.
Pedigree Simulator is distributed under the following APACHE 2.0 license: https://compadre.dev/licenses/sim_license.txt