Skip to content

An Applied Genomics project developed during the Master’s Degree in Bioinformatics at the University of Bologna, focused on designing and evaluating a simulated workflow for the de novo genome assembly, functional annotation, and population genomics of the Malayan horned frog (Pelobatrachus nasutus), with an assigned budget of €100,000.

Notifications You must be signed in to change notification settings

Markus2409/Applied_Genomics_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 

Repository files navigation

Applied Genomics Project – De novo genome assembly and functional genomics of the Pelobatrachus nasutus as a phenotypic plasticity, and antimicrobial defenses model


Sequencing Assembly Annotation Population Organism Budget

Table of Contents


Introduction & State of the art

This repository documents the Applied Genomics Simulation Project developed during the Applied Genomics course of the Master’s Degree in Bioinformatics (University of Bologna, 2025).

The project was designed under an assigned budget of €100,000, and represents a simulation exercise rather than a real sequencing effort.

The main goal was to design a working pipeline for the de novo assembly of the genome of the Malayan horned frog (Pelobatrachus nasutus), which as of 26th August 2025 still lacks a published nuclear reference genome.

In addition to genome assembly, the project focused on the annotation of genes relevant to biomedical applications (e.g., antimicrobial peptides, skin-related genes) and the investigation of phenotypic plasticity (cutaneous tubercles, mucous glands, pigmentation differences) across contrasting habitats.

This repository is also intended to provide a concise overview of the designed pipeline, bioinformatics softwares utilized and budget estimates, while serving as a container for meta-information such as figures used in the project, the presentation slides, and the final report document.

The repository as intended is organized into the following folders:

  • Report and Powerpoint/
    Contains the final written report document (PDF) and the presentation slides (PowerPoint) prepared for the Applied Genomics course.

  • Images and Tables/
    Contains all the figures, diagrams, and tables included in the report and presentation. These are provided separately to facilitate reuse in other documents or presentations.


Software & Toolkits

Tool / Software Purpose Reference
hifiasm Fast haplotype-resolved de novo assembler optimized for PacBio HiFi reads, capable of producing phased assemblies. Cheng 2021
3D-DNA Automated scaffolding tool using Hi-C data to assemble contigs into chromosome-length scaffolds. Dudchenko 2017
Juicebox Interactive visualization and manual curation tool for genome assemblies using Hi-C contact maps. Robinson 2018
RepeatModeler2 De novo detection of transposable elements and repeats in genomic sequences. Flynn 2020
RepeatMasker Uses a repeat library (e.g. from RepeatModeler2) to mask repetitive elements across the genome. Smit 2015
BRAKER2 (uses STAR + AUGUSTUS) Automated gene annotation pipeline combining RNA-seq alignments (STAR) with ab initio predictions (AUGUSTUS) to refine exon–intron boundaries. Gabriel 2024; Stanke 2004; Dobin 2013
MAKER Annotation pipeline integrating ab initio predictions and homology evidence into final gene models. Cantarel 2008
BWA-MEM2 High-performance sequence aligner for mapping Illumina reads to the reference genome. Md 2019
GATK Toolkit for variant discovery and genotyping, using the gVCF workflow for joint calling. der Auwera 2002
ANGSD Population genomics tool for calculating allele frequencies and summary statistics from NGS data. Korneliussen 2014
PCAngsd Performs PCA based on genotype likelihoods for low-coverage sequencing datasets. Meisner 2018
BayPass Detects SNPs associated with environmental variables or population structure. Gautier 2025

Methodological Pipeline

Workflow of the experimental and analytical pipeline

Sample Collection & Ethics

  • 1 high-quality individual sample from blood for reference genome assembly (HMW DNA).
  • 1 skin sample for RNA-seq and Hi-C assembly.
  • Skin biopsy samples from 10 individuals (5 from protected forest, 5 from fragmented habitat) for resequencing (15× each).

All activities were conducted in full compliance with the Nagoya Protocol, with prior informed consent (PIC), mutually agreed terms (MAT), and collection permits obtained from the Malaysian Government, Sarawak Forestry Department, and private landowners (Wilmar Oil Palm Plantation).

DNA & RNA Extraction

  • Genomic DNA: Qiagen Genomic-tip, QC with Qubit fluorometer and NanoDrop (A260/280 ≈ 1.8–2.0, A260/230 > 2.0), PFGE (>50 kb).
  • Total RNA (skin): Qiagen RNeasy, stored in RNAlater, RIN ≥ 7.

Genome Sequencing

  • PacBio HiFi (30–40× coverage, 15–20 kb reads): long, high-accuracy reads.
  • Hi-C (Illumina NovaSeq PE150): chromatin conformation for scaffolding, recovered from the skin tissue sample of the reference individual.
  • RNA-seq (skin of the reference individual, 2–3 replicates, NovaSeq PE150, ~50M reads each).
  • Illumina sequencing for downstream population analysis (10 individuals, 15× each, NovaSeq PE150).

Assembly & QC

Functional Annotation

Population Genomics

Data submission

  • All raw sequencing reads, assembled and annotated genome and VCF files are submitted to the European Nucleotide Archive (ENA) to ensures compliance with the FAIR principles and enable reproducibility by the scientific community

Budget & Estimated Costs

The estimated costs reported in this project were obtained, whenever possible, by directly contacting the information services of Illumina, PacBio, Qiagen and ThermoFisher for updated sequencing, extraction and conservation kit prices.
For consumables and contingency expenses, a general estimate was made based on standard laboratory practices (plasticware, barcoding materials, cryogenic storage, shipping logistics), with an additional margin included to cover unexpected costs.

Item Estimated Cost (€) Notes
Fieldwork & permits 4,800 Sampling in Gunung Gading NP & Wilmar plantation; ABS/Nagoya compliance
DNA extraction (HMW + resequencing) 3,250 Qiagen Genomic-tip
RNA extraction (skin only) 1,420 RNAlater + Qiagen RNeasy kits
PacBio HiFi sequencing 34,780 30–40× coverage, 1 HQ individual
Hi-C sequencing 8,230 Hi-C libraries from fresh tissue, sequenced on Illumina NovaSeq PE150
Illumina resequencing (10 ind., 15×) 14,650 Illumina NovaSeq PE150
RNA-seq (skin, 2–3 replicates) 6,180 Illumina NovaSeq PE150
Computational costs & Cloud storage 11,740 Assembly, annotation, population genomics analyses
Consumables & contingency 8,560 Tubes, barcoding, dry shipper
TOTAL 93,610 Within the €100,000 budget

References

  • Cheng, H., Concepcion, G. T., Feng, X., Zhang, H., & Li, H. (2021). Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18(2). https://doi.org/10.1038/s41592-020-01056-5
  • Dudchenko, O., Batra, S. S., Omer, A. D., Nyquist, S. K., Hoeger, M., Durand, N. C., Shamim, M. S., Machol, I., Lander, E. S., Aiden, A. P., & Lieberman Aiden, E. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 356(6333). https://doi.org/10.1126/science.aal3327
  • Robinson, J. T., Turner, D., Durand, N. C., Thorvaldsdóttir, H., Mesirov, J. P., & Lieberman Aiden, E. (2018). Juicebox.js provides a cloud-based visualization system for Hi-C data. Cell Systems, 6(2). https://doi.org/10.1016/j.cels.2018.01.001
  • Flynn, J. M., Hubley, R., Goubert, C., Rosen, J., Clark, A. G., Feschotte, C., & Smit, A. F. (2020). RepeatModeler2 for automated genomic discovery of transposable element families. PNAS, 117(17). https://doi.org/10.1073/pnas.1921046117
  • Smit, A. F. A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. http://www.repeatmasker.org
  • Stanke, M., Steinkamp, R., Waack, S., & Morgenstern, B. (2004). AUGUSTUS: A web server for gene finding in eukaryotes. Nucleic Acids Research, 32. https://doi.org/10.1093/nar/gkh379
  • Gabriel, L., Brůna, T., Hoff, K. J., Ebel, M., Lomsadze, A., Borodovsky, M., & Stanke, M. (2024). BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Research, 34(5), 769–777. https://doi.org/10.1101/gr.278090.123
  • Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1). https://doi.org/10.1093/bioinformatics/bts635
  • Cantarel, B. L., Korf, I., Robb, S. M. C., Parra, G., Ross, E., Moore, B., Holt, C., Sánchez Alvarado, A., & Yandell, M. (2008). MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Research, 18(1). https://doi.org/10.1101/gr.6743907
  • Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In IPDPS 2019. https://doi.org/10.1109/IPDPS.2019.00041
  • Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., Del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K. V., Altshuler, D., Gabriel, S., & DePristo, M. A. (2002). GATK Best Practices. Current Protocols in Bioinformatics, 11.
  • Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15. https://doi.org/10.1186/s12859-014-0356-4
  • Meisner, J., & Albrechtsen, A. (2018). Inferring population structure and admixture proportions in low-depth NGS data. Genetics, 210(2). https://doi.org/10.1534/genetics.118.301336
  • Gautier, M. (2025). BayPass software for population genomics. http://www1.montpellier.inra.fr/CBGP/software/baypass/ (Accessed: 25 Aug 2025)
  • Tegenfeldt, F., Kuznetsov, D., Manni, M., Berkeley, M., Zdobnov, E. M., & Kriventseva, E. V. (2025). OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Research, 53(D1), D516–D522. https://doi.org/10.1093/nar/gkae987
  • Rhie, A., Walenz, B. P., Koren, S., & Phillippy, A. M. (2020). Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology, 21, 245. https://doi.org/10.1186/s13059-020-02134-9
  • Ou, S., Chen, J., & Jiang, N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Research, 46(21), e126. https://doi.org/10.1093/nar/gky730
  • Wright, S. (1978). Evolution and the Genetics of Populations, Vol. 4: Variability within and among natural populations. University of Chicago Press.
  • Wang, G., Li, X., & Wang, Z. (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44(D1), D1087–D1093. https://doi.org/10.1093/nar/gkv1278
  • Li, Y., Ren, Y., Zhang, D., Jiang, H., Wang, Z., Li, X., & Rao, D. (2019). Chromosome-level assembly of the mustache toad genome using third-generation DNA sequencing and Hi-C analysis. GigaScience, 8(9). https://doi.org/10.1093/gigascience/giz114

Contacts

For any questions, suggestions, or contributions, feel free to open an issue or contact the maintainer:

Marco Cuscunà

About

An Applied Genomics project developed during the Master’s Degree in Bioinformatics at the University of Bologna, focused on designing and evaluating a simulated workflow for the de novo genome assembly, functional annotation, and population genomics of the Malayan horned frog (Pelobatrachus nasutus), with an assigned budget of €100,000.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published