Applied Genomics Project – De novo genome assembly and functional genomics of the Pelobatrachus nasutus as a phenotypic plasticity, and antimicrobial defenses model

Introduction & State of the art

This repository documents the Applied Genomics Simulation Project developed during the Applied Genomics course of the Master’s Degree in Bioinformatics (University of Bologna, 2025).

The project was designed under an assigned budget of €100,000, and represents a simulation exercise rather than a real sequencing effort.

The main goal was to design a working pipeline for the de novo assembly of the genome of the Malayan horned frog (Pelobatrachus nasutus), which as of 26th August 2025 still lacks a published nuclear reference genome.

In addition to genome assembly, the project focused on the annotation of genes relevant to biomedical applications (e.g., antimicrobial peptides, skin-related genes) and the investigation of phenotypic plasticity (cutaneous tubercles, mucous glands, pigmentation differences) across contrasting habitats.

This repository is also intended to provide a concise overview of the designed pipeline, bioinformatics softwares utilized and budget estimates, while serving as a container for meta-information such as figures used in the project, the presentation slides, and the final report document.

The repository as intended is organized into the following folders:

Report and Powerpoint/
Contains the final written report document (PDF) and the presentation slides (PowerPoint) prepared for the Applied Genomics course.
Images and Tables/
Contains all the figures, diagrams, and tables included in the report and presentation. These are provided separately to facilitate reuse in other documents or presentations.

Software & Toolkits

Tool / Software	Purpose	Reference
hifiasm	Fast haplotype-resolved de novo assembler optimized for PacBio HiFi reads, capable of producing phased assemblies.	Cheng 2021
3D-DNA	Automated scaffolding tool using Hi-C data to assemble contigs into chromosome-length scaffolds.	Dudchenko 2017
Juicebox	Interactive visualization and manual curation tool for genome assemblies using Hi-C contact maps.	Robinson 2018
RepeatModeler2	De novo detection of transposable elements and repeats in genomic sequences.	Flynn 2020
RepeatMasker	Uses a repeat library (e.g. from RepeatModeler2) to mask repetitive elements across the genome.	Smit 2015
BRAKER2 (uses STAR + AUGUSTUS)	Automated gene annotation pipeline combining RNA-seq alignments (STAR) with ab initio predictions (AUGUSTUS) to refine exon–intron boundaries.	Gabriel 2024; Stanke 2004; Dobin 2013
MAKER	Annotation pipeline integrating ab initio predictions and homology evidence into final gene models.	Cantarel 2008
BWA-MEM2	High-performance sequence aligner for mapping Illumina reads to the reference genome.	Md 2019
GATK	Toolkit for variant discovery and genotyping, using the gVCF workflow for joint calling.	der Auwera 2002
ANGSD	Population genomics tool for calculating allele frequencies and summary statistics from NGS data.	Korneliussen 2014
PCAngsd	Performs PCA based on genotype likelihoods for low-coverage sequencing datasets.	Meisner 2018
BayPass	Detects SNPs associated with environmental variables or population structure.	Gautier 2025

Methodological Pipeline

Sample Collection & Ethics

1 high-quality individual sample from blood for reference genome assembly (HMW DNA).
1 skin sample for RNA-seq and Hi-C assembly.
Skin biopsy samples from 10 individuals (5 from protected forest, 5 from fragmented habitat) for resequencing (15× each).

All activities were conducted in full compliance with the Nagoya Protocol, with prior informed consent (PIC), mutually agreed terms (MAT), and collection permits obtained from the Malaysian Government, Sarawak Forestry Department, and private landowners (Wilmar Oil Palm Plantation).

DNA & RNA Extraction

Genomic DNA: Qiagen Genomic-tip, QC with Qubit fluorometer and NanoDrop (A260/280 ≈ 1.8–2.0, A260/230 > 2.0), PFGE (>50 kb).
Total RNA (skin): Qiagen RNeasy, stored in RNAlater, RIN ≥ 7.

Genome Sequencing

PacBio HiFi (30–40× coverage, 15–20 kb reads): long, high-accuracy reads.
Hi-C (Illumina NovaSeq PE150): chromatin conformation for scaffolding, recovered from the skin tissue sample of the reference individual.
RNA-seq (skin of the reference individual, 2–3 replicates, NovaSeq PE150, ~50M reads each).
Illumina sequencing for downstream population analysis (10 individuals, 15× each, NovaSeq PE150).

Assembly & QC

hifiasm for contig-level assembly (Cheng 2021).
Hi-C scaffolding with 3D-DNA (Dudchenko 2017) + Juicebox (Robinson 2018) with manual curation.
QC metrics: N50 > 20 Mb, BUSCO (Tegenfeldt 2025) , QV ≥ 30 (Rhie 2020), LAI (Ou 2018)

Functional Annotation

Repeat discovery: RepeatModeler2 (Flynn et al., 2020) + masking with RepeatMasker (Smit et al., 2015).
Gene models: BRAKER2 (Gabriel et al., 2024; Stanke et al., 2004) combines RNA-seq evidence (STAR, Dobin et al., 2013) with ab initio models to improve exon–intron boundary prediction (AUGUSTUS, Stanke et al., 2004).
Refinement: MAKER (Cantarel et al., 2008) integrating BRAKER2 results with homology from the closely related Leptobrachium ailaonicum genome (Li et al., 2019).
Major target gene families: antimicrobial peptides (AMPs, validated through APD3, Wang et al., 2016), keratins, mucins, extracellular matrix proteins.

Population Genomics

Mapping: BWA-MEM2 (Md et al., 2019) against the reference genome.
Variant calling: GATK (Van der Auwera et al., 2002) in gVCF mode, joint calling.
Genotype likelihoods: ANGSD (Korneliussen et al., 2014) for allele frequencies and summary statistics.
Population structure: PCAngsd (Meisner & Albrechtsen, 2018) for PCA.
Differentiation: $F_{ST}$ (Wright, 1978).
Environmental associations: BayPass (Gautier, 2025) to identify SNPs correlated with habitat (protected vs fragmented).

Data submission

All raw sequencing reads, assembled and annotated genome and VCF files are submitted to the European Nucleotide Archive (ENA) to ensures compliance with the FAIR principles and enable reproducibility by the scientific community

Budget & Estimated Costs

The estimated costs reported in this project were obtained, whenever possible, by directly contacting the information services of Illumina, PacBio, Qiagen and ThermoFisher for updated sequencing, extraction and conservation kit prices.
For consumables and contingency expenses, a general estimate was made based on standard laboratory practices (plasticware, barcoding materials, cryogenic storage, shipping logistics), with an additional margin included to cover unexpected costs.

Item	Estimated Cost (€)	Notes
Fieldwork & permits	4,800	Sampling in Gunung Gading NP & Wilmar plantation; ABS/Nagoya compliance
DNA extraction (HMW + resequencing)	3,250	Qiagen Genomic-tip
RNA extraction (skin only)	1,420	RNAlater + Qiagen RNeasy kits
PacBio HiFi sequencing	34,780	30–40× coverage, 1 HQ individual
Hi-C sequencing	8,230	Hi-C libraries from fresh tissue, sequenced on Illumina NovaSeq PE150
Illumina resequencing (10 ind., 15×)	14,650	Illumina NovaSeq PE150
RNA-seq (skin, 2–3 replicates)	6,180	Illumina NovaSeq PE150
Computational costs & Cloud storage	11,740	Assembly, annotation, population genomics analyses
Consumables & contingency	8,560	Tubes, barcoding, dry shipper
TOTAL	93,610	Within the €100,000 budget

References

Cheng, H., Concepcion, G. T., Feng, X., Zhang, H., & Li, H. (2021). Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18(2). https://doi.org/10.1038/s41592-020-01056-5
Dudchenko, O., Batra, S. S., Omer, A. D., Nyquist, S. K., Hoeger, M., Durand, N. C., Shamim, M. S., Machol, I., Lander, E. S., Aiden, A. P., & Lieberman Aiden, E. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 356(6333). https://doi.org/10.1126/science.aal3327
Robinson, J. T., Turner, D., Durand, N. C., Thorvaldsdóttir, H., Mesirov, J. P., & Lieberman Aiden, E. (2018). Juicebox.js provides a cloud-based visualization system for Hi-C data. Cell Systems, 6(2). https://doi.org/10.1016/j.cels.2018.01.001
Flynn, J. M., Hubley, R., Goubert, C., Rosen, J., Clark, A. G., Feschotte, C., & Smit, A. F. (2020). RepeatModeler2 for automated genomic discovery of transposable element families. PNAS, 117(17). https://doi.org/10.1073/pnas.1921046117
Smit, A. F. A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. http://www.repeatmasker.org
Stanke, M., Steinkamp, R., Waack, S., & Morgenstern, B. (2004). AUGUSTUS: A web server for gene finding in eukaryotes. Nucleic Acids Research, 32. https://doi.org/10.1093/nar/gkh379
Gabriel, L., Brůna, T., Hoff, K. J., Ebel, M., Lomsadze, A., Borodovsky, M., & Stanke, M. (2024). BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Research, 34(5), 769–777. https://doi.org/10.1101/gr.278090.123
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1). https://doi.org/10.1093/bioinformatics/bts635
Cantarel, B. L., Korf, I., Robb, S. M. C., Parra, G., Ross, E., Moore, B., Holt, C., Sánchez Alvarado, A., & Yandell, M. (2008). MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Research, 18(1). https://doi.org/10.1101/gr.6743907
Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In IPDPS 2019. https://doi.org/10.1109/IPDPS.2019.00041
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., Del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K. V., Altshuler, D., Gabriel, S., & DePristo, M. A. (2002). GATK Best Practices. Current Protocols in Bioinformatics, 11.
Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15. https://doi.org/10.1186/s12859-014-0356-4
Meisner, J., & Albrechtsen, A. (2018). Inferring population structure and admixture proportions in low-depth NGS data. Genetics, 210(2). https://doi.org/10.1534/genetics.118.301336
Gautier, M. (2025). BayPass software for population genomics. http://www1.montpellier.inra.fr/CBGP/software/baypass/ (Accessed: 25 Aug 2025)
Tegenfeldt, F., Kuznetsov, D., Manni, M., Berkeley, M., Zdobnov, E. M., & Kriventseva, E. V. (2025). OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Research, 53(D1), D516–D522. https://doi.org/10.1093/nar/gkae987
Rhie, A., Walenz, B. P., Koren, S., & Phillippy, A. M. (2020). Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology, 21, 245. https://doi.org/10.1186/s13059-020-02134-9
Ou, S., Chen, J., & Jiang, N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Research, 46(21), e126. https://doi.org/10.1093/nar/gky730
Wright, S. (1978). Evolution and the Genetics of Populations, Vol. 4: Variability within and among natural populations. University of Chicago Press.
Wang, G., Li, X., & Wang, Z. (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44(D1), D1087–D1093. https://doi.org/10.1093/nar/gkv1278
Li, Y., Ren, Y., Zhang, D., Jiang, H., Wang, Z., Li, X., & Rao, D. (2019). Chromosome-level assembly of the mustache toad genome using third-generation DNA sequencing and Hi-C analysis. GigaScience, 8(9). https://doi.org/10.1093/gigascience/giz114

Contacts

For any questions, suggestions, or contributions, feel free to open an issue or contact the maintainer:

Marco Cuscunà

[email protected]
0009-0008-4017-8328

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Images and tables		Images and tables
Report and Powerpoint		Report and Powerpoint
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Applied Genomics Project – De novo genome assembly and functional genomics of the Pelobatrachus nasutus as a phenotypic plasticity, and antimicrobial defenses model

Table of Contents

Introduction & State of the art

Software & Toolkits

Methodological Pipeline

Budget & Estimated Costs

References

Contacts

About

Uh oh!

Releases

Packages

Markus2409/Applied_Genomics_Project

Folders and files

Latest commit

History

Repository files navigation

Applied Genomics Project – De novo genome assembly and functional genomics of the Pelobatrachus nasutus as a phenotypic plasticity, and antimicrobial defenses model

Table of Contents

Introduction & State of the art

Software & Toolkits

Methodological Pipeline

Budget & Estimated Costs

References

Contacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages