MASV, a misassembly detection and variant calling pipeline for long reads data
Author
Fuentes Palacios, DiegoEntity
UAM. Departamento de Ingeniería InformáticaDate
2020-02Subjects
E novo assembly; Third generation sequencing; Mis-assemblies; Biología y Biomedicina / Biología; InformáticaNote
Trabajo fin de máster en Bioinformática y Biología ComputacionalEsta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.
Abstract
In de novo genome assembly each genome sequenced and assembled presents its own chal-
lenges such as sample quantity, DNA integrity, repetitiveness, heterozygosity... but above all,
mis-assemblies are often the most difficult ones to tackle. Fortunately the longer read size pro-
duced by third generation sequencing technologies allow a better characterization of complex
regions, usually differentiated by its large number of repeats [1],[2]. This masters project aims to
develop an automated pipeline for detecting large structural variants (SV) in de novo assemblies
produced by long reads which may be indicative of errors in the assembly process. By mapping
these reads to their assembly we might be able to pinpoint mis-assemblies or sequence blocks
with a high discrepancy to the real genomic fragment from which the read derived.
Methodology: In order to do so a Snakemake pipeline was developed. It incorporates
the most widely used aligners Minimap2 [3] and Ngmlr [4] as well as two SV prediction software
Sniffles[4] and Svim[5]. It also includes custom scripts to measure recall, precision, F1 and
precision-recall trade-off for evaluation purposes as well as some custom scripts for VCF (variant
call file) formatting and conversion.
Experiments conducted: First the SV predicting power was benchmarked replicating
an experiment in the Svim paper[5]: using NA12878 nanopore raw reads (obtained from the
Nanopore WGS consortium [6]) mapped against the hg19 human genome reference with 2676
high-confidence deletions and 68 high-confidence insertions as the high confidence SV dataset
(validated in a previous study using PacBio and Moleculo reads [7]). After successfully repli-
cating the experiment and setting the default parameters, the pipeline functionality was tested
with respect to mis-assembly detection. This experiment involved simulating long reads from
a reference genome into which we introduced Svs at known positions. The simulated reads
would then be mapped to the unaltered reference in order to detect the rearrangements (i.e.
“mis-assemblies”). In other words, the idea was for the unaltered hg19 reference to resemble a
de novo assembly mis-assembled with respect to the reads (rearranged-based simulated reads),
provided the reads are the "ground truth", and with the knowledge beforehand of where the
real SV were located. The hg19 reference genome (in this case chromosomes 21 and 22 to
avoid larger computation times in a more controlled environment) were rearranged introduc-
ing simulated homozygous SVs. 200 deletions, 100 inversions, 200 tandem duplications and
100 insertions (cut & paste more akin to conservative transposition) were introduced using the
R package RSVsim[8], providing us a high confidence “truth” SV dataset. The SimLoRD[9]
python package was required to simulate PacBio reads (x53 coverage) based on the rearranged
hg19 reference. Thus two conditions were proposed: rearranged-based reads mapped against the
hg19 reference to test prediction in homozygosity; and a merge of simulated reads based on the
rearranged reference (x26 coverage) and normal reference simulated reads (x26 coverage) for a
total of x52 coverage heterozygous reads against the normal reference.
Discussion: The results obtained are quite promising. With the caveat that only simulated
data was used instead of an actual assembly, the results seem to indicate that long read SV
detection methods can be used as a tool for mis-assembly detection. Though it is not imple-
mented, it would have been interesting to merge the calls from both SV predictors, generating
high confidence consensus calls to reduce the impact of spurious calls on complex genome assemblies, such as the ones from the plant kingdom[10]. Future work would focus development
on an algorithm or reducing the number of mis-assemblies in draft genomes by rearranging the
target assembly according to the calls made by the MASV pipeline.
Files in this item
Google Scholar:Fuentes Palacios, Diego
This item appears in the following Collection(s)
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/
Related items
Showing items related by title, author, creator and subject.
-
An evaluation of pipelines for DNA variant detection can guide a reanalysis protocol to increase the diagnostic ratio of genetic diseases
Romero, Raquel; de la Fuente, Lorena; Del Pozo-Valero, Marta; Riveiro Álvarez, Rosa; Trujillo Tiebas, María José; Martín-Mérida, Inmaculada; Ávila Fernández, Almudena; Iancu, Ionut Florin; Perea Romero, Irene; Núñez-Moreno, Gonzalo; Damián, Alejandra; Rodilla, Cristina; Almoguera, Berta; Cortón, Marta; Ayuso García, María del Carmen Tomasa; Mínguez, Pablo
2022-01-27 -
Pathogenic variants in glutamyl-tRNAGln amidotransferase subunits cause a lethal mitochondrial cardiomyopathy disorder
Kurolap, Alina; Palacios Zambrano, Sara; Bratkovic, Drago; Derks, Terry G.J.; Bick, David; Bouman, Katelijne; Chatfield, Kathryn C.; Damouny-Naoum, Nadine; Dishop, Megan K.; Falik-Zaccai, Tzipora C.; Fares, Fuad; Fedida, Ayalla; Ferrero, Ileana; Gallagher, Renata C.; Garesse Alarcón, Rafael; Gilberti, Micol; González Blázquez, Cristina; Gowan, Katherine; Habib, Clair; Halligan, Rebecca K.; Kalfon, Limor; Knight, Kaz; Lefeber, Dirk; Mamblona, Laura; Mandel, Hanna; Mory, Adi; Ottoson, John; Paperna, Tamar; Pruijn, Ger J.M.; Rebelo-Guiomar, Pedro F.; Saada, Ann; Sainz, Bruno; Salvemini, Hayley; Schoots, Mirthe H.; Smeitink, Jan A.; Szukszto, Maciej J.; Ter Horst, Hendrik J.; Van den Brandt, Frans; Van Spronsen, Francjan J.; Veltman, Joris A.; Wartchow, Eric; Wintjes, Liesbeth T.; Zohar, Yaniv; Fernández-Moreno, Miguel Ángel; Baris, Hagit N.; Donnini, Claudia; Minczuk, Michal; Rodenburg, Richard J.; Van Hove, Johan L.K.
2018-10-03