Development and application of novel genome assembly algorithms
that use multiple data sources
Birden Fazla Veri Kaynağı Kullanabilen Yeni Genom Birleştirme
Algoritmalarının Tasarımı Ve Uygulanması
- Scientific and Technical Research Council of Turkey
(TÜBİTAK-1001-112E135), 2012-2015
- PI: Can Alkan
- Students: Elif Dal, Fatma Kahveci, Shatlyk Ashyralyyev
- Total 313,152 TL for three years
(approx. €134,000).
- The goal of this project is to develop assembly algorithms to
more reliably construct de novo
genome assemblies using data from multiple sources.
Abstract
The application of high throughput sequencing (HTS) technologies are
revolutionizing the field of genomics, providing unprecedented
resolution to study genomes of different species, and normal and disease
causing human genetic variation. Although significant advances have been
made to analyze HTS data, there are still several hurdles in fully
utilizing the power of HTS.
Although we can now generate data at a rate previously unimaginable, the
analysis of the data is proceeding at a slower pace because: 1)
unprecedented amounts of data introduce challenges in computational
infrastructure in terms of both storage and processing power; 2) reads
are often associated with high sequence errors and shorter read length;
and 3) currently available algorithms to analyze HTS data and the HTS
data themselves show different biases against different regions of the
genome. Due to these problems, the information available in the
sequencing datasets is not completely mined. There is a need to forge an
alliance between computer science and genomics to devise better methods
to use the massive amount of sequence data to unleash the full power of
HTS methodologies.
Thanks to the substantially reduced cost of genome sequencing, there is
now great interest in sequencing the genomes of thousands of species to
better understand the genomic diversity across different organisms,
organismal biology and genome evolution. In the last few years many
genomes are sequenced: plants such as rice, grape, wheat, potato, corn,
cucumber; and animals such as the giant panda, turkey, gorilla,
orangutan, bonobo, opossum, elephant, etc. Recently more ambitious
projects like the Genome 10K Consortium are started to sequence the
genomes of 10.000 vertebrate species. However, the aforementioned
limitations of the HTS technologies also affected de novo sequencing
studies that aim to construct the reference genomes of various species.
This is mainly due to the repetitive structure of the genomes of most
species, the short sequence reads generated by current platforms, and
the increased error rate. Thus there are still problems to solve to
increase the accuracy of the assembled genomes; otherwise any biological
conclusions derived from non-accurate genome assemblies would be
incorrect.
Reasoning from the previous observations and empirical evidence that all
current HTS platforms show different strengths and biases, we propose to
devise novel genome assembly algorithms that use data from multiple
sources, including, when available, data derived from laboratory
experiments to better assemble the genomes of new species. We will test
our algorithms with 1) a set of bacterial artificial chromosomes (BACs)
generated from a hydatidiform mole resource that were sequenced using
both the Illumina and Pacific Biosciences platforms, and test the
assembly accuracy by comparing with high quality assemblies of the same
resource using capillary sequence data; 2) whole genome shotgun sequence
libraries generated from a haploid genome (hydatidiform mole) and
sequenced using 454/Roche and Illumina platforms, several BAC end
sequences from the same library sequenced using capillary sequencing,
and physical fingerprinting data. The basepair calling accuracy of the
Illumina platform coupled with longer matepairs from the 454/Roche, long
sequences from Pacific Biosciences, long “jumps” from BAC end
sequencing, and the physical ordering of the BACs from the fingerprint
data will be used in harmony to improve the genome assembly. In long
term, we will also incorporate methodologies that utilize data from
upcoming nanotechnology-based sequencing platforms such as the Oxford
Nanopore Technologies. Enhanced algorithms that can better assemble
genomes will improve our understanding of the biology of genomes.
Dissemination
- Early postzygotic mutations
contribute to de novo variation in a healthy monozygotic twin
pair. Gülşah M Dal, Bekir Ergüner, Mahmut S
Sağıroğlu, Bayram Yüksel, Onur Emre Onat, Can
Alkan, Tayfun Özçelik. J Med Genet,
51(7):455-459, 2014
- Whole
genome sequencing of Turkish genomes reveals functional private
alleles and impact of genetic interactions with Europe, Asia and
Africa. Can Alkan,
Pınar Kavak, Mehmet Somel, Omer Gokcumen, Serkan Uğurlu, Ceren
Saygı, Elif Dal,
Kuyaş Buğra, Tunga Güngör, S Cenk Sahinalp, Nesrin Özören and
Cemalettin Bekpen. BMC Genomics, 15(1):963, 2014.
- A hypergraph-based model for hybrid de novo assembly. Shatlyk
Askyralyyev, Can Firtina, Cevdet Aykanat, Can
Alkan. Bertinoro Computational Biology Meeting, June
14-18, Bertinoro, Italy, 2015.
- Evaluation
of genome scaffolding tools using pooled clone sequencing. Elif Dal and Can Alkan. Turkish J
of Biology, 42, 471-476, 2018.