Development and application of novel genome assembly algorithms that use multiple data sources

Birden Fazla Veri Kaynağı Kullanabilen Yeni Genom Birleştirme Algoritmalarının Tasarımı Ve Uygulanması

Abstract

The application of high throughput sequencing (HTS) technologies are revolutionizing the field of genomics, providing unprecedented resolution to study genomes of different species, and normal and disease causing human genetic variation. Although significant advances have been made to analyze HTS data, there are still several hurdles in fully utilizing the power of HTS.
Although we can now generate data at a rate previously unimaginable, the analysis of the data is proceeding at a slower pace because: 1) unprecedented amounts of data introduce challenges in computational infrastructure in terms of both storage and processing power; 2) reads are often associated with high sequence errors and shorter read length; and 3) currently available algorithms to analyze HTS data and the HTS data themselves show different biases against different regions of the genome. Due to these problems, the information available in the sequencing datasets is not completely mined. There is a need to forge an alliance between computer science and genomics to devise better methods to use the massive amount of sequence data to unleash the full power of HTS methodologies.

Thanks to the substantially reduced cost of genome sequencing, there is now great interest in sequencing the genomes of thousands of species to better understand the genomic diversity across different organisms, organismal biology and genome evolution. In the last few years many genomes are sequenced: plants such as rice, grape, wheat, potato, corn, cucumber; and animals such as the giant panda, turkey, gorilla, orangutan, bonobo, opossum, elephant, etc. Recently more ambitious projects like the Genome 10K Consortium are started to sequence the genomes of 10.000 vertebrate species. However, the aforementioned limitations of the HTS technologies also affected de novo sequencing studies that aim to construct the reference genomes of various species. This is mainly due to the repetitive structure of the genomes of most species, the short sequence reads generated by current platforms, and the increased error rate. Thus there are still problems to solve to increase the accuracy of the assembled genomes; otherwise any biological conclusions derived from non-accurate genome assemblies would be incorrect.

Reasoning from the previous observations and empirical evidence that all current HTS platforms show different strengths and biases, we propose to devise novel genome assembly algorithms that use data from multiple sources, including, when available, data derived from laboratory experiments to better assemble the genomes of new species. We will test our algorithms with 1) a set of bacterial artificial chromosomes (BACs) generated from a hydatidiform mole resource that were sequenced using both the Illumina and Pacific Biosciences platforms, and test the assembly accuracy by comparing with high quality assemblies of the same resource using capillary sequence data; 2) whole genome shotgun sequence libraries generated from a haploid genome (hydatidiform mole) and sequenced using 454/Roche and Illumina platforms, several BAC end sequences from the same library sequenced using capillary sequencing, and physical fingerprinting data. The basepair calling accuracy of the Illumina platform coupled with longer matepairs from the 454/Roche, long sequences from Pacific Biosciences, long “jumps” from BAC end sequencing, and the physical ordering of the BACs from the fingerprint data will be used in harmony to improve the genome assembly. In long term, we will also incorporate methodologies that utilize data from upcoming nanotechnology-based sequencing platforms such as the Oxford Nanopore Technologies. Enhanced algorithms that can better assemble genomes will improve our understanding of the biology of genomes.

Dissemination