Algorithms for structural variation discovery using hybrid
sequencing technologies and library preparation protocols.
Hibrit dizileme teknolojileri ve kütüphane hazırlama
protokolleri kullanarak yapısal varyasyonların bulunması için
algoritmalar.
Scientific and Technical Research Council of Turkey
(TÜBİTAK-1001-215E172), 2016-2018
- PI: Can Alkan
- Students: Can Fırtına, Arda Söylev, Can Koçkan
- Collaborations: Fereydoun Hormozdiari (UC Davis), Thong Le (UC
Davis), Iman Hajirasouliha (Weill Cornell), Camir Ricketts (Weill
Cornell), Ercüment Çiçek (Bilkent)
- Total 330,637 TL for three years
(approx. €98,935).
- The goal of this project is to develop algorithms for the
discovery and characterization of structural variants using
multiple sequencing platforms, linked-reads, and read clouds.
Abstract
Genomic structural variation (SV) is defined by the 1000 Genomes Project
as variation that affects more than 50 basepairs. These variations can
be in different forms such as deletion, insertion, inversion,
translocation, retrotransposition, or interspersed or tandem
duplications. Although there are much less SVs than single nucleotide
polymorphisms (SNPs) (3.5 million SNPs vs. 10-15 thousand SVs), the
total number of basepairs affected by SVs are substantially higher (3.5
Mbp SNP, 15-20 Mbp SV).
Widespread occurrence of SVs in non-cancer genomes were first shown in
2004 bi Iafrate et al. It was later understood that SVs also cause
several complex diseases such as Crohn’s, schizophrenia, and autism.
Array comparative genomic hybridization (array CGH) was the dominant
technology for specifically copy number variation (CNV) discovery,
however, high throughput sequencing (HTS) became more popular for such
studies after their intruduction in 2007. Still, as demonstrated in the
1000 Genomes Project, since HTS platforms either produce short reads
(Illumina, Complete Genomics, Ion Torrent, SOLiD), or with high error
rates (Pacific Biosciences, Oxford Nanopore), although there is
relatively high success in CNV discovery, reliable algorithms for
characterizing complex SVs such as inversions, translocations, and novel
sequence insertions are still lacking. The fact that such complex
variation usually occur in highly repetitive regions of the genome makes
it harder to align HTS reads. This negatively affects our ability to
understand the genetic causes of several complex diseases, therefore
limits solving the missing heritability problem.
Although all sequencing technologies have problems in either read
length, base pair calling accuracy, or error profiles, bias in one
technology may present itself as a strength in another. For example,
Illumina reads are short, but Pacific Biosciences produce long reads,
while Pacific Biosciences error rate is high (>15%), Illumina has
high accuracy (>99.9%). In addition, independent from the sequencing
technology, recently new library preparation techniques were developed,
such as Illumina TSLR, 10X Genomics, Dovetail Genomics, and pooled clone
sequencing. It is possible to obtain long range contiguity information
using these methods, without changing the sequencing technology itself.
In this project, we propose to use different sequencing techniques and
library preparation protocols in an integrated fashion to reliably
characterize structural variation. Therefore we will be able to
complement the strengths of different technologies with each other, and
correct for the biases. These algorithms will enable better
characterization of complex structural variation such as inversions and
translocations, and help solve the missing heritability problem.
Dissemination