HapCol: Accurate and memory-efficient haplotype assembly from long reads

Yuri Pirola, Simone Zaccaria, Riccardo Dondi, Gunnar W. Klau, Nadia Pisanti, Paola Bonizzoni

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.

Original languageEnglish
Pages (from-to)1610-1617
Number of pages8
JournalBioinformatics
Volume32
Issue number11
DOIs
Publication statusPublished - 1 Jun 2016

Fingerprint

Haplotype
Nucleotides
Polymorphism
Haplotypes
Sequencing
Data storage equipment
Coverage
Single nucleotide Polymorphism
Error correction
Bioinformatics
Computational efficiency
Social Responsibility
Availability
Single Nucleotide Polymorphism
Experimental Analysis
Error Correction
Exact Algorithms
Uniform distribution
Computational Efficiency
Technology

Cite this

Pirola, Y., Zaccaria, S., Dondi, R., Klau, G. W., Pisanti, N., & Bonizzoni, P. (2016). HapCol: Accurate and memory-efficient haplotype assembly from long reads. Bioinformatics, 32(11), 1610-1617. https://doi.org/10.1093/bioinformatics/btv495
Pirola, Yuri ; Zaccaria, Simone ; Dondi, Riccardo ; Klau, Gunnar W. ; Pisanti, Nadia ; Bonizzoni, Paola. / HapCol : Accurate and memory-efficient haplotype assembly from long reads. In: Bioinformatics. 2016 ; Vol. 32, No. 11. pp. 1610-1617.
@article{7ad5354117c24a07b7c99178ba729827,
title = "HapCol: Accurate and memory-efficient haplotype assembly from long reads",
abstract = "Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.",
author = "Yuri Pirola and Simone Zaccaria and Riccardo Dondi and Klau, {Gunnar W.} and Nadia Pisanti and Paola Bonizzoni",
year = "2016",
month = "6",
day = "1",
doi = "10.1093/bioinformatics/btv495",
language = "English",
volume = "32",
pages = "1610--1617",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "11",

}

Pirola, Y, Zaccaria, S, Dondi, R, Klau, GW, Pisanti, N & Bonizzoni, P 2016, 'HapCol: Accurate and memory-efficient haplotype assembly from long reads' Bioinformatics, vol. 32, no. 11, pp. 1610-1617. https://doi.org/10.1093/bioinformatics/btv495

HapCol : Accurate and memory-efficient haplotype assembly from long reads. / Pirola, Yuri; Zaccaria, Simone; Dondi, Riccardo; Klau, Gunnar W.; Pisanti, Nadia; Bonizzoni, Paola.

In: Bioinformatics, Vol. 32, No. 11, 01.06.2016, p. 1610-1617.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - HapCol

T2 - Accurate and memory-efficient haplotype assembly from long reads

AU - Pirola, Yuri

AU - Zaccaria, Simone

AU - Dondi, Riccardo

AU - Klau, Gunnar W.

AU - Pisanti, Nadia

AU - Bonizzoni, Paola

PY - 2016/6/1

Y1 - 2016/6/1

N2 - Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.

AB - Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=84973369410&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973369410&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btv495

DO - 10.1093/bioinformatics/btv495

M3 - Article

VL - 32

SP - 1610

EP - 1617

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 11

ER -

Pirola Y, Zaccaria S, Dondi R, Klau GW, Pisanti N, Bonizzoni P. HapCol: Accurate and memory-efficient haplotype assembly from long reads. Bioinformatics. 2016 Jun 1;32(11):1610-1617. https://doi.org/10.1093/bioinformatics/btv495