On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes

Paola Bonizzoni, Riccardo Dondi, Gunnar W. Klau, Yuri Pirola, Nadia Pisanti, Simone Zaccaria

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.

Original languageEnglish
Pages (from-to)718-736
Number of pages19
JournalJournal of Computational Biology
Volume23
Issue number9
DOIs
Publication statusPublished - 1 Sep 2016

Fingerprint

Polyploidy
Haplotype
Error correction
Error Correction
Diploidy
Haplotypes
Genome
Genes
Fragment
Fixed-parameter Tractability
Chromosomes
Sequencing
Chromosome
Approximation algorithms
Formulation
Data Clustering
Fish
Approximation
Cluster Analysis
Computational complexity

Keywords

  • combinatorial optimization
  • graph theory
  • haplotypes
  • next-generation sequencing

Cite this

Bonizzoni, Paola ; Dondi, Riccardo ; Klau, Gunnar W. ; Pirola, Yuri ; Pisanti, Nadia ; Zaccaria, Simone. / On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes. In: Journal of Computational Biology. 2016 ; Vol. 23, No. 9. pp. 718-736.
@article{2ab1ecfc55c448f2b3f97cd11952f3be,
title = "On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes",
abstract = "In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.",
keywords = "combinatorial optimization, graph theory, haplotypes, next-generation sequencing",
author = "Paola Bonizzoni and Riccardo Dondi and Klau, {Gunnar W.} and Yuri Pirola and Nadia Pisanti and Simone Zaccaria",
year = "2016",
month = "9",
day = "1",
doi = "10.1089/cmb.2015.0220",
language = "English",
volume = "23",
pages = "718--736",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "9",

}

Bonizzoni, P, Dondi, R, Klau, GW, Pirola, Y, Pisanti, N & Zaccaria, S 2016, 'On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes' Journal of Computational Biology, vol. 23, no. 9, pp. 718-736. https://doi.org/10.1089/cmb.2015.0220

On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes. / Bonizzoni, Paola; Dondi, Riccardo; Klau, Gunnar W.; Pirola, Yuri; Pisanti, Nadia; Zaccaria, Simone.

In: Journal of Computational Biology, Vol. 23, No. 9, 01.09.2016, p. 718-736.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes

AU - Bonizzoni, Paola

AU - Dondi, Riccardo

AU - Klau, Gunnar W.

AU - Pirola, Yuri

AU - Pisanti, Nadia

AU - Zaccaria, Simone

PY - 2016/9/1

Y1 - 2016/9/1

N2 - In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.

AB - In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.

KW - combinatorial optimization

KW - graph theory

KW - haplotypes

KW - next-generation sequencing

UR - http://www.scopus.com/inward/record.url?scp=84986260123&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84986260123&partnerID=8YFLogxK

U2 - 10.1089/cmb.2015.0220

DO - 10.1089/cmb.2015.0220

M3 - Article

VL - 23

SP - 718

EP - 736

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 9

ER -