Performance of Reference-Independent Classifiers for Incomplete Genomes

With the advent of whole genome shotgun sequencing it has become possible to obtain the entire sequence of a bacterial genome quickly and at a low cost. The bottleneck for analyses of genomes now lies in the ability to process the extremely fragmented data produced by Next Generation Sequencing tools. Current assembly algorithms struggle to reconstruct the whole sequence of bacteria from these reads. This means that the results obtained based on these assemblies may not be reproduced when using another algorithm. Metagenomic datasets introduce an additional level of complexity. Whereas in single-genome analyses all reads can be assumed to belong to one single genome, the origin of each read is unknown in metagenomic datasets. A key question in both single-genome and metagenome analyses is how to distinguish different closely related organisms. For single-genome analyses this could be used to identify whether two bacterial samples share the same origin and track the specific bacterial strains, and for metagenomic samples this translates to an ability to assign reads to the correct organism. Here, we analyze the performance of common comparative genomic methods as a function of read size on real read data from 104 bacterial genomes with verified taxonomy. The obtained results allow us to confirm the range of applicability of each method and the expected error rates.


Gleb Goussarov (1,2)
Ilse Cleenwerck (1)
Mohamed Mysara(2)
Natalie Leys(2)
Peter Vandamme(1)
Pieter Monsieurs(2)


Ghent University(1)

Presenting author

Gleb Goussarov, PhD Student, SCK-CEN
Contact us now Strategic Partners