Benchmarks

  • De novo assembly of viral quasiespecies (HIV-1)

    Details

    HIV-1 full-length benchmarking data sets for haplotype reconstruction methods, sequenced with Illumina MiSeq and 454/Roche GSJunior. Five well-studied HIV-1 strains (HXB2, 89.6, JR-CSF, NL4-3, and YU-2) have been mixed and sequenced.

    This benchmark is based on the dataset found at https://github.com/cbg-ethz/5-virus-mix.

    The results for other genome assemblers have been extracted from the following independent report:

    Baaijens, Jasmijn A et al. “De novo assembly of viral quasispecies using overlap graphs.” Genome research vol. 27,5 (2017): 835-848. doi:10.1101/gr.215038.116

    Results

    N50 Genome fraction (%) Max contig length
    SAVAGE-de-novo58892.61,221
    SGA63532.41,034
    SOAPdenovo259141.9984
    SPAdes59142.62,952
    metaSPAdes3,26653.74,543
    s-aligner (using Illumina data)9,641100.09,646
    s-aligner (using Roche data)9,062100.09,062
    195% increased performance in the N50 metric over the second best assembler.
    1,540% increased performance in the N50 metric over the worst-performing assembler.
    8% increased performance in the Genome Fraction metric over the second best assembler.
    209% increased performance in the Genome Fraction metric over the worst-performing assembler.

    Download here the contigs obtained by s-aligner for the Illumina Miseq data.

    Download here the contigs obtained by s-aligner for the 454/Roche GSJunior data.

  • De novo assembly of simulated viral samples (HIV, HCV, ZKV)

    Details

    The sequencing reads covered the full reference genomes at an average coverage of 20,000.

    "We created five simulated data sets for benchmarking, consisting of 2 × 250-bp Illumina MiSeq reads and representing quasispecies infections from different viruses: human immunodeficiency virus (HIV), hepatitis C virus (HCV), and Zika virus (ZIKV). We varied the number of strains per sample as well as the relative abundances of those strains and the pairwise divergence between strains. To get data sets as realistic as possible, we used true viral genomes from the NCBI database and Illumina MiSeq error profiles during simulations."

    The dataset and results for other genome assemblers have been extracted from the following independent report:

    Baaijens, Jasmijn A et al. “De novo assembly of viral quasispecies using overlap graphs.” Genome research vol. 27,5 (2017): 835-848. doi:10.1101/gr.215038.116

    Results

    N50 Genome fraction (%) Max contig length
    HIV
    SAVAGE-de-novo4,91399.89,413
    SGA65032.41,034
    SOAPdenovo251635.7844
    SPAdes5,87391.79,789
    metaSPAdes5,15932.77,044
    s-aligner9,916100.09,916
    HCV
    SAVAGE-de-novo8,24899.69,297
    SGA63818.1832
    SOAPdenovo253122.0926
    SPAdes8,58291.39,311
    metaSPAdes1,54945.93,041
    s-aligner9,28596.39,305
    ZKV
    SAVAGE-de-novo2,10399.49,282
    SGA00.00
    SOAPdenovo256221.01,025
    SPAdes2,57765.610,269
    metaSPAdes3,92617.56,495
    s-aligner10,16195.010,165
    79% increased performance in the N50 metric over the second best assembler.
    1,726% increased performance in the N50 metric over the worst-performing assembler (excluding invalid results).

    Download here the contigs obtained by s-aligner.

  • Assembly of SARS-Cov-2

    Details

    Some random data-sets where selected from the SRA archive and assembled using different assemblers.

    Results

    Run Id Hardware Library design SPAdes NG50
    SRR12351628MiSeqARTIC4.371
    SRR13684392MiSeqARTIC29.404
    SRR11410529MiSeqARTIC19.294
    SRR12045777MiSeqARTIC19.338
    SRR12623307MiSeqARTIC19.283
    SRR11772204MiSeqARTIC29.837
    SRR12045770MiSeqARTIC1.412
    SRR11410528MiSeqARTIC19.291
    SRR13660064MiSeqARTIC16.463
    SRR13623050MiSeqARTIC29.842
    SRR13623049MiSeqARTIC29.833
    SRR13574254IlluminaARTIC1.000
    SRR13727443IlluminaARTIC1.631
    SRR13731834IlluminaARTIC29.687
    SRR13727440IlluminaARTIC0
    Average16.712
    Variance11.991
    Run Id Hardware Library design SPAdes NG50 s-aligner NG50
    SRR12819233Random21.58529.845
    SRR12445029Random4.98029.299
    SRR10903401Random29.877
    SRR12481157Random23.58329.836
    SRR12445036Random5.10429.112
    SRR13615951Random29.85829.846
    SRR13615945Random29.85229.797
    SRR13615944Random29.85229.829
    SRR13615947Random29.85629.837
    SRR13615942Random28.30729.754
    SRR13300938Random18.500
    SRR12445040Random5.56029.340
    SRR12445032Random2.67429.351
    SRR13050769Random025.854
    SRR13495171Random029.804
    Average17.22028.654
    Variance13.0632.984
    Protocols making use of s-aligner and random-primer library design show an increased performance of 72% (28.654 vs 16.712). Also, 13 out of 15 times s-aligner+random got an almost-perfect assembly. That's a 86% success rate against a 33% success rate for ARTIC+SPAdes protocols.

    Download here the data.

  • Assembly of large DNA-viruses from samples containing multiple strains

    Details

    To produce a benchmark dataset of mixed viral strains that also includes technical artifacts introduced in experimental data generation, they created viral strain mixtures mimicking clinical samples from patients with mixed strain infections in vitro. For this, they combined viral DNA of the HCMV strains TB40/E BAC4 and AD169 (designated as ”TA”), derived directly from bacterial artificial chromosomes (BAC) with these viral genomes and prepared from Escherichia coli, or the strains TB40/E BAC4 and Merlin (designated as “TM”), which were amplified in human cell-cultures, respectively, at mixing ratios of 1:1, 1:10 and 1:50.

    The benchmark set as well as the results for the other genome assemblers have been extracted from the following independent report:

    Zhi-Luo Deng, Akshay Dhingra, Adrian Fritz, Jasper Götting, Philipp C Münch, Lars Steinbrück, Thomas F Schulz, Tina Ganzenmüller, Alice C McHardy, Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses, Briefings in Bioinformatics, , bbaa123, https://doi.org/10.1093/bib/bbaa123

    In a second stage after using s-aligner, Flye was used to assemble the contigs obtained. Flye is a long-read assembler usually applied to PacBio or Nanopore data. This way the resulting contigs are longer. The results obtained by Flye are shown independently and merged with the contigs obtained by s-aligner.

    Results

    Largest alignment TA-0-1 TA-1-0 TA-1-10 (T) TA-1-10 (A) TA-1-1 (T) TA-1-1 (A) TA-1-50 (T) TA-1-50 (A) TM-0-1 TM-1-0 TM-1-10 (T) TM-1-10 (M) TM-1-1 (T) TM-1-1 (M) TM-1-50 (T) TM-1-50 (M)
    MetaSPAdes85,249192,883198,958164,86184,62884,408162,614168,031169,369192,62996,882106,00053,37753,377160,252169,370
    IVA179,316227,5749,08513,953189,434166,414162,754168,389170,711193,287161,686170,13714,83611,649162,554170,711
    s-aligner157,40565,30087,52587,74737,83937,83937,839102,740102,96678,49457,30957,28837,22236,47548,65649,135
    s-aligner + flye167,902193,963167,345219,917192,848170,557195,030171,783169,618104,573216,544192,116226,447191,661192,934179,208
    s-aligner + flye (just flye results)167,902193,963167,345219,917192,848170,557195,030171,783169,618104,573216,544192,116226,447191,661192,934179,208
    Genome fraction TA-0-1 TA-1-0 TA-1-10 (T) TA-1-10 (A) TA-1-1 (T) TA-1-1 (A) TA-1-50 (T) TA-1-50 (A) TM-0-1 TM-1-0 TM-1-10 (T) TM-1-10 (M) TM-1-1 (T) TM-1-1 (M) TM-1-50 (T) TM-1-50 (M)
    MetaSPAdes95%99%100%99%100%99%99%99%95%99%97%98%100%98%99%97%
    IVA95%99%38%44%99%98%99%99%95%99%100%97%47%43%99%96%
    s-aligner100%100%100%100%100%99%89%95%96%100%100%94%100%93%96%96%
    s-aligner + flye100%100%100%100%100%99%89%95%96%100%100%94%100%93%96%96%
    s-aligner + flye (just flye results)100%100%90%99%99%93%85%90%79%84%95%92%100%92%95%96%
    48% increased performance in the Largest Alignment metric over the second best assembler.

    Download here the contigs obtained from all assemblers.

  • De novo assembly of human transcriptome

    Details

    This benchmark has been produced applying different de novo genome-assembly software to a transcriptome human sample.

    Whole tissue RNA-seq of human melanoma metastasis of patients prior and on anti-CD20 antibody therapy.

    Results

    rnaSPAdes Megahit s-aligner rnaSPAdes + Mehahit s-aligner + rnaSPAdes + Megahit
    == BASIC TRANSCRIPTS METRICS ==
    Transcripts26609170667358743675117262
    Transcripts > 500 bp88938801134691769431163
    Transcripts > 1000 bp29042310261552147829
    == ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS ==
    Avg. aligned fraction0.8650.930.840.8930.859
    Avg. alignment length501.495607.126331.203546.325409.139
    Avg. mismatches per transcript6.0238.4414.937.055.698
    == ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS ==
    Misassemblies143180284323607
    == ASSEMBLY COMPLETENESS (SENSITIVITY) ==
    Database coverage0.0190.0250.0250.0250.034
    Duplication ratio1.0841.6721.6721.5842.443
    50%-assembled genes30412948336336614237
    95%-assembled genes573461450740839
    50%-covered genes34513354414141484901
    95%-covered genes6385276878751077
    50%-assembled isoforms34283194457747306711
    95%-assembled isoforms591461452771883
    50%-covered isoforms39243677608754768351
    95%-covered isoforms6585276939121148
    Mean isoform coverage0.4070.4270.390.4410.429
    Mean isoform assembly0.3750.3960.3430.4060.386
    == ASSEMBLY SPECIFICITY ==
    50%-matched1522312160461022738373485
    95%-matched67047418183701412232492
    Unannotated270114877096418811284
    Mean fraction of transcript matched0.7090.7930.6990.7440.716
    16% more 50%-assembled genes.
    13% more 95%-assembled genes.
    18% increase in 50%-covered genes.
    23% increase in 95%-covered genes.
    19% less mismatches per transcript
    36% increased database coverage.

    Download here the transcripts obtained by s-aligner.

    Download here the raw reads.