HIV-1 full-length benchmarking data sets for haplotype reconstruction methods, sequenced with Illumina MiSeq and 454/Roche GSJunior. Five well-studied HIV-1 strains (HXB2, 89.6, JR-CSF, NL4-3, and YU-2) have been mixed and sequenced.
This benchmark is based on the dataset found at https://github.com/cbg-ethz/5-virus-mix.
The results for other genome assemblers have been extracted from the following independent report:
Baaijens, Jasmijn A et al. “De novo assembly of viral quasispecies using overlap graphs.” Genome research vol. 27,5 (2017): 835-848. doi:10.1101/gr.215038.116
N50 | Genome fraction (%) | Max contig length | |
---|---|---|---|
SAVAGE-de-novo | 588 | 92.6 | 1,221 |
SGA | 635 | 32.4 | 1,034 |
SOAPdenovo2 | 591 | 41.9 | 984 |
SPAdes | 591 | 42.6 | 2,952 |
metaSPAdes | 3,266 | 53.7 | 4,543 |
s-aligner (using Illumina data) | 9,641 | 100.0 | 9,646 |
s-aligner (using Roche data) | 9,062 | 100.0 | 9,062 |
The sequencing reads covered the full reference genomes at an average coverage of 20,000.
"We created five simulated data sets for benchmarking, consisting of 2 × 250-bp Illumina MiSeq reads and representing quasispecies infections from different viruses: human immunodeficiency virus (HIV), hepatitis C virus (HCV), and Zika virus (ZIKV). We varied the number of strains per sample as well as the relative abundances of those strains and the pairwise divergence between strains. To get data sets as realistic as possible, we used true viral genomes from the NCBI database and Illumina MiSeq error profiles during simulations."
The dataset and results for other genome assemblers have been extracted from the following independent report:
Baaijens, Jasmijn A et al. “De novo assembly of viral quasispecies using overlap graphs.” Genome research vol. 27,5 (2017): 835-848. doi:10.1101/gr.215038.116
N50 | Genome fraction (%) | Max contig length | |
---|---|---|---|
HIV | |||
SAVAGE-de-novo | 4,913 | 99.8 | 9,413 |
SGA | 650 | 32.4 | 1,034 |
SOAPdenovo2 | 516 | 35.7 | 844 |
SPAdes | 5,873 | 91.7 | 9,789 |
metaSPAdes | 5,159 | 32.7 | 7,044 |
s-aligner | 9,916 | 100.0 | 9,916 |
HCV | |||
SAVAGE-de-novo | 8,248 | 99.6 | 9,297 |
SGA | 638 | 18.1 | 832 |
SOAPdenovo2 | 531 | 22.0 | 926 |
SPAdes | 8,582 | 91.3 | 9,311 |
metaSPAdes | 1,549 | 45.9 | 3,041 |
s-aligner | 9,285 | 96.3 | 9,305 | ZKV |
SAVAGE-de-novo | 2,103 | 99.4 | 9,282 |
SGA | 0 | 0.0 | 0 |
SOAPdenovo2 | 562 | 21.0 | 1,025 |
SPAdes | 2,577 | 65.6 | 10,269 |
metaSPAdes | 3,926 | 17.5 | 6,495 |
s-aligner | 10,161 | 95.0 | 10,165 |
Some random data-sets where selected from the SRA archive and assembled using different assemblers.
Run Id | Hardware | Library design | SPAdes NG50 |
---|---|---|---|
SRR12351628 | MiSeq | ARTIC | 4.371 |
SRR13684392 | MiSeq | ARTIC | 29.404 |
SRR11410529 | MiSeq | ARTIC | 19.294 |
SRR12045777 | MiSeq | ARTIC | 19.338 |
SRR12623307 | MiSeq | ARTIC | 19.283 |
SRR11772204 | MiSeq | ARTIC | 29.837 |
SRR12045770 | MiSeq | ARTIC | 1.412 |
SRR11410528 | MiSeq | ARTIC | 19.291 |
SRR13660064 | MiSeq | ARTIC | 16.463 |
SRR13623050 | MiSeq | ARTIC | 29.842 |
SRR13623049 | MiSeq | ARTIC | 29.833 |
SRR13574254 | Illumina | ARTIC | 1.000 |
SRR13727443 | Illumina | ARTIC | 1.631 |
SRR13731834 | Illumina | ARTIC | 29.687 |
SRR13727440 | Illumina | ARTIC | 0 |
Average | 16.712 | ||
Variance | 11.991 |
Run Id | Hardware | Library design | SPAdes NG50 | s-aligner NG50 |
---|---|---|---|---|
SRR12819233 | Random | 21.585 | 29.845 | |
SRR12445029 | Random | 4.980 | 29.299 | |
SRR10903401 | Random | 29.877 | ||
SRR12481157 | Random | 23.583 | 29.836 | |
SRR12445036 | Random | 5.104 | 29.112 | |
SRR13615951 | Random | 29.858 | 29.846 | |
SRR13615945 | Random | 29.852 | 29.797 | |
SRR13615944 | Random | 29.852 | 29.829 | |
SRR13615947 | Random | 29.856 | 29.837 | |
SRR13615942 | Random | 28.307 | 29.754 | |
SRR13300938 | Random | 18.500 | ||
SRR12445040 | Random | 5.560 | 29.340 | |
SRR12445032 | Random | 2.674 | 29.351 | |
SRR13050769 | Random | 0 | 25.854 | |
SRR13495171 | Random | 0 | 29.804 | |
Average | 17.220 | 28.654 | ||
Variance | 13.063 | 2.984 |
To produce a benchmark dataset of mixed viral strains that also includes technical artifacts introduced in experimental data generation, they created viral strain mixtures mimicking clinical samples from patients with mixed strain infections in vitro. For this, they combined viral DNA of the HCMV strains TB40/E BAC4 and AD169 (designated as ”TA”), derived directly from bacterial artificial chromosomes (BAC) with these viral genomes and prepared from Escherichia coli, or the strains TB40/E BAC4 and Merlin (designated as “TM”), which were amplified in human cell-cultures, respectively, at mixing ratios of 1:1, 1:10 and 1:50.
The benchmark set as well as the results for the other genome assemblers have been extracted from the following independent report:
Zhi-Luo Deng, Akshay Dhingra, Adrian Fritz, Jasper Götting, Philipp C Münch, Lars Steinbrück, Thomas F Schulz, Tina Ganzenmüller, Alice C McHardy, Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses, Briefings in Bioinformatics, , bbaa123, https://doi.org/10.1093/bib/bbaa123
In a second stage after using s-aligner, Flye was used to assemble the contigs obtained. Flye is a long-read assembler usually applied to PacBio or Nanopore data. This way the resulting contigs are longer. The results obtained by Flye are shown independently and merged with the contigs obtained by s-aligner.
Largest alignment | TA-0-1 | TA-1-0 | TA-1-10 (T) | TA-1-10 (A) | TA-1-1 (T) | TA-1-1 (A) | TA-1-50 (T) | TA-1-50 (A) | TM-0-1 | TM-1-0 | TM-1-10 (T) | TM-1-10 (M) | TM-1-1 (T) | TM-1-1 (M) | TM-1-50 (T) | TM-1-50 (M) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MetaSPAdes | 85,249 | 192,883 | 198,958 | 164,861 | 84,628 | 84,408 | 162,614 | 168,031 | 169,369 | 192,629 | 96,882 | 106,000 | 53,377 | 53,377 | 160,252 | 169,370 |
IVA | 179,316 | 227,574 | 9,085 | 13,953 | 189,434 | 166,414 | 162,754 | 168,389 | 170,711 | 193,287 | 161,686 | 170,137 | 14,836 | 11,649 | 162,554 | 170,711 |
s-aligner | 157,405 | 65,300 | 87,525 | 87,747 | 37,839 | 37,839 | 37,839 | 102,740 | 102,966 | 78,494 | 57,309 | 57,288 | 37,222 | 36,475 | 48,656 | 49,135 |
s-aligner + flye | 167,902 | 193,963 | 167,345 | 219,917 | 192,848 | 170,557 | 195,030 | 171,783 | 169,618 | 104,573 | 216,544 | 192,116 | 226,447 | 191,661 | 192,934 | 179,208 |
s-aligner + flye (just flye results) | 167,902 | 193,963 | 167,345 | 219,917 | 192,848 | 170,557 | 195,030 | 171,783 | 169,618 | 104,573 | 216,544 | 192,116 | 226,447 | 191,661 | 192,934 | 179,208 |
Genome fraction | TA-0-1 | TA-1-0 | TA-1-10 (T) | TA-1-10 (A) | TA-1-1 (T) | TA-1-1 (A) | TA-1-50 (T) | TA-1-50 (A) | TM-0-1 | TM-1-0 | TM-1-10 (T) | TM-1-10 (M) | TM-1-1 (T) | TM-1-1 (M) | TM-1-50 (T) | TM-1-50 (M) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MetaSPAdes | 95% | 99% | 100% | 99% | 100% | 99% | 99% | 99% | 95% | 99% | 97% | 98% | 100% | 98% | 99% | 97% |
IVA | 95% | 99% | 38% | 44% | 99% | 98% | 99% | 99% | 95% | 99% | 100% | 97% | 47% | 43% | 99% | 96% |
s-aligner | 100% | 100% | 100% | 100% | 100% | 99% | 89% | 95% | 96% | 100% | 100% | 94% | 100% | 93% | 96% | 96% |
s-aligner + flye | 100% | 100% | 100% | 100% | 100% | 99% | 89% | 95% | 96% | 100% | 100% | 94% | 100% | 93% | 96% | 96% |
s-aligner + flye (just flye results) | 100% | 100% | 90% | 99% | 99% | 93% | 85% | 90% | 79% | 84% | 95% | 92% | 100% | 92% | 95% | 96% |
This benchmark has been produced applying different de novo genome-assembly software to a transcriptome human sample.
Whole tissue RNA-seq of human melanoma metastasis of patients prior and on anti-CD20 antibody therapy.
rnaSPAdes | Megahit | s-aligner | rnaSPAdes + Mehahit | s-aligner + rnaSPAdes + Megahit | |
---|---|---|---|---|---|
== BASIC TRANSCRIPTS METRICS == | |||||
Transcripts | 26609 | 17066 | 73587 | 43675 | 117262 |
Transcripts > 500 bp | 8893 | 8801 | 13469 | 17694 | 31163 |
Transcripts > 1000 bp | 2904 | 2310 | 2615 | 5214 | 7829 |
== ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS == | |||||
Avg. aligned fraction | 0.865 | 0.93 | 0.84 | 0.893 | 0.859 |
Avg. alignment length | 501.495 | 607.126 | 331.203 | 546.325 | 409.139 |
Avg. mismatches per transcript | 6.023 | 8.441 | 4.93 | 7.05 | 5.698 |
== ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS == | |||||
Misassemblies | 143 | 180 | 284 | 323 | 607 |
== ASSEMBLY COMPLETENESS (SENSITIVITY) == | |||||
Database coverage | 0.019 | 0.025 | 0.025 | 0.025 | 0.034 |
Duplication ratio | 1.084 | 1.672 | 1.672 | 1.584 | 2.443 |
50%-assembled genes | 3041 | 2948 | 3363 | 3661 | 4237 |
95%-assembled genes | 573 | 461 | 450 | 740 | 839 |
50%-covered genes | 3451 | 3354 | 4141 | 4148 | 4901 |
95%-covered genes | 638 | 527 | 687 | 875 | 1077 |
50%-assembled isoforms | 3428 | 3194 | 4577 | 4730 | 6711 |
95%-assembled isoforms | 591 | 461 | 452 | 771 | 883 |
50%-covered isoforms | 3924 | 3677 | 6087 | 5476 | 8351 |
95%-covered isoforms | 658 | 527 | 693 | 912 | 1148 |
Mean isoform coverage | 0.407 | 0.427 | 0.39 | 0.441 | 0.429 |
Mean isoform assembly | 0.375 | 0.396 | 0.343 | 0.406 | 0.386 |
== ASSEMBLY SPECIFICITY == | |||||
50%-matched | 15223 | 12160 | 46102 | 27383 | 73485 |
95%-matched | 6704 | 7418 | 18370 | 14122 | 32492 |
Unannotated | 2701 | 1487 | 7096 | 4188 | 11284 |
Mean fraction of transcript matched | 0.709 | 0.793 | 0.699 | 0.744 | 0.716 |