Contignant - Benchmarks

De novo assembly of viral quasiespecies (HIV-1)

Details

HIV-1 full-length benchmarking data sets for haplotype reconstruction methods, sequenced with Illumina MiSeq and 454/Roche GSJunior. Five well-studied HIV-1 strains (HXB2, 89.6, JR-CSF, NL4-3, and YU-2) have been mixed and sequenced.

This benchmark is based on the dataset found at https://github.com/cbg-ethz/5-virus-mix.

The results for other genome assemblers have been extracted from the following independent report:

Baaijens, Jasmijn A et al. “De novo assembly of viral quasispecies using overlap graphs.” Genome research vol. 27,5 (2017): 835-848. doi:10.1101/gr.215038.116

Results

	N50	Genome fraction (%)	Max contig length
SAVAGE-de-novo	588	92.6	1,221
SGA	635	32.4	1,034
SOAPdenovo2	591	41.9	984
SPAdes	591	42.6	2,952
metaSPAdes	3,266	53.7	4,543
s-aligner (using Illumina data)	9,641	100.0	9,646
s-aligner (using Roche data)	9,062	100.0	9,062

195% increased performance in the N50 metric over the second best assembler.

1,540% increased performance in the N50 metric over the worst-performing assembler.

8% increased performance in the Genome Fraction metric over the second best assembler.

209% increased performance in the Genome Fraction metric over the worst-performing assembler.

Download here the contigs obtained by s-aligner for the Illumina Miseq data.

Download here the contigs obtained by s-aligner for the 454/Roche GSJunior data.

De novo assembly of simulated viral samples (HIV, HCV, ZKV)

Details

The sequencing reads covered the full reference genomes at an average coverage of 20,000.

"We created five simulated data sets for benchmarking, consisting of 2 × 250-bp Illumina MiSeq reads and representing quasispecies infections from different viruses: human immunodeficiency virus (HIV), hepatitis C virus (HCV), and Zika virus (ZIKV). We varied the number of strains per sample as well as the relative abundances of those strains and the pairwise divergence between strains. To get data sets as realistic as possible, we used true viral genomes from the NCBI database and Illumina MiSeq error profiles during simulations."

The dataset and results for other genome assemblers have been extracted from the following independent report:

Baaijens, Jasmijn A et al. “De novo assembly of viral quasispecies using overlap graphs.” Genome research vol. 27,5 (2017): 835-848. doi:10.1101/gr.215038.116

Results

	N50	Genome fraction (%)	Max contig length
HIV
SAVAGE-de-novo	4,913	99.8	9,413
SGA	650	32.4	1,034
SOAPdenovo2	516	35.7	844
SPAdes	5,873	91.7	9,789
metaSPAdes	5,159	32.7	7,044
s-aligner	9,916	100.0	9,916
HCV
SAVAGE-de-novo	8,248	99.6	9,297
SGA	638	18.1	832
SOAPdenovo2	531	22.0	926
SPAdes	8,582	91.3	9,311
metaSPAdes	1,549	45.9	3,041
s-aligner	9,285	96.3	9,305
ZKV
SAVAGE-de-novo	2,103	99.4	9,282
SGA	0	0.0	0
SOAPdenovo2	562	21.0	1,025
SPAdes	2,577	65.6	10,269
metaSPAdes	3,926	17.5	6,495
s-aligner	10,161	95.0	10,165

79% increased performance in the N50 metric over the second best assembler.

1,726% increased performance in the N50 metric over the worst-performing assembler (excluding invalid results).

Download here the contigs obtained by s-aligner.

Assembly of SARS-Cov-2

Details

Some random data-sets where selected from the SRA archive and assembled using different assemblers.

Results

Run Id	Hardware	Library design	SPAdes NG50
SRR12351628	MiSeq	ARTIC	4.371
SRR13684392	MiSeq	ARTIC	29.404
SRR11410529	MiSeq	ARTIC	19.294
SRR12045777	MiSeq	ARTIC	19.338
SRR12623307	MiSeq	ARTIC	19.283
SRR11772204	MiSeq	ARTIC	29.837
SRR12045770	MiSeq	ARTIC	1.412
SRR11410528	MiSeq	ARTIC	19.291
SRR13660064	MiSeq	ARTIC	16.463
SRR13623050	MiSeq	ARTIC	29.842
SRR13623049	MiSeq	ARTIC	29.833
SRR13574254	Illumina	ARTIC	1.000
SRR13727443	Illumina	ARTIC	1.631
SRR13731834	Illumina	ARTIC	29.687
SRR13727440	Illumina	ARTIC	0
Average			16.712
Variance			11.991

Run Id	Library design	SPAdes NG50	s-aligner NG50
SRR12819233	Random	21.585	29.845
SRR12445029	Random	4.980	29.299
SRR10903401	Random	29.877
SRR12481157	Random	23.583	29.836
SRR12445036	Random	5.104	29.112
SRR13615951	Random	29.858	29.846
SRR13615945	Random	29.852	29.797
SRR13615944	Random	29.852	29.829
SRR13615947	Random	29.856	29.837
SRR13615942	Random	28.307	29.754
SRR13300938	Random		18.500
SRR12445040	Random	5.560	29.340
SRR12445032	Random	2.674	29.351
SRR13050769	Random	0	25.854
SRR13495171	Random	0	29.804
Average		17.220	28.654
Variance		13.063	2.984

Protocols making use of s-aligner and random-primer library design show an increased performance of 72% (28.654 vs 16.712). Also, 13 out of 15 times s-aligner+random got an almost-perfect assembly. That's a 86% success rate against a 33% success rate for ARTIC+SPAdes protocols.

Download here the data.

Assembly of large DNA-viruses from samples containing multiple strains

Details

To produce a benchmark dataset of mixed viral strains that also includes technical artifacts introduced in experimental data generation, they created viral strain mixtures mimicking clinical samples from patients with mixed strain infections in vitro. For this, they combined viral DNA of the HCMV strains TB40/E BAC4 and AD169 (designated as ”TA”), derived directly from bacterial artificial chromosomes (BAC) with these viral genomes and prepared from Escherichia coli, or the strains TB40/E BAC4 and Merlin (designated as “TM”), which were amplified in human cell-cultures, respectively, at mixing ratios of 1:1, 1:10 and 1:50.

The benchmark set as well as the results for the other genome assemblers have been extracted from the following independent report:

Zhi-Luo Deng, Akshay Dhingra, Adrian Fritz, Jasper Götting, Philipp C Münch, Lars Steinbrück, Thomas F Schulz, Tina Ganzenmüller, Alice C McHardy, Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses, Briefings in Bioinformatics, , bbaa123, https://doi.org/10.1093/bib/bbaa123

In a second stage after using s-aligner, Flye was used to assemble the contigs obtained. Flye is a long-read assembler usually applied to PacBio or Nanopore data. This way the resulting contigs are longer. The results obtained by Flye are shown independently and merged with the contigs obtained by s-aligner.

Results

Largest alignment	TA-0-1	TA-1-0	TA-1-10 (T)	TA-1-10 (A)	TA-1-1 (T)	TA-1-1 (A)	TA-1-50 (T)	TA-1-50 (A)	TM-0-1	TM-1-0	TM-1-10 (T)	TM-1-10 (M)	TM-1-1 (T)	TM-1-1 (M)	TM-1-50 (T)	TM-1-50 (M)
MetaSPAdes	85,249	192,883	198,958	164,861	84,628	84,408	162,614	168,031	169,369	192,629	96,882	106,000	53,377	53,377	160,252	169,370
IVA	179,316	227,574	9,085	13,953	189,434	166,414	162,754	168,389	170,711	193,287	161,686	170,137	14,836	11,649	162,554	170,711
s-aligner	157,405	65,300	87,525	87,747	37,839	37,839	37,839	102,740	102,966	78,494	57,309	57,288	37,222	36,475	48,656	49,135
s-aligner + flye	167,902	193,963	167,345	219,917	192,848	170,557	195,030	171,783	169,618	104,573	216,544	192,116	226,447	191,661	192,934	179,208
s-aligner + flye (just flye results)	167,902	193,963	167,345	219,917	192,848	170,557	195,030	171,783	169,618	104,573	216,544	192,116	226,447	191,661	192,934	179,208

Genome fraction	TA-0-1	TA-1-0	TA-1-10 (T)	TA-1-10 (A)	TA-1-1 (T)	TA-1-1 (A)	TA-1-50 (T)	TA-1-50 (A)	TM-0-1	TM-1-0	TM-1-10 (T)	TM-1-10 (M)	TM-1-1 (T)	TM-1-1 (M)	TM-1-50 (T)	TM-1-50 (M)
MetaSPAdes	95%	99%	100%	99%	100%	99%	99%	99%	95%	99%	97%	98%	100%	98%	99%	97%
IVA	95%	99%	38%	44%	99%	98%	99%	99%	95%	99%	100%	97%	47%	43%	99%	96%
s-aligner	100%	100%	100%	100%	100%	99%	89%	95%	96%	100%	100%	94%	100%	93%	96%	96%
s-aligner + flye	100%	100%	100%	100%	100%	99%	89%	95%	96%	100%	100%	94%	100%	93%	96%	96%
s-aligner + flye (just flye results)	100%	100%	90%	99%	99%	93%	85%	90%	79%	84%	95%	92%	100%	92%	95%	96%

48% increased performance in the Largest Alignment metric over the second best assembler.

Download here the contigs obtained from all assemblers.

De novo assembly of human transcriptome

Details

This benchmark has been produced applying different de novo genome-assembly software to a transcriptome human sample.

Whole tissue RNA-seq of human melanoma metastasis of patients prior and on anti-CD20 antibody therapy.

Results

	rnaSPAdes	Megahit	s-aligner	rnaSPAdes + Mehahit	s-aligner + rnaSPAdes + Megahit
== BASIC TRANSCRIPTS METRICS ==
Transcripts	26609	17066	73587	43675	117262
Transcripts > 500 bp	8893	8801	13469	17694	31163
Transcripts > 1000 bp	2904	2310	2615	5214	7829
== ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS ==
Avg. aligned fraction	0.865	0.93	0.84	0.893	0.859
Avg. alignment length	501.495	607.126	331.203	546.325	409.139
Avg. mismatches per transcript	6.023	8.441	4.93	7.05	5.698
== ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS ==
Misassemblies	143	180	284	323	607
== ASSEMBLY COMPLETENESS (SENSITIVITY) ==
Database coverage	0.019	0.025	0.025	0.025	0.034
Duplication ratio	1.084	1.672	1.672	1.584	2.443
50%-assembled genes	3041	2948	3363	3661	4237
95%-assembled genes	573	461	450	740	839
50%-covered genes	3451	3354	4141	4148	4901
95%-covered genes	638	527	687	875	1077
50%-assembled isoforms	3428	3194	4577	4730	6711
95%-assembled isoforms	591	461	452	771	883
50%-covered isoforms	3924	3677	6087	5476	8351
95%-covered isoforms	658	527	693	912	1148
Mean isoform coverage	0.407	0.427	0.39	0.441	0.429
Mean isoform assembly	0.375	0.396	0.343	0.406	0.386
== ASSEMBLY SPECIFICITY ==
50%-matched	15223	12160	46102	27383	73485
95%-matched	6704	7418	18370	14122	32492
Unannotated	2701	1487	7096	4188	11284
Mean fraction of transcript matched	0.709	0.793	0.699	0.744	0.716

16% more 50%-assembled genes.

13% more 95%-assembled genes.

18% increase in 50%-covered genes.

23% increase in 95%-covered genes.

19% less mismatches per transcript

36% increased database coverage.

Download here the transcripts obtained by s-aligner.

Download here the raw reads.