Human transcriptome benchmark result

After some additional problems to adapt my software to large datasets I finally got a de novo transcriptome assembly. The problems included hardware limitations, logistic problems to get new hardware, and some exponential-explosion effects in parts of the algorithm that were irrelevant for shorter datasets. Now that the problems are fixed, I hope to make new studies at larger scale. But here it is the first result.

This result corresponds to assembling a transcriptome of a human melanoma cancer cells that were implanted into a mouse. The benchmark results were obtained with rnaQUAST, a standard tool for comparing transcriptome results. Despite this tool makes some dubious or directly inconsistent analysis, this is considered the de facto standard tool for transcriptome benchmarking.

The results show that s-aligner gets a bigger number of large transcripts, a bigger coverage of the target human transcripts and in general more complete assemblies. I got a considerable 36% increment in the coverage of the human transcriptome, which means the assembly gets a 36% higher number of base pairs correctly assembled.

	rnaSPAdes	Megahit	s-aligner	rnaSPAdes + Mehahit	s-aligner + rnaSPAdes + Megahit
== BASIC TRANSCRIPTS METRICS ==
Transcripts	26609	17066	73587	43675	117262
Transcripts > 500 bp	8893	8801	13469	17694	31163
Transcripts > 1000 bp	2904	2310	2615	5214	7829
== ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS ==
Avg. aligned fraction	0.865	0.93	0.84	0.893	0.859
Avg. alignment length	501.495	607.126	331.203	546.325	409.139
Avg. mismatches per transcript	6.023	8.441	4.93	7.05	5.698
== ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS ==
Misassemblies	143	180	284	323	607
== ASSEMBLY COMPLETENESS (SENSITIVITY) ==
Database coverage	0.019	0.025	0.025	0.025	0.034
Duplication ratio	1.084	1.672	1.672	1.584	2.443
50%-assembled genes	3041	2948	3363	3661	4237
95%-assembled genes	573	461	450	740	839
50%-covered genes	3451	3354	4141	4148	4901
95%-covered genes	638	527	687	875	1077
50%-assembled isoforms	3428	3194	4577	4730	6711
95%-assembled isoforms	591	461	452	771	883
50%-covered isoforms	3924	3677	6087	5476	8351
95%-covered isoforms	658	527	693	912	1148
Mean isoform coverage	0.407	0.427	0.39	0.441	0.429
Mean isoform assembly	0.375	0.396	0.343	0.406	0.386
== ASSEMBLY SPECIFICITY ==
50%-matched	15223	12160	46102	27383	73485
95%-matched	6704	7418	18370	14122	32492
Unannotated	2701	1487	7096	4188	11284
Mean fraction of transcript matched	0.709	0.793	0.699	0.744	0.716

I don't want to extend much on why most of these metrics are dubious when not directly inconsistent, but some hints:

If in the same transcriptome you filter out shorter transcripts you get a higher mean of transcripts matched, higher mean isoform coverage, but higher number of mismatches per transcript.
In many metrics it doesn't discard repeated transcripts, altering this way greatly the significance of the metric.
If you don't filter out short transcripts you get for example more 50%.covered genes. By pure logic, if you just add the reads as contigs, these will cover a bigger percentage due to contigs usually not including non-assembled reads.

For these reasons I made tests filtering out transcripts in my assembly at different minimal lengths. Despite there were differences, the trend is the same: my assembly gets significatively larger transcripts and more coverage. The results in the table correspond to filtering out transcripts shorter than 180 bp. Note that this even discards some correctly short transcripts.

In addition to rnaQUAST I made some scripts to verify independently the performance of my assembler. With a method similar to what I explained in a previous blog post I found that a significative amount of base pairs in my assembly correspond to segments of the transcriptome that are not covered by the other two assemblers. From 56,626,688 base pairs in the transcriptome obtained with the combination of all three assemblies 4,481,140 base pairs were only assembled by s-aligner. That means that in combination with s-aligner you get an about 8% larger transcriptome.

Blog

Human transcriptome benchmark result

Tags

Share On

Other posts in this blog:

Comments

Write a Comment