Human transcriptome benchmark result

After some additional problems to adapt my software to large datasets I finally got a de novo transcriptome assembly. The problems included hardware limitations, logistic problems to get new hardware, and some exponential-explosion effects in parts of the algorithm that were irrelevant for shorter datasets. Now that the problems are fixed, I hope to make new studies at larger scale. But here it is the first result.

This result corresponds to assembling a transcriptome of a human melanoma cancer cells that were implanted into a mouse. The benchmark results were obtained with rnaQUAST, a standard tool for comparing transcriptome results. Despite this tool makes some dubious or directly inconsistent analysis, this is considered the de facto standard tool for transcriptome benchmarking.

The results show that s-aligner gets a bigger number of large transcripts, a bigger coverage of the target human transcripts and in general more complete assemblies. I got a considerable 36% increment in the coverage of the human transcriptome, which means the assembly gets a 36% higher number of base pairs correctly assembled.

rnaSPAdes Megahit s-aligner rnaSPAdes + Mehahit s-aligner + rnaSPAdes + Megahit
Transcripts > 500 bp88938801134691769431163
Transcripts > 1000 bp29042310261552147829
Avg. aligned fraction0.8650.930.840.8930.859
Avg. alignment length501.495607.126331.203546.325409.139
Avg. mismatches per transcript6.0238.4414.937.055.698
Database coverage0.0190.0250.0250.0250.034
Duplication ratio1.0841.6721.6721.5842.443
50%-assembled genes30412948336336614237
95%-assembled genes573461450740839
50%-covered genes34513354414141484901
95%-covered genes6385276878751077
50%-assembled isoforms34283194457747306711
95%-assembled isoforms591461452771883
50%-covered isoforms39243677608754768351
95%-covered isoforms6585276939121148
Mean isoform coverage0.4070.4270.390.4410.429
Mean isoform assembly0.3750.3960.3430.4060.386
Mean fraction of transcript matched0.7090.7930.6990.7440.716

I don't want to extend much on why most of these metrics are dubious when not directly inconsistent, but some hints:

  • If in the same transcriptome you filter out shorter transcripts you get a higher mean of transcripts matched, higher mean isoform coverage, but higher number of mismatches per transcript.
  • In many metrics it doesn't discard repeated transcripts, altering this way greatly the significance of the metric.
  • If you don't filter out short transcripts you get for example more 50%.covered genes. By pure logic, if you just add the reads as contigs, these will cover a bigger percentage due to contigs usually not including non-assembled reads.

For these reasons I made tests filtering out transcripts in my assembly at different minimal lengths. Despite there were differences, the trend is the same: my assembly gets significatively larger transcripts and more coverage. The results in the table correspond to filtering out transcripts shorter than 180 bp. Note that this even discards some correctly short transcripts.

In addition to rnaQUAST I made some scripts to verify independently the performance of my assembler. With a method similar to what I explained in a previous blog post I found that a significative amount of base pairs in my assembly correspond to segments of the transcriptome that are not covered by the other two assemblers. From 56,626,688 base pairs in the transcriptome obtained with the combination of all three assemblies 4,481,140 base pairs were only assembled by s-aligner. That means that in combination with s-aligner you get an about 8% larger transcriptome.


Write a Comment