After some additional problems to adapt my software to large datasets I finally got a de novo transcriptome assembly. The problems included hardware limitations, logistic problems to get new hardware, and some exponential-explosion effects in parts of the algorithm that were irrelevant for shorter datasets. Now that the problems are fixed, I hope to make new studies at larger scale. But here it is the first result.
This result corresponds to assembling a transcriptome of a human melanoma cancer cells that were implanted into a mouse. The benchmark results were obtained with rnaQUAST, a standard tool for comparing transcriptome results. Despite this tool makes some dubious or directly inconsistent analysis, this is considered the de facto standard tool for transcriptome benchmarking.
The results show that s-aligner gets a bigger number of large transcripts, a bigger coverage of the target human transcripts and in general more complete assemblies. I got a considerable 36% increment in the coverage of the human transcriptome, which means the assembly gets a 36% higher number of base pairs correctly assembled.
rnaSPAdes | Megahit | s-aligner | rnaSPAdes + Mehahit | s-aligner + rnaSPAdes + Megahit | |
---|---|---|---|---|---|
== BASIC TRANSCRIPTS METRICS == | |||||
Transcripts | 26609 | 17066 | 73587 | 43675 | 117262 |
Transcripts > 500 bp | 8893 | 8801 | 13469 | 17694 | 31163 |
Transcripts > 1000 bp | 2904 | 2310 | 2615 | 5214 | 7829 |
== ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS == | |||||
Avg. aligned fraction | 0.865 | 0.93 | 0.84 | 0.893 | 0.859 |
Avg. alignment length | 501.495 | 607.126 | 331.203 | 546.325 | 409.139 |
Avg. mismatches per transcript | 6.023 | 8.441 | 4.93 | 7.05 | 5.698 |
== ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS == | |||||
Misassemblies | 143 | 180 | 284 | 323 | 607 |
== ASSEMBLY COMPLETENESS (SENSITIVITY) == | |||||
Database coverage | 0.019 | 0.025 | 0.025 | 0.025 | 0.034 |
Duplication ratio | 1.084 | 1.672 | 1.672 | 1.584 | 2.443 |
50%-assembled genes | 3041 | 2948 | 3363 | 3661 | 4237 |
95%-assembled genes | 573 | 461 | 450 | 740 | 839 |
50%-covered genes | 3451 | 3354 | 4141 | 4148 | 4901 |
95%-covered genes | 638 | 527 | 687 | 875 | 1077 |
50%-assembled isoforms | 3428 | 3194 | 4577 | 4730 | 6711 |
95%-assembled isoforms | 591 | 461 | 452 | 771 | 883 |
50%-covered isoforms | 3924 | 3677 | 6087 | 5476 | 8351 |
95%-covered isoforms | 658 | 527 | 693 | 912 | 1148 |
Mean isoform coverage | 0.407 | 0.427 | 0.39 | 0.441 | 0.429 |
Mean isoform assembly | 0.375 | 0.396 | 0.343 | 0.406 | 0.386 |
== ASSEMBLY SPECIFICITY == | |||||
50%-matched | 15223 | 12160 | 46102 | 27383 | 73485 |
95%-matched | 6704 | 7418 | 18370 | 14122 | 32492 |
Unannotated | 2701 | 1487 | 7096 | 4188 | 11284 |
Mean fraction of transcript matched | 0.709 | 0.793 | 0.699 | 0.744 | 0.716 |
I don't want to extend much on why most of these metrics are dubious when not directly inconsistent, but some hints:
For these reasons I made tests filtering out transcripts in my assembly at different minimal lengths. Despite there were differences, the trend is the same: my assembly gets significatively larger transcripts and more coverage. The results in the table correspond to filtering out transcripts shorter than 180 bp. Note that this even discards some correctly short transcripts.
In addition to rnaQUAST I made some scripts to verify independently the performance of my assembler. With a method similar to what I explained in a previous blog post I found that a significative amount of base pairs in my assembly correspond to segments of the transcriptome that are not covered by the other two assemblers. From 56,626,688 base pairs in the transcriptome obtained with the combination of all three assemblies 4,481,140 base pairs were only assembled by s-aligner. That means that in combination with s-aligner you get an about 8% larger transcriptome.