Comparing transcriptome and metagenome assemblies

There are many tools for generating genome assemblies. Each tool has different properties and performs differently on different datasets. Therefore, one important task in genetics is comparing genome assemblies in order to know what software performs better in every circumstance.

For assembling genomes from samples designed to capture a specific genome (cultivating the organism, taking a tissue sample directly from the organism, or capturing specific sequences by library design), the comparison of results tends to be an easy concept. In case you have a reference genome you just map your assembly to the reference and you get a quite clear idea of how has the assembly performed. This is a technique not free of errors and inconsistencies but in general works relatively well. A tool frequently used for this task is QUAST.

In case you don’t have a reference sequence, or also in case you have it but you want to get another measure of completeness, you can try to identify genes that are usually present in organisms of the family of the organism that you are sequencing. A tool that takes a measure of completeness this way is BUSCO.

But metagenomes and transcriptomes belong to a different case. Now, you don’t know exactly what you are sequencing. In the case of transcriptomes, you are looking for thousands of different mRNA sequences, and despite sometimes you have references, these references are a work in progress. Also, when you assemble a transcriptome you are quite often looking for a mutation, for example in cancer transcriptome samples. These mutations often are chimeras from two genes, or from a gene and a transposable element. Therefore, you can’t just rely on references. In the cast of metagenomes, you don’t even know what you are looking for. There are potentially hundreds of microorganisms in the sample and your goal is to find the largest genome sequence from all of them. Often, these are even new species that have never been sequenced. So, you can’t either rely on references in this case.

For cases in which you can’t rely on references a common method for evaluating the performance of the assembly consists on mapping the reads to every assembly. The assembly to which more percentage of reads map is therefore understood to be the one that did a better job finding a sequence for each read. But that is not necessarily true or accurate. One assembly could have thousands of short contigs to which a large percentage of reads map but still be useless for biological studies.

This week I have been working on a different approach. Instead of mapping reads to each assembly, just map contigs from one assembly to the other one. As long as I know that is not much in use (if anyone does indeed use it) as a method for comparing assemblies, and yet I find it to be quite useful.

Here for example there is a mapping of some contigs in one transcriptomic assembly to a contig in another assembly.

k141_81226	6238	62	762	-	42703__length_700	700	0	700	693	700	32	NM:i:7	ms:i:1358	AS:i:1358	nn:i:0	tp:A:P	cm:i:115	s1:i:631	s2:i:621	de:f:0.0100	rl:i:26	cg:Z:700M
k141_81226	6238	111	762	+	4805__length_651	651	0	651	649	651	0	NM:i:2	ms:i:1290	AS:i:1290	nn:i:0	tp:A:S	cm:i:114	s1:i:621	de:f:0.0031	rl:i:26	cg:Z:651M
k141_81226	6238	2951	3396	+	6117__length_445	445	0	445	445	445	60	NM:i:0	ms:i:890	AS:i:890	nn:i:0	tp:A:P	cm:i:78	s1:i:436	s2:i:0	de:f:0	rl:i:26	cg:Z:445M
k141_81226	6238	4936	5334	-	32681__length_398	398	0	398	398	398	60	NM:i:0	ms:i:796	AS:i:796	nn:i:0	tp:A:P	cm:i:68	s1:i:383	s2:i:0	de:f:0	rl:i:26	cg:Z:398M
k141_81226	6238	2660	2985	-	56442__length_325	325	0	325	324	325	60	NM:i:1	ms:i:644	AS:i:644	nn:i:0	tp:A:P	cm:i:54	s1:i:297	s2:i:0	de:f:0.0031	rl:i:26	cg:Z:325M
k141_81226	6238	1891	2202	+	55945__length_312	312	0	312	311	312	35	NM:i:1	ms:i:616	AS:i:616	nn:i:0	tp:A:P	cm:i:55	s1:i:284	s2:i:274	de:f:0.0032	rl:i:26	cg:Z:299M1D12M
k141_81226	6238	1898	2202	-	36701__length_308	308	0	308	303	308	0	NM:i:5	ms:i:586	AS:i:586	nn:i:0	tp:A:S	cm:i:53	s1:i:274	de:f:0.0098	rl:i:26	cg:Z:11M3D2M1D291M
k141_81226	6238	5369	5556	+	23817__length_187	187	0	187	187	187	60	NM:i:0	ms:i:374	AS:i:374	nn:i:0	tp:A:P	cm:i:27	s1:i:179	s2:i:0	de:f:0	rl:i:26	cg:Z:187M
k141_81226	6238	5562	5737	-	10842__length_175	175	0	175	175	175	60	NM:i:0	ms:i:350	AS:i:350	nn:i:0	tp:A:P	cm:i:32	s1:i:167	s2:i:0	de:f:0	rl:i:26	cg:Z:175M
k141_81226	6238	5790	5940	-	37907__length_150	150	0	150	150	150	60	NM:i:0	ms:i:300	AS:i:300	nn:i:0	tp:A:P	cm:i:19	s1:i:136	s2:i:0	de:f:0	rl:i:26	cg:Z:150M
k141_81226	6238	2368	2518	-	2143__length_150	150	0	150	149	150	60	NM:i:1	ms:i:294	AS:i:294	nn:i:0	tp:A:P	cm:i:23	s1:i:144	s2:i:0	de:f:0.0067	rl:i:26	cg:Z:150M

Find the PAF format description if you want to understand this table. But it basically says which contig matches whitch other contig and at which positions.

We can see how one large contig (6238 bps) maps to a few shorter contigs in the other assembly (none larger than 1000bps). We can understand that in this case the first assembly most likely did a better job and the second one just failed to assemble the sequence. But it could indeed be exactly the opposite. It could be that the first assembly got a misassembly merging a few mRNAs.

There are a few ways to check which case is the right one. You can, for example, use BLAST to find matches in an mRNA database. In that case you will find that k141_81226 matches "Homo sapiens SIN3 transcription regulator family member B (SIN3B), transcript variant 3, mRNA" at 66% identity, while the other contigs match in most cases to the same sequence but in smaller fraction. That seems already quite indicative of what has went on with these assemblies. It seems highly likely that the smaller contigs are just incomplete contigs and do not represent real mRNA sequences present in the sample, but we also find that the large sequence could in addition be a misassembly.

Even if you don’t have references or you find it unpractical to make a BLAST search for every contig, there is another way to get clues on what has gone on. You can make a multiple alignment of all the contigs. Then you can visualize the result and deduce what is a misassembly and what is just legit larger contigs.

This is what you get when the reads are incomplete sequences.

This is what you typically get when the large read is a misassembly.

And this is what you get when both assemblies are complementary and capture different regions of the same sequence.

The last case is, indeed, under my experience, the most frequent, even if one assembly gets larger and more numerous contigs. Therefore you usually can complement your assembly even with an assembly of lower quality.

I guess I should write a short paper with more details on this method and even develop a more robust tool, but I have no time and no resources to start side projects. Which in my opinion is a shame for the actual organization of science. Funds should go to where there is a demonstrated new idea with potential and not to where there is reputation and influences. One of the many things that need to be changed in science and innovation organization. Of course, getting private capital for such a project isn't either ever going to happen. Things are really so bad. No wonder why the promised future at last was not flying cars but... impotence to fight a pandemic.


Write a Comment