s-aligner and personalized medicine

S-aligner is at the very deep just a software that aligns sequences as its name indicates. This is basically all it does every time it runs. It gets a list of sequences and finds ways to align them. How can that be related to personalized medicine?

Turns out that aligning sequences is a fundamental step in almost every bioinformatics process. You need to align sequences almost for everything you do: generating phylogenetic trees, assembling genomes, finding variants compared to a reference genome, validating genome assemblies, annotating genomes…

In the case of personalized medicine, one fundamental step to apply any kind of genetic treatment is knowing the particularities of the genome of the patient. This can be done in a few ways, for example sequencing the whole genome of the patient or sequencing his transcriptome.

Sequencing the whole genome of the patient, despite many news announcing that it can be now done for $1000 is not really at the abast of current treatments. You know how sensationalism dominates now the news industry and science is not an island free of the general problems. Sensationalism and biased opinions based on orchestrated campaigns from influential organizations are also mainstream here. Therefore, you can in some cases order your genome to be sequenced for $1000 but what you will get is far from being your real whole genome. Some parts of the human genome have indeed not been sequenced on any human until very recently, many years after the announcement of the first complete human genome was announced on all media, and required applying new expensive technologies. Getting accurate and affordable whole-genome sequencing is still an entry barrier for personalized medicine, despite even with the current methods many investigations and treatments are already possible.

S-aligner has not been tested for assembling a full human genome. The main problem is the cost. S-aligner, as we have seen previously, can help to lower the costs of genome sequencing, but, for doing that, it requires increasing the costs of genome assembly. The costs of genome assembly are a marginal part of the cost of genome sequencing, but even being a marginal part this is not a negligible cost for an actually unfunded project. This cost can even be 20 times the real cost in the experimentation stage in which many tests will have to be repeated until finding the parameters that work for that use case.

But there is an alternative to whole-genome sequencing for personalized medicine investigation. Instead of sequencing the full genome, you can sequence only the RNA sequences being active in the cells you want to treat. The human genome consists of a large number of genes, but only a small percentage are active on every cell. Being active, means that these genes are translated to RNA sequences so that these can be later translated into active chemical compounds affecting our biological processes. Therefore, if we read the RNA sequences, we are reading the genes that are active in that cell. We can, therefore, design treatments and diagnostics based exclusively on these genes.

Sequencing the set of RNA active in a cell or set of cells is called transcriptomics and despite consisting in sequencing fewer genes it has also some challenges actually pending.

The main problem is again the assembly. Despite requiring the assembly of a lower number of genes and in general, being a less costly process, the complexity of the assembly is considered even bigger than in whole-genome sequencing. The problem is that in this case, you are not trying to assemble a single sequence, but many independent ones, all of them mixed in the same sample, and in many cases, the sequences being very similar between them due to splicing, the diploid nature of the human genome, and many genes being modified copies of other genes.

This lack of precision assembling transcriptomes makes that most personalized-medicine treatments are yet based on the design of specific peptides that affect other peptides in the organism and not in the design of strategies to directly interact with the RNA as it would be desirable for more precise, fast and affordable treatments. Personalized medicine based on RNA is the next revolution coming.

S-aligner is already being tested in transcriptomics. My actual problem is the lack of computational capacity or funds to make massive tests and as there are many other tests that got my attention previously (viral genome assembly, bacterial genome assembly, assembling a fungi, metagenome assembly…) I have yet spent a limited amount of resources for studying the capacity of s-aligner to help in transcriptomics.

One case in which I am actually working is a transcriptomic sample from an ALS patient. I have assembled the transcriptome with Megahit for comparing the results. I tried also with RNASpades but failed due to the high computational requirements of the software. And I already got an assembly with s-aligner. The assembly obtained with s-aligner is for now a very limited one in which I made use of a very limited number of reads for the study. The reason, again, the cost of the computational resources required for a larger study.

Megahit has obtained an assembly containing 128MB of data, while s-aligner has obtained for now 22MB. Remember that s-aligner works incrementally and you can continue an study from partial results like this one without having to repeat the work already done. The number of contigs is also larger with Megahit: 128k vs 48k.

Comparing the results of Megahit and s-aligner we also see how in most cases the corresponding contig in the Megahit assembly is larger than the contig for the same RNA in s-aligner. Only in a reduced number of cases do we obtain larger contigs for the same RNA sequence with s-aligner than with Megahit. But that is not the end of the story. Further analysis reveals that in many cases the contigs in Megahit are larger but indeed incorrectly larger due to misassemblies.

At the very least, we already have a tool that can help in transcriptomics to obtain larger contigs for some RNA sequences and identify potential misassemblies. If it only makes it possible to improve the quality of actual transcriptomics studies by 10-20% it will already be worth it. But this is only the beginning. In every scenario in which we tested s-aligner we always started with disappointing results and in every scenario we end up improving these results comforming we made more tests, we adjust the parameters and we apply more computational power. Therefore I am confident this will also be the case in transcriptomics for personalized medicine. I hope I will be able to inform of new progress soon.


Write a Comment