I have been busy the last weeks trying to assemble for the first time a fungal genome with s-aligner. It has been a quite difficult task. Despite its demonstrated performance with viral genomes (under 300kbp) s-aligner had never been tested with genomes larger than 5Mbp. In this case, a fungal genome is around 30Mbp long. That is a significatively different problem with different requirements.
This was a project demanded by an international company. They had obtained a quite bad assembly for one of their customers applying a hybrid approach with PacBio and Illumina data, so after a few successful cases in which I could help them with other assemblies, they sent me this one to see if I could help.
In the first place, I found the problem with the processing speed of my software. Despite being adequate for smaller genomes, it turned out to be insufficient for larger ones.I tried to run it in Virtual Machines from Google Cloud and AWS, but despite making use of up to 80 threads it yet didn’t run faster. Most of the time, only a few threads were running concurrently. Turned out that the problem was in the efficiency of concurrently accessing the same file from different threads. Once fixed this issue, the processing speed skyrocketed, achieving double-digit speed multiplying factors.
So I left it running in Cloud Virtual Machines for hours hoping to get finally my assembly. Due to the incremental nature of s-aligner I did first some partial tests with a limited number of reads being processed. And the first tests already got immediately significantly better results than the hybrid approach.
C:16.5%[S:16.5%,D:0%],F:16.83%,M:66.67%,n:303 50 Complete BUSCOs (C) 50 Complete and single-copy BUSCOs (S) 0 Complete and duplicated BUSCOs (D) 51 Fragmented BUSCOs (F) 202 Missing BUSCOs (M) 303 Total BUSCO groups searched
C:92.5%[S:3.5%,D:89.0%],F:4.7%,M:2.8%,n:255 236 Complete BUSCOs (C) 9 Complete and single-copy BUSCOs (S) 227 Complete and duplicated BUSCOs (D) 12 Fragmented BUSCOs (F) 7 Missing BUSCOs (M) 255 Total BUSCO groups searched
But I wanted to see if I could get even better results. Then things started to become messy. For no apparent reason, my VM instances running in the cloud didn’t complete the job. They died or were preempted every time I initiated them.
I use the preemptible mode of AWS and Google Cloud to save costs. In this mode, Google or Amazon can interrupt the instance at any time without previous warning if they require the resources for one of their processes. This is a way to save costs (up to 80% savings) but with the risk of sometimes having to repeat tasks due to the VM being preempted.
Turned out that this was a memory issue. At some point, with large genomes, memory consumption was growing exponentially and Linux was killing the process when it reached the defined system limit. It took many attempts to discover what was going on.
First, I avoided this problem by hiring VMs with more available RAM, but at some point, this wasn’t either enough. Many more tries (requiring many hours of processing) were performed until I realized this was not going to be an option: the processing costs were going to be too high (the more RAM, the higher cost per hour of VMs).
So once again, I had to turn back to the option I always want to avoid when working for a customer: modifying the software. Sometimes fixing the software can take just a few hours, but sometimes you can enter into a spiral from which ut’s hard to exit in weeks without taking the risk of letting the future development of the project be flawed forever due to unresolved and undocumented issues that you introduced while trying to fix another one.
But I consulted the customer and he confirmed that he was not in a hurry to get the results, so I went on. After a few failed attempts I finally got the RAM consumption issue relatively fixed, now requiring less than 100 GB of RAM at the peak demand for RAM. This was a reasonable limit, not elevating too much the VM cost per hour.
Now that I finally could process the sample data, there was still the issue that my software was requiring the peak RAM at some steps of the process but not in all of them, and requiring a high number of concurrent cores in some steps but not in all of them. Providing all the time the cores required for the peak demand and the RAM required for the peak demand was too inefficient.
The solution was making my software even more incremental: saving states at different stages of the processing, so I could run every stage on a different VM with different requirements.
After all these modifications, I finally got a more satisfying assembly.
Getting this assembly has required me to spend many weeks of work, a lot of accumulated VM hours, and an overall cost of over $300 (just in VMs),¡. But after the software modifications, and avoiding the failed tests, a similar assembly in the future should be performed in less than 24 hours, with a cost under $30 (even inefficiently using VMs from cloud providers). Considering that I only made use of single-read data from Illumina, that means that an assembly of this length can be obtained saving the costs of PacBio data, and the costs of sampling with Illumina paired-end data. That means cutting in half the costs of the Illumina sequencing and saving 2/3 of the cost, while only increasing the software processing cost by $30. Overall, that means reducing the costs a 62% while increasing the completeness of the assembly by 445%. This looks like an interesting result for this case even if more tests are still required to get more general conclusions.
N50 | Largest contig | Complete & partial BUSCOs | Time | Estimated cost | |
---|---|---|---|---|---|
Hybrid | 528.493 | 3.279.816 | 16,5% | 10-72 hours | $4.000-$7.000 |
Illumina paired-end and Megahit | 46.435 | 428.316 | 94,1% | 10-72 hours | $1.000-$1.500 |
Illumina single-read and s-aligner | 53.111 | 581.962 | 94.5% | 24-72 hours | $530-$1.380 |