Training

We will show here a case of use with real data.

Download

You can download the data from here and test yourself how s-aligner works. The data corresponds to the run SRR12956185. This corresponds to an Unbiased Deep Sequencing with Nextera XT on Illumina MiSeq for an African-lineage Zika virus.

Download data
Assembling it
  1. Unzip the file that you downloaded.
  2. Index the fasta file.

    :~$ ./index.sh raw-zika-sample1 PATH_TO_UNCOMPRESSED_ZIP/sra_data.fasta

  3. Assemble making a search with the first 500 reads.

    :~$ ./assemble.sh raw-zika-sample1 500 > results/raw-zika-sample1-500.fa

    You will get an output in you terminal like this:

  4. Take a look to the result file and see if the contigs have a reasonable length and are reasonably similar to a real Zika sequence.

    You can do that in many ways. For example you could use the Quast tool and the reference genome. A simpler way and often more informative is just making an online Blast search in the NCBI portal and see if the sequence matches a zika virus.

    In this case, we see how the contig we introduced (3261 bp long) is a perfect match for the sequence MK028860.1, corresponding to a Zika virus assembly. So, we are in the good direction.

  5. Now, we can go ahead with more confidence and assemble making a search with the next 2000 reads.

    :~$ ./assemble.sh raw-zika-sample1 2500 -sid THE_SESSION_ID_IN_PREVIOUS_STEP > results/raw-zika-sample1-2500.fa

  6. Finally, we likely got a reasonable assembly. We can check that using Quast.

    Important: Note that every time you run s-aligner with the same data you will likely get different results. In some cases even very different results. That's due to the randomness introduced by the order variation caused by multithreading.

    Your quast result, in any case should look very similar to that:

    You can compare these results with results from other software tools. You can use the compressed fastq file in the zip that you downloaded as input for the other tools.

    LGA50 NGA50 Genome fraction (%) Max contig length Largest alignment Misassembled length
    Velvetit doesn't contain contigs >= 500 bp
    SPAdes--0,72880770
    rnaSPAdes31.64298,044.0702.84833.248
    s-aligner110.15599,9010.26310.15524.483