Contignant - Guide

Installation

Dependencies

The software only works for now under a Linux 64 bits OS. Despite it is a standalone software, some prepared pipelines make use of third-party software that is highly recommended to install. This is, indeed, a de-facto requirement for the actual distribution as only the instructions for executing the scripts in these pipelines will be provided for now.

DEPENDENCIES:

OpenSSL-devel 1.1

Installing it

To install the software unzip the file you downloaded into a folder, then run install.sh from that folder.

:~$ chmod 744 install.sh; ./install.sh

The install script will install the dependencies. If it foesn't work please consult how to install dependencies in your system.

That folder will be your workfolder and its structure should not be modified as the software depends on it.

Guide

Basic principles

IMPORTANT: This instructions are valid for the version 1.02 of s-aligner.
Run that command to know your version.

:~$ LD_LIBRARY_PATH=. ./saligner -version

If you don't have the last version run:

:~$ ./update.sh

This software is basically a sequence aligner. It makes use of some new algorithms which are not usually the best on any aspect but just the right compromise between speed and efficiency for the uses it is intended to.

For genome assembly (the main use of the software) you need to have in mind that this still is just an alignment software. It just gets as seeds some reads and starts aligning other reads departing from these seed reads. Overall, the software maskes use of a variant of overlap-layout-consensus algorithm.

De novo genome assembly

Indexing for de novo genome assembly

s-aligner needs to index the reads before starting to work with them as an important aspect of its efficiency is based on being able to find overlapping reads fast and accurately.

It accepts FASTA amd FASTQ files as input but FASTQ files will be automatically converted to FASTA. You can use your preferred tool to convert it instead if you like. The input files can also be compressed with gzip. In that case the extension myst be .fastq.gz, .fq.gz, fasta.gz or fa.gz.

The basic command to index a set of reads is something like this:

:~$ ./index.sh sars-cov-2-sample1 your_path/your_reads.fa

where "sars-cov-2-sample1" is the id we want to put to the sample and the second parameter is the file with the reads. The id will be used later to assemble the genome.

Some ways in which you can modify the behaviour of this script are based on the following parameters:

-n [NUMBER]: specifies the number of reads to be randomly selected. Default 100.000.
-complete: specifies to index the full set of reads (not recommended for assembly tasks).
-silence: for disabling the report of progress in the standard error channel.
-subid [NUMBER between 2 and 99]: if you want to obtain two separate indexes for the same data (usually with different number of selected reads) you can specify a secondary id for the index. By default the secondary id is "71"

So, if you want to generate an alternative index with 150.000 selected reads for the former example you can use this command:

:~$ ./index.sh sars-cov-2-sample1 your_path/your_reads.fa -n 150000 -subid 10

Assembling

Once you have indexed the reads you can produce an assembly with a command like this one:

:~$ ./assemble.sh sars-cov-2-sample1 500

The first parameter is the id you used in the indexing step and the second one is the number of reads to be processed. You usually will want to start processing a few reads and see how the result looks like. You can then adjust parameters if something doesn't look ok.

You can also specify an index id with the corresponding optional parameter.

You will usually want to redirect the output to a file.The output is a FASTA file with the assembled contigs.

:~$ ./assemble.sh sars-cov-2-sample1 500 > results/sars-cov-2-sample1-500.fa

Once the partial assembly is completed you can continue assembling from the same state without processing again the previous reads. For example, you could get an output like this one from the previous command:

******************************************************************************************
*******************               Looking for matches            *************************
******************************************************************************************
	Sequences processed: 500 out of 500 (100%)      Sequences matched:9969
******************************************************************************************
*******************   State saved. Session id: 4085840A70CC5EB5  *************************
******************************************************************************************
*******************               Processing matches             *************************
******************************************************************************************
	Alignments processed: 26824 out of 26824 (100%)
******************************************************************************************
*******************               Assembling contigs             *************************
******************************************************************************************
	Contigs processed: 22 out of 22 (100%)
******************************************************************************************

You can use the session id in the third line to continue assembling departing from this partial assembly:

:~$ ./assemble.sh sars-cov-2-sample1 2500 -sid 4085840A70CC5EB5 > results/sars-cov-2-sample1-500.fa

There are other parameters that let you specify the output format. For example this command will generate a FASTQ file instead of FASTA and a Multifasta file for every contig showing how the reads assemble to form the contig.

:~$ ./assemble.sh sars-cov-2-sample1 3500 -sid 4085840A70CC5EB5 -output alignments -format fastq > results/sars-cov-2-sample1-500.fa

That's the aspect that a typical alignment of reads in a contig will have if you visualize the FASTA file with a tool like AliView:

Here there is the complete list of available parameters for assemble.sh:

-output [contigs(default)|alignments]: specifies if the result will be only a list of contigs or also a set of multifasta files with the alignments. The multifasta files will be generated at the "results" folder
-format [fasta(default)|fastq]: lets you specify the format of the contigs file. The quality measurements in the fastq file will be based on the consensus of the reads used to generate the alignment.
-speed [slow|normal(default)|fast|fastest]: the fastest the algorithm goes the less accuracy it has.
-seeds [NUMBER]: for specifying the number of seeds from which start assembling.
-seed_id [STRING]: to specify a read from which start assembling (makes sense for partial transcriptome assembly).
-silence: for disabling the report of progress in the standard error channel.
-subid [NUMBER between 2 and 99]: the id you used in the indexing step
-t|-threads [NUMBER]: the number of threads that will be avalable during the assembly. By default it's 12. That number may be limited by the license terms: consult your license terms.
-of: Redirect the output channel to a file in the "results" folder with an automatic name including all the aprameters used in the assembly.

Checking reads quality

Scaffolding

Once you get an assembly you usually can continue the assembling increasing the depth parameter or you can directly try to scaffold to get larger contigs.

There are several ways to scaffold. You can, for example use Flye or Canu (long-read assemblers). That's usually the best option for assembling genomes over 100kbp.

For shorter genomes it works better a tool included in this software. You can use the following command, for example to scaffold the contigs obtained assembling at depth 500.

:~$ ./scaffold.sh results/sars-cov-2-sample1-500.fa

That will generate the file results/sars-cov-2-sample1-500.fa.scaffolds.fa that will contain a set of scaffolds obtained from the contigs. Note that this is an unguided process (without a reference) and therefore it will scaffold misassemblies too. Usually you will get contigs that score for a larger NG50 this way.

Checking reads quality

Mapping reads to a reference genome

The quality of the reads is an important factor for the suitability of the assemblies. One important factor contributing to bad results usually is the fact that the reads contain control sequences used by the sequencing technology that don't really pertain to the genome (adapters, barcodes...). A fast and easy way to check if there are issues with your reads is aligning them to a reference genome. This way you can see if the reads contain adapters or any extra sequences. That's something tou can also do with other doftware but it isn't usually easy to do as most alternative software doesn't let you visualize what it considers insertions.

For example, if you align the reads and get something like this:

you can see how the extremes of the reads contain sequences that look like garbage and do not overlapp with other reads. That's a conclusive evidence that you need clipping your data with software like Timmomatic or TrimGalore.

In contrast, that's what you will find if the reads are already correctly trimmed:

Note how the reads (shorter) are perfectly aligned to the reference in the center.

A typical command for mapping reads to a reference is:

:~$ ./map.sh sars-cov-2-sample1 /mnt/disk2/reads.fa /mnt/disk2/reference.fa 5000 -t 12

Note that we used only .5000 reads for the mapping. A number between 5.000 and 50.000 is usually enough as the resulting files can be quite large due to their multifasta nature.

This is the full list of accepted parameters for the map.sh script:

-silence: for disabling the report of progress in the standard error channel.
-t|-threads [NUMBER]: the number of threads that will be avalable during the mapping. By default it's 1. That number may be limited by the license terms: consult your license terms.

De novo transcriptome assembly

Indexing for de novo transcriptome assembly

This feature is experimental. We have not made tests in depth to figure out the possibilities of this method. We'd like to focus on that for next releases.

Indexing for de novo transcriptome assembly is done almost exactly the same way than indexing for genome assembly. The only difference is that in transcriptome assembly you sometimes have larger amounts of data. Depending on your needs you can index different number of reads. Selecting fewer reads you will likely get assemblies for the most frequent transcripts in your sample. If you want to assemble the least frequent transcripts you'll likely have to select a larger set of reads or filter somehow the reads to select those reads that likely belong to your target transcript.

In the experiments we have made, we see that usually a reads set of 300.000 reads will have a good ballance to start working on, but that number can vary with your dataset.

A typical transcriptome indexing, therefore, will look like that:

:~$ ./index.sh transcriptome-sample1 /mnt/diskA/transcriptome-reads.fa -n 300000

Assembling transcripts

The assembly also works almost identically to genome assembly. In that case you have to consider that you are likely assembling thousands of transcripts and not a single genome. Departing from one single seed doesn't seem the best strategy.

Therefore, you will usually set a higher initial number of seeds.

:~$ ./assemble.sh transcriptome-sample1 9500 -seeds 1000 > results/transcriptome-sample1-9500.fa

In case you know the id of a read matching to a transcript of your interest (for example, by mapping reads to a list of reference transcripts), you can set that id as seed and the chances that you obtain sooner an assembly of the transcript are increased (it's not assured, as other transcripts could also match that read).

:~$ ./assemble.sh transcriptome-sample1 9500 -seed_id _10001.1 > results/transcriptome-sample1-9500.fa

Remember to use the character '_' at the beginning of the read id and not including the character '>'.