RNA Pipeline Quickstart¶

1) Download and install the gemtools¶

If that goes well, you will have a “gemtools” command line tool available. Please check the GEMTools homepage for download and installation instructions.

2) Create a genome index¶

In order to run the pipeline for any given genome, you need to create a gem index for that particular genome. The gem-indexer takes a single single fasta file as input. Assume you have a genome.fa file, you can create the .gem index by calling:

gemtools index -i genome.fa -t 8

Here -t 8 indicates to use 8 threads/cpus, put this to the number of cpus available to you. The command will create 3 files just next to the fasta file:

genome.gem  -- the gem index
genome.hash -- hash version of the genome. You probably don't need this
               and the creation can be avoided using --no-hash
genome.log  -- log file of the indexing process

3) Transcriptome index¶

If you have a genome annotation in GTF format available, you should create a transriptome index from that GTF file. Say you have an annotation.gtf, file. You can create a transcriptome index like this:

gemtools t-index -a annotation.gtf -i genome.gem -t 8

The transcriptome indexer takes the annotation in GTF format and the previously generated .gem index as input. The -t 8 again indicates that 8 threads should be used by the process.

This creates 5 files in the current folder:

annotation.gtf.junctions – the splice junctions annotation.gtf.junctions.fa – the transcriptome annotation.gtf.junctions.gem – the transcriptome index annotation.gtf.junctions.keys – keys to translate from transcriptome to genome annotation.gtf.junctions.log – indexer log file

The pipeline needs all the files except the log file and its searching for them automatically just next to the annotation fiel (if you do not specify the paths explicitly). For a quick start, just keep the files next to the annotation.

4) Run the pipeline¶

With the index and the optional transcript index you can run the pipeline. For this example, lets assume you have paired-end reads in two files reads_1.fastq.gz and reads_2.fastq.gz.

Run this command to get an overview of what will happen:

gemtools rna-pipeline -i genome.gem -a annotation.gtf -f reads_1.fastq -q 33 -t 8 --dry

The tool will complain if anything is missing, otherwise it will print an overview of the pipeline steps. Note that we assume here that everything is in the same folder for the sake of simplicity. The main paramters are:

-i genome.gem     -- the gem genome index
-a annotation.gtf -- the annotation. You can just skip this if you don't
                     have one. Otherwise the tool will search for the transcriptome
                     index just next to the annotation. If you put it
                     somewhere else you have to explicitly set the paths. Call
                     gemtools rna-pipeline --help for an overview of the available
                     options.

-f reads_1.fastq  -- the input file. NOTE that by default we assume you have
                     paired end reads and we search for a second file called
                     reads_2.fastq.gz (works also for other variants like
                     reads.0.fastq, reads_0.fq). If it doesn't find the
                     second file, specify it explicitly like
                     -f reads_1.fastq.gz reads_2.fastq.gz. If you do NOT
                     HAVE PAREID reads, just specify the input file and
                     add --single-end to the paramters.

-q 33             -- this is the quality offset. Should be 33 or 64, but
                     you have to specify this as figuring this out automatically
                     can be expensive.
-t 8              -- threads again. This is important as it significantly
                     speeds up the runs!

Now, if everything looks good, start the same command but remove the --dry to actually start the run. With the default configuration you will get these output files:

reads.map.gz    -- the gem aligned reads
reads.bam       -- alignments in bam format
reads.junctions -- the denovo junction sites found
reads.stats*    -- two sets of stats, *all* for all the mappings found,
                   *best* considering only the best mappings. You get two
                   files each. .txt is the human readable form, .json is
                   in JSON format so you can easily read the stats with
                   more or less any modern programming language.

5) Run a stats report on the output¶

This step is optional, but allows a quick result check:

gemtools report -i reads.stats.all.json -p

The report command will create a .zip file with a graphical report (-p indicates paired end reads). Unzip the file and open the index.html in your web browser.

Notes¶

the pipeline is restartable. When it failed at some point you can just try to restart the run, it will skip the already completed steps.
you can save pipeline configuration and reuse them later. This is handy if you have more datasets. For example:
```
gemtools rna-pipeline -i genome.gem -a annotation.gtf -q 33 -t 8 --save config.gt
```
will not run anything but save the configuration to a config.gt file. You can reuse the configuration like this:
```
gemtools rna-pipeline --load config.gt -f reads_A_1.fastq.gz
gemtools rna-pipeline --load config.gt -f reads_B_1.fastq.gz
gemtools rna-pipeline --load config.gt -f reads_C_1.fastq.gz
....
```
All the configuration is taken from the config.gt file and you override the input file from the command line. Note that the configuration file is stored in JSON format. You can read this easily and also create configuration files automatically.

gemtools has a lot of options. Take a look with gemtools rna-pipeline --help. Here are a few important ones:

--name         -- specify an output name, otherwise the input file name
                  is used as a template
--compress-all -- if you have very limited disk space, you can tell the
                  pipeline to compress all intermediate files on the fly.
                  This costs performance though!
--direct-input -- in case of limited disk space, add this to stream the
                  initial data directly into gem rather then creating a
                  dedicated input file.

LAST BUT VERY IMPORTANT NOTE

The pipeline expects to find samtools installed on the system. Try to get the lates samtools from their github repository (https://github.com/samtools/samtools – clone or download and call make to build it). The latest version is multi-threaded (i.e. samtools view --help will show a -@ paramter). Also, see if you have pigz installed in the system you try to run gemtools on. pigz is a parallel compressor and the pipeline makes use of it if it is available. It will speed up compression steps a lot!

Running into problems?¶

Please do not hesitate to contact us if you run into any problems or you would like to see other features implemented. Please consider using the GEMTools issue tracker.

Scoring Scheme¶

The rna-pipeline computes scores for each alignment. These scores are transformed to the SAM MAPQ field in the resulting SAM file. Here is the basic scheme, how the gem scores are calculated.

Best strata¶

The score for the best stratum is set as follows:

Match class		GEM score	SAM score
Unique, no subdominant match	Best	65504	254
Unique, no subdominant match	Worst	64512	252
Unique, no subdominant match(es)	Best	32736	180
Unique, no subdominant match(es)	Worst	31744	177
Not unique, different score	Best	16352	127
Not unique, different score	Worst	15360	123
Perfect tie	Best	14304	119
Perfect tie	Worst	13312	114

Next strata¶

If the read contains more than one mapping, the score for the next stratum is calculated as:

Match class		GEM score	SAM score
Unique, no subdominant match	Best	8160	89
Unique, no subdominant match	Worst	5120	71
Unique, no subdominant match(es)	Best	12256	110
Unique, no subdominant match(es)	Worst	9216	95
Not unique, different score	Best	12256	110
Not unique, different score	Worst	9216	95
Perfect tie	Best	12256	110
Perfect tie	Worst	9216	95

In addition, the optional SAM field XT is set to R for all perfect ties and to U for all other alignments.

GEMTools 1.6 documentation