Introduction to the quickstart

The following quickstart intends to implement a basic workflow aiming at aligning reads against a reference genome. To do so, the developper needs to create components. A component is a workflow step. The components to create in this tutorial are:

  • BWAIndex in order to index the reference genome,
  • BWAmem to align a set of reads against an indexed reference.

Once the components created, a workflow, linking the 2 components, should be created. The resulting workflow and components are provided in the sources of jflow under the workflows/quickstart/ directory.

Step #1 create the folder tree

The first think to do before implementing the components and the workflow, is to create the folder tree. A workflow in jflow is a Python package defined by a folder with an __init__.py file.

Within jflow sources, add a package named myQuickStart and within this package create an other package named components where all the components specific to myQuickStart will be stored. Note that if the component is shared by multiple workflows, you should add it in the workflows/components/ folder.

In this workflow, 2 components will be implemented, create 2 empty files named bwaindex.py and bwamem.py. These 2 files will be used in the 2nd step to implement the components. You should obtained the following folder tree:

jflow/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── components/
│   ├── extparsers/
│   ├── myQuickStart/         [ create the folder myQuickStart ]
│   │   ├── components/       [ create the folder components where will be stored the components ]
│   │   │   ├── __init__.py   [ create an empty __init__.py file to make this directory a package ]
│   │   │   ├── bwaindex.py   [ create an empty bwaindex.py file for the BWAIndex component ]
│   │   │   ├── bwamem.py     [ create an empty bwamem.py file for the BWAmem component ]
│   │   ├── __init__.py       [ create an empty __init__.py file for the workflow definition ]
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

Step #2 create required components

To create a component in jflow, it only requires to implement a Python class inheriting from the jflow.component.Component class. Inheriting from this class force the developper to overload the define_parameters() and the process() functions. The first one allows the developper to define all the parameters the component takes to run the command line. The second one permits to specify how the command line should be built.

import os
								
from jflow.component import Component
from weaver.function import PythonFunction, ShellFunction

class MyComponent (Component):
    
    def define_parameters(self, ...):
        # define the parameters
        
    def process(self):
        # define how should be built the command line

In this tutorial, 2 components are created: BWAIndex and BWAmem.

BWAIndex component

The BWAIndex command line should look like this:

bwa index -a bwtsw -p input_file.fasta input_file.fasta > input_file.stdout 2>> input_file.stderr

Where bwa index is the executable, -a bwtsw set the indexing algorithm to "bwstw", -p input_file.fasta names the final output databack as the input file, input_file.fasta gives the fasta file to index, > input_file.stdout catch the stdout messages and 2>> input_file.stderr catch the stderr messages. From this, we can split the command line in different inputs and parameters as following:

  • algorithm: defines the -a option that allows to specify which indexing algorithm to use. "bwtsw" is one example, the other available values are "div" and "is". -a is a parameter and can be added as such by using the add_parameter() method available from the jflow.component.Component class. As this parameter only handles 3 different values, it is possible to restrict the user choice to these values by using the choices option of the add_parameter() method. Just like all add_[*] methods provided by the jflow.component.Component class, this method requires 2 options: the parameter name (here "algorithm") and the parameter help (here "Which algorithm should be used to index the fasta file").
  • databank: bwa index produces as output a databank with a name defined by the option -p. Here, we choose to name it with the name of the input file: input_file.fasta. Adding an output to a component is possible by using the add_output_file() method. This function, in addition to the parameter name and the parameter help, takes the filename option to define the name of the produced file.
  • input_fasta: defines the input file input_file.fasta as a component parameter. To do so, jflow provides the add_input_file() method. Providing the file is required to build the command line, to force this behaviour, the option required can be settled to True. In the same way, BWAIndex can only be run on fasta files, this can be specified with the file_format option.
  • stdout: to trace the command line, the produced stdout file (input_file.stdout) can be added as output file, just like it has been done for the databank parameter.
  • stderr: to trace the command line errors, the stderr file (input_file.stderr) can also be added as output file.

To be added to a component, all these parameters should be specified within the define_parameters() method as following:

import os
								
from jflow.component import Component
from weaver.function import ShellFunction

class BWAIndex (Component):
    
    def define_parameters(self, input_fasta, algorithm="bwtsw"):
        self.add_input_file("input_fasta", "Which fasta file should be indexed", 
                            file_format="fasta", default=input_fasta, required=True)
        self.add_parameter("algorithm", "Which algorithm should be used to index the fasta file", 
                           default=algorithm, choices=["bwtsw", "div", "is"])
        self.add_output_file("databank", "The indexed databank", 
                             filename=os.path.basename(input_fasta))
        self.add_output_file("stdout", "The BWAIndex stdout file", filename="bwaindex.stdout")
        self.add_output_file("stderr", "The BWAIndex stderr file", filename="bwaindex.stderr")
        
    def process(self):        
        # define how should be built the command line

Including these parameter definitions, the resulting command line should have the following structure:

[EXE] index -a [algorithm] -p [databank] [input_fasta] > [stdout] 2>> [stderr]

In the following, this structure will be used to help us to build the command line. To build a command line, jflow provides a function named ShellFunction in which the command line structure can be given (nb: there is other functions available as the PythonFunction to run an internal function). The ShellFunction takes 2 arguments: the command line structure, wich is required, and the cmd_format defining the parameter ordering.

Considering cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option, jflow will consider the following inputs and outputs order: input_fasta, databank, stdout and then stderr resulting to the following command structure:

[EXE] index -a [algorithm] -p $2 $1 > $3 2>> $4

All execution path are accessible using the method get_exec_path. This leads to the implementation of the following process() function:

import os
								
from jflow.component import Component
from weaver.function import PythonFunction, ShellFunction

class BWAIndex (Component):

    def process(self):        
        bwaindex = ShellFunction("ln -s $1 $2; " + self.get_exec_path("bwa") + " index -a " + \
                                 self.algorithm + " -p $2 $1 > $3 2>> $4", 
                                 cmd_format="{EXE} {IN} {OUT}")
        bwaindex(inputs=self.input_fasta, outputs=[self.databank, self.stdout, self.stderr])

In this example, the bwa index command line is preceded by a symbolic link creation. This is done because bwa aln|mem|... use as input the prefix of the created databank and not directly the file generated by bwa index. The final class is given by:

import os
								
from jflow.component import Component
from weaver.function import ShellFunction

class BWAIndex (Component):
    
    def define_parameters(self, input_fasta, algorithm="bwtsw"):
        self.add_input_file("input_fasta", "Which fasta file should be indexed", 
                            file_format="fasta", default=input_fasta, required=True)
        self.add_parameter("algorithm", "Which algorithm should be used to index the fasta file", 
                           default=algorithm, choices=["bwtsw", "div", "is"])
        self.add_output_file("databank", "The indexed databank", 
                             filename=os.path.basename(input_fasta))
        self.add_output_file("stdout", "The BWAIndex stdout file", filename="bwaindex.stdout")
        self.add_output_file("stderr", "The BWAIndex stderr file", filename="bwaindex.stderr")
        
    def process(self):        
        bwaindex = ShellFunction("ln -s $1 $2; " + self.get_exec_path("bwa") + " index -a " + \
                                 self.algorithm + " -p $2 $1 > $3 2>> $4", 
                                 cmd_format="{EXE} {IN} {OUT}")
        bwaindex(inputs=self.input_fasta, outputs=[self.databank, self.stdout, self.stderr])

BWAmem component

In the same way, the BWAmem component command line should look like this:

bwa mem reference.fasta sample.fastq > sample.sam > sample.stdout 2>> sample.stderr

The main difference with the previous component is that we will give BWAmem the ability to process multiple files to obtain the following:

bwa mem reference.fasta sample.fastq > sample.sam > sample.stdout 2>> sample.stderr
bwa mem reference.fasta sample2.fastq > sample2.sam > sample2.stdout 2>> sample2.stderr
bwa mem reference.fasta sample3.fastq > sample3.sam > sample3.stdout 2>> sample3.stderr
...

To do so, we will introduce, in this section, the methods add_input_file_list() and add_output_file_list() and the notion of abstraction. The parameters that can be defined are:

  • reads: defines the sample.fastq file. Here, we will allow the component to iterate through multiple input files. This is possible by using the add_input_file_list() method available from the jflow.component.Component class. The options to this method are the same as the ones available with add_input_file().
  • reference_genome: defines the reference.fasta file. This parameter is a single file and can be added using the add_input_file() method as described in the BWAIndex component
  • sam_files: defines the output file sample.sam as an output file list parameter. This one can be added to the component using the add_output_file_list() method. Doing so, 2 options have to be given to the method: pattern and items. The first one defines the output filename pattern and the second one gives the list of the items through which the component will iterate. {basename_woext} allows to retrieve the items file basename settled, in this example, with the input file list.
  • stderr: to trace the command line, the produced error messages will be stored as an output file list (1 stderr file per execution).

Just like on the previous example, the ShellFunction is used to define the command line structure. However, where in BWAIndex the function was directly executed, here we want BWAmem to iterate through the input file list. This can be done by using an abstraction. In this example we will use MultiMap to map an input to multiple outputs, but serveral other abstraction exist. To use an abstraction, it only requires to call the abstraction function (here MultiMap) on the ShellFunction previously defined:

MultiMap(bwamem, inputs=[...], outputs=[...], includes=[...])

The final class is then given by:

from jflow.component import Component
from jflow.abstraction import MultiMap

from weaver.function import ShellFunction


class BWAmem (Component):

    def define_parameters(self, reference_genome, reads):
        self.add_input_file_list( "reads", "Which reads files should be used.", 
                                  default=reads, required=True )
        self.add_input_file("reference_genome", "Which reference file should be used", 
                            default=reference_genome, required=True)
        self.add_output_file_list("sam_files", "The BWA outputed file", 
                                  pattern='{basename_woext}.sam', items=self.reads)
        self.add_output_file_list("stderr", "The BWA stderr file", 
                                  pattern='{basename_woext}.stderr', items=self.reads)

    def process(self):
        bwamem = ShellFunction(self.get_exec_path("bwa") + " mem " + self.reference_genome + \
                               " $1 > $2 2>> $3", cmd_format='{EXE} {IN} {OUT}')
        bwamem = MultiMap(bwamem, inputs=[self.reads], outputs=[self.sam_files, self.stderr], 
                          includes=[self.reference_genome])

Step #3 create the workflow

Creating a workflow in jflow is quite similar to the creation of a component. It requires to implement a Python class inheriting from the jflow.workflow.Workflow class. Inheriting from this class force the developper to overload 3 methods:

  • get_description(): should return a workflow description usefull for the final user,
  • define_parameters(): similar to the method described for the components,
  • process(): in charge to create the workflow by linking the different components.
from jflow.workflow import Workflow

class MyWorkflow (Workflow):
    
    def get_description(self):
        return "a description"

    def define_parameters(self, function="process"):
        # define the parameters

    def process(self):
        # add and link the components

The first think to do is to overload the get_description() method to give a description to our new workflow.

from jflow.workflow import Workflow

class MyQuickStart (Workflow):

    def get_description(self):
        return "Align reads against a reference genome"

In this tutorial, the final workflow will only take 2 parameters: a list of read files (add_input_file_list()) and a reference genome (add_input_file()). Just like for a component this can be define as following:

from jflow.workflow import Workflow

class MyQuickStart (Workflow):

    def define_parameters(self, function="process"):
        self.add_input_file_list("reads", "Which read files should be used", 
                                 file_format="fastq", required=True)
        self.add_input_file("reference_genome", "Which genome should the read being align on", 
                            file_format="fasta", required=True)

NB: The function="process" options allow to link a set of parameter to an execution function name (here "process").

Finaly, we can add the components within the workflow by using the method add_component(). This method takes as argument the name of the component (given by the component class name) and the component parameters. All the outputs defined within a component are accessible as a component class attribute. Thus it is easy to link the different components between each other.

from jflow.workflow import Workflow

class MyQuickStart (Workflow):

    def process(self):
        # index the reference genome
        bwaindex = self.add_component("BWAIndex", [self.reference_genome])
        # align reads against the indexed genome
        bwamem = self.add_component("BWAmem", [bwaindex.databank, self.reads])

In this example, BWAmem takes as input the databank parameter produced by BWAIndex. The final class representing the workflow should look like this:

from jflow.workflow import Workflow

class MyQuickStart (Workflow):

    def get_description(self):
        return "Align reads against a reference genome"

    def define_parameters(self, function="process"):
        self.add_input_file_list("reads", "Which read files should be used", 
                                 file_format="fastq", required=True)
        self.add_input_file("reference_genome", "Which genome should the read being align on", 
                            file_format="fasta", required=True)

    def process(self):
        # index the reference genome
        bwaindex = self.add_component("BWAIndex", [self.reference_genome])
        # align reads against the indexed genome
        bwamem = self.add_component("BWAmem", [bwaindex.databank, self.reads])

Step #4 test your workflow

From your install directory enter

python bin/jflow_cli.py -h
usage: jflow_cli.py [-h]
[...]
Available sub commands:
  {rerun,reset,delete,execution-graph,status,quickstart,myquickstart}
    [...]
    myquickstart        Align reads against a reference genome
    [...]

A new line with "myquickstart" should be here! Run it on your own data ...