What is a component

A component is one or a collection of command lines, which can be external scripts or python code. In jflow, a component is represented by a Python class inerhiting from jflow.component.Component. It lists all the inputs, outputs and parameters required to run the command line(s) and defines its structure.

Where to add a new component

New components must be added in a Python package. Two different location are possible in order to be imported by jflow:

  • workflows.components: the component will be visible by all workflows,
  • workflows.myWorkflow.components: the component will only be available formyWorkflow.

The following code represent the structure of the source, and the location where to add component packages.

jflow/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── myWorkflow/
│   │   ├── components/          [ workflow specific components ]
│   │   │   └── myComponent.py   [ the component code ]
│   │   └── __init__.py
│   ├── components/              [ general components ]
│   │   ├── __init__.py
│   │   └── myComponent.py       [ the component code ]
│   ├── extparsers/
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

The Component class

In jflow, a component is a class defined in the myComponent.py file. In order to add a new component, the developper has to:

  • implement a class inheriting from the jflow.component.Component class,
  • overload the define_parameters() method to add the component inputs, outputs and parameters,
  • overload the process() method to define the command line(s) structure.

The class skeleton is given by

from jflow.component import Component

class MyComponent (Component):

    def define_parameters(self, param1, param2, ...):
        # define the parameters

    def process(self):
        # define the command line(s) structure

Define parameters

The define_parameters() method is used to add component parameters, inputs and outputs. To do so, several methods are available. Once defined, the new parameters are available as object attibuts, thus they are accessible through self.parameter_name.

Several types of parameters can be added, all described in the following sections. All have two required positional arguments: name and help. The other arguments are optional and can be given to the method by using their keywords.

Parameters

Parameters can be added to handle a single element or a list of elements. Thus, the add_parameter() method can be used to force the final user to provide one and only one value, where the add_parameter_list() method allows the final user to give as many values he wants.

add_parameter()

Example

In the following example, a parameter named sequencer is added to the workflow. It has a list of choices and the default value is "HiSeq2000".

self.add_parameter("sequencer",
    		   "The sequencer type.", 
    		   choices = ["HiSeq2000", "ILLUMINA","SLX","SOLEXA","454","UNKNOWN"], 
    		   default="HiSeq2000")

Options

There are two positional arguments: name and help. All other options are keyword options

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default - false None The default parameter value. It's type depends on the parameter type.
type string false "str" The parameter type. The value provided by the final user will be casted and checked against this type. All built-in Python types are available "int", "str", "float", "bool", "date", ... To create customized types, refere to the Add a data type documentation.
choices list false [] A list of the allowed values.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_parameter_list()

The add_parameter_list() method takes the same arguments as add_parameter(). However, adding this parameter, the final user will be allowed to enter multiple values for this parameter and the object attribut self.parameter_name will be settled as a Python list.

Inputs

Just like for parameters, inputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the component to take one and only one file, where the add__input_file_list() method allows the component to take as many files as possible.

add_input_file()

Example

In the following example, an input named reads is added to the workflow. The provided file is required and should be in fastq format. No file size limitation is specified.

self.add_input_file_list("reads", 
                         "Which read files should be used", 
                         file_format="fastq", 
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
file_format string false "any" The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
type string false "inputfile" The type can be "inputfile", "localfile", "urlfile" or "browsefile". An "inputfile" allows the final user to provide a "localfile" or an "urlfile" or a "browsefile". A "localfile" restricts the final user to provide a path to a file visible by jflow. An "urlfile" only permits the final user to give an URL as input, where a "browsefile" force the final user to upload a file from its own computer. This last option is only available from the GUI and is considered as a "localfile" from the command line. All the uploading process is handled by jflow.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_input_file_list()

This method takes the same arguments as add_input_file(). However, adding this parameter, the component can take a list file files and the object attribut self.parameter_name will be settled as a Python list.

add_input_directory()

Input directories are similar to files.

Example

In the following example, an input directory named reference is added to the workflow. The provided directory is required.

self.add_input_directory("reference",
                         "Folder containing the reference genome",
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_input_object()

Add a python object as input. The object is defined as a class inherited from object.

Example

Considering the following object:


class MyObject(object):

    def __init__(self, value):
        self.value = value

    def add_five(self):
        self.value = self.value + 5

    def multiply_by_ten(self):
        self.value = self.value * 10
					

We define this object and give it as a component argument. Into the component, the input is defined like that:

self.add_input_object("i_object", "The input object to add 5", default=i_object)
					

Note: the object is accessible only into a python function loaded from an "add_python_execution".

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default object.
required boolean false false Wether or not the parameter can be ommitted.

add_input_object_list()

This method takes the same arguments as add_input_object(). However, adding this parameter, the component can take a list of objects and the object attribut self.parameter_name will be settled as a Python list.

Outputs

Just like for inputs, outputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the component to produce one and only one file, where the add__input_file_list() method allows the component to produce as many files as possible.

add_output_file()

Example

In the following example, an output named databank is defined. The process have to produce the file, otherwise the workflow will failed. The file written on the disk will be named with the same name as the one stored in the variable input_fasta.

self.add_output_file("databank", 
		     "The indexed databank", 
		     filename=os.path.basename(input_fasta))

Options

The two positional arguments name and help are always present

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
filename string false None The expected name of the output file.
file_format string false "any" The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_output_file_list()

Example

In the following example, an output named sam_files is defined. The files written on the disk will all have the pattern {basename_woext}.sam defined by the self.reads variable. The resulting list will gathers files with the same basename as self.reads but with an extension substituted by ".sam".

self.add_output_file_list("sam_files", 
			  "The BWA output files", 
			  pattern='{basename_woext}.sam', 
			  items=self.reads)

Options

The two positional arguments name and help are always present

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
items list false None A list of element through which add_output_file_list() should iterate to produce the output list.
pattern string false '{basename_woext}.out' The pattern is used to produce the output list. add_output_file_list() maps the pattern on each items to produce the final value. The pattern can accepts these predefined values:
  • {fullpath}, {FULL} for full input file path,
  • {basename}, {BASE} for base input file name,
  • {fullpath_woext}, {FULLWE} for full input file path without extension,
  • {basename_woext}, {BASEWE} for base input file name without extension.
file_format string false "any" The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_output_file_endswith()

Example

In the following example, an output named sam_files is defined. The list files will be made of all the files written on the disk with an extension equal to .sam. This is performed at the end of the execution and can be useful when the component produces an unknown number of ouptut files.

self.add_output_file_endswith("sam_files", 
			      "The BWA output files", 
			      pattern='.sam')

Options

This method is very different from add_output_file_list() because it should only be used when the number of output files returned by the component is unknown. Three options are required: name, help and pattern.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
pattern string true None The extension of the files to return.
behaviour string false "include" How to process selected files. Other values than "include" mean that all files not ending with the pattern will be selected.
file_format string false "any" The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
cmd_format string false "" The command format is the parameter skeleton required to build the final command line.
argpos integer false -1 The parameter position in the command line.

add_output_file_pattern()

Example

In the following example, an output named sam_files is defined. The list files will be made of all the files written on the disk with the pattern *_R1.sam. This is performed at the end of the execution and can be useful when the component produces an unknown number of ouptut files.

self.add_output_file_pattern("sam_files", 
			      "The BWA output files", 
			      pattern='*_R1.sam')

Options

This method is slightly different from add_output_file_endswith() because it returns all the file corresponding to the given pattern instead of only the extension. All the options are the same, only pattern differs.

Name Type Required Default value Description
pattern string true None The regepx string used to retrieve the ouput files.

add_output_object()

Example

In the following example, an output named o_object is defined. The python function defined for execution has to return a python object, which will be stored into this object.

self.add_output_object("o_object", "The output object")

Options

Name Type Required Default value Description
name string true None The name of the output object.
help string true None The description of the object.
required bool false false Is required.

add_output_object_list()

Example

In the following example, an output named o_object is defined. The python function defined for execution has to return a list of python objects, with the exact numbers of items mentioned in the argument nb_items (2 in this example).

self.add_output_object_list("o_object", "The output object", nb_items=2)

Options

Name Type Required Default value Description
name string true None The name of the output object.
help string true None The description of the object.
nb_items int true 0 Number of objects into the list.
required bool false false Is required.

Process

The process() method is in charge to specify the executables used to process the data (a command line or a Python function) and to define the pattern of execution that determine how the functions are applied on the data, what is named hereunder an abstraction. To build the process, jflow provides two main functions named ShellFunction and PythonFunction and two main abstractions: Map and MultiMap.

Functions

The two provided functions allows the developper to specify the executables used to process the data.

ShellFunction

The ShellFunction can be called when the workflow requires to run an external command line. This function allows to define the command line structure so jflow can build and run it automaticly on the final user inputs.

Example

Considering the following blastall command line:
blastall -p [program_name] -i [query_file] -d [database] -o [file_out]

When using a jflow function, the command format has to be given in order to set the inputs, outputs and arguments order. Let's fix it at cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option. Doing so jflow will consider the following inputs and outputs order: query_file, database and then file_out. resulting to the following command structure:

blastall -p [program_name] -i [$1] -d [$2] -o [$3]

The ShellFunction can then be applied as following:

blast = ShellFunction("blastall -p blastn -i $1 -d $2 -o $3", cmd_format="{EXE} {IN} {OUT}")

And can be executed by calling the new created function

blast( inputs=[query_file, database], outputs=[file_out] )

Options

Name Type Required Default value Description
source string true None The command line structure defining inputs, outputs and arguments positions.
shell string false "sh" Which shell should be used to interpret the command line, the value can be "sh" | "ksh" | "bash" | "csh" | "tcsh".
cmd_format string false '{EXE} {ARG} {IN} > {OUT}' The cmd_format supports the following fields:
  • {executable}, {EXE} for the path to the executable,
  • {inputs}, {IN} for the inputs files,
  • {outputs}, {OUT} for the output files,
  • {arguments}, {ARG} for the arguments.

PythonFunction

The PythonFunction can be called when the workflow requires to run an internal Python function. This function allows to define the way the function should be called so jflow can call and run it automaticly on the final user inputs.

Example

Considering a function named fastq2fasta defined by:

def fastq2fasta(fastq_file, fasta_file):
    # python import lib goes here
    import jflow.seqio as seqio
    
    # python code goes here

When using a jflow function, the command format has to be given in order to set the inputs, outputs and arguments order. Let's fix it at cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option. Doing so jflow will consider the following variables order to pass to the function: fastq_file and then fasta_file. The PythonFunction can be used as following:

fastq2a = PythonFunction(fastq2fasta, cmd_format="{EXE} {IN} {OUT}")

And can be executed by calling the new created function

fasta2q( inputs=[fastq_file], outputs=[fasta_file] )

Options

Name Type Required Default value Description
function function true None The Python function use to process the data.
add_path string false None A path to a Python library required to run the function. This is useful in case the library is not in the path and not visible by jflow.
cmd_format string false '{EXE} {ARG} {IN} > {OUT}' The cmd_format supports the following fields:
  • {executable}, {EXE} for the path to the executable,
  • {inputs}, {IN} for the inputs files,
  • {outputs}, {OUT} for the output files,
  • {arguments}, {ARG} for the arguments.

Abstractions

The abstraction allows to define the pattern of execution that determine how the functions (ShellFunction or PythonFunction) are applied on the data.

__call__

This first abstraction is executed when calling the ShellFuntion or the PythonFunction as a basic Python function.

Example

fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
fasta_trim( inputs="splA.fasta", outputs="splA_trim.fasta" )

This abstraction will lead to the execution of the following command line:

fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta

Options

Name Type Required Default value Description
inputs list | string true None The input files list required to run the function.
outputs list | string false None The output files list created by the function.
arguments string false None The arguments to provide to the function.
includes list | string false None Files to include for this task.
collect boolean false false Whether or not to mark files for garbage collection.
local boolean false false Whether or not to force local execution.

Map

The Map abstraction allows to map one input file list to one output file list.

Example

fasta_list = ["splA.fasta", "splB.fasta", "splC.fasta"]
out_list = ["splA_trim.fasta", "splB_trim.fasta", "splC_trim.fasta"]

fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
Map( fasta_trim, inputs=fasta_list, outputs=out_list )

This abstraction will lead to the execution of the following command lines:

fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta
fastaTrim.pl --length 50 splB.fasta > splB_trim.fasta
fastaTrim.pl --length 50 splC.fasta > splC_trim.fasta

Options

Name Type Required Default value Description
function function true None The ShellFunction or the PythonFunction to use to process the data.
inputs list | string true None The input files list required to run the function.
outputs list | string false None The output files list created by the function.
includes list | string false None Files to include for each task.
collect boolean false false Whether or not to mark files for garbage collection.
local boolean false false Whether or not to force local execution.

MultiMap

The MultiMap abstraction allows to map n input file lists to n output file lists, all of the same lenght.

Example

fastq_list = ["splA.fastq", "splB.fastq", "splC.fastq"]
out_list = [ ["splA.fasta", "splA.qual"],
             ["splB.fasta", "splB.qual"],
             ["splC.fasta", "splC.qual"] ]

fastq2fasta = ShellFunction( "fastq2fasta.py --input $1 --fasta $2 --qual $3", 
                             cmd_format="{EXE} {IN} {OUT}" )
MultiMap( fastq2fasta, inputs=fastq_list, outputs=out_list )

This abstraction will lead to the execution of the following command lines:

fastq2fasta.py --input splA.fastq --fasta splA.fasta --qual splA.qual
fastq2fasta.py --input splB.fastq --fasta splB.fasta --qual splB.qual
fastq2fasta.py --input splC.fastq --fasta splC.fasta --qual splC.qual

Options

Name Type Required Default value Description
function function true None The ShellFunction or the PythonFunction to use to process the data.
inputs list | string true None The input files list required to run the function.
outputs list | string false None The output files list created by the function.
includes list | string false None Files to include for each task.
collect boolean false false Whether or not to mark files for garbage collection.
local boolean false false Whether or not to force local execution.

Other methods

Pre process

pre_process() is executed before running the process method. Unlike process, this method does not allow to call a ShellFunction or a PythonFunction, but can be useful when implementing an application requiring to prepare some data before running the component (insert / recover information from a database, ...).

Post process

post_process() is executed right after the process method and cannot be used to call a ShellFunction or a PythonFunction. This method can be useful to perform some database transactions and to synchronize data.

Get shared resources

The method get_resource(), giving a specific resource, returns the defined value within the resources section of the jflow configuration file.

Options

There is one required argument : resource.

Name Type Required Default value Description
resource string true None The resource name for which is requested the configured value.

Get cpu

get_cpu() return the cpu value of batch_options configuration for this component (available for local and sge only). This option is usefull to set software parameter that needs more than one CPU.

For more information about how to configure a component see advanced configuration

Get memory

get_memory() return the memory value of batch_options configuration for this component (available for local and sge only). This option is usefull to set software parameter that needs specific value of memory.

For more information about how to configure a component see advanced configuration

External components

To ease the component creation step, jflow offers the possibility to create parsers by extending the jflow.extparser.ExternalParser class. The new component parser must be created in the workflows.extparsers package:

jflow/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── components/ 
│   ├── extparsers/
│   │   ├── __init__.py
│   │   ├── myParser.py   [ a new component parser ]
│   │   └── mobyle.py     [ mobyle component parser ]
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

The ExternalParser class

Jflow provides an abstract class jflow.exparser.ExternalParser that must be extended to define a parser for external components. The function parse() must be overloaded and has to return a new component type definition. Hereunder is the skeleton of the class:

class MyExternalParser(ExternalParser):

    def parse(self, component_file):
        # file parsing
        # code ...
    	
        # use self.build_component() to return a new component definition
        return self.build_component(component_name, fn_define_parameters, **kwargs)

The method build_component() must be used to return the new component type.The newly created component type must have a definition of define_parameters(). If the process() method is not redefined, the developper has to overload get_abstraction() and get_command(). See process for more details.

parse()

The parse() function is the only function to implement. This function will be called by jflow and will be executed on any file present in jflow components packages not recognized as an internal component. The function must return a type inheriting from jflow.component.Component what can easily be done with the help of the build_component() method.

Example

class MobyleParser(ExternalParser):
    
    def parse(self, component_file):
        parameters = []
        parameters_names = []
        tree = ET.parse(component_file)
        root = tree.getroot()
        
        # get command string
        command = root.findtext(".//head/command", None)
        
        # retrieve all parameters from xml file 
        for parameterNode in root.findall('.//parameters/parameter[name]'):
            attrs = parameterNode.attrib
            param = self._parseParameter(parameterNode)
            if param['name'] in parameters_names :
                raise Exception('Duplicated parameter (%s)'%param['name'])
            parameters.append(param)
            parameters_names.append(param['name'])
        
        def  fn_get_command(self) :
            # code
        
        def fn_get_abstraction(self) :
            # code
            
        def fn_define_parameters(self, **kwargs):
            # code
        
        # ...
        return self.build_component(component_name, fn_define_parameters,
               get_command = fn_get_command, get_abstraction = fn_get_abstraction)

Options

Name Type Required Default value Description
component_file string true None The path to the file where the external component is defined.

build_component()

The build_component() method is used to return a new class definition created using python type() method.

Example

In this example, the functions get_command and get_abstraction have been overloaded in the new component class.

def parse(self, component_file):
        
    def  fn_get_command(self) :
        # code
        
    def fn_get_abstraction(self) :
        # code
        
    def fn_define_parameters(self):
        # code
    
    # ...
    return self.build_component(component_name, fn_define_parameters, 
    get_command = fn_get_command, get_abstraction = fn_get_abstraction)

Options

Name Type Required Default value Description
component_name string true None The name of the newly craeted type.
fn_define_parameters callable true None A callable which overload the define_parameters function of component.
**kwargs any false None Any other method or attribut that must be added or overloaded in this new class definition.