Jflow

A component is one or a collection of command lines, which can be external scripts or python code. In jflow, a component is represented by a Python class inerhiting from jflow.component.Component. It lists all the inputs, outputs and parameters required to run the command line(s) and defines its structure.

New components must be added in a Python package. Two different location are possible in order to be imported by jflow:

workflows.components: the component will be visible by all workflows,
workflows.myWorkflow.components: the component will only be available formyWorkflow.

The following code represent the structure of the source, and the location where to add component packages.

jflow/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── myWorkflow/
│   │   ├── components/          [ workflow specific components ]
│   │   │   └── myComponent.py   [ the component code ]
│   │   └── __init__.py
│   ├── components/              [ general components ]
│   │   ├── __init__.py
│   │   └── myComponent.py       [ the component code ]
│   ├── extparsers/
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

In jflow, a component is a class defined in the myComponent.py file. In order to add a new component, the developper has to:

implement a class inheriting from the jflow.component.Component class,
overload the define_parameters() method to add the component inputs, outputs and parameters,
overload the process() method to define the command line(s) structure.

The class skeleton is given by

from jflow.component import Component

class MyComponent (Component):

    def define_parameters(self, param1, param2, ...):
        # define the parameters

    def process(self):
        # define the command line(s) structure

The define_parameters() method is used to add component parameters, inputs and outputs. To do so, several methods are available. Once defined, the new parameters are available as object attibuts, thus they are accessible through self.parameter_name.

Several types of parameters can be added, all described in the following sections. All have two required positional arguments: name and help. The other arguments are optional and can be given to the method by using their keywords.

Parameters

Parameters can be added to handle a single element or a list of elements. Thus, the add_parameter() method can be used to force the final user to provide one and only one value, where the add_parameter_list() method allows the final user to give as many values he wants.

add_parameter()

Example

In the following example, a parameter named sequencer is added to the workflow. It has a list of choices and the default value is "HiSeq2000".

self.add_parameter("sequencer",
    		   "The sequencer type.", 
    		   choices = ["HiSeq2000", "ILLUMINA","SLX","SOLEXA","454","UNKNOWN"], 
    		   default="HiSeq2000")

Options

There are two positional arguments: name and help. All other options are keyword options

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
default	-	false	None	The default parameter value. It's type depends on the parameter type.
type	string	false	"str"	The parameter type. The value provided by the final user will be casted and checked against this type. All built-in Python types are available "int", "str", "float", "bool", "date", ... To create customized types, refere to the Add a data type documentation.
choices	list	false	[]	A list of the allowed values.
required	boolean	false	false	Wether or not the parameter can be ommitted.
flag	string	false	None	The command line flag (if the value is None, the flag will be `--name`).
group	string	false	"default"	The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name	string	false	None	The parameter name that should be displayed on the final form.
cmd_format	string	false	""	The command format is the parameter skeleton required to build the final command line.
argpos	integer	false	-1	The parameter position in the command line.

add_parameter_list()

The add_parameter_list() method takes the same arguments as add_parameter(). However, adding this parameter, the final user will be allowed to enter multiple values for this parameter and the object attribut self.parameter_name will be settled as a Python list.

Inputs

Just like for parameters, inputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the component to take one and only one file, where the add__input_file_list() method allows the component to take as many files as possible.

add_input_file()

Example

In the following example, an input named reads is added to the workflow. The provided file is required and should be in fastq format. No file size limitation is specified.

self.add_input_file_list("reads", 
                         "Which read files should be used", 
                         file_format="fastq", 
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
default	string	false	None	The default path value.
file_format	string	false	"any"	The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
type	string	false	"inputfile"	The type can be "inputfile", "localfile", "urlfile" or "browsefile". An "inputfile" allows the final user to provide a "localfile" or an "urlfile" or a "browsefile". A "localfile" restricts the final user to provide a path to a file visible by jflow. An "urlfile" only permits the final user to give an URL as input, where a "browsefile" force the final user to upload a file from its own computer. This last option is only available from the GUI and is considered as a "localfile" from the command line. All the uploading process is handled by jflow.
required	boolean	false	false	Wether or not the parameter can be ommitted.
flag	string	false	None	The command line flag (if the value is None, the flag will be `--name`).
group	string	false	"default"	The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name	string	false	None	The parameter name that should be displayed on the final form.
cmd_format	string	false	""	The command format is the parameter skeleton required to build the final command line.
argpos	integer	false	-1	The parameter position in the command line.

add_input_file_list()

This method takes the same arguments as add_input_file(). However, adding this parameter, the component can take a list file files and the object attribut self.parameter_name will be settled as a Python list.

add_input_directory()

Input directories are similar to files.

Example

In the following example, an input directory named reference is added to the workflow. The provided directory is required.

self.add_input_directory("reference",
                         "Folder containing the reference genome",
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
default	string	false	None	The default path value.
required	boolean	false	false	Wether or not the parameter can be ommitted.
flag	string	false	None	The command line flag (if the value is None, the flag will be `--name`).
group	string	false	"default"	The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name	string	false	None	The parameter name that should be displayed on the final form.
cmd_format	string	false	""	The command format is the parameter skeleton required to build the final command line.
argpos	integer	false	-1	The parameter position in the command line.

add_input_object()

Add a python object as input. The object is defined as a class inherited from object.

Example

Considering the following object:


class MyObject(object):

    def __init__(self, value):
        self.value = value

    def add_five(self):
        self.value = self.value + 5

    def multiply_by_ten(self):
        self.value = self.value * 10

We define this object and give it as a component argument. Into the component, the input is defined like that:

self.add_input_object("i_object", "The input object to add 5", default=i_object)

Note: the object is accessible only into a python function loaded from an "add_python_execution".

Options

There are two positional argument : name and help. All other options are keyword options.

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
default	string	false	None	The default object.
required	boolean	false	false	Wether or not the parameter can be ommitted.

add_input_object_list()

This method takes the same arguments as add_input_object(). However, adding this parameter, the component can take a list of objects and the object attribut self.parameter_name will be settled as a Python list.

Outputs

Just like for inputs, outputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the component to produce one and only one file, where the add__input_file_list() method allows the component to produce as many files as possible.

add_output_file()

Example

In the following example, an output named databank is defined. The process have to produce the file, otherwise the workflow will failed. The file written on the disk will be named with the same name as the one stored in the variable input_fasta.

self.add_output_file("databank", 
		     "The indexed databank", 
		     filename=os.path.basename(input_fasta))

Options

The two positional arguments name and help are always present

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
filename	string	false	None	The expected name of the output file.
file_format	string	false	"any"	The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
group	string	false	"default"	The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name	string	false	None	The parameter name that should be displayed on the final form.
cmd_format	string	false	""	The command format is the parameter skeleton required to build the final command line.
argpos	integer	false	-1	The parameter position in the command line.

add_output_file_list()

Example

In the following example, an output named sam_files is defined. The files written on the disk will all have the pattern {basename_woext}.sam defined by the self.reads variable. The resulting list will gathers files with the same basename as self.reads but with an extension substituted by ".sam".

self.add_output_file_list("sam_files", 
			  "The BWA output files", 
			  pattern='{basename_woext}.sam', 
			  items=self.reads)

Options

The two positional arguments name and help are always present

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
items	list	false	None	A list of element through which `add_output_file_list()` should iterate to produce the output list.
pattern	string	false	'{basename_woext}.out'	The pattern is used to produce the output list. `add_output_file_list()` maps the pattern on each `items` to produce the final value. The pattern can accepts these predefined values: `{fullpath}`, `{FULL}` for full input file path, `{basename}`, `{BASE}` for base input file name, `{fullpath_woext}`, `{FULLWE}` for full input file path without extension, `{basename_woext}`, `{BASEWE}` for base input file name without extension.
file_format	string	false	"any"	The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
group	string	false	"default"	The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name	string	false	None	The parameter name that should be displayed on the final form.
cmd_format	string	false	""	The command format is the parameter skeleton required to build the final command line.
argpos	integer	false	-1	The parameter position in the command line.

add_output_file_endswith()

This method should only be used with components producing unknown number of outputs (like some demultiplexing tools).

Example

In the following example, an output named sam_files is defined. The list files will be made of all the files written on the disk with an extension equal to .sam. This is performed at the end of the execution and can be useful when the component produces an unknown number of ouptut files.

self.add_output_file_endswith("sam_files", 
			      "The BWA output files", 
			      pattern='.sam')

Options

This method is very different from add_output_file_list() because it should only be used when the number of output files returned by the component is unknown. Three options are required: name, help and pattern.

Name	Type	Required	Default value	Description
name	string	true	None	The name of the parameter. The parameter value is accessible within the workflow object through the attribute named `self.parameter_name`.
help	string	true	None	The parameter help message.
pattern	string	true	None	The extension of the files to return.
behaviour	string	false	"include"	How to process selected files. Other values than "include" mean that all files not ending with the pattern will be selected.
file_format	string	false	"any"	The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
group	string	false	"default"	The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name	string	false	None	The parameter name that should be displayed on the final form.
cmd_format	string	false	""	The command format is the parameter skeleton required to build the final command line.
argpos	integer	false	-1	The parameter position in the command line.

add_output_file_pattern()

This method should only be used with components producing unknown number of outputs (like some demultiplexing tools).

Example

In the following example, an output named sam_files is defined. The list files will be made of all the files written on the disk with the pattern *_R1.sam. This is performed at the end of the execution and can be useful when the component produces an unknown number of ouptut files.

self.add_output_file_pattern("sam_files", 
			      "The BWA output files", 
			      pattern='*_R1.sam')

Options

This method is slightly different from add_output_file_endswith() because it returns all the file corresponding to the given pattern instead of only the extension. All the options are the same, only pattern differs.

Name	Type	Required	Default value	Description
pattern	string	true	None	The regepx string used to retrieve the ouput files.

add_output_object()

Example

In the following example, an output named o_object is defined. The python function defined for execution has to return a python object, which will be stored into this object.

self.add_output_object("o_object", "The output object")

Options

Name	Type	Required	Default value	Description
name	string	true	None	The name of the output object.
help	string	true	None	The description of the object.
required	bool	false	false	Is required.

add_output_object_list()

Example

In the following example, an output named o_object is defined. The python function defined for execution has to return a list of python objects, with the exact numbers of items mentioned in the argument nb_items (2 in this example).

self.add_output_object_list("o_object", "The output object", nb_items=2)

Options

Name	Type	Required	Default value	Description
name	string	true	None	The name of the output object.
help	string	true	None	The description of the object.
nb_items	int	true	0	Number of objects into the list.
required	bool	false	false	Is required.

The process() method is in charge to specify the executables used to process the data (a command line or a Python function) and to define the pattern of execution that determine how the functions are applied on the data, what is named hereunder an abstraction. To build the process, jflow provides two main functions named ShellFunction and PythonFunction and two main abstractions: Map and MultiMap.

Overloading process() can be omitted. Jflow offers, for components with easy command lines, an automatic built of the process() method. In this case, options argpos and cmd_format must be provided by the developper for each parameter. Also, two other methods of jflow.component.Component must be overloaded. get_command(), which must returns the execution path and get_abstraction() which returns the abstraction name to use.

Functions

The two provided functions allows the developper to specify the executables used to process the data.

ShellFunction

The ShellFunction can be called when the workflow requires to run an external command line. This function allows to define the command line structure so jflow can build and run it automaticly on the final user inputs.

Example

Considering the following blastall command line:

blastall -p [program_name] -i [query_file] -d [database] -o [file_out]

When using a jflow function, the command format has to be given in order to set the inputs, outputs and arguments order. Let's fix it at cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option. Doing so jflow will consider the following inputs and outputs order: query_file, database and then file_out. resulting to the following command structure:

blastall -p [program_name] -i [$1] -d [$2] -o [$3]

The ShellFunction can then be applied as following:

blast = ShellFunction("blastall -p blastn -i $1 -d $2 -o $3", cmd_format="{EXE} {IN} {OUT}")

And can be executed by calling the new created function

blast( inputs=[query_file, database], outputs=[file_out] )

Options

Name	Type	Required	Default value	Description
source	string	true	None	The command line structure defining inputs, outputs and arguments positions.
shell	string	false	"sh"	Which shell should be used to interpret the command line, the value can be "sh" \| "ksh" \| "bash" \| "csh" \| "tcsh".
cmd_format	string	false	'{EXE} {ARG} {IN} > {OUT}'	The `cmd_format` supports the following fields: `{executable}`, `{EXE}` for the path to the executable, `{inputs}`, `{IN}` for the inputs files, `{outputs}`, `{OUT}` for the output files, `{arguments}`, `{ARG}` for the arguments.

PythonFunction

The PythonFunction can be called when the workflow requires to run an internal Python function. This function allows to define the way the function should be called so jflow can call and run it automaticly on the final user inputs.

Example

Considering a function named fastq2fasta defined by:

def fastq2fasta(fastq_file, fasta_file):
    # python import lib goes here
    import jflow.seqio as seqio
    
    # python code goes here

When using a jflow function, the command format has to be given in order to set the inputs, outputs and arguments order. Let's fix it at cmd_format="{EXE} {IN} {OUT}", which is a classic value for this option. Doing so jflow will consider the following variables order to pass to the function: fastq_file and then fasta_file. The PythonFunction can be used as following:

fastq2a = PythonFunction(fastq2fasta, cmd_format="{EXE} {IN} {OUT}")

And can be executed by calling the new created function

fasta2q( inputs=[fastq_file], outputs=[fasta_file] )

Options

Name	Type	Required	Default value	Description
function	function	true	None	The Python function use to process the data.
add_path	string	false	None	A path to a Python library required to run the function. This is useful in case the library is not in the path and not visible by jflow.
cmd_format	string	false	'{EXE} {ARG} {IN} > {OUT}'	The `cmd_format` supports the following fields: `{executable}`, `{EXE}` for the path to the executable, `{inputs}`, `{IN}` for the inputs files, `{outputs}`, `{OUT}` for the output files, `{arguments}`, `{ARG}` for the arguments.

Abstractions

The abstraction allows to define the pattern of execution that determine how the functions (ShellFunction or PythonFunction) are applied on the data.

call

This first abstraction is executed when calling the ShellFuntion or the PythonFunction as a basic Python function.

Example

fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
fasta_trim( inputs="splA.fasta", outputs="splA_trim.fasta" )

This abstraction will lead to the execution of the following command line:

fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta

Options

Name	Type	Required	Default value	Description
inputs	list \| string	true	None	The input files list required to run the function.
outputs	list \| string	false	None	The output files list created by the function.
arguments	string	false	None	The arguments to provide to the function.
includes	list \| string	false	None	Files to include for this task.
collect	boolean	false	false	Whether or not to mark files for garbage collection.
local	boolean	false	false	Whether or not to force local execution.

Map

The Map abstraction allows to map one input file list to one output file list.

Example

fasta_list = ["splA.fasta", "splB.fasta", "splC.fasta"]
out_list = ["splA_trim.fasta", "splB_trim.fasta", "splC_trim.fasta"]

fasta_trim = ShellFunction( "fastaTrim.pl --length 50 $1 > $2", cmd_format="{EXE} {IN} {OUT}" )
Map( fasta_trim, inputs=fasta_list, outputs=out_list )

This abstraction will lead to the execution of the following command lines:

fastaTrim.pl --length 50 splA.fasta > splA_trim.fasta
fastaTrim.pl --length 50 splB.fasta > splB_trim.fasta
fastaTrim.pl --length 50 splC.fasta > splC_trim.fasta

Options

Name	Type	Required	Default value	Description
function	function	true	None	The `ShellFunction` or the `PythonFunction` to use to process the data.
inputs	list \| string	true	None	The input files list required to run the function.
outputs	list \| string	false	None	The output files list created by the function.
includes	list \| string	false	None	Files to include for each task.
collect	boolean	false	false	Whether or not to mark files for garbage collection.
local	boolean	false	false	Whether or not to force local execution.

MultiMap

The MultiMap abstraction allows to map n input file lists to n output file lists, all of the same lenght.

Example

fastq_list = ["splA.fastq", "splB.fastq", "splC.fastq"]
out_list = [ ["splA.fasta", "splA.qual"],
             ["splB.fasta", "splB.qual"],
             ["splC.fasta", "splC.qual"] ]

fastq2fasta = ShellFunction( "fastq2fasta.py --input $1 --fasta $2 --qual $3", 
                             cmd_format="{EXE} {IN} {OUT}" )
MultiMap( fastq2fasta, inputs=fastq_list, outputs=out_list )

This abstraction will lead to the execution of the following command lines:

fastq2fasta.py --input splA.fastq --fasta splA.fasta --qual splA.qual
fastq2fasta.py --input splB.fastq --fasta splB.fasta --qual splB.qual
fastq2fasta.py --input splC.fastq --fasta splC.fasta --qual splC.qual

Options

Name	Type	Required	Default value	Description
function	function	true	None	The `ShellFunction` or the `PythonFunction` to use to process the data.
inputs	list \| string	true	None	The input files list required to run the function.
outputs	list \| string	false	None	The output files list created by the function.
includes	list \| string	false	None	Files to include for each task.
collect	boolean	false	false	Whether or not to mark files for garbage collection.
local	boolean	false	false	Whether or not to force local execution.

Pre process

pre_process() is executed before running the process method. Unlike process, this method does not allow to call a ShellFunction or a PythonFunction, but can be useful when implementing an application requiring to prepare some data before running the component (insert / recover information from a database, ...).

Post process

post_process() is executed right after the process method and cannot be used to call a ShellFunction or a PythonFunction. This method can be useful to perform some database transactions and to synchronize data.

Get shared resources

The method get_resource(), giving a specific resource, returns the defined value within the resources section of the jflow configuration file.

Options

There is one required argument : resource.

Name	Type	Required	Default value	Description
resource	string	true	None	The resource name for which is requested the configured value.

Get cpu

get_cpu() return the cpu value of batch_options configuration for this component (available for local and sge only). This option is usefull to set software parameter that needs more than one CPU.

For more information about how to configure a component see advanced configuration

Get memory

get_memory() return the memory value of batch_options configuration for this component (available for local and sge only). This option is usefull to set software parameter that needs specific value of memory.

For more information about how to configure a component see advanced configuration

To ease the component creation step, jflow offers the possibility to create parsers by extending the jflow.extparser.ExternalParser class. The new component parser must be created in the workflows.extparsers package:

jflow/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── components/ 
│   ├── extparsers/
│   │   ├── __init__.py
│   │   ├── myParser.py   [ a new component parser ]
│   │   └── mobyle.py     [ mobyle component parser ]
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

The ExternalParser class

Jflow provides an abstract class jflow.exparser.ExternalParser that must be extended to define a parser for external components. The function parse() must be overloaded and has to return a new component type definition. Hereunder is the skeleton of the class:

class MyExternalParser(ExternalParser):

    def parse(self, component_file):
        # file parsing
        # code ...
    	
        # use self.build_component() to return a new component definition
        return self.build_component(component_name, fn_define_parameters, **kwargs)

The method build_component() must be used to return the new component type.The newly created component type must have a definition of define_parameters(). If the process() method is not redefined, the developper has to overload get_abstraction() and get_command(). See process for more details.

Jflow provides extparsers.mobyle.MobyleParser class as an example to integrate new external component parsers. Adding this class the developper can add mobyle components, defined by an XML file, by copying the XML definition of the component in any jflow components packages.

parse()

The parse() function is the only function to implement. This function will be called by jflow and will be executed on any file present in jflow components packages not recognized as an internal component. The function must return a type inheriting from jflow.component.Component what can easily be done with the help of the build_component() method.

Example

class MobyleParser(ExternalParser):
    
    def parse(self, component_file):
        parameters = []
        parameters_names = []
        tree = ET.parse(component_file)
        root = tree.getroot()
        
        # get command string
        command = root.findtext(".//head/command", None)
        
        # retrieve all parameters from xml file 
        for parameterNode in root.findall('.//parameters/parameter[name]'):
            attrs = parameterNode.attrib
            param = self._parseParameter(parameterNode)
            if param['name'] in parameters_names :
                raise Exception('Duplicated parameter (%s)'%param['name'])
            parameters.append(param)
            parameters_names.append(param['name'])
        
        def  fn_get_command(self) :
            # code
        
        def fn_get_abstraction(self) :
            # code
            
        def fn_define_parameters(self, **kwargs):
            # code
        
        # ...
        return self.build_component(component_name, fn_define_parameters,
               get_command = fn_get_command, get_abstraction = fn_get_abstraction)

Options

Name	Type	Required	Default value	Description
component_file	string	true	None	The path to the file where the external component is defined.

build_component()

The build_component() method is used to return a new class definition created using python type() method.

Example

In this example, the functions get_command and get_abstraction have been overloaded in the new component class.

def parse(self, component_file):
        
    def  fn_get_command(self) :
        # code
        
    def fn_get_abstraction(self) :
        # code
        
    def fn_define_parameters(self):
        # code
    
    # ...
    return self.build_component(component_name, fn_define_parameters, 
    get_command = fn_get_command, get_abstraction = fn_get_abstraction)

Options

Name	Type	Required	Default value	Description
component_name	string	true	None	The name of the newly craeted type.
fn_define_parameters	callable	true	None	A callable which overload the define_parameters function of component.
**kwargs	any	false	None	Any other method or attribut that must be added or overloaded in this new class definition.

What is a component

Where to add a new component

The Component class

Define parameters

Parameters

add_parameter()

Example

Options

add_parameter_list()

Inputs

add_input_file()

Example

Options

add_input_file_list()

add_input_directory()

Example

Options

add_input_object()

Example

Options

add_input_object_list()

Outputs

add_output_file()

Example

Options

add_output_file_list()

Example

Options

add_output_file_endswith()

Example

Options

add_output_file_pattern()

Example

Options

add_output_object()

Example

Options

add_output_object_list()

Example

Options

Process

Functions

ShellFunction

Example

Options

PythonFunction

Example

Options

Abstractions

__call__

Example

Options

Map

Example

Options

MultiMap

Example

Options

Other methods

Pre process

Post process

Get shared resources

Options

Get cpu

Get memory

External components

The ExternalParser class

parse()

Example

Options

build_component()

Example

Options

call