NAME
Bio::SeqAlignment::Components::SeqMapping::Dataflow::LinearGeneric - A role to implement a linear dataflow for a Generic Mapper
VERSION
version 0.03
DESCRIPTION
This role provides a linear dataflow for sequence mapping. By that we mean that the input, i.e. a reference to a (list) of any object that can hold biological sequences such as BioX::Seq objects, FASTA files, etc) undergoes a single step process for mapping, i.e. similarity search, usually through a (pseudo)alignment method and mapping to a single best candidate, which is the final (?definitive) step of similarity search. The module is intended to be applied as a role to the class Bio::SeqAlignment::Components::SeqMapping::Mapper::Generic The latter is pluggable Moose mapper that provides the necessary logic to implement the methods required by this role module.
The method requires the implementation of the following methods by the GenericMapper class to which the role will be applied:
* seq_align : aligns an atomic unit of work provided as the single argument and returns the similarity metric and the match result. This method is passed only one argument: 1. The atomic unit of work This function returns a reference to an array containing the sequence ID, the reference sequence that was mapped to and the similarity metric used to decide which reference sequence to map to. In essence this method performs ALL the steps of the similarity search and mapping in one go (as opposed to the linear-linear dataflow where the similarity search, extraction of similarity measures, their reduction and mapping are distinct steps. * cleanup : does any cleanup necessary after the mapping has been done if the user desires so. The function will be provided with a single argument: 1. the output of the seq_align method
Note that these functions are provided as code references to the constructor of the class that composes the role. The role does not provide any default implementation for these methods, as they are highly dependent on the implementation of the similarity search and the similarity metric reduction steps. The user is free to implement these methods in any way they see fit, e.g. using BioPerl, BioX, BioPython, command line and use Perl to wrap them into functions that are provided as arguments to: Bio::SeqAlignment::Components::SeqMapping::Mapper::Generic
METHODS
sim_seq_search
This is the single public method provided by the role. It takes the following arguments: * $workload : an array reference containing the atomic units of work to be processed. The atomic units of work can be any object that can hold a biological sequence, and the format depends entirely on the class composing the role. This gives immense flexibility to the user to define the space the dataflow operates in, e.g. in-memory, on-disk, etc. * %args : a hash containing the following optional arguments, that control parallelization and cleanup after the mapping has been done. The keys are: * max_workers : the number of workers to use for parallelization. Default value is 1, i.e. no parallelization. * chunk_size : the number of atomic units of work to be processed by each worker. Default value is 1. * cleanup : a boolean value that indicates whether to do cleanup after the mapping has been done. Default value is 1, i.e. cleanup after the mess.
$mapper->sim_seq_search(
$workload,
max_workers => 4,
chunk_size => 10,
cleanup => 1
);
SEE ALSO
Bio::SeqAlignment::Components::SeqMapping::Mapper::Generic
Documentation of the Generic Mapper class that uses this role.
Bio::SeqAlignment::Examples::EnhancingEdlib
Example of how to use the Generic Mapper with the LinearLinearGeneric and the LinearGeneric Dataflow roles, along with the Edlib alignment library.
AUTHOR
Christos Argyropoulos, <chrisarg at cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2024 by Christos Argyropoulos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.