NAME
Bio::SeqAlignment::Components::SeqMapping::Dataflow::LinearLinear - A role to implement a linear-linear dataflow for a non-generic sequence mapper.
VERSION
version 0.02
DESCRIPTION
This role provides a linear-linear dataflow for sequence mapping. By that we mean that the input, i.e. a reference to a (list) of any object that can hold biological sequences such as BioX::Seq objects, FASTA files, etc) undergoes a two step process for mapping. The first step is to carry out a similarity search against a database of reference sequences, using (pseudo)alignment and extract a similarity metric. The second step is to reduce the similarity metric to a single value, that identifies the reference sequence each user provided sequence is most similar to. The output is a list containing the sequence ID, the reference sequence that was mapped to and the similarity metric used to decide which reference sequence to map to. The "linear-linear" part of the dataflow refers to the fact that each atomic unit of work is processed independently of all others for both the alignment and the reduction steps, and that Perl is given access to the intermediate results of the similarity search, before directing them to the reduction step. Due to the lack of any dependencies between the atomic units of work, the dataflow can be parallelized if the user desires so, using the MCE module. Parallelization is optional and the user can choose to run the dataflow in a single thread if desired. This feature is useful when the user wants to run the dataflow consuming the resources of a single core, and leaving the rest of the cores free for e.g. multithreaded logic that implements the similarity search and the similarity metric reduction steps. The module is intended to be composed as a role by a class that provides the necessary logic to implement the methods required by this role.
The method requires the implementation of the following methods by the class composing the role:
* seq_align : aligns an atomic unit of work provided as the single argument and returns the similarity metric. It is passed only one argument: 1. The atomic unit of work The method may return any object that can hold the similarity metric for each sequence in the atomic unit of work. This could be for example a file, a hashref, an arrayref, etc. A specialized extraction method will then be used to extract the similarity metric from the output of this method. * extract_sim_metric : extracts the similarity metric for each sequence in the atomic unit of work from the search results. It is passed only one argument: 1. The output of the seq_align method The method should return a hashref where the keys are the sequence IDs and the values are references to arrays that contain the similarity metric for each sequence in an atomic unit of work. * reduce_sim_metric : reduces the similarity metric for each sequence to a single value. It receives one argument: 1. A hashref containing the similarity metric for each sequence in the atomic unit of work. The keys are the sequence IDs and the values are references to arrays that contain the similarity metric for each sequence against each reference sequence in the database used for searching. All the values of the hash are hashrefs themselves. This function returns a reference to an array containing the sequence ID, the reference sequence that was mapped to and the similarity metric used to decide which reference sequence to map to. * cleanup : does any cleanup necessary after the mapping has been done if the user desires so. The function will be provided with three arguments: 1. the output of the seq_align method 2. the output of the extract_sim_metric method 3. the output of the reduce_sim_metric method
METHODS
sim_seq_search
This is the single public method provided by the role. It takes the following arguments: * $workload : an array reference containing the atomic units of work to be processed. The atomic units of work can be any object that can hold a biological sequence, and the format depends entirely on the class composing the role. This gives immense flexibility to the user to define the space the dataflow operates in, e.g. in-memory, on-disk, etc. * %args : a hash containing the following optional arguments, that control parallelization and cleanup after the mapping has been done. The keys are: * max_workers : the number of workers to use for parallelization. Default value is 1, i.e. no parallelization. * chunk_size : the number of atomic units of work to be processed by each worker. Default value is 1. * cleanup : a boolean value that indicates whether to do cleanup after the mapping has been done. Default value is 1, i.e. cleanup after the mess.
$mapper->sim_seq_search(
$mapper,
$workload,
max_workers => 4,
chunk_size => 10,
cleanup => 1
);
TODO
* Provide examples of non-generic classes that compose this role
AUTHOR
Christos Argyropoulos, <chrisarg at cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2024 by Christos Argyropoulos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.