NAME

BioX::Seq::Fetch - Fetch records from indexed FASTA non-sequentially

SYNOPSIS

use BioX::Seq::Fetch;

my $parser = BioX::Seq::Fetch->new($filename);

my $seq = $parser->fetch('seq_ABC');
my $sub = $parser->fetch('seq_XYZ', 8 => 15);

DESCRIPTION

BioX::Seq::Fetch provides non-sequential access to records from indexed sequence files. Currently only FASTA files indexed using samtoools faidx or another compatible method are supported. The module will now create samtools-compatible index files automatically if they are missing.

CONSTRUCTOR

new

my $parser = BioX::Seq::Fetch->new(
    $filename,
    with_descriptions => 1,
);

Create a new BioX::Seq::Fetch parser. Requires an input filename (STDIN or open filehandles are not supported, as a filename is needed to find the corresponding index file and to ensure than seek()-ing is supported). Takes one optional boolean argument ('with_descriptions') indicating whether to enable backtracking to find and include any sequence description present (normally this is absent as the FASTA index includes the offset to the sequence itself and not the defline). This option is currently experimental and may slow down sequence fetches, so it is turned off by default.

METHODS

fetch_seq

my $seq = $parser->fetch_seq(
    $name,
    $start, 
    $end,
);

Returns the requested sequence as a BioX::Seq object, or undef if no matching sequence is found. Requires a valid sequence identifier and optionally 1-based start and end coordinates to retrieve a substring (the entire sequence is returned by default). A fatal error is thrown if the provided coordinates are outside the range of [1-length(sequence)].

write_index

$parser->write_index();
$parser->write_index( 'path/to/file.fa.fai' );

Writes a samtools-compatible index file for the underlying sequence file. Accepts one optional argument specifying the path of the file to create (the default, which should usually not be changed, is the same as the underlying sequence file with a '.fai' extension added).

This method is now called automatically if a FASTA file is opened with no index file present.

ids

my @seq_ids = $parser->ids;

Returns an array of sequence IDs, ordered by their occurence in the underlying file.

length

my $len = $parser->length( $seq_id );

Returns the length of the sequence given by $seq_id. May be marginally faster than fetching the sequence object and then finding the length.

COMPRESSION

BioX::Seq::Fetch supports files compressed with blocked gzip (BGZIP), typically using the bgzip utility. This allows for pseudo-random access without the need for full file decompression. The Compress::BGZIP module is required for this functionality.

CAVEATS AND BUGS

Please report any bugs or feature requests to the issue tracker at https://github.com/jvolkening/p5-BioX-Seq.

AUTHOR

Jeremy Volkening <jeremy *at* base2bio.com>

COPYRIGHT AND LICENSE

Copyright 2014-2017 Jeremy Volkening

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.