NAME

Bio::Trace::ABIF - Perl extension for reading and parsing ABIF (Applied Biosystems, Inc. Format) files

VERSION

Version 0.01

SYNOPSIS

The ABIF file format is a binary format for storing data, developed by Applied Biosystems, Inc. This module provides general methods for accessing any chunk of information contained into an ABIF file. Besides, it provides shortcut methods for reading the most commonly used parts of the file, and methods for manipulating the data (e.g., for computing LOR scores).

use Bio::Trace::ABIF;

my $ab1 = Bio::Trace::ABIF->new();
$ab1->open_abif('/Path/to/my/file.ab1');

print $ab1->sample_name(), "\n";
my @quality_values = $ab1->quality_values();
my $sequence = $ab1->sequence();
# etc...

$ab1->close_abif();

If you cannot find a method to retrieve the information you need, you may use get_data_item() or get_directory().

CONSTRUCTOR

new()

Usage    : my $ab1 = Bio::Trace::ABIF->new();
Returns  : an instance of ABIF.

Creates an ABIF object.

METHODS

open_abif()

Usage    : $ab1->open_abif($pathname) or die 'Wrong filename or format';
Returns  : 1 if the file is successfully opened and it is an ABIF file;
           0 otherwise. 

Opens the specified file in binary format and checks whether it is in ABIF format.

close_abif()

Usage    : $ab1->close_abif();
Returns  : Nothing.

Closes the currently opened file.

is_abif_open()

Usage    : if ($ab1->is_abif_open()) { # ...
Returns  : 1 if an ABIF file is open; 0 otherwise.

is_abif_format()

Usage    : if ($ab1->is_abif_format()) { # Ok, it is ABIF
Returns  : 1 if the file is in ABIF format, 0 otherwise.

Checks that the file is in ABIF format. This method is called automatically when a file is opened.

abif_version()

Usage    : my $v = $ab1->abif_version();
Returns  : The ABIF file version number (e.g., '1.01').

Used to determine the ABIF file version number.

num_dir_entries()

Usage    : my $n = $ab1->num_dir_entries();
Returns  : The number of data items contained in the file.

Used to determine the number of directory entries in the ABIF file.

data_offset()

Usage    : my $n = $ab1->data_offset();
Returns  : The offset, in bytes, of the first directory entry
           with respect to the beginning of the file

Used to determine the data offset of the directory entries.

get_directory()

Usage    : my %DirEntry = $ab1->get_directory($tag_name, $tag_num);
Returns  : a hash with the content of the specified directory entry,
           or () if the tag is not found.
           

Retrieves the directory entry identified by the pair ($tag_name, $tag_num). The $tag_name is a four letter ASCII code and $tag_num is an integer (typically, 1 <= $tag_num <= 1000). The returned hash has the following keys:

TAG_NAME: the tag name;
TAG_NUMBER: the tag number;
ELEMENT_TYPE: a string indicating the type of the data item
              ('char', 'byte', 'float', 'pString', etc...);
ELEMENT_SIZE: the size, in bytes, of one element;
NUM_ELEMENTS: the number of elements in the data item;
DATA_SIZE: the size, in bytes, of the data item;
DATA_ITEM: the raw sequence of bytes of the data item.

Nota Bene: it is upon the caller to interpret the data item field correctly (typically, by unpack()ing the item).

Refer to the "SEE ALSO" Section for further information.

get_data_item()

Usage    : my @data = $ab1->get_data_item($tag_name, $tag_num, $template);
Returns  : (), if the tag is not found; otherwise:
           a list of elements unpacked according to $template.

Retrieves the data item specified by the pair ($tag_name, $tag_num) and, unpacks it according to $template. The $tag_name is a four letter ASCII code and $tag_num is an integer (typically, 1 <= $tag_num <= 1000). The $template has the same format as in the pack() function.

Refer to the "SEE ALSO" Section for further information.

search_tag()

Usage    : if (search_tag($tag_name, $tag_num)) { # etc...
Returns  : 1 if the tag is found; 0, otherwise

Performs a linear scan of the directory entries until the specified data tag is matched.

If the tag is found then the file handle is positioned just after thet tag number (ready to read the element type).

AB1 COMMON TAGS

The following methods work for files from the AB 3730/3730xl Data Collection Software v2.0 and v3.0 on the Applied Biosystems 3730/3730xl Genetic/DNA Analyzer or from ABI Prism(R) 3100/3100-Avant Analyzer Data Collection Software v2.0.

sample_name()

Usage    : my $name = sample_name();
Returns  : a string containing the sample name;
           '' if this tag is not in the file.

data_collection_software_version()

Usage    : my $v = data_collection_software_version();
Returns  : the data collection software version.
           '' if this tag is not in the file.

data_collection_firmware_version()

Usage    : my $v = data_collection_firmware_version();
Returns  : the data collection firmware version;
           '' if this tag is not in the file.

official_instrument_name()

Usage    : my $v = official_instrument_name();
Returns  : the official instrument name;
           '' if this tag is not in the file.

well_id()

Usage    : my $well_id = well_id();
Returns  : the well ID;
           '' if this tag is not in the file.

capillary_number()

Usage    : my $cap_n = capillary_number();
Returns  : the LANE/Capillary number;
           0 if this tag is not in the file.

user()

Usage    : my $user = user();
Returns  : the name of the user who created the plate;
           '' if this tag is not in the file.

Nota Bene: this field is optional

instrument_name_and_serial_number()

 Usage    : my $sn = instrument_name_and_serial_number()
 Returns  : a string with the instrument name and the serial number;
            '' if this tag is not in the file.

sequencing_analysis_param_filename()

Usage    : my $f = sequencing_analysis_param_filename();
Returns  : the Sequencing Analysis parameters filename;
           '' if this tag is not in the file.

comment()

Usage    : my $comment = comment();
Returns  : the comment associated to the file;
           '' if this tag is not in the file.

This is an optional field.

base_order()

Usage    : my @bo = $ab->base_order();
Returns  : the order in which the bases are stored in the file;
           () if this tag is not in the file.

order_base()

Usage    : my %bases = $ab->order_base();
Returns  : the indices of the bases, as they stored in the file;
           () if the base order is not present in the file.

This method does the opposite as base_order() does.

num_capillaries()

Usage    : my $nc = num_capillaries();
Returns  : the number of capillaries;
           0 if this tag is not in the file.

sample_tracking_id()

Usage    : my $sample_id = sample_tracking_id();
Returns  : the sample tracking ID;
           '' if this tag is not in the file.

analysis_protocol_xml()

Usage    : my $xml = analysis_protocol_xml();
Returns  : the Analysis Protocol XML string;
           '' if this tag is not in the file.

analysis_protocol_settings_name()

Usage    : my $name = analysis_protocol_settings_name();
Returns  : the Analysis Protocol settings name;
           '' if this tag is not in the file.

analysis_protocol_settings_version()

Usage    : my $name = analysis_protocol_settings_version();
Returns  : the Analysis Protocol settings version;
           '' if this tag is not in the file.

analysis_protocol_xml_schema_version()

Usage    : my $name = analysis_protocol_xml_schema_version();
Returns  : the Analysis Protocol XML schema version;
           '' if this tag is not in the file.

results_group()

Usage    : my $name = results_group();
Returns  : the results group name;
           '' if this tag is not in the file.

run_module_name()

Usage    : my $name = run_module_name();
Returns  : the run module name;
           '' if this tag is not in the file.

run_module_version()

Usage    : my $name = run_module_version();
Returns  : the run module version;
           '' if this tag is not in the file.

raw_data_for_channel()

Usage    : my @data = raw_data_for_channel($channel_number);
Returns  : the channel $channel_number raw data;
           () if the channel number is out of range
           or the tag is not in the file.

There are four channels in an ABIF file, numbered from 1 to 4. An optional channel number 5 exists in some files. If a channel is specified out of such ranges, the empty list is returned. If channel 5 data is requested, but such data is not found, then returns undef.

num_dyes()

Usage    : my $n = num_dyes();
Returns  : the number of dyes;
           0 if this tag is not in the file.

dye_name()

Usage    : my $n = dye_name($n);
Returns  : the name of dye number $n;
           '' if this tag is not in the file or $n is not
           in the range [1..4].
           

AB1 NEWER TAGS

The following methods work for files from the AB 3730/3730xl Data Collection Software v3.0 on the Applied Biosystems 3730/3730xl Genetic Analyzer.

container_owner()

Usage    : my $owner = container_owner();
Returns  : the container's owner;
           '' if this tag is not in the file.

SEQSCAPE V2.5 AND SEQUENCING ANALYSIS V5.2 TAGS

The following methods work from file processed by SeqScape v2.5 or Sequencing Analysis v5.2.

quality_values()

Usage    : my @qv = quality_values();
Returns  : the list of quality values;
           () if this tag is not in the file.

quality_values_ref()

Usage    : my $ref_to_qv = quality_values_ref();
Returns  : a reference to the list of quality values;
           a reference to the empty list if this tag is not in the file.

edited_quality_values()

Usage    : my @qv = edited_quality_values();
Returns  : the list of edited quality values;
           () if this tag is not in the file.

edited_quality_values_ref()

Usage    : my $ref_to_qv = edited_quality_values_ref();
Returns  : a reference to the list of edited quality values;
           a reference to the empty list if this tag is not in the file.

sequence()

Usage    : my $sequence = sequence();
Returns  : the string of the base called sequence;
           '' if this tag is not in the file.

sequence_length()

Usage    : my $l = sequence_length();
Returns  : the length of the base called sequence;
           0 if the sequence is not in the file.

edited_sequence()

Usage    : my $sequence = edited_sequence();
Returns  : the string of the edited base called sequence;
           '' if this tag is not in the file.

edited_sequence_length()

Usage    : my $l = edited_sequence_length();
Returns  : the length of the base called sequence;
           0 if the sequence is not in the file.

analyzed_data_for_channel()

Usage    : my @data = analyzed_data_for_channel($channel_number);
Returns  : the channel $channel_number analyzed data;
           () if the channel number is out of range
           or this tag is not in the file.

There are four channels in an ABIF file, numbered from 1 to 4. An optional channel number 5 exists in some files. If a channel is specified out of such ranges, the empty list is returned. If channel 5 data is requested, but such data is not found, returns undef.

peak1_location_orig()

Usage    : my $pl = peak1_location_orig();
Returns  : The peak 1 location (orig);
           0 if this tag is not in the file.

peak1_location()

Usage    : my $pl = peak1_location();
Returns  : The peak 1 location;
           0 if this tag is not in the file.

base_spacing()

Usage    : my $spacing = base_spacing();
Returns  : the spacing (a float);
           0.0 if this tag is not in the file.

basecaller_version()

 Usage    : my $v = basecaller_version();
 Returns  : a string indicating the basecaller version (e.g., 'KB 1.3.0');
            '' if this tag is not in the file.

base_locations()

Usage    : my @bl = base_locations();
Returns  : the list of base locations;
           () if this tag is not in the file.

base_locations_edited()

Usage    : my @bl = base_locations_edited();
Returns  : the list of base locations (edited);
           () if this tag is not in the file.

basecaller_bcp_dll()

Usage    : my $v = basecaller_bcp_dll();
Returns  : a string with the basecalled BCP/DLL;
           '' if this tag is not in the file.

signal_level()

Usage    : my %signal_level = signal_level();
Returns  : the signal level for each dye;
           () if this tag is not in the file.
           

raw_trace()

Usage    : my @trace = raw_trace($base);
Returns  : the raw trace corresponding to base $base;
           () if the data is not in the file.

The possible values for $base are 'A', 'C', 'G' and 'T' (case insensitive).

trace()

Usage    : my @trace = trace($base);
Returns  : the (analyzed) trace corresponding to base $base;
           () if the data is not in the file.

The possible values for $base are 'A', 'C', 'G' and 'T'.

noise()

Usage    : my %noise = $ab->noise();
Returns  : the estimated noise for each dye;
           () if this tag is not in the file.

This method works only with files containing data processed by the KB Basecaller.

METHODS FOR ASSESSING QUALITY

The following methods compute some values that help assessing the quality of the data.

avg_signal_to_noise_ratio()

Usage    : my $sn_ratio = $ab->avg_signal_to_noise_ratio()
Returns  : the average signal to noise ratio (only for the KB Basecaller);
           0 if the information needed to compute such value is missing.

This method works only with files containing data processed by the KB Basecaller.

clear_range_start()

Usage    : my $cl_start = $ab->clear_range_start();
           my $cl_start = $ab->clear_range_start($window_width,
                                                 $bad_bases_threshold,
                                                 $quality_threshold);
Returns  : the clear range start position (counting from zero);
           -1 if $window_width is greater than the number of quality values;
           -1 if the information needed to compute such value is missing.

The Sequencing Analysis program determines the clear range of the sequence by trimming bases from the 5' to 3' ends until fewer than 4 bases out of 20 have a quality value less than 20. You can change these parameters by explicitly passing arguments to this method (the default values are $window_width = 20, $bad_bases_threshold = 4, $quality_threshold = 20). Note that Sequencing Analysis counts the bases starting from one, so you have to add one to the return values to get consistent results.

clear_range_stop()

Usage    : my $cl_stop = $ab->clear_range_stop();
           my $cl_stop = $ab->clear_range_stop($window_width,
                                               $bad_bases_threshold,
                                               $quality_threshold);
Returns  : the clear range stop position (counting from zero);
           -1 if $window_width is greater than the number of quality values;
           -1 if the information needed to compute such value is missing.

The Sequencing Analysis program determines the clear range of the sequence by trimming bases from the 5' to 3' ends until fewer than 4 bases out of 20 have a quality value less than 20. You can change these parameters by explicitly passing arguments to this method (the default values are $window_width = 20, $bad_bases_threshold = 4, $quality_threshold = 20).

sample_score()

Usage    : my $ss = $ab->sample_score();
         : my $ss = $ab->sample_score($window_width, $bad_bases_threshold,
                                      $quality_threshold);
Returns  : the sample score associated to the sequence;
           0 if the information needed to compute such value is missing.

The sample score is the average quality value of the bases in the clear range of the sequence. See the clear_range_start() and clear_range_stop() methods for the meaning of the optional arguments of this method.

num_medium_quality_bases()

Usage    : my $n = $ab->num_medium_quality_bases($min_qv, $max_qv, $start, $stop);
           my $n = $ab->num_medium_quality_bases($min_qv, $max_qv);
Returns  : the number of bases in the range [$start, $stop], or in the whole
           sequence if no range is specified, with $min_qv <= quality value <= $max_qc;
           -1 if the information needed to compute such value is missing.

num_high_quality_bases()

Usage    : my $n = $ab->num_high_quality_bases($threshold, $start, $stop);
           my $n = $ab->num_high_quality_bases($threshold);
Returns  : the number of bases in the range [$start, $stop], or in the whole
           sequence if no range is specified, with quality value >= $threshold;
           -1 if the information needed to compute such value is missing.
           

num_low_quality_bases()

Usage    : my $n = $ab->num_low_quality_bases($threshold, $start, $stop);
           my $n = $ab->num_low_quality_bases($threshold);
Returns  : the number of bases in the range [$start, $stop], or in the whole
           sequence if no range is specified, with quality value <= $threshold;
           -1 if the information needed to compute such value is missing.
           

contiguous_read_length()

Usage    : my ($crl_start, $crl_stop) = $ab->contiguous_read_length();
           my ($crl_start, $crl_stop) =
              $ab->contiguous_read_length($windowWidth, $quality_threshold)
Returns  : the beginning and ending position of the CRL (Contiguous Read Length)

The CRL is (the length of) the longest uninterrupted stretch in a read such that the average quality of any interval of $windowWidth bases (20 by default) that is inside such stretch never goes below $threshold (20 by default). The threshold must be at least 10. The ends of the CRL are further trimmed until there are no bases with quality values less than 10 within the first five and the last five bases. The positions are counted from zero. If there is more than one CRL, the position of the first one is reported.

length_of_read()

Usage    : my $LOR = $ab->length_of_read($windowWidth, $quality_threshold);
           my $LOR = $ab->length_of_read($windowWidth, $quality_threshold, $method);
Returns  : the Length Of Read (LOR) value, computed using a window of
           $windowWidth bases, a threshold equal to $quality_threshold and,
           optionally, according to the specified $method
           (which, currently, is either the string 'SequencingAnalysis' or the string
           'GoodQualityWindows'). If there are less than $windowWidth quality values,
           the returned value is 0;
           return 0 if there are less than $windowWidth bases;
           returns 0 if the information needed to compute such value is missing.

The Length Of Read (LOR) score gives an approximate measure of the usable range of high-quality or high-accuracy bases determined by quality values. Such range can be determined in several ways. Two possible procedures are currently implemented and described below.

The 'SequencingAnalysis' method (used by default) computes the LOR as the widest range starting and ending with $windowWidth bases whose average quality is greater than or equal to $quality_threshold.

The 'GoodQualityWindows' method computes the LOR as the number of intervals of $windowWidth bases whose average quality is greater than or equal to $quality_threshold.

INTERNAL FUNCTIONS

The following methods are meant for internal use only and should not be used as user functions.

_ieee_single_prec_float

Usage    : _ieee_single_prec_float($string_32bits)
Returns  : the floating number corresponding to the given 32 bit string.

Interprets the 32 bit string in the standard IEEE format:

<sign (1 bit)><exponent (8 bits)><mantissa (23 bits)>

The value is computed as:

sign * 1.mantissa * 2**(exponent - 127)

_bit_string_as_unsigned_integer()

Usage    : my $n = _bit_string_as_unsigned_integer('10000101');
Returns  : the decimal value of the given bit string.

Interprets the given bit string as an unsigned integer, and returns its decimal representation (e.g., for example above, returns 133).

_bit_string_as_decimal_fraction()

Usage    : my $f = _bit_string_as_decimal_fraction('010.01000');
Returns  : the decimal fraction corresponding to the given
           bit string.

Interprets the given bit string as an unsigned fractional value and returns the corresponding decimal value. For example, 010.0100 is converted into 2.25.

AUTHOR

Nicola Vitacolonna, <vitacolonna at appliedgenomics.org>

BUGS

Please report any bugs or feature requests to bug-bio-trace-abif at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Bio-Trace-ABIF. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Bio::Trace::ABIF

You can also look for information at:

SEE ALSO

See http://www.appliedbiosystems.com/support/ for the ABIF format file specification sheet.

There is an ABI module on CPAN (http://search.cpan.org/~malay/ABI-0.01/).

bioperl-ext also parses ABIF files and other trace formats.

You are welcome at http://www.appliedgenomics.org!

ACKNOWLEDGEMENTS

Thanks to Simone Scalabrin for many helpful suggestions and for the first implementation of the length_of_read() method the way Sequencing Analysis does it!

Some explication about the way Sequencing Analysis computes some parameters has been found at http://keck.med.yale.edu/dnaseq/.

COPYRIGHT & LICENSE

Copyright 2006 Nicola Vitacolonna, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

DISCLAIMER

This software is provided "as is" without warranty of any kind.