NAME
Bio::ToolBox::Data::Feature - Objects representing rows in a data table
DESCRIPTION
A Bio::ToolBox::Data::Feature is an object representing a row in the data table. Usually, this in turn represents an annotated feature or segment in the genome. As such, this object provides convenient methods for accessing and manipulating the values in a row, as well as methods for working with the represented genomic feature.
This class should NOT be used directly by the user. Rather, Feature objects are generated from a Bio::ToolBox::Data::Iterator object (generated itself from the row_stream function in Bio::ToolBox::Data), or the iterate function in Bio::ToolBox::Data. Please see the respective documentation for more information.
Example of working with a stream object.
my $Data = Bio::ToolBox::Data->new(file => $file);
# stream method
my $stream = $Data->row_stream;
while (my $row = $stream->next_row) {
# each $row is a Bio::ToolBox::Data::Feature object
# representing the row in the data table
my $value = $row->value($index);
# do something with $value
}
# iterate method
$Data->iterate( sub {
my $row = shift;
my $number = $row->value($index);
my $log_number = log($number);
$row->value($index, $log_number);
} );
METHODS
General information methods
- row_index
-
Returns the index position of the current data row within the data table. Useful for knowing where you are at within the data table.
- feature_type
-
Returns one of three specific values describing the contents of the data table inferred by the presence of specific column names. This provides a clue as to whether the table features represent genomic regions (defined by coordinate positions) or named database features. The return values include:
coordinate: Table includes at least chromosome and start
named: Table includes name, type, and/or Primary_ID
unknown: unrecognized
- column_name
-
Returns the column name for the given index.
item data
Returns the parent Bio::ToolBox::Data object, in case you may have lost it by going out of scope.
Methods to access row feature attributes
These methods return the corresponding value, if present in the data table, based on the column header name. If the row represents a named database object, try calling the "feature" method first. This will retrieve the database SeqFeature object, and the attributes can then be retrieved using the methods below or on the actual database SeqFeature object.
In cases where there is a table column and a corresponding SeqFeature object, for example a start column and a parsed SeqFeature object, the table value takes precedence and is returned. You can always obtain the SeqFeature's value separately and directly.
These methods do not set attribute values. If you need to change the values in a table, use the "value" method below.
- seq_id
-
The name of the chromosome the feature is on.
- start
- end
- stop
-
The coordinates of the feature or segment. Coordinates from known 0-based file formats, e.g. BED, are returned as 1-based. Coordinates must be integers to be returned. Zero or negative start coordinates are assumed to be accidents or poor programming and transformed to 1. Use the "value" method if you don't want this to happen.
- strand
-
The strand of the feature or segment. Returns -1, 0, or 1. Default is 0, or unstranded.
- midpoint
-
The calculated midpoint position of the feature.
- peak
-
For features in a
narrowPeak
file, this will report the peak coordinate, transformed into a genomic coordinate. - name
- display_name
-
The name of the feature.
- coordinate
-
Returns a coordinate string formatted as
seqid:start-stop
. The start coordinate is converted to 1-based where relevant, in concordance with HTS tools (samtools and tabix). - type
-
The type of feature. Typically either
primary_tag
orprimary_tag:source_tag
. In a GFF3 file, this represents columns 3 and 2, respectively. In annotation databases such as Bio::DB::SeqFeature::Store, the type is used to restrict to one of many different types of features, e.g. gene, mRNA, or exon. - id
- primary_id
-
Here, this represents the
primary_ID
in the database. Note that this number is generally unique to a specific database, and not portable between databases. - length
-
The length of the feature or segment.
- score
-
Returns the value of the Score column, if one is available. Typically associated with defined file formats, such as GFF files (6th column), BED and related Peak files (5th column), and bedGraph (4th column).
Accessing and setting values in the row.
- value
-
# retrieve a value my $v = $row->value($index); # set a value $row->value($index, $v + 1);
Returns or sets the value at a specific column index in the current data row. Null values return a '.', symbolizing an internal null value.
- row_values
-
Returns an array or array reference representing all the values in the current data row.
Special feature attributes
GFF and VCF files have special attributes in the form of key => value pairs. These are stored as specially formatted, character-delimited lists in certain columns. These methods will parse this information and return as a convenient hash reference. The keys and values of this hash may be changed, deleted, or added to as desired. To write the changes back to the file, use the "rewrite_attributes" to properly write the attributes back to the file with the proper formatting.
- attributes
-
Generic method that calls either "gff_attributes" or "vcf_attributes" depending on the data table format.
- gff_attributes
-
Parses the 9th column of GFF files. URL-escaped characters are converted back to text. Returns a hash reference of key => value pairs.
- vcf_attributes
-
Parses the
INFO
(8th column) and all sample columns (10th and higher columns) in a VCF file. The Sample columns use theFORMAT
column (9th column) as keys. The returned hash reference has two levels: The first level keys are both the column names and index (1-based). The second level keys are the individual attribute keys to each value. For example:my $attr = $row->vcf_attributes; # access by column name my $genotype = $attr->{sample1}{GT}; my $depth = $attr->{INFO}{ADP}; # access by 1-based column index my $genotype = $attr->{10}{GT}; my $depth = $attr->{8}{ADP}
- rewrite_attributes
-
Generic method that either calls "rewrite_gff_attributes" or "rewrite_vcf_attributes" depending on the data table format.
- rewrite_gff_attributes
-
Rewrites the GFF attributes column (the 9th column) based on the contents of the attributes hash that was previously generated with the "gff_attributes" method. Useful when you have modified the contents of the attributes hash.
- rewrite_vcf_attributes
-
Rewrite the VCF attributes for the
INFO
(8th column),FORMAT
(9th column), and sample columns (10th and higher columns) based on the contents of the attributes hash that was previously generated with the "vcf_attributes" method. Useful when you have modified the contents of the attributes hash.
Convenience Methods to database functions
The next three functions are convenience methods for using the attributes in the current data row to interact with databases. They are wrappers to methods in the Bio::ToolBox::db_helper module.
- seqfeature
- feature
-
Returns a SeqFeature object representing the feature or item in the current row. If the SeqFeature object is stored in the parent
$Data
object (usually from parsing an annotation file), it is immediately returned. Otherwise, the SeqFeature object is retrieved from the database using the name and type values in the current Data table row. The SeqFeature object is requested from the database named in the general metadata. If an alternate database is desired, you should change it first using the$Data
->database() method. If the feature name or type is not present in the table, then nothing is returned.See Bio::ToolBox::SeqFeature and Bio::SeqFeatureI for more information about working with these objects. See Bio::DB::SeqFeature::Store about working with database features.
This method normally only works with "named" feature types in a Bio::ToolBox::Data Data table. If your Data table has coordinate information, i.e. chromosome, start, and stop columns, then it will likely be recognized as a "coordinate" feature_type and not work.
Pass a true value to this method to force the seqfeature lookup. This will still require the presence of Name, ID, and/or Type columns to perform the database lookup. The Bio::ToolBox::Data method feature() is used to determine the type if a Type column is not present.
- segment
-
Returns a database Segment object corresponding to the coordinates defined in the Data table row. If a named feature and type are present instead of coordinates, then the feature is first automatically retrieved and a Segment returned based on its coordinates. The database named in the general metadata is used to establish the Segment object. If a different database is desired, it should be changed first using the general "database" method.
See Bio::DB::SeqFeature::Segment and Bio::RangeI for more information about working with Segment objects.
- get_features
-
my @overlap_features = $row->get_features(type => $type);
Returns seqfeature objects from a database that overlap the Feature or interval in the current Data table row. This is essentially a convenience wrapper for a Bio::DB style features method using the coordinates of the Feature. Optionally pass an array of key value pairs to specify alternate coordinates if so desired. Potential keys include
- seq_id
- start
- end
- type
-
The type of database features to retrieve.
- db
-
An alternate database object to collect from.
- get_sequence
-
Fetches genomic sequence based on the coordinates of the current seqfeature or interval in the current Feature. This requires a database that contains the genomic sequence, either the database specified in the Data table metadata or an external indexed genomic fasta file.
If the Feature represents a transcript or gene, then a concatenated sequence of the selected subfeatures may be generated and returned. Note that redundant or overlapping subfeatures are NOT merged, and unexpected results may be obtained.
The sequence is returned as simple string. If the feature is on the reverse strand, then the reverse complement sequence is automatically returned.
Pass an array of key value pairs to specify alternate coordinates if so desired. Potential keys include
- subfeature
-
Pass a text string representing the type of subfeature from which to collect the sequence. Acceptable values include
exon
cds
5p_utr
3p_utr
intron
- seq_id
- start
- end
- strand
- extend
-
Indicate additional basepairs of sequence added to both sides
- db
-
The fasta file or database from which to fetch the sequence
Data collection
The following methods allow for data collection from various sources, including bam, bigwig, bigbed, useq, Bio::DB databases, etc.
- calculate_reference($position)
-
Calculates and returns the absolute genomic coordinate for a relative reference position taking into account feature orientation (strand). This is not explicitly data collection, but often used in conjunction with such. Provide an integer representing the relative position point:
3
representing3'
end coordinate4
representing mid point coordinate5
representing5'
end coordinate9
representing peak summit in narrowPeak formatted files
If necessary, an array or array reference may be provided as an alternative parameter with keys including
position
,strand
,practical_start
, andpractical_end
if alternate or adjusted coordinates should be used instead of the given row feature coordinates. - get_score
-
my $score = $row->get_score( dataset => 'scores.bw', method => 'max', );
This method collects a single score over the feature or interval. Usually a mathematical or statistical value is employed to derive the single score. Pass an array of key value pairs to control data collection. Keys include the following:
- db
- ddb
-
Specify a Bio::DB database from which to collect the data. The default value is the database specified in the Data table metadata, if present. Examples include a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database.
- dataset
-
Specify the name of the dataset. If a database was specified, then this value would be the
primary_tag
ortype:source
feature found in the database. Otherwise, the name of a data file, such as a bam, bigWig, bigBed, or USeq file, would be provided here. This options is required! - method
-
Specify the mathematical or statistical method combining multiple scores over the interval into one value. Options include the following:
mean
sum
min
max
median
count
Count all overlapping items.
pcount
Precisely count only containing (not overlapping) items.
ncount
Count overlapping unique names only.
range
The difference between minimum and maximum values.
stddev
Standard deviation.
- strandedness
-
Specify what strand from which the data should be taken, with respect to the Feature strand. Three options are available. Only really relevant for data sources that support strand.
sense
The same strand as the Feature.
antisense
The opposite strand as the Feature.
all
Strand is ignored, all is taken (default).
- subfeature
-
Specify the subfeature type from which to collect the scores. Typically a SeqFeature object representing a transcript is provided, and the indicated subfeatures are collected from object. Pass the name of the subfeature to use. Accepted values include the following.
exon
cds
5p_utr
3p_utr
intron
- extend
-
Specify the number of basepairs that the Data table Feature's coordinates should be extended in both directions. Ignored when used with the subfeature option.
- seq_id
- chromo
- start
- end
- stop
- strand
-
Optionally specify zero or more alternate coordinates to use. By default, these are obtained from the Data table Feature.
- get_relative_point_position_scores
-
while (my $row = $stream->next_row) { my $pos2score = $row->get_relative_point_position_scores( 'ddb' => '/path/to/BigWigSet/', 'dataset' => 'MyData', 'position' => 5, 'extend' => 1000, ); }
This method collects indexed position scores centered around a specific reference point. The returned data is a hash of relative positions (example -20, -10, 1, 10, 20) and their score values. Pass an array of key value pairs to control data collection. Keys include the following:
- db
- ddb
-
Specify a Bio::DB database from which to collect the data. The default value is the database specified in the Data table metadata, if present. Examples include a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database.
- dataset
-
Specify the name of the dataset. If a database was specified, then this value would be the
primary_tag
ortype:source
feature found in the database. Otherwise, the name of a data file, such as a bam, bigWig, bigBed, or USeq file, would be provided here. This options is required! - position
-
Indicate the position of the reference point relative to the Data table Feature. 5 is the 5' coordinate, 3 is the 3' coordinate, and 4 is the midpoint (get it? it's between 5 and 3). Default is 5.
- extend
-
Indicate the number of base pairs to extend from the reference coordinate. This option is required!
- coordinate
-
Optionally provide the real chromosomal coordinate as the reference point.
- absolute
-
Boolean option to indicate that the returned hash of positions and scores should not be transformed into relative positions but kept as absolute chromosomal coordinates.
- avoid
-
Provide a
primary_tag
ortype:source
database feature type to avoid overlapping scores. Each found score is checked for overlapping features and is discarded if found to do so. The database should be set to use this. - strandedness
-
Specify what strand from which the data should be taken, with respect to the Feature strand. Three options are available. Only really relevant for data sources that support strand.
sense
The same strand as the Feature.
antisense
The opposite strand as the Feature.
all
Strand is ignored, all is taken (default).
- method
-
Only required when counting objects.
count
Count all overlapping items.
pcount
Precisely count only containing (not overlapping) items.
ncount
Count overlapping unique names only.
- get_region_position_scores
-
while (my $row = $stream->next_row) { my $pos2score = $row->get_relative_point_position_scores( 'ddb' => '/path/to/BigWigSet/', 'dataset' => 'MyData', 'position' => 5, 'extend' => 1000, ); }
This method collects indexed position scores across a defined region or interval. The returned data is a hash of positions and their score values. The positions are by default relative to a region coordinate, usually to the 5' end. Pass an array of key value pairs to control data collection. Keys include the following:
- db
- ddb
-
Specify a Bio::DB database from which to collect the data. The default value is the database specified in the Data table metadata, if present. Examples include a Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database.
- dataset
-
Specify the name of the dataset. If a database was specified, then this value would be the
primary_tag
ortype:source
feature found in the database. Otherwise, the name of a data file, such as a bam, bigWig, bigBed, or USeq file, would be provided here. This options is required! - subfeature
-
Specify the subfeature type from which to collect the scores. Typically a SeqFeature object representing a transcript is provided, and the indicated subfeatures are collected from object. When converting to relative coordinates, the coordinates will be relative to the length of the sum of the subfeatures, i.e. the length of the introns will be ignored.
Pass the name of the subfeature to use. Accepted values include the following.
exon
cds
5p_utr
3p_utr
intron
- extend
-
Specify the number of basepairs that the Data table Feature's coordinates should be extended in both directions.
- seq_id
- chromo
- start
- end
- stop
- strand
-
Optionally specify zero or more alternate coordinates to use. By default, these are obtained from the Data table Feature.
- position
-
Indicate the position of the reference point relative to the Data table Feature. 5 is the 5' coordinate, 3 is the 3' coordinate, and 4 is the midpoint (get it? it's between 5 and 3). Default is 5.
- coordinate
-
Optionally provide the real chromosomal coordinate as the reference point.
- absolute
-
Boolean option to indicate that the returned hash of positions and scores should not be transformed into relative positions but kept as absolute chromosomal coordinates.
- avoid
-
Provide a
primary_tag
ortype:source
database feature type to avoid overlapping scores. Each found score is checked for overlapping features and is discarded if found to do so. The database should be set to use this. - strandedness
-
Specify what strand from which the data should be taken, with respect to the Feature strand. Three options are available. Only really relevant for data sources that support strand.
sense
The same strand as the Feature.
antisense
The opposite strand as the Feature.
all
Strand is ignored, all is taken (default).
- method
-
Only required when counting objects.
count
Count all overlapping items.
pcount
Precisely count only containing (not overlapping) items.
ncount
Count overlapping unique names only.
- fetch_alignments
-
my $sam = $Data->open_database('/path/to/file.bam'); my $alignment_data = { mapq => [] }; my $callback = sub { my ($a, $data) = @_; push @{ $data->{mapq} }, $a->qual; }; while (my $row = $stream->next_row) { $row->fetch_alignments( 'db' => $sam, 'data' => $alignment_data, 'callback' => $callback, ); }
This function allows you to iterate over alignments in a Bam file, allowing custom information to be collected based on a callback code reference that is provided.
Three parameters are required:
db
,data
, andcallback
. A true value (1) is returned upon success.- db
-
Provide an opened, high-level, Bam database object.
- callback
-
Provide a code callback reference to use when iterating over the alignments. Two objects are passed to this code function: the alignment object and the data structure that is provided. See the Bam adapter documentation for details on low-level
fetch
through the Bam index object for details. - data
-
This is a reference to a
HASH
data object for storing information. It is passed to the callback function along with the alignment. Three new key => value pairs are automatically added:start
,end
, andstrand
. These correspond to the values for the current queried interval. Coordinates are automatically transformed to 0-base coordinate system to match low level alignment objects. - subfeature
-
If the feature has subfeatures, such as exons, introns, etc., pass the name of the subfeature to restrict iteration only over the indicated subfeatures. The
data
object will inherit the coordinates for each subfeatures. Allowed subfeatures include the following:exon
cds
5p_utr
3p_utr
intron
- start
- stop
- end
-
Provide alternate, custom start and stop coordinates for the row feature. Ignored with subfeatures.
Feature Export
These methods allow the feature to be exported in industry standard formats, including the BED format and the GFF format. Both methods return a formatted tab-delimited text string suitable for printing to file. The string does not include a line ending character.
These methods rely on coordinates being present in the source table. If the row feature represents a database item, the "feature" method should be called prior to these methods, allowing the feature to be retrieved from the database and coordinates obtained.
- bed_string
-
Returns a BED formatted string. By default, a 6-element string is generated, unless otherwise specified. Pass an array of key values to control how the string is generated. The following arguments are supported.
- bed
-
Specify the number of BED elements to include. The number of elements correspond to the number of columns in the BED file specification. A minimum of 3 (chromosome, start, stop) is required, and maximum of 6 is allowed (chromosome, start, stop, name, score, strand).
- chromo
- seq_id
-
Provide a text string of an alternative chromosome or sequence name.
- start
- stop
- end
-
Provide alternative integers for the start and stop coordinates. Note that start values are automatically converted to 0-base by subtracting 1.
- strand
-
Provide alternate an alternative strand value.
- name
-
Provide an alternate or missing name value to be used as text in the 4th column. If no name is provided or available, a default name is generated.
- score
-
Provide a numerical value to be included as the score. BED files typically use integer values ranging from 1..1000.
- gff_string
-
Returns a GFF3 formatted string. Pass an array of key values to control how the string is generated. The following arguments are supported.
- chromo
- seq_id
- start
- stop
- end
- strand
-
Provide alternate values from those defined or missing in the current row Feature.
- source
-
Provide a text string to be used as the source_tag value in the 2nd column. The default value is null ".".
- primary_tag
-
Provide a text string to be used as the primary_tag value in the 3rd column. The default value is null ".".
- type
-
Provide a text string. This can be either a "primary_tag:source_tag" value as used by GFF based BioPerl databases, or "primary_tag" alone.
- score
-
Provide a numerical value to be included as the score. The default value is null ".".
- name
-
Provide alternate or missing name value to be used as the display_name. If no name is provided or available, a default name is generated.
- attributes
-
Provide an anonymous array reference of one or more row Feature indices to be used as GFF attributes. The name of the column is used as the GFF attribute key.
AUTHOR
Timothy J. Parnell, PhD
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.