NAME
Bio::ToolBox::Parser::bed - Parser for BED-style formats
SYNOPSIS
use Bio::ToolBox::Parser;
my $filename = 'file.bed';
my $Parser = Bio::ToolBox::Parser->new(
file => $filename,
) or die "unable to open gff file!\n";
# the Parser will taste the file and open the appropriate
# subclass parser, bed in this case
while (my $feature = $Parser->next_top_feature() ) {
# each $feature is parent SeqFeature object
printf "%s:%d-%d\n", $f->seq_id, $f->start, $f->end;
}
DESCRIPTION
This is the BED-style specific parser subclass to the Bio::ToolBox::Parser object, and as such inherits generic methods from the parent. File formats include the following.
- Bed
-
Bed files may have 3-12 columns, where the first 3-6 columns are basic information about the feature itself, and columns 7-12 are usually for defining subfeatures of a transcript model, including exons, UTRs (thin portions), and CDS (thick portions) subfeatures. This parser will parse these extra fields as appropriate into subfeature SeqFeature objects. Bed files are recognized with the file extension .bed.
- Bedgraph
-
BedGraph files are a type of wiggle format in Bed format, where the 4th column is a score instead of a name. BedGraph files are recognized by the file extension .bedgraph or .bdg.
- narrowPeak
-
narrowPeak files are a specialized Encode variant of bed files with 10 columns (typically denoted as bed6+4), where the extra 4 fields represent score attributes to a narrow ChIPSeq peak. These files are parsed as a typical bed6 file, and the extra four fields are assigned to SeqFeature attribute tags
signalValue
,pValue
,qValue
, andpeak
, respectively. NarrowPeak files are recognized by the file extension .narrowPeak. - broadPeak
-
broadPeak files, like narrowPeak, are an Encode variant with 9 columns (bed6+3) representing a broad or extended interval of ChIP enrichment without a single "peak". The extra three fields are assigned to SeqFeature attribute tags
signalValue
,pValue
, andqValue
, respectively. BroadPeak files are recognized by the file extension .broadPeak.
Track
and Browser
lines are generally ignored, although a track
definition line containing a type
key will be interpreted if it matches one of the above file types.
SeqFeature default values
The SeqFeature objects built from the bed file intervals will have some inferred defaults.
- Coordinate system
-
SeqFeature objects use the 1-based coordinate system, per the specification of Bio::SeqFeatureI, so the 0-based start coordinates of bed files will always be parsed into 1-based coordinates.
display_name
-
SeqFeature objects will use the name field (4th column in bed files), if present, as the
display_name
. The SeqFeature object should default to theprimary_id
if a name was not provided. primary_id
-
It will use a concatenation of the sequence ID, start (original 0-based), and stop coordinates as the
primary_id
, for example 'chr1:0-100'. primary_tag
-
Bed files don't have an inherent attribute of feature type (they are all the same type), so a default
primary_tag
is assigned based on the file type. For peak files (narrowPeak and broadPeak) this ispeak
, for gappedPeak this isgappedPeak
andpeak
(subfeatures), and for bed12 files with transcript models, the transcripts will be set to eithermRNA
orncRNA
, depending on the presence of interpreted CDS start and stop (thick coordinates). source_tag
-
Bed files don't have a concept of a source; default is "".
-
Extra columns in the narrowPeak and broadPeak formats are assigned to attribute tags as described above. The
rgb
values set in bed12 files are also set to an attribute tag.
METHODS
Initializing the parser object
In most cases, users should initialize an object using the generic Bio::ToolBox::Parser object.
These are class methods to initialize the parser with an annotation file and modify the parsing behavior. Most parameters can be set either upon initialization or as class methods on the object. Unpredictable behavior may occur if you implement these in the midst of parsing a file.
Do not open subsequent files with the same object. Always create a new object to parse a new file.
- new
-
my $parser = Bio::ToolBox::Parser::bed->new($filename); my $parser = Bio::ToolBox::Parser::bed->new( file => 'file.bed', do_gene => 1, do_cds => 1, );
Initiate a new Bed file parser object. Pass a single value (the bed file name) to open the file for parsing. Alternatively, pass an array of key value pairs to control how the table is parsed. These options are primarily for parsing bed12 files with subfeatures. Options include the following.
- file
-
Provide the path and file name for a Bed file. The file may be gzip compressed.
- source
-
Pass a string to be added as the source tag value of the SeqFeature objects.
- do_exon
- do_cds
- do_utr
- do_codon
-
For Bed12 formats that represent transcripts, pass a boolean (1 or 0) value to parse certain subfeatures, including
exon
,CDS
,five_prime_UTR
,three_prime_UTR
,stop_codon
, andstart_codon
features. Default is false. - class
-
Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature, which is lighter-weight and consumes less memory. A suitable BioPerl alternative is Bio::SeqFeature::Lite.
Other methods
Additional methods for working with the parser object and the parsed SeqFeature objects.
- typelist
-
Returns a string representation of the type of SeqFeature types to be encountered in the file. Currently this returns generic strings, 'mRNA,ncRNA,exon,CDS' for bed12 and 'feature' for everything else.
SEE ALSO
Bio::ToolBox::Parser, Bio::ToolBox::SeqFeature
AUTHOR
Timothy J. Parnell, PhD
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.