NAME
Bio::DB::BigFile -- Low-level interface to BigWig & BigBed files
SYNOPSIS
use Bio::DB::BigFile;
use Bio::DB::BigFile::Constants;
my $wig = Bio::DB::BigFile->bigWigFileOpen('hg18_methylcytosine.bw');
# query each of the intervals (fixed or variable step values)
my $intervals = $wig->bigWigIntervalQuery('chr1',5_000_000 => 8_000_000);
for (my $i=$intervals->head;$i;$i=$i->next) {
my $start = $i->start;
my $end = $i->end;
my $val = $i->val;
}
# get 500 bins of statistical summary data
my $summary = $bigWigSummaryArray('chr1',5_000_000=>8_000_000,bbiSumMean,500);
for (my $i=0;$i<$bin;$i++) {
print "bin $i: ",$summary->[$i],"\n";
}
# get 500 bins of extended summary data
my $summary_e = $bigWigSummaryArrayExtended('chr1',5_000_000=>8_000_000,500);
for (my $i=0;$i<$bin;$i++) {
my $s = $summary_e->[$i];
print "bin $i: min=$s->{minVal} max=$s->{maxVal} sum=$s->{sumData}\n";
}
# single summary over a bin
my $mean = $wig->bigWigSingleSummary('chr1',5_000_000=>6_000_000,bbiSumMean);
DESCRIPTION
This module provides a low-level interface to Jim Kent's BigWig and BigBed files, which are indexed genome feature databases that can be randomly accessed across the network. Please see http://genome.ucsc.edu/FAQ/FAQformat.html for information about creating these files.
For the high-level interface, please see Bio::DB::BigWig and Bio::DB::BigBed.
INSTALLATION
Installation requires a compiled version of Jim Kent's source tree, including the main library, jkweb.a. Please see the README in the Bio::DB::BigFile distribution directory for instructions.
CLASS METHODS
Please note that all genomic coordinates consumed or returned by this module are zero-based half-open intervals. This is not true of the "high level" interfaces.
- $wig = Bio::DB::BigFile->bigWigFileOpen('/path/to/file.bw');
-
Open a preexisting BigWig file and return its object handle. The returned object object will be of type Bio::DB::bbiFile.
- $bed = Bio::DB::BigFile->bigBedFileOpen('/path/to/file.bb');
-
Open a preexisting BigBed file and return its object handle. The returned object object will be of type Bio::DB::bbiFile.
- Bio::DB::BigFile->createBigWig($infile,$chrom_sizes,$outfile,$args)
-
Create a BigWig file from a text .wig file (without track definition lines). The arguments are identical to those used by the UCSC wigToBigWig utility.
$infile is the path to the input .wig file.
$chrom_sizes points to a file of chromosome sizes formatted in two whitespace-separated columns consisting of chromosome name and size.
$outfile is a path to the BigWig file you wish to create.
$args is a hash reference containing the following options:
Option Value Default ------ ----- ------- blockSize Record block size 1024 itemsPerSlot Record batching 512 clipDontDie If values are given 1 that fall outside chromosome boundaries then warn, but don't exit. compress Compress the BigWig 1 file to save space.
If an exception occurs (for example, the output location is not writable), then this method will terminate the process. The exception cannot be caught by an eval {}.
Note that there is no equivalent method for creating BigBed files, and this method may also be deprecated in the future. Jim Kent recommends using the wigToBigWig and bedToBigBed command-line utilities instead.
- Bio::DB::BigFile->udcSetDefaultDir('/path/')
-
When the BigWig/BigBed library accesses remote Big{Wig,Bed} files, it creates a series of cache files located in /tmp/udcCache by default. To change the location of the cache files, call this method, passing it the path to the preferred directory.
- $path = Bio::DB::BigFile->udcGetDefaultDir()
-
This class method returns the current UDC cache default directory.
OBJECT METHODS
Once a Bio::DB::bbiFile object is created, you can query it using the methods described in this section.
Please note that all genomic coordinates consumed or returned by this module are zero-based half-open intervals. This is not true of the "high level" interfaces.
BigWig File Methods
- $wig->bigWigIntervalDump($seqid,$start,$end [,$max,$fh])
-
For the indicated region (chromosome,start,end), convert the BigWig file into text WIG format and write it to standard output. If $max is provided, then limit the dump to $max values. If $fh is provided, then write to the indicated file handle. Note that only real filehandles work: tied filehandles such as IO::String will cause a core dump.
- $chromosome_list = $wig->chromList()
-
Return the head to a linked list of chromosomes known to the BigWig file. The head of the list has one method named head() which returns the first Bio::DB::ChromInfo object. Each ChromInfo object has the following methods:
next() Return the next ChromInfo in the list, or undef if this is the last element in the list. name() Return the name of the chromosome. id() Return the ID of the chromosome (usually a small integer). size() Return the size of the chromosome.
For example, to iterate over all chromosomes known to the BigWig:
my $list = $wig->chromList(); my $next = $list->head; while ($next) { print $next->name,": ",$next->size,"\n"; $next = $next->next; }
Do not undef the list object while you are still iterating through its chromInfo objects.
- $size = $wig->chromSize('chr1')
-
Return the size of a single named chromosome, or undef if there is no chromosome of this size.
- $interval_head = $wig->bigWigIntervalQuery($chrom,$start,$end)
-
For the region indicated by the chromosome name, start and end, return the head of a linked list of Bio::DB::bbiInterval objects for which there is wig file data. Each interval corresponds to a single data line in the original WIG file.
The head of the list has one method named head(), which returns the first Bio::DB::bbiInterval object. Each object has the following methods:
next() Return the next bbiInterval object in the list, or undef if this is the last element in the list. start() The start of this interval end() The end of this interval value() The numeric value of this interval
For example, to iterate over all intervals on the first megabase of chromosome 3:
my $list = $wig->bigWigIntervalQuery('chr3',0=>1_000_000); my $next = $list->head; while ($next) { print $next->start,"..",$next->end,": ",$next->val,"\n"; $next = $next->next; }
Do not undef the list head object while you are still iterating through its elements.
- $summaryarray = $wig->bigWigSummaryArray($chrom,$start,$end,$operation,$bins)
-
For the region indicated by $chrom, $start and $end, divide the interval into $bins subregions and compute summary information according to $operation. The result is returned in an array reference of $bins elements in length.
The operation is one of the following, defined in Bio::DB::BigFile::Constants:
Constant Operation -------- --------- bbiSumMean The mean value of all intervals in the bin. bbiSumMax The maximum value of all intervals in the bin. bbiSumMin The minimum value of all intervals in the bin. bbiSumCoverage The count of all intervals in the bin. bbiSumStandardDeviation The standard deviation of all intervals in the bin.
For example, to divide the first megabase of chromosome 3 into 100 bins and find the mean value of the intervals in each bin:
my $bins = $wig->bigWigSummaryArray('chr3',0=>1_000_000,bbiSumMean,100); for my $value (@$bins) { print $value,"\n"; }
If the interval is invalid, returns undef.
- $value = $wig->bigWigSingleSummary($chrom,$start,$end,$operation)
-
Return statistical summary information about a single interval. $operation corresponds to one of the constants described in bigWigSummaryArray().
- $summaryarray=$wig->bigWigSummaryArrayExtended($chrom,$start,$end,$bins)
-
This method is similar to bigWigSummaryArray(), except that instead of returning an arrayref of numeric values, the returned arrayref points to a list of hashes describing the contents of each bin. Hash keys are the following:
Key Value --- --------- validCount Number of intervals in the bin maxVal Maximum value in the bin minVal Minimum value in the bin sumData Sum of the intervals in the bin sumSquares Sum of the squares of the intervals in the bin
sumData and sumSquares can be used to compute the mean and standard deviation of the bin, and to compute these values when multiple bins are combined.
For example, to calculate the means of 100 bins across an interval:
my $bins = $wig->bigWigSummaryArrayExtended('chr3',0=>1_000_000,100); for (my $i=0;$i<@$bins;$i++) { my $mean = $bins->[$i]{sumData}/$bins->[$i]{validCount}; }
- $summaryobj=$wig->bigWigSummary($chrom,$start,$end,$bins)
-
This is similar to the previous method, except that it returns a summary object rather than an arrayref. This object, of type Bio::DB::bbiExtendedSummary, has the following methods:
$summaryobj->size() Number of bins in the summary. $summaryobj->validCount($bin) Count of intervals in bin $bin. $summaryobj->minVal($bin) Minimum value in bin $bin. $summaryobj->maxVal($bin) Maximum value in bin $bin. $summaryobj->sumData($bin) Sum of the values in bin $bin. $summaryobj->sumSquares($bin) Sum of the squares of the values in bin $bin
This method may be slightly more memory-efficient than bigWigSummaryArrayExtended.
- $arrayref = $wig->bigWigBinStats($chrom,$start,$end,$bins)
-
This is similar to the previous two methods, but returns a reference to an array of objects with vaidCount(), minVal(), maxVal(), sumData() and sumSquares() methods.
Example:
my $bins = $wig->bigWigBinStats('chr3',0=>1_000_000,100); for (my $i=0;$i<@$bins;$i++) { my $mean = $bins->[$i]->sumData()/$bins->[$i]->validCount(); }
This method is about 30% slower than the previous methods, and may be deprecated in the future.
BigBed File Methods
These methods apply to previously opened BigBed files.
- $count = $bed->bigBedItemCount()
-
Returns the number of items in the BigBed file.
- $chromosome_list = $bed->chromList()
-
This is identical to the BigWig chromList() method and returns an object that points to a linked list of chromosome information objects.
- $list_head = $bed->bigBedIntervalQuery($chrom,$start,$end [,$max])
-
For the indicated interval, return the head to a linked list of BigBed interval objects (Bio::DB::BigBedInterval). $max specifies the maximum number of items to return; unlimited if absent or 0. The head object has a single method named head() that returns the first interval object. Each interval object has the following methods:
next() Return the next interval in the list start() Start of this interval end() End of this interval rest() Return a string corresponding to all of the BED fields following the end field. This will be whitespace-delimited, but otherwise unparsed.
Here is a simple bigBedToBed file dumper:
my $chroms = $bed->chromList; for (my $c = $chroms->head; $c; $c=$c->next) { dump_chrom($c); } sub dump_chrom { my $chrom = shift; my $name = $chrom->name; my $size = $chrom->size; my $intervals = $bed->bigBedIntervalQuery($name,0,$size); for (my $i=$intervals->head;$i;$i=$i->next) { print join("\t",$name,$i->start,$i->end,$i->rest),"\n"; } }
- $summaryarray = $wig->bigBedSummaryArray($chrom,$start,$end,$operation,$bins)
-
For the region indicated by $chrom, $start and $end, divide the interval into $bins subregions and compute summary information according to $operation. The result is returned in an array reference of $bins elements in length.
The operation is one of the following, defined in Bio::DB::BigFile::Constants:
Constant Operation -------- --------- bbiSumMean The mean value of all intervals in the bin. bbiSumMax The maximum value of all intervals in the bin. bbiSumMin The minimum value of all intervals in the bin. bbiSumCoverage The count of all intervals in the bin. bbiSumStandardDeviation The standard deviation of all intervals in the bin.
For example, to divide the first megabase of chromosome 3 into 100 bins and find the mean value of the intervals in each bin:
my $bins = $wig->bigBedSummaryArray('chr3',0=>1_000_000,bbiSumMean,100); for my $value (@$bins) { print $value,"\n"; }
If the interval is invalid, returns undef.
- $summaryarray=$wig->bigBedSummaryArrayExtended($chrom,$start,$end,$bins)
-
This method is similar to bigBedSummaryArray(), except that instead of returning an arrayref of numeric values, the returned arrayref points to a list of hashes describing the contents of each bin. Hash keys are the following:
Key Value --- --------- validCount Number of intervals in the bin maxVal Maximum value in the bin minVal Minimum value in the bin sumData Sum of the intervals in the bin sumSquares Sum of the squares of the intervals in the bin
sumData and sumSquares can be used to compute the mean and standard deviation of the bin, and to compute these values when multiple bins are combined.
For example, to calculate the means of 100 bins across an interval:
my $bins = $wig->bigBedSummaryArrayExtended('chr3',0=>1_000_000,100); for (my $i=0;$i<@$bins;$i++) { my $mean = $bins->[$i]{sumData}/$bins->[$i]{validCount}; }
- $summaryobj=$wig->bigBedSummary($chrom,$start,$end,$bins)
-
This is similar to the previous method, except that it returns a summary object rather than an arrayref. This object, of type Bio::DB::bbiExtendedSummary, has the following methods:
$summaryobj->size() Number of bins in the summary. $summaryobj->validCount($bin) Count of intervals in bin $bin. $summaryobj->minVal($bin) Minimum value in bin $bin. $summaryobj->maxVal($bin) Maximum value in bin $bin. $summaryobj->sumData($bin) Sum of the values in bin $bin. $summaryobj->sumSquares($bin) Sum of the squares of the values in bin $bin
This method may be slightly more memory-efficient than bigBedSummaryArrayExtended.
- $sql = $bed->bigBedAutoSqlText()
-
Return the autoSQL text associated with this BigBed file, if any.
- $as_object = $bed->bigBedAs()
-
Return a parsed object corresponding to the AutoSQL data. See AutoSQL Methods for a description of what you can do with this object.
AutoSQL Methods
The bigBedAs() method returns a parsed AutoSQL definition object of type Bio::DB::asObject. A full description of this object is beyond the scope of this document; please see the Jim Kent include file asParse.h for definitions of the various objects and methods that are not discussed in detail.
Bio::DB::asObject
This corresponds to a SQL table and its linked C definition.
- $asObject = $as->next
-
Return the next asObject in a linked list.
- $string = $as->name
-
Return the name of the object.
- $string = $as->comment
-
Return the comment for the object.
- $bool = $as->isTable
-
Return true if the object is a SQL table.
- $bool = $as->isSimple
-
Return true if the object is a simple object.
- $column_list = $as->columnList
-
Returns a linked list of autosql object columns.
Bio::DB::asColumn
This corresponds to a column in a SQL table and its corresponding C struct field.
- $as_column = $ac->next
-
Return the next column in the linked list.
- $string = $ac->name
-
Column name.
- $string = $ac->comment
-
Column comment.
- $as_column_type = $ac->lowType
-
Column type, a Bio::DB::asTypeInfo object.
- $string = $ac->obName
- $string = $ac->obType
- $as_column = $ac->linkedSize
- $int = $ac->fixedSize
- $string = $ac->linkedSizeName
- $bool = $ac->isList
- $bool = $ac->isArray
-
Not documented here.
Bio::DB::asTypeInfo
This corresponds to the SQL and C struct types of an autosql column.
- $int = $ati->type
-
Numeric ID of this type.
- $string = $ati->name
-
AutoSQL name for the type.
- $string = $ati->sqlName
-
SQL name for the type.
- $string = $ati->cName
-
C struct name for the type.
- $bool = $ati->isUnsigned
- $bool = $ati->stringy
- $bool = $ati->listyName
- $bool = $ati->nummyName
- $bool = $ati->outFormat
-
Not documented here.
SEE ALSO
Bio::Perl, Bio::Graphics, Bio::Graphics::Browser2
AUTHOR
Lincoln Stein <lincoln.stein@oicr.on.ca>. <lincoln.stein@bmail.com>
Copyright (c) 2010 Ontario Institute for Cancer Research.
This package and its accompanying libraries is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0. Refer to LICENSE for the full license text. In addition, please see DISCLAIMER.txt for disclaimers of warranty.