NAME
Bio::ViennaNGS::Util - Utility routines for Next-Generation Sequencing data analysis
SYNOPSIS
use Bio::ViennaNGS::Util;
# make bigWig from BED or BAM
$type = "bam";
$strand = "+";
$bwfile = bed_or_bam2bw($type,$infile,$cs_in,$strand,$destdir,$wantnorm,$size_p,$scale,$logfile);
# make bigBed from BED
my $bb = bed2bigBed($bed_in,$cs_in,$destdir,$logfile);
# sort a BED file
sortbed($bed_in,$destdir,$bed_out,$rm_orig,$logfile)
DESCRIPTION
Bio::ViennaNGS::Util is a collection of utility subroutines for building efficient Next-Generation Sequencing (NGS) data analysis pipelines.
ROUTINES
- bed_or_bam2bw($type,$infile,$chromsizes,$strand,$dest,$want_norm,$size,$scale,$log)
-
Creates stranded, normalized BigWig coverage profiles from strand-specific BAM or BED files (provided via
$infile
). The routine expects a file type 'bam' or 'bed' via the$type
argument.$chromsizes
is the chromosome.sizes files,$strand
is either "+" or "-" and$dest
contains the output path for results. For normlization of bigWig profiles, additional attributes are required:$want_norm
triggers normalization with values 0 or 1.$size
is the number of fragments/elements in the BAM or BED file and$scale
gives the number to which data is normalized (ie. every bedGraph entry is multiplied by a factor ($scale
/$size
).$log
is expected to contain either the full path and file name of log file or 'undef'. The routine returns the full file name of the newly generated bigWig file.While this routine can handle non-straned BAM/BED files (in which case
$strand
should be set to "+" and hence all coverage profiles will be created with a positive sign, even if they map to the negative strand), usage of strand-specific data is highly recommended. For BAM file, this is easily achieved by calling the bam_split routine (see above) prior to this one, thus creating dedicated BAM files containing exclusively reads mapped to the positive or negative strand, respectively.It is important to know that this routine does not extract reads mapped to either strand from a non-stranded BAM/BED file if the
$strand
argument is given. It rather adjusts the sign of all mapped reads/features in a BAM/BED file and then creates bigWig files. See the split_bam routine for extracting reads mapped to either strand.Stranded bigWigs can easily be visualized via TrackHubs in the UCSC Genome Browser. Internally, the conversion from BAM/BED to bigWig is accomplished via two third-party applications: genomeCoverageBed and bedGraphToBigWig. Intermediate bedGraph files are removed automatically once the bigWig files are ready.
- sortbed($infile,$dest,$outfile,$rm_orig,$log)
-
Sorts BED file
$infile
with bedtools sort.$dest
andoutfile
name path and filename of the resulting sorted BED file.$rm_infile
is either 1 or 0 and indicated whether the original$infile
should be deleted.$log
holds path and name of log file. - bed2bigBed($infile,$chromsizes,$dest,$log)
-
Creates an indexed bigBed file from a BED file.
$infile
is the BED file to be transformed,$chromsizes
is the chromosome.sizes file and$dest
contains the output path for results.$log
is the name of a log file, or undef if no logging is reuqired. A '.bed', '.bed6' or '.bed12' suffix in$infile
will be replaced by '.bb' in the output. Else, the name of the output bigBed file will be the value of$infile
plus '.bb' appended.The conversion from BED to bigBed is done by a third-party utility (bedToBigBed), which is executed by IPC::Cmd.
- unique_array($arrayref)
-
Takes a reference to an array and uniques entries via hash copy. Returns a reference to an array containing unique values.
- kmer_enrichment($seqs,$klen)
-
Takes a reference to a sequence array and the length of the desired kmer. Counts occurences of kmers of given length and returns a reference to a hash containing the kmer as key and occurences as value.
- extend_chain($chromsizes,$chain,$left,$right,$upstream,$downstream,$extension)
-
Takes a hash containing chromosomes and their corresponding sizes, as e.g. returned by the fetch_chrom_sizes method. Second argument is a chain as generated by parse_bed6.
This chain is then either extended to the desired minimal size, extend from either the 5' and/or 3' end of the chain or returns only the 5' or 3' extended region plus an optional extension into the existing chain.
- parse_bed6($bedfile)
-
Takes a path to a bed6 file optionally containing additional fields and returns a reference to an array of ViennaNGS::Feature objects. This array can then easily be used for creation of an ViennaNGS::FeatureChain object.
- fetch_chrom_sizes($species)
-
Takes the name of the desired species (e.g. hg19, mm9, ..) and retrieves annotated chromosomes and their sizes from UCSC. This methods returns a reference to a hash containing chromosomes as kays and their corresponding size as value. Either uses the fetchChromSizes util of the UCSC bin collection or alternatively and if mysql is available via a mysql command.
- mkdircheck(@dirs)
-
Takes the path to one or more directories, converts them to an OS specific path and creates directories.
- rmdircheck(@dirs)
-
Takes the path to one or more directories, converts them to an OS specific path and removes directories.
DEPENDENCIES
- Bio::ViennaNGS::FeatureChain
- Bio::Perl >= 1.00690001
- File::Basename
- File::Path
- Path::Class
- IPC::Cmd
- Math::Round
- Carp
Bio::ViennaNGS uses third-party tools for computing intersections of BED files: bedtools intersect from the BEDtools suite is used to compute overlaps and bedtools sort is used to sort BED output files. Make sure that those third-party utilities are available on your system, and that hey can be found and executed by the Perl interpreter. We recommend installing the latest version of BEDtools on your system.
Examples for using most of the provided methods can be found at ViennaNGS::Tutorial.
AUTHORS
COPYRIGHT AND LICENSE
Copyright (C) 2015 Michael T. Wolfinger <michael@wolfinger.eu>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.