NAME
Bio::ViennaNGS - A Perl extension for Next-Generation Sequencing data analysis
SYNOPSIS
use Bio::ViennaNGS;
# split a single-end or paired-end BAM file by strands
@result = split_bam($bam_in,$rev,$want_uniq,$want_bed,$destdir,$logfile);
# extract unique and multi mappers from a BAM file
@result = uniquify_bam($bam_in,$outdir,$logfile);
# make bigWig from BED or BAM
$type = "bam";
$strand = "+";
$bwfile = bed_or_bam2bw($type,$infile,$cs_in,$strand,$destdir,$wantnorm,$size_p,$scale,$logfile);
# make bigBed from BED
my $bb = bed2bigBed($bed_in,$cs_in,$destdir,$logfile);
# sort a BED file
sortbed($bed_in,$destdir,$bed_out,$rm_orig,$logfile)
# compute transcript abundance in TPM
$meanTPM = computeTPM($sample,$readlength);
# parse a bedtools multicov compatible file
$conds = parse_multicov($infile);
# write bedtools multicov compatible file
write_multicov("TPM",$destdir,$basename);
DESCRIPTION
Bio::ViennaNGS is a collection of utilities and subroutines for building efficient Next-Generation Sequencing (NGS) data analysis pipelines.
ROUTINES
- split_bam($bam,$reverse,$want_uniq,$want_bed,$dest_dir,$log)
-
Splits BAM file $bam according to [+] and [-] strand.
$reverse
,$want_uniq
and$want_bed
are switches with values of 0 or 1, triggering forced reversion of strand mapping (due to RNA-seq protocol constraints), filtering of unique mappers (identified via NH:i:1 SAM argument), and forced output of a BED file corresponding to strand-specific mapping, respectively.$log
holds name and path of the log file.Strand-splitting is done in a way that in paired-end alignments, FIRST and SECOND mates (reads) are treated as _one_ fragment, ie FIRST_MATE reads determine the strand, while SECOND_MATE reads are assigned the opposite strand per definitionem. This also holds if the reads are not mapped in proper pairs and even if there is no mapping partner at all.
Sometimes the library preparation protocol causes inversion of the read assignment (with respect to the underlying annotation). In those cases, the natural mapping of the reads can be obtained by the
$reverse
flag.This routine returns an array whose fist two elements are the file names of the newly generate BAM files with reads mapped to the positive, and negative strand, respectively. Elements three and four are the number of fragments mapped to the positive and negative strand. If the
$want_bed
option was given elements fiveand six are the file names of the output BED files for positive and negative strand, respectively.NOTE: Filtering of unique mappers is only safe for single-end experiments; In paired-end experiments, read and mate are treated separately, thus allowing for scenarios where eg. one read is a multi-mapper, whereas its associate mate is a unique mapper, resulting in an ambiguous alignment of the entire fragment.
- uniquify_bam($bam,$dest,$log)
-
Extract unique and multi mapping reads from BAM file
$bam
. New BAM files for unique and multi mappers are created in the output folder$dest
, which are named basename.uniq.bam and basename.mult.bam, respectively. If defined, a logfile named$log
is created in the output folder.This routine returns an array holding file names of the newly created BAm files for unique and multi mappers, respectively.
- bed_or_bam2bw($type,$infile,$chromsizes,$strand,$dest,$want_norm,$size,$scale,$log)
-
Creates stranded, normalized BigWig coverage profiles from strand-specific BAM or BED files (provided via
$infile
). The routine expects a file type 'bam' or 'bed' via the$type
argument.$chromsizes
is the chromosome.sizes files,$strand
is either "+" or "-" and$dest
contains the output path for results. For normlization of bigWig profiles, additional attributes are required:$want_norm
triggers normalization with values 0 or 1.$size
is the number of fragments/elements in the BAM or BED file and$scale
gives the number to which data is normalized (ie. every bedGraph entry is multiplied by a factor ($scale
/$size
).$log
is expected to contain either the full path and file name of log file or 'undef'. The routine returns the full file name of the newly generated bigWig file.While this routine can handle non-straned BAM/BED files (in which case
$strand
should be set to "+" and hence all coverage profiles will be created with a positive sign, even if they map to the negative strand), usage of strand-specific data is highly recommended. For BAM file, this is easily achieved by calling the bam_split routine (see above) prior to this one, thus creating dedicated BAM files containing exclusively reads mapped to the positive or negative strand, respectively.It is important to know that this routine does not extract reads mapped to either strand from a non-stranded BAM/BED file if the
$strand
argument is given. It rather adjusts the sign of all mapped reads/features in a BAM/BED file and then creates bigWig files. See the split_bam routine for extracting reads mapped to either strand.Stranded bigWigs can easily be visualized via TrackHubs in the UCSC Genome Browser. Internally, the conversion from BAM/BED to bigWig is accomplished via two third-party applications: genomeCoverageBed and bedGraphToBigWig. Intermediate bedGraph files are removed automatically once the bigWig files are ready.
- sortbed($infile,$dest,$outfile,$rm_orig,$log)
-
Sorts BED file
$infile
with bedtools sortt.$dest
andoutfile
name path and filename of the resulting sorted BED file.$rm_infile
is either 1 or 0 and indicated whether the original$infile
should be deleted.$log
holds path and name of log file. - bed2bigBed($infile,$chromsizes,$dest,$log)
-
Creates an indexed bigBed file from a BED file.
$infile
is the BED file to be transformed,$chromsizes
is the chromosome.sizes file and$dest
contains the output path for results.$log
is the name of a log file, or undef if no logging is reuqired. A '.bed', '.bed6' or '.bed12' suffix in$infile
will be replace by '.bb' in the output. Else, the name of the output bigBed file will be the value of$infile
plus '.bb' appended.The conversion from BED to bigBed is done by a third-party utility (bedToBigBed), which is executed by IPC::Cmd.
- computeTPM($featCount_sample,$rl)
-
Computes expression in Transcript per Million (TPM) [Wagner et.al. Theory Biosci. (2012)].
$featCount_sample
is a reference to a Hash of Hashes data straucture where keys are feature names and values hold a hash that must at least contain length and raw read counts. Practically,$featCount_sample
is represented by _one_ element of@featCount
, which is populated from a multicov file byparse_multicov()
.$rl
is the read length of the sequencing run.Returns the mean TPM of the processed sample, which is invariant among samples. (TPM models relative molar concentration and thus fulfills the invariant average criterion.)
- parse_multicov($file)
-
Parse a bedtools multicov (multiBamCov) file, i.e. an extended BED6 file, into an Array of Hash of Hashes data structure (
@featCount
).$file
is the input file. Returns the number of samples present in the multicov file, ie. the numner of columns extending the canonical BED6 columns in the input multicov file. - write_multicov($item,$dest_dir,$base_name)
-
Write
@featCount
data to a bedtools multicov (multiBamCov)-type file.$item
specifies the type of information from@featCount
HoH entries, e.g. TPM or RPKM. These values must have been computed and inserted into@featCount
beforehand by e.g.computeTPM()
.$dest_dir
gives the absolute path and$base_name
the basename (will be extended by$item
.csv) of the output file.
DEPENDENCIES
- Bio::Perl >= 1.00690001
- BIO::DB::Sam >= 1.39
- File::Basename
- File::Temp
- Path::Class
- IPC::Cmd
- Carp
Bio::ViennaNGS uses third-party tools for computing intersections of BED files: bedtools intersect from the BEDtools suite is used to compute overlaps and bedtools sort is used to sort BED output files. Make sure that those third-party utilities are available on your system, and that hey can be found and executed by the Perl interpreter. We recommend installing the latest version of BEDtools on your system.
SEE ALSO
- Bio::ViennaNGS::AnnoC
- Bio::ViennaNGS::UCSC
- Bio::ViennaNGS::SpliceJunc
- Bio::ViennaNGS::Fasta
- Bio::ViennaNGS::FeatureChain
AUTHORS
- Michael T. Wolfinger <michael@wolfinger.eu>
- Jörg Fallmann <fall@tbi.univie.ac.at>
- Florian Eggenhofer <florian.eggenhofer@tbi.univie.ac.at>
- Fabian Amman <fabian@tbi.univie.ac.at<gt>
COPYRIGHT AND LICENSE
Copyright (C) 2014 Michael T. Wolfinger <michael@wolfinger.eu>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.4 or, at your option, any later version of Perl 5 you may have available.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1020:
Non-ASCII character seen before =encoding in 'Jörg'. Assuming UTF-8