NAME

Bio::ViennaNGS - Perl extension for Next-Generation Sequencing analysis

SYNOPSIS

use ViennaNGS;

# get Bio::PrimarySeq::Fasta object
my @fo = get_fasta_ids($fasta_in);
foreach my $id (@fo) {
  $fastaobj{$id} = $fastadb->get_Seq_by_id($id);
}

# get strand-specific genomic sequence between $start and $end
$seq = get_stranded_subsequence($obj,$start,$stop,$fastaobj{$id});

# split a single-end  or paired-end BAM file by strands
@result = split_bam($bam_in,$rev,$want_uniq,$want_bed,$destdir,$logfile);

# make bigWig from BAM
bam2bw($bam_in,$chromsizes)

# make bigWig from BED
bed2bw($bed_in,$cs_in,"+",$destdir,$wantnorm,$size_p,$scale,$logfile);

# make bigBed from BED
my $bb = bed2bigBed($bed_in,$cs_in,$destdir,$logfile);

# compute transcript abundance in TPM
$meanTPM = computeTPM($sample,$readlength);

# parse a bedtools multicov compatible file
$conds = parse_multicov($infile);

# write bedtools multicov compatible file
write_multicov("TPM", $destdir, $basename);

DESCRIPTION

ViennaNGS is a collection of utilities and subroutines often used for Next-Generation Sequencing (NGS) data analysis.

get_stranded_subsequence($object,$start,$stop,$strand)

Returns the actual DNA/RNA sequence from $start to $end. $object is a Bio::PrimarySeq::Fasta object, which obeys the Bio::PrimarySeqI conventions. To recover the entire raw DNA or protein sequence, call $object->seq(). $strand is 1 or -1 for [+] or [-] strand, respectively.

split_bam($bam,$reverse,$want_uniq,$want_bed,$dest_dir,$log)

Splits BAM file $bam according to [+] and [-] strand. $reverse, $want_uniq and $want_bed are switches with values of 0 or 1, triggering forced reversion of strand mapping (due to RNA-seq protocol constraints), filtering of unique mappers (identified via NH:i:1 SAM argument), and forced output of a BED file corresponding to strand-specific mapping, respectively. $log holds name and path of the log file.

Strand-splitting is done in a way that in paired-end alignments, FIRST and SECOND mates (reads) are treated as _one_ fragment, ie FIRST_MATE reads determine the strand, while SECOND_MATE reads are assigned the opposite strand per definitionem. (This also holds if the reads are not mapped in proper pairs and even if ther is no mapping partner at all)

Sometimes the library preparation protocol causes inversion of the read assignment (with respect to the underlying annotation). In those cases, the natural mapping of the reads can be obtained by the $reverse flag.

NOTE: Filtering of unique mappers is only safe for single-end experiments; In paired-end experiments, read and mate are treated separately, thus allowing for scenarios where eg. one read is a multi-mapper, whereas its associate mate is a unique mapper, resulting in an ambiguous alignment of the entire fragment.

bam2bw($bam,$chromsizes)

Creates BedGraph and BigWig coverage profiles from BAM files. These can easily be visualized as TrackHubs within the UCSC Genome Browser (http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html). Internally, the conversion is accomplished by two third-party applications: genomeCoverageBed (from BEDtools, see http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html) and bedGraphToBigWig (from the UCSC Genome Browser, see http://hgdownload.cse.ucsc.edu/admin/exe/).

bed2bw($infile,$chromsizes,$strand,$dest,$want_norm,$size,$scale,$log)

Creates stranded, normalized BigWig coverage profiles from BED files. $chromsizes is the chromosome.sizes files, $strand is either "+" or "-" and $dest contains the output path for results. For normlization of bigWig profiles, additional attributes are required: $want_norm triggers normalization with values 0 or 1. $size is the number of elements (features) in the BED file and $scale gives the number to which data is normalized (ie. every bedGraph entry is multiplied by a factor ($scale/$size). $log holds path and name of log file.

Stranded BigWigs can easily be visualized via TrackHubs in the UCSC Genome Browser. Internally, the conversion is accomplished by two third-party applications: genomeCoverageBed (from BEDtools, see http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html) and bedGraphToBigWig (from the UCSC Genome Browser, see http://hgdownload.cse.ucsc.edu/admin/exe/). Intermediate bedGraph files are removed automatically.

bed2bigBed($infile,$chromsizes,$dest,$log)

Creates an indexed bigBed file from a BED file. $infile is the BED file to be transformed, $chromsizes is the chromosome.sizes file and $dest contains the output path for results. $log is the name of a log file, or undef if no logging is reuqired. A '.bed', '.bed6' or '.bed12' suffix in $infile will be replace by '.bb' in the output. Else, the name of the output bigBed file will be the value of $infile plus '.bb' appended.

The conversion from BED to bigBed is done by a third-party utility (bedToBigBed), which is executed by IPC::Cmd.

computeTPM($featCount_sample,$rl)

Computes expression in Transcript per Million (TPM) [Wagner et.al. Theory Biosci. (2012)]. $featCount_sample is a reference to a HoH data straucture where keys are feature names and values hold a hash that must at least contain length and raw read counts. Practically, $featCount_sample is represented by _one_ element of @featCount, which is populated from a multicov file by parse_multicov(). $rl is the read length of the sequencing run.

Returns the mean TPM of the processed sample, which is invariant among samples. (TPM models relative molar concentration and thus fulfills the invariant average criterion.)

parse_multicov($file)

Parse a bedtools multicov (multiBamCov) file, i.e. an extended BED6 file, into an Array of Hash of Hashes data structure (@featCount). $file is the input file. Returns the number of samples present in the multicov file, ie. the numner of columns extending the canonical BED6 columns in the input multicov file.

write_multicov($item,$dest_dir,$base_name)

Write @featCount data to a bedtools multicov (multiBamCov)-type file. $item specifies the type of information from @featCount HoH entries, e.g. TPM or RPKM. These values must have been computed and inserted into @featCount beforehand by e.g. computeTPM(). $dest_dir gives the absolute path and $base_name the basename (will be extended by $item.csv) of the output file.

DEPENDENCIES

Bio::Perl >= 1.006924
BIO::DB::Sam >= 1.39
File::Basename
File::Temp
Path::Class
IPC::Cmd
Carp

ViennaNGS::SpliceJunc uses third-party tools for computing intersections of BED files: bedtools intersect from the BEDtools suite is used to compute overlaps and bedtools sort is used to sort BED output files. Make sure that those third-party utilities are available on your system, and that hey can be found and executed by the perl interpreter. We recommend installing the latest version of BEDtools on your system.

SEE ALSO

perldoc Bio::ViennaNGS::AnnoC
perldoc Bio::ViennaNGS::UCSC
perldoc Bio::ViennaNGS::SpliceJunc

AUTHORS

Michael T. Wolfinger <michael@wolfinger.eu> Florian Eggenhofer <egg@tbi.univie.ac.at> Joerg Fallmann <fall@tbi.univie.ac.at>

COPYRIGHT AND LICENSE

Copyright (C) 2014 Michael T. Wolfinger <michael@wolfinger.eu>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.4 or, at your option, any later version of Perl 5 you may have available.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.