NAME
Bio::ViennaNGS::SpliceJunc - Perl extension for alternative splicing analysis
SYNOPSIS
use Bio::ViennaNGS::SpliceJunc;
my $c;
my %fastaobj;
my (@fo,@res);
# get Bio::PrimarySeq::Fasta object
my @fo = get_fasta_ids($fasta_in);
foreach my $id (@fo) {
$fastaobj{$id} = $fastadb->get_Seq_by_id($id);
}
# Extract annotated splice sites from BED12
bed6_ss_from_bed12($bed12_in,$p_annot,$window,$want_canonical,\%fastaobj);
# Extract mapped splice junctions from RNA-seq data
bed6_ss_from_rnaseq($s_in,$p_mapped,$window,$mincov,$want_canonical,\%fastaobj);
# Check for each splice junction seen in RNA-seq if it overlaps with
# any annotated splice junction
@res = intersect_sj($p_annot,$p_mapped,$outdir,$prefix,$window,$mil);
# Check whether a splice junction is canonical
$c = ss_isCanonical($chr,$pos5,$pos3,\%fastaobj)
DESCRIPTION
Bio::ViennaNGS::SpliceJunc is a Perl module for alternative splicing (AS) analysis. It provides routines for identification and characterization of novel and existing (annotated) splice junctions from RNA-seq data.
Identification of novel splice junctions is based on intersecting potentially novel splice junctions from RNA-seq data with annotated splice junctions.
SUBROUTINES
- bed6_ss_from_bed12($bed12,$dest,$window,$fastaobjR)
-
Extracts splice junctions from an BED12 file (provided via argument
$bed12
), writes a BED6 file for each transcript to$dest
, containing all its splice junctions. Output splice junctions can be flanked by a window of +/-$window
nt.$fastaobjR
is a reference to a Bio::PrimarySeq::Fasta object holding the underlying reference genome. Each splice junction is represented as two bed lines in the output BED6. - bed6_ss_from_rnaseq($bed_in,$dest,$window,$mcov)
-
Extracts splice junctions from mapped RNA-seq data. The input BED6 file should contain coordinates of introns in the following syntax:
chr1 3913 3996 splits:97:97:97:N:P 0 +
The fourth column in this BED file (correponding to the 'name' field according to the BED specification) should be a colon-separated string of six elements, where the first element should be 'splits' and the second element is assumed to hold the number of reads supporting this splice junction. The fifth element indicates the splice junction type: A capital 'N' determines a normal splice junction, whereas 'C' indicates circular and 'T' indicates trans-splice junctions, respectively. Only normal splice junctions ('N') are considered, the rest is skipped. Elements 3, 4 and 6 are not further processed.
We recommend using segemehl|http://www.bioinf.uni-leipzig.de/Software/segemehl/ for generating this type of BED6 files. This routine is, however, not limited to segemehl output. BED6 files containing splice junction information from other short read mappers or third-party sources will be processed if hey are formatted as described above.
This routine writes a BED6 file for each splice junction provided in the input to
$dest
. Output splice junctions can be flanked by a window of +/-$window
nt. Each splice junction is represented as two bed lines in the output BED6. - intersect_sj($p_annot,$p_mapped,$dest,$prefix,$window,$mil)
-
Intersects all splice junctions identified in an RNA-seq experiment with annotated splice junctions. Identifies and characterizes novel and existing splice junctions. Each BED6 file in
$p_mapped
is intersected with those transcript splice junction BED6 files in$p_annot
, whose genomic location spans the query splice junction. This is to prevent the tool from intersecting each splice site found in the mapped RNA-seq data with all annotated transcripts.$mil
specifies a maximum intron length.The intersection operations are performed with bedtools intersect from the BEDtools suite). BED sorting operations are performed with bedtools sort.
Writes two BEd6 files to $dest (optionally prefixed by $prefix), which contain novel and existing splice junctions, respectively.
- ss_isCanonical($chr,$p5,$p3,$fastaobjR)
-
Checks whether a given splice junction is canonical, ie. whether the first and last two nucleotides of the enclosed intron correspond to a certain nucleotide motif.
$chr
is the chromosome name,$p5
and$p3
the 5' and 3' ends of the splice junction and$fastaobjR
is a Bio::PrimarySeq::Fasta object holding the underlying reference genomeThis routine does not explicitly consider standedness in the sense that splice junction motifs are evaluated in terms of the forward strand of the underlying reference sequence. This is best explained by an example: Consider the splice junction motif GU->G on the reverse strand. In 5' to 3' direction of the forward strandm this junction reads CT->AC. A splice junction is canonical if its motif corresponds to one of the following cases:
5'===]GT|CT....AG|AC[====3' ie GT->AG or CT->AC 5'===]GC|CT....AG|GC[====3' ie GC->AG or CT->GC 5'===]AT|GT....AC|AT[====3' ie AT->AC or GT->AT
DEPENDENCIES
This modules depends on the following Perl modules:
Bio::ViennaNGS::SpliceJunc uses third-party tools for computing intersections of BED files: bedtools intersect from the BEDtools suite is used to compute overlaps and bedtools sort is used to sort BED output files. Make sure that those third-party utilities are available on your system, and that hey can be found and executed by the perl interpreter. We recommend installing the latest version of BEDtools on your system.
SEE ALSO
AUTHOR
Michael T. Wolfinger <michael@wolfinger.eu>
COPYRIGHT AND LICENSE
Copyright (C) 2014 Michael T. Wolfinger <michael@wolfinger.eu>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.4 or, at your option, any later version of Perl 5 you may have available.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.