The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Tutorial_pipeline01.pl - An example pipeline for the ViennaNGS toolbox

SYNOPSIS

./Tutorial_Pipeline01.pl path/to/R/libraries Where path/to/R/libraries should point to the directory containing ggplot2

DESCRIPTION

This script is a showcase for the usage of ViennaNGS in a real NGS example. We start from a file containing ENSEMBL annotation information for human protein-coding genes. We are insterested in finding sequence motifs in close proximity to the gene start (50nt upstream, 10nt into the gene).

PIPELINE

The first step is to initialize some variables and generate a chromosome_sizes hash.

<my $bed = 'hg19_highlyexpressed.bed';> <my $name = (split(/\./,$bed))[0];> <my $upstream = 50;> <my $into = 10;> <my $outfile = "$name.ext$upstream\_fromStart_$into\_downstream.bed";> <my $outfile2 = "$name.ext$upstream\_upstream.bed";> <my %sizes = %{fetch_chrom_sizes('hg19')};>

Generate a Bio::ViennaNGS::FeatureChain object

The bed file of interest is parsed, a feature array is generated and passed on to Bio::ViennaNGS::FeatureChain, which creates a new Moose Object of type FeatureChain, containing the original bed entries

<my @featurelist = @{parse_bed6($bed)}; ### Now we create a Bio::ViennaNGS::FeatureChain from the featurelist above my $chain = Bio::ViennaNGS::FeatureChain-new('type'=>'original','chain'=>\@featurelist);>>

Extend the existing chain for motif analysis

The newly created FeatureChain object will now be extended 50nt upstream of the gene start and 10nt into the gene, to retrieve a bed file which contains the putative sequence motifs.

<my $extended_chain = extend_chain(\%sizes,$chain,0,$into,$upstream,0);>

Extended chains are now print out to make them available for external tools like bedtools.

<my $out = $extended_chain-print_chain(); print $Out $out; close ($Out)>>

Summary of so far used methods

fetch_chrom_sizes<as_string>

Returns a chromosome-sizes hash reference for the specified species, e.g. hg19, mm9, mm10, etc.

parse_bed6<as_string>

Reads a bed6 file and returns a feature array.

Bio::ViennaNGS::FeatureChain->new()<as_method>

Generates a new Bio::ViennaNGS::FeatureChain object from a feature array

Bio::ViennaNGS(extend_chain)<as_method>

Extends a Bio::ViennaNGS::FeatureChain object by given constraints

Sequence analysis

 We now generate FASTA files from the extended bed files using bedtools getfasta method.
C<<my $bedtools = `bedtools getfasta -s -fi hg19_chromchecked.fa -bed $outfile -fo $name.ext$upstream\_fromStart_$into\_downstream.fa`;>>
C<<print STDERR "$bedtools\n" if $?;>>
C<<$bedtools = `bedtools getfasta -s -fi hg19_chromchecked.fa -bed $outfile2 -fo $name.ext$upstream\_upstream.fa`;>>
C<<print STDERR "$bedtools\n" if $?;>>

To analyze putative sequence motifs in the newly generated Fasta files, we use two approaches. First we analyze the k-mer content with the Bio::ViennaNGS(kmer_enrichment) method for k-mers of length 6 to 8 nt.

<open(IN,"<","$name.ext$upstream\_fromStart_$into\_downstream.fa") || die ("Could not open $name.ext$upstream\_fromStart_$into\_downstream.fa!\n@!\n");>

<my @fastaseqs;> <while(<IN){ chomp (my $raw = $_); next if ($_ =~ /^>/); push @fastaseqs, $raw; }>> <close(IN);>

<for (6..8){ my %kmer = %{kmer_enrichment(\@fastaseqs, $_)}; my $total = sum values %kmer; ### Print Output open(KMER,"","$_\_mers") or die "Could not open file $_\_mers$!\n"; print KMER "$_\-mer\tCount\tRatio\n"; print KMER "TOTAL\t$total\t1\n"; foreach my $key (sort {$kmer{$b} <=> $kmer{$a} } keys %kmer) { my $ratio = nearest(.0001,$kmer{$key}/$total); print KMER "$key\t$kmer{$key}\t$ratio\n"; } close(KMER); }>>

MEME

In a second approach we run MEME to retrieve the 20 most over-represented motifs of length 8. meme hg19_highexpressed.ext50_fromStart_10_downstream.fa -oc MEME_hg19_highexpressed.ext50_fromStart_10_downstream.fa -w 8 -dna -nmotifs 20

Once the meme run is done, we want to have a nice figure which shows the e-value and site coverage of the top 10 motifs

<my $cmd = "perl ../scripts/MEME_xml_motif_extractor.pl -f Example_Pipeline_meme.xml -r $RLIBPATH -t Example_Pipeline"; my( $success, $error_message, $full_buf, $stdout_buf, $stderr_buf ) = run(command = $cmd, verbose => 0);

if(!$success){ my $this_function = (caller(0))[3]; print STDERR "ERROR: MEME_xml_motif_extractor.pl run unsuccessful\n"; print join "", @$full_buf; unless ($r) { warn "If you do not provide a valid R-lib-path and ggplot is not found in the standard R path, this pipeline will not be able to parse the MEME xml output.\n"; pod2usage(-verbose => 0); } }>>

AUTHOR

Joerg Fallmann <joerg.fallmann@univie.ac.at>