NAME

fasta-decoy.pl - decoy input databanks following several moethods

DESCRIPTION

Reads input fasta file and produce a decoyed databanks with several methods:

reverse: simply reverse each the sequence
shuffle: shuffle AA in each sequence
shuffle & avoid known cleaved peptides: shuffe sequence but avoid producing kown trayptic peptides
Markov model: learn Markov model chain distribution of a given level), then produces entries corresponding to this distribution

SYNOPSIS

#reverse sequences for a local (optionaly compressed) file fasta-decoys.pl --in=/tmp/uniprot_sprot.fasta.gz --method=reverse

#download databanks from the web | uncompress it and shuffle the sequence wget -silent -O - ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz | zcat | databatanks-decoy.pl --method=shuffle

#use a .dat file (with splice forms) as an input uniprotdat2fasta.pl --in=uniprot_sprot_human.dat | fasta-decoy.pl --method=markovmodel

ARGUMENTS

--in=infile.fasta

An input fasta file (will be uncompressed if ending with gz)

-out=outfile.fasta

A .fasta file [default is stdout]

--method=(reverse|shuffle|markovmodel)

Set the decoying method

OPTIONS

--method=shuffle options

--shuffle-reshufflecleavedpeptides

Re-shuffle peptides of size >=6 that where detected as cleaved one in original databank

--shuffle-reshufflecleavedpeptides-minlength [default 6]

Set the size of the peptide to be reshuffled is they already exist

--shuffle-reshufflecleavedpeptides-crc=int

Building a hash of known cleaved peptide can be quite demanding for memory (uniprot_rembl => ~4GB). Thereforea solution is to make an but array containing stating if or not a peptide with corresponding crc code was found.

--shuffle-cleaveenzyme=regexp

Set a regular expression for the enzyme [default is trysin: '(?<=[KR])(?=[^P])']

--shuffle-testenzyme

Just digest entries with the set enzyme and produces space separated peptides (to check the enzyme)

--method=markovmodel options

--markovmodel-level=int [default 3]

Set length of the model (0 means only AA distrbution will be respected, 3 means chains of length 3 distribution etc.). Setting a length >3 can deal to memory burnout.

misc

--noprogressbar

do not display terminal progress bar (if possible)

--help

--man

--verbose

Setting an environment vartiable DO_NOT_DELETE_TEMP=1 will keep the temporay file after the script exit

EXAMPLE

COPYRIGHT

Copyright (C) 2004-2006 Geneva Bioinformatics www.genebio.com

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

AUTHORS

Alexandre Masselot, www.genebio.com