NAME

fascut - select biosequence record data by character or field ranges

SYNOPSIS

fascut [OPTION]... [INDEX-SET] [MULTIFASTA-FILE]...

RANGE SPECIFICATION

RANGE-LIST  : {  RANGE  |  RANGE,RANGE-LIST }
     RANGE  : {  INDEX  |   FROM>SEP<TO     | FROM>SEP<TO:BY } 
     >SEP<  : {   ..    |        :          |       -        }
     INDEX  : nonzero integer
   FROM,TO  : nonzero integer or the empty string 
               -- positive integers count from data first character or field         
               -- negative integers count back from data last character or field
        BY  : nonzero integer
               -- positive integers step forward       
               -- negative integers step backward

DESCRIPTION

fascut takes biological sequence records on input, and on output, transforms one component (e.g. sequence, description or identifier) of each record as the concatenation of index- and range-based selections of input data. Data is indexed by character or optionally by field. By default, fascut operates character-wise on sequence data. The sequence of data selections is specified in the first argument as a comma-separated list of ranges. A range is either a single index integer or a range of integers, with reversals and variable step sizes allowed. Arbitrary repetition of indices is allowed across ranges in the range-list. fascut outputs the concatenation of the ordered data selections specified by each range.

The one mandatory argument to fascut is a sequence of indices or ranges separated by commas (,). The ranges may be specified in Perl-style (or Genbank coordinate style)llike "from..to", in R/Octave-style like "from:to" or UNIX cut-style as in "from-to". If the index bounds of a simple range are missing, "from" defaults to "1" and "to" defaults to "-1". An optional ":by" suffix specifies a non-zero integer step-increment, which may be positive or negative. Negative step-increments imply a reversal of input data.

One-based indexing applies consistently in fascut whether indexing data by character (like cut -c) or by field. fascut -f cuts descriptions by field, after descriptions are split by strings of white-space, or optionally a user-defined Perl regex. Negative indices, starting with -1, count backwards from last characters and fields.

After converting negative indices to positive indices for any given sequence on input, fascut expects the "from" parameter to be less than its respective "to" parameter for every range unless a negative "by" parameter is given, in which case "from" is expected to be greater than "to". If any of this is violated for a given range and sequence, then that range will be skipped for that sequence. If after conversions, "from" is negative, it defaults to "1". If "to" is greater than a sequence length, it defaults to "-1". However, if strict-mode (-s) is enabled, then both types of bad range specifications will abort processing of a sequence record with no output for that sequence except a warning.

When the range-list argument starts with a negative index integer, you will need to terminate option processing with "--" before supplying the range-string argument to fascut.

Options specific to fascut: -i, --identifier cuts identifiers (by character) -d, --description cuts descriptions (by character) -f, --field cuts descriptions (by field) -S, --split-on-regex=<regex> split description to fields using <regex> -j, --join=<string> join selected field ranges using <string> -s, --strict strict range-checking; skips sequences with warnings if ranges are out-of-bounds

Options general to FAST: -h, --help print a brief help message --man print full documentation --version print version -l, --log create/append to logfile -L, --logname=<string> use logfile name <string> -C, --comment=<string> save comment <string> to log --format=<format> use alternative format for input --moltype=<[dna|rna|protein]> specify input sequence type -q, --fastq use fastq format as input and output

INPUT AND OUTPUT

fascut is part of FAST, the FAST Analysis of Sequences Toolbox, based on Bioperl. Most core FAST utilities expect input and return output in multifasta format. Input can occur in one or more files or on STDIN. Output occurs to STDOUT. The FAST utility fasconvert can reformat other formats to and from multifasta.

OPTIONS

-i --identifier

Cut identifiers by character. Use the -S option to alter how the identifier data is split.

-d --description

Cut descriptions by character. Use the -f option to split descriptions by strings of whitespace instead. Use the -S option to split the description with an arbitrary regex.

-f --field

Cut descriptions by field. By default, the description is split on strings of white-space.

-S regex --split-on-regex regex

Use regex to split data. Special characters in the regex option argument must be quoted to protect them from the shell.

-j string --join=string

Paste selected fields together with string string for new description. Default is a single space character " ". Use "\t" to indicate a tab-character.

-s --strict

This option will implement strict range checking on the coordinates. When used this option will skip any sequences for which the coordinates are out of range (the default is to output the longest valid subsequence contained within the range).

-h, --help

Print a brief help message and exit.

--man

Print the manual page and exit.

--version

Print version information and exit.

-l, --log

Creates, or appends to, a generic FAST logfile in the current working directory. The logfile records date/time of execution, full command with options and arguments, and an optional comment.

-L [string], --logname=[string]

Use [string] as the name of the logfile. Default is "FAST.log.txt".

-C [string], --comment=[string]

Include comment [string] in logfile. No comment is saved by default.

--format=[format]

Use alternative format for input. See man page for "fasconvert" for allowed formats. This is for convenience; the FAST tools are designed to exchange data in Fasta format, and "fasta" is the default format for this tool.

-m [dna|rna|protein], --moltype=[dna|rna|protein]

Specify the type of sequence on input (should not be needed in most cases, but sometimes Bioperl cannot guess and complains when processing data).

-q

use fastq format as input and output.

EXAMPLES

Example 1:

    fascut 4 < in.fas > out.fas

Example 2:

    fascut 1..8 < in.fas > out.fas

Get all but the first 3 bases/aas:

    fascut 4..-1 < in.fas > out.fas

Get the last 3 bases/aas:

    fascut -- -3..-1 < in.fas > out.fas

Get the last 3 bases/aas to position 1000 if possible:

    fascut -- -3..1000 < in.fas > out.fas

Example 3:

    fascut 1..4,2..6,-1 < in.fas > out.fas

SEE ALSO

man perlre
perldoc perlre

Documentation on perl regular expressions.

man FAST
perldoc FAST

Introduction and cookbook for FAST

The FAST Home Page"

CITING

If you use FAST, please cite Lawrence et al. (2015). FAST: FAST Analysis of Sequences Toolbox. and Bioperl Stajich et al..