NAME

fastaxsort - sort sequence records based on NCBI taxonomy

SYNOPSIS

fastaxsort [OPTION]... [NODES-FILE] [NAMES-FILE] [MULTIFASTA-FILE]...

fastaxsort --tax-id-mode [OPTION]... [NODES-FILE] [MULTIFASTA-FILE]...

DESCRIPTION

fastaxsort takes NCBI Taxonomy data and multifasta format sequence or alignment data as input and, if directed to valid NCBI Taxonomic names or IDs in the sequence records, outputs those records sorted taxonomically, or more specifically, sorted by pre-order depth-first traversal through the NCBI Taxonomy tree.

NCBI Taxonomy data must be downloaded separately from <NCBI Taxonomy|http://www.ncbi.nlm.nih.gov/taxonomy>, particularly one of the files marked "taxdump" from for example <ftp://ftp.ncbi.nih.gov/pub/taxonomy>. Only the files "nodes.dmp" and "names.dmp" in the downloaded data are used.

At least some part of sequence records must contain NCBI taxonomic names or IDs. By default, the entire description is expected to exactly match exactly one NCBI taxonomic name (or ID in --tax-id-mode). fastaxsort can optionally sort sequence records taxonomically by their identifiers, or by indexed fields within descriptions or identifiers, where fields are generated by splitting with a delimiter; by default, one or more white-space characters. Alternative delimiters may be specified by a user-defined regex. Positive integers index fields from the beginning; the first field has index one. Negative integers index fields from the end.

Options specific to fastaxsort: -T, --tax-id-mode sort records using NCBI taxonomic IDs in data -i, --identifier sort records using sequence identifiers -f, --field=<int> sort records using fields -S, --split-on-regex=<regex> split descriptions or identifiers using regex -a, --annotate annotate records with a dot-hex taxonomic address -j, --join=<string> use <string> to join taxonomic addresses to descriptions --index output an index mapping dot-hex addresses to NCBI Taxonomy

Options general to FAST: -h, --help print a brief help message --man print full documentation --version print version -l, --log qcreate/append to logfile -L, --logname=<string> use logfile name <string> -C, --comment=<string> save comment <string> to log --format=<format> use alternative format for input --moltype=<[dna|rna|protein]> specify input sequence type -q, --fastq use fastq format as input and output

INPUT AND OUTPUT

fastaxsort is part of FAST, the FAST Analysis of Sequences Toolbox, based on Bioperl. Most core FAST utilities expect input and return output in multifasta format. Input can occur in one or more files or on STDIN. Output occurs to STDOUT. The FAST utility fasconvert can reformat other formats to and from multifasta.

OPTIONS

-T --tax-id-mode

NCBI Taxonomic data in sequence records are IDs, not names.

-i --indentifier

Taxa are sorted using sequence identifiers (default uses whole descriptions)

-f --field

Sort sequence records by values at a specific field in sequence descriptions or identifiers. With this option, the description or identifier is split into fields using strings of white space as field delimiters (the default Perl delimiter for splitting lines of data into fields, which are invalid characters in sequence identfiers).

This option takes a mandatory integer option argument giving the index for which field to sort by. One-based indexing is used, so the first field after the sequence identifier has index 1. As standard in Perl, negative indices count backwards from the last field; field "-1" is the last field, "-2" is the second-to-last etc. Sequence records for which the specified field does not exist will sort on a null key.

-S --split-on-regex

Use regex <regex> to split descriptions/identifiers for the -f option instead of the perl default (which splits on one or more whitespace characters). Special characters must be quoted to protect them from the shell.

-a --annotate

Add FAST taxonomic addresses in dot-hex notation to sequence record descriptions

-j [string] --join=[strong]

Use [string] to append FAST taxonomic addresses to sequence record descriptions, instead of default " ". Use "\t" to indicate a tab-character.

--index

Instead of printing the sorted sequence records, print a key that maps fastaxsort taxonomically generated taxonomic addresses in dot-hexadecimal notation to NCBI taxonomic names or IDs.

-h, --help

Print a brief help message and exit.

--man

Print the manual page and exit.

--version

Print version information and exit.

-l, --log

Creates, or appends to, a generic FAST logfile in the current working directory. The logfile records date/time of execution, full command with options and arguments, and an optional comment.

-L [string], --logname=[string]

Use [string] as the name of the logfile. Default is "FAST.log.txt".

-C [string], --comment=[string]

Include comment [string] in logfile. No comment is saved by default.

--format=[format]

Use alternative format for input. See man page for "fasconvert" for allowed formats. This is for convenience; the FAST tools are designed to exchange data in Fasta format, and "fasta" is the default format for this tool.

-m [dna|rna|protein], --moltype=[dna|rna|protein]

Specify the type of sequence on input (should not be needed in most cases, but sometimes Bioperl cannot guess and complains when processing data).

-q --fastq

Use fastq format as input and output.

EXAMPLES

Print all sequences with "-DNA" in the ID:

    Sort sequences where the taxonomic identifier is found in the third field of the description:

    fastaxsort -f 3 -S " \| " nodes.dmp names.dmp tRNAdb-CE.sample2000.fas

SEE ALSO

man perlre
perldoc perlre

Documentation on perl regular expressions.

man FAST
perldoc FAST

Introduction and cookbook for FAST

The FAST Home Page"

CITING

If you use FAST, please cite Lawrence et al. (2015). FAST: FAST Analysis of Sequences Toolbox. Bioinformatics and Bioperl Stajich et al..