NAME
fetch-tax.pl - Fetch (and format) information from the NCBI Taxonomy database
VERSION
version 0.242020
USAGE
fetch-tax.pl <infiles> --tax=<dir> [optional arguments]
REQUIRED ARGUMENTS
- <infiles>
-
Path to input IDL files [repeatable argument].
- --taxdir=<dir>
-
Path to local mirror of the NCBI Taxonomy database.
OPTIONAL ARGUMENTS
- --from-must
-
Consider the input file as generated by ed/treeplot [default: no]. Currently, this switches to the legacy .lis format (instead of the modern .idl format).
- --col[umn]=<n>
-
Column number providing the string to be used as the item [default: 1]. Columns are numbered as would do the shell, i.e., they start at 1.
- --sep[arator]=<str>
-
Separator used to split columns [default: '\t'].
- --item-type=<str>
-
Type of the items listed in the infile [default: mustid]. The following types are available:
- mustid (standard MUST ids, including the '@') - baseid (base MUST ids, truncated before the '@') - strain (catalog strains to coerce to NCBI names) - name (NCBI names) - lineage (NCBI (or SILVA) lineages separated by ';') - taxid (NCBI taxon ids or GCA/GCF accessions) - gi (NCBI GIs, complete accessions are allowed)
mustid
andbaseid
items will both be analyzed by the Bio::MUST::Core::Taxonomy heuristics, whereasname
items will be considered in full as NCBI names (which may correspond to higher taxa or include very detailed strain information).strain
items will be coerced to NCBI names using the same heuristics asmustid
andbaseid
items. In contrast,taxid
items will be directly used to get the corresponding NCBI taxa.As an additional possibility,
gi
items can be used. These are given either as mere GI numbers (e.g., 158280253 or gi|158280253) or as complete NCBI accessions beginning with the GI number (e.g., gi|158280253|gb|EDP06011.1|), which helps analysing BLAST reports obtained from searches against NCBI databases.Using
gi
items requires having installed the GI-to-taxid mapper during setup of the local mirror of the NCBI Taxonomy database (see setup-taxdir.pl for details). - --keep-strain
-
Include the NCBI strain in the generated mustid [default: no]. The original strain is slightly transformed and stripped of its non-alphanumeric characters for maximal compatibility with other software.
- --missing=<str>
-
String to substitute for missing taxonomies [default: none].
- --[no]item
-
[Don't] include list item in output [default: yes].
- --[no]taxid
-
[Don't] include NCBI taxon id in output [default: yes].
- --[no]mustid
-
[Don't] include base MUST id in output [default: yes].
- --[no]lineage
-
[Don't] include NCBI lineage in output [default: yes].
- --levels=<level>...
-
List of whitespace-separated levels to be displayed in NCBI lineages [default: all].
Only taxa corresponding to specified levels will be conserved; others will be pruned out. Taxon order will follow the input level order. Beware that invalid or missing levels will result in undef values at the corresponding slots.
Valid levels are: superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, 'species group', 'species subgroup', species, subspecies, varietas, forma.
Levels can also be specified as numbers but this only makes sense for the highest levels in the hierarchy (i.e., 3 to 5).
- --org-mapper
-
IDM output switch [default: no]. When specified, the output can be used as an IDM file listing the base MUST id => NCBI taxon id pairs. This option overrides all other output switches except the next one. Such IDM files are also compatible with
42
'syaml-generator.pl
. - --legacy-nom=<level|file.fra>
-
Enable generation of a NOM file associating base MUST ids to groups for using with MUSTED. Groups will be set to the taxa ranked at the specified level. Again, the level can be given as a number. When specified, this option overrides all other output switches.
Alternatively, the leaves of the systematic frame contained in the specified FRA file can be used to fine-tune the rank for each taxon, i.e., each base MUST id is associated to the first terminal taxon that is part of its lineage.
- --version
- --usage
- --help
- --man
-
Print the usual program information
AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.