NAME

fetch-tax.pl - Fetch (and format) information from the NCBI Taxonomy database

VERSION

version 0.242020

USAGE

fetch-tax.pl <infiles> --tax=<dir> [optional arguments]

REQUIRED ARGUMENTS

<infiles>

Path to input IDL files [repeatable argument].

--taxdir=<dir>

Path to local mirror of the NCBI Taxonomy database.

OPTIONAL ARGUMENTS

--from-must

Consider the input file as generated by ed/treeplot [default: no]. Currently, this switches to the legacy .lis format (instead of the modern .idl format).

--col[umn]=<n>

Column number providing the string to be used as the item [default: 1]. Columns are numbered as would do the shell, i.e., they start at 1.

--sep[arator]=<str>

Separator used to split columns [default: '\t'].

--item-type=<str>

Type of the items listed in the infile [default: mustid]. The following types are available:

- mustid  (standard MUST ids, including the '@')
- baseid  (base MUST ids, truncated before the '@')
- strain  (catalog strains to coerce to NCBI names)
- name    (NCBI names)
- lineage (NCBI (or SILVA) lineages separated by ';')
- taxid   (NCBI taxon ids or GCA/GCF accessions)
- gi      (NCBI GIs, complete accessions are allowed)

mustid and baseid items will both be analyzed by the Bio::MUST::Core::Taxonomy heuristics, whereas name items will be considered in full as NCBI names (which may correspond to higher taxa or include very detailed strain information). strain items will be coerced to NCBI names using the same heuristics as mustid and baseid items. In contrast, taxid items will be directly used to get the corresponding NCBI taxa.

As an additional possibility, gi items can be used. These are given either as mere GI numbers (e.g., 158280253 or gi|158280253) or as complete NCBI accessions beginning with the GI number (e.g., gi|158280253|gb|EDP06011.1|), which helps analysing BLAST reports obtained from searches against NCBI databases.

Using gi items requires having installed the GI-to-taxid mapper during setup of the local mirror of the NCBI Taxonomy database (see setup-taxdir.pl for details).

--keep-strain

Include the NCBI strain in the generated mustid [default: no]. The original strain is slightly transformed and stripped of its non-alphanumeric characters for maximal compatibility with other software.

--missing=<str>

String to substitute for missing taxonomies [default: none].

--[no]item

[Don't] include list item in output [default: yes].

--[no]taxid

[Don't] include NCBI taxon id in output [default: yes].

--[no]mustid

[Don't] include base MUST id in output [default: yes].

--[no]lineage

[Don't] include NCBI lineage in output [default: yes].

--levels=<level>...

List of whitespace-separated levels to be displayed in NCBI lineages [default: all].

Only taxa corresponding to specified levels will be conserved; others will be pruned out. Taxon order will follow the input level order. Beware that invalid or missing levels will result in undef values at the corresponding slots.

Valid levels are: superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, 'species group', 'species subgroup', species, subspecies, varietas, forma.

Levels can also be specified as numbers but this only makes sense for the highest levels in the hierarchy (i.e., 3 to 5).

--org-mapper

IDM output switch [default: no]. When specified, the output can be used as an IDM file listing the base MUST id => NCBI taxon id pairs. This option overrides all other output switches except the next one. Such IDM files are also compatible with 42's yaml-generator.pl.

--legacy-nom=<level|file.fra>

Enable generation of a NOM file associating base MUST ids to groups for using with MUSTED. Groups will be set to the taxa ranked at the specified level. Again, the level can be given as a number. When specified, this option overrides all other output switches.

Alternatively, the leaves of the systematic frame contained in the specified FRA file can be used to fine-tune the rank for each taxon, i.e., each base MUST id is associated to the first terminal taxon that is part of its lineage.

--version
--usage
--help
--man

Print the usual program information

AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.