NAME
Bio::LITE::Taxonomy::NCBI::Gi2taxid - Mappings of NCBI GI's to Taxids fast and with very low memory footprint.
SYNOPSIS
Creation of a new Taxid to GI dictionary (binary mapping file):
use Bio::LITE::Taxonomy::NCBI::Gi2taxid qw/new_dict/;
new_dict (in => "gi_taxid_prot.dmp",
out => "gi_taxid_prot.bin");
Usage of the dictionary:
use Bio::LITE::Taxonomy::NCBI::Gi2taxid;
my $dict = Bio::LITE::Taxonomy::NCBI::Gi2taxid->new(dict=>"gi_taxid_prot.bin");
my $taxid = $dict->get_taxid(12553);
DESCRIPTION
The NCBI site offers a file to map gene and protein sequences (GIs) with their corresponding taxon of origin (Taxids). If you want to use this information inside a Perl script you will find that (given the high amount of sequences available) it is fairly inefficient to store this information in, for example, a regular hash. Only for creating such a hash you will need more than 10 GBs of system memory.
This is a very simple module that has been designed to efficiently map NCBI GIs to Taxids with speed as the primary goal. It is designed to retrieve taxids from GIs very fast and with low memory usage. It is even faster than using a SQL database to retrieve the mappings or using a local DBHash.
To achieve this, it uses a binary index that can be created with the function new_dict
. This index has to be created one time for each mapping file.
The original mapping files can be downloaded from the NCBI site at the following address: ftp://ftp.ncbi.nih.gov/pub/taxonomy/.
FUNCTIONS
new_dict
This function creates a new binary dictionary from the NCBI mapping file. The file should be uncompressed before being passed to the script. The function accepts the following parameters:
*WARNING* version 0.05 uses a more compacted memory file. This means that binary files created with earlier versions will not work with this one and vice-versa. You need to create the new binary db with this version.
- in
-
This is the uncompressed mapping file from the NCBI. The function accepts a filename or a filehandle
- out
-
Optional. Where the binary dictionary is going to be printed. The function accepts a filename or a filehandle (that should be opened with writing permissions). If absent STDOUT will be assumed.
CONSTRUCTOR
new
Once the binary dictionary is created it can be used as an object using this constructor. It accepts the following parameters
- dict
-
This is the binary dictionary obtained with the
new_dict
function. The name of the file or a filehandle is accepted. - save_mem
-
Optional. Use this option to avoid to load the binary dictionary into memory. This will save almost 1GB of system memory but looking up for Taxids will be ~20% slower. This option of off by default.
METHODS
get_taxid
This method receives a GI and returns the corresponding Taxid.
SEE ALSO
AUTHOR
Miguel Pignatelli
Any comments should be addressed to emepyc@gmail.com
LICENSE
Copyright 2009 Miguel Pignatelli, all rights reserved.
This library is free software; you may redistribute it and/or modify it under the same terms as Perl itself.