The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Data::Babel - Translator for biological identifiers

VERSION

Version 1.10_01

SYNOPSIS

  use Data::Babel;
  use Data::Babel::Config;
  use Class::AutoDB;
  use DBI;

  # open database containing Babel metadata
  my $autodb=new Class::AutoDB(database=>'test');

  # try to get existing Babel from database
  my $babel=old Data::Babel(name=>'test',autodb=>$autodb);
  unless ($babel) {              
    # Babel does not yet exist, so we'll create it
    # idtypes, masters, maptables are names of configuration files that define 
    #   the Babel's component objects
    $babel=new Data::Babel
      (name=>'test',idtypes=>'examples/idtype.ini',masters=>'examples/master.ini',
       maptables=>'examples/maptable.ini');
  }
  # open database containing real data
  my $dbh=DBI->connect("dbi:mysql:database=test",undef,undef);

  # CAUTION: rest of SYNOPSIS assumes you've loaded the real database somehow
  # translate several Entrez Gene ids to other types
  my $table=$babel->translate
    (input_idtype=>'gene_entrez',
     input_ids=>[1,2,3],
     output_idtypes=>[qw(gene_symbol gene_ensembl chip_affy probe_affy)]);
  # print a few columns from each row of result
  for my $row (@$table) {
    print "Entrez gene=$row->[0]\tsymbol=$row->[1]\tEnsembl gene=$row->[2]\n";
  }
  # same translation but limit results to Affy hgu133a
  my $table=$babel->translate
    (input_idtype=>'gene_entrez',
     input_ids=>[1,2,3],
     filters=>{chip_affy=>'hgu133a'},
     output_idtypes=>[qw(gene_symbol gene_ensembl chip_affy probe_affy)]);
  # generate a table mapping all Entrez Gene ids to UniProt ids
  my $table=$babel->translate
    (input_idtype=>'gene_entrez',
     output_idtypes=>[qw(protein_uniprot)]);
  # convert to HASH for easy programmatic lookups
  my %gene2uniprot=map {$_[0]=>$_[1]} @$table;

DESCRIPTION

Data::Babel translates biological identifiers based on information contained in a database. Each Data::Babel object provides a unique mapping over a set of identifier types. The system as a whole can contain multiple Data::Babel objects; these may share some or all identifier types, and may provide the same or different mappings over the shared types.

The principal method is 'translate' which converts identifiers of one type into identifiers of one or more output types. In typical usage, you call 'translate' with a list of input ids to convert. You can also call it without any input ids (or with the special option 'input_ids_all' set) to generate a complete mapping of the input type to the output types. This is convenient if you want to hang onto the mapping for repeated use. You can also filter the output based on values of other identifier types.

CAVEAT: Some features of Data::Babel are overly specific to the procedure we use to construct the underlying Babel database. We note such cases when they arise in the documentation below.

The main components of a Data::Babel object are

1. a list of Data::Babel::IdType objects, each representing a type of identifier
2. a list of Data::Babel::Master objects, one per IdType, providing
  • a master list of valid values for the type, and

  • optionally, a history mapping old values to current ones [NOT YET IMPLEMENTED]

3. a list of Data::Babel::MapTable objects which implement the mapping

One typically defines these components using configuration files whose basic format is defined in Config::IniFiles. See examples in "Configuration files" and the examples directory of the distribution.

Each MapTable represents a relational table stored in the database and provides a mapping over a subset of the Babel's IdTypes; the ensemble of MapTables must, of course, cover all the IdTypes. The ensemble of MapTables must also be non-redundant as explained in "Technical details".

You need not explicitly define Masters for all IdTypes; Babel will create 'implicit' Masters for any IdTypes lacking explicit ones. An implicit Master has a list of valid identifiers but no history and could be implemented as a view over all MapTables containing the IdType. In the current implementation, we use views for IdTypes contained in single MapTables but construct actual tables for IdTypes contained in multiple MapTables.

Configuration files

Our configuration files use 'ini' format as described in Config::IniFiles: 'ini' format files consist of a number of sections, each preceded with the section name in square brackets, followed by parameter names and their values.

There are separate config files for IdTypes, Masters, and MapTables. There are complete example files in the distribution. Here are some excerpts:

IdType

  [chip_affy]
  display_name=Affymetrix array
  referent=array
  defdb=affy
  meta=name
  format=/^[a-z]+\d+/
  sql_type=VARCHAR(32)

The section name is the IdType name. The parameters are

  • display_name. human readable name for this type

  • referent. the type of things to which this type of identifier refers

  • defdb. the database, if any, responsible for assigning this type of identifier

  • meta. some identifiers are purely synthetic (eg, Entrez gene IDs) while others have some mnemonic content; legal values are

    • eid (meaning synthetic)

    • symbol

    • name

    • description

  • format. Perl format of valid identifiers

  • sql_type. SQL data type

Master

  [gene_entrez_master]
  inputs=<<INPUTS
  MainData/GeneInformation
  INPUTS
  query=<<QUERY
  SELECT locus_link_eid AS gene_entrez FROM gene_information 
  QUERY

The section name is the Master name; the name of the IdType is the same but without the '_master'. The parameters are used by our database construction procedure and may not be useful in other settings.

MapTable

  [gene_entrez_information]
  inputs=MainData/GeneInformation 
  idtypes=gene_entrez gene_symbol gene_description organism_name_common
  query=<<QUERY
  SELECT 
         GENE.locus_link_eid AS gene_entrez, 
         GENE.symbol AS gene_symbol, 
         GENE.description AS gene_description,
         ORG.common_name AS organism_name_common
  FROM 
         gene_information AS GENE
         LEFT OUTER JOIN
         organism AS ORG ON GENE.organism_id=ORG.organism_id
  QUERY

  [% maptable %]
  inputs=MainData/GeneUnigene
  idtypes=gene_entrez gene_unigene
  query=<<QUERY
  SELECT UG.locus_link_eid AS gene_entrez, UG.unigene_eid AS gene_unigene
  FROM   gene_unigene AS UG
  QUERY

This excerpt has two MapTable definitions which illustrate two ways that MapTables can be named. The first uses a normal section name; the second invokes a Template Toolkit macro which generates unique names of the form 'maptable_001'. This is very convenient because Babel databases typically contain a large number of MapTables, and it's hard to come up with good names for most of them. In any case, the names don't matter much, because software generates the queries that operate on these tables.

The 'inputs' and 'query' parameters are used by our database construction procedure and may not be useful in other settings.

Input ids that do not connect to any outputs

The 'translate' method does not return any output for input identifiers that do not connect to any identifiers of the desired output types. In other words, 'translate' never returns output rows in which the output columns are all NULL.

An input identifier can fail to connect for several reasons:

1. The identifier does not exist in the Master table for the input IdType; this generally means that the input id is not valid.
2. The identifier exists in the Master table for the input IdType (hence is valid) but is not present in any MapTables; this is rare, because it means the identifer is valid but does not participate in any relationships.
3. The identifier exists in the Master table for the input IdType and one or more MapTables, but the rows that match the input contain NULLs for all output IdTypes; this is normal and simply means that the input doesn't connect to any ids of the desired output types.

If no output IdTypes are specified, 'translate' returns a row containing one element, namely, the input identifier, for each input id that exists in the corresponding Master table. This is the only way at present for the application to distinguish non-existent ids from ones that exist but don't connect.

Technical details

A basic Babel property is that translations are stable. You can add output types to a query without changing the answer for the types you had before, you can remove output types from the query without changing the answer for the ones that remain, and if you "reverse direction" and swap the input type with one of the outputs, you get everything that was in the original answer.

We accomplish this by requiring that the database of MapTables satisfy the universal relation property (a well-known concept in relational database theory), and that 'translate' retrieves a sub-table of the universal relational. Concretely, the universal relational is the natural full outer join of all the MapTables. 'translate' performs natural left out joins starting with the Master table for the input IdType, and then including enough tables to connect the input and output IdTypes. Left outer joins suffice, because 'translate' starts with the Master.

We further require that the database of MapTables be non-redundant. The basic idea is that a given IdType may not be present in multiple MapTables, unless it is being used as join column. More technically, we require that the MapTables form a tree schema (another well-known concept in relational database theory), and any pair of MapTables have at most one IdType in common. As a consequence, there is essentially a single path between any pair of IdTypes.

To represent the connections between IdTypes and MapTables we use an undirected graph whose nodes represent IdTypes and MapTables, and whose edges go between each MapTable and the IdTypes it contains. In this representation, a non-redundant schema is a tree.

'translate' uses this graph to find the MapTables it must join to connect the input and output IdTypes. The algorithms is simple: start at the leaves and recursively prune back branches that do not contain the input or output IdTypes.

METHODS AND FUNCTIONS

new

 Title   : new 
 Usage   : $babel=new Data::Babel
                      name=>$name,
                      idtypes=>$idtypes,masters=>$masters,maptables=>$maptables 
 Function: Create new Data::Babel object or fetch existing object from database
           and update its components.  Store the new or updated object.
 Returns : Data::Babel object
 Args    : name        eg, 'test'
           idtypes, masters, maptables
                       define component objects; see below
           old         existing Data::Babel object in case program already
                       fetched it (typically via 'old')
           autodb      Class::AutoDB object for database containing Babel.
                       class method often set before running 'new'
 Notes   : 'name' is required. All other args are optional

The component object parameters can be any of the following:

1. filenames referring to configuration files that define the component objects
2. any other file descriptors that can be handled by the new method of Config::IniFiles, eg, filehandles and IO::File objects
3. objects of the appropriate type for each component, namely, Data::Babel::IdType, Data::Babel::Master, Data::Babel::MapTable, respectively
4. ARRAYs of the above

old

 Title   : old 
 Usage   : $babel=old Data::Babel($name)
           -- OR --
           $babel=old Data::Babel(name=>$name)
 Function: Fetch existing Data::Babel object from database          
 Returns : Data::Babel object or undef
 Args    : name of Data::Babel object, eg, 'test'
           if keyword form used, can also specify autodb to set the
           corresponding class attribute

attributes

The available object attributes are

  name       eg, 'test' 
  id         name prefixed with 'babel', eg, 'babel:test'. not really used.  
             exists for compatibility with component objects
  idtypes    ARRAY of this Babel's Data::Babel::IdType objects
  masters    ARRAY of this Babel's Data::Babel::Master objects
  maptables  ARRAY of this Babel's Data::Babel::MapTable objects

The available class attributes are

  autodb     Class::AutoDB object for database containing Babel

translate

 Title   : translate 
 Usage   : $table=$babel->translate
                     (input_idtype=>'gene_entrez',
                      input_ids=>[1,2,3],
                      filters=>{chip_affy=>'hgu133a'},
                      output_idtypes=>[qw(transcript_refseq transcript_ensembl)],
                      limit=>100)
 Function: Translate the input ids to ids of the output types
 Returns : table represented as an ARRAY of ARRAYS. Each inner ARRAY is one row
           of the result. The first element of each row is an input id; the rest
           are outputs in the same order as output_idtypes
 Args    : input_idtype   name of Data::Babel::IdType object or object
           input_ids      ARRAY of ids to be translated. If absent or undef, all
                          ids of the input type are translated. If an empty
                          array, ie, [], no ids are translated and the result
                          will be empty.
           input_ids_all  a more explicit way to specify that all ids of the 
                          input type should be translated.
           filters        HASH of conditions limiting the output; see below.
           output_idtypes ARRAY of names of Data::Babel::IdType objects or
                          objects
           limit          maximum number of rows to retrieve (optional)
 Notes   : Duplicate output columns are retained. 
           Does not return output rows in which the output columns are all NULL.
           If no output idtypes are specified, returns rows for which the input
           id exists in the corresponding Master table.
           The order of output rows is arbitrary.
           If input_ids is an empty ARRAY, ie, [], the result will be empty.
           It is an error to set both input_ids and input_ids_all.

The 'filters' argument is a HASH of types and values. The types can be names of Data::Babel::IdType objects or objects themselves. The values can be single values or ARRAYs of values. For example

  filters=>{chip_affy=>'hgu133a'}
  filters=>{chip_affy=>['hgu133a','hgu133plus2']}
  filters=>{chip_affy=>['hgu133a','hgu133plus2'],pathway_kegg_id=>4610}

show

 Title   : show
 Usage   : $babel->show
 Function: Print object in readable form
 Returns : nothing useful
 Args    : none

check_schema

 Title   : check_schema
 Usage   : @errstrs=$babel->check_schema
           -- OR --
           $ok=$babel->check_schema
 Function: Validate schema. Presently checks that schema graph is tree and all
           IdTypes contained in some MapTable
 Returns : in array context, list of errors
           in scalar context, true if schema is good, false if schema is bad
 Args    : none

check_contents - NOT YET IMPLEMENTED

 Title   : check_contents
 Usage   : $babel->check_schema
 Function: Validate contents of Babel database. Checks consistency of explicit
           Masters and MapTables
 Returns : boolean
 Args    : none

Objects have names and ids: names are strings like 'gene_entrez' and are unique for a given class of object; ids have a short form of the type prepended to the name, eg, 'idtype:gene_entrez', and are unique across all classes. We use ids as nodes in schema and query graphs. In most cases, applications should should use names.

The methods in this section map names or ids to component objects, or (as a trivial convenience), convert ids to names.

name2idtype

 Title   : name2idtype
 Usage   : $idtype=$babel->name2idtype('gene_entrez')
 Function: Get the IdType object given its name
 Returns : Data::Babel::IdType object or undef
 Args    : name of object
 Notes   : only looks at this Babel's component objects

name2master

 Title   : name2master
 Usage   : $master=$babel->name2master('gene_entrez_master')
 Function: Get the Master object given its name
 Returns : Data::Babel::Master object or undef
 Args    : name of object
 Notes   : only looks at this Babel's component objects

name2maptable

 Title   : name2maptable
 Usage   : $maptable=$babel->name2maptable('maptable_012')
 Function: Get the MapTable object given its name
 Returns : Data::Babel::MapTable object or undef
 Args    : name of object
 Notes   : only looks at this Babel's component objects

id2object

 Title   : id2object
 Usage   : $object=$babel->id2object('idtype:gene_entrez')
 Function: Get object given its id
 Returns : Data::Babel::IdType, Data::Babel::Master, Data::Babel::MapTable
           object or undef
 Args    : id of object
 Notes   : only looks at this Babel's component objects

id2name

 Title   : id2name
 Usage   : $name=$babel->id2name('idtype:gene_entrez')
           -- OR --
           $name=Data::Babel->id2name('idtype:gene_entrez')
 Function: Convert object id to name
 Returns : string
 Args    : id of object
 Notes   : trival convenience method

METHODS AND ATTRIBUTES OF COMPONENT CLASS Data::Babel::IdType

new

 Title   : new 
 Usage   : $idtype=new Data::Babel::IdType name=>$name,...
 Function: Create new Data::Babel::IdType object or fetch existing object from 
           database and update its components. Store the new or updated object.
 Returns : Data::Babel::IdType object
 Args    : any attributes listed in the attributes section below, except 'id'
           (because it is computed from name)
           old         existing Data::Babel object in case program already
                       fetched it (typically via 'old')
           autodb      Class::AutoDB object for database containing Babel.
                       class method often set before running 'new'
 Notes   : 'name' is required. All other args are optional

old

 Title   : old 
 Usage   : $idtype=old Data::Babel::IdType($name)
           -- OR --
           $babel=old Data::Babel::IdType(name=>$name)
 Function: Fetch existing Data::Babel::IdType object from database          
 Returns : Data::Babel::IdType object or undef
 Args    : name of Data::Babel::IdType object, eg, 'gene_entrez'
           if keyword form used, can also specify autodb to set the
           corresponding class attribute

attributes

The available object attributes are

  name          eg, 'gene_entrez' 
  id            name prefixed with 'idtype', eg, 'idtype:::gene_entrez'
  master        Data::Babel::Master object for this IdType
  maptables     ARRAY of Data::Babel::MapTable objects containing this IdType
  display_name  human readable name, eg, 'Entrez Gene ID'
  referent      the type of things to which this type of identifier refers
  defdb         the database, if any, which assigns identifiers
  meta          meta-type: eid (meaning synthetic), symbol, name, description
  format        Perl format of valid identifiers, eg, /^\d+$/
  perl_format   synonym for format
  sql_type      SQL data type, eg, INT(11)

The available class attributes are

  autodb     Class::AutoDB object for database containing Babel

degree

 Title   : degree 
 Usage   : $number=$idtype->degree
 Function: Tell how many Data::Babel::MapTables contain this IdType          
 Returns : number
 Args    : none

METHODS AND ATTRIBUTES OF COMPONENT CLASS Data::Babel::Master

new

 Title   : new 
 Usage   : $master=new Data::Babel::Master name=>$name,idtype=>$idtype,...
 Function: Create new Data::Babel::Master object or fetch existing object from 
           database and update its components. Store the new or updated object.
 Returns : Data::Babel::Master object
 Args    : any attributes listed in the attributes section below, except 'id'
           (because it is computed from name)
           old         existing Data::Babel object in case program already
                       fetched it (typically via 'old')
           autodb      Class::AutoDB object for database containing Babel.
                       class method often set before running 'new'
 Notes   : 'name' is required. All other args are optional

old

 Title   : old 
 Usage   : $master=old Data::Babel::Master($name)
           -- OR --
           $babel=old Data::Babel::Master(name=>$name)
 Function: Fetch existing Data::Babel::Master object from database          
 Returns : Data::Babel::Master object or undef
 Args    : name of Data::Babel::Master object, eg, 'gene_entrez'
           if keyword form used, can also specify autodb to set the
           corresponding class attribute

attributes

The available object attributes are

  name          eg, 'gene_entrez_master' 
  id            name prefixed with 'master', eg, 'master:::gene_entrez_master'
  idtype        Data::Babel::IdType object for which this is the Master
  implicit      boolean indicating whether Master is implicit
  explicit      opposite of implicit
  view          boolean indicating whether Master is implemented as a view
  inputs, namespace, query
                used by our database construction procedure

The available class attributes are

  autodb     Class::AutoDB object for database containing Babel

degree

 Title   : degree 
 Usage   : $number=$master->degree
 Function: Tell how many Data::Babel::MapTables contain this Master's IdType          
 Returns : number
 Args    : none

METHODS AND ATTRIBUTES OF COMPONENT CLASS Data::Babel::MapTable

new

 Title   : new 
 Usage   : $maptable=new Data::Babel::MapTable name=>$name,idtypes=>$idtypes,...
 Function: Create new Data::Babel::MapTable object or fetch existing object from 
           database and update its components. Store the new or updated object.
 Returns : Data::Babel::MapTable object
 Args    : any attributes listed in the attributes section below, except 'id'
           (because it is computed from name)
           old         existing Data::Babel object in case program already
                       fetched it (typically via 'old')
           autodb      Class::AutoDB object for database containing Babel.
                       class method often set before running 'new'
 Notes   : 'name' is required. All other args are optional

old

 Title   : old 
 Usage   : $maptable=old Data::Babel::MapTable($name)
           -- OR --
           $babel=old Data::Babel::MapTable(name=>$name)
 Function: Fetch existing Data::Babel::MapTable object from database          
 Returns : Data::Babel::MapTable object or undef
 Args    : name of Data::Babel::MapTable object, eg, 'gene_entrez'
           if keyword form used, can also specify autodb to set the
           corresponding class attribute

attributes

The available object attributes are

  name          eg, 'gene_entrez_master' 
  id            name prefixed with 'maptable', eg, 'maptable:::gene_entrez_master'
  idtypes       ARRAY of Data::Babel::IdType objects contained by this MapTable
  inputs, namespace, query
                used by our database construction procedure

The available class attributes are

  autodb     Class::AutoDB object for database containing Babel

SEE ALSO

I'm not aware of anything.

AUTHOR

Nat Goodman, <natg at shore.net>

BUGS AND CAVEATS

Please report any bugs or feature requests to bug-data-babel at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Babel. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

Known Bugs and Caveats

1. The attributes of Master and MapTable objects are overly specific to the procedure we use to construct databases and may not be useful in other settings.
2. This class uses Class::AutoDB to store its metadata and inherits all the Known Bugs and Caveats of that module.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Data::Babel

You can also look for information at:

ACKNOWLEDGEMENTS

This module extends a version developed by Victor Cassen.

LICENSE AND COPYRIGHT

Copyright 2010 Institute for Systems Biology

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.