NAME

Bio::Graph::ProteinGraph - a representation of a protein interaction graph.

SYNOPSIS

# Read in from file
my $graphio = Bio::Graph::IO->new(-file   => 'myfile.dat',
                                  -format => 'dip');
my $graph   = $graphio->next_network();

Using ProteinGraph

  # Remove duplicate interactions from within a dataset
  $graph->remove_dup_edges();

  # Get a node (represented by a sequence object) from the graph.
  my $seqobj = $gr->nodes_by_id('P12345');

  # Get clustering coefficient of a given node.
  my $cc = $gr->clustering_coefficient($graph->nodes_by_id('NP_023232'));
  if ($cc != -1) {  ## result is -1 if cannot be calculated
    print "CC for NP_023232 is $cc";
  }

  # Get graph density
  my $density = $gr->density();

  # Get connected subgraphs
  my @graphs = $gr->components();

  # Remove a node
  $gr->remove_nodes($gr->nodes_by_id('P12345'));

  # How many interactions are there?
  my $count = $gr->edge_count;

  # How many nodes are there?
  my $ncount = $gr->node_count();

  # Let's get interactions above a threshold confidence score.
  my $edges = $gr->edges;
  for my $edge (keys %$edges) {
	 if (defined($edges->{$edge}->weight()) &&
      $edges->{$edge}->weight() > 0.6) {
		    print $edges->{$edge}->object_id(), "\t",
             $edges->{$edge}->weight(),"\n";
	 }
  }

  # Get interactors of your favourite protein
  my $node      = $graph->nodes_by_id('NP_023232');
  my @neighbors = $graph->neighbors($node); 
  print "      NP_023232 interacts with ";
  print join " ,", map{$_->object_id()} @neighbors;
  print "\n";

  # Annotate your sequences with interaction info
  my @seqs; ## array of sequence objects
  for my $seq(@seqs) {
    if ($graph->has_node($seq->accession_number)) {
       my $node = $graph->nodes_by_id( $seq->accession_number);
       my @neighbors = $graph->neighbors($node);
       for my $n (@neighbors) {
         my $ft = Bio::SeqFeature::Generic->new(
                      -primary_tag => 'Interactor',
                      -tags        => { id => $n->accession_number }
                      );
            $seq->add_SeqFeature($ft);
        }
     }
  }

  # Get proteins with > 10 interactors
  my @nodes = $graph->nodes();
  my @hubs;
  for my $node (@nodes) {
    if ($graph->neighbor_count($node) > 10) {
       push @hubs, $node;
    }
  }
  print "the following proteins have > 10 interactors:\n";
  print join "\n", map{$_->object_id()} @hubs;

  # Merge graphs 1 and 2 and flag duplicate edges
  $g1->union($g2);
  my @duplicates = $g1->dup_edges();
  print "these interactions exist in $g1 and $g2:\n";
  print join "\n", map{$_->object_id} @duplicates;

Creating networks from your own data

If you have interaction data in your own format, e.g.

  edgeid  node1  node2  score

  my $io = Bio::Root::IO->new(-file => 'mydata');
  my $gr = Bio::Graph::ProteinGraph->new();
  my %seen = (); # to record seen nodes
  while (my $l = $io->_readline() ) {

  # Parse out your data...
  my ($e_id, $n1, $n2, $sc) = split /\s+/, $l;

  # ...then make nodes if they don't already exist in the graph...
  my @nodes =();
    for my $n ($n1, $n2 ) {
		if (!exists($seen{$n})) {
        push @nodes,  Bio::Seq->new(-accession_number => $n);
		  $seen{$n} = $nodes[$#nodes];
      } else {
			push @nodes, $seen{$n};
	   }
    }
  }

  # ...and add a new edge to the graph
  my $edge  = Bio::Graph::Edge->new(-nodes => \@nodes,
                                    -id    => 'myid',
                                    -weight=> 1);
  $gr->add_edge($edge);

DESCRIPTION

A ProteinGraph is a representation of a protein interaction network. It derives most of its functionality from the Bio::Graph::SimpleGraph module, but is adapted to be able to use protein identifiers to identify the nodes.

This graph can use any objects that implement Bio::AnnotatableI and Bio::IdentifiableI interfaces. Bio::Seq (but not Bio::PrimarySeqI) objects can therefore be used for the nodes but any object that supports annotation objects and the object_id() method should work fine.

At present it is fairly 'lightweight' in that it represents nodes and edges but does not contain all the data about experiment ids etc. found in the Protein Standards Initiative schema. Hopefully that will be available soon.

A dataset may contain duplicate or redundant interactions. Duplicate interactions are interactions that occur twice in the dataset but with a different interaction ID, perhaps from a different experiment. The dup_edges method will retrieve these.

Redundant interaction are interactions that occur twice or more in a dataset with the same interaction id. These are more likely to be due to database errors. These methods are useful when merging 2 datasets using the union() method. Interactions present in both datasets, with different IDs, will be duplicate edges.

For Developers

In this module, nodes are represented by Bio::Seq::RichSeq objects containing all possible database identifiers but no sequence, as parsed from the interaction files. However, a node represented by a Bio::PrimarySeq object should work fine too.

Edges are represented by Bio::Graph::Edge objects. In order to work with SimpleGraph these objects must be array references, with the first 2 elements being references to the 2 nodes. More data can be added in $e[2]. etc. Edges should be Bio::Graph::Edge objects, which are Bio::IdentifiableI implementing objects.

At present edges only have an identifier and a weight() method, to hold confidence data, but subclasses of this could hold all the interaction data held in an XML document.

So, a graph has the following data:

1. A hash of nodes ('_nodes'), where keys are the text representation of a nodes memory address and values are the sequence object references.

2. A hash of neighbors ('_neighbors'), where keys are the text representation of a nodes memory address and a value is a reference to a list of neighboring node references.

3. A hash of edges ('_edges'), where a key is a text representation of the 2 nodes. E.g., "address1,address2" as a string, and values are Bio::Graph::Edge objects.

4. Look up hash ('_id_map') for finding a node by any of its ids.

5. Look up hash for edges ('_edge_id_map') for retrieving an edge object from its identifier.

6. Hash ('_components').

7. An array of duplicate edges ('_dup_edges').

8. Hash ('_is_connected').

REQUIREMENTS

To use this code you will need the Clone.pm module availabe from CPAN. You also need Class::AutoClass, available from CPAN as well. To read in XML data you will need XML::Twig available from CPAN.

SEE ALSO

Bio::Graph::SimpleGraph Bio::Graph::IO Bio::Graph::Edge Bio::Graph::IO::dip Bio::Graph::IO::psi_xml

FEEDBACK

Mailing Lists

User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.

bioperl-l@bioperl.org                  - General discussion
http://bioperl.org/wiki/Mailing_lists  - About the mailing lists

Reporting Bugs

Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web:

http://bugzilla.open-bio.org/

AUTHORS

Richard Adams - this module, Graph::IO modules.

Email richard.adams@ed.ac.uk

AUTHOR2

Nat Goodman - SimpleGraph.pm, and all underlying graph algorithms.

has_node

name      : has_node
purpose   : Is a protein in the graph?
usage     : if ($g->has_node('NP_23456')) {....}
returns   : 1 if true, 0 if false
arguments : A sequence identifier.

nodes_by_id

Name      : nodes_by_id
Purpose   : get node memory address from an id
Usage     : my @neighbors= $self->neighbors($self->nodes_by_id('O232322'))
Returns   : a SimpleGraph node representation ( a text representation
            of a node needed for other graph methods e.g.,
            neighbors(), edges()
Arguments : a protein identifier., e.g., its accession number.

union

Name        : union
Purpose     : To merge two graphs together, flagging interactions as 
              duplicate.
Usage       : $g1->union($g2), where g1 and g2 are 2 graph objects. 
Returns     : void, $g1 is modified
Arguments   : A Graph object of the same class as the calling object. 
Description : This method merges 2 graphs. The calling graph is modified, 
              the parameter graph ($g2) in usage) is unchanged. To take 
              account of differing IDs identifying the same protein, all 
              ids are compared. The following rules are used to modify $g1.

              First of all both graphs are scanned for nodes that share 
              an id in common. 

        1. If 2 nodes(proteins) share an interaction in both graphs,
           the edge in graph 2 is copied to graph 1 and added as a
           duplicate edge to graph 1,

        2. If 2 nodes interact in $g2 but not $g1, but both nodes exist
           in $g1, the attributes of the interaction in $g2 are 
           used to make a new edge in $g1.

        3. If 2 nodes interact in g2 but not g1, and 1 of them is a new
           protein, that protein is put in $g1 and a new edge made to
           it. 

        4. At present, if there is an interaction in $g2 composed of a
           pair of interactors that are not present in $g1, they are 
           not copied to $g1. This is rather conservative but prevents
           the problem of having redundant nodes in $g1 due to the same
           protein being identified by different ids in the same graph.

        So, for example 

             Edge   N1  N2 Comment

   Graph 1:  E1     P1  P2
             E2     P3  P4
             E3     P1  P4

   Graph 2:  X1     P1  P2 - will be added as duplicate to Graph1
             X2     P1  X4 - X4 added to Graph 1 and new edge made
             X3     P2  P3 - new edge links existing proteins in G1
             X4     Z4  Z5 - not added to Graph1. Are these different
                             proteins or synonyms for proteins in G1?

edge_count

Name     : edge_count
Purpose  : returns number of unique interactions, excluding 
           redundancies/duplicates
Arguments: void
Returns  : An integer
Usage    : my $count  = $graph->edge_count;

node_count

Name     : node_count
Purpose  : returns number of nodes.
Arguments: void
Returns  : An integer
Usage    : my $count = $graph->node_count;

neighbor_count

Name      : neighbor_count
Purpose   : returns number of neighbors of a given node
Usage     : my $count = $gr->neighbor_count($node)
Arguments : a node object
Returns   : an integer

_get_ids_by_db

Name     : _get_ids_by_db
Purpose  : gets all ids for a node, assuming its Bio::Seq object
Arguments: A Bio::SeqI object
Returns  : A hash: Keys are db ids, values are accessions
Usage    : my %ids = $gr->_get_ids_by_db($seqobj);

add_edge

Name        : add_edge
Purpose     : adds an interaction to a graph.
Usage       : $gr->add_edge($edge)
Arguments   : a Bio::Graph::Edge object, or a reference to a 2 element list. 
Returns     : void
Description : This is the method to use to add an interaction to a graph. 
              It contains the logic used to determine if a graph is a 
              new edge, a duplicate (an existing interaction with a 
              different edge id) or a redundant edge (same interaction, 
              same edge id).

subgraph

Name      : subgraph
Purpose   : To construct a subgraph of  nodes from the main network.This 
            method overrides that of Bio::Graph::SimpleGraph in its dealings with 
            Edge objects. 
Usage     : my $sg = $gr->subgraph(@nodes).
Returns   : A subgraph of the same class as the original graph. Edge objects are 
            cloned from the original graph but node objects are shared, so beware if you 
            start deleting nodes from the parent graph whilst operating on subgraph nodes. 
Arguments : A list of node objects.

add_dup_edge

Name       : add_dup_edge
Purpose    : to flag an interaction as a duplicate, take advantage of 
             edge ids. The idea is that interactions from 2 sources with 
             different interaction ids can be used to provide more 
             evidence for a interaction being true, while preventing 
             redundancy of the same interaction being present more than 
             once in the same dataset. 
Returns    : 1 on successful addition, 0 on there being an existing 
             duplicate. 
Usage      : $gr->add_dup_edge(edge->new (-nodes => [$n1, $n2],
                                          -score => $score
                                          -id    => $id);
Arguments  : an EdgeI implementing object.
Descripton : 

edge_by_id

Name        : edge_by_id
Purpose     : retrieve data about an edge from its id
Arguments   : a text identifier
Returns     : a Bio::Graph::Edge object or undef
Usage       : my $edge = $gr->edge_by_id('1000E');

remove_dup_edges

Name        : remove_dup_edges
Purpose     : removes duplicate edges from graph
Arguments   : none         - removes all duplicate edges
              edge id list - removes specified edges
Returns     : void
Usage       :    $gr->remove_dup_edges()
              or $gr->remove_dup_edges($edgeid1, $edgeid2);

redundant_edge

Name        : redundant_edge
Purpose     : adds/retrieves redundant edges to graph
Usage       : $gr->redundant_edge($edge)
Arguments   : none (getter) or a Biuo::Graph::Edge object (setter). 
Description : redundant edges are edges in a graph that have the 
              same edge id, ie. are 2 identical interactions. 
              With edge arg adds it to list, else returns list as reference. 

redundant_edges

Name         : redundant_edges
Purpose      : alias for redundant_edge

remove_redundant_edges

Name        : remove_redundant_edges
Purpose     : removes redundant_edges from graph, used by remove_node(),
              may be better as an internal method??
Arguments   : none         - removes all redundant edges
              edge id list - removes specified edges
Returns     : void
Usage       :    $gr->remove_redundant_edges()
              or $gr->remove_redundant_edges($edgeid1, $edgeid2);

clustering_coefficient

Name      : clustering_coefficient
Purpose   : determines the clustering coefficient of a node, a number 
            in range 0-1 indicating the extent to which the neighbors of
            a node are interconnnected.
Arguments : A sequence object (preferred) or a text identifier
Returns   : The clustering coefficient. 0 is a valid result.
            If the CC is not calculable ( if the node has <2 neighbors),
               returns -1.
Usage     : my $node = $gr->nodes_by_id('P12345');
            my $cc   = $gr->clustering_coefficient($node);

remove_nodes

Name      : remove_nodes
Purpose   : to delete a node from a graph, e.g., to simulate effect 
            of mutation
Usage     : $gr->remove_nodes($seqobj);
Arguments : a single $seqobj or list of seq objects (nodes)
Returns   : 1 on success

unconnected_nodes

Name      : unconnected_nodes
Purpose   : return a list of nodes with no connections. 
Arguments : none
Returns   : an array or array reference of unconnected nodes
Usage     : my @ucnodes = $gr->unconnected_nodes();

articulation_points

 Name      : articulation_points
 Purpose   : to find edges in a graph that if broken will fragment
               the graph into islands.
 Usage     : my $edgeref = $gr->articulation_points();
             for my $e (keys %$edgeref) {
				   print $e->[0]->accession_number. "-".
                     $e->[1]->accession_number ."\n";
             }
 Arguments : none
 Returns   : a list references to nodes that will fragment the graph 
             if deleted. 
 Notes     : This is a "slow but sure" method that works with graphs
               up to a few hundred nodes reasonably fast.

is_articulation_point

Name      : is_articulation_point
Purpose   : to determine if a given node is an articulation point or not. 
Usage     : if ($gr->is_articulation_point($node)) {.... 
Arguments : a text identifier for the protein or the node itself
Returns   : 1 if node is an articulation point, 0 if it is not