NAME
Bio::Graph::ProteinGraph - a representation of a protein interaction graph.
SYNOPSIS
# Read in from file
my $graphio = Bio::Graph::IO->new(-file => 'myfile.dat',
-format => 'dip');
my $graph = $graphio->next_network();
Using ProteinGraph
# Remove duplicate interactions from within a dataset
$graph->remove_dup_edges();
# Get a node (represented by a sequence object) from the graph.
my $seqobj = $gr->nodes_by_id('P12345');
# Get clustering coefficient of a given node.
my $cc = $gr->clustering_coefficient($graph->nodes_by_id('NP_023232'));
if ($cc != -1) { ## result is -1 if cannot be calculated
print "CC for NP_023232 is $cc";
}
# Get graph density
my $density = $gr->density();
# Get connected subgraphs
my @graphs = $gr->components();
# Remove a node
$gr->remove_nodes($gr->nodes_by_id('P12345'));
# How many interactions are there?
my $count = $gr->edge_count;
# How many nodes are there?
my $ncount = $gr->node_count();
# Let's get interactions above a threshold confidence score.
my $edges = $gr->edges;
for my $edge (keys %$edges) {
if (defined($edges->{$edge}->weight()) &&
$edges->{$edge}->weight() > 0.6) {
print $edges->{$edge}->object_id(), "\t",
$edges->{$edge}->weight(),"\n";
}
}
# Get interactors of your favourite protein
my $node = $graph->nodes_by_id('NP_023232');
my @neighbors = $graph->neighbors($node);
print " NP_023232 interacts with ";
print join " ,", map{$_->object_id()} @neighbors;
print "\n";
# Annotate your sequences with interaction info
my @seqs; ## array of sequence objects
for my $seq(@seqs) {
if ($graph->has_node($seq->accession_number)) {
my $node = $graph->nodes_by_id( $seq->accession_number);
my @neighbors = $graph->neighbors($node);
for my $n (@neighbors) {
my $ft = Bio::SeqFeature::Generic->new(
-primary_tag => 'Interactor',
-tags => { id => $n->accession_number }
);
$seq->add_SeqFeature($ft);
}
}
}
# Get proteins with > 10 interactors
my @nodes = $graph->nodes();
my @hubs;
for my $node (@nodes) {
if ($graph->neighbor_count($node) > 10) {
push @hubs, $node;
}
}
print "the following proteins have > 10 interactors:\n";
print join "\n", map{$_->object_id()} @hubs;
# Merge graphs 1 and 2 and flag duplicate edges
$g1->union($g2);
my @duplicates = $g1->dup_edges();
print "these interactions exist in $g1 and $g2:\n";
print join "\n", map{$_->object_id} @duplicates;
Creating networks from your own data
If you have interaction data in your own format, e.g.
edgeid node1 node2 score
my $io = Bio::Root::IO->new(-file => 'mydata');
my $gr = Bio::Graph::ProteinGraph->new();
my %seen = (); # to record seen nodes
while (my $l = $io->_readline() ) {
# Parse out your data...
my ($e_id, $n1, $n2, $sc) = split /\s+/, $l;
# ...then make nodes if they don't already exist in the graph...
my @nodes =();
for my $n ($n1, $n2 ) {
if (!exists($seen{$n})) {
push @nodes, Bio::Seq->new(-accession_number => $n);
$seen{$n} = $nodes[$#nodes];
} else {
push @nodes, $seen{$n};
}
}
}
# ...and add a new edge to the graph
my $edge = Bio::Graph::Edge->new(-nodes => \@nodes,
-id => 'myid',
-weight=> 1);
$gr->add_edge($edge);
DESCRIPTION
A ProteinGraph is a representation of a protein interaction network. It derives most of its functionality from the Bio::Graph::SimpleGraph module, but is adapted to be able to use protein identifiers to identify the nodes.
This graph can use any objects that implement Bio::AnnotatableI and Bio::IdentifiableI interfaces. Bio::Seq (but not Bio::PrimarySeqI) objects can therefore be used for the nodes but any object that supports annotation objects and the object_id() method should work fine.
At present it is fairly 'lightweight' in that it represents nodes and edges but does not contain all the data about experiment ids etc. found in the Protein Standards Initiative schema. Hopefully that will be available soon.
A dataset may contain duplicate or redundant interactions. Duplicate interactions are interactions that occur twice in the dataset but with a different interaction ID, perhaps from a different experiment. The dup_edges method will retrieve these.
Redundant interaction are interactions that occur twice or more in a dataset with the same interaction id. These are more likely to be due to database errors. These methods are useful when merging 2 datasets using the union() method. Interactions present in both datasets, with different IDs, will be duplicate edges.
For Developers
In this module, nodes are represented by Bio::Seq::RichSeq objects containing all possible database identifiers but no sequence, as parsed from the interaction files. However, a node represented by a Bio::PrimarySeq object should work fine too.
Edges are represented by Bio::Graph::Edge objects. In order to work with SimpleGraph these objects must be array references, with the first 2 elements being references to the 2 nodes. More data can be added in $e[2]. etc. Edges should be Bio::Graph::Edge objects, which are Bio::IdentifiableI implementing objects.
At present edges only have an identifier and a weight() method, to hold confidence data, but subclasses of this could hold all the interaction data held in an XML document.
So, a graph has the following data:
1. A hash of nodes ('_nodes'), where keys are the text representation of a nodes memory address and values are the sequence object references.
2. A hash of neighbors ('_neighbors'), where keys are the text representation of a nodes memory address and a value is a reference to a list of neighboring node references.
3. A hash of edges ('_edges'), where a key is a text representation of the 2 nodes. E.g., "address1,address2" as a string, and values are Bio::Graph::Edge objects.
4. Look up hash ('_id_map') for finding a node by any of its ids.
5. Look up hash for edges ('_edge_id_map') for retrieving an edge object from its identifier.
6. Hash ('_components').
7. An array of duplicate edges ('_dup_edges').
8. Hash ('_is_connected').
REQUIREMENTS
To use this code you will need the Clone.pm module availabe from CPAN. You also need Class::AutoClass, available from CPAN as well. To read in XML data you will need XML::Twig available from CPAN.
SEE ALSO
Bio::Graph::SimpleGraph Bio::Graph::IO Bio::Graph::Edge Bio::Graph::IO::dip Bio::Graph::IO::psi_xml
FEEDBACK
Mailing Lists
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web:
http://bugzilla.open-bio.org/
AUTHORS
Richard Adams - this module, Graph::IO modules.
Email richard.adams@ed.ac.uk
AUTHOR2
Nat Goodman - SimpleGraph.pm, and all underlying graph algorithms.
has_node
name : has_node
purpose : Is a protein in the graph?
usage : if ($g->has_node('NP_23456')) {....}
returns : 1 if true, 0 if false
arguments : A sequence identifier.
nodes_by_id
Name : nodes_by_id
Purpose : get node memory address from an id
Usage : my @neighbors= $self->neighbors($self->nodes_by_id('O232322'))
Returns : a SimpleGraph node representation ( a text representation
of a node needed for other graph methods e.g.,
neighbors(), edges()
Arguments : a protein identifier., e.g., its accession number.
union
Name : union
Purpose : To merge two graphs together, flagging interactions as
duplicate.
Usage : $g1->union($g2), where g1 and g2 are 2 graph objects.
Returns : void, $g1 is modified
Arguments : A Graph object of the same class as the calling object.
Description : This method merges 2 graphs. The calling graph is modified,
the parameter graph ($g2) in usage) is unchanged. To take
account of differing IDs identifying the same protein, all
ids are compared. The following rules are used to modify $g1.
First of all both graphs are scanned for nodes that share
an id in common.
1. If 2 nodes(proteins) share an interaction in both graphs,
the edge in graph 2 is copied to graph 1 and added as a
duplicate edge to graph 1,
2. If 2 nodes interact in $g2 but not $g1, but both nodes exist
in $g1, the attributes of the interaction in $g2 are
used to make a new edge in $g1.
3. If 2 nodes interact in g2 but not g1, and 1 of them is a new
protein, that protein is put in $g1 and a new edge made to
it.
4. At present, if there is an interaction in $g2 composed of a
pair of interactors that are not present in $g1, they are
not copied to $g1. This is rather conservative but prevents
the problem of having redundant nodes in $g1 due to the same
protein being identified by different ids in the same graph.
So, for example
Edge N1 N2 Comment
Graph 1: E1 P1 P2
E2 P3 P4
E3 P1 P4
Graph 2: X1 P1 P2 - will be added as duplicate to Graph1
X2 P1 X4 - X4 added to Graph 1 and new edge made
X3 P2 P3 - new edge links existing proteins in G1
X4 Z4 Z5 - not added to Graph1. Are these different
proteins or synonyms for proteins in G1?
edge_count
Name : edge_count
Purpose : returns number of unique interactions, excluding
redundancies/duplicates
Arguments: void
Returns : An integer
Usage : my $count = $graph->edge_count;
node_count
Name : node_count
Purpose : returns number of nodes.
Arguments: void
Returns : An integer
Usage : my $count = $graph->node_count;
neighbor_count
Name : neighbor_count
Purpose : returns number of neighbors of a given node
Usage : my $count = $gr->neighbor_count($node)
Arguments : a node object
Returns : an integer
_get_ids_by_db
Name : _get_ids_by_db
Purpose : gets all ids for a node, assuming its Bio::Seq object
Arguments: A Bio::SeqI object
Returns : A hash: Keys are db ids, values are accessions
Usage : my %ids = $gr->_get_ids_by_db($seqobj);
add_edge
Name : add_edge
Purpose : adds an interaction to a graph.
Usage : $gr->add_edge($edge)
Arguments : a Bio::Graph::Edge object, or a reference to a 2 element list.
Returns : void
Description : This is the method to use to add an interaction to a graph.
It contains the logic used to determine if a graph is a
new edge, a duplicate (an existing interaction with a
different edge id) or a redundant edge (same interaction,
same edge id).
subgraph
Name : subgraph
Purpose : To construct a subgraph of nodes from the main network.This
method overrides that of Bio::Graph::SimpleGraph in its dealings with
Edge objects.
Usage : my $sg = $gr->subgraph(@nodes).
Returns : A subgraph of the same class as the original graph. Edge objects are
cloned from the original graph but node objects are shared, so beware if you
start deleting nodes from the parent graph whilst operating on subgraph nodes.
Arguments : A list of node objects.
add_dup_edge
Name : add_dup_edge
Purpose : to flag an interaction as a duplicate, take advantage of
edge ids. The idea is that interactions from 2 sources with
different interaction ids can be used to provide more
evidence for a interaction being true, while preventing
redundancy of the same interaction being present more than
once in the same dataset.
Returns : 1 on successful addition, 0 on there being an existing
duplicate.
Usage : $gr->add_dup_edge(edge->new (-nodes => [$n1, $n2],
-score => $score
-id => $id);
Arguments : an EdgeI implementing object.
Descripton :
edge_by_id
Name : edge_by_id
Purpose : retrieve data about an edge from its id
Arguments : a text identifier
Returns : a Bio::Graph::Edge object or undef
Usage : my $edge = $gr->edge_by_id('1000E');
remove_dup_edges
Name : remove_dup_edges
Purpose : removes duplicate edges from graph
Arguments : none - removes all duplicate edges
edge id list - removes specified edges
Returns : void
Usage : $gr->remove_dup_edges()
or $gr->remove_dup_edges($edgeid1, $edgeid2);
redundant_edge
Name : redundant_edge
Purpose : adds/retrieves redundant edges to graph
Usage : $gr->redundant_edge($edge)
Arguments : none (getter) or a Biuo::Graph::Edge object (setter).
Description : redundant edges are edges in a graph that have the
same edge id, ie. are 2 identical interactions.
With edge arg adds it to list, else returns list as reference.
redundant_edges
Name : redundant_edges
Purpose : alias for redundant_edge
remove_redundant_edges
Name : remove_redundant_edges
Purpose : removes redundant_edges from graph, used by remove_node(),
may be better as an internal method??
Arguments : none - removes all redundant edges
edge id list - removes specified edges
Returns : void
Usage : $gr->remove_redundant_edges()
or $gr->remove_redundant_edges($edgeid1, $edgeid2);
clustering_coefficient
Name : clustering_coefficient
Purpose : determines the clustering coefficient of a node, a number
in range 0-1 indicating the extent to which the neighbors of
a node are interconnnected.
Arguments : A sequence object (preferred) or a text identifier
Returns : The clustering coefficient. 0 is a valid result.
If the CC is not calculable ( if the node has <2 neighbors),
returns -1.
Usage : my $node = $gr->nodes_by_id('P12345');
my $cc = $gr->clustering_coefficient($node);
remove_nodes
Name : remove_nodes
Purpose : to delete a node from a graph, e.g., to simulate effect
of mutation
Usage : $gr->remove_nodes($seqobj);
Arguments : a single $seqobj or list of seq objects (nodes)
Returns : 1 on success
unconnected_nodes
Name : unconnected_nodes
Purpose : return a list of nodes with no connections.
Arguments : none
Returns : an array or array reference of unconnected nodes
Usage : my @ucnodes = $gr->unconnected_nodes();
articulation_points
Name : articulation_points
Purpose : to find edges in a graph that if broken will fragment
the graph into islands.
Usage : my $edgeref = $gr->articulation_points();
for my $e (keys %$edgeref) {
print $e->[0]->accession_number. "-".
$e->[1]->accession_number ."\n";
}
Arguments : none
Returns : a list references to nodes that will fragment the graph
if deleted.
Notes : This is a "slow but sure" method that works with graphs
up to a few hundred nodes reasonably fast.
is_articulation_point
Name : is_articulation_point
Purpose : to determine if a given node is an articulation point or not.
Usage : if ($gr->is_articulation_point($node)) {....
Arguments : a text identifier for the protein or the node itself
Returns : 1 if node is an articulation point, 0 if it is not