NAME
Graph::Dijkstra - Dijkstra's shortest path algorithm with methods to input/output graph datasets from/to supported file formats
SYNOPSIS
# create the object
use Graph::Dijkstra;
my $graph = Graph::Dijkstra->new();
# input graph from a supported file format
$graph->inputGraphfromGML('mygraphfile.gml');
$graph->inputGraphfromCSV('mygraphfile.csv');
$graph->inputGraphfromJSON('mygraphfile.json');
$graph->inputGraphfromGraphML('mygraphfile.graphml.xml', {keyEdgeValueID => 'weight', keyNodeLabelID => 'name'} );
$graph->inputGraphfromGEXF('mygraphfile.gexf.xml' );
#SET methods to create graph nodes and edges "manually"
$graph->node( 0, 'nodeA');
$graph->node( 1, 'nodeB');
$graph->node( 2, 'nodeC');
$graph->edge(0, 1, 3); #create or update an edge between source and target; cost(dist) must be > 0
$graph->edge(1, 2, 2);
$graph->removeNode( 0 ); #removes the node with ID = 0 and all associated edges; returns "self" or undef if node not found
$graph->removeEdge( 0, 1 ); #removes the edge that connects the two nodes; returns "self" or undef if either node or the edge not found
#GET methods for graph nodes and edges
$graph->node( 0 ); #returns the label associated with node ID '0' or undef if there is no node with that id
$graph->nodeExists( 0 ); #returns true if a node with an ID value of '0' has been defined; false if not.
$graph->edge( 0, 1 ); #returns the cost (distance) associated with the edge between node's 0 and 1; undef if there is no edge
$graph->edgeExists( 0, 1 ); #returns true if an edge between nodes with id values of '0' and '1' has been defined; false if not.
$graph->nodeList(); #returns 2 dimensional array of all nodes, each node (array) element contains ID and LABEL values
$graph->adjacent( 0, 1 ); #returns true or false if there is an edge between nodeA and nodeB
$graph->adjacentNodes( 0 ); #returns a list of node IDs of immediately connected nodes (edges)
#methods to output internal graph to a supported file format
$graph->outputGraphtoGML('mygraphfile.gml', 'creator name');
$graph->outputGraphtoCSV('mygraphfile.csv');
$graph->outputGraphtoJSON('mygraphfile.json');
$graph->outputGraphtoGraphML('mygraphfile.graphml.xml', {keyEdgeWeightID => 'weight',keyEdgeWeightAttrName => 'weight', keyNodeLabelID => 'name', keyNodeLabelID => 'name'});
$graph->outputGraphtoGEXF('mygraphfile.gexf.xml');
#Dijkstra shortest path methods
use Data::Dumper;
my %Solution = ();
#shortest path to farthest node from origin node
if (my $solutionCost = $graph->farthestNode( 0, \%Solution )) {
print Dumper(\%Solution);
}
#shortest path between two nodes (from origin to destination)
if (my $pathCost = $graph->shortestPath( 0, 2, \%Solution ) ) {
print Dumper(\%Solution);
}
#Jordan or vertex center with all points shortest path matrix
my %solutionMatrix = ();
if (my $graphMinMax = $graph->vertexCenter(\%solutionMatrix) ) {
print "Center Node Set 'eccentricity', minimal greatest distance to all other nodes $graphMinMax\n";
print "Center Node Set = ", join(',', @{$solutionMatrix{centerNodeSet}} ), "\n";
my @nodeList = (sort keys %{$solutionMatrix{row}});
print 'From/To,', join(',',@nodeList), "\n";
foreach my $fromNode (@nodeList) {
print "$fromNode";
foreach my $toNode (@nodeList) {
print ",$solutionMatrix{row}{$fromNode}{$toNode}";
}
print "\n";
}
$graph->outputAPSPmatrixtoCSV(\%solutionMatrix, 'APSP.csv');
}
DESCRIPTION
Efficient implementation of Dijkstras shortest path algorithm in Perl using a Minimum Priority Queue (Array::Heap::ModifiablePriorityQueue**).
Computation methods.
farthestNode() Shortest path to farthest node (from an origin node) C<$graph->farthestNode()>
shortestPath() Shortest path between two nodes
vertexCenter() Jordan center node set (vertex center) with all points shortest path (APSP) matrix
Methods that input and output graph datasets from/to the following file formats.
GML (Graph Modelling Language, not to be confused with Geospatial Markup Language),
JSON Graph Specification (latest draft specification, edge weights/costs input using metadata "value" attribute),
GraphML (XML based),
GEXF (Graph Exchange XML Format), and
CSV (a simple row column format modelled after GML developed by author).
At this time, Graph::Dijkstra supports undirected graphs only. All edges are treated as bi-directional (undirected). The inputGraphfrom[file format] methods croak if they encounter a graph dataset with directed edges.
In this initial release, the internal graph data model is minimal. Nodes (vertices) contain an ID value (simple scalar), a label (string), and a list (hash) of edges (the ID values of connected nodes). Edges contain the ID values of the target and source nodes and the numeric cost (value/distance/amount). The edge cost must be a positive number (integer or real).
The outputGraphto[file format] methods output data elements from the internal graph. If converting between two supported formats (eg., GML and GraphML), unsupported attributes from the input file (which are not saved in the internal graph) are *not* be written to the output file. Later releases will extend the internal graph data model.
This initial release has not been sufficiently tested with real-world graph data sets. It can handle rather large datasets (tens of thousands of nodes, hundreds of thousands of edges).
If you encounter a problem or have suggested improvements, please email the author and include a sample dataset. If providing a sample data set, please scrub it of any sensitive or confidential data.
**Array::Heap::ModifiablePriorityQueue, written in Perl, uses Array::Heap, an xs module.
PLATFORM DEPENDENCIES
Graph::Dijkstra uses XML::LibXML to input and output GraphML format files. At this time, XML::LibXML is not available on the ActiveState PPM repository for Windows x86 or x64. Author has emailed ActiveState requesting that they make it available. Until such time as XML::LibXML is available on the ActiveState PPM repository for Windows, this module will not build on the ActiveState PPM repository for Windows.
Conversely, this module was developed using ActiveState Perl 5.20.2 for Windows (x86) using the XML::LibXML PPM installed from the Bribes PPM repository. It should be possible for ActiveState Windows users to install Graph::Dijkstra as follows.
Install XML::LibXML from the Bribes PPM repository http://www.bribes.org/perl/ppmdir.html using the following command.
ppm install http://www.bribes.org/perl/ppm/XML-LibXML.ppd
Then install the following from the ActiveState PPM repository as normal (using the PPM install command).
Array::Heap::ModifiablePriorityQueue
Regexp::Common
Scalar::Util
HTML::Entities
You should then be able to install this module directly from CPAN using the CPAN command.
METHODS
- Graph::Dijkstra->VERBOSE( $bool );
-
Class method that turns on or off informational output to STDOUT.
- my $graph = Graph::Dijsktra->new();
-
Create a new, empty graph object. Returns the object on success.
Node methods
- $graph->node( $id, $label );
-
SET method that adds new or updates existing node and returns self. Node ID values must be simple scalars.
- $graph->node( $id );
-
GET method that returns the label associated with the node ID or returns undef if the node ID does not exist. Note: nodes may have blank ('') labels. Use nodeExists method to test for existance.
- $graph->nodeExists( $id );
-
GET method that returns true if a node with that ID values exists or false if not.
- my @list = $graph->nodeList();
-
Returns unsorted, 2 dimensional array (list of lists) of all nodes in the graph. Each list element includes a node ID value (element 0) and LABEL value (element 1). $list[0][0] is the ID value of the first node and $list[0][1] is the LABEL value of the first node.
- $graph->removeNode( $id );
-
Removes node identified by $id and all connecting edges and returns self. Returns undef if $id does not exist.
Edge methods
- $graph->edge( $sourceID, $targetID, $cost );
-
SET method that creates new or updates existing edge between $sourceID and $targetID and returns $self. $cost must be > 0. Returns undef if $cost <= 0 or if $sourceID or $targetID do not exist.
- $graph->edge( $sourceID, $targetID );
-
GET method that returns existing cost (distance) of edge between $sourceID and $targetID. Returns 0 if there is no edge between $sourceID and $targetID. Returns undef if $sourceID or $targetID do not exist.
- $graph->edgeExists( $sourceID, $targetID );
-
GET method that returns true if an edge connects the source and target IDs or false if an edge has not been defined.
- $graph->adjacent( $sourceID, $targetID );
-
GET method that returns true if an edge connects $sourceID and $targetID or false if not. Returns undef if $sourceID or $targetID do not exist.
- my @list = $graph->adjacentNodes( $id );
-
Returns unsorted list of node IDs connected to node $id by an edge. Returns undef if $id does not exist.
- $graph->removeEdge( $sourceID, $targetID );
-
Removes edge between $source and $target (and $target and $source) and returns self. Returns undef if $source or $target do not exist.
Dijkstra computation methods
- my $solutioncost = $graph->farthestNode( $originID, $solutionHref );
-
Returns the cost of the shortest path to the farthest node from the origin. Carps and returns 0 if $originID does not exist. When there is more than one solution (two or more farthest nodes from the origin with the same cost), the solution hash includes multiple "path" elements, one for each solution.
Sample code
my %Solution = (); if ( my $solutionCost = $graph->farthestNode('I', \%Solution) ) { print Dumper(\%Solution); foreach my $i (1 .. $Solution{count}) { print "From origin $Solution{origin}, solution path ($i) to farthest node $Solution{path}{$i}{destination} at cost $Solution{cost}\n"; foreach my $edgeHref (@{$Solution{path}{$i}{edges}}) { print "\tsource='$edgeHref->{source}' target='$edgeHref->{target}' cost='$edgeHref->{cost}'\n"; } } }
Produces the following output
$VAR1 = { 'cost' => 18, 'origin' => 'I', 'desc' => 'farthest', 'count' => 2, 'path' => { '1' => { 'destination' => 'A', 'edges' => [ { 'source' => 'I', 'target' => 'L', 'cost' => 4 }, { 'cost' => 6, 'target' => 'H', 'source' => 'L' }, { 'source' => 'H', 'target' => 'D', 'cost' => 5 }, { 'cost' => 3, 'target' => 'A', 'source' => 'D' } ] }, '2' => { 'destination' => 'C', 'edges' => [ { 'source' => 'I', 'target' => 'J', 'cost' => 2 }, { 'cost' => 9, 'target' => 'K', 'source' => 'J' }, { 'target' => 'G', 'source' => 'K', 'cost' => 2 }, { 'cost' => 5, 'source' => 'G', 'target' => 'C' } ] } } }; From origin I, solution path (1) to farthest node A at cost 18 source='I' target='L' cost='4' source='L' target='H' cost='6' source='H' target='D' cost='5' source='D' target='A' cost='3' From origin I, solution path (2) to farthest node C at cost 18 source='I' target='J' cost='2' source='J' target='K' cost='9' source='K' target='G' cost='2' source='G' target='C' cost='5'
- my $solutioncost = $graph->shortestPath( $originID, $destinationID, $solutionHref );
-
Returns cost of shortest path between $originID and $destinationID or 0 if there is no path between $originID and $destinationID. Carps if $originID or $destinationID do not exist. Populates $solutionHref (hash reference) with origin, destination, cost, and shortest path edges.
Sample code
my %Solution = (); if ( my $pathCost = $graph->shortestPath('I','A', \%Solution) ) { print Dumper(\%Solution); print "Solution path from origin $Solution{origin} to destination $Solution{destination} at cost $Solution{cost}\n"; foreach my $edgeHref (@{$Solution{edges}}) { print "\tsource='$edgeHref->{source}' target='$edgeHref->{target}' cost='$edgeHref->{cost}'\n"; } }
Produces the following output
$VAR1 = { 'destination' => 'A', 'cost' => 18, 'desc' => 'path', 'origin' => 'I', 'edges' => [ { 'cost' => 4, 'source' => 'I', 'target' => 'L' }, { 'target' => 'H', 'source' => 'L', 'cost' => 6 }, { 'target' => 'D', 'cost' => 5, 'source' => 'H' }, { 'cost' => 3, 'source' => 'D', 'target' => 'A' } ] }; Solution path from origin I to destination A at cost 18 source='I' target='L' cost='4' source='L' target='H' cost='6' source='H' target='D' cost='5' source='D' target='A' cost='3'
- my $graphMinMax = $graph->vertexCenter($solutionMatrixHref);
-
Returns the graph "eccentricity", the minimal greatest distance to all other nodes from the "center node" set or Jordan center. Carps if graph contains disconnected nodes (nodes with no edges) which are excluded. If graph contains a disconnected sub-graph (a set of connected nodes isoluated / disconnected from all other nodes), the return value is 0 -- as the center nodes are undefined.
The $solutionMatrix hash (reference) is updated to include the center node set (the list of nodes with the minimal greatest distance to all other nodes) and the all points shortest path matrix. In the all points shortest path matrix, an infinite value (1.#INF) indicates that there is no path from the origin to the destination node. In this case, the center node set is empty and the return value is 0.
Includes a class "helper" method that outputs the All Points Shortest Path matrix to a CSV file.
NOTE: The size of the All Points Shortest Path matrix is nodes^2 (expontial). A graph with a thousand nodes results in a million entry matrix that will take a long time to compute. Have not evaluated the practical limit on the number of nodes.
Sample code
my %solutionMatrix = (); my $graphMinMax = $graph->vertexCenter(\%solutionMatrix); print "Center Node Set = ", join(',', @{$solutionMatrix{centerNodeSet}} ), "\n"; print "Center Node Set 'eccentricity', minimal greatest distance to all other nodes $graphMinMax\n"; my @nodeList = (sort keys %{$solutionMatrix{row}}); # or (sort {$a <=> $b} keys %{$solutionMatrix{row}}) if the $nodeID values are numeric print 'From/To,', join(',',@nodeList), "\n"; foreach my $fromNode (@nodeList) { print "$fromNode"; foreach my $toNode (@nodeList) { print ",$solutionMatrix{row}{$fromNode}{$toNode}"; } print "\n"; } # Output All Points Shortest Path matrix to a .CSV file. # If the nodeID values are numeric, include a third parameter, 'numeric' to sort the nodeID values numerically. $graph->outputAPSPmatrixtoCSV(\%solutionMatrix, 'APSP.csv');
Input graph methods
- $graph->inputGraphfromJSON($filename);
-
Inputs nodes and edges from a JSON format file following the draft JSON Graph Specification. Edge values/weights/costs are input using the metadata "value" attribute.
Example edge that includes metadata value attribute per JSON Graph Specification. { "source" : "1", "target" : "2", "metadata" : { "value" : "0.5" } },
See JSON Graph Specification https://www.npmjs.com/package/json-graph-specification
- $graph->inputGraphfromGML($filename);
-
Inputs nodes and edges from a Graphics Modelling Language format file (not to be confused with the Geospatial Markup Language XML format). Implemented using pattern matching (regexp's) on "node" and "edge" constructs. An unmatched closing bracket (']') inside a quoted string attribute value will break the pattern matching. Quoted string attribute values (e.g., a label value) should not normally include an unmatched closing bracket. Report as a bug and I'll work on re-implementing using a parser.
See Graph Modelling Language https://en.wikipedia.org/wiki/Graph_Modelling_Language
- $graph->inputGraphfromCSV($filename);
-
Inputs nodes and edges from a CSV format file loosely modelled after GML. The first column in each "row" is either a "node" or "edge" value. For "node" rows, the next two columns are the ID and LABEL values. For "edge" rows, the next three columns are "source", "target", and "value" values. No column header row.
Example
node,A,"one" node,B,"two" node,C,"three" node,D,"four" node,E,"five" node,F,"six" edge,A,B,4 edge,A,F,5 edge,A,E,7 edge,A,D,3 edge,D,E,5
- $graph->inputGraphfromGraphML($filename, {keyEdgeValueID => 'weight', keyNodeLabelID => 'name'} );
-
Inputs nodes and edges from an XML format file following the GraphML specification. EXPERIMENTAL... has not been tested with real world data sets.
Input files must contain only a single graph and cannot contain embedded graphs. Hyperedges are not supported.
The options hash reference (second parameter following the filename) is used to provide the key element ID values for edge weight/value/cost/distance and node label/name/description.
If either is not provided, the method will search the key elements for (1) edge attributes (for="edge") with an attr.name value of weight, value, cost, or distance; and (2) node attributes (for="node") with an attr.name value of label, name, description, or nlabel.
Graphs must contain a "key" attribute for edges that identifies the edge weight/value/cost/distance such as
for="edge" attrib.name="weight"
. If this key element includes a child element that specifies a default value, that default value will be used to populate the weight (cost/value/distance) for each edge node that does not include a weight/value/cost/distance data element. Seems odd to specify a default edge weight but it will be accepted.<key id="d1" for="edge" attr.name="weight" attr.type="double"> <default>2.2</default> </key> <edge id="7" source="1" target="2"> <data key="weight">0.5</data> </edge>
Graphs should contain a "key" attribute for nodes that identifies the node label / name / description such as
for="node" attrib.name="name"
orfor="node" attrib.name="label"
. These are used to populate the internal graph "label" value for each node. If not included, the internal node labels will be empty strings.<key id="name" for="node" attr.name="name" attr.type="string"/> <node id="4"> <data key="name">josh</data> </node>
See GraphML Primer http://graphml.graphdrawing.org/primer/graphml-primer.html and GraphML example http://gephi.org/users/supported-graph-formats/graphml-format/
- $graph->inputGraphfromGEXF($filename );
-
Inputs nodes and edges from an XML format file following the GEXF draft 1.2 specification. Will also accept draft 1.1 specification files.
Input files must contain only a single graph. Hierarchial nodes are not supported.
Node elements are expected to contain a label element. Edge elements are expected to contain a weight attribute.
Note: Author has seen a GEXF file where the edge weights were specified using a <attvalues><attvalue> element under each <edge> element (see following) where there was no corresponding <attributes><attribute> definition of weight. This doesn't appear to be correct. If it is, please email the author.
<attvalues> <attvalue for="weight" value="1.0"></attvalue> </attvalues>
Output graph methods
- $graph->outputGraphtoGML($filename, $creator);
-
Using the internal graph, outputs a file in GML format. Includes a "creator" entry on the first line of the file with a date and timestamp. Note that non-numeric node IDs and label values are HTML encoded.
- $graph->outputGraphtoJSON($filename);
-
Using the internal graph, outputs a file following the JSON graph specification. See the inputGraphfromJSON method for format details.
- $graph->outputGraphtoCSV($filename);
-
Using the internal graph, outputs a file in CSV format. See the inputGraphfromCSV method for format details.
- $graph->outputGraphtoGraphML($filename, {keyEdgeWeightID => '',keyEdgeWeightAttrName => '', keyNodeLabelID => '', keyNodeLabelID => ''} );
-
Using the internal graph, outputs a file in XML format following the GraphML specification (schema). The option attributes keyEdgeWeightID and keyEdgeWeightAttrName both default to 'weight'. keyNodeLabelID and keyNodeLabelID both default to 'name'. Set these attributes values only if you need to output different values for these key attributes.
- $graph->outputGraphtoGEXF($filename);
-
Using the internal graph, outputs a file in XML format following the GEXF draft 1.2 specification (schema).
PERFORMANCE
Performance measurements were recorded on an unremarkable laptop (Intel Core i5) running Microsoft Windows 10 (64bit) and ActiveState Perl 5.20.2 (x86). Reported times are Perl "use benchmark" measurements of the runtime of the core algorithm within the referenced methods. Timings exclude data loading. Measurements are indicative at best.
With a test graph of 16+K nodes and 121+K edges, both farthest and shortestPath completed in under 1 second (~0.45seconds). Did not attempt to run the vertexCenter method given that the resulting All Points Shortest Path matrix would have over 256M elements (requiring the computation of ~128M unique paths).
With a smaller test graph of 809 nodes and 1081 edges, both farthestNode and shortestPath completed in under 0.01 seconds. The vertexCenter method took much longer to complete at 56.61 seconds. For a graph with 809 nodes, creating the All Points Shortest Path matrix required computation of 326,836 unique paths ((809^2 - 809) / 2). The shortest path computation rate was ~5,774 paths per second or 0.000173 seconds per path. Next version will include an alternative implementation, the Floyd Warshall algorithm, to create the All Points Shortest Path matrix.
For the smaller test graph run, Windows Task Manager reported that Perl consumed 30-33% of CPU capacity and allocated 58.5MB of memory.
With a GraphML (XML) dataset containing over 700K nodes and 1M edges (>6M lines of text, ~175MB file size), the perl process ran out of memory (exceeded the 2GB per process limit for 32bit applications under 64bit MS Windows). The memory allocation limit was reached in the libxml2 (c) library before control was returned to Perl. Using the 64bit version of Perl should (may) avoid this problem. The GraphML file is available at http://sonetlab.fbk.eu/data/social_networks_of_wikipedia/ under the heading "Large Wikipedias (2 networks extracted automatically with the 2 algorithms)". Filename is eswiki-20110203-pages-meta-current.graphml.
LIMITATIONS
Node ID values must be simple scalars.
Currently only works with undirected graphs. Next version will support directed as well as undirected graphs.
For simplicity, InputGraphfromGML implemented using pattern matching (regexp's). An unmatched closing bracket inside a quoted string (value) will break it. Will re-implement using a parser as necessary.
TODOs
Support input and output of addditional data attributes for nodes and edges in a comprehensive manner. For example, add support for edge IDs and labels. Add support for graph coordinates to nodes. Update $graph->inputGraphfrom*, $graph->outputGraphto*, $graph->node*, and $graph->edge* methods.
Support input, Dijkstra path computations, and output of directed graphs (graphs with directed / uni-directional edges).
Evaluate vertexCenter method with larger graphs. Current implementation uses Dijkstra shortest path algorithm. Possibly replace with Floyd Warshall algorithm or add second implementation.
Test very large graph datasets using a 64bit version of perl (without the 2GB process limit).
Rewrite the input/output methods for CSV files to process nodes and edges in separate files (nodes file and edges file) with each containing an appropriate column header.
Add support for the Pajek "NET" graph file format.
Input welcome. Please email author with suggestions. Graph data sets for testing (purged of sensitive/confidential data) are welcome and appreciated.
SEE ALSO
Array::Heap::ModifiablePriorityQueue
Graph Modelling Language https://en.wikipedia.org/wiki/Graph_Modelling_Language
JSON Graph Specification https://www.npmjs.com/package/json-graph-specification
GraphML Primer http://graphml.graphdrawing.org/primer/graphml-primer.html
GraphML example http://gephi.org/users/supported-graph-formats/graphml-format/
GEXF File Format http://www.gexf.net/format/index.html
AUTHOR
D. Dewey Allen <ddallen16@gmail.com>
COPYRIGHT/LICENSE
Copyright (C) 2015, D. Dewey Allen
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.