The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

AI::NeuralNet::SOM - A simple Kohonen Self-Organizing Maps.

SYNOPSIS

use AI::NeuralNet::SOM;

        # Create a new self-organizing map.
        $som = AI::NeuralNet::SOM->new();
        
        # Create a data set to initialize and train.
        @data = (
        13.575570, 12.656892, -1.424328, -2.302774, 404.921600,
        13.844373, 12.610620, -1.435429, -1.964423, 404.978180,
        13.996934, 12.669785, -1.384147, -1.830788, 405.187378,
        14.060876, 12.755087, -1.378407, -2.020230, 404.892548,
        14.095317, 12.877163, -1.363435, -2.072163, 404.698822,
        13.975704, 12.888503, -1.351579, -1.832351, 404.479889,
        13.713181, 12.836812, -1.338311, -1.997729, 403.891724,
        13.834728, 12.809576, -1.333899, -2.002055, 403.270264,
        13.744470, 12.770656, -1.343199, -2.241165, 402.820709,
        13.982540, 12.697198, -1.372424, -1.922313, 402.433960,
        14.064130, 12.691656, -1.377368, -1.752657, 403.218475,
        14.035974, 12.764489, -1.354782, -1.970408, 402.411560,
        14.037183, 12.913648, -1.322078, -2.069336, 402.292755,
        13.985688, 12.954960, -1.345922, -1.613702, 404.184143,
        14.054778, 12.941310, -1.384624, -1.703977, 399.970612,
        13.915499, 13.089429, -1.313017, -1.429557, 399.338287,
        14.590042, 13.462692, -1.290192, -1.537785, 399.777039,
        15.501397, 14.348173, -1.275527, -1.680045, 399.398071,
        15.630893, 15.530425, -1.280694, -1.917952, 400.034485,
        16.435490, 17.209114, -1.305744, -1.094125, 399.959900);

        # Initialize map.
        $som->initialize(3,3,5,'hexa','bubble','linear',0,\@data);

        # Find quantization error before training and print it.
        $qerr = $som->qerror(\@data);
        print "Mean quantization error before trainig= $qerr\n";

        # Train map with the same data set.
        $som->train(500,0.05,3,'linear',\@data);

        # Find quantization error after training and print it.
        $qerr = $som->qerror(\@data);
        print "Mean quantization error after trainig= $qerr\n\n";

        # Create a data set to label map.
        @label_data = (
        23.508335, 21.359016, 3.906102, 4.884908, 404.440765,
        23.823174, 21.731325, 4.295785, 5.244288, 405.100342,
        24.207268, 22.070162, 4.646249, 5.030964, 404.812225,
        24.284208, 22.401424, 4.806539, 5.006081, 404.735596,
        24.401838, 22.588514, 4.957213, 5.011020, 404.176880,
        25.824610, 24.155489, 5.976608, 6.708979, 405.040466,
        26.197090, 24.353720, 6.272694, 6.843574, 405.728119,
        26.347252, 24.720333, 6.518201, 6.950599, 405.758606,
        26.537718, 24.976704, 6.661457, 7.163557, 404.037567,
        27.041384, 25.309855, 6.979992, 7.488787, 404.839081,
        27.193167, 25.601683, 7.173965, 7.920047, 404.749054);

        #Label map with "fault" patterns.
        $patterns_count = scalar(@label_data) / $som->i_dim;
        for $i (0..$patterns_count-1){
                @pattern = splice(@label_data, 0, $som->i_dim);
                ($x, $y) = $som->winner(\@pattern);
                $som->set_label($x, $y, "fault");
        }

        # Create a data set to test map.
        @test_data = (
        23.508335, 21.359016, 3.906102, 4.884908, X,
        23.823174, 21.731325, 4.295785, 5.244288, 405.100342,
        24.207268, 22.070162, 4.646249, 5.030964, 404.812225,
        13.575570, 12.656892, -1.424328, -2.302774, 404.921600,
        24.284208, 22.401424, 4.806539, 5.006081, 404.735596,
        24.401838, 22.588514, 4.957213, 5.011020, 404.176880,
        13.844373, 12.610620, -1.435429, -1.964423, 404.978180,
        24.628309, 23.015909, 5.075150, 5.560286, 403.773132,
        13.996934, 12.669785, -1.384147, -1.830788, 405.187378,
        25.551638, 23.864803, 5.774306, 6.208019, 403.946777,
        26.347252, 24.720333, 6.518201, 6.950599, 405.758606,
        26.537718, 24.976704, 6.661457, 7.163557, 404.037567,
        X, 15.601683, X, X, 404.749054,
        27.041384, 25.309855, 6.979992, 7.488787, 404.839081);

        #Test map and print results.
        $patterns_count = scalar(@test_data) / $som->i_dim;
        for $i (0..$patterns_count-1){
                @pattern = splice(@test_data, 0, $som->i_dim);
                ($x, $y) = $som->winner(\@pattern);
                $label=$som->label($x, $y);
                if (defined($label)) {
                        print "@pattern - $label\n";
                }
                else {
                        print "@pattern\n";
                }
        }

DESCRIPTION

The principle of the SOM

The Self-Organizing Map represents the result of a vector quantization algorithm that places a number reference or of codebook vectors into a high-dimensional input data space to approximate to its data sets in an ordered fashion. When local-order relations are defined between the reference vectors, the relative values of the latter are made to depend on each other as if their neighboring values would lie along an "elastic surface". By means of the self-organizing algorithm, this "surface" becomes defined as a kind of nonlinear regression of the reference vectors through the data points. A mapping from a high-dimensional data space R^n onto, say, a two-dimensional lattice of points is thereby also defined. Such a mapping can effectively be used to visualize metric ordering relations of input samples. In practice, the mapping is obtained as an asymptotic state in a learning process.

A typical application of this kind of SOM is in the analysis of complex experimental vectorial data such as process states, where the data elements may even be related to each other in a highly nonlinear fashion.

There exist many versions of the SOM. The basic philosophy, however, is very simple and already effective as such, and has been implemented by the procedures contained in this package.

The SOM here defines a mapping from the input data space R^n onto a regular two-dimensional array of nodes. With every node i, a parametric reference vector mi in R^n is associated. The lattice type of the array can be defined as rectangular or hexagonal in this package; the latter is more effective for visual display. An input vector x in R^n is compared with the mi, and the best match is defined as "winner": the input is thus mapped onto this location.

One might say that the SOM is a "nonlinear projection" of the probability density function of the high-dimensional input data onto the two-dimensional display. Let x in R^n be an input data vector. It may be compared with all the mi in any metric; in practical applications, the smallest of the Euclidean distances ||x - mi|| is usually made to define the best-matching node, signified by the subscript c:

||x - mc|| = min{||x - mi||} ; or

c = arg min{||x - mi||} ; (1)

Thus x is mapped onto the node c relative to the parameter values mi.

During learning, those nodes that are topographically close in the array up to a certain distance will activate each other to learn from the same input. Without mathematical proof we state that useful values of the mi can be found as convergence limits of the following learning process, whereby the initial values of the mi(0) can be arbitrary, e.g., random:

mi(t + 1) = mi(t) + hci(t)[x(t) - mi(t)] ; (2)

where t is an integer, the discrete-time coordinate, and hci(t) is the so-called neighborhood kernel; it is a function defined over the lattice points. Usually hci(t) = h(||rc - ri||; t), where rc in R^2 and ri in R^2 are the radius vectors of nodes c and i, respectively, in the array. With increasing ||rc - ri||, hci goes to 0. The average width and form of hci defines the "stiffness" of the "elastic surface" to be fitted to the data points. Notice that it is usually not desirable to describe the exact form of p(x), especially if x is very-high-dimensional; it is more important to be able to automatically find those dimensions and domains in the signal space where x has significant amounts of sample values!

This package contains two options for the definition of hci(t). The simpler of them refers to a neighborhood set of array points around node c. Let this index set be denoted Nc (notice that we can define Nc = Nc(t) as a function of time), whereby hci = alpha(t) if i in Nc and hci = 0 if i not in Nc, where alpha(t) is some monotonically decreasing function of time (0 < alpha(t) < 1). This kind of kernel is nicknamed "bubble", because it relates to certain activity "bubbles" in laterally connected neural networks [Kohonen 1989]. Another widely applied neighborhood kernel can be written in terms of the Gaussian function,

hci = alpha(t) * exp(-(||rc-ri||^2)/(2 rad^2(t))); (3)

where alpha(t) is another scalar-valued "learning rate", and the parameter rad(t) defines the width of the kernel; the latter corresponds to the radius of Nc above. Both alpha(t) and rad(t) are some monotonically decreasing functions of time, and their exact forms are not critical; they could thus be selected linear. In this package it is furher possible to use a function of the type alpha(t) = A/(B + t), where A and B are constants; the inverse-time function is justified theoretically, approximately at least, by the so-called stochastic approximation theory. It is advisable to use the inverse-time type function with large maps and long training runs, to allow more balanced finetuning of the reference vectors. Effective choices for these functions and their parameters have so far only been determined experimentally; such default definitions have been used in this package.

The next step is calibration of the map, in order to be able to locate images of different input data items on it. In the practical applications for which such maps are intended, it may be usually self-evident from daily routines how a particular input data set ought to be interpreted. By inputting a number of typical, manually analyzed data sets and looking where the best matches on the map according to Eq. (1) lie, the map or at least a subset of its nodes can be labeled to delineate a "coordinate system" or at least a set of characteristic reference points on it according to their manual interpretation. Since this mapping is assumed to be continuous along some hypothetical "elastic surface", it may be self-evident how the unknown data are interpreted by means of interpolation and extrapolation with respect to these calibrated points.

METHODS

new AI::NeuralNet::SOM;

Creates a new empty Self-Organizing Map object;

$som->initialize($xdim, $ydim, $idim, $topology, $neighborhood, $init_type, $random_seed, \@data);

Initializes the SOM object. Sets map dimension $xdim x $ydim. Input data vector sets equal to $idim. Variable $topology may be either "rect" or "hexa", $neighborhood may be "bubble" or "gaussian". Initialization type of the SOM object can be "linear" or "random", $random seed is any non-negative integer. \@data is a reference to the array containing initialization data.

$som->train($train_length, $alpha, $radius, $alpha_type, \@data);

The method trains the Self-Organizing Map. $train_length - a number of training epoches, $alpha - learning rate, $radius - initial training radius which decreases to 1 during training process, $alpha_type sets a type of the learning rate decrease function, and can be "linear" or "inverse_t", \@data is a reference to the array containing training data.

$som->qerror;

Returns quantization error of the trained map.

($x, $y, $dist) = $som->winner(\@data);

Finds the "winned" neuron for the mapped data vector \@data and returns its coordinates $x and $y and $dist - Euclidean distance between the neuron and the input vector.

$som->set_label($x, $y, $label);

Sets label for the neuron with the coordinates x and y

$som->clear_all_labels;

Clears all the labels on the map.

$som->save(*FILE);

Save the Self-Organazing Map to file which represented as descriptor *FILE. This may be *STDOUT. The reference vectors are stored in ASCII-form. The format of the entries is similar to that used in the input data files, except that the optional iitems on the first line of data files (topology type, x- and y-dimensions and neighborhood type) are now compulsory. In map files it is possible to include several labels for each entry.

An example: The map file code.cod contains a map of three-dimensional vectors, with three times two map units.

      code.cod:

       3 hexa 3 2 bubble
       191.105   199.014   21.6269
       215.389   156.693   63.8977
       242.999   111.141   106.704
       241.07    214.011   44.4638
       231.183   140.824   67.8754
       217.914   71.7228   90.2189

The x-coordinates of the map (column numbers) may be thought to range from 0 to n 1, where n is the x-dimension of the map, and the y-coordinates (row numbers) from 0 to m 1, respectively, where m is the y-dimension of the map. The reference vectors of the map are stored in the map file in the following order:

 1       The unit with coordinates (0; 0).
 2       The unit with coordinates (1; 0).
         ...
 n       The unit with coordinates (n - 1; 0).
 n + 1   The unit with coordinates (0; 1).
         ...
 nm      The last unit is the one with coordinates (n - 1; m - 1).


    (0,0) - (1,0) - (2,0) - (3,0)         (0,0) - (1,0) - (2,0) - (3,0)

      |       |       |       |               \    /   \  /   \   /   \

    (0,1) - (1,1) - (2,1) - (3,1)             (0,1) - (1,1) - (2,1) - (3,1)

      |       |       |       |                 /   \  /   \   /   \  /

    (0,2) - (1,2) - (2,2) - (3,2)         (0,2) - (1,2) - (2,2)  -(3,2)



          Rectangular                             Hexagonal

In the picture above the locations of the units in the two possible topological structures are shown. The distance between two units in the map is computed as an Euclidean distance in the (two dimensional) map topology.

$som->load(*FILE);

Loads the Self-Organazing Map from file which represented as descriptor *FILE.

$som->umatrix;

Calculates Umatrix for existing map and returns a reference to array that contains Umatrix data.

Umatrix is a way of representing the distances between reference vectors of neighboring map units. Although being a somewhat laborious task to calculate it can effectively be used to visualize the map in an interpretable manner.

Umatrix algorithm calculates the distances between the neighboring neurons and stores them in a grid (matrix) that corresponds to the used topology type. From that grid, a proper visualization can be generated by picking the values for each neuron distance (4 for rectangular and 6 for hexagonal topology). The distance values are scaled to the range between 0 and 1 and are shown as colors when the Umatrix is visualized.

Example:

        ...
        $umat = $som->umatrix;
        for $j (0..$som->y_dim*2-2) {
                for $i (0..$som->x_dim*2-2) {
                        print "$umat->[$j*($som->x_dim*2-1)+$i] ";
                }
                print "\n";
        }
        ...
$som->x_dim;

Returns the x dimention of map.

$som->y_dim;

Returns the y dimension of map.

$som->i_dim;

Returns the input vector dimension

$som->topology;

Returns the map topology.

$som->neighborhood;

Returns the neighborhood function type.

$som->map($x, $y, $z);

Returns the $z element of the vector of the neuron with coordinates $x and $y. 0 < $z <= $som->i_dim.

$som->label($x, $y);

Returns the label corresponding to the neuron with coordinates $x and $y.

NOTES

Using missing values

You can use missing values in datasets to initialize and train map. I recommend to use "X" symbol to indicate missing values, but you can use any alpha symbols for this purpose.

Some particular parts of this documentation were taken from the documentation for SOM_PAK <http://www.cis.hut.fi/research/som-research/nnrc-programs.shtml>.

BUGS

This is the alpha release of AI::NeuralNet::SOM, but I am sure there are probably bugs in here which I just have not found yet. If you find bugs in this module, I would appreciate it greatly if you could report them to me at <voischev@mail.ru>, or, even better, try to patch them yourself and figure out why the bug is being buggy, and send me the patched code, again at <voischev@mail.ru>.

HISTORY

AI-NeuralNet-SOM-0.01 - The first alpha version.

AI-NeuralNet-SOM-0.02 - fixed bugs in "load" method and added new method "umatrix".

AUTHOR

Voischev Alexander <voischev@mail.ru>

Copyright (c) 2000 Voischev Alexander. All rights reserved. The AI::NeuralNet::SOM are free software; you can redistribute it and/or modify it under the same terms as Perl itself. THIS COME WITHOUT WARRANTY OF ANY KIND.