Editing Self-organizing map (section)

== Learning algorithm ==
The goal of learning in the self-organizing map is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other [[sense|sensory]] information is handled in separate parts of the [[cerebral cortex]] in the [[human brain]].<ref name="Haykin">{{cite book |first=Simon |last=Haykin |title=Neural networks - A comprehensive foundation |chapter=9. Self-organizing maps |edition=2nd |publisher=Prentice-Hall |year=1999 |isbn=978-0-13-908385-3 }}</ref>

[[Image:Somtraining.svg|thumb|500px|An illustration of the training of a self-organizing map. The blue blob is the distribution of the training data, and the small white disc is the current training datum drawn from that distribution. At first (left) the SOM nodes are arbitrarily positioned in the data space. The node (highlighted in yellow) which is nearest to the training datum is selected. It is moved towards the training datum, as (to a lesser extent) are its neighbors on the grid. After many iterations the grid tends to approximate the data distribution (right).]]<!--
-->
The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest [[principal component]] [[eigenvectors]]. With the latter alternative, learning is much faster because the initial weights already give a good approximation of SOM weights.<ref name="SOMIntro">{{cite web |title=Intro to SOM |first=Teuvo |last=Kohonen |work=SOM Toolbox |url=http://www.cis.hut.fi/projects/somtoolbox/theory/somalgorithm.shtml |year=2005<!-- last updated 18 March 2005 --> |access-date=2006-06-18 }}</ref>

The network must be fed a large number of example vectors that represent, as close as possible, the kinds of vectors expected during mapping. The examples are usually administered several times as iterations.

The training utilizes [[competitive learning]]. When a training example is fed to the network, its [[Euclidean distance]] to all weight vectors is computed. The neuron whose weight vector is most similar to the input is called the '''best matching unit''' (BMU). The weights of the BMU and neurons close to it in the SOM grid are adjusted towards the input vector. The magnitude of the change decreases with time and with the grid-distance from the BMU. The update formula for a neuron v with weight vector '''W<sub>v</sub>'''(s) is
:<math>W_{v}(s + 1) = W_{v}(s) + \theta(u, v, s) \cdot \alpha(s) \cdot (D(t) - W_{v}(s))</math>,
where ''s'' is the step index, ''t'' is an index into the training sample, ''u'' is the index of the BMU for the input vector '''D'''(''t''), ''α''(''s'') is a [[monotonically decreasing]] learning coefficient; ''θ''(''u'', ''v'', ''s'') is the [[Neighbourhood (mathematics)|neighborhood]] function which gives the distance between the neuron u and the neuron ''v'' in step ''s''.<ref name="Scholarpedia">{{cite journal|title=Kohonen network|last1=Kohonen|first1=Teuvo|last2=Honkela|first2=Timo|year=2011<!-- last approved revision 2011-11-15 -->|journal=Scholarpedia|volume=2|issue=1|pages=1568|doi=10.4249/scholarpedia.1568|doi-access=free|bibcode=2007SchpJ...2.1568K}}<!--
Begin Quote
Consider first data items that are n-dimensional Euclidean vectors x(t)=[ξ1(t),ξ2(t),…,ξn(t)]. Here t is the index of the data item in a given sequence. Let the ith model be mi(t)=[μi1(t),μi2(t),…,μin(t)], where now t denotes the index in the sequence in which the models are generated.
End Quote
The equation mi(t+1)=mi(t)+α(t)hci(t)[x(t)−mi(t)] thus uses the symbol t to mean *two different things*: the t of x(t) is not the t of m, α and h. This is why we use s and t here.

Ultsch & Siemon 1990 also use three nested loops when describing Kohonen's algorithm: the outer one is over the training steps (and controls the decay of θ and α (called n and η, respectively, in their paper)), the middle one is over the data items, and the inner is over the neurons.
--></ref> Depending on the implementations, t can scan the training data set systematically (''t'' is 0, 1, 2...''T''-1, then repeat, ''T'' being the training sample's size), be randomly drawn from the data set ([[bootstrap sampling]]), or implement some other sampling method (such as [[Jackknife_resampling|jackknifing]]).

The neighborhood function ''θ''(''u'', ''v'', ''s'') (also called ''function of lateral interaction'') depends on the grid-distance between the BMU (neuron ''u'') and neuron ''v''. In the simplest form, it is 1 for all neurons close enough to BMU and 0 for others, but the [[Gaussian function|Gaussian]] and [[mexican hat wavelet|Mexican-hat]]<ref name="Vrieze">{{cite book |last1=Vrieze |first1=O.J. |title=Artificial Neural Networks |chapter=Kohonen Network |chapter-url=https://link.springer.com/content/pdf/10.1007%2FBFb0027024.pdf |website=Springer |series=Lecture Notes in Computer Science |year=1995 |volume=931 |pages=83–100 |publisher=University of Limburg, Maastricht |doi=10.1007/BFb0027024 |isbn=978-3-540-59488-8 |access-date=1 July 2020 |ref=Vrieze}}</ref> functions are common choices, too. Regardless of the functional form, the neighborhood function shrinks with time.<ref name="Haykin" /> At the beginning when the neighborhood is broad, the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a couple of neurons, the weights are converging to local estimates. In some implementations, the learning coefficient ''α'' and the neighborhood function ''θ'' decrease steadily with increasing ''s'', in others (in particular those where ''t'' scans the training data set) they decrease in step-wise fashion, once every ''T'' steps.

[[File:TrainSOM.gif|thumb|Training process of SOM on a two-dimensional data set]]

This process is repeated for each input vector for a (usually large) number of cycles '''λ'''. The network winds up associating output nodes with groups or patterns in the input data set. If these patterns can be named, the names can be attached to the associated nodes in the trained net.

During mapping, there will be one single ''winning'' neuron: the neuron whose weight vector lies closest to the input vector. This can be simply determined by calculating the Euclidean distance between input vector and weight vector.

While representing input data as vectors has been emphasized in this article, any kind of object which can be represented digitally, which has an appropriate distance measure associated with it, and in which the necessary operations for training are possible can be used to construct a self-organizing map. This includes matrices, continuous functions or even other self-organizing maps.

=== Algorithm ===
# Randomize the  node weight vectors in a map
# For <math>s = 0, 1, 2, ..., \lambda</math>
## Randomly pick an input vector <math>{D}(t)</math>
## Find the node in the map closest to the input vector. This node is the '''best matching unit''' (BMU). Denote it by <math>u</math>
## For each node <math>v</math>, update its vector by pulling it closer to the input vector: <math display="block">W_{v}(s + 1) = W_{v}(s) + \theta(u, v, s) \cdot \alpha(s) \cdot (D(t) - W_{v}(s))  </math>

The variable names mean the following, with vectors in bold,
* <math>s</math> is the current iteration
* <math>\lambda</math> is the iteration limit
* <math>t</math> is the index of the target input data vector in the input data set <math>\mathbf{D}</math>
* <math>{D}(t)</math> is a target input data vector
* <math>v</math> is the index of the node in the map
* <math>\mathbf{W}_v</math> is the current weight vector of node <math>v</math>
* <math>u</math> is the index of the best matching unit (BMU) in the map
* <math>\theta (u, v, s)</math> is the neighbourhood function,
* <math>\alpha (s)</math> is the learning rate schedule.
The key design choices are the shape of the SOM, the neighbourhood function, and the learning rate schedule. The idea of the neighborhood function is to make it such that the BMU is updated the most, its immediate neighbors are updated a little less, and so on. The idea of the learning rate schedule is to make it so that the map updates are large at the start, and gradually stop updating.

For example, if we want to learn a SOM using a square grid, we can index it using <math>(i, j)</math> where both <math>i, j \in 1:N</math>. The neighborhood function can make it so that the BMU updates in full, the nearest neighbors update in half, and their neighbors update in half again, etc.<math display="block">\theta((i, j), (i', j'), s) = \frac{1}{2^{|i-i'| + |j-j'|}} = \begin{cases}
1 & \text{if }i=i', j = j' \\
1/2 & \text{if }|i-i'| + |j-j'| = 1 \\
1/4 & \text{if }|i-i'| + |j-j'| = 2 \\
\cdots & \cdots
\end{cases} 
</math>And we can use a simple linear learning rate schedule <math>\alpha(s) = 1-s/\lambda</math>.

Notice in particular, that the update rate does ''not'' depend on where the point is in the Euclidean space, only on where it is in the SOM itself. For example, the points <math>(1,1), (1,2) </math> are close on the SOM, so they will always update in similar ways, even when they are far apart on the Euclidean space. In contrast, even if the points <math>(1,1), (1, 100)</math> end up overlapping each other (such as if the SOM looks like a folded towel), they still do not update in similar ways.

=== Alternative algorithm ===
# Randomize the map's nodes' weight vectors
# Traverse each input vector in the input data set
## Traverse each node in the map
### Use the [[Euclidean distance]] formula to find the similarity between the input vector and the map's node's weight vector
### Track the node that produces the smallest distance (this node is the best matching unit, BMU)
## Update the nodes in the neighborhood of the BMU (including the BMU itself) by pulling them closer to the input vector
### <math>W_{v}(s + 1) = W_{v}(s) + \theta(u, v, s) \cdot \alpha(s) \cdot (D(t) - W_{v}(s))</math>
# Increase <math>s</math> and repeat from step 2 while <math>s < \lambda</math>

=== Initialization options ===
Selection of initial weights as good approximations of the final weights is a well-known problem for all iterative methods of artificial neural networks, including self-organizing maps. Kohonen originally proposed random initiation of weights.<ref>{{cite book |first=T. |last=Kohonen |title=Self-Organization and Associative Memory |publisher=Springer |orig-year=1988 |edition=2nd |isbn= 978-3-662-00784-6 |year=2012}}</ref> (This approach is reflected by the algorithms described above.) More recently, principal component initialization, in which initial map weights are chosen from the space of the first principal components, has become popular due to the exact reproducibility of the results.<ref>{{cite conference |first1=A. |last1=Ciampi |first2=Y. |last2=Lechevallier |title=Clustering large, multi-level data sets: An approach based on Kohonen self organizing maps |editor-first=D.A. |editor-last=Zighed |editor2-first=J. |editor2-last=Komorowski |editor3-first=J. |editor3-last=Zytkow  |date=2000 |publisher=Springer |volume=1910 |pages=353–358 |doi=10.1007/3-540-45372-5_36 |book-title=Principles of Data Mining and Knowledge Discovery: 4th European Conference, PKDD 2000 Lyon, France, September 13–16, 2000 Proceedings |series=Lecture notes in computer science |isbn=3-540-45372-5|doi-access=free }}</ref>

[[File:Self oraganizing map cartography.jpg|thumb|Cartographical representation of a self-organizing map ([[U-Matrix]]) based on Wikipedia featured article data (word frequency). Distance is inversely proportional to similarity. The "mountains" are edges between clusters. The red lines are links between articles.]]

A careful comparison of random initialization to principal component initialization for a one-dimensional map, however, found that the advantages of principal component initialization are not universal. The best initialization method depends on the geometry of the specific dataset. Principal component initialization was preferable (for a one-dimensional map) when the principal curve approximating the dataset could be univalently and linearly projected on the first principal component (quasilinear sets). For nonlinear datasets, however, random initiation performed better.<ref>{{cite journal | last1 = Akinduko | first1 = A.A. | last2 = Mirkes | first2 = E.M. | last3 = Gorban | first3 = A.N. | year = 2016 | title = SOM: Stochastic initialization versus principal components | url = https://www.researchgate.net/publication/283768202 | journal = Information Sciences | volume =  364–365| pages =  213–221| doi = 10.1016/j.ins.2015.10.013 }}</ref>