Editing Artificial neuron (section)

==Types of activation function==
{{Main|Activation function}}
The activation function of a neuron is chosen to have a number of properties which either enhance or simplify the network containing the neuron. Crucially, for instance, any [[multilayer perceptron]] using a linear activation function has an equivalent single-layer network; a ''non''-linear function is therefore necessary to gain the advantages of a multi-layer network.{{Citation needed|date=May 2018}}

Below, <math>u</math> refers in all cases to the weighted sum of all the inputs to the neuron, i.e. for <math>n</math> inputs,

: <math>u = \sum_{i=1}^n w_i x_i</math>

where <math>w</math> is a vector of synaptic weights and <math>x</math> is a vector of inputs.

===Step function===
{{Main|Step function}}
The output <math>y</math> of this activation function is binary, depending on whether the input meets a specified threshold, <math>\theta</math> (theta). The "signal" is sent, i.e. the output is set to 1, if the activation meets or exceeds the threshold.

: <math>y = \begin{cases} 1 & \text{if }u \ge \theta \\ 0 & \text{if }u < \theta \end{cases}</math>

This function is used in [[perceptron]]s, and appears in many other models. It performs a division of the [[Vector space|space]] of inputs by a [[hyperplane]]. It is specially useful in the last layer of a network, intended for example to perform binary classification of the inputs.

===Linear combination===
{{Main|Linear combination}}
In this case, the output unit is simply the weighted sum of its inputs, plus a bias term. A number of such linear neurons perform a linear transformation of the input vector. This is usually more useful in the early layers of a network. A number of analysis tools exist based on linear models, such as [[harmonic analysis]], and they can all be used in neural networks with this linear neuron. The bias term allows us to make [[homogeneous coordinates|affine transformations]] to the data.

===Sigmoid===
{{Main|Sigmoid function}}
A fairly simple nonlinear function, the [[sigmoid function]] such as the logistic function also has an easily calculated derivative, which can be important when calculating the weight updates in the network. It thus makes the network more easily manipulable mathematically, and was attractive to early computer scientists who needed to minimize the computational load of their simulations. It was previously commonly seen in [[multilayer perceptron]]s. However, recent work has shown sigmoid neurons to be less effective than [[Rectifier (neural networks)|rectified linear]] neurons. The reason is that the gradients computed by the [[backpropagation]] algorithm tend to diminish towards zero as activations propagate through layers of sigmoidal neurons, making it difficult to optimize neural networks using multiple layers of sigmoidal neurons.<!--
This part of the article needs to be expanded -->

===Rectifier===
{{Main|Rectifier (neural networks)}}
In the context of [[artificial neural network]]s, the '''rectifier''' or '''ReLU (Rectified Linear Unit)''' is an [[activation function]] defined as the positive part of its argument:

: <math>f(x) = x^+ = \max(0, x),</math>

where <math>x</math> is the input to a neuron. This is also known as a [[ramp function]] and is analogous to [[half-wave rectification]] in electrical engineering. This [[activation function]] was first introduced to a dynamical network by Hahnloser et al. in a 2000 paper in ''[[Nature (journal)|Nature]]''<ref name="Hahnloser2000">{{cite journal | last1=Hahnloser | first1=Richard H. R. | last2=Sarpeshkar | first2=Rahul | last3=Mahowald | first3=Misha A. | last4=Douglas | first4=Rodney J. | last5=Seung | first5=H. Sebastian | title=Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit | journal=Nature | volume=405 | issue=6789 | year=2000 | issn=0028-0836 | doi=10.1038/35016072 | pmid=10879535 | pages=947–951| bibcode=2000Natur.405..947H | s2cid=4399014 }}</ref> with strong [[biological]] motivations and mathematical justifications.<ref name="Hahnloser2001">{{cite conference |author=R Hahnloser |author2=H.S. Seung |year=2001 |title=Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks|conference=NIPS 2001}}</ref> It has been demonstrated for the first time in 2011 to enable better training of deeper networks,<ref name="glorot2011">{{cite conference |author1=Xavier Glorot |author2=Antoine Bordes |author3=[[Yoshua Bengio]] |year=2011 |title=Deep sparse rectifier neural networks |conference=AISTATS |url=http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf}}</ref> compared to the widely used activation functions prior to 2011, i.e., the [[Logistic function|logistic sigmoid]] (which is inspired by [[probability theory]]; see [[logistic regression]]) and its more practical<ref>{{cite encyclopedia |author=[[Yann LeCun]] |author2=[[Leon Bottou]] |author3=Genevieve B. Orr |author4=[[Klaus-Robert Müller]] |year=1998 |url=http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf |title=Efficient BackProp |editor=G. Orr |editor2=K. Müller |encyclopedia=Neural Networks: Tricks of the Trade |publisher=Springer}}</ref> counterpart, the [[hyperbolic tangent]].

A commonly used variant of the ReLU activation function is the Leaky ReLU which allows a small, positive gradient when the unit is not active:

<math>f(x) = \begin{cases}
    x & \text{if } x > 0, \\
    ax & \text{otherwise}.
\end{cases}</math>

where <math>x</math> is the input to the neuron and <math>a</math> is a small positive constant (set to 0.01 in the original paper).<ref name="maas2014">Andrew L. Maas, Awni Y. Hannun, Andrew Y. Ng (2014). [https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf Rectifier Nonlinearities Improve Neural Network Acoustic Models].</ref>