Swish as an Activation Function in Neural Network

The choice of activation function is very important and can greatly influence the accuracy and training time of a model. Swish is one of the new activation functions which was first proposed in 2017 by using a combination of exhaustive and reinforcement learning-based search. The authors of the research paper first proposing the Swish Activation Function found that it outperforms ReLU and its variants such as Parameterized ReLU(PReLU), Leaky ReLU(LReLU), Softplus, Exponential Linear Unit(ELU), Scaled Exponential Linear Unit(SELU) and Gaussian Error Linear Units(GELU) on a variety of datasets such as the ImageNet and CIFAR Dataset when applied to pre-trained models.

Swish Activation Function is continuous at all points. The shape of Swish Activation Function looks similar to ReLU, for being unbounded above 0 and bounded below it. But unlike ReLU however it is differentiable at all points and is non-monotonic.
The reason it looks a lot like ReLU is because for large enough values, σ(x) becomes approximately equal to 1, and hence values of swish activation function become approximately equal to x. Similarly for large enough negative values, the value of σ(x) becomes approximately equal to 0, and hence the values of swish function become approximately equal to 0.

Mathematically it is defined as-

y = swish(x) = xσ(βx)

where σ(x) = 1/(1+exp(-x)), is the sigmoid function. β can either be a constant defined prior to the training or a parameter that can trained during the training time. The graph for the swish function is given below-

Swish Activation Function Graph
Fig 1. Swish Activation Function

The value of β can also greatly influence the shape of the curve and hence the output, accuracy and training time. The graph below compares the graphs of swish function for various values of β.

Comparison of Swish Activation Functions with various values of β graph.
Comparison of Swish Activation Functions with various values of β

The derivative of the Swish Activation function is-

f'(x) = β f(βx) + σ(βx)(1 – β f(βx))

The graph below compares the graphs of derivatives of swish function for various values of β.

Comparison of Derivatives of Swish Activation Functions with various values of β graph.
Comparison of Derivatives of Swish Activation Functions with various values of β

Pros

  • It is continuous and differentiable at all points.
  • It is simple and easy to use.
  • Unlike ReLU, it does not suffer from the problem of dying neurons.
  • It performs better than various activation functions such as ReLU, Leaky ReLU, Parameterized ReLU, ELU, SELU, GELU when compared on standard datasets such as CIFAR and ImageNet.
  • Being a non-saturating activation function, it does not suffer from the problems of exploding or vanishing gradients.

Cons

  • It is slower to compute as compared to ReLU and its variants such as Leaky ReLU and Parameterized ReLU because of the use of sigmoid function involved in computing the outputs.
  • According to a research paper, swish activation function is unstable and cannot be predicted a priori.