New DNN research: ISRLU – an innovative activation function

Our group has published a new research paper introducing ISRLU:
Carlile et al, 2017 ISRLU

“Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)”
Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti & Brian Whitney
https://arxiv.org/pdf/1710.09967.pdf

This 2017 paper has been submitted to a 2018 conference, ISRLU (Pron: is-rah-lou  (ɪz rɑː luː) & ISRU (Pron: is-rue (ɪz rːuː) are faster activation functions for all DNN:

  • ISRLU has all of the great properties of ELU, but can be implemented much faster in SW or in HW/SW codesigned DNN hardware.
  • ISRLU is continuously differentiable in 1st and 2nd derivatives (ELU, ReLU are not).
  • ISRU has similar curve to tanh & sigmoid for RNNs, but is much faster than both.

This innovation came about by using real Performance Engineering skills.

  • We saw the trend that convolutions are becoming smaller and other strategies (Winograd) are reducing the time to do a convolution, therefore activation functions are becoming more important.
  • Activation functions have been based on exponential functions which are always slower to evaluate.
  • We know inverse square root can always be faster to evaluate, if it is same speed or slower it is time to optimize inverse square root implementation.
  • We remembered Paul Kinney’s 1986 invention of fast approx of inverse square root technique, 16 years before Carmack & Mathisen rediscovered it and made it famous in 2002.
  • We constructed a “sigmoid-like function” from inverse square root and married that to ELU’s innovation.
  • …Basically understanding performance, have huge background of experiences, led us to innovation that SME had missed.

Remember this kind of thinking comes about from having experience with full stack innovations/experience with optimizing HPC, Database, Java, ERP, NoSQL, Java, Spark, SQL, Graph analytics, intrinsics, parallelism, always optimizing at scale, microarchitectures, HW/SW codesign,…

ISRU can replace tanh and sigmoid (add 1, scale by 1/2), while it is duly noted that activation function performance is less important for RNN due to the matrix multiplications, future optimizations may change this characteristic.
Carlile et al, 2017 ISRLU

We are of course working on other AI innovations… we’ll keep  you posted.

* Pronunciation guide from wikipedia: https://en.wikipedia.org/wiki/Help:IPA/English)

Leave a Reply

Your email address will not be published. Required fields are marked *