Our group has published a new research paper introducing ISRLU:
![Carlile et al, 2017 ISRLU](https://i1.wp.com/aiperf.com/blog/wp-content/uploads/2017/10/blog-image-isrlu-copy.png?w=600)
“Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)”
Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti & Brian Whitney
https://arxiv.org/pdf/1710.09967.pdf
This 2017 paper has been submitted to a 2018 conference, ISRLU (Pron: is-rah-lou (ɪz rɑː luː) & ISRU (Pron: is-rue (ɪz rːuː) are faster activation functions for all DNN:
- ISRLU has all of the great properties of ELU, but can be implemented much faster in SW or in HW/SW codesigned DNN hardware.
- ISRLU is continuously differentiable in 1st and 2nd derivatives (ELU, ReLU are not).
- ISRU has similar curve to tanh & sigmoid for RNNs, but is much faster than both.
This innovation came about by using real Performance Engineering skills.
- We saw the trend that convolutions are becoming smaller and other strategies (Winograd) are reducing the time to do a convolution, therefore activation functions are becoming more important.
- Activation functions have been based on exponential functions which are always slower to evaluate.
- We know inverse square root can always be faster to evaluate, if it is same speed or slower it is time to optimize inverse square root implementation.
- We remembered Paul Kinney’s 1986 invention of fast approx of inverse square root technique, 16 years before Carmack & Mathisen rediscovered it and made it famous in 2002.
- We constructed a “sigmoid-like function” from inverse square root and married that to ELU’s innovation.
- …Basically understanding performance, have huge background of experiences, led us to innovation that SME had missed.
Remember this kind of thinking comes about from having experience with full stack innovations/experience with optimizing HPC, Database, Java, ERP, NoSQL, Java, Spark, SQL, Graph analytics, intrinsics, parallelism, always optimizing at scale, microarchitectures, HW/SW codesign,…
ISRU can replace tanh and sigmoid (add 1, scale by 1/2), while it is duly noted that activation function performance is less important for RNN due to the matrix multiplications, future optimizations may change this characteristic.
![Carlile et al, 2017 ISRLU](https://i1.wp.com/aiperf.com/blog/wp-content/uploads/2017/10/blog-image-isru-copy-1024x583.png?w=600)
We are of course working on other AI innovations… we’ll keep you posted.
* Pronunciation guide from wikipedia: https://en.wikipedia.org/wiki/Help:IPA/English)