October 2017 – AI Perf Blog

New DNN research: ISRLU – an innovative activation function

Our group has published a new research paper introducing ISRLU:

“Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)”
Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti & Brian Whitney
https://arxiv.org/pdf/1710.09967.pdf

This 2017 paper has been submitted to a 2018 conference, ISRLU (Pron: is-rah-lou (ɪz rɑː luː) & ISRU (Pron: is-rue (ɪz rːuː) are faster activation functions for all DNN:

ISRLU has all of the great properties of ELU, but can be implemented much faster in SW or in HW/SW codesigned DNN hardware.
ISRLU is continuously differentiable in 1st and 2nd derivatives (ELU, ReLU are not).
ISRU has similar curve to tanh & sigmoid for RNNs, but is much faster than both.

This innovation came about by using real Performance Engineering skills.

We saw the trend that convolutions are becoming smaller and other strategies (Winograd) are reducing the time to do a convolution, therefore activation functions are becoming more important.
Activation functions have been based on exponential functions which are always slower to evaluate.
We know inverse square root can always be faster to evaluate, if it is same speed or slower it is time to optimize inverse square root implementation.
We remembered Paul Kinney’s 1986 invention of fast approx of inverse square root technique, 16 years before Carmack & Mathisen rediscovered it and made it famous in 2002.
We constructed a “sigmoid-like function” from inverse square root and married that to ELU’s innovation.
…Basically understanding performance, have huge background of experiences, led us to innovation that SME had missed.

Remember this kind of thinking comes about from having experience with full stack innovations/experience with optimizing HPC, Database, Java, ERP, NoSQL, Java, Spark, SQL, Graph analytics, intrinsics, parallelism, always optimizing at scale, microarchitectures, HW/SW codesign,…

ISRU can replace tanh and sigmoid (add 1, scale by 1/2), while it is duly noted that activation function performance is less important for RNN due to the matrix multiplications, future optimizations may change this characteristic.

We are of course working on other AI innovations… we’ll keep you posted.

* Pronunciation guide from wikipedia: https://en.wikipedia.org/wiki/Help:IPA/English)

Research Paper: DNN one feature at a time

A summary of an interesting DNN research paper we are reading…

“Building effective deep neural network architectures one feature at a time”
Martin Mundt, Tobias Weis, Kishore Konda, Visvanathan Ramesh
https://arxiv.org/pdf/1705.06778.pdf

This 2017 paper looks at adding on feature map at a time to a CNN

One Feature at a Time summary:

Instead of starting with an arbitrary number of feature maps at every level of a CNN, they start with one, then add one feature map at a time as determined by a metric.
The end state of their model challenges the long standing “rule of thumb” that one should monotonically increase feature maps at the higher levels(in read in bar graph shown above). The final state of the feature at time (shown in green) has a very different profile.
The metric used is how much a feature has changed with respect to its initial state, i.e. features that have a high amount of structure change for initial state are more likely to play a more vital role.
Growing one at a time is comes at less computation cost than training too large of an architecture.
More effective CNN architectures should grow number of feature as one moves to the middle layer (getting to higher dimensional state) and then reduce number of features to push toward better aggregation which is needed at final layers.

The arbitrary numbers of feature maps has always bothered me. I find this paper to be quite inspiring.

We quickly tried crudely just having more features in the middle of some of our TensorFlow CNN models and it improved our test accuracy (reducing our cross-entropy loss). Now we need to try their full feature at a time! …Any of this code posted at GitHub?

The only thing that bothers me is that it seems there must be a slightly more universal metric than amount of change from the initial state, because what if the initialization happened to be to a very good state. I completely buy their argument that for the randomized initialization cases this is unlikely, but what if other techniques are used to initialized. I’m going to be spending some time thinking about alternatives.

All the best to the authors!

This is an definitely an interesting area of research that our team is thinking more about…

AI/ML Public Lecture at Institute for Advance Study Friday Oct 27 Princeton

www.ias.edu

Sanjeev Arora & Richard Semel gave a public lecture, “Machines: How Do They Learn and Where Are They Headed?,” on Friday, October 27, at the Institute for Advance Studies.

The lecture was recorded and you can see it at: https://www.youtube.com/watch?v=jUUh_hfeBxw

See the URL for more info: https://www.ias.edu/news/2017/arora-zemel-publiclecture

DNN: A Tutorial and Survey

A kindler gentler DNN research paper summary…

“Efficient Processing of Deep Neural Networks: A Tutorial and Survey”
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer
https://arxiv.org/pdf/1703.09039.pdf

This 2017 paper provides a good overview of the latest trends in DNN (Deep Neural Networks) by our friends at MIT. For those not reading all of the research papers, this will serve as a kindler gentler introduction into the latest.

DNN Tutorial Survey:

An intro in DNN
DNN models that are currently in use
Hardware for DNN and implications for future energy efficient designs
HW/SW Codesign optimizations

This broad survey packs a lot into its 32-pages and has most of latest and of course a great reference section (162 references). After the 1st 12-13 pages it has more of a focus on HW design implications.

Research Paper: Large Cyclical Learning Rates

A summary of an interesting DNN research paper we are reading…

“Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates”
Leslie N. Smith and Nicholay Topin
https://arxiv.org/pdf/1708.07120.pdf

This 2017 paper looks at cyclical learning rates (CLR) and a large max learning rate to speed up training… and maybe even lead to “super-convergence.”

Cyclical Learning with a large learning rate summary:

Some NN training can be dramatically speed up by using cyclical learning rates (CLR)
On the CIFAR-10 dataset, their model reached 81.4% validation accuracy in only 25k iterations “super-convergence” vs. 70k iterations with standard hyperparameters
One specifies the min learning rate & the max learning rate boundaries and a number of iterations used for cycle, a cycle consists of increasing and decreasing learning rates
Increasing the learning rate may be an effective way of escaping saddle points

While this technique doesn’t always get into “super-convergence” for DNN’s, it can happen for some.

For an intuitive feel of the algorithm we offer the following: At the begining the learning rate must be small to allow progress in an appropriate direction since the curvature may change drastically. As the slope decreases one can make more progress with bigger learning rates. Finally. smaller learning rates are needed when the solution does “fine- scale” maneuvering. In the last stages, a valley may have fine-scale features such as a variety of troughs that must be traveled to find the local minimum.

For a visual on what CLR cycle looks like, see the image below. In the paper they discuss that an appropriately large max_lr may cause this super-convergence.

The architectures and code to replicate the data in the paper are available at github.com/lnsmith54/super-convergence. For earlier work on CLR see: https://github.com/bckenstler/CLR

They present evidence that training with large learning rates applied in the right fashion improves performance by regularizing the network.

This is an interesting area of research and it has also given our team thoughts on how this idea may also help with new parallel implementations.

Research Paper: Dilated Convolutions

A summary of an interesting DNN research paper we are reading…

“Multi-Scale Context Aggregation by Dilated Convolutions”
Fisher Yu, Vladlen Koltun
https://arxiv.org/pdf/1511.07122.pdf

This 2016 paper introduced dilated convolutions. We expect it to get wider attention:

Dilated Convolutions summary:

basically Conv filters with gaps between the filter elements
adds a level of scale invariance
broader view of the input, can capture multi-scale contextual information
one doesn’t need to lose resolution or to analyzing rescaled images

Typically when we talk about strides in convolutions for DNN, it means striding the filter over the input and applying the filter to adjacent input elements. One skips some number of input elements are reapply the filter. In dilated filters consecutive filter elements are applied to non-adjacent input elements.

In dilated convolutions the filter is applied with defined gaps. For example if a 3×3 filter is applied to 2D image, a dilation of k=1 is just a normal convolution when k=2 one skips one pixel per input with the filter values spread out over the input. k=4 means skipping 3 pixels.

Figure 1 below (from the paper) shows dilated 2D convolution on 2D data.

The caption has confused some people on the web so we offer the following additional explanation: The Red dots show which 3×3 filter elements are applied to which input elements. (a) shows a normal 3×3 filter, the 9 elements of the filters are applied to consecutive elements of the input, (b) shows the same 3×3 filter but notice the same 9 elements are applied to different input points with a gap of 1 between them {dilation k=2}, (c) shows the same 3×3 filter but notice it is applied to different input points with a gap of 3 {dilation k=4}. The blue/green shading shows the receptive field captured with the next layer.

It gets us thinking about creating new ways to add scale and rotational variance to DNNs.

Gluon: A new ML library from AWS & Microsoft

Gluon is a new ML library from AWS & Microsoft.

Gluon - AWS Microsoft

Gluon looks like an interesting interface for building NN & ML models. I like the ability to ‘hybridize’ between symbolic representations and ‘imperative’ workflow that are more familiar to many people. From what I can see Gluon should allow everyone to easily speed up writing models in general and still allow researchers to customize/hack one part of the model while leveraging the simplicity of the other common layers.

My group is preparing a paper for submission to the upcoming ICLR 2018. We did all of this work using TensorFlow, but it would have been interesting to do our research using Gluon/MXNet (or the upcoming Gluon/Microsoft Cognitive Toolkit).

More details at: https://lnkd.in/gfV_td8