Brad – AI Perf Blog

AI/ML Lecture at Institute for Advance Study: Yann LeCun

www.ias.edu

Yann LeCun (NYU & Facebook AI Research) gave a public lecture entitled: “Theoretical Machine Learning Lecture Series: How Could Machines Learn as Efficiently as Animals and Humans?” at the Institute for Advance Studies on 12-Dec-2017.

The lecture was recorded and you can see it at: https://www.youtube.com/watch?v=0BUr4_ZkA1w

See the URL for more info: https://www.ias.edu/ideas/2017/machine-learning-lecun

New AI Innovation: Columnar DB for AI Featurization

Brad Carlile Copyright 2007 — Multiverse-Copyright 2007 Brad Carlile

Our new paper: “Columnar Database Techniques for Creating AI Features”
Brad Carlile, Akiko Marti & Guy Delamarter
http://arxiv.org/abs/1712.02882 (upcoming conference submission)
(special thanks to Brian Whitney & Paul Kinney for their assistance).

In this paper we show:

Our new innovation to improve the performance of columnar databases (NoSQL & RDBMS) for the purpose of AI featurization (ML & DL).
Our additions to columnar database dictionaries enable efficient feature creation calculations, by halving the required data movement and optimizing the compute.
We propose fully integrated DB+AI architecture rich with information flows and learning feedback mechanisms that would further improve the whole analytics cycle. We point out other innovations in this area..
We also give a survey of the various techniques so that subject matter experts in the diverse fields of Database, Machine Learning, and Deep Learning can understand about critical aspects that are outside their expertise.

This innovation came about by using real Performance Engineering skills.

We have a cross-stack understanding AI full stack (Datasource to AI). This comes from our deep experience with performance optimizations in each part of this stack.
We Look across the traditional silos (DB analytics, ML analytics, DL analytics) which opens doors for seeing new innovations and performance opportunities.
What most people see as analytics pipeline we see as an analytics cycle. Again this opens up new opportunities.

Remember this kind of thinking comes about from having experience with full stack innovations/experience with performance optimization on every type of enterprise application. http://www.aiperf.com/bio/unique-experiences.html.

Our recent DL innovation paper:
“Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)”
Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti & Brian Whitney
https://arxiv.org/pdf/1710.09967.pdf

…We are of course working on other AI innovations… we’ll keep you posted.

WSJ: Why companies should hire teams – Not individuals

“Why Companies Should Hire Teams, Not Individuals” is Sydney Finkelstein’s new article by in the Wall Street Journal (C-Suite Strategies front page print edition Oct. 30, 2017).

Some companies are breaking out of typical problems of hiring piecemeal, by hiring pre-existing groups.

Why do this? Because talented teams have already demonstrated excellence in a specialized task or function, and it’s easier and cheaper to tap those teams than to create new teams of your own. They also may have unique skills honed by unique experiences.

The Benefits of hiring a team:

It allows companies to hire more reliably, based on proven track record. You know there is a good mix of personalities & skills for success.
Teams shake things up in a good way, because if you need innovation then you need a certain level of disruption.
Teams are able to contribute much more quickly.
Teams more likely to retain their fresh perspectives.
…and finally, research shows that job interviews poorly predict future job performance.

Please take a look at the article, which ends with:

“It requires that they rethink their processes from top to bottom, taking on sacred cows and best practices. From that perspective, hiring teams might be exactly what leaders should start working on — not least because their less adventurous competitors aren’t.”

The full URL :
https://www.wsj.com/articles/why-companies-should-hire-teams-not-individuals-1509329580

Latest Brain research Nov-2017

There is some interesting new brain research. These high-level articles are less techy, but they may provide some inspiration. We believe it is important to learn more about how the brain works for future AI work. I often use the following, “when designing planes it is inspiring to look at bird flight, but don’t try to stick with all of their design principles when designing a plane for Mach 6.7” The same applies to AI.

How memories ripple through the brain:
https://www.sciencedaily.com/releases/2017/10/171031084843.htm
from here you can find the link to their journal article that goes into more detail.

Creating Virtual Human Brain:
http://www.npr.org/sections/health-shots/2017/10/25/559819023/scientists-and-surgeons-team-up-to-create-models-of-living-human-brain-cells
Allen Institue’s awesome .gif

New DNN research: ISRLU – an innovative activation function

Our group has published a new research paper introducing ISRLU:

“Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)”
Brad Carlile, Guy Delamarter, Paul Kinney, Akiko Marti & Brian Whitney
https://arxiv.org/pdf/1710.09967.pdf

This 2017 paper has been submitted to a 2018 conference, ISRLU (Pron: is-rah-lou (ɪz rɑː luː) & ISRU (Pron: is-rue (ɪz rːuː) are faster activation functions for all DNN:

ISRLU has all of the great properties of ELU, but can be implemented much faster in SW or in HW/SW codesigned DNN hardware.
ISRLU is continuously differentiable in 1st and 2nd derivatives (ELU, ReLU are not).
ISRU has similar curve to tanh & sigmoid for RNNs, but is much faster than both.

This innovation came about by using real Performance Engineering skills.

We saw the trend that convolutions are becoming smaller and other strategies (Winograd) are reducing the time to do a convolution, therefore activation functions are becoming more important.
Activation functions have been based on exponential functions which are always slower to evaluate.
We know inverse square root can always be faster to evaluate, if it is same speed or slower it is time to optimize inverse square root implementation.
We remembered Paul Kinney’s 1986 invention of fast approx of inverse square root technique, 16 years before Carmack & Mathisen rediscovered it and made it famous in 2002.
We constructed a “sigmoid-like function” from inverse square root and married that to ELU’s innovation.
…Basically understanding performance, have huge background of experiences, led us to innovation that SME had missed.

Remember this kind of thinking comes about from having experience with full stack innovations/experience with optimizing HPC, Database, Java, ERP, NoSQL, Java, Spark, SQL, Graph analytics, intrinsics, parallelism, always optimizing at scale, microarchitectures, HW/SW codesign,…

ISRU can replace tanh and sigmoid (add 1, scale by 1/2), while it is duly noted that activation function performance is less important for RNN due to the matrix multiplications, future optimizations may change this characteristic.

We are of course working on other AI innovations… we’ll keep you posted.

* Pronunciation guide from wikipedia: https://en.wikipedia.org/wiki/Help:IPA/English)

Research Paper: DNN one feature at a time

A summary of an interesting DNN research paper we are reading…

“Building effective deep neural network architectures one feature at a time”
Martin Mundt, Tobias Weis, Kishore Konda, Visvanathan Ramesh
https://arxiv.org/pdf/1705.06778.pdf

This 2017 paper looks at adding on feature map at a time to a CNN

One Feature at a Time summary:

Instead of starting with an arbitrary number of feature maps at every level of a CNN, they start with one, then add one feature map at a time as determined by a metric.
The end state of their model challenges the long standing “rule of thumb” that one should monotonically increase feature maps at the higher levels(in read in bar graph shown above). The final state of the feature at time (shown in green) has a very different profile.
The metric used is how much a feature has changed with respect to its initial state, i.e. features that have a high amount of structure change for initial state are more likely to play a more vital role.
Growing one at a time is comes at less computation cost than training too large of an architecture.
More effective CNN architectures should grow number of feature as one moves to the middle layer (getting to higher dimensional state) and then reduce number of features to push toward better aggregation which is needed at final layers.

The arbitrary numbers of feature maps has always bothered me. I find this paper to be quite inspiring.

We quickly tried crudely just having more features in the middle of some of our TensorFlow CNN models and it improved our test accuracy (reducing our cross-entropy loss). Now we need to try their full feature at a time! …Any of this code posted at GitHub?

The only thing that bothers me is that it seems there must be a slightly more universal metric than amount of change from the initial state, because what if the initialization happened to be to a very good state. I completely buy their argument that for the randomized initialization cases this is unlikely, but what if other techniques are used to initialized. I’m going to be spending some time thinking about alternatives.

All the best to the authors!

This is an definitely an interesting area of research that our team is thinking more about…

AI/ML Public Lecture at Institute for Advance Study Friday Oct 27 Princeton

www.ias.edu

Sanjeev Arora & Richard Semel gave a public lecture, “Machines: How Do They Learn and Where Are They Headed?,” on Friday, October 27, at the Institute for Advance Studies.

The lecture was recorded and you can see it at: https://www.youtube.com/watch?v=jUUh_hfeBxw

See the URL for more info: https://www.ias.edu/news/2017/arora-zemel-publiclecture

DNN: A Tutorial and Survey

A kindler gentler DNN research paper summary…

“Efficient Processing of Deep Neural Networks: A Tutorial and Survey”
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer
https://arxiv.org/pdf/1703.09039.pdf

This 2017 paper provides a good overview of the latest trends in DNN (Deep Neural Networks) by our friends at MIT. For those not reading all of the research papers, this will serve as a kindler gentler introduction into the latest.

DNN Tutorial Survey:

An intro in DNN
DNN models that are currently in use
Hardware for DNN and implications for future energy efficient designs
HW/SW Codesign optimizations

This broad survey packs a lot into its 32-pages and has most of latest and of course a great reference section (162 references). After the 1st 12-13 pages it has more of a focus on HW design implications.

Research Paper: Large Cyclical Learning Rates

A summary of an interesting DNN research paper we are reading…

“Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates”
Leslie N. Smith and Nicholay Topin
https://arxiv.org/pdf/1708.07120.pdf

This 2017 paper looks at cyclical learning rates (CLR) and a large max learning rate to speed up training… and maybe even lead to “super-convergence.”

Cyclical Learning with a large learning rate summary:

Some NN training can be dramatically speed up by using cyclical learning rates (CLR)
On the CIFAR-10 dataset, their model reached 81.4% validation accuracy in only 25k iterations “super-convergence” vs. 70k iterations with standard hyperparameters
One specifies the min learning rate & the max learning rate boundaries and a number of iterations used for cycle, a cycle consists of increasing and decreasing learning rates
Increasing the learning rate may be an effective way of escaping saddle points

While this technique doesn’t always get into “super-convergence” for DNN’s, it can happen for some.

For an intuitive feel of the algorithm we offer the following: At the begining the learning rate must be small to allow progress in an appropriate direction since the curvature may change drastically. As the slope decreases one can make more progress with bigger learning rates. Finally. smaller learning rates are needed when the solution does “fine- scale” maneuvering. In the last stages, a valley may have fine-scale features such as a variety of troughs that must be traveled to find the local minimum.

For a visual on what CLR cycle looks like, see the image below. In the paper they discuss that an appropriately large max_lr may cause this super-convergence.

The architectures and code to replicate the data in the paper are available at github.com/lnsmith54/super-convergence. For earlier work on CLR see: https://github.com/bckenstler/CLR

They present evidence that training with large learning rates applied in the right fashion improves performance by regularizing the network.

This is an interesting area of research and it has also given our team thoughts on how this idea may also help with new parallel implementations.

Research Paper: Dilated Convolutions

A summary of an interesting DNN research paper we are reading…

“Multi-Scale Context Aggregation by Dilated Convolutions”
Fisher Yu, Vladlen Koltun
https://arxiv.org/pdf/1511.07122.pdf

This 2016 paper introduced dilated convolutions. We expect it to get wider attention:

Dilated Convolutions summary:

basically Conv filters with gaps between the filter elements
adds a level of scale invariance
broader view of the input, can capture multi-scale contextual information
one doesn’t need to lose resolution or to analyzing rescaled images

Typically when we talk about strides in convolutions for DNN, it means striding the filter over the input and applying the filter to adjacent input elements. One skips some number of input elements are reapply the filter. In dilated filters consecutive filter elements are applied to non-adjacent input elements.

In dilated convolutions the filter is applied with defined gaps. For example if a 3×3 filter is applied to 2D image, a dilation of k=1 is just a normal convolution when k=2 one skips one pixel per input with the filter values spread out over the input. k=4 means skipping 3 pixels.

Figure 1 below (from the paper) shows dilated 2D convolution on 2D data.

The caption has confused some people on the web so we offer the following additional explanation: The Red dots show which 3×3 filter elements are applied to which input elements. (a) shows a normal 3×3 filter, the 9 elements of the filters are applied to consecutive elements of the input, (b) shows the same 3×3 filter but notice the same 9 elements are applied to different input points with a gap of 1 between them {dilation k=2}, (c) shows the same 3×3 filter but notice it is applied to different input points with a gap of 3 {dilation k=4}. The blue/green shading shows the receptive field captured with the next layer.

It gets us thinking about creating new ways to add scale and rotational variance to DNNs.