Research Paper: Large Cyclical Learning Rates

A summary of an interesting DNN research paper we are reading…

“Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates”
Leslie N. Smith and Nicholay Topin
https://arxiv.org/pdf/1708.07120.pdf

This 2017 paper looks at cyclical learning rates (CLR) and a large max learning rate to speed up training… and maybe even lead to “super-convergence.”

Cyclical Learning with a large learning rate summary:

  • Some NN training can be dramatically speed up by using cyclical learning rates (CLR)
  • On the CIFAR-10 dataset, their model reached 81.4% validation accuracy in only 25k iterations “super-convergence” vs. 70k iterations with standard hyperparameters
  • One specifies the min learning rate & the max learning rate boundaries and a number of iterations used for cycle, a cycle consists of increasing and decreasing learning rates
  • Increasing the learning rate may be an effective way of escaping saddle points

While this technique doesn’t always get into “super-convergence” for DNN’s, it  can happen for some.

For an intuitive feel of the algorithm we offer the following: At the begining the learning rate must be small to allow progress in an appropriate direction since the curvature may change drastically. As the slope decreases one can make more progress with bigger learning rates. Finally. smaller learning rates are needed when the solution does “fine- scale” maneuvering. In the last stages, a valley may have fine-scale features such as a variety of troughs that must be traveled to find the local minimum.

For a visual on what CLR cycle looks like, see the image below. In the paper they discuss that an appropriately large max_lr may cause this super-convergence.

The architectures and code to replicate the data in the paper are available at github.com/lnsmith54/super-convergence. For earlier work on CLR see: https://github.com/bckenstler/CLR

They present evidence that training with large learning rates applied in the right fashion improves performance by regularizing the network.

This is an interesting area of research and it has also given our team thoughts on how this idea may also help with new parallel implementations.

Leave a Reply

Your email address will not be published. Required fields are marked *