License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ITCS.2021.63
URN: urn:nbn:de:0030-drops-136025
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/13602/
Go to the corresponding LIPIcs Volume Portal


van den Brand, Jan ; Peng, Binghui ; Song, Zhao ; Weinstein, Omri

Training (Overparametrized) Neural Networks in Near-Linear Time

pdf-format:
LIPIcs-ITCS-2021-63.pdf (0.6 MB)


Abstract

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second-order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent of the training batch size n), second-order algorithms incur a daunting slowdown in the cost per iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [Zhang et al., 2019; Cai et al., 2019], yielding an O(mn²)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width m.
We show how to speed up the algorithm of [Cai et al., 2019], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an ?₂-regression problem, and then use a Fast-JL type dimension reduction to precondition the underlying Gram matrix in time independent of M, allowing to find a sufficiently good approximate solution via first-order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra - which led to recent breakthroughs in convex optimization (ERM, LPs, Regression) - can be carried over to the realm of deep learning as well.

BibTeX - Entry

@InProceedings{vandenbrand_et_al:LIPIcs.ITCS.2021.63,
  author =	{Jan van den Brand and Binghui Peng and Zhao Song and Omri Weinstein},
  title =	{{Training (Overparametrized) Neural Networks in Near-Linear Time}},
  booktitle =	{12th Innovations in Theoretical Computer Science Conference (ITCS 2021)},
  pages =	{63:1--63:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-177-1},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{185},
  editor =	{James R. Lee},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/13602},
  URN =		{urn:nbn:de:0030-drops-136025},
  doi =		{10.4230/LIPIcs.ITCS.2021.63},
  annote =	{Keywords: Deep learning theory, Nonconvex optimization}
}

Keywords: Deep learning theory, Nonconvex optimization
Collection: 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)
Issue Date: 2021
Date of publication: 04.02.2021


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI