The Heavy-Tail Phenomenon in Decentralized Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) method is one of the most popular optimization techniques in machine learning, particularly in Deep Neural Network (DNN). The gradient noise in this method is often modeled by Gaussian or assumed to have finite variance. However, empirical evidence suggests that the gradient noises can be highly non-Gaussian and often exhibit heavy tails in nature. This heaviness has a direct relationship to the generalization performance of the algorithm.
In this candidacy paper we discuss materials from three papers where we first present the tail-index analysis in SGD that shows empirically that gradient noise can have heavy tails, and through metastability analysis, the heavy-tailed SGD validates the wide minima phenomenon. We then present the paper about the heavy-tail phenomenon in SGD and investigates the origins of the heavy tails. We show that the heaviness of the tail is related to the choice of stepsize, batch-size and other hyperparameters of the algorithm. Finally, we discuss the heavy-tail phenomenon in decentralized SGD. We conclude the candidacy paper by proposing a few future research directions.
A copy of the presentation can be found here
Back to top