Untitled

Paper: Training trajectories, mini-batch losses, and the curious role of the learning rate.

The paper discusses the role of the learning rate in the optimization process of deep learning using Stochastic Gradient Descent (SGD). The authors show that a quadratic function can accurately model the loss function on individual mini-batches. The paper proposes a simple model and geometric interpretation that analyzes the relationship between mini-batch gradients and the full batch and how the learning rate affects the relationship between improvement gates and specific learning rate schedules. The authors' results suggest that an even simpler averaging technique can improve accuracy by averaging two points a few steps apart. The authors also show a dramatic difference in the loss as a function of the learning rate of the training batch versus the held-out batch. Their findings were validated on ImageNet and other datasets using ResNet architecture.

The authors continue to explore the behavior of the loss function during training using SGD. They study the impact of a single step on the loss value, both for the training batch and unseen batches. They found that a single step along the training batch direction can bring the loss down to values considerably lower than the best average loss of a fully-trained model. This behavior is typical throughout the training trajectory. They also investigate whether the loss basin remains the same regardless of the starting point by looking at the behavior of the loss for a fixed mini-batch as they branch off different parts of the training trajectory; they found that this is the case. The authors also proposed an analytical model that describes the behavior of the weight vector during training. They also explore the interpolation between two trajectories with fixed-batch starting at Ot and Ot+1 and use this to show that the loss on the full distribution is reached before the final trajectory. They also observed that this resulted in better accuracy on the held-out batch. They present a series of figures that illustrate their findings.

The authors propose an analytical model that describes the behavior of the weight vector during training, which they found to capture well many of the phenomena of large deep neural networks training. They present an alternative derivation of the result based on the Fokker-Planck equation, which uses a scaling factor that generalizes the result to an arbitrary averaging kernel. They also explore the expected change in the norm for a fixed learning rate and show that a natural geometric interpretation arises from the basic properties of stochastic gradient descent. They found that the distance to the global minimum is proportional to VA and described by equation (13).
They also show that weight averaging is equivalent to a reduced learning rate along the trajectory for the Stochastic Weight Averaging (SWA) method and Exponential Moving Average (EMA). They also explore the effect on the stationary distribution. They show that the same results are observed for Imagenet and present a table that shows the equivalence of the different averaging methods to the reduced learning rate for a fixed window size.

The authors continue to investigate the behavior of stochastic gradient descent (SGD) in the context of deep learning. They compare the effects of different learning rate schedules on the training process, such as a fixed learning rate versus a linearly decaying schedule. They observe that weight averaging can improve model performance and reduce overfitting but that this technique is sensitive to the time scale of the underlying weight evolution. They also introduce a synthetic model to study the fast and slow convergence of SGD. They also observe that weight averaging can be simulated using a properly chosen learning rate schedule. The authors suggest that their findings may pave the way for further understanding the role of the learning rate in training and propose further studies in this area.