1 Introduction
Batch normalization [1]
(BN) is one of the most important techniques for training deep neural networks and has proven extremely effective in avoiding gradient blowups during backpropagation and speeding up convergence. In its original introduction
[1], the desirable effects of BN are attributed to the socalled “reduction of covariate shift”. However, it is unclear what this statement means in precise mathematical terms. To date, there lacks a comprehensive theoretical analysis of the effect of batch normalization.In this paper, we study the convergence and stability of gradient descent with batch normalization (BNGD) via a modeling approach. More concretely, we consider a simplified supervised learning problem: ordinary least squares regression, and analyze precisely the effect of BNGD when applied to this problem. Much akin to the mathematical modeling of physical processes, the leastsquares problem serves as an idealized “model” of the effect of BN for general supervised learning tasks. A key reason for this choice is that the dynamics of GD without BN (hereafter called GD for simplicity) in leastsquares regression is completely understood, thus allowing us to isolate and contrast the additional effects of batch normalization.
The modeling approach proceeds in the following steps. First, we derive precise mathematical results on the convergence and stability of BNGD applied to the leastsquares problem. In particular, we show that BNGD converges for any constant learning rate , regardless of the conditioning of the regression problem. This is in stark contrast with GD, where the condition number of the problem adversely affect stability and convergence. Many insights can be distilled from the analysis of the OLS model. For instance, we may attribute the stability of BNGD to an interesting scaling law governing and the initial condition; This scaling law is not present in GD. The preceding analysis also implies that if we are allowed to use different learning rates for the BN rescaling variables () and the remaining trainable variables (), we may conclude that BNGD on our model converges for any as long as . Furthermore, we discover an asymptotic acceleration effect of BNGD and moreover, there exist regions of such that the performance of BNGD is insensitive to changes in , which help to explain the robustness of BNGD to the choice of learning rates. We reiterate that contrary to many previous works, all the preceding statements are precise mathematical results that we derive for our simplified model.
The last step in our modeling approach is also the most important: we need to demonstrate that these insights are not specific features of our idealized model. Indeed, they should be true characteristics, at least in an approximate sense, of BNGD for general supervised learning problems. We do this by numerically investigating the convergence, stability and scaling behaviors of BNGD on various datasets and model architectures. We find that the key insights derived from our idealized analysis indeed correspond to practical scenarios.
1.1 Related work
Batch normalization was originally introduced in [1] and subsequently studied in further detail in [2]. Since its introduction, it has become an important practical tool to improve stability and efficiency of training deep neural networks [3, 4]
. Initial heuristic arguments attribute the desirable features of BN to concepts such as “covariate shift”, which lacks a concrete mathematical interpretation and alternative explanations have been given
[5]. Recent theoretical studies of BN includes [6], where the authors proposed a variant of BN, the diminishing batch normalization (DBN) algorithm and analyzed the convergence of the DBN algorithm, showing that it converges to a stationary point of the loss function. More recently,
[7] demonstrated that the higher learning rates of batch normalization induce a regularizing effect.Most relevant to the present work is [8], where the authors also considered the convergence properties of BNGD on linear networks (similar to the leastsquares problem). The authors showed that for a particularly adaptive choice of dynamic learning rate schedule, which can be seen as a fixed effective step size in our terminology (see equation (11) and section therein), BNGD converges linearly. The present research is independent and the key difference in our analysis is that we prove that the convergence occurs for constant learning rates (and in fact, arbitrarily large learning rates for , as long as ). This result is quite different from those in both [8] and [6] where a specialized learning rate schedule is employed. This is an important distinction; While a decaying or dynamic learning rate is sometimes used in practice, in the case of BN it is critical to analyze the nonasymptotic, constant learning rate case, precisely because one of the key practical advantages of BN is that a bigger learning rate can be used than that in GD. Hence, it is desirable, as in the results presented in this work, to perform our analysis in this regime.
Finally, through the lens of the leastsquares example, BN can be viewed as a type of overparameterization, where additional parameters, which do not increase model expressivity, are introduced to improve algorithm convergence and stability. In this sense, this is related in effect to the recent analysis of the implicit acceleration effects of overparameterization on gradient descent [9].
1.2 Organization
Our paper is organized as follows. In Section 2, we outline the ordinary least squares (OLS) problem and present GD and BNGD as alternative means to solve this problem. In Section 3, we demonstrate and analyze the convergence of the BNGD for the OLS model, and in particular contrast the results with the behavior of GD, which is completely known for this model. We also discuss the important insights to BNGD that these results provide us with. We then validate these findings on more general supervised learning problems in Section 4. Finally, we conclude in Section 5.
2 Background
Consider the simple linear regression model where
is a random input column vector and
is the corresponding output variable. Since batch normalization is applied for each feature separately, in order to gain key insights it is sufficient to the case of . A noisy linear relationship is assumed between the dependent variable and the independent variables , i.e. whereis the parameters. Denote the following moments:
(1) 
To simplify the analysis, we assume the covariance matrix of is positive definite and the mean of
is zero. The eigenvalues of
are denoted as . Particularly, the maximum and minimum eigenvalue of is denoted by and respectively. The condition number of is defined as . Note that the positive definiteness of allows us to define the vector norms and by and respectively.2.1 Ordinary least squares
The ordinary least squares (OLS) method for estimating the unknown parameters
leads to the following optimization problem(2) 
The gradient of with respect to is , and the unique minimizer is . The gradient descent (GD) method (with step size or learning rate ) for solving the optimization problem (2) is given by the iterating sequence,
(3) 
which converges if , and the convergence rate is determined by the spectral radius with
(4) 
It is well known (for example see Chapter 4 of [10]) that the optimal learning rate is , where the convergence estimate is related to the condition number :
(5) 
2.2 Batch normalization
Batch normalization is a featurewise normalization procedure typically applied to the output, which in this case is simply . The normalization transform is defined as follows:
(6) 
where . After this rescaling, will be order 1, and hence in order to reintroduce the scale, we multiply with a rescaling parameter (Note that the shift parameter can be set zero since ). Hence, we get the BN version of the OLS problem (2):
(7) 
The objective function is no longer convex. In fact, it has trivial critical points, , which are saddle points of .
We are interested in the nontrivial critical points which satisfy the relations,
(8) 
It is easy to check that the nontrivial critical points are global minimizers, and the Hessian matrix at each critical point is degenerate. Nevertheless, the saddle points are strict (Details can be found in Appendix), which typically simplifies the analysis of gradient descent on nonconvex objectives [11, 12].
Consider the gradient descent method to solve the problem (7), which we hereafter call batch normalization gradient descent (BNGD). We set the learning rates for and to be and respectively. These may be different, for reasons which will become clear in the subsequent analysis. We thus have the following discretetime dynamical system
(9)  
(10) 
We now begin a concrete mathematical analysis of the above iteration sequence.
3 Mathematical analysis of BNGD on OLS
In this section, we discuss several mathematical results one can derive concretely for BNGD on the OLS problem (7). First, we establish a simple but useful scaling property, which then allows us to prove a convergence result for (effectively) arbitrary constant learning rates. We also derive the asymptotic properties of the “effective” learning rate of BNGD (to be precisely defined subsequently), which shows some interesting sensitivity behavior of BNGD on the chosen learning rates. Detailed proofs of all results presented here can be found in the Appendix.
3.1 Scaling property
In this section, we discuss a straightforward, but useful scaling property that the BNGD iterations possess. Note that the dynamical properties of the BNGD iteration are governed by a set of numbers, or a configuration .
Definition 3.1 (Equivalent configuration).
Two configurations, and , are said to be equivalent if for iterates ,
following these configurations respectively, there is an invertible linear transformation
and a nonzero constant such that for all .It is easy to check the system has the following scaling law.
Proposition 3.2 (Scaling property).
Suppose , then

The configurations and are equivalent.

The configurations and are equivalent.
It is worth noting that the scaling property (2) in Proposition 3.2 originates from the batchnormalization procedure and is independent of the specific structure of the loss function. Hence, it is valid for general problems where BN is used (Lemma A.9).
Despite being a simple result, the scaling property is important in determining the dynamics of BNGD, and is useful in our subsequent analysis of its convergence and stability properties. For example, one may observe that scaling property (2) implies it is sufficient to consider the case of small learning rates when establishing stability, since an unstable iteration sequence will reach a large enough , after which the remaining iterations may be seen, by the scaling principle, as “restarting” the sequence with small learning rates. We shall now make use of this property to prove a convergence result for BNGD on the OLS problem.
3.2 Batch normalization converges for arbitrary step size
We have the following convergence result.
Theorem 3.3 (Convergence for BNGD).
Sketch of Proof. We first prove that the algorithm converges for small enough and , with any initial value such that . Then, using the scaling property and the fact that is nondecreasing, we obtain the convergence of by a simple “restarting” argument outlined previously. Finally, using the positive definiteness of , we can prove the iteration converges to either a minimizer or a saddle point.
It is important to note that BNGD converges for all step size of , independent of the spectral properties of . This is a significant advantage and is in stark contrast with GD, where the step size is limited by , and the condition number of intimately controls the stability and convergence rate.
Although could converge to a saddle point, one can prove using the ‘strict saddle point’ arguments in [11, 12] that the set of initial value for which converges to strict saddle points has Lebesgue measure 0, provided the learning rate is sufficiently small. We note that even for large learning rates, in experiments with initial values drawn from typical distributions, we have not encountered convergence to saddles.
3.3 Convergence rate, acceleration and asymptotic sensitivity
Now, let us consider the convergence rate of BNGD when it converges to a minimizer. Compared with GD, the update coefficient before in equation (10) changed from to a complicated term which we named as the effective step size or learning rate
(11) 
and the recurrence relation in place of is
(12) 
Consider the dynamics of the residual , which equals if and only if is a global minimizer. Using the property of norm (see section A.1), we observe that the effective learning rate determines the convergence rate of via
(13) 
where is spectral radius of the matrix . The inequality (13) shows that the convergence of is linear provided for some positive number . It is worth noting that the convergence of the loss function value is implied by the convergence of (Lemma A.19).
Next, let us discuss below an asymptotic acceleration effect of BNGD over GD. When is close to a minimizer, we can approximate the iteration (9)(10) by a linearized system. The Hessian matrix for BNGD at a minimizer is , where the matrix is
(14) 
The matrix is positive semidefinite and has better spectral properties than , such as a lower pseudocondition number , where and are the maximal and minimal nonzero eigenvalues of respectively. Particularly, for almost all (see section A.1 ). This property leads to asymptotic acceleration effects of BNGD: When is small, the contraction coefficient in (13) can be improved to a lower coefficient. More precisely, we have the following result:
Proposition 3.4.
For any positive number , if is close to a minimizer, such that , then we have
(15) 
where .
Generally, we have provided , and the optimal rate is , where the inequality is strict for almost all . Hence, the estimate (15) indicates that the optimal BNGD could have a faster convergence rate than the optimal GD, especially when is much smaller than and is small enough.
Finally, we discuss the dependence of the effective learning rate (and by extension, the effective convergence rate (13) or (15)) on . This is in essence a sensitivity analysis on the performance of BNGD with respect to the choice of learning rate. The explicit dependence of on is quite complex, but we can nevertheless give the following asymptotic estimates.
Proposition 3.5.
Suppose , and , then

When is small enough, , the effective step size has a same order with , i.e. there are two positive constants, , independent on and , such that .

When is large enough, , the effective step size has order , i.e. there are two positive constants, , independent on and , such that .
Observe that for finite , is a differentiable function of . Therefore, the above result implies, via the mean value theorem, the existence of some such that . Consequently, there is at least some small interval of the choice of learning rates where the performance of BNGD is insensitive to this choice. In fact, empirically this is one commonly observed advantage of BNGD over GD, where the former typically allows for a variety of (large) learning rates to be used without adversely affecting performance. The same is not true for GD, where the convergence rate depends sensitively on the choice of learning rate. We will see later in Section 4 that although we only have a local insensitivity result above, the interval of this insensitivity is actually quite large in practice.
4 Experiments
Let us first summarize our key findings and insights from the analysis of BNGD on the OLS problem.

A scaling law governs BNGD, where certain configurations can be deemed equivalent

BNGD converges for any learning rate , provided that . In particular, different learning rates can be used for the BN variables compared with the remaining trainable variables

There exists intervals of for which the performance of BNGD is not sensitive to the choice of
In the subsequent sections, we first validate numerically these claims on the OLS model, and then show that these insights go beyond the simple OLS model we considered in the theoretical framework. In fact, much of the uncovered properties are observed in general applications of BNGD in deep learning.
4.1 Experiments on OLS
Here we test the convergence and stability of BNGD for the OLS model. Consider a diagonal matrix where is a increasing sequence. The scaling property (Proposition 3.2) allows us to set the initial value having same norm with , . Of course, one can verify that the scaling property holds strictly in this case.
Figure 1 gives examples of with different condition numbers . We tested the loss function of BNGD, compared with the optimal GD (i.e. GD with the optimal step size ), in a large range of step sizes and , and with different initial values of . Another quantity we observe is the effective step size of BN. The results are encoded by four different colors: whether is close to the optimal step size , and whether loss of BNGD is less than the optimal GD. The results indicate that the optimal convergence rate of BNGD can be better than GD in some configurations. This acceleration phenomenon is ascribed to the pseudocondition number of (discard the only zero eigenvalue) being less than . This advantage of BNGD is significant when the (pseudo)condition number discrepancy between and is large. However, if this difference is small, the acceleration is imperceptible. This is consistent with our analysis in section 3.3.
Another important observation is a region such that is close to , in other words, BNGD significantly extends the range of ‘optimal’ step sizes. Consequently, we can choose step sizes in BNGD at greater liberty to obtain almost the same or better convergence rate than the optimal GD. However, the size of this region is inversely dependent on the initial condition . Hence, this suggests that small at first steps may improve robustness. On the other hand, small will weaken the performance of BN. The phenomenon suggests that improper initialization of the BN parameters weakens the power of BN. This experience is encountered in practice, such as [13], where higher initial values of BN parameter are detrimental to the optimization of RNN models.
4.2 Experiments on practical deep learning problems
We conduct experiments on deep learning applied to standard classification datasets: MNIST [14], Fashion MNIST [15] and CIFAR10 [16]. The goal is to explore if the key findings outlined at the beginning of this section continue to hold for more general settings. For the MNIST and Fashion MNIST dataset, we use two different networks: (1) a onelayer fully connected network (784
10) with softmax meansquare loss; (2) a fourlayer convolution network (ConvMaxPoolConvMaxPoolFCFC) with ReLU activation function and crossentropy loss. For the CIFAR10 dataset, we use a fivelayer convolution network (ConvMaxPoolConvMaxPoolFCFCFC). All the trainable parameters are randomly initialized by the Glorot scheme
[17]before training. For all three datasets, we use a minibatch size of 100 for computing stochastic gradients. In the BNGD experiments, batch normalization is performed on all layers, the BN parameters are initialized to transform the input to zero mean/unit variance distributions, and a small regularization parameter
1e3 is added to variance to avoid division by zero.Scaling property Theoretically, the scaling property 3.2 holds for any layer using BN. However, it may be slightly biased by the regularization parameter . Here, we test the scaling property in practical settings. Figure 2
gives the loss of network(2) (2CNN+2FC) at epoch=1 with different learning rate. The norm of all weights and biases are rescaled by a common factor
. We observe that the scaling property remains true for relatively large . However, when is small, the norm of weights are small. Therefore, the effect of the regularization in becomes significant, causing the curves to be shifted.Stability for large learning rates We use the loss value at the end of the first epoch to characterize the performance of BNGD and GD methods. Although the training of models have generally not converged at this point, it is enough to extract some relative rate information. Figure 3 shows the loss value of the networks on the three datasets. It is observed that GD and BNGD with identical learning rates for weights and BN parameters exhibit a maximum allowed learning rate, beyond which the iterations becomes unstable. On the other hand, BNGD with separate learning rates exhibits a much larger range of stability over learning rate for nonBN parameters, consistent with our theoretical results in Theorem 3.3.
Insensitivity of performance to learning rates Observe that BN accelerates convergence more significantly for deep networks, whereas for onelayer networks, the best performance of BNGD and GD are similar. Furthermore, in most cases, the range of optimal learning rates in BNGD is quite large, which is in agreement with the OLS analysis (Proposition 3.5). This phenomenon is potentially crucial for understanding the acceleration of BNGD in deep neural networks. Heuristically, the “optimal” learning rates of GD in distinct layers (depending on some effective notion of “condition number”) may be vastly different. Hence, GD with a shared learning rate across all layers may not achieve the best convergence rates for all layers at the same time. In this case, it is plausible that the acceleration of BNGD is a result of the decreased sensitivity of its convergence rate on the learning rate parameter over a large range of its choice.
5 Conclusion and Outlook
In this paper, we adopted a modeling approach to investigate the dynamical properties of batch normalization. The OLS problem is chosen as a point of reference, because of its simplicity and the availability of convergence results for gradient descent. Even in such a simple setting, we saw that BNGD exhibits interesting nontrivial behavior, including scaling laws, robust convergence properties, asymptotic acceleration, as well as the insensitivity of performance to the choice of learning rates. Although these results are derived only for the OLS model, we show via experiments that these are qualitatively valid for general scenarios encountered in deep learning, and points to a concrete way in uncovering the reasons behind the effectiveness of batch normalization.
Interesting future directions include the extension of the results for the OLS model to more general settings of BNGD, where we believe the scaling law (Proposition 3.2) should play a significant role. In addition, we have not touched upon another empirically observed advantage of batch normalization, which is better generalization errors. It will be interesting to see how far the current approach takes us in investigating such probabilistic aspects of BNGD.
References

[1]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in
International Conference on Machine Learning
, pp. 448–456, 2015.  [2] S. Ioffe, “Batch renormalization: Towards reducing minibatch dependence in batchnormalized models,” CoRR, vol. abs/1702.03275, 2017.

[3]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  [4] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for largescale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
 [5] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift),” ArXiv eprints, May 2018.
 [6] Y. Ma and D. Klabjan, “Convergence analysis of batch normalization for deep neural nets,” CoRR, vol. 1705.08011, 2017.
 [7] J. Bjorck, C. Gomes, and B. Selman, “Understanding Batch Normalization,” ArXiv eprints, May 2018.
 [8] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann, “Towards a Theoretical Understanding of Batch Normalization,” ArXiv eprints, May 2018.
 [9] S. Arora, N. Cohen, and E. Hazan, “On the optimization of deep networks: Implicit acceleration by overparameterization,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, (Stockholmsmässan, Stockholm Sweden), pp. 244–253, PMLR, 10–15 Jul 2018.
 [10] Y. Saad, Iterative methods for sparse linear systems, vol. 82. siam, 2003.
 [11] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient Descent Converges to Minimizers,” ArXiv eprints, Feb. 2016.
 [12] I. Panageas and G. Piliouras, “Gradient Descent Only Converges to Minimizers: NonIsolated Critical Points and Invariant Regions,” in 8th Innovations in Theoretical Computer Science Conference (ITCS 2017) (C. H. Papadimitriou, ed.), vol. 67 of Leibniz International Proceedings in Informatics (LIPIcs), (Dagstuhl, Germany), pp. 2:1–2:12, Schloss Dagstuhl–LeibnizZentrum fuer Informatik, 2017.
 [13] T. Cooijmans, N. Ballas, C. Laurent, and A. C. Courville, “Recurrent batch normalization,” CoRR, vol. abs/1603.09025, 2016.
 [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [15] H. Xiao, K. Rasul, and R. Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” 2017.
 [16] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Citeseer, 2009.

[17]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256, 2010.
Appendix A Proof of Theorems
a.1 Gradients and Hessian matrix
The gradients are:
(17)  
(18) 
The Hessian matrix is
(19) 
where
(20)  
(21) 
The objective function has trivial critical points, . It is obvious that is the minimizer of , but is not a local minimizer of unless , hence are saddle points of . The Hessian matrix at those saddle points has at least a negative eigenvalue, hence the saddle points are strict.
On the other hand, the nontrivial critical points satisfies the relations,
(22) 
where the sign of depends on the direction of , i.e. . It is easy to check that the nontrivial critical points are global minimizers. The Hessian matrix at those minimizers is where the matrix is
(23) 
which is positive semidefinite and has a zero eigenvalue corresponding to the eigenvector
, i.e. .Lemma A.1.
If is positive definite and is defined as , then the eigenvalues of and satisfy the following inequalities:
(24) 
Here means the th smallest eigenvalue of .
Proof.
(1) According to the definition, we have , and for any ,
(25) 
which implies is semipositive definite, and . Furthermore, we have the following equality:
(26) 
(2) We will prove for all , . In fact, using the MinMax Theorem, we have
(3) We will prove for all , . In fact, using the MaxMin Theorem, we have
where we have used the fact that , . ∎
There are several corollaries related to the spectral property of . We first give some definitions. Since is positive semidefinite, we can define the seminorm.
Definition A.2.
The seminorm of a vector is defined as . if and only if is parallel to .
Definition A.3.
The pseudocondition number of is defined as .
Definition A.4.
For any real number , the pseudospectral radius of the matrix is defined as .
The following corollaries are direct consequences of Lemma A.1, hence we omit the proofs.
Corollary A.5.
The pseudocondition number of is less than or equal to the condition number of :
(27) 
where the equality holds up if and only if , is the eigenvector of corresponding to eigenvalue .
Corollary A.6.
For any vector and any real number , we have .
Corollary A.7.
For any positive number , we have
(28) 
where the inequality is strict if for .
a.2 Scaling property
The dynamical system defined in equation (9)(10) is completely determined by a set of configurations . It is easy to check the system has the following scaling property:
Lemma A.8 (Scaling property).
Suppose , then

The configurations and are equivalent.

The configurations and are equivalent.
The scaling property is valid for general loss functions provided batch normalization is used. Consider a general problem
(29) 
and its BN version
(30) 
Then the gradient descent method gives the following iteration,
(31)  
(32) 
where , and is the gradient of original problem:
(33) 
It is easy to check the general BNGD has the following property:
Lemma A.9 (General scaling property).
Suppose , then the configurations and are equivalent. Here the sign * means other parameters.
a.3 Proof of Theorem 3.3
Recall the BNGD iterations
The scaling property simplify our analysis by allowing us to set, for example, and . In the rest of this section, we only set .
For the step size of , it is easy to check that tends to infinity with and initial value . Hence we only consider , which make the iteration of bounded by some constant .
Lemma A.10 (Boundedness of ).
If the step size , then the sequence is bounded for any and any initial value .
Proof.
Define , which is bounded by , then
Since , we have . ∎
According to the iterations (A.3), we have
(34) 
Define
(35)  
(36)  
(37) 
and using the property , and the property of norm, we have
(38) 
Therefore we have the following lemma to make sure the iteration converge:
Lemma A.11.
Let . If there are two positive numbers and , and the effective step size satisfies
(39) 
for all large enough, then the iterations (A.3) converge to a minimizer.
Proof.
Without loss of generality, we assume and the inequality (39) is satisfied for all . We will prove converges and the direction of converges to the direction of .
(1) Since is always increasing, we only need to prove it is bounded. We have,
(40)  
(41)  
(42)  
(43) 
The inequality in last lines are based on the fact that , and are bounded by a constant . Next, we will prove , which implies are bounded.
According to the estimate (38), we have
(44)  
Comments
There are no comments yet.