Quiz show

Quiz show

BN is proposed by Sergey Ioffe and Christian Szegedy. Which one of the following papers is also published by Christian Szegedy?

        A. (Deepid2)Deep Learning Face Representation by Joint Identification-Verification
B. (Joint Bayesian)Bayesian Face Revisited: A Joint Formulation
C. Robust Multi-Resolution Pedestrian Detection in Traffic Scenes
D. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images
E. (Googlenet)Going Deeper with Convolutions
  • E

What's normalization?

What's normalization?

What's normalization?

    Description

What's the normalization used before BN?

  • Caffe
    • Local Response Normalization
    • Mean-Variance Normalization
  • MatConvNet
    • Cross-Channel Normalization
    • Spatial Normalization

Cross-Channel Normalization

For each output channel k

G(k) is a corresponding subset of input channels

$$ G(k) \subset \{1,2, ..., D\} $$ $$ y_{ijk} = \dfrac{x_{ijk}} {(k + \alpha \sum_{m\in G(k)} x^2_{ijm})^{\beta}} $$
Description

Spatial Normalization

For each output channel k

$$ n^2_{ijk} = \dfrac{1}{H'W'} \sum_{1\leq i' \leq H', 1\leq j' \leq W'} x_{i+i'-\lfloor(H'-1)/2\rfloor, j+j'-1-\lfloor(W'-1)/2\rfloor}^2$$ $$ y_{ijk} = \dfrac{x_{ijk}} {(1 + \alpha n^2_{ijk})^{\beta}} $$
Description

Local Response Normalization

Two modes:

  • ACROSS_CHANNEL
    • across nearby channels
    • but not spatial extent
  • WHITHIN_CHANNEL
    • extend spatially
    • but in seperate channels
$$ y_{i} = \dfrac{x_{i}} {(1 + (\alpha/n) \sum_{i} x^2)^{\beta}} $$

Mean-Variance Normalization

Two modes:

  • ACROSS_CHANNEL
    • across nearby channels
    • but not spatial extent
  • WHITHIN_CHANNEL
    • extend spatially
    • but in seperate channels
$$ y_{i} = \dfrac{x_{i} - \mu(x)} {\epsilon + \sqrt{\sum_{i} x^2}} $$

What's batch normalization?

Motivation

Problem: internal covariate shift

Change in the distribution of network activations due to the change in network parameters during training

Description

Motivation

Idea: ensure the distribution of nonlinearity inputs remains more stable

$$ \hat{x} = \dfrac{x-E[x]}{\sqrt{Var[x]}} $$
$$ E[\hat{x}] = 0, var[\hat{x}] = 1$$
Description

Forward

$$ x,y \in \Re^{H\times W\times K\times M}, \gamma,\beta \in \Re^{K}$$ $$ \mu_k = \dfrac{1}{HWM} \sum_{i=1}^H\sum_{j=1}^W\sum_{m=1}^M x_{ijkm} $$ $$ \sigma_k^2 = \dfrac{1}{HWM} \sum_{i=1}^H\sum_{j=1}^W\sum_{m=1}^M (x_{ijkm} - \mu_k)^2 $$ $$ \hat{x}_{ijkm} = \dfrac{x_{ijkm} - \mu_{k}}{\sqrt{\sigma_k^2 + \epsilon}} $$ $$ y_{ijkm} = \gamma_k \times \hat{x}_{ijkm} + \beta_k $$

Forward

Description

Forward

Description

Backward

For one feature map:

$$ \small{N = H\times W\times M}$$ $$ \dfrac{\partial l}{\partial \gamma} = \sum_{i=1}^N \dfrac{\partial l}{\partial y_i} \cdot \hat{x}$$ $$ \dfrac{\partial l}{\partial \beta} = \sum_{i=1}^N \dfrac{\partial l}{\partial y_i} $$

Parameters of BN layer can be updated by above equations.

Backward

For one feature map k:

$$ \dfrac{\partial l}{\partial x_{ijkm}} = \sum_{i'j'km'} \dfrac{\partial l}{\partial y_{i'j'km'}}\dfrac{\partial y_{i'j'km'}}{\partial x_{ijkm}}$$ $$ \dfrac{\partial y_{i'j'km'}}{\partial x_{ijkm}} = \gamma_k ((1-\dfrac{\partial \mu_k}{\partial x_{ijkm}})\dfrac{1}{\sqrt{\sigma^2+\epsilon}} - \dfrac{1}{2} (x_{i'j'km'} - \mu_k)(\sigma_k^2+\epsilon)^{-3/2}\dfrac{\partial \sigma_k^2}{\partial x_{ijkm}})$$ $$ \dfrac{\partial \mu_k}{\partial x_{ijkm}} = \dfrac{1}{HWM} = \dfrac{1}{N}$$ $$ \dfrac{\partial \sigma^2_k}{\partial x_{ijkm}} = \dfrac{2}{N} (x_{ijkm} - \mu_k) $$

Diff can be passed down through BN layer by above equations.

Parameters of BN in practice

Network: xd_net_12m

Parameters of BN in practice

Network: xd_net_12m

Quiz show

Quiz show

The picture below describes what kind of Normalization?

Description
        A. Cross-Channel Normalization
B. Spatial Normalization
C. Batch Normalization
D. Local Response Normalization
E. Mean-Variance Normalization
  • C

Quiz show

The number of parameter \(\gamma\) in a BN layer equals to?

        A. Batch size
B. The number of feature maps
C. The number of activations
  • B

Why batch normalization?

Experiments in the paper

Description

Our experiments

  • Cifar 10
  • Training samples: 50000
Description

Our experiments

  • Deepid2
  • Training samples: 500000
Description

Our experiments (from Sun yi)

  • BN seems to be sensitive to learning rate or weight initialization?
Networklriterpcadimaccuracy
(tile conv)sn01 bn0.011500003000.984000
(tile conv)sn02 bn0.05150000400/500/7000.991167
(full conv)tn03 7x6x1024(3)->7x6x256(1)->512 bn0.031500003000.989000
(full conv)tn01 7x6x1024(3)->7x6x256(1)->512 bn0.051500004000.992667
(full conv)tn04 7x6x1024(3)->7x6x256(1)->512 bn0.1150000800/9000.990167
(full conv)np04 tn01 -> no bn0.051500004000.990500
(full conv)np05 tn01 -> no bn0.01150000500/8000.990000

Our experiments (from Sun gang)

  • Accelerate the training of VGG
  • Not improve the accuracy (even lower)
  • The improvement reported in the paper is inconlusive: The structure used in BN paper is different with Googlenet
Description

Statistic from BN's citation

Statistic from BN's citation

Description Description

<Thank You!>