Backpropagation in CNN

Backpropagation In Convolutional Neural Networks

Cross Correlation

Given an input image \(I\) and a filter \(K\) of dimensions \(k_1 \times k_2\), then the cross correlation operation is defined as:

\[(I \otimes K)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i + m, j + n)K(m, n)\]

Convolution

\[(I * K)^{d}_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i - m, j - n)K(m, n)\]

which is equivalent to cross correlation with flipped kernel (i.e flipped 180 degree)

\[(I * K)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i + m, j + n) \text{ rot}_{180}(K(m, n))\]

CNNs

a Convolution layer consists of:

  1. An input map \(I \in \mathbb{R}^{H \times W \times C}\) with height \(H\), width \(W\), channels \(H\).
  2. A bank of \(D\) filters \(K \in \mathbb{R}^{k_1 \times k_2 \times D}\).
  3. Bias \(b \in \mathbb{R}^{D}\) (one for each filter).
  4. An activation function \(f\) apply element-wise.

In general, the forward output of applying filter \(d\) on the input map is:

\[(I * K^d)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} \sum^{C}_{n=1}I(i + m, j + n, c) \cdot \text{ rot}_{180}(K^d(m, n, c)) + b\]

However, in CNN, we do not care about the arrangement of filters (i.e whether it is flipped or not), because the weights are going to be learned regardless of the arrangement of the filters. Thus, we can safely ignore the flip and treat the convolution operation as cross correlation:

\[(I * K^d)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} \sum^{C}_{n=1}I(i + m, j + n, c) \cdot K^d(m, n, c) + b\]

For the simplicity, we focus on \(2d\) case first (i.e we focus on gray scale image with \(C = 1\)) and only one filter:

\[(I * K)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i + m, j + n)K(m, n)\]

Notations

  1. \(l\) is the \(l\)th layer, where \(l = 1\) is the first layer, \(L\) is the last layer.
  2. Input \(\mathbf{x}\) has dimension \(H \times W\).
  3. The filter weight at each layer is denoted as \(w^l_{m, n}, \; 0 \leq m < k^l_1, \; 0 \leq n < k^l_2\)
  4. The convolved input at each layer is denoted as \(y^{l}_{ij}\): \[y^{l}_{i, j} = \sum^{k^l_2 - 1}_{m=0}\sum^{k^l_1 - 1}_{n=0} w^{l}_{m, n} x^{l - 1}_{i+m, j+n} + b^l\]
  5. The output of each convolution layer after applying the activation function is denoted as: \[x^{l}_{ij} = f(y^{l}_{i, j})\]

Forward Pass

The forward pass is essential characterized above:

Backward Pass

Let the mean square loss be defined as:

\[L = \frac{1}{N} \sum^{N}_{i=1} (t_i - y^L_i)^2\]

Where \(y^L_i\) is the prediction from the final layer and \(t_i\) is the target value.

In backward propagation, we are seeking to compute the gradient of loss w.r.t the weights \(\frac{\partial L}{\partial w^{l}_{m, n}}\) for each layer, and the gradient w.r.t the output map to each layer \(\frac{\partial L}{\partial y^l_{i, j}}\).

Gradient of the Weights

Since, the output map \(\mathbf{y}^l\) will have dimension \(H^{l} = H^{l - 1} - k^l_{1} + 1\), \(W^{l} = W^{l - 1} - k^{l}_2 + 1\) (\(\mathbf{y}^{l}\) is reshaped to be a \(H^{l} W^{l} \times 1\) vector)

\[\begin{aligned} \frac{\partial L}{\partial w^{l}_{m^{\prime}, n^{\prime}}} &= \frac{\partial L}{\partial \mathbf{y}^l} \frac{\partial \mathbf{y}^{l}}{\partial w^{l}_{m^{\prime}, n^{\prime}}}\\ &= \sum^{H^{l} - 1}_{i=0} \sum^{W^{l} - 1}_{j=0}\frac{\partial L}{\partial y^l_{i, j}}\frac{\partial y_{i, j}^{l}}{\partial w^{l}_{m^{\prime}, n^{\prime}}}\\ \end{aligned}\]

By expanding \(\frac{\partial y_{i, j}^{l}}{\partial w^{l}_{m^{\prime}, n^{\prime}}}\) we have:

\[\begin{aligned} \frac{\partial y_{i, j}^{l}}{\partial w^{l}_{m^{\prime}, n^{\prime}}} &= \frac{\partial}{\partial w^{l}_{m^{\prime}, n^{\prime}}}\sum^{k^l_2 - 1}_{m=0}\sum^{k^l_1 - 1}_{n=0} w^{l}_{m, n} x^{l - 1}_{i+m, j+n} + b^l\\ &= x^{l - 1}_{i + m^{\prime}, j + n^{\prime}} \end{aligned}\]

By substituting the above result back to the previous gradient equation we have:

\[\frac{\partial L}{\partial w^{l}_{m^{\prime}, n^{\prime}}} = \sum^{H^{l} - 1}_{i=0} \sum^{W^{l} - 1}_{j=0}\frac{\partial L}{\partial y^l_{i, j}} x^{l - 1}_{i + m^{\prime}, j + n^{\prime}}\]

We can see that this is the convolution (cross correlation) of gradient of activations on the input feature map:

\[\frac{\partial L}{\partial w^{l}_{m^{\prime}, n^{\prime}}} = (X^{l - 1} \otimes \frac{\partial L}{\partial \mathbf{y}^l})_{m^{\prime}, n^{\prime}}\]

Where \(\frac{\partial L}{\partial \mathbf{y}^l}\) is reshaped to a \(H^l \times W^l\) matrix.

Gradients of the Inputs

Recall that, \(\mathbf{x}^l = f(\mathbf{y}^l)\), so they have same dimensions.

\[\begin{aligned} \frac{\partial L}{\partial y^l_{i, j}} &= \frac{\partial L}{\partial \mathbf{x}^l} \frac{\partial \mathbf{x}^l}{\partial y_{i, j}^l}\\ &= \sum^{H^{l} - 1}_{i=0} \sum^{W^{l} - 1}_{j=0} \frac{\partial L}{\partial x^l_{i, j}} f^{\prime} (y^l_{i, j})\\ \\ \frac{\partial L}{\partial x^l_{i^{\prime}, j^{\prime}}} &= \frac{\partial L}{\partial \mathbf{y}^{l + 1}} \frac{\partial \mathbf{y}^{l + 1}}{\partial x^l_{i^{\prime}, j^{\prime}}}\\ &= \sum^{H^{l + 1} - 1}_{i=0} \sum^{W^{l + 1} - 1}_{j=0} \frac{\partial L}{\partial y^{l + 1}_{i^{\prime} - m, j^{\prime} - n}} w^{l + 1}_{m, n} \end{aligned}\]

This is exactly as the full convolution (or correlation with flipped kernel, notice here, we have to flip the kernel):

\[\frac{\partial L}{\partial x^l_{i^{\prime}, j^{\prime}}} = (\frac{\partial L}{\partial \mathbf{y}^{l + 1}} * W^{l + 1} )_{i^{\prime}, j^{\prime}}\]

Where \(W^{l + 1}\) and \(\frac{\partial L}{\partial \mathbf{y}^{l + 1}}\) are reshaped into matrices.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np

def conv2d(images, kernel, bias, p, s):
n, c_in, h_in, w_in = images.shape
c_out, c_in, k, _ = kernel.shape
h_out = (h_in + 2 * p - k) // 2 + 1
w_out = (w_in + 2 * p - k) // 2 + 1
# add padding
pad_image = np.pad(images, [(0, 0), (0, 0), (p, p), (p, p)])
out_images = np.zeros((n, c_out, h_out, w_out))
_, _, pad_image_h, pad_image_w = pad_image.shape
for i in range(n):
for j in range(c_out):
y_idx = out_y = 0
while y_idx + k <= pad_image_w:
x_idx = out_x = 0
while x_idx + k <= pad_image_h:
out_images[i, j, out_x, out_y] = \
np.sum(pad_image[i, :, y_idx:y_idx+k, x_idx:x_idx+k] * kernel[i]) + bias[j]
x_idx += s
out_x += 1

y_idx += s
out_y += 1

return out_images

Ref

https://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/

https://medium.com/@pavisj/convolutions-and-backpropagations-46026a8f5d2c