Yuhang He's Blog

Some birds are not meant to be caged, their feathers are just too bright.

Delving Deeper into Convolution Operation

Convolutional Nerual Network (CNN) has been serving as the hallmark for various vision tasks. But do we really understand the nuts and bolts of image convolution operation? Let’s ignore the mathematical analysis of convolution operation, which has been fully covered by a variety of online blogs. Instead, the question that haunting my brain is the way how convolution works, especially when it encounters input with multiple channels but output with also multiple but different channels:

For example, the input Blob is $N \times C_1 \times H_1 \times W_1$, the output Blob after convolution is $N \times C_2 \times H_2 \times W_2$. While the Blob height and width are easy to understand, how does convolution operation change $C_1$ to $C_2$ ?

I turned to Google for help but nothing useful has been found. Finally, I decide to read the Caffe the source code, which is the most prestigious guide to figure out any problem, to dig out the answer.

How Caffe Address it Step by Step?

The core code snippet for Caffe convolution operation is in conv_layer.cpp and base_conv_layer.cpp (we just consider the weight $w$ and ignore the bias $b$ for conciseness)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
//conv_layer.cpp
template <typename Dtype>
void ConvolutionLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const Dtype* weight = this->blobs_[0]->cpu_data();
  for (int i = 0; i < bottom.size(); ++i) {
    const Dtype* bottom_data = bottom[i]->cpu_data();
    Dtype* top_data = top[i]->mutable_cpu_data();
    //num_ is the batch_size
    //weight size is [con_output_channel_ x kernel_size_]
    //kernel_size: [conv_input_channel_ x kernel_h x kernel_w]
    //bottom_dim_: conv_input_channel_ x conv_input_height_ x conv_input_width_
    for (int n = 0; n < this->num_; ++n) {
      this->forward_cpu_gemm(bottom_data + n * this->bottom_dim_, weight,
          top_data + n * this->top_dim_);
      if (this->bias_term_) {
        const Dtype* bias = this->blobs_[1]->cpu_data();
        this->forward_cpu_bias(top_data + n * this->top_dim_, bias);
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//base_conv_layer.cpp
template <typename Dtype>
void BaseConvolutionLayer<Dtype>::forward_cpu_gemm(const Dtype* input,
    const Dtype* weights, Dtype* output, bool skip_im2col) {
  const Dtype* col_buff = input;
  if (!is_1x1_) {
    if (!skip_im2col) {
      conv_im2col_cpu(input, col_buffer_.mutable_cpu_data());
    }
    col_buff = col_buffer_.cpu_data();
  }
  //weights: conv_output_channel_ x kernel_size_
  //col_buf_: kernel_size_ x conv_output_spatial_dim_
  //conv_output_spatial_dim_: top blob height x width
  for (int g = 0; g < group_; ++g) {
    caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, conv_out_channels_ /
        group_, conv_out_spatial_dim_, kernel_dim_,
        (Dtype)1., weights + weight_offset_ * g, col_buff + col_offset_ * g,
        (Dtype)0., output + output_offset_ * g);
  }
}

From the above two code snippets, we can learn the convolution operation pipeline: within a mini-batch, in each iteration, we feed a bottom instance, which is multiplied by a weight matrix to get the relevant top instance. That is, the bottom blob is mapped to the top blob via a weight $W$. $W$ has conv_output_channel_ rows, which is responsible for mapping the bottom blob to top blob’s channels.

Graphical Anatomy

Reading source code gives us machine biased and rigorous analysis, graphical illustration, on the other hand, provides us with an intuitive and direct understanding. From my perspective, the best way to absorb convolution operation is through graphical visualization.

The overall graphical illustration is given below:

conv_operation_img