Convolutional Nerual Network (CNN) has been serving as the hallmark for various vision tasks. But do we really understand the nuts and bolts of image convolution operation? Let’s ignore the mathematical analysis of convolution operation, which has been fully covered by a variety of online blogs. Instead, the question that haunting my brain is the way how convolution works, especially when it encounters input with multiple channels but output with also multiple but different channels:
For example, the input Blob is $N \times C_1 \times H_1 \times W_1$, the output Blob after convolution is $N \times C_2 \times H_2 \times W_2$. While the Blob height and width are easy to understand, how does convolution operation change $C_1$ to $C_2$ ?
I turned to Google for help but nothing useful has been found. Finally, I decide to read the Caffe the source code, which is the most prestigious guide to figure out any problem, to dig out the answer.
How Caffe Address it Step by Step?
The core code snippet for Caffe convolution operation is in conv_layer.cpp
and base_conv_layer.cpp
(we just consider the weight $w$ and ignore the bias $b$ for conciseness)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

From the above two code snippets, we can learn the convolution operation pipeline: within a minibatch, in each iteration, we feed a bottom instance, which is multiplied by a weight matrix to get the relevant top instance. That is, the bottom blob is mapped to the top blob via a weight $W$. $W$ has conv_output_channel_
rows, which is responsible for mapping the bottom blob to top blob’s channels.
Graphical Anatomy
Reading source code gives us machine biased and rigorous analysis, graphical illustration, on the other hand, provides us with an intuitive and direct understanding. From my perspective, the best way to absorb convolution operation is through graphical visualization.
The overall graphical illustration is given below: