Deep diving into convolutions
- morshed adnan
- May 15, 2020
- 3 min read
In this blog I am going to discuss the content of the paper named Going Deeper with Convolutions. The last blog-post was associated with the general writing style of this paper. However, at this point I am going to explain the technical details of this paper and associated contents as well.
Required Background Knowledge: The concepts of Convolutional Neural Networks (CNN) are a category of Neural Networks that have been proven very effective in areas such as image recognition and classification. CNN have multiple layers such as including convolutional layer, non-linear layer, polling and fully connected layer. The convolutional and fully connected layers take parameters; pooling and non-linear layers don’t take parameters on the other hand. The reason I describe the fundamental concepts of CNN is that this paper introduces an inception module with reduced dimensions in figure 2 based on the naïve inception model illustrated in figure 1.
Inception module: In general, Inception network consists of modules stacked upon each other with occasional max-pooling layers with stride 2. The inception module illustrated in figure 1 is restricted to 1*1, 3*3 and 5*5 convolutional filters. Since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling in such stage brings additional benefits. However, in the simplest naïve model the modest number of 5*5 convolutions can prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes more pronounced once a pooling units are added in parallel would lead to an inevitable increase in the number of outputs stage to stage. Thus the idea of dimensionality reduction is introduced to overcome this problem.

Figure 1: Inception module, naïve version(Szegedy et al., 2014)
The figure inception module with reduced dimensions below shows that 1*1 convolutions are used to compute reductions before the expensive 3*3 and 5*5 convolutions. The convolutions are also used to include the uses of rectified linear activation making them dual purpose besides dimensionality reduction.

Figure 2: Inception Module with reduced dimensions(Szegedy et al., 2014)
A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational com- plexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes.
Based on the inception module with reduced dimensions the authors propose a new architecture called GoogleNet for large-scale image detection and classification in in the ILSVRC2014 competition. The GoogleNet architecture is depicted partially in figure 3. The authors used one deeper and wider Inception network with slightly superior quality seems to improve the results marginally. As we can see in the figure below all the layers in an inception module are concatenated and added to the next inception module, which is stacked on top of every inception modules.

Figure 3: GoogleNet Architecture in partial
The network is 22 layers deep and the overall number of layers used for the construction of the network is about 100. The authors found that a move from fully connected layers to average pooling layer improved the top-1 accuracy about 0.6 %. The GoogleNet network is trained using DistBelief (Dean et al., n.d.) distributed machine learning model using modest amount of model and data parallelism. This architecture is trained to 1.2 million images, 50 thousands for validation and another one hounded thousands for images testing. GoogleNet architecture stood first in the ImageNet comettition in 2014 with a mean average precision of 43.9% in the ILSVRC challenge.
The authors in this article efficiently explains the breakdown of their proposed architecture and showed the accuracy of the model in large scale image classification challenges. In addition to that, the authors clearly showed the problem and proposed a novel approach to solve the problem from the point of computational expense. However, the authors did not mention the scope of using this architecture to relatively small amount of datasets. Using the same network on small datasets might increase the computation costs instead. The paper was published in 2015 in the CVPR2015 conference for computer vision foundation, and since then, the paper has 21243 citations.
I would recommend to read the paper in order to get the core ideas of this paper and to enhance your understanding of large convolutional models for image classification.
Dear Morshed
Thank you for the interesting review. It was a really good summary of the paper and I enjoyed reading it. I was impressed how they managed to increase the size of the network without increasing the complexity of computations. You did a great job explaining the content and the method.
All the best,
Afrooz