MobileNetV2 CNN model

MobileNetV2 is a convolutional neural network architecture designed to perform efficiently on mobile devices. It features an inverted residual structure, where the residual connections occur between the bottleneck layers. The model includes an intermediate expansion layer that utilizes lightweight depthwise convolutions to filter features, providing a source of non-linearity.

MobileNetV2 has been trained on the ImageNet dataset and is specifically optimized for deployment on mobile devices and other low-power applications. This model consists of 155 layers, indicating a significant level of complexity that may require considerable effort if one intends to visualize the architecture. MobileNetV2 demonstrates high efficiency in various tasks, including object detection, image segmentation, and classification. The architecture is distinguished by three key features:

Depthwise separable convolutions
Narrow input and output bottlenecks between layers
Shortcut connections among bottleneck layers

The convolutional process can be intricately divided into two specialized steps, each serving a distinct purpose in the overall operation:

Depthwise Convolution: This initial step entails the application of a separate filter to each individual input channel. This means that instead of applying a single filter across all channels simultaneously, a unique filter specifically designed for each channel is utilized. This approach allows for greater spatial independence among the channels, enabling the model to learn more nuanced features associated with each channel independently.
Pointwise Convolution: Following the depthwise convolution, the next step involves the use of a pointwise convolution. In this phase, a 1x1 convolutional filter is employed to effectively combine the outputs generated from the previous depthwise step. This method serves to integrate the feature maps from all channels, facilitating the mixing of information across the different input channels.

In comparison to traditional convolutional operations, where a single filter is applied across the entire depth of the input tensor, leading to a high computational burden, separable convolutions provide a streamlined alternative. The standard convolution methods can often be computationally exhaustive as they encompass a larger number of parameters and operations.

The separable convolution technique simplifies this process by breaking it down into the two aforementioned steps. This bifurcation leads to a substantial reduction in the parameter count, contributing to a significant decrease in computational demands, estimated to be between 8 to 9 times less than that of conventional convolutions. Consequently, while this method streamlines computation and lowers resource consumption, it is noteworthy that the reduction in accuracy remains minimal, thus ensuring that performance is largely preserved.

Inverted Residuals

MobileNet V2 introduces the concept of inverted residuals featuring linear bottlenecks. This methodology effectively preserves both the input and output dimensions while executing the intermediate layers in a lower-dimensional space, thereby minimizing computational costs. The structure of the inverted residual block is composed of three distinct layers:

1x1 Convolution (Expansion Layer): This layer serves to expand the input channels by a specified scaling factor, thereby increasing the dimensionality of the data.
Depthwise Convolution: In this layer, a depthwise convolution is applied independently to each expanded channel, facilitating spatial convolution operations.
1x1 Convolution (Projection Layer): This layer reduces the expanded data back to a lower-dimensional space, effectively decreasing the number of channels to the intended output size.

MobileNet V2 Network Structure

MobileNet V2 features a streamlined architecture that includes:

Initial Convolution Layer: This is a standard convolution layer with 32 filters and a stride of 2.
Series of Inverted Residual Blocks: The network is composed of several stages, each containing a specific number of inverted residual blocks. The expansion factors, output channels, and strides vary across these stages to manage computational complexity and enhance the receptive field.
Final Convolution Layer: This includes a 1x1 convolution layer followed by a global average pooling layer.
Fully Connected Layer: This layer uses softmax activation for classification tasks.