Character Recognition Using Pre-Trained Models and Performance Variants Based on Datasets Size: A Survey

— The most efficient and beneficial mechanism to the feature of extracting data from an image, has been the Convolutional Neural Network (CNN) and it is used in many fields (Optical character recognition, image classification, object recognition and Facial recognition etc.). In this papier, we studied the character classification problems, using pre-trained models based on Convolutional Neural Network (CNN), and how the performance can change the outcome of dataset that is given. For that, we have used five pre-trained models’ such as VGG16/19, ResNet, Xception et MobileNet. The experiment shows that Xception had the best performance rate compared to other models for all datasets, VGG16/19 performance rate are variants depend on dataset. However, Experiments shows that ResNet achieve the worst accuracy rate compared to other methods.


INTRODUCTION
Classification is an important process in an automatic image classification system [1], image classification in Artificial Intelligence became a challenging problem, as well as the field of image processing [2] which is one of the key research objectives, this is mainly due to the diversity of techniques and methods that are used in this context, either in classification or in object recognition [3].
Character classification [4] is the corner stone to extract a piece of information which we want to get from an image by processing it to a typed understandable form, and it has made a big impact in areas that want to step-ahead and go from written document to digitalized ones.
Nowadays, the Deep Learning algorithms [5] used in image classification have shown good results and high performance in a predictive accuracy, many novels Deep Learning based image classification algorithms has been highly developed as image classification steps into a brand-new era thanks to the sustained development of machine learning, such as k-Nearest Neighbor, Support Vector Machine, Convolutional Neural Network (CNN) [6] etc.
Convolutional Neural Network (CNN) has been recognized as the most powerful and effective mechanism for feature extraction, but traditional classifiers connected to CNN do not fully grasp the extracted features. Therefore, we will present a comparative direction between proposed solutions about the image classification problem using CNN.
As we present in this research paper we will discuss the use of the Deep Learning techniques, especially convolution neural network (CNN), will focus on pre-trained [7] models' application from Keras Library [8], it's a deep learning API programming in Python, that use TensorFlow [9] as a platform, it was developed for being fast and smooth in experimentation.
In the purpose to apply the Transfer Learning [10] on our three different datasets, the first has 26 classes (alphabetic from a to z in lowercase), the second and third has 62 classes (alphabetic from a to z in lowercase and uppercase and numbers from 0 to 9), they have different parameters of training, testing and evaluation. For this purpose, we have selected five modules from Keras Library (VGG16/19 [14], Xception [15], MobileNet [16] and ResNet [17]) and trained them on three datasets, we got mixed results, between 80% and 100% on VGGs, Xception, ResNet models, except for MobileNet model, the results were not as good as we expected, in fact, they were poor.

II.
RELARED WORK Convolutional Neural Network (CNN) automatically extract features from the image by constructing many different layers of CNN [20], that generate a feature hierarchy, The shallower frontal convolutional layer uses a smaller perceptual domain that allows learning of some local features of the image, and the deeper posterior convolutional layer uses a larger perceptual domain and can learn more abstract features (such as object size, position, directional directions and information).
A new CapsNet-based algorithm [21], [22] has recently been proposed, providing viable ideas for further refinement of the results. We believe CapsNet has the potential to achieve better performance by making some changes to the hyperparameters.
In their work [14], the Visual Geometry Group studied the effect of convolutional network depth on its accuracy in large- scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small convolutional filters (3x3), showing that a significant improvement over prior art configurations can be obtained by increasing the depth at 16 -19 layers of weight.
In their paper [15], Microsoft Research has shown that convolution and highly indivisible convolution lie at either end of a discrete spectrum, with the Inception module representing an intermediate point in between. These observations led us to propose replacing the Inception module with deep separable convolutions in the neural machine vision architecture. They presented a new architecture based on this idea, called Xception, which has a similar number of parameters to Inception V3 [11], Xception shows small increases in sort performance for the ImageNet [12] dataset and large increases for the JFT dataset [13].
Google Inc. in [16] launched an efficient model called MobileNets for mobile applications and embedded vision, it's based on a simplified architecture that uses deep separable convolutions to build lightweight deep neural networks. They introduce two hyperparameters that efficiently balance latency and accuracy.

III. METHODS:
A. Networks: The number 16 in the term VGG alludes to the deep neural network's 16 layers (VGGnet), This indicates that VGG16 is a large network with over 138 million parameters. Even by today's standards, it's a massive and impressive network. The simplicity of the VGGNet16 surface, on the other hand, is what makes the network more appealing. It may be argued that all its architecture comes together just by glancing at it.
A few convolution layers are followed by a pooling layer that decreases the height and breadth of the image. When it comes to the filter's numbers, we can employ to do our research, we have roughly 64 options, which we can expand them to around 128 and subsequently to 256. We can utilize 512 filters in the final levels, Figure shows the VGG16 architecture.
The VGG19 model (also known as VGGNet-19) is similar to the VGG16 except that it has 19 layers. The numbers "16" and "19" refer to the model's weight layers (convolutional layers). VGG19 contains three more convolutional layers than VGG16.

2) MobileNet:
Convolution layers that are depth wise separable are used to construct MobileNets, a depth wise convolution and a pointwise convolution make up each depth wise separable convolution layer. MobileNet contains 28 layers if depth wise and pointwise convolutions are counted separately, the width multiplier hyperparameter can be adjusted to reduce the number of parameters in a conventional MobileNet to 4.2 million, the figure II detailes the MobileNet architecture.

3) ResNet:
The essential innovation with ResNet, short for Residual Networks, was that it allowed us to train extraordinarily deep neural networks with 150+ layers effectively, The ResNet-50 model is divided into five stages, each with its own convolution and identity block. There are three convolution layers in each convolution block, and three convolution layers in each identity block. There are around 23 million trainable parameters in the ResNet-50, the Figure III

4) Xception:
The term Xception refers to the extreme form of Inception. It's even better than Inception-v3 using a modified depth wise separable convolution, the modified depth wise separable convolution is known as SeparableConv. SeparableConvs are considered as Inception Modules and used throughout the deep learning architecture, all flows have residual (or shortcut/skip) connections, which were first proposed by ResNet. as can be shown in figure IV.

B. Data augmentation:
We have used OpenCV library [23] to resize the images in our datasets, and pre-processing library from Keras, to prepare and processing the images in our system. In addition to this, we have used a normalization (of 255 as a float32 type), to get a total information from an image, and remove the distortions.

C. Hyperparameters:
In the used datasets, we split them to 80% of train and 20% of test, for the validation we had 80% from the train and 20% from the test, the following table shows the datasets details: IV. EXPERIMENTS:

D. Datasets:
The first dataset has only alphabetics character in lowercase, with 26 classes, the second dataset has alphabetics character lowercase, uppercase and numbers from 0 to 9 with 64 classes and the third dataset has alphabetic characters and numbers with 64 classes, all datasets contains images with a white background and black character in the middle of the shape, for all models, we resize the images to 32x32 as an input shape parameter, except the Xception model we resize the images to 71x71, because its only accepts 71x71 input shape or higher, The figure V show a sample of dataset used. In this section, we will look for all models result on dataset 1, therefore, we regroup the results in a graph for being clear and simplify to read, the all models trained on 12 epochs. In figure VI, Xception, VGG16 and VGG19 shows a high accuracy for train and test, and the longer the go stable they get, with a slight of overfitting [21], while, MobileNet and ResNet shows a bad result, where we see the validation accuracy (test) is wobbling.   Table II present the score evaluate of the models, that shows VGG19 and Xception achieve the best results in comparison with the other models. Figure VIII represent how accuracy (train) and validation accuracy (test) has changed during the 12 epochs, Xception, VGG16 and VGG19 shows a high accuracy for train and test, however, Xception shows less or low validation accuracy, compared to the one in dataset 1, and the longer the go stable they get, with a slight of overfitting, while, MobileNet and ResNet shows a unsatisfying result, where we see the validation accuracy (test) is low.   Figure X represent how accuracy (train) and validation accuracy (test) has changed during the 12 epochs, Xception, VGG16 and VGG19 shows a high accuracy for train and test. However, Xception shows a little low validation accuracy, compared to the one in dataset 1, as we note that the overfitting is shrink compared to the results, we got in the sections before, and the longer the go stable they get, also we note that ResNet  Figure XI represent how loss (train) and validation loss (test) has changed during the 12 epochs, Xception, VGG16 and VGG19 shows a low loss for train and test. However, Xception shows a little high validation loss, compared to the one in dataset 1, as we note that the overfitting is shrink compared to the results we got in the sections before, and the longer the go stable they get, also we note that ResNet achieve a medium loss, while, MobileNet shows a bad result, where we see the loss (train) and validation loss (test) is low.

H. Results:
To conclude the experiment section, we got a variants results expend on dataset size and the used model, we note that some models may perform better in a small dataset, but in the large ones also the performance might be low or medium, as example we have VGG19, some models can have a high performance only on the large dataset, as example we have VGG16, and the others can have a good performance on any size of dataset, as example we have Xception. This variant may due to the architecture of the models or the shape of images in the dataset. The modify of the layers that contain the models, using Fine-Tuning technique, can improve the model's performance.
V. CONCULSION: As we presented in this paper, we compared between different pre-trained models' application from Keras Library, the result of Transfer Learning diverse from dataset to the other, depending on the datasets size. The efficiency of models in certain datasets didn't mean that they were efficient in all datasets.
The architect of model most focus on upcoming research, so it can adapt to any kind of dataset, using Fine-Tuning [18].
In future works, we plan to improve the character classification results, by combining several methods, and semantic information.