Human Activities Detection using Deep Learning Technique-YOLOv8

. Using a mask during the pandemic has occasionally been crucial and difficult. The use of universal masks can greatly lower and possibly even stop the spread of viruses within communities. So, mask detection has become a very critical task for security agencies in all the buildings, Government offices & other places. With the advent of GPUs, high computing machines, and Deep Convolution Neural Networks (DCCN), automatic Face & Mask Detection is possible by considering the image processing feature of extracting, 3-dimensional shapes from 2-dimensional images. This paper discuss about the YOLOv8 model to confirm its overall applicability, on two datasets namely FDDB & MASK. This helps to examine the behavior of the feature from the Mask dataset, which is intended for COVID-19 Mask Detection alone. Mask is the main dataset in this experiment. Above this, the ImageNet dataset is utilized for pretraining and FDDB (Face Detection Dataset & Benchmarks) datasets for recognizing face of a human being. The precision of models on FDDB is 58.9 % & on MASK dataset is 66.5%.


Introduction
The inception of computer vision as a field was made possible by a better understanding of how biological neurons in the visual cortex of the brain work. This understanding revealed that visual input processing begins with the recognition of basic shapes like edges before progressing to more complex structures. The Neocognitron was invented by Kunihiko Fukushima in 1980 [4]. Yann LeCun applied a backpropagation algorithm to Neocognitron and released the first convolutional neural network, LeNet-5, in 1989. Similarly, Krizhevsky et al. introduced AlexNet in 2012, which is an 8-layer convolutional neural network with non-saturating ReLU activation neurons [5]. The network won the 2012 ImageNet Large Scale Visual Recognition Challenge [6] achieving a top-5 error of 15.3%. AlexNet architecture was considered the most influential neural network architecture.
In computer vision, face detection is related to the detection of human faces in images. Some are discussed below: While deep learning face detectors provide substantially improved accuracy and robustness, OpenCV's Haar cascades continue to serve a useful purpose as they are light and secondly they are Superfast and even run with low-resource devices. The model size is limited to around 930 KB. Apart from this, Haar cascades have a number of flaws, including the fact that they are more prone to false-positive detections and are less reliable than their HOG + Linear SVM, SSD, YOLO, and other equivalents.
The most popular algorithm was first presented by Viola and Jones [7]. The method creates rectangular parts from the supplied image. Following that, each portion is run through a series of weak classifiers that scan for straightforward features that like those in the haar. A Haarlike feature is the variation in the sum of pixel intensities in several nearby locations. If the section successfully navigates all levels of the cascade, it is deemed to include a face; otherwise, it is rejected. Repeating this process results in rectangular pieces of varied sizes. This classifier is trained using AdaBoost. The fundamental advantage of the method is that since integral pictures are employed, the amount of time needed to calculate a haar-like characteristic does not change. A summed-area table, also known as an integral picture, is a type of data structure where each cell's value is determined by adding the values of the cells to its left and right. The summed-area table can be calculated in a single step. Many wellknown libraries for computer vision use Haar cascade face detection, including Open CV.
Another alternative approach for this task is a feature descriptor called a histogram of oriented gradients (HOG). Authors [8] first proposed it as a way to recognise faces. The algorithm determines the gradient of pixel intensities for each individual pixel in the image. The image is then cut up into smaller pieces after that. Each section creates a histogram of the gradients. After then, the most striking gradient is stored. Finally, the HOG picture is classified using Support Vector Machines (SVM). The library package Dilib has been using face detection based on HOG.
Various(Multi) Tasks Cascaded A deep learning technique for face detection called Convolutional Neural Networks (MTCNN) was first presented [9]. The technology recognises the face and facial landmarks in a single pass. It makes use of a cascade of convolutional neural networks where, an image pyramid is initially created by resizing the provided image. An image pyramid is a representation of a photograph in various scales. It enables humans to identify objects in images of varying scales. This is frequently paired with a sliding window that can place things in different places. Compared to other facial landmark identification systems, the MTCNN simply locates five landmarks: the tip of the nose, the corners of the mouth, and two eyes.
In CNN, a centered detection method is used where a region proposal component for creating a lot of promising regions consisting of one type of object which is trailed by the CNN classifier for classification. The main idea after this model was to change multiple object classification concept to single object classification concept. Due to the slow region proposals methods, classification part is delayed. The major drawback of this system is that both the factors accuracy and speed cannot be simultaneously achieved during time critical situations [26].

Proposed Architecture based on MobileNet & on YOLOv8
In this paper, the foundation is based on the MobileNets neural network architecture [22]. This decision was made because the architecture is suitable for software that needs to balance processing speed and accuracy on embedded or mobile platforms. A convolutional layer can be divided into "depthwise" and "pointwise" operation while still keeping a significant portion of the network's representational power, according to MobileNets' creators. Due to this division, 3 x 3 convolutions require a lot less operations and parameters. A separable convolution is preceded by a further pointwise layer with linear activation to produce a "bottleneck," and the linear bottleneck layers of the MobileNets architecture are built from separable ones. This bottleneck multiplies feature maps before lowering them and spreads the input in a higher-dimensional space to make use of ReLU activation's nonlinear power without sacrificing information. To enhance backpropagation and enhance computation graph execution, the MobileNets authors also inserted leftover connections from earlier work. Since each residual block's input and output tensor sizes are lower than the enlarged tensors processed between the bottlenecks, the presence of these skip connections requires an order of execution where memory usage is primarily determined by this.
Epochs: 100 epochs were run to check the performance in the iterations.
One crucial feature of YOLOv8 is its extensibility. It is designed as a framework that is compatible with all earlier iterations of YOLO, making it simple to move between them and evaluate their effectiveness. For individuals who want to take advantage of the most recent YOLO technology while maintaining the functionality of their existing YOLO models, YOLOv8 is the ideal choice as a result. The same model is then used to conduct an inclusive investigation over the Mask Dataset. The conclusion acknowledges that YOLOv8 is an outstanding or the best model for detecting objects, human activity, or masks.

A. Parameter Setting For Neural Network
The GoogLeNet network, which has 24 convolutional layers and two layers which are FC, served as the only inspiration for this network. This system has the ability to take photos of any size as input and reshape them to 448*448 prior to sending them to this network. 7 * 7 grids are used to disperse the input photos. Each grid predicts three bounding boxes, each with four box coordinates, one class probability, and twenty class-specific sums. Thus, the result is a 7 * 7 ( 3 * 5 + 30 ) tensor.

B. Cost Function
A cost function indicates how quickly a neural network learns for oneself. The network predicts class-specific scores and further bounding boxes during the item recognition exercise. As a result, the overall cost Costtotal, which is denoted by Here S = grid size, N = in every grid how many bounding boxes are predicted. C = no. of classes. The noobject scale, object scale, and class scale are some of the variables that are used to change the weights of different expenses. Pobject,n is the likelihood that an object is present in box n of grid m. According to the intended result, isObj indicates if an object is present in the current box. The Pobject with the best IOU prediction out of all the n boxes is the PObject, bestpredict. The intersection of the ground truth bounding box with the best prediction bounding box in grid m is known as IOUbestpredict. Pclass,i is the class i class probability for prediction and Ptruth,i is the class i ground truth class probability in grid m, both of which are either 0 or 1.

C. Evaluation Criteria
To authorize and confirm results of detecting objects, I have used two standards, viz. overall precision & overall recall. For authenticating object detecting results, we used overall precision and overall recall. For direction estimation, direction accuracy is used. Schematicdiagram Figure 1, gives the understanding of overall precision, overall recall.

Fig. 1. Precision and Recall [15]
The calculation of overall precision and overall recall is specified in the below Equation Overall precision is defined as how many predicting objects are the related objects. Similarly, Overall Recall is defined as how many predicted elements are relevant. F. ImageNet Dataset: This is one of the largest dataset having around 1000 classes and total images of around 1.40 million. These all images are used for pre-training means before training the model, it is used.

G. Face Dataset [FDDB]
This dataset consists of one and only one class, which is known as "face". Total number of faces available in 2845 images are 5171. The data set size is around 523 MB.

Results Analysis
A. Dataset FDDB

Conclusion
In this paper, YOLOv8 model got trained on various datasets i.e. FDDB & ImageNet. The results found from the experimentation depicts that the YOLOv8 performed extremely well for all the FDDB Dataset compared to the benchmark.
Eventually, this work helped us come to the conclusion that YOLOv8 has achieved 84.6% accuracy with 66.5% recall at 0.030 sec. per picture while employing the model over the Mask benchmark. These results gave us hope and highlighted that the YOLO model is a better tool for identifying different things needed in the field of self-driven vehicles.
If we look at the exponential rise from version 5 to version 7, it is clear that training time was a major problem, however version 8 is taking less time in delivering results with a greater mean average precision. The problem of lengthy training is partially addressed here. Yolov8 more successfully balances precision with training. Completely developed networkbackbone, an anchor-free detection head, and anovel reducing loss(cost) function have made it much faster.