Motion Prediction for Autonomous Vehicle using Deep Learning Architecture and Transfer Learning

. Abstract. In the current scenario, Autonomous Vehicle (AV) technology has become one of the more futuristic approaches in the automotive industry since it aims to enhance driving safety, driving comfort, and its economy, adding to reduced obstacle collision or tra ffi c accident rate. Motion planning contributes a vital part in autonomous driving, serving as a fundamental building block enabling the AV to move further. Though there are several traditional approaches to implement motion planning, yet challenges remain regarding guaranteed performance and safety under all driving circumstances. With the impressive advancement of deep learning technologies, many researchers have tried to develop end-to-end motion planning approaches using deep learning, which generally employ Deep Neural Networks (DNNs) to directly map the raw sensor data (e.g., point clouds and images) to planned trajectories(e.g., yaw velocity and steering angle). However, accurate motion prediction is still critical for autonomous driving where there’s a transfer learning approach which emphasizes on reusing a trained model for various applications. This could possibly improve the accuracy of motion prediction steadily. Considering this in mind, a transfer learning motion prediction approach for autonomous


Introduction
Self-driving cars are also known as Autonomous Vehicles.AVs are vehicles that have the capability of driving automatically without the external influence (manual overdrive) of a human being.Building such vehicles are very much prevalent in the 21st century as anything that can be thought of is going under the process of being automated.Automating vehicles comes with a notable number of advantages.The main advantage that is considered while implementing automation in the workforce or daily life is that automation increases the productivity thereby saving time.It is also predicted that with more AVs on the road, road accidents would be reduced by 90% according to the United States Department of Transportation.Due the processes being automated, it can also improve fuel efficiency thereby contributing to a more sustainable environment that uses less fossil fuel overall .Autonomy is the state of being able to function by itself.In terms of an AV, it refers to the ability of the AV being able to function by itself.There are 5 levels to autonomy concerned with AVs [7].Level 1 includes the AV having a single automated function such as simple cruise control.Level 2 deals with automating acceleration, deceleration and steering.Level 3 would automate all safety functions except for the manual overdrive that would be necessary during emergency situations.Levels 4 and 5 are the vehicles that are fully autonomous.Difference between Level 4 and 5 would be that level 4 vehicles can handle everything within a specific set of driving scenarios like highways and so on.Whereas level 5 can handle any scenario at all conditions.While designing an AV it is important to consider all the factors that a real time driver would experience and what he would do in such situations.The overview of how these vehicles work can be mapped to every idea and logic that a real person is taught while learning to drive on any terrain.A plethora of factors contribute to the design of an AV that are sure to take models from a real person driving a car manually.The following are some of the points that have to be considered while a person drives a car manually.And therefore, an AV should be able carry out the following actions automatically.
• A person must be mindful of their surroundings.It is recommended to look around well and good before they decided to move the car wherever they want.
• They should be aware of the rules of the road for a seamless driving experience.Recognizing and following such rules/signs while maneuvering the vehicle is crucial.
• They should be able to effectively communicate his decision to maneuver the car in a certain way before doing it (signaling to other drivers) • They should be able to steer the wheel safely to his required destination.
An AV must be able to complete all these tasks instantly with minimal error.And so, a basic framework is needed to build the car keeping all the above points and more in mind.
The modules/components that build an AV are as follows: Perception, Planning, Control and Redirection.
As a human needs eyes and ears to sense his surrounding before he starts to move his car, an AV requires similar sensory organs to help it do the same thing.From the broad study it is observed that the most used sensors for AV modelling are Radar, LiDAR, GPS and Cameras.Camera's act as the eyes of the AV.It helps the car sense it's surrounding to get an idea of what objects (car, trucks, people) are in its environment.However, cameras alone cannot help AV as it cannot sense the depth of the image received from the camera.LiDAR helps the AV sense the depth of the objects in its vicinity.Depth calculation is quite important as it helps the car calculate the distance between each object and avoid crashes.Cameras perform poorly when it comes being used in harsh weather.And that's where Radars come in.Radars are one of the most widely used sensors for object tracking (a.k.a tracking vehicles) and they do amazingly when in harsh weather.But their resolution is limited.A global positioning system is used to figure the latitude, longitude, and altitude of the vehicle.However, as this device is only used for long ranged maneuvering.It is used to plot a suitable course to the destination that needs to be reached through the car.
In simple terms, sensors aid an AV to get the required data it needs to make calculated decisions on the road.Now, coming to "perception," these deals with understanding what the AV perceives as its surroundings.So, in short, what this part of the architecture does is that it collects the information from all the sensors mentioned above and compiles them together into understandable information for the AV.And this process is called Sensor Fusion.
Perception's followed up by two key strategies: localization and detection.Localization uses the data received from the GPS to let the AV know its precise location.Detection uses the information from the camera, LiDAR, and Radar to detect what is in its surroundings.The process of detection helps the AV detect crucial factors on the road like other cars, trucks buses and bikes.It can also be used to detect the road signs, keep track of road lanes and furthermore.In short terms, it can be concluded that detection helps the AV detect what type of objects are in its surroundings.This segment of the AV architecture helps the AV answer crucial questions like "Where am I?" and "What is around me?".Now that the AV has all the information it needs by collecting the information from the sensors, it'd be requiring a plan to be carried out.That is, plans that contribute to fixing the course of action an AV will choose to do under specific conditions.This segment of the architecture is made of 4 sub-segments: Route Planning, Prediction, Behavior Planning and Trajectory Planning [11].The route planning process deals with plotting the most favorable course of action.That is, which highway to take, which road to take to avoid traffic.This process is extremely vital as it helps the whole non-driving process become even more seamless.
Prediction deals with predicting what other objects in your path/vicinity would do.To elaborate, let's assume you're cruising on a highway with a Truck right in front of you and a person suddenly making the attempt to cross the road.And, also there would be many cars behind your AV.There are millions of possibilities that could happen, which also leads to assumption that half of the outcomes might turn out bad.And so, it is critical to analyze all the different possibilities (probability of each event happening) and make your AV plan the route/response accordingly.With the help of Machine learning and Artificial Intelligence, the AV will be empowered to identify the objects correctly so that the AV doesn't come into a crash with such object.A particular use case would be that, if an AV identifies a human being as a lamp post it would not anticipate that it might move and thereby causing a crash.In such cases, the more fleshed out an ML model which would be infused with multiple examples and conditions and the greatest number of scenarios the better the prediction system would work.And therefore, prediction systems are crucial to an AV.
Behavior Prediction might sound the same prediction but is vastly different.While "Prediction" deals with the whole idea of using probability to build ML models based on previous events such as crashes, cutting, over taking and so on, Predicting the movement of other vehicles, this segment is to take care of the "behavior of the AV itself".In short, Behavior prediction is a vital tool assisting an AV to calculate it's behavior on certain scenarios that occur real time on the roads.
Trajectory planning takes all the information from the above segments and uses to plan a trajectory that accelerates and slows down at a comfortable rate, breaking gradually and so on.It basically takes the comfort of the passenger to make the whole travel comfortable.Prediction assists the AV to consider the various ways traffic could move around it thereby helping it to plan outs it's maneuver in the safest and most efficient way possible.
Finally, with all the information required and the decisions made all that's left is actually maneuvering the AV.The module that is responsible for taking care of maneuvering is the control segment.Control segment is dedicated to send all the accumulated predictive and decision-making data to the steering wheel, accelerator, brakes, clutch and gears to carry out the entire process of making the AV drive on roads.To carry out the task of controlling the car, controllers like PID controller and Model Predictive Controller are used.The four pillars of building an AV were discussed in brief.Without these four segments, the whole process of building an AV would be impossible.Each segment is essential to make the AV work as intended.
This research work focuses on the Motion Prediction module of the AV architecture.Motion Prediction is vital to an AV as it is the part which enables the AV to think about all the different combination in which an object from the environment could behave.The ability of a car to safely navigate through traffic is made possible through a robust motion prediction system.A system that is dedicated to motion prediction should be able to predict all the outcomes that a normal human driver would be able to think off and much more.However, motion prediction is plagued by a resource issue where multiple models that would power the prediction system in the AV are needed.Fortunately, with the help of Deep Learning Models and Neural Networks connecting all the dots to needed data has become much more effective and guaranteed [14].The entire process gets streamlines with respect to utilizing more models of data.
This paper aims to build a novel Motion Prediction Model that provides better predictions.The motion prediction model is crucial for the smooth functioning of an AV and so, the entire process must be tightly kept together, efficient and powerful.And so, this research work shows an efficient Deep Learning model that would revolutionize the prediction part of the architecture involved in AV thereby making the security and the efficiency of prediction in AVs much more dependable.

Literature Review
Multiple Object Tracking (MOT) is an experimental paradigm that helps a visual system to keep track of multiple moving objects.MOT is particularly crucial when it comes to Motion Prediction as quality predictions and state of action can only be decided by an AV when it can track the moving objects around it effectively.MOT paves the way for AV to keep track of various moving objects around it like cars, trucks, bikes or even people.MOT has been carried out by various methods in the past.Out of which 5 MOT models have been discussed below.

Traditional MOT
The traditional MOT comprises of three major modules which help to achieve the desired output: data segmentation, data association and filtering (Petrovskaya 2012).Data Segmentation deals with segmenting the raw data received from the AV's sensors into clusters using pattern recognition techniques.Now those clusters of data must be associated with the objects on the road as moving targets (moving obstacles), and this is achieved by data associating techniques.Filtering the data, that is, understanding the position of those moving objects, filters like the Kalman filter are used.In 2013, a cube building box was built to identify whether the cluster within the box is a vehicle or not.And in 2015, a method was proposed where you can track and detect vehicles using a 3D LiDAR sensor.Using the 3D LiDAR sensor LiDAR point clouds are formed.Identifying objects in that cluster of points was achieved through an algorithm called the KNN (K Nearest Neighbor) Algorithm which uses the power of Euclidean Distance, where a new object is checked with the properties of its neighbor to identify itself as a particular object.

Model based MOT
As the name suggests, in 2012, this MOT is based on models of target identification that is directly built upon the data provided by the sensors and also utilizes the geometric model of objects that included the use of non-parametric filters such as particle filters.The modelbased MOT was used by the AV "Junior" which represented Stanford University in the 2007 DARPA Urban Challenge and placed second.The model-based MOT for "Junior" was presented by Petrovskaya and Thrun in 2009.In this model based MOT, the steps of Data Segmentation and Data Association are removed and replaced with a formulation that combines the Kalman and the Rao -Blackwellized Particle Filters to find the vehicle's pose and geometry.In 2015, a similar method was adopted, but the categories were not assumed.A Bayes filter is responsible for figuring out the pose of the sensor, geometry of the static local background and the dynamics.The geometry information includes boundary points that is observed with the help of a 2D LiDAR sensor.The whole system is based on iterating new updates for the measurements and the location of the target to the old target's details.

Stereo based MOT
In short, this type of MOT is based on the color and the depth information provided by the stereo pairs of images to detect and the track the objects within an environment.In 2010, a method was proposed for obstacle detection and recognition that uses a synchronized video from a forward-looking stereo camera.For detecting obstacles, they used a Support Vector Machine (SVM) with a Histogram of Oriented Gradients (HOG) to categorize them either as an obstacle or not.For obstacle tracking, they hypothesize with the already available data and verify whether their hypothesis was true.In 2016, a semiglobal matching algorithm computed a disparity map from a stereo image pair was used.These disparity maps go a long way when it comes to deriving simple linear clustering techniques such as coplanar, hinge and occlusion to get the boundaries for the object segmentation.

Grid Map based MOT
As the name suggests, it starts off with the construction of a grid map of the dynamic environment around the vehicle and this was initially started off by Petrovskaya et al in 2012.The construction of said map includes, data association, data segmentation and filtering to get the perfect 3D layout for navigating through the environment.Azim and Aycard in 2014 used a 3D local occupancy grid map that divides the environment into occupied, free and unknown voxels (3D Pixels).In 2017, Ge et al, utilized a 2.5D occupancy grid map to model the static background and the dynamic objects within it.

Sensor Fusion based MOT
Sensor Fusion based MOT has a very similar history of being developed as the other modes of MOT that were briefly discussed.From the naming convention it is understood that it ITM Web of Conferences 57, 01002 (2023) ICAECT 2023 https://doi.org/10.1051/itmconf/20235701002fuses/joins all the data collected by the different sensors that are involved in an AV.To make this method more impressive, it was used by the AV "Boss" belonging to Carnegie Melon University which won First Place in DARPA 2007.The MOT system is divided into two layers where first the sensor data is being used to associate the various features observed from the data from the sensors either as a point or a box model.This was further used in 2014 by Cho et al on the car that was the official entry of Carnegie Melon University in the DARPA.In 2013, Mertz et al used scan lines from the 2D LiDAR to project it onto a 2D plane.In 2015, Na et al used multiple sensors such a RADARS, 2D LiDARS and 3D LiDARS to merge the tracks of the moving obstacles.Recently, in 2021, Tuopu Wen et al used a monocular camera for roadside HD map object reconstruction.

Challenges in the design of AV
In 2021, Qiping Chen et al, had brought up 3 major concerns for the AV Industry to start booming.These challenges can be divided into technical, complex environment, and application.Technical challenges are the ones that cause problems with respect to the sensors, radars and the technology involved in the AV.Few concerns that need to be addressed are poor visibility of camera sensors in low light, the detecting distance of LiDAR is limited and last but not the least serious data omissions from the MMW (Millimeter Wave Radar).One of the major algorithmic issues that previous MOT models have had is the lack of well-developed fusion sensing technology.Fusion sensing technology in Layman's terms would be fusing or joining all the pieces of information gathered from the sensors present in the AV as one big chunk of data that can be worked with for objection detection and motion prediction.
Complex environmental situations such as heavy rainfall, dew, blizzards, and plateaus (where higher the land, thinner is the air, which means the cooling system will take a lot more time to cool the system) all play a crucial role in how the AV functions in the environment.For instance, if it rains, the camera will get wet thereby reducing the quality of the images rendered by the camera sensors.Such extreme environmental calamities make object detection a herculean task for even the greatest of algorithms.Unfortunately, weather conditions are not something that can be controlled but they could be used as prerequisites to make the object detection and motion prediction algorithms as robust as possible to face such calamities with better confidence.
Hence, application of AV is not limited to bringing the idea of AV to the market where it'll make a good impression on people and make them buy it.But it also comes with a few more concerns like, does the sensor place make the vehicle appealing to customers or is the small trunk space enough.There will be lot more concerns and challenges that the development of AV must face and overcome, and to solve certain problems that involve making the object detection much sharper thereby making the entire AV system more robust with emphasis on the object detection and motion prediction, this paper aims to make a difference in the field of developing algorithms for the AV industry.

Proposed Architecture
Figure 1 shows the overall architecture of the proposed motion planning system in autonomous vehicle.Data related to the autonomous vehicle's (ego agent) position and its surroundings (agents) are perceived through various sensors.The perceived information about the ego agent and its surrounding agents are combined to comprehend the perceptual scene of the AV.The perception scene is further over layed with aerial and semantic map to get the broader representation of autonomous vehicle in global and local context.Rasterisation is 6 ITM Web of Conferences 57, 01002 (2023) ICAECT 2023 https://doi.org/10.1051/itmconf/20235701002performed to convert the scenic information into rasters, that facilitate better learning of deep learning models.EfficientNet, a deep learning model is preferred in the design that predicts the next move of the autonomous vehicle.The prediction is made in a way to ensure the smooth and safe journey.The predicted information is passed to the controller to automate the movement of the autonomous vehicle.

Few terminologies in the design on autonomous vehicle
• Frame -A time stamped record that contains the location and rotation of the AV.
• Agent -Any movable entity (either as a vehicle, pedestrian and so on) with unique ID for tracking between frames.
• Agent's data -The features corresponding to the agent.This may include centroid, Extent, Yaw Velocity, Track-id, Label Probability of the corresponding agent.

Perception scene generator
Perception in autonomous system is the ability to sense and observe the surroundings.The autonomous vehicle requires to sense its velocity, position as well as the relative displacement of other agents around it to help the motion planning system to make safe and informed decisions.Sensors like RADAR, GPS and LiDAR are used to capture information about the autonomous vehicle and the agents.The captured data are pre-processed to retrieve information related to autonomous vehicle (ego translation, ego rotation, time stamp) and the agents (centroid, Extent, Yaw Velocity, Track-id, Label Probability for each agent).This information is combined frame by frame to generate perception scene, this provides a complete scenario of the autonomous vehicle over a period.

Global and local context generator of autonomous vehicle
The perception scene data are generally available in the vector graphic form.It is computationally challenging to handle scenic data in vector graphic representation.Therefore, a rasterisation approach is preferred to represent the collective information over frames in an efficient way.Rasterisation is a process in which the vector graphics are converted into pixels, dots or lines which are then grouped together to form meaningful images called rasters.It is essential to view the global and local context of the autonomous vehicle with respect to the other agents and the surrounding.Local context relates the autonomous vehicle with respect to the immediate surroundings, static objects, and agents.Whereas global context relates the autonomous vehicle to a bigger locality to understand its global position.The proposed approach uses semantic maps as an approach to incorporate the local context in the design of motion planning, where the static objects of the environment like, lanes, junctions, traffic signal, speed brakers information are provided in the semantic maps.This information is overlayed with the perception scene to infer the local context of current frame.A box rasteriser approach is used in the proposed system to render the local context and mark the agents as 2D boxes.The global view is obtained through aerial maps, where the information regarding the connecting roads of different places are provided.This information is also overlayed with the rasters and local context to provide a better view of the current situation for the autonomous vehicle.Satellite Rasterisers are preferred in this approach to render an oriented crop of the scene from a satellite map.This overlayed information of local and global context are called the Birds Eye View Raster, the samples of which are shown from fig. 2

Deep learning model for Motion Planning through Transfer Learning
The objective of this module is to predict the next move of autonomous vehicle.A collective information on AV, agent, and its surroundings in form of Birds Eye View raster is used to model the motion planning system.Deep learning models are chosen in the design as the BEV raster data are high dimensional features with complex relations.The various deep neural network architecture and the learning algorithms provide an opportunity for effective modelling of motion planning system.To implement the motion prediction, a simple baseline standard CNN architecture, Efficientnet is preferred in the proposed design.EfficientNet is a convolutional neural network that uniformly scales well in all dimensions and could learn from the pretrained models.Since BEV raster are high dimensional and features dependents on the number of agents, EfficientNet are preferred over other CNN architectures.Moreover, transfer learning is incorporated in the proposed design to improve the performance of the Efficientnet motion prediction model.Using the baseline architecture, the layers of the Effi-cientNet are modified according to the input and output features of motion planning system, first and last layers are modified according to the application's requirement.A default three channel convolutional layer would not be enough to rasterize different semantic information in different layers.Hence, a five-channel input with multiple input features are considered.The number of channels in the first convolutional layer for this application is determined by  the number of agents considered.Table 1 shows the layer composition of the EfficientNet deep learning model used in the proposed design.to the application considered.The proposed motion planning model is trained using transfer learning approach on the label BEV raster data.
The Re-rasterized frame obtained as output from the rasterizer yields the predicted trajectory of the autonomous vehicle.The angular velocity and the yaw angle of the autonomous vehicle is predicted as the output.The predicted information is subsequently passed on to the controller to trigger the succeeding motion of the autonomous vehicle based on the yaw angle obtained.

Experimental Evaluation
The proposed motion prediction system is modeled with the Lyft dataset.Lyft prediction dataset contains fused sensor information over 1,118 hours that enables the design of the motion prediction system.The data also contains high-definition scene data with bounding boxes and a class probability of different agents.To detail on the semantic information about the surroundings, high-resolution aerial images are provided in the dataset.

Experimental Analysis
In this work, the motion prediction methods are evaluated by two universally-agreed, standard measures Metric Mean(ME) and Loss Function(LF).The experimental results evaluated against the aforementioned metrics for the baseline model Resnet50 and Efficientnet are presented in Table 3 : In order to scrutinise and distinguish the dynamics of both Resnet50 as well as Efficientnet, the the Lyft dataset along with L5Kit is plugged into Resnet50 as well as EfficientNet network configuring and expecting predicted path as in adjustable frames of the predicted trajectory.On processing the test data via Resnet50 architecture, the encountered observations would be regression loss (avg): 227.19 corresponding to fig. 6. Besides, negative-log-likelihood metric is also found to be 7,700.53as seen in fig.7 with reference to table 3.
Though Resnet50 produced affirmative results, it did have some notable drawbacks.
• The popular ResNet50 had huge complex architecture containing 49 convolution layers and 1 fully connected layer at the end of the network.
• The functioning was a bit slower.
Moving on as to how EfficientNet overcomes the limitations of Resnet50 taking the same evaluation metrics into consideration is to be discussed in the upcoming sections.
Since, this's a regression problem, where a real-value quantity is predicted, the output Layer Configuration would be one node with a linear activation unit and the Loss Function evaluated in terms of Mean Squared Error (MSE) computing the regression loss.Implementing EfficientNet baseline architecture, taking the loss metrics into consideration, the achieved regression loss: (avg):149.25 on training the model with Lyft dataset and L5Kit, the training loss is plotted as shown in fig.9.This average loss's found to lower in comparison with Resnet's average regression loss(fig.6).Considering one more evaluation metric called negative log-likelihood, which's basically, a cost function that is used as loss for machine learning models, tells about the deterioration in any neural network's performance, the lower the better.By considering negative-loglikelihood metric of the ground truth coordinates in the distribution is found to be 7,553.51incase of Efficientnet which proves to be lower than that of what Resnet50 as shown(fig.8).
Another parameter metric mean(ME) is taken into evaluation.A metric is again a function that is used to judge the performance of your model.Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when training the model.The plot with respect to the time displacement of the autonomous vehicle is delineated against the metric mean which's shown through the following scatter plot below(fig.10).
The visualisation of the predictions are presented in fig.11, the trajectory are plotted for the autonomous vehicle and the agents around through ego-trajectory and agent-trajectory at a particular frame(adjustable), varying back and forth, the location of both the autonomous With the experimentally evaluated parameters of both the network architectures Resnet50 and EfficientNet using the same Lyft dataset and L5 kit configurations in hand, the following results are inferred.
• EfficientNets not only prove with better accuracies(with comparatively less regression loss)compared to Resnet, they are also lightweight and thus, faster to run which lags a bit incase of Resnet.
• EfficientNets produce comparatively a lower value of negative-log-likelihood which in turn's expected for a sound prediction model.
The aforementioned features well prove that Efficientnet works much efficient than Resnet50 baseline network as in employing them for the motion prediction and path visualisation which is ultimately the objective of this paper.

Conclusion and Future Scope
Autonomous Vehicles (AV's) are without a doubt the future of the transportation industry on Earth and maybe later Space based explorations.In such a tremendous advancement of technology with its base being developed in recent decades, it is crucial that the existing systems must be functional to the highest order.And in order to reach such high levels of functionality that it becomes much more common in the industry, advancements in AV must be made in such a way that the efficiency of the system, security, usability, robustness, safety, economic concern and finally comfort are considered.This paper focuses on making the safety and the efficiency of the system better than the existing models.The improvement was made possible by migrating from ResNet-50 architecture to the EfficientNet architecture.The reason being that in 2019 Google Research found that while comparing with the widely used ResNet-50, the top-1 accuracy with utilizing FLOPS (Floating Point Operations per Second) as reference for computing power the accuracy was 76.3 percent whereas EfficientNet-B4 blasted past the accuracy of the ResNet-50 and had an astounding increase in accuracy by Figure 11.Adjustable frame 6.3 percent thereby making the accuracy of using an EfficientNet model 82.6 percent.And so, logically, utilizing a CNN Architecture that offered better accuracy is bound to increase the accuracy of object detection within an AV.This is also to extend the reasoning to why traditional methods were not used as they would be considered outdated in the presence of such incredible CNN's which uses the idea of Transfer Learning and constantly scale up its accuracy with the help of AutoML.The future of AV has great potential to make current object detection and motion planning systems more efficient and improve their accuracy at the same time.The future is also not limited to vehicles that use wheels to move, and the greatest possibility is that these concepts can be applied to Cargo Ships, Planes and maybe even Rockets (it will mostly come into play for unmanned missions).As far as the future of AV is concerned, the sky is the limit.

Figure 1 .
Figure 1. Outline of the Proposed Architecture.

Table 1 .
[23]es Vs Corresponding ResolutionOf Efficientnet Model[23]Transfer learning is preferred in the proposed design.Transfer learning generally uses the learnt information of the pre trained model and effectively improves the learning with respect Table 2 shows the statistical distribution of the dataset.The datasets are present in zarr format.Training.zarrcomprises 32,01,24,624 agents, 16,265 scenes and 40,39,527 frames and 3,87,35,988 traffic light faces.Validation.zarr consists of 31,26,17,887 agents, 16,220 scenes, 40,30,296 frames and 2,92,77,930 traffic light faces and test.zarrcontains 8,85,94,921 agents, 11,314 scenes, 11,31,400 frames and 78,54,144 traffic light faces.The experiments are done with a selection 83 percent training