A Hybrid Deep Learning Based Visual System for In-Vehicle Safety

In the automotive industry, researchers, AI experts, and developers are actively pushing deep learning based approaches for In-vehicle safety. In this research paper, we propose a hybrid deep learning based visual system for providing feedback to the driver in a non-intrusive manner. We describe a hybrid SSD-RBM model for face feature identification. In this system, object detection, object tracking, and observations are processed through a full pipeline of image processing and detect the driver's movements and generate a safe and efficient action plan in real time. This in-vehicle interactive system assists drivers in regulating driving performance and avoiding hazards.



Abstract-In the automotive industry, researchers, AI experts, and developers are actively pushing deep learning based approaches for In-vehicle safety.In this research paper, we propose a hybrid deep learning based visual system for providing feedback to the driver in a non-intrusive manner.We describe a hybrid SSD-RBM model for face feature identification.In this system, object detection, object tracking, and observations are processed through a full pipeline of image processing and detect the driver's movements and generate a safe and efficient action plan in real time.This in-vehicle interactive system assists drivers in regulating driving performance and avoiding hazards.

Index Terms-Computer Vision, Deep Learning, Driver
Alert, In-vehicle Safety.

I. INTRODUCTION
Global Status report on road safety in 2018, the number of road traffic deaths continues to rise steadily, reaching 1.35 million in 2016.The report further says that Vehicle safety is increasingly critical to the prevention of crashes and has been shown to contribute to substantial reductions in the number of deaths and serious injuries resulting from road traffic crashes.[1] According to the records NHTSA research, 2018 the most common problem traffic accident causes are: • Frontal crashes, where vehicles driving in opposite direction have collided; • Lane departure collisions, where a lane changing vehicle collided with a vehicle from an adjacent lane; • Crashes with surrounding vehicles while parking, passing through an intersection or a narrow alley, etc., prototyping tool for Driver State and performance Data.In this paper, authors stated that automotive UI (User Interface) designers are searching for alternative avenues for delivering information to the driver that complement the mostly visual task of driving.Auditory channels provide the flexibility to display a wider variety of information to the driver without increasing the workload of driving task.It is important to identify types of auditory displays and sonification strategies that provide integral information necessary for the driving task, and not overload the driver with unnecessary or intrusive data.Their system is intended to integrate driving performance data and driver affective state in real-time.Automobiles and in-vehicle safety systems have been improving over the past several decades [3].

Deep Learning:
Deep Learning is a specific subfield of machine learning: a new take on learning representation from data that puts an emphasis on learning successive layers of increasingly meaningful representations.In recent years, DL, RL, and deep RL methods are poised to reshape the future of ML [4].The core concept of DL is to learn data representations through increasing abstraction levels.Almost in all levels, more abstract representations at a higher level are learned by defining them in terms of less abstract representations at lower levels.This type of hierarchical learning process is very powerful as it allows a system to comprehend and learn complex representations directly from the raw data [5], making it useful in many disciplines [6].Several DL architectures have been reported in the literature, including deep neural network (DNN), RNN, convolutional neural network (CNN), deep auto encoder (DA), deep Boltzmann machine (DBM), deep belief network (DBN), deep residual network, deep convolutional inverse graphics network, and so on.In this research, we used Deep Learning methods such as SSD and DBM as they are efficient tools for object detection, object tracking and image segmentation of driver's movements.

II. COMPUTER VISION
The authors in "Practical Machine Learning with Python" state that Computer vision is all about the art and science of making machines understand high-level useful patterns and representations from images and video so that it would be able to make intelligent decisions similar to what a human would do upon observing its surroundings [7].In computer vision, feedback has also played an important role in some vision tasks.For example, feedback was used to select the internal attention to achieve better object recognition performance.[8] Computer vision solutions are today in use in manned vehicles for improved safety or comfort.[9].

A. Driver Alert
When you submit your final version, after your paper has been accepted, prepare it in two-column format, including figures and tables.
A Hybrid Deep Learning Based Visual System for In-Vehicle Safety Rajkumar Joghee Bhojan, D. Ramyachitra, Subramanian Ganesan, Ragavi Rajkumar Driver's visual system is mostly busy with the driving task, designing active safety systems based on other sensory channels like auditory channels.According to multiple resource theory each task has a vector that shows the amount and qualitative level of the additional resources.Assuming that driving includes visual, spatial, and manual depends on the task demands.When there are no task demands to the driver, the driver tends to go to semisleeping state in long run driving.Therefore, it is important to identify the driver's visual state and strategies that can provide integral information necessary for the driving task, and not overload the driver with unnecessary or intrusive data.

III. METHOD
Driver alert system components are FC, CC, and BC as shown in Figure 1 A. Architecture

B. Front-end Component (FC):
FC module is deployed on the edge device.As shown in the top box in Figure 1, it consists of three submodules, which are image pre-processing (e.g., blurry image detection), face detectors, and the filters based segmentation.After the image pre-processing module, an original clear image is generated for segmentation.Next, the face detector, combined with different filters to segment the original image.After segmentation, we can generate a clear and segmented image.These images will be transferred to the server via the Communication Module (introduced below) for further classification.

C. Communication Component (CC):
CC provides two channels for communication between the FC and the Back-end Component (BC).It transfers the image data from the FC to the BC via Input Channel, and it also passes the detection results from the BC to the FC via Output Channel.

D. Back-end Component (BC):
The BC module runs on the cloud server, which is configured to use OpenCV2 (an open source deep learning framework) for SSD model training and testing.OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library.OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products [10].Then the trained model is deployed on the server and used for classifying the image.More specifically, the segmented image is first passed through our proposed Convnet, then the features are generated from the model.

IV. HYBRID APPROACH
To characterize face similarities from different aspects such as fatigued or not, we concatenate the features extracted from different face region pairs by different supervised learnings such as Decision tree and random forest methods.The resulting high-dimensional relational features are classified by Supervised Learnings for face verification.In Supervised training, we started to use a pretrained model from AffectNet and identified facial state of the driver.The features for identifying the facial states such as, A Decision Tree (DT) is a binary classifier resembles a tree in which the features are represented by nodes and the edges leaving a node are labeled by the feature weight and leaves represent the categories.Decision Tree algorithm stops when no other selection is made [11].The decision tree finally gives an output as either the subject is fatigued or tired and this result will be inputted to Convnet analysis.
In the Convnet, we inputted the data we want to use (images, non-performance activity, etc.) and the data gets passed through different "layers" of the net.Each layer modifies the input values to try and morph it into something useful and predictive in the model.
EigenFaces face recognizer views at all the training images of all the characters as a complex and try to deduce the components.These components are necessary and helpful and discard the rest of the images, in this way it not only extracts the essential elements from the training data but also saves memory by rejecting the less critical segments.After pre-training each Supervised learning and the ConvNet separately, the entire hybrid network is jointly optimized to further improve the accuracy.

A. RBM Learning
The Boltzmann machine proposed by Geoffrey Hinton [12] and colleagues in 1983, is a well-known example of a stochastic neural network that can learn internal representations and solve combinatorial optimization problems.The Boltzmann machine is a fully connected network comprising two-state units.It employs simulated annealing for transitioning between the possible network states.The units flip their states on the basis of the current state of their neighbors and the corresponding edge weights to maximize a global consensus function, which is equivalent to minimizing the network energy.According to authors in [13] A Restricted Boltzmann Machine is an undirected graphical model with stochastic visible variables k ∈ {0, 1} and stochastic hidden variables h ∈ {0, 1}, where each visible variable is connected to each hidden variable.An RBM is a variant of the Boltzmann Machine, with the restriction that the visible units and hidden units must forma bipartite graph.This restriction allows for more efficient training algorithms, in particular the gradient-based contrastive divergence algorithm [14].
The authors describe facial emotion recognition fundamentally identifies emotion which shapes how humans' self-control and reaction based on the situation as well as the environment to which they belong.This research further explains that Skybiometry is considered to be a state of the art in recognizing and detecting facial expressions.Authors Mollahosseini, et al., made a dataset "AffectNet" for facial expressions [15].AffectNet serves as the largest database of facial expressions, valence, and arousal represented in two different emotion models.With the help of evaluation metrics, deep neural network baselines can perform better than the conventional learning methods [16].In this research paper, we used AffectNet dataset for identifying facial identification (especially fatigues) while the driver is driving.Manually labeling datasets with object masks is extremely time consuming [17].

B. Single Shot Detector (SSD):
The authors in [18] describe that Single Shot Multibox Detector (SSD) discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.The authors further illustrate that SSD at prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.Additionally, this computer vision network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.As SSD network works on a single network principle, it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation.Hence the single network makes SSD easy to train and integrate into systems that require a detection component.The authors in [18] further reassure that Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference [18].The architecture of SSD is given in the below Fig. 2.

A. Camera
In this research, a camera is mounted facing front side of the driver's seat nearby dashboard so that it will capture driver's movement in real time.The camera is used mostly for object detection, recognition and tracking for example, driver's eye movement, mouth yawning, face movement and body movement.

B. Object Detection
Object detectors have become significantly more accurate and gained important new capabilities [19].As stated by Shaoshan Liu,et al [20] in recent years, we have seen the rapid development of vision-based deep-learning technology, which enables highly accurate object detection and tracking.The Convolution Neural Network (CNN), a deep neural network (DNN) that is widely used in objectrecognition tasks, has a four-layer evaluation process:  The convolution layer contains a variety of filters to extract features from the input image.Because each filter contains a set of learnable parameters that will be derived after the training stage, the convolution layer can be the most computationally intensive layer in the CNN. The activation layer has mechanisms for deciding whether to activate the target neuron. The pooling layer reduces the representation's spatial size to reduce the number of its parameters and consequently the required computation. The fully connected layer contains neurons with full connections to all activations in the pooling layer.The authors in [21] are further stated that Prior works have shown that with a large amount of annotated data, convolutional neural networks (CNNs) can be trained to achieve a super-human performance for various visual recognition tasks.As tremendous efforts are dedicated into the discovery of effective network architectures and training methods for further advancing the performance, it is also important to investigate into effective approaches for data annotation as data annotation is essential but expensive.
In this research, we used a single shot detector for an easy and lightweight model.The SSD network works on feedforward convolutional network which make fixed -sized bounding boxes and scores based the object detection.As shown in Fig. 3, driver's face and movements are captured in real time using the camera mounted on the dashboard.

C. Object Tracking
When the subject (Driver) is fatigued or in a semi-sleep state, there is a chance of falling either in front side or sideways as shown in Figure 4.In this type of situation, the Algorithm for Driver's alert will be used for body tracking.The SSD algorithm is used to capture driver's movements continuously without any disturbances.The main goal of our algorithm is to train an object detector that takes video image as input.To train such an object detector, the training and validation images of the detector are annotated with a bounding box per object and its category.Such an annotation is commonly seen in public datasets including PASCAL VOC and MS COCO Object tracking aims to automatically track the moving object's trajectory and send the results to decision making.The main goal is to ensure that the vehicle does not collide with a moving object, whether a vehicle or a person crossing the road.

D. Model Training
The SSD training will be used to handle multiple object categories.In order to deploy AI-based solutions to Driver Alert, we need to follow three major building blocks: data preparation, model generation, and model deployment.Data preparation focuses on getting data ready for training and testing neural networks, covering topics such as data recording, ground truth labeling, big data storage, etc. Model generation involves developing network architectures, training the networks, and evaluating the trained models.A model is considered "trained" if the difference between its outputs and the corresponding ground truth labels (the expected outputs)

E. Decision-making:
At this stage, a decision is made by the "Driver Alert" and it starts functioning as per the conditions.The first condition will be a speaker based alert to the driver and the second condition will be shaking the backrest of the driver's seat.

VI. EXPERIMENTS
In our research, as shown in the flowchart (Fig. 5), video images are captured by a camera which is mounted on the Vehicle's dashboard.The algorithm developed for Driver's Alert keeps watching driver's movement by both object detection and object tracking methods.If the driver is fatigued, it checks for next condition where "Yawning" is happening, if this condition is also happening, the algorithm looks eye movement and body movements by the detection method.The trained model works as if an agent sitting in front of the driver and make the decision.When Driver's Alert predicts that there is a need for alert, it will invoke a voice-based alert or it will shake the driver's seat.

VII. CONCLUSION
Though there are so many researchers have studied the fatigue detection, fundamental problems is to solve avoid road hazards.This paper has proposed a new hybrid SSD-RBM model for driver's face identification and verification.The model learns directly and jointly extracts relational visual features from eye movement and body movement under the supervision of face identities.Both feature extraction and recognition stages are unified under a single deep network architecture and RBM.The components are jointly optimized for the target of face verification and finally communicated to Driver Alert component.

Fig. 3
Fig.3 Object Detection -The Subject is in driving state.

Fig. 4
Fig. 4 Object Tracking -The subject is in a semi-sleep state.

Fig. 5 .
Fig.5.Flow Chart for Driver Alert • Failures to see or recognize the road signalization [traffics signs and signs] and consequently cause a traffic accident due to inappropriate driving [2].• The authors Steven Landry, et al., proposed an Invehicle sonification [alert with no speech]