Deep Learning Model for A Driver Assistance System to Increase Visibility on A Foggy Road

For many years, a lot of researches have been made to develop Advanced Driver Assistance Systems (ADAS) that are based on integrated systems. The main objective is to help drivers. Hence, keeping them safe under different driving conditions. Visibility for drivers remains the biggest problem faced on the road in an atmosphere of fog. In this paper, we examine a system that can be employed to substantially enhance visibility through using deep neural networks. Researches done recently-which are based on deep learning for eliminating image fog-have made clear that an end-to-end proposed system is such an effective model. However, it becomes a must to extend the idea to end-to-end real-time video deshazing. In this paper, we introduce a model of image dehazing. It is based on Convolutional Neural Networks (CNN) as a basis for developing the video dehazing model. As in addition, we concatenate our model with the faster RCNN for detecting objects on the road in real time. The experimental results on our image datasets shows the performance of our model with regard to Peak Signal to Noise Ratio (PSNR=19.823) and Structural Similarity (SSIM =0.8501). On the dataset of the synthesized videos, our model achieved a performance of PSNR = 21.4032 and SSIM = 0.9354. Moreover, with the concatenation of our dehazing model with Faster R-CNN (regions with convolutional neural networks), our proposed system displays desirable visual quality and a remarkable progress of the object detection achievement on blurred images with mean Average Precision (mAP) equal to 0.933 during the day and 0.804 at night.


Introduction
Advanced Driver Assistance Systems (ADAS) are mainly designed to offer vehicle drivers help, thereby minimizing a potential threat to their safety. The majority of these systems are based on image processing algorithms, such as those allowing detecting of nearby vehicles and pedestrians and the recognition of signs. Because of this, quite a lot of systems are fixed into vehicles. Though such systems are widely employed to draw drivers' attention once a potential threat appears for vehicle drivers, they perform less effectively under certain adverse weather conditions where the vision weakened. This happens most notably in the presence of fog.
Eliminating or reducing the fog of an image captured by an ADAS system seems to be difficult and somehow ill-posed phenomenon. It is worth mentioning here that significant developments that took place in deep learning mainstream paved the way to get considerable results in improving vision degraded by fog [1]. Therefore, the elimination of fog requires the estimation of the depth map. What is more, previous assumptions are necessary to estimate the depth map for systems using single images as input.
Very lately, a lot of algorithms have been suggested for detecting objects and eliminating fog [1,2]. It is worth mentioning here that traditional algorithms need two crucial elements: gathered facts on the environment and developed learning as well. In addition, most object detection and fog elimination algorithms are not suited for real-time uses on account of they consist of considerable arithmetic time.
In this way, authors in this area of study have introduced deep learning approaches for degraded vision so that images can be restored and reconstructed [1]- [3]. Yet, these methods cannot directly be applied for removing the fog from the image. The authors F. Hussain, and J. Jeong, came up with an approach, using Deep Neural Network (DNN).They also assumed that an anonymous complex function could model mathematically the fog in an image [4]. Li Chongyi et al. proposed a cascading CNN ASTESJ ISSN: 2415-6698 model composed of three components: a component of the shared invisible layers that extract the common features, the work of the global estimate of the atmospheric light and that of the subnet of the average transmission [5]. The researchers Li Boyi and others introduced a model for dehaze of image constructed with a convolutional neurons network (CNN), named "all-in-one dehazing network" and "An All-in-One Network for Dehazing and beyond" (AOD-Net) [1,6]. It is designed on the basis of a reformulated atmospheric diffusion model. The researchers B.Cai et al. found an end-to-end formable model an excellent alternative for calculating the average transmission, called DehazeNet [2], it takes a hazy image at the input and leaves its transmission matrix, and then image dehazed is recovered by the atmospheric diffusion model. W. Ren et al. used a multi-scale CNN (MSCNN) that created and improved a coarse scale transmission matrix [7].
With regard to the video dehazing, most approaches depend crucially on the phase of improving temporal inconsistencies. The authors Kim et al. suggested inserting a temporal coherence in the cost function, with a clock filter to accelerate the processing [8]. The authors Cai et al. conceived a spatio-temporal optimization for de-hazing video in real time [2].
In recent years, an interest has begun to grow in video modeling using CNN for a huge number of tasks. And here are three examples: super-resolution (SR) [9], blur [10] and classification [11,12] and style transfer [13]. In [9], the author studied various structure configurations for SR video. In [11,12], the authors made similar attempts by exploring different connectivity options for video classification. The researchers Liu et al. introduced such adjustable system. In this context, they placed a spatial alignment network between the images [14]. Others introduced an end-to-end CNN to learn how to accumulate information on several hazy images / video [10]. With reference to video style transfer, Chen et al. integrated short-and long-term coherence. This is not to mention they indicated the superiority of multi-image approaches over single-image approaches [13].
In this paper, we present a video dehazing system based on an image-dehazing model. We use convolutional neural networks (CNN). Then, we concatenate our model with the faster RCNN to detect objects on the foggy road in real time. Moreover, we deploy our system to validate the quantitative and visual results obtained.

Network Design
Our proposed work is made of two important steps. Firstly, we worked hard to develop an end-to-end CNN model [15] that explicitly learns the mapping relationships between the raw images and their associated transmission maps to recover fog-free images. And then we integrated the Faster R-CNN proposed model [16] for objects detection in video. The algorithm presents the general steps of our model for dehazing and detecting objects in a video:

Atmospheric diffusion model
The model of atmospheric diffusion is the usual description of the hazy image production process, proposed by McCartney [17] and developed later by Nayar and Narasimhan [18]. The model is written as follows [1,2,19]: where I (x) is an observed hazy image, J (x) is the image that recovered after the estimation of A and t(x), A is the global atmospheric light and t (x) is the transmission map.
In equation (2), β is the atmosphere diffusion coefficient and d (x) is used to refer to the distance between the image and the camera.
Equation (2) indicates that when d (x) moves to infinity, t (x) comes near to zero. With equation (1), we have: In reality, d (x) cannot be infinite, but can be a long distance, which gives a very weak transmission t0. Then, the global atmospheric light (A) is estimated by the following formula

Construction of the proposed CNN model
The proposed model for the image frame consists of an estimation module. This module uses convolutional layers to estimate transmission map t (x) (see Figure 1), followed by a clean image generation module that consists of a multiplication layer of several element by element and the addition layers for generating the recovery image (see Figure 2).
The estimation module is the essential component, responsible for estimating the depth and level of relative disorder. As Figure 1 shows, we use five convolutional layers by merging filters of various sizes (see Table 1).
After comparing the results obtained by different architectures of CNN (see table 1), we concluded that the CNN model that consists of five convolutional layers with the number of 3x3 filters is the most efficient, with a PSNR of 19.8231 and a SSIM 0.8501.  [7]. Influenced by them, we concatenated in our model the layer "pool1" entities with the layers "conv1" and "conv2". The same thing for "pool2" with those of "conv2" and "conv3"; and "pool3" with those of "conv1", "conv2", "conv3" and "conv4". This model captures the characteristics of images at different scales and at the intermediate connections, compensating for the loss of information during convolutions. Notably, in our proposed model each convolutional layer uses only three filters. Consequently, our model is lighter and performs well in terms of PSNR and SSIM values in comparison with other existing deep methods.

Model of Object detection
Faster R-CNN [16] is proposed to detect objects on images accurately. With ResNet, Faster R-CNN achieved a mean average Precision (mAP) of 76.4% on the PASCAL VOC dataset 2007 and 2012 (see Table 2). By combining region and classifier propositions in a large network, it becomes possible to automatically learn good representations of the features for the task.
In this paper, we examine the detection and recognition of objects in the presence of fog with a view to improve high-level vision tasks to combine with the model for dehazing of video. We opted for the Faster R-CNN model as the basic algorithm for detecting robust objects in real time (see Table 2), approved on synthetic and natural blurred images.
We modified the Faster R-CNN Caffe source code and used ResNet [20] as the Faster-RCNN convolution backend. We came to the conclusion that using ResNet provides a substantial improvement over other architectures. The time reweighting of the chain gives further improvements.  Faster R-CNN-ResNet [20] 0.764 5 1 Table 2 shows comparison measurements of mAP and the FPS (frames per second) speed with the batch size, using the dataset PASCAL VOC between Fast R-CNN, YOLO, SSD300, Faster R-CNN VGG-16 and R-Faster CNN-ResNetet. Our chosen detection method (Faster R-CNN-ResNetet) surpasses all methods in terms of mAP. Though a set of certain methods may work at higher speeds, they have lower accuracies (mAP). The Faster R-CNN-ResNetet remains the best method in real time to reach more than 76.4% of mAP.
Outside the dehazing part, the temporal coherence should be taken into consideration for object detection so that we can reach satisfactory results. With our proposed model, we can herebyadvancing Faster R-model CNN ideally suited for video [22]. The first two convolutional layers of Faster R-CNN model to an image, was divided into three equivalent sections to insert the earlier, present and upcoming images, in this order. The last three images are concatenated in the next of the second convolutional layer and go through the resting layers for predicting the bounding boxes of the objects for the present image.

Dehazing an image
For the dataset, we took around 25,000 images from the NYU2 database [23] for instructing our model and took over 3,000 images as non-overlapping test data (Test1). We also collected a dataset of natural foggy images to validate and evaluated the performance of our model. First, we created foggy images synthesized from equations (1) and (2), using the ground truth images with the NYU2 inner depth database. The input data is RGB images with a resolution of 640 x 480 and the output was a depth map with a resolution of 320 x 240.
In our model, about 12 epoch are enough for it to converge, and it works well enough after these times. During the learning phase, the weights of our network are initialized in a random way. We used the ReLU function as a more efficient neuron in our specific context. We opted for the function of Mean Square Error loss (MSE), and that it stimulates the PSNR (Peak Signal to Noise Ratio) and the SSIM (Structural Similarity) as well as the visual quality.   We compared our basic model proposed with several dehazing methods, in particular: Automatic atmospheric light Recovery (ATM) [24], Regularization of the context constrained to limits (BCCR) [25], No-local Image Dehazing (NLD) [26], Fast visibility restoration (FVR) [27], Dark channel priority (DCP) [28], MSCNN [7], DehazeNet [2], Color Attenuation Prior (CAP) [29] and AOD-Net [1] ].
Our synthesized hazy images come along with ground truth images, paving the way for us assess PSNR and SSIM and check out if the results stay accurate.
As our model is optimized from start to finish in case of MSE loss, one should not be amazed to see its PSNR performance superior to other methods. Further, even if SSIM is not directly called the optimization criteria, our model obtains SSIM advantages superior to the other models compared.
It is well known that SSIM more accurately reflects human perception, as SSIM measures apart from pixel level errors. We become faithful to our model with such a consistent SSIM improvements we achieved. Table 3 shows the promising performance of our model compared to the others, in terms of PSNR and SSIM.
Our method has an advantage greater than 0.2 dB in the PSNR and 0.03 in the SSIM when we compare it to approach AOD-Net.

Video sequences dehazing
First of all, we brought into existence a dataset of foggy synthetic video from on equation (1), employing 20 chosen videos in the TUM RGB-D dataset [30] that consists of different video sequences. Depth information is refined by means of the filling model of Silber man et al. [31]. After that, we broke our dataset into a learning set made up of 12 videos with 120,000 images and a set of non-overlapping tests called Test2, consisting of 8 short video sequences containing a total of 450 images. Finally, we collected a set of natural hazy video sequences to validate and evaluate the performance of our model.
During the training of our model, we adopted the loss of the mean squared error (MSE), which is aligned with the SSIM, the PSNR and the visual quality.
Due to a light structure, our proposed model needs only 15 epochs (160,000 iterations) to become stable. Concerning video-based methods, we compared our model with EVD-Net [19] and STMRF [32] methods (not based on CNN). Then, we got a higher advantage over the two previous approaches. Furthermore, we observed that our advanced system performs quite remarkably showing a difference of 0.5 dB in the PSNR and 0.03 in the SSIM.

Object detection model
We reformed our adaptive Faster-RCNN model on a set of training data provided by Foggy Cityscapes [33] and on a set of synthesized personal data. Foggy Cityscapes is a foggy synthetic data set. It causes fog to operate on real scenes. The images are rendered utilizing the Cityscapes images and depth maps [8]. It contains 2,975 images in the learning set and 500 images in the validation set. In this experiment, we reported our findings on categories such as person, car, truck, bus, motorcycle and bicycle. We opted for the average score of the channel separately for each class based on validation performance.
On the validation set, we achieve a mAP of 0.777 (see Table 5 for a full breakdown between classes).

The proposed model Architecture for object detection in hazy video
The merging of our model for video dehazing and the adapted Faster R-CNN model ( Figure 5)has given birth to our general model, which naturally displays an interesting tree structure that is locally linked and is subjected to joint and more crucial optimization.
On the validation set of our proposed model, we achieved a mAP of 0.929 (see Table 6).

Deployment of our system
As an embedded platform, we used a Raspberry Pi 3 Model B +, for our experiments we made predictions on new video sequences using the raspberry Pi with camera designated v2.1, based on the Sony IMX219 CMOS type sensor with a resolution of 8 MP (3280 × 2464 pixels) and we downloaded our pre-trained model on our pi. We needed Raspbian Stretch 9, because TensorFlow 1.9 officially supports Raspberry Pi if you use Raspbian 9.   We performed tests on images taken during the day and at night using the Faster R-CNN model adaptive at first, and then the proposed general model (image dehazing + Faster R-CNN adaptive). Table 7 and 8 show the results obtained.

The Quantitative results
For quantitative results, we calculated the mean Average Precision (mAP) across all images, using both the adaptive Faster R-CNN model and the dehazing model concatenated with the adaptive Faster R-CNN. (See Table 7). During the day, the heavy haze degrades mAP to around 0.27. Adding the dehazing model, mAP is improved by 0.12 for the detection of objects under a light veil conditions, 0.23 in medium fog and 0.27 in thick fog. During the night, with the dehazing model the mAP improves by 0.28 for detecting objects in light haze conditions and 0.21 in the middle fog (see table 7).

The results visualized
For the visualized results, Table 8 makes a visual comparison of the results when the object is detected on hazy images. And here we are illustrating five cases: • During the day: -Detection of objects in the Heavy Haze image with adaptive Faster R-CNN before and after dehazing.
-Detection of objects in the Medium haze image with adaptive Faster R-CNN before and after dehazing.
-Detection of objects in the Light haze image with adaptive Faster R-CNN before and after dehazing.
• At night: -Detection of objects in the Medium haze image with adaptive Faster R-CNN before and after dehazing.
-Detection of objects in the Light haze image with adaptive Faster R-CNN before and after dehazing.
Experiments we conducted lead us to this conclusion: as soon as the haze becomes heavier at night, detecting objects becomes less dependable. Most of all, under whichever fog conditions-light or medium or high-our highly advanced system can regularly ameliorate the process of detecting objects.

Conclusion
In this paper, we present an intelligent system with a view to enhance visibility quality in atmospheric fog, allowing dehaze and object detection. The proposed system is based on the deep learning model using a light CNN model of five convolutional layers works well in terms of PSNR and SSIM compared to other existing deep methods.
We modified the Faster R-CNN model, using ResNet as its convolution backend. We concluded that using ResNet offers a substantial improvement over other architectures. Further, we concatenated the CNN video dehazing model with the faster RCNN as a robust model compared to the others. Our objective was detecting objects on the road in real time with significant performance in terms of mAP.
Based on the qualitative and visual results, our system demonstrated both efficiency and superiority over other existing systems in terms of PSNR, SSIM, mAP and visual quality.
It should be noted that, though the proposed system was much more applicable in enhancing visibility for road drivers when there is fog, there were some limitations. First of all, we did not test our proposed system in case of a rainy foggy environment. Secondly, our system is not applicable in case of very dense fog. Lastly, it is not also applicable when detecting so many objects. Consequently, in our future work we shall definitely spare no effort to enhance the performance of our system by an automated Chabot. This Chabot interprets the captured image and announces audio warnings in the event of critical situations, a project that mainly aims at developing a system in the situation where the presence of fog and rain come at the same time.