Video pre-processing: moving from human to machines

Alongside the development of autonomous driving (AD) and Advanced Drivers’ Assistance Systems (ADAS), the requirements for storage, processing and transmitting data are rapidly increasing. On the other hand, the size of a typical one-minute long camera footage at 720p resolution taken at 30 frames per second in a raw format reaches almost 5GB. Data requirements for high-quality videos, such as HD and UHD, are many times higher. Due to bandwidth limitations the data needs to be compressed not only for transmitting within the car but also for sending to the cloud.

Most commonly used compression methods are JPEG for images and Motion-JPEG as well as H.264 for video. Both these standards offer an inherently lossy type of compression. More importantly, the mentioned compression methods are optimized for the human visual system, which means that they distort the data in a way possibly unperceivable by humans. However, in AD and ADAS applications, the end-receiver of the data is not human but a machine.

Good quality of video and image stream from the cameras attached to a vehicle is of paramount importance for tasks involving the perception of the car’s surroundings. These computer vision tasks include an understanding of traffic signs, estimation of the distance to other traffic members and the prediction of their trajectories. In all these cases, detection of objects of interest - e.g. other vehicles, pedestrians or traffic lights - is the fundamental first step. It is necessary to understand, what impact an image and video compression have on the accuracy of computer vision tasks and how this might influence the efficiency of AI-based systems, particularly in safety-critical applications. In this article, we study the decrease in accuracy of an object detection task related to the compression strength.

Devcenter User interface

Figure 1.: Example of car detection using YOLOv3 network.

Measuring the accuracy of detection

To measure the detection accuracy on a chosen test dataset, we use the mean average precision (mAP), which is also used in standard benchmarks such as PASCAL VOC 1 and COCO 2. To understand how this measure works, we need to introduce several terms. Each detection provides a set of bounding boxes describing the locations of the found objects. First, we need to check if the detected bounding boxes align with the ground truth. Intersection over Union (IoU) - see Fig. 2. - is designed to differentiate between correct and incorrect detections; so-called True Positives and False Positives. This means that when the detected bounding box strongly overlaps with the ground truth, e.g. IoU exceeds 0.5 in our case, the detection is classified as correct. Now, we can define:

  • Precision as a ratio of correct detections among all made detections.
  • Recall as a ratio of correct detections among all object occurrences in the dataset.

To put it differently, Precision is an estimate of the probability that a single detection is correct. Recall, on the other hand, is the expected proportion of detected objects of all objects that should be detected.

Devcenter User interface

Figure 2.: Visualization of Intersection over Union (IoU). We classify a detection as correct when IoU is larger than 0.5. 3

Beginning with one image in the dataset, we compute a range of Precision and Recall values adding images one at a time. In this way, we can create a Precision plot depending on Recall for each relevant object label in the dataset separately. Average Precision (AP) is defined as an area under the curve of Precision and Recall plot, which is smoothed so that Precision is monotonically decreasing 4. It creates a very robust accuracy measure for each interesting class. This measure increases with the number of correct detections and decreases with the number of false detections.

Finally, to measure detection accuracy in the whole dataset, we take a mean of all APs obtained for each class - the mean average precision (mAP).

Benchmarking the impact of compression

In the benchmarks, we used one of the most popular convolutional neural networks (CNN) for object detection - YOLOv3. YOLOv3 is a one-stage object detector offering an excellent trade-off between accuracy and latency. We evaluated this CNN on a KITTI test dataset consisting of 7481 lossless compressed images of size 1242 × 375 pixels. The dataset contains 31656 labelled instances of cars, 511 trains and 4709 persons.

While keeping all other parameters fixed, we compressed all test files with JPEG and H.264 encoders using a range of quality factors. We began with encoding all available images with the highest possible quality and evaluating the accuracy of the object detection. Then, we systematically decreased compression strength.

Devcenter User interface

Figure 3.: Mean Average Precision of YOLOv3 on the KITTI test dataset at various compression levels compared to default JPEG settings.

In Fig.3, we present the benchmarking results. For the comparison purpose, we also include the outcomes of JPEG encoding with default parameters and calculate additional compression on top of it. As the data rate savings increase, the accuracy of object detection decreases. Initially, this drop is mild, but as the compression reaches a high level, the detection becomes very inaccurate.

YOLOv3 network was trained using high-quality JPEG images and at mild compression levels it performs better on data encoded using JPEG than H.264. However, as the compression rate increases, H.264 overcomes JPEG encoding yielding considerably smaller files at an equal mAP of the detection. We attribute the degradation of the detection accuracy to the loss of data and the appearance of blocking artefacts when low-quality parameters are used. The effect is more severe in the case of JPEG than H.264. This could stem from the fact that, contrary to JPEG, H.264 standard also uses deblocking filter in addition to the quantization module, which decreases the frame’s distortion.

Conclusion

Due to large volumes of image and video data collected by the cars, efficient compression methods are essential to store and process data efficiently and at very low latency. Standard encoding algorithms such as JPEG and H.264 were once created to fit human consumption. However, especially when higher data rate efficiencies are required, these encoding algorithms lead to considerable degradation of the machine learning algorithms’ accuracy.

We at Teraki specialize in encoder-decoder processing product optimized for machine perception. Our embedded edge technology overcomes the shortcomings of the existing codecs created for the human visual system. Teraki technology is designed to prevent accuracy degradation, as well as enable our customers to impose desired accuracy levels by adapting the quality of the incoming video streams.

Sources
1. Everingham M. et al., The Pascal Visual Object Classes Challenge: A Retrospective, International Journal of Computer Vision, 2015, Volume 111, Issue 1, pp 98–136
2. Lin T.-Y. et al., Microsoft COCO: Common Objects in Context, European Conference on Computer Vision, ECCV 2014: Computer Vision – ECCV 2014 pp 740-755
3. Source: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
4. Everingham M. et al., The PASCAL Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, 2010, Volume 88, Issue 2, pp 303–338
5. Geiger A. et al., Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite, Conference on Computer Vision and Pattern Recognition (CVPR), 2012

Share this Post: