Distance estimation of various objects present in an image from camera when they were captured using depth image and object detection mechanism

shivam khandelwal
14 min readApr 28, 2021

--

This is the blog based based on my mtech thesis

“yesterday is history ,future is mystery ,present is gift”

Table of Contents

  1. Introduction
  2. Mapping Machine Learning and Deep Learning
  3. Data-set Description
  4. Evaluation Metric
  5. Exploratory Data Analysis and Featurization
  6. Machine learning models
  7. Results obtained from machine learning models
  8. Failed experiments and results
  9. Final Model or solution
  10. Future Work
  11. References
  12. Deployment using Flask
  13. Github repository
  14. Linkedin profile

1. INTRODUCTION

The challenge is to use a depth map and retrieve features from it to predict the distance of an object in a camera image. There will be discussion on numerous other features that were discovered when investigating. Also included is the use of an object recognition mechanism, which aids in the detection and localization of objects in images.The challenge of estimating the distance of objects from a camera remains a hot research subject in the field of computer vision, and it has several applications. In order to go on, it is necessary to understand a few words. Range calculation that is both reliable and accurate remains a difficult challenge in computer vision. Few methodologies exist for this, but they are very complex, and sensors and other heavy forms of machinery are used to measure the distance of objects. Understanding the space between two points is critical in robotics for avoiding collisions and picking up objects. Any sensors can do this, but they have a number of disadvantages. These constraints, however, can be solved by the use of image processing.
As a result, it was ultimately agreed to use image recognition to calculate the object’s distance. The definition of depth picture, as well as object detection mechanisms, are heavily used to solve this issue. to build a self-contained artificial intelligence.

2. Mapping machine learning problem

Without the help of any sensors or any heavy hardware can it be possible for robot or any model to have the rough idea about its vicinity . Robot need to learn spatial location of objects under its vicinity The proposed mechanism will make use of depth images, object detection mechanism to find the spatial graph of the environment. Or by lookin at image is it possible to estimate the distance of object from camera.

This is a Regression problem in Machine Learning, where the task is to predict a real-valued output (distance) of object with respect to camera present in image or vicinity of camera .The predicted value should be greater than zero.

3. Dataset Description

A dataset is required before proceeding with any machine learning or deep learning problem. This chapter would include a brief overview of all current datasets or all steps taken into consideration when constructing a dataset from scratch. Until continuing, it is important to go over the analysis problem statement and motivation. Using object detection methodology and depth images, the goal is to construct a machine learning methodology that can classify the object present in the image and estimate the distance from the camera, that is, what will be the real distance from the camera to the object when those were clicked. There was no prior work and literature to address the issue of distance estimating using machine learning. As a result, the scarcity of jobs and study in this area motivates one to try to tackle the problem from the ground up. A critical question arises: what kind of dataset will be used to complete this task?

3.1 How dataset looks like

The dataset must be an image dataset containing object/objects as well as the distance between all the objects seen in the image, which should be registered and will serve as ground reality. To keep track of all these documents, an excel file with the image name, an entity present in that image, and the distance corresponding to that object should be kept.

csv file example

In the above table column name are

  1. image name which consists of the name of images
  2. Object present, which consists of a list of objects captured in the image,
  3. Distance is ground truth values, which were the real distance of the given object from the camera registered at the time of the object’s capture in the camera.

The defined table is a visual perspective that can produce valuable information, which is picture name a.png, which consists of an object named bottle that was taken at 120 cm away from the camera. The same picture a.png is used again, this time for an object bag that is 150 cm away from the camera as it is taken. This is the material that could be derived from the above table, and this was the raw or general viewpoint on how the dataset could look and what all useful columns it should provide in order to fulfil our analysis domain. It’s also understandable that the picture a.png is made up of two objects: a bag and a glass.

3.1.1 Dataset Creation

The dataset was developed by grabbing two or three objects and placing them at different distances from the camera, while still attempting to maintain relative distances between the objects so that it is clear which object is closer to the camera and which is farther away. The distance between each of the points captured by the camera is then registered, and the image is saved.

The YOLO v4 algorithm was used to detect a typical indoor object.
As previously said, the dataset consists of photographs that are regular images obtained from the camera. These images are then fed into the YOLOv4 algorithm, and at the same time, these regular images are transformed into depth maps, so our dataset now consists of the original image and related depth maps or depth images. With the aim of visualising excel files in mind, the task of creating comma-separated file (CSV) or excel file is completed using YOLOv4 and depth chart, which will be addressed in a later chapter. The names of the images were given in the format numerical.png, and their corresponding depth maps were given in the format numerical depth.png. For eg, the standard image name 1.png corresponds to the depth name 1 depth.png.

Various steps were taken when constructing the dataset including

  1. attempting to collect at most three data points from a single image
  2. attempting to preserve the greatest possible relative distance between the objects so that the depth image would easily discern the collected different pixel values for different objects.
  3. The distance between different objects was observed, and the names of images and their depth were supplied numerically in order to ensure one-to-one correspondence.
  4. The height of objects was also retained in order to include it as one of the input features.
  5. The backdrop was maintained as a white wall, and all objects were kept on the ground.
  6. The camera configuration was not considered,
  7. and all photographs were taken with the camera in a spot on the floor.

All of these steps, including the conversion of a regular image to a depth map and the input of an image from the YOLOv4 object detection technique, are laying the groundwork for the following sections, which are pre-processing and feature extraction. Total of 390 images were collected and translated into depth charts, and a total of 1080 entries were found using YOLO v4 that were helpful in extracting details about the different objects present in the 390 images. the sample images are shown as below

left normal image middle image obtained from YOLOv4 right depth maps

lets visualize the flow diagram in the process of dataset creation

flow diagram involve in dataset creation

The depth map was generated by using a pre-trained model that had been trained on the NYU dataset.

above figure depicts the direction taken to produce data points from the image. The following observations can be made-
1. An image is captured by the camera

2. It is then fed into the YOLOv4 object detection mechanism,

3. And the original image is then fed into a pre-trained algorithm NYU model to produce a depth image.

4. The depth photos and the YOLOv4 algorithm will be used to fill the entries in that table.

This was all about creating datasets, which laid the groundwork for feature engineering or feature extraction.

4. Evaluation matrix

If the expected values are greater than zero, this matrix is often used in many research papers to calculate the value. Its value ranges from 0 to 1, and model performance is determined by how close it is to zero. If rmsle is close to zero, model performance is better; if rmsle is close to one, model performance is worse. It is given by

5. Featurization and Exploratory Data Analysis

The measures that were taken while extracting features are as follows:
• After the collection of the picture and its depth index, the photos are fed into YOLOv4.

• The bounding box is created from the YOLOv4

• The field under the depth image is cropped using such bounding boxes

• This is how information is collected from a depth image

Still, a critical question arises: how will feature extraction from depth maps be accomplished, or what methods are appropriate for feature extraction from depth maps? as asked in figure below.

what to do with ractangular area of image

5.1. Mean of region of interest all bounding box coordinates and object height as input

region of interest is the bounding box region which was obtained using YOLOv4 . To use this section of the image, it was first cropped, and then the mean of that cropped image was computed, yielding the mean value of r,g,b, which was then used as input features. So, for the first experiment, we took the mean of the pixel values produced from the depth map after adding bounding box coordinates to it. Apart from that, the entity height and width, as well as the bounding box height, are considered input features. As it turns out, there is a link between bounding box width (w) and bounding box height (h) . Which is used in the triangular similarity system. As a result, our input features are r,g,b,w,h, and entity height. The output function is the object’s true distance from the camera. the data set looks like as shown below.

csv file obtained after processing
code to extract mean r,g,b from cropped image

5.1.1. Exploratory Data Analysis

5.1.1.1. Scatter plots Bivariate analysis pixel values

r pixel value vs distance
  • The plot above is a scattered plot of r with the x-axis representing r’s pixel value and the y-axis representing distance value.
  • It can be shown that there is a rather or somehow opposite relationship between the values of r and space, which implies that as the r value increases, so does the distance value.
  • This is the inference that can be taken from the Scatter plot seen above. Although there is a relationship between r and time, it is non-linear, and this argument would be helpful in further processing.
  • This is all of the detail that can be derived from the above plot.
g pixel value vs distance
  • The plot above is a scattered plot of g with distance x-axis representing pixel value of g and distance y-axis representing distance value.
  • It can be shown that there is a rather or somehow direct relationship between the values of g and distance, which implies that as the g value increases, so does the distance value.
  • The rate of decrement or increment is not linear, as seen in the preceding figure.
  • As the pixel value increases, the distance increases non-linearly. This is the inference that can be taken from the above figure.
  • Though there is a relationship between g and distance, it is non-linear, and this argument would be useful in further processing.
  • This is all of the detail that can be derived from the above map.
  • The above plot further explains how the appearance of anomalies indicates the presence of a small number of points with a large g value in relation to a short distance.
b pixel value vs distance
  • The plot above is a scatter plot of b with distance x-axis representing b’s pixel value and distance y axis representing distance value.
  • It can be shown that there is a fairly or somewhat clear relationship between the values of b and distance, which implies that as the value of b increases, the value of distance decreases.
  • The rate of decrement or increment is not linear, as seen in the above map.
  • As the pixel value increases, the distance increases non-linearly.
    This is the inference that can be taken from the Scatter plot seen above. Though there is a relationship between b and time, it is non-linear, and this argument would be useful in further processing.
  • This is all of the details that can be derived from the plot above.
  • The occurrence of inconsistencies in the above plot indicates that there are a few points with a high b value in relation to a short distance.

5.1.1.2. Multivariate analysis Correlation -matrix

The correlation coefficient is a mathematical metric used to assess the frequency of a relationship between two random variables. Its value ranges from -1 to +1; negative values represent an opposite relationship between two random variables, while positive values represent a positive relationship between the random variables. It is useful in defining the interaction between the two input features in machine learning terminology.

correlation matrix
  1. The lack of combined features r, g, b is due to their strong dependence as certain observations is carried out.
  2. The correlation value between r, g is -0.984805 which is equivalent to -1 both are strongly negatively dependent
  3. The correlation value between r, b is -0.798360 which is also a high value which is also tending to negatively dependent
  4. The correlation value between b and g is 0.870295, which is also a high value, indicating that they are both strongly dependent.

The preceding observation confirms the unworthy of using r, g, and b as input features because they are closely related to or dependent on each other. As a consequence, taking the mean value is not a worthy way to continue with modelling, as can be seen later in the result pages.

  • The features x, y, w, h, and object height performed significantly well because they are very dependent on each other, as seen in above figure of correlation matrix .
  • As applied to all other functions, the values are similar to zero, as seen in the figure above.
  • As a result of adding these elements, the model’s efficiency improves dramatically, as seen in the results segment.

All of this information can be derived from exploratory data processing, which is performed on mean values of the area of interest collected from depth images using YOLOv4 object detection.

5.2. Flattening of cropped image

Can all the pixel values within the bounding box be used instead of the mean of a cropped portion of the image? To do this, the cropped image is transformed to a matrix and then flattened to create a data point.

flattening of cropped image

It is now easy to use a pixel as an input function when flattening. This feature extraction technique was used to make the most use of a cropped image

So required features are height of object , bounding box coordinates obtained from YOLOv4 and 972 features obtained from flattening of cropped portion of depth image.

6. Machine learning models

from the data exploration and lot of experiments it was observed that non- linear model perform far more better than linear model. it was decided to move forward with two models that is xgboost and random forest.

6.1. XGboost

The eXtreme Gradient Boosting (XGBoost) algorithm is a scalable and optimised variant of the gradient boosting algorithm that is optimised for efficiency, computational speed, and model output. It is an open-source library that belongs to the Distributed Machine Learning Community. XGBoost is a seamless combination of software and hardware capabilities designed to improve current boosting methods with pinpoint precision in the shortest period of time. Here is a fast comparison of XGBoost with other gradient boosting algorithms trained on a random forest with 500 trees.

code of xgboost with grid search cv

6.2. Random Forest

A random forest consists of multiple random decision trees. Two types of randomnesses are built into the trees. First, each tree is built on a random sample from the original data. Second, at each tree node, a subset of features are randomly selected to generate the best split.

random forest with grid search cv

7. Results obtained from various machine learning models

machine learning models outputs

Observation

  • rmsle test random forest 0.21366621116016435
  • rmsle train random forest 0.16829662370607318
  • rmsle test xgboost 0.21208309514939927
  • rmsle train xgboost 0.1285440295707081

8. Failed Experiments and results

mlp and xgboost comparision

Observation

  • deep learning techinque seem to be not fruitfil
  • lack of datapoints seems to be the major reason for this
  • machnie learning model are performing better for this problem
  • so final conclusion is use of machine learning than that of deep learning to solve this case study
  • applying simple mlp on features which are being used as input in xgboost
  • Linear regression is not a good choice for this dataset as it underfits a lot
  • Using depth image features only as input without bounding box and object height as input the performance is degraded significantly.

9. Final model or solutions

xgboost and random forest are the final model as they seems to work significantly better with this dataset.

error plot random forest
error plot xgboost

Observation

  • the above curve is the pdf of error obtained from random forest and xgboost
  • mean of error centered at near to zero in both curve
  • curve is some what bell shaped which is normal distribution that is satisfiable
  • output from xgboost and random forest is quite remarkable as it satisfied the condition men of error should be centered at zero

10. Future work

  1. dataset was created in room from mobile camera result in less number of images
  2. Focus more on feature extraction or feature engineering
  3. height of camera can be considered as input feature
  4. From this project it can be concluded that mahine learning or deep learning can easily solve this problem or distance estimation but lack of dataset is a major issue.

11. References

  1. https://openaccess.thecvf.com/content_ICCV_2019/papers/Zhu_Learning_Object-Specific_Distance_From_a_Monocular_Image_ICCV_2019_paper.pdf
  2. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8559642
  3. DisNet: A novel method for distance estimation from monocular camera* Muhammad Abdul Haseeb, Jianyu Guan, Danijela Ristić-Durrant, Axel Gräser, Member, IEEE
  4. https://www.pyimagesearch.com/2015/01/19/find-distance-camera-objectmarker-using-python-opencv/
  5. special thanks and regards to https://www.appliedaicourse.com

12. Deployment using flask

13. Github Respository

14. Linkedin

linkedin.com/in/shivam-khandelwal-b35749107

--

--