Coursera Visual Perception for Self-Driving Cars 05

2019-08-08

From Coursera, State Estimation and Localization for Self-Driving Cars by University of Toronto
https://www.coursera.org/learn/visual-perception-self-driving-cars

Semantic Segmentation

The Semantic Segmentation Problem

Given an input image, we want to classify each pixel into a set of preset categories. The categories can be static road elements e.g. roads, sidewalk, traffic lights or dynamic obstacles e.g. cars, pedestrians, and cyclists or a background class that encompasses any category we do not include in our preset categories.
Semantic segmentation is still through a function estimator
- Given an image, we take every pixel as an input and output a vector of class scores per pixel.
- A pixel belongs to the class with the highest class score.

Performance of semantic segmentation

True Positive(TP): The number of correctly classified pixels belonging to classX
False Positive(FP): The number of pixels that do not belong to class X in ground truth but are classified that class by the algorithm
False Negative(FN): The number of pixels that do belong to class X in ground truth,but are not classified as that class by the algorithm
$$IOU_{class} = \frac{TP}{TP+FP+FN}$$
Class lOU over all the data is calculated by computing the sum of TP, FP, FN for all images first
Averaging the class IOU is usually not a very good idea!
CityScapes Segmentation Dataset

ConvNets for Semantic Segmentation

Semantic segmentation takes a camera images input and provides a category classification for every pixel in that image as output. This problem can be modeled as a function approximation problem, and ConvNets can once again be used to approximate the required function.
The ConvNet model can be chosen as the same ConvNet model we used for object detection. That is, a feature extractor followed by an output layer. We do not need anchors here as we are not trying to localize objects.
In VGG architecture,
- the input image is the size of MxNx3
- As with object detection, the resolution will be reduced by half after every pooling layer. On the other hand, the depth will increase due to the convolutional layers.
- The output feature map of the convolutional feature extractor will be downsampled by 16 times.
- We can add as many convolutional pooling blocks as needed, but that will further shrink our feature map.
However, output for semantic segmentation is to have a classification for every pixel in the image. How can we achieve this given a down sampled feature map that is 16 times smaller than our original input?
A simple solution would be the following:
- first compute the output by passing the 16 times downsampled output feature map through a softmax layer.
- We can then determine the class of every pixel in the subsampled output by taking the highest class score obtained by the softmax layer.
- The final step would be to proceed to upsample, the downsampled output back to the original image resolution.
- however, this naive upsampling might produce inadequate results
one of the most challenging problems in semantic segmentation is attaining smooth and accurate boundaries around the objects.
- Smooth boundaries are hard to attain as in general boundaries in the image space are ambiguous, especially for thin objects such as pools, similar looking objects such as roads and sidewalks, and far away objects.
Naive upsampling also induces two additional problems:
- Any object less than 16 pixels in width or height usually fully disappears in their upsampled image. This affects both thin objects such as pools and far away objects. As they are below the minimum dimensions required for the pixels to appear from the original image.
To remedy these problems, researchers have formulated what is commonly referred to as a feature decoder.
- The feature decoder can be thought of as a mirror image of the feature extractor.
- Instead of using the convolution pooling paradigm to downsample the resolution, it uses upsampling layers followed by a convolutional layer to upsample the resolution of the feature map.
- The upsampling usually using nearest neighbor methods achieves the opposite effect to pooling, but results in an inaccurate feature map.
- The following convolutional layers are then used to correct the features in the upsampled feature map with learnable filter banks.
- This correction usually provides the required smooth boundaries as we go forward through the feature decoder.
Similar to the feature extractor, each upsampling convolution block is referred to as a deconvolution.
- As we go through the first deconvolution block, the input feature map is upsampled to twice the input resolution.
- The depth is controlled by how many filters are defined for each successive convolutional layer.
- As we go forward through the rest of the decoder, we finally arrive at a feature map of similar resolution to the input image.
The output can be a linear output layer, followed by a softmax function. This layer is very similar to the classification output layer in object detection.
- this layer provides a k-dimensional vector per pixel with the kth element being how confident the neural network is that the pixel belongs to the kth class.
Classification Loss: $L_{cls}=\frac 1{N_{total} \, \sum_i CrossEntropy(s^*_i,s_i)}$
- $N_{total}$ is the total number of classified pixels in all the images of a mini-batch. Usually, a mini-batch would comprise of 8 to 16 images, and the choice of this number depends on how much memory your GPU has.
- $s_i$ is the output of the neural network
- $s^*_i$ is the ground truth classification

Semantic Segmentation for Road Scene Understanding

3D Drivable Surface Estimation
Steps:

Generate semantic segmentation output
Associate 3D point coordinates with 2D image pixels
Choose 3D points belonging to the Drivable Surface category
Estimate 3D drivable surface model

Plane Model: $$ax+by+z=d$$
Least squares formulation: $$p=[a,b,d], \; argmin_A(Ap-B)$$ $$A = [x_1 \; y_1 \; -1; x_2 \; y_2 \; -1; …; x_N \; y_N \; -1], \quad B = [-z_1; -z_2; … ; -z_N]$$
Solution: $p=(A^TA)^{-1}A^TB$

Minimum number of points to estimate model: 3 non-collinear points
RANSAC Algorithm:
- From the data, randomly select 3 points.
- Compute model parameters a, b, and d using least squares estimation.
- Compute number of inliers, N.
- If N > threshold, terminate and return the computed plane parameters. Else, go back to step 1.
- Recompute the model parameter using all the inliers in the inlier set.

Semantic Lane Estimation

Estimate the lane, the area where the car can drive on the drivable surface
Estimate what is at the boundaries of the lane:
- Curb
- Road
- Car
Steps:
- Extract segmentation mask from pixels belonging to lane separators such as lane markings or curbs.
- Extract edges from this segmentation mask using an edge detector.
- Linear Lane Model:Use the Hough transform to detect lines in the output edge map.
- Filter lines based on slope to remove horizontal lines.
- Remove any line that does not belong to the drivable space.
- Determine which classes occur at the boundary of the lane.
Materials:
- Hough Transform Line Detection: https://docs.opencv.org/3.4.3/d9/dbo/tutorial_hough_lines.html
- Canny Edge Detection: https://docs.opencv.org/3.4.3/da/d22/tutorial_py_canny.html