Coursera Visual Perception for Self-Driving Cars 01

2019-08-01

From Coursera, State Estimation and Localization for Self-Driving Cars by University of Toronto
https://www.coursera.org/learn/visual-perception-self-driving-cars

Basics of 3D Computer Vision

The Camera Sensor

Pinhole Camera Model:

focal length: the distance between the pinhole and the image plane. It defines the size of the object projected onto the image
camera center: The coordinates of the center of the pinhole. These coordinates defined the location on the imaging sensor that the object projection will inhabit.

Camera Projective Geometry

Let’s define the problem we need to solve: a point $O_{world}$ defined at a particular location in the world coordinate frame. We want to project this point from the world frame to the camera image plane.

Light travels from the $O_{world}$ on the object through the camera aperture to the sensor surface.
The projection onto the sensor surface through the aperture results in flipped images of the objects in the world.
need to develop a model for how to project a point from the world frame coordinates x, y and z to, image coordinates u and v:
- First, select a world frame in which to define the coordinates of all objects and the camera.
- define the camera coordinate frame as the coordinate frame attached to the center of our lens aperture known as the optical sensor.
- We refer to the parameters of the camera pose as the extrinsic parameters, as they are external to the camera and specific to the location of the camera in the world coordinate frame.
- define image coordinate frame as the coordinate frame attached to our virtual image plane emanating from the optical center. The image pixel coordinate system is attached to the top left corner of the virtual image plane.
- So we need to adjust the pixel locations to the image coordinate frame.
- we define the focal length is the distance between the camera and the image coordinate frames along the z-axis of the camera coordinate frame.
Finally, the projection problem reduces to two steps.
- We first need to project from the world to the camera coordinates, then we project from the camera coordinates to the image coordinates.
- We can then transform image coordinates to pixel coordinates through scaling and offset.
Computing the projection:
- World -> Camera: $$o_{camera} = [R|t]O_{world}$$
- Camera -> Image: $$o_{image} = [f \; 0 \; u_0 ; 0 \; f \; v_0 ; 0 \; 0 \; 1]o_{camera} = Ko_{camera}$$
  - K is a 3x3 matrix, which depends on camera intrinsic parameters: camera geometry and the camera lens characteristics
- World -> Image: $$P = K[R|t]$$
- therefore, $o_{image} = PO = K[R|t]O_{world}$
Image coordinates to Pixel coordinates: $[x y z]^T -> [u v 1]^T = \frac 1z[x y z]^T$

The digital image:

an image is represented digitally as an M by N by three array of pixels, with each pixel representing the projection of a 3D point onto the 2D image plane.

Camera Calibration

The camera calibration problem is defined as finding these unknown intrinsic and extrinsic camera parameters, given n known 3D point coordinates and their corresponding projection to the image plane.

Our approach will comprise of getting the P matrix first and then decomposing it into the intrinsic parameters K and the extrinsic rotation parameters R and translation parameters t.
Use scenes with known geometry to:
- Correspond 2D image coordinates to 3D world coordinates
- Find the Least Squares Solution (or non-linear solution)of the parameters of P
The most commonly used example would be a 3D checkerboard, with squares of known size providing a map of fixed point locations to observe.
If we have N 3D points and their corresponding N 2D projections, set up homogeneous linear system
- Solved with Singular Value Decomposition(SVD)

Visual Depth Perception - Stereopsis

Stereo Camera Model

A stereo sensor is usually created by two cameras with parallel optical axes.
Given a known rotation and translation between the two cameras and a known projection of a point $O$ in 3D to the two camera frames resulting in pixel locations $O_L$ and $O_R$ respectively, we can formulate the necessary equations to compute the 3D coordinates of the point $O$.
Assumptions:
- First, we assume that the two cameras used to construct the stereo sensors are identical.
- Second, we will assume that while manufacturing the stereo sensor, we tried as hard as possible to keep the two cameras optical axes aligned.
- project the previous figure to bird’s eye view for easier visualization.
Some parameters:
- focal length: the distance between the camera center and the image plane.
- the baseline is defined as the distance along the shared x-axis between the left and right camera centers.
- By defining a baseline to represent the transformation between the two camera coordinate frames, we are assuming that the rotation matrix is identity and there is only a non-zero x component in the translation vector. The $[R|t]$ transformation therefore boils down to a single baseline parameter B.
Define the quantities to compute:
- to compute the x and z coordinates of the point $O$ with respect to the left camera frame.
- The y coordinate can be estimated easily after the x and z coordinates are computed.
- by constructing the similarity, we get $\frac Zf = \frac X{x_L}$ and $\frac Zf = \frac {X-b}{x_R}$
- finally we can get: $X=\frac{zx_L}{f}$, $Y=\frac{zy_L}{f}$

Derive the location of a point in 3D

Two main problems:
- We need to know $f,b,u_o,v_o$: Use stereo camera calibration
- We need to find corresponding $x_R$ for each $x_L$: Use disparity computation algorithms

Visual Depth Perception - Computing the Disparity

Disparity: The difference in image location of the same 3D point under perspective to two different cameras

Correspond pixels in the left image to those in the right image to find matches

Estimate the disparity through stereo matching

if moving our 3D point along the line connecting it with the left cameras center.
- Its projection on the left camera image plane does not change. But for the projection on the right camera plane, the projection moves along the horizontal line.
- This is called an epipolar line and follows directly from the fixed lateral offset and image plane alignment of the two cameras in a stereo pair. We can constrain our correspondence search to be along the epipolar line, reducing the search from 2D to 1D.
- One thing to note is that horizontal epipolar lines only occur if the optical axes of the two cameras are parallel.
- In the case of non parallel optical axis, the epipolar lines are skewed.
- We can use stereo rectification to warpimages originating from two cameras with non-parallel optical axes to force epipolar lines to be horizontal.
A Basic Stereo Algorithm
- Given: Rectified Images and Stereo Calibration.
- For each epipolar line take a pixel on this line in the left image, compare these left image pixels to every pixel in the right image on the same epipolar line.
- select the right image pixel that matches the left pixel the most closely which can be done by minimizing the cost.
- a very simple cost can be the squared difference in pixel intensities.
- Finally, we can compute the disparity by subtracting the right image location from the left one.

Image filtering

Cross correlation

The idea to reduce salt-pepper noise is to compute the mean of the whole neighborhood, and replace the outlier pixel with this mean value: $$G[u,v] = \frac 1{(2k+1)^2} \sum^k_{i=-k} \sum^k_{j=-k} I[u-i, v-j]$$
- where (2k+1) is the filter size, (u,v) is the center pixel coordinates
The mean equation can be generalized by adding a weight to every pixel in the neighborhood, resulting in cross-correlation. The weight matrix H is called a kernel.
- Kernal could be: mean filter, gaussian filter
Application：
- Template matching: The pixel with the highest response from Cross-correlation is the location of the template in an image
- Image gradient computation: Define a finite difference kernel,and apply it to the image to get the image gradient

Convolution

A convolution is a cross-correlation where the filter is flipped both horizontally and vertically before being applied to the image
Unlike Cross-Correlation, Convolution is associative. If H and F are filter kernels then: $H(FI) = (HF)I$
Precompute filter convolutions(H*F)then apply it once to the image to reduce runtime.