Part 2: Pose Optimization
Once our features are detected and associated, we aim to estimate the movement between the two observations.
After Part 1, which focused on feature detection and association, we now delve into estimating our position and orientation over time.
Choosing the Pose Model to Optimize
A pose represents the position and orientation of our robot in space. We can express it in the way and in the coordinate system that suits us best (I’ve seen spherical, but here, we prefer Cartesian).
The position of our observer will be a 3-vector, with the origin centered on the starting point of our robot. Orientation is often represented by a rotation matrix or a quaternion. I prefer using quaternions, which are easier to handle than rotation matrices.
In truth, the choice of our pose model will depend mostly on the regression technique we want to apply.
However, it’s worth noting that using a quaternion or a rotation matrix “overparameterizes” the problem: we end up with \(N \geq 4\) variables to define an orientation with 3 degrees of freedom along the \(x\), \(y\), and \(z\) axes in Cartesian coordinates. We’ll revisit this point later, but it is fundamental.
Isn’t the ‘spherical’ representation of rotation not overparameterized? Why not use it?
Indeed, this representation is not overparameterized, but it has its own set of problems. Adding successive rotations is not trivial, and each component is defined on a domain depending on an irrational number (PI).
It turns out that a rotation matrix is already a rotation representation using angles on X, Y, and Z, but in a much more useful form. See this link for more information on the subject!
Pose Optimization
Here, we aim to estimate the transformation between our two sets of associated features, i.e., the rotation and translation to apply to each feature in set 2 to retrieve set 1. To my knowledge, the results can be obtained in the form of either a pose relative to set 1 or an absolute pose (relative to the world’s point 0).
Generally, optimizing an absolute pose has fewer applications than recovering a series of relative poses and is heavier and less precise (locally) than a relative pose. We return the absolute pose to the user.
Ideally, our map should be defined in terms of relative poses to optimize the position of map elements when performing a loop closure.
Optimization Method
Optimization will be based on the transformation between two sets of associated features to estimate the parameters of the movement. In the literature, various techniques are observed, most of which use linear regressions such as gradient descent, Newton-Gauss, Bundle Adjustment, etc.
One of my favorites is representing the problem as a graph: We consider the pose of our observer as a point on a graph connected to the previous pose. Each pose is connected to the associated features, which are also connected to each other. Some links are marked as “elastic,” meaning optimizable. Links between features are always considered rigid, optimizing them doesn’t have much utility. If we mark all links, except those between poses and their features, as elastic, we obtain the final pose after optimization. We can also mark the links between poses and features as elastic; the optimization will take longer, but we will also obtain an error associated with each feature, which can be useful in constructing the map. This algorithm is implemented in g2o under the name SparseOptimizer (see an example of use in Light Visual Tracking (LVT), an efficient visual odometry method in environments with many features).
I have also read a paper proposing a method that optimizes the rotation of the pose first and uses this rotation as a constant for the bundle adjustment optimization of the position (see RGB-D SLAM with Structural Regularities (2021)).
Bundle adjustment involves a series of optimizations by adjusting the final parameters between each iteration. In the case of a surfel-based SLAM (small surfaces around a feature point), surfels belonging to the same “object” can be merged between each optimization step (see BAD: Bundle Adjusted DirectRGBD-SLAM (2019) for the complete method).
To obtain the relative pose of my robot, I use the Levenberg-Marquardt algorithm, implemented with the [
For my part, I am leaning towards the method described in Using Quaternions for Parametrizing 3D Rotations in Unconstrained Non-Linear Optimization (2001). This method seems more intuitive to me than those presented earlier. Instead of optimizing the quaternion parameters to find another, we will optimize the movements to be made on the tangent hyperplane to the unit sphere containing our initial quaternion to obtain our final quaternion. These movements are linear and easily optimizable. However, I have some reservations about optimizing rotations far apart from each other. In my opinion, the distance on the tangent plane is limiting in the case of a significant rotation.
Simplified Mathematical Formulation
Let’s take our initial quaternion \(k_0 = (k_w, k_x, k_y, k_z)\).
We will calculate a basis of the hyperplane \(M\) from \(k_0\): $$ M = \begin{pmatrix} -\frac{k_x}{k_w} & -\frac{k_y}{k_w} & -\frac{k_z}{k_w} \ 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 0 & 1 \end{pmatrix} $$ The columns of \(M\) represent part of the hyperplane parameters.
Next, we obtain the matrix \(U\), the matrix of eigendecomposition of \(M\): $$ M = U \sum{} V^* $$
Now we choose a three-dimensional vector \(v\), which will be the input for our parameterization. \(U\) will be a matrix \(4 * 3\). We can obtain a transformed quaternion by performing the operation $$ v_4 = Uv $$ $$ k_f = \sin{(\lVert v_4 \rVert)} \cdot v_4 + \cos{(\lVert v_4 \rVert)} \cdot k_0 $$ Note that fixing all components of \(v\) to 0 will produce the quaternion \(k_0\).
During optimization, we simply need to transform \(v\) into the quaternion \(k_f\) and use this quaternion for our optimization.
Scoring
To estimate the quality of our pose estimation, we need to establish a series of metrics applied to each pair of associated points. Our optimization function returns a vector of size \(N\), with \(N\) being the number of associated features. For the optimization to be complete, the values of our vector must approach 0 as closely as possible.
Let \(P_W\) be our 3D point in space, and \(P_A\) its associated 2D feature.
Projection of the Reference Point in Camera Coordinates
We begin by transforming our vector \(v\) into a quaternion using the previously seen formula. We then calculate the rotation matrix \(R_w\) from our quaternion \(k_f\) and transform this matrix into a world-to-screen transformation matrix using the position \(T_W\): $$ T_s = -R_w^t \cdot T_w $$ $$ TR_s = (R_w, T_S) $$ We then project our reference point onto the screen using the transformation matrix \(RT_s\) (4x3 matrix), which transforms points from 3D coordinates to a 3D projection in camera space \(P_c\): $$ P_c = TR_s \cdot P_W $$ Let \(d\) be the 3rd component of \(P_c\). $$ P_s = \frac{c_f \cdot P_c}{d} + c_c $$ Where \(c_f\) and \(c_c\) are 2D vectors representing the focal distance and the distance of this focal point from the center of the camera (intrinsic parameters of the used camera).
Our 2D point \(P_s\) is the projection of our initial point \(P_W\) in camera coordinates.
Comparison of the Two Associated Features
We can then compare the distance between \(P_s\) and \(P_A\) using a distance calculation. The scale of our output variable is not really important.
I initially used the Manhattan distance, which provides quick and more than satisfactory results. Some papers suggest using the Mahalanobis distance, which is more robust to varied point distributions but more computationally intensive.
Weighting of Distances
Our optimization step could stop there, but we can add weighting to further eliminate the influence of outliers. We can use an estimator like Huber, Cauchy, SC-DCS, or a special one as presented in At All Costs: A Comparison of Robust Cost Functions for Camera Correspondence Outliers (2015) for example.
We can also re-run the optimization by removing points considered as outliers during the initial optimization. The processing will be heavier, but the final estimation will be much better.