SLAM - Parti 1

Implement my system of localization and mapping

Part 1: Introduction & Feature Detection

Some people claim to have a poor sense of direction. They might compare themselves to a robot and feel much better.

It’s been a while since I posted anything here! However, I haven’t been inactive, rest assured.

Since September 2020, I’ve been working on creating my own SLAM system, a classic in robotics. This project started with a long phase of research, documentation, and planning, and more recently, the initial implementations, followed by the first results.

In this series of articles, I’ll go back in time a bit and describe the precise functioning of my approach.

For those already familiar with the concept of SLAM, especially feature detection, descriptors, and the association of these features, jump directly to Part 2.

What is SLAM

Simultaneous Localization And Mapping, or SLAM in short, involves a system’s ability to map and locate itself in an unknown environment. Most of the time, we think of robotic systems in an unknown environment, but many applications are also concerned, such as object digitization (see: structure from motion).

What about loop closing? (click to expand)

Imagine leaving your home, taking a tour of the neighborhood, and then returning home. How do you know you’re back home?

Look at a postcard of a familiar place. What visually allows you to know that it’s a familiar place? Could you recognize it if it were, for example, covered in snow or in black and white? There’s a good chance you could.

This problem of recognizing a previously visited place is called the loop closing problem. Our SLAM system will accumulate position errors, which, even if minimal, will eventually create a significant global position error. Loop closing eliminates this error by recognizing an area already observed in the past (with a lower error) and adjusting our position and that of our environment based on the known position. This reduces the accumulated error to return to a pose and a map closer to reality.

The SLAM problem is not yet a closed problem; many limitations still exist. Notably, long-term localization (life-long mapping), real-time loop closing, and managing dynamic environments are major problems that still lack definitive solutions.

Here, we will limit ourselves to visual SLAM, based on images from one or two cameras.

We can divide SLAM approaches into two major categories:

  • Dense: Based on processing all pixels in the image. All information is exploited! The associated processing is generally heavy and not suitable for real-time use.
  • Sparse: Based on features (points, lines, planes, etc.), generally faster.

I decided to go for a sparse SLAM, based on points, lines, planes, and superquadric shapes. I aim for real-time performance (> 30 FPS), and I will use images from an RGB-D camera.

My system will be based on extracting features from each image, followed by a feature matching phase between the features in one image and the features stored in a local map, then a pose optimization. Loop closing and memory mapping will be done later.

Several iterations will take place:

  • V0:
    • Based on points, lines, plans, and cylinders in a presumed static environment.
    • No loop closing
  • V1:
    • Use of superquadrics (depending on results)
    • Multithreading
  • V2:
    • Loop closing
    • Dynamic environment

In early October 2021, I completed the implementation of the basic functions of V0 (feature detection, pose optimization).

First step: Feature Detection

The detection of characteristic points is, to my knowledge, one of the oldest topics in computer vision. Consequently, it’s a highly developed field, with many methods, implementations, and documentation. A wide range of choices!

Detection

First of all, even though the problem of characteristic point detection is “mastered,” the process remains slow enough to be a limiting factor for real-time applications.

We can consider a more efficient approach than simply re-detecting points for each image: once the points have been detected for the first image, we will use a method called optical flow. This method is faster and allows us to obtain feature matches directly between images based on the associated movement of features from one image to another.

// TODO IMAGE OPTICAL FLOW

To avoid losing all points when moving too much, point detection is launched on the new image by masking the areas around the existing points (see Keypoint detection in RGBD images based on an Efficient Viewpoint-Covariant Multiscale Representation (2016)).

In our SLAM, we will also seek to detect high-level features, such as lines, planes, and cylinders. For lines, I will use LSD (Line Segment Detector). The detection of planes and cylinders will be performed by CAPE, presented in a previous article (De CAPE et d’opes).

We can also detect lines formed by two planes, which are a feature with a “strong” constraint during the pose optimization. Some publications suggest using “supposed” planes and lines (see Point-Plane SLAM Using Supposed Planes for Indoor Environments (2019)), but their results are only evaluated in an orthogonal environment, where the angles between two planes are close to 90° and assumed to be perpendicular. This assumption works ideally in the case of an orthogonal environment (inside a building, for example), but its results are limited in non-orthogonal environments (most of the planet!).

Descriptors

To associate one point with another, we need to compare different points and quantify their proximity. For this purpose, descriptors are used, representing a point and its nearby space. These descriptors allow for consistent point comparison.

There are many types of descriptors, each with different characteristics, computation speed, and comparison methods. What interests me here is a descriptor that is quick to compute and fast to compare. Its rotation and scale invariance features are not the main characteristics in our case, as with our 30 frames per second, descriptors will be updated regularly, making their precision secondary.

Many of these descriptors rely only on the 2D information of the point and its surroundings, whereas we could also use the depth measurement provided by our camera. For some examples of more advanced point descriptors, take a look at:

Of course, in our case, we have features other than points (currently, plans, and lines). There are descriptors for lines and plans, but this field is less developed than that of point descriptors. In our real-time case, using descriptors for plans and lines is not necessary. We can be confident in stating that a feature detected in one image will return at least partially in the next image. The association can, therefore, be reliably done by comparing the “strong” characteristics of our features. For plans, we can, for example, compare the normal vectors to the plan’s surface, calculate the IoU (Inter Over Union) of two surfaces, etc. For lines, we will limit ourselves to comparing the starting/ending points, orientation, placement in the image, etc.

Yet, even if this association is functional, our object still lacks a descriptor to find it during a loop closing. In this case, normals probably won’t be the same, IoU doesn’t make much sense, and neither do the starting and ending points. Nevertheless, we will still use descriptors for these “high-level” features.

For plans, a remarkably effective technique uses an autoencoder to create a very robust plan descriptor (see PlaneMatch: Patch Coplanarity Prediction for Robust RGB-D Reconstruction (2018)). For lines, there are common methods (like LSD Line Segment Descriptor, not to be confused with LSD Line Segment Detector), but I am also considering a machine learning-based approach. ML has long demonstrated its effectiveness against “hand-crafted” features.

Second Step: Feature Association

We’re getting into the interesting part. We now need to associate the features detected in our images. This step is crucial because even a minor error will, at best, slow down pose optimization and, at worst, distort the result.

We have already mentioned several ways to associate “high-level” features like lines and plans in the previous section. Despite their simplicity, these methods are reliable because there are generally very few high-level features in the images. There are fewer plans than lines and fewer lines than points.

For points, I use RANSAC, already mentioned in a different post (Monocular Depth Map). Using this method for point association is fast and effective but relies on the descriptors calculated earlier. These descriptors are not always reliable, so I added a maximum detection zone around each point to be matched, to exclude the farthest points. This approximation will only work in real-time applications where points move little between images.

Dynamic Environment

Before moving on, we need to mention an important point: dynamic environments. So far, we have assumed that our environment was static (i.e., each feature is invariant from one image to another). This assumption is not true in 99% of real cases.

To estimate our position, we will rely on rigid optimization, which will be disturbed by moving features. Eliminating these moving features in real-time is not easy, as their movement is generally slow compared to image refresh rates.

Some approaches are based on the detection of known dynamic objects (humans, cars, animals, etc.) to exclude them from optimization (see Toward Real-time Semantic RGBD-SLAM in Dynamic environments (2021)). The limitation of this approach is, of course, that not all moving objects are in the training database. Researchers add a detection criterion that I find particularly interesting: Assuming that a dynamic element always has a surface, they segment the image into detection zones and determine over time which areas contain dynamic objects. These areas are then completely eliminated from pose optimization and mapping.

I am still too “new” in this project to try anything other than assuming that the environment is static. This treatment will be carried out for V1.

See you in Part 2 to discuss pose optimization!


See also