In recent years, research teams worldwide have developed new methods for simultaneous localization and mapping (SLAM).
These techniques can be used to construct or update maps of a given environment in real time, while simultaneously tracking an artificial agent or robot’s location within these maps.
Most existing SLAM approaches rely heavily on the use of range-based or vision-based sensors, both to sense the environment and a robot’s movements.
These sensors, however, can be very expensive and typically require significant computational power to operate properly.
Aware of these limitations, researchers at the Singapore University of Technology and Design, Southwest University of Science and Technology, the University of Moratuwa and Nanyang Technological University have recently developed a new technique for collaborative SLAM that does rely on range-based or vision-based sensors.
This technique, presented in a paper prepublished on arXiv, could enable more effective robot navigation within unknown indoor environments at a cost significantly lower than that of most previously proposed methods.
“We aimed to utilize low cost and low computational sensor as a replacement for range-based or visual-based sensors,” Chau Yuen, one of the researchers who carried out the study, told TechXplore.
“Since modern buildings typically have Wi-Fi network coverage, our goal is to utilize such freely available information to perform SLAM.”
To exploit radio features that are readily available in most urban environments, the researchers developed an approach for collaborative simultaneous localization and radio fingerprint mapping called C-SLAM-RF.
Their technique works by crowdsensing Wi-Fi measurements in large indoor environments and then using these measurements to generate maps or locate artificial agents.
“Our goal is to generate a radio map of the environment with the least human effort possible by leveraging in-built sensing capabilities of commonly used smart phones,” Yuen explained.
The system developed by Yuen and his colleagues receives information about the strength of the signal coming from pre-existing Wi-Fi access points spread around a given environment, as well as from pedestrian dead reckoning (PDR) processes (i.e., calculations of someone’s current position) derived from a smart phone.
It then uses these signals to build a map of the environment without requiring prior knowledge of the environment or the distribution of the access points within it.
The C-SLAM-RF tool devised by the researchers can also determine whether the robot has returned to a previously visited location, known as “loop closure,” by assessing the similarity between different signals’ radio fingerprints.
“Existing SLAM approaches often use dedicated devices, for example, visual cameras or range-based LIDAR sensors, to measure the similarity of observations by scan matching or feature matching, which are computationally expensive,” U-Xuan Tan, another researcher involved in the study, told TechXplore.
“Our solution exploits the possibility to perform localization and mapping with low-cost, ubiquitous IoT devices like smartphones, due to growing popularity of Wi-Fi wireless networks.”
Yuen, Tan and their colleagues tested their technique in an indoor environment with an area of 130 meters x 70 meters. Their results were highly promising, as their system’s performance exceeded that of several other existing techniques for SLAM, often by a considerable margin.
“We evaluated our approach in a large scale environment and a positioning accuracy of 0.6 meters is achieved without any prior knowledge of the environment,” Ran Liu, another researcher involved in the study, told TechXplore.
“This accuracy outperforms state-of-the-art fingerprinting-based localization approach, which requires a tedious survey of the environment. The computational time required by our approach is insignificant when compared to the range-based or visual SLAM.”
In the future, the approach for collaborative SLAM devised by this team of researchers could help to enhance robot navigation in unknown environments.
In addition, the fact that it does not require the use of expensive sensors and relies on existing Wi-Fi hotspots makes it a more feasible solution for large-scale implementations.
“We would now like to explore the possibility of fusing different radio signals, for example cellular signal, to improve the positioning accuracy,” Yong Liang Guan, another researcher involved in the study.
“Another plan is to use the produced radio map for the purpose of localization.
A combination with other sensors for example laser range finders to accelerate and improve the mapping in large scale environment will be also one of our future research.”
Localization and navigation are the key technologies of autonomous mobile service robots, and simultaneous localization and mapping (SLAM) is considered as an essential basis for this.
The main principle of SLAM is to detect the surrounding environment through sensors on the robot, and to construct the map of the environment while estimating the pose (including both location and orientation) of the robot.
Since SLAM was first put forward in 1988, it was growing very fast, and many different schemes have been formed. Depending on the main sensors applied, there are two mainstream practical approaches: LiDAR (light detection and Ranging)-SLAM and Visual-SLAM.
LiDAR can detect the distance of the obstacles, and it is the best sensor to construct a grid map, which represents the structure and obstacles on the robot running plane.
The early SLAM research often used LiDAR as the main sensor. Extended Kalman filter (EKF) was applied to estimate the pose and of the robot , but the performance was not ideal. For some strong nonlinear systems, this method will bring more truncation errors, which may lead to inaccurate positioning and mapping.
Particle filter approaches [2,3] were introduced because they can effectively avoid the nonlinear problem, but it also leads to the problem of increasing the amount of calculation with the increase of particle number.
In 2007, Grisetti proposed a milestone of LiDAR-SLAM method called Gmapping , based on improved Rao-Blackwellized particle filter (RBPF), it improves the positioning accuracy and reduces the computational complexity by improving the proposed distribution and adaptive re-sampling technique.
As an effective alternative to probabilistic approaches, optimization-based methods are popular in recent years. In 2010, Kurt Konolige proposed such a representative method called Karto-SLAM , which uses sparse pose adjustment to solve the problem of matrix direct solution in nonlinear optimization.
Hector SLAM  proposed in 2011 uses the Gauss-Newton method to solve the problem of scanning matching, this method does not need odometer information, but high precision LiDAR is required. In 2016, Google put forward a notable method called Cartographer , by applying the laser loop closing to both sub-maps and global map, the accumulative error is reduced.
Using visual sensors to build environment map is another hot spot for robot navigation. Compared with LiDAR-SLAM, Visual-SLAM is more complex, because image carries too much information, but has difficulty in distance measurement.
Estimating robot motion by matching extracted image features under different poses to build a feature map is a common method for Visual-SLAM.Mono-SLAM , proposed in 2007, is considered the origin of many Visual-SLAM.
Extended Kalman filter (EKF) is used as the back-end to track the sparse feature points in the front-end. The uncertainty is expressed by a probability density function. From the observation model and recursive calculation, the mean and variance of the posterior probability distribution are obtained. Reference  used RBPF to realize visual-SLAM.
This method avoids the problem of non-linear and has high precision, but it needs a large number of particles, which increase the computational complexity. PTAM  is a representative work of visual-SLAM, it proposed a simple and effective method to extract key frames, as well as a parallel framework of a real-time tracking thread and a back-end nonlinear optimization mapping thread.
It is the first time to put forward the concept of separating the front and back ends, leading the structure design of many SLAM methods.ORB-SLAM  is considered as a milestone of visual-SLAM.
By applying Oriented FAST and Rotated BRIEF (ORB) features and bag-of-words (BOW) model, this method can create a feature map of the environment in real-time stably in many cases.
Loop detection and closing via BOW is an outstanding contribution of this work, it effectively prevents the cumulative error and can be quickly retrieved after the tracking lost.In recent years, different from feature-based methods, some direct methods of visual-SLAM were explored, by estimating robot motion through pixel value directly.
Dense image alignment based on each pixel of the images proposed in Reference  can build a dense 3D map of the environment. The work in Reference  built a semi-dense map by estimating the depth values of pixels with a large gradient in the image.
Engel et al. proposed LSD-SLAM (Large-Scale Direct Monocular SLAM) algorithm . The core of LSD-SLAM algorithm is to apply a direct method to semi-dense monocular SLAM, which is rarely seen in the previous direct method. Forster et al. proposed SVO (Semi-Direct Monocular Visual Odometry)  in 2014, called “sparse direct method”, which combines feature points with direct methods to track some key points (such as corners, etc.), and then estimates the camera motion and position according to the information around the key points.
This method runs fast for Unmanned Aerial Vehicles (UAV), by adding special constraints and optimization to such applications.RGB-D camera can provide both color and depth information in its view field.
It is the most capable sensor for building a complete 3D scene map. Reference  proposes Kinect fusion method, which uses the depth images acquired by Kinect to measure the minimum distance of each pixel in each frame, and fuses all the depth images to obtain global map information.
Reference  constructs error function by using photo-metric and geometric information of image pixels. Camera pose is obtained by minimizing the error function. Map problem is treated as pose graph representation.
Reference  is a better direct RGB-D SLAM method. This method combines the intensity error and depth error of pixels as error functions, and minimizes the cost function to obtain the optimal camera pose. This process is implemented by g2o. Entropy-based key frame extraction and closed-loop detection method are both proposed, thus greatly reducing the path error.
1.3. Multi-Sensor Fusion
Introducing assistant sensor data can improve the robustness of the SLAM system. Currently, for LiDAR-SLAM and Visual-SLAM, the most commonly used assistant sensors are encoder and Inertial Measurement Unit (IMU), which can provide additional motion data of the robot.
SLAM systems [2,5,17,18,19,20] with such assistant sensors usually perform better.
In recent years, based on the works of LiDAR-SLAM and Visual-SLAM, some researchers have started to carry out the research of integrating such two main sensors [21,22,23,24,25]. In , the authors applied a visual odometer to provide initial values for two-dimensional laser Iterative Closets Point (ICP) on a small UAV, and achieved good results in real-time and accuracy.
In , a graph structure-based SLAM with monocular camera and laser is introduced, with the assumption that the wall is normal to the ground and vertically flat.  integrates different state-of-the art SLAM methods based on vision, laser and inertial measurements using an EKF for UAV in indoor.  presents a localization method based in cooperation between aerial and ground robots in an indoor environment, a 2.5D elevation map is built by RGB-D sensor and 2D LiDAR attached on UAV.  provides a scale estimation and drift correction method by combining mono laser range finder and camera for mono-SLAM. In , a visual SLAM system that combines images acquired from a camera and sparse depth information obtained from 3D LiDAR is proposed, by using the direct method.
In , EKF fusion is performed on the poses calculated by LiDAR module and vision module, and an improved tracking strategy is introduced to cope with the tracking failure problem of Vision SLAM. As camera and LiDAR becoming standard configurations for robots, laser-vision fusion will be a hot research topic for SLAM, because it can provide a more robust result for real applications.
1.4. Problems in Application
Generally speaking, LiDAR-SLAM methods build occupy grid map, which is ready for path-planning and navigation control. However, closed-loop detection and correction are needed for larger map building, and that is not easy for a grid map.
Because the scan data acquired are two-dimensional point cloud data, which have no obvious features and are very similar to each other, the closed-loop detection based on scan data directly is often ineffective.
And this flaw also extends to fast relocation function when robot running with a given map. In the navigation package provided by the Robot Operating System (ROS), the robot needs a manually given initial pose before automotive navigation and moving. On the other hand, Most Visual-SLAM approaches generate feature maps, which are good at localization, but not good for path-planning. RGB-D or 3D-LiDAR are capable of building a complete 3D scene map, but it is limited in use because of high calculating or founding cost.For consumer robots, the cost of sensors and processing hardware is sensitive.
Low-cost LiDAR sensors become popular. However, to realize a robust low-cost navigation system is not easy work. Because low-cost LiDAR sensors have much poorer performance in frequency, resolution and precision than normal ones.
In one of our previous work , we have introduced methods to improve the performance of scan-matching for such low-cost LiDAR, however, that only works for adjacent poses. The accumulating errors may usually grow too fast and cause failure to larger map building. To find an effective and robust SLAM and relocation solution with low computation cost and the low founding cost is still a challenging work for commercially-used service robots.
1.5. The Contributions of this Paper
This paper proposes a robust low-cost SLAM frame work, with the combination of low-cost LiDAR and camera. By combining the laser points cloud data and image feature points data as constraints, a graph optimization method is used to optimize the pose of the robot. At the same time, the BoW based on visual features is used for loop closure detection, and then the grid map built by laser is further optimized. Finally, a 2.5D map presenting both obstacles occupy and vision feature is built, which is ready for fast re-localization.The main contribution/advantages of this paper are:
- Proposes a new low-cost laser-vision slam system. For graph optimization, a new cost-function considering both laser and feature constraints. We also added image feature-based loop-closure to the system to remove accumulative errors.
- Proposes a 2.5D map, it is fast in re-localization and loop-closing, compared with feature map, it is ready for path-planning.
The remaining parts of this paper are organized as follows: Section 2 introduces the slam framework based on graph optimization; Section 3 introduces the back-end optimization and loop closing method; Section 4 introduces the 2.5D map and fast relocation algorithm; Section 5 is the experiment; and Section 6 is the conclusion part.
2. SLAM Framework Based on Graph Optimization
A graph-based SLAM framework could be divided into two parts: Front-end and back-end, as shown in Figure 1 below.
Figure 1. A general graph-based light detection and ranging (SLAM) framework.The front-end is mainly used to estimate the position and pose of the robot by sensor data. However, the observed data contain varying degrees of noise whether in images or in laser. Especially for low-cost LiDAR and cameras. The noise will lead to cumulative errors in pose estimation, and such errors will increase with time.
Through filtering or optimization algorithms, the back-end part can eliminate the cumulative errors and improve the positioning and map accuracy.In this paper, graph optimization is used as the back-end, and the error is minimized by searching the descending gradient through nonlinear optimization. Graph optimization simply describes an optimization problem in the form of a graph. The node of the graph represents the position and attitude, and the edge represents the constraint relationship between the position and the attitude and the observation. The sketch map of graph optimization is shown in Figure 2.
Figure 2. Sketch map of graph optimization.In Figure 2, X represents the position and pose of the robot, and Z represents the observation. In this paper, the robot observes the external environment through both LiDAR and camera.
As a result, Z could be a combination of three-dimensional spatial feature points detected by camera or obstacle points detected by LiDAR. Matching result and errors of visual feature points and scan data are an essential part to generate the nodes and constrain for graph optimization.
For the matching of visual feature points, the error is usually represented by re-projection error.
The calculation of re-projection error needs to give two cameras corresponding to adjacent frames, two-dimensional coordinates of matched feature points in two images and three-dimensional coordinates of corresponding three-dimensional space points.
As shown in Figure 3, for the feature point pairs p1 and p2 in adjacent frames, a real-world spatial point P corresponding to p1 could be localized at first, and then P is re-projected onto the latter frame to form the feature point position p^2 of the image. Due to the error of position and pose estimation, the existence of p^2 and p2 is not a coincidence. The distance between these two points could be denoted as a re-projection error.
Pure vision SLAM usually extracts and matches feature points, approaches like EPnP  can be applied to estimate the pose transformation (R,t) of adjacent frames. Where R represents the rotation matrix, and t represents the transformation matrix.
Figure 3. The re-projection error.According to Figure 3, the re-projection errors could be calculated as follows:Firstly, calculate the 3D position [X′,Y′,Z′] of P′ (corresponding point of P) in camera coordinate, through the pose and transformation relation [R,t]:
More information: Collaborative SLAM based on Wifi fingerprint similarity and motion information. arXiv:2001.02759 [cs.NI]. arxiv.org/abs/2001.02759