Visually Augmented Navigation (VAN)

Real-Time Visual Perception

One of the major challenges of a feature-based Simultaneous Localization & Mapping (SLAM) methodology is identifying features from raw sensor data. In man-made structured environments, typically composed of planes, lines, and corners primitives, features can be more easily detected [1]. However, unstructured natural environments, like the underwater seafloor, pose a more challenging task for feature extraction and matching. In contrast, view-based SLAM methodologies do not explicitly represent features in the environment. Rather, they rely on scan-matching based approaches [2] that are data driven and do not require an explicit representation of features. These techniques match raw data directly to extract pose alignment, and have seen recent success when used as constraints in a pose-graph framework [3] [4] [5] [6].

VAN fuses camera-based motion estimates derived from overlapping seafloor imagery with vehicle navigation data to constrain navigation error. In this framework error does not monotonically accumulates with time as in dead-reckoned navigation systems, instead it is bounded by the network topology of camera constraints.

Our underwater visual navigation approach uses a camera as an accurate and inexpensive perceptual sensor to collect near-seafloor imagery that can be used in a scan-matching capacity. Without any knowledge of extrinsic camera information, robust image registration techniques must rely solely upon encoding features in a viewpoint invariant way. For example, rotational and scale differences between images render simple correlation based similarity metrics useless. To overcome these limitations, advanced techniques generally rely on encoding some form of locally invariant feature descriptor such as differential invariants [7], generalized image moments (Zernike) [8] [9], and affine invariant regions [10] [11] [12]. However, these higher-order descriptions also tend to be computationally expensive.

Fusion of dead-reckon navigation sensor data with "zero-drift" camera measurements.

INS-fused visual perception: Unlike other structure-from-motion (SFM) research within the computer vision and robotics community [13] [14] [15] [16] [17] [18] [19], our research investigates how information from inertial navigation systems can be probabilistically exploited within a visual perception SLAM framework to obtain computationally efficient, robust image registration metrics. For example, in the case of an instrumented platform with absolute measurements of orientation, we can use sensor-derived information to our advantage to relax the demands of the image feature encoding while making it a more discriminatory metric. Consider the case of complex Zernike moments [20] [21], rotational invariance requires that we discard the phase and retain only the magnitude; however, if we can pre-compensate for the rotational differences between images, then the phase information can also be used to make the similarity metric more discriminatory [22]. Since attitude sensors provide information on the 3D orientation of cameras in a fixed reference frame, combining this type of a priori pose information with visual vocabulary methods, such as [23] [24] [25], can provide for a fast discriminatory metric for identifying candidate image pairs with similar scene content while maintaining robustness to the peculiarities of underwater imaging (e.g., low-contrast imagery, backscatter).

Pose-Constrained Correspondence Search: When a good prior exists for camera motion, we can exploit it to greatly improve the robustness of putative correspondence selection, which is particularly challenging in SLAM during loop-closing. If we consider extending matching to image sets, then we can perform a joint matching across image sets to exploit that our relative-pose prior is well known intra-set. We look for correspondences that are jointly compatible across the two sets (similar to the joint compatibility data association test of Neria and Tardos [26]). This effectively increases the signal-to-noise ratio for correspondence establishment, since feature similarity measures can be considered in aggregate. The figure below depicts the pairwise pose constrained correspondence search (PCCS).

A priori temporal pose information from the navigation sensors can be exploited during image registration to restrict putative correspondence selection.

Multi-Scalar Representation

Over the years, a number of different approaches have been proposed to curb the computational cost of SLAM map updates. Submaps [27] [28] [29], postponement [30] [31], Rao-Blackwellized approaches [32], covariance intersection [33] [34], and more recent pose-graph approaches [35] [36] [37] [38], all attempt to decompose large-scale mapping into more computationally manageable representations. For example, Eustice et al showed [39] [40] that for a pose-graph representation, expressing the SLAM estimation problem in an extended information filter (EIF) (i.e., the dual of an extended Kalman filter (EKF)) results in an information matrix (i.e., inverse of the covariance matrix) that is exactly sparse without any approximation. This allows view-based SLAM systems to take advantage of the sparse information parameterization to realize highly efficient, scalable SLAM algorithms that are capable of generating large-area results, such as those shown below.

VAN mapping results from a 2004 ROV survey of the RMS Titanic using monocular imagery (previously published by Eustice et al in [41] [42] [43]). The mapped area encompasses over 3100 m2 and over 3.4 km of traveled path length.

RMS Titanic The wreck of the RMS Titanic was surveyed during the summer of 2004 by the deep-sea ROV Hercules operated by the Institute for Exploration of the Mystic Aquarium. The ROV was equipped with a standard suite of oceanographic dead-reckon navigation sensors capable of measuring heading, attitude, altitude, XYZ bottom-referenced Doppler velocities, and a pressure sensor for depth. The vehicle surveyed the wreck athwartships while maintaining a constant altitude of approximately 7.5 m above the deck. The survey consisted of a boustrophedon trajectory at a horizontal speed of approximately 10 cm/s. The vehicle was equipped with a calibrated down-looking 12-bit digital-still camera that collected imagery at a rate of 1 frame every 8 seconds. This yielded an ~900 image sequence with slightly over 50% along-track (temporal) overlap and roughly 25% cross-track (spatial) overlap. The figure above shows visual SLAM results from applying our VAN technique to this data set.

Current Research

With extension, many current SLAM methods have the capacity to map very large areas of spatial extent. However, the type of representations suitable for very long time durations is unclear. In other words, imagine deploying a SLAM system for weeks or months—as would be the case for an AUV at a seafloor ocean observatory. Most prior SLAM research has focused on mapping and exploration of new areas, and little research has been done regarding long-term localization and mapping within a previously mapped environment. Due to the curse of dimensionality, most statistical representations suffer from a finite description, leaving them to track only the most probable metrics such as the mean or mode(s) of the distribution. Therefore, when a fault occurs, such as a false loop closure or false data correspondence, the system is typically unable to cope, which then leads to divergence. These type of faults will occur with a non-negligible probability given sufficient time. Therefore, we posit that one of the next research challenges in SLAM is long-term autonomy.

Persistent Multi-Scalar SLAM: An ongoing area of research within PeRL is to investigate probabilistic representations for multi-scalar SLAM that are tolerant to false correspondence and loop-closure, and that can accommodate long-term autonomy. We are investigating hierarchal information-form representations for modeling large-scale topology, combined with a regenerative Monte-Carlo approach for multi-hypothesis diagnosis of false loop-closure/data correspondence.