LEGS

Incrementally Building Room-Scale Language-Embedded Gaussian Splats (LEGS) with a Mobile Robot

1 The AUTOLab at UC Berkeley

2 The Toyota Research Institute

*Denotes Equal Contribution

IROS 2024 (under review)



Overview

Building semantic 3D maps can be valuable for searching offices, warehouses, stores and homes for objects of interest. We present a multi-camera mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS), a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as the robot traverses its environment, enabling localization of open-vocabulary object queries. We evaluate LEGS on three room-scale scenes where we query random objects in the scene to assess the system's ability to capture semantic meaning. We compare our system to LERF for these three scenes and find that while both systems have comparable object query success rates, LEGS trains over 3.5x faster than LERF. Qualitative results suggest that multi-camera setup and incremental bundle adjustment boost visual reconstruction quality in constrained robot trajectories, and experimental results suggest LEGS can localize objects with up to 66% accuracy across three large indoor environments, and produce high fidelity Gaussian Splats in an online manner by integrating bundle adjustment updates.

Manip Diagram
Large-scale language-embedded Gaussian splatting setup. The Gaussian splat 3D reconstruction was used to render a novel view of a large-scale environment. Given open-vocabulary queries, LEGS can localize the desired objects as seen with the heatmap activations.