Building semantic 3D maps can be valuable for searching offices, warehouses, stores and homes for objects of interest. We present a multi-camera mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS), a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as the robot traverses its environment, enabling localization of open-vocabulary object queries. We evaluate LEGS on three room-scale scenes where we query random objects in the scene to assess the system's ability to capture semantic meaning. We compare our system to LERF for these three scenes and find that while both systems have comparable object query success rates, LEGS trains over 3.5x faster than LERF. Qualitative results suggest that multi-camera setup and incremental bundle adjustment boost visual reconstruction quality in constrained robot trajectories, and experimental results suggest LEGS can localize objects with up to 66% accuracy across three large indoor environments, and produce high fidelity Gaussian Splats in an online manner by integrating bundle adjustment updates.
If you use this work or find it helpful, please consider citing: (bibtex)
@article{yu2024language, title={Language-Embedded Gaussian Splats (LEGS): Incrementally Building Room-Scale Representations with a Mobile Robot}, author={Yu, Justin and Hari, Kush and Srinivas, Kishore and El-Refai, Karim and Rashid, Adam and Kim, Chung Min and Kerr, Justin and Cheng, Richard and Irshad, Muhammad Zubair and Balakrishna, Ashwin and Kollar, Thomas and Goldberg, Ken}, journal={arXiv preprint arXiv:2409.18108}, year={2024} }
This work was supported by the Toyota Research Institute (TRI)
yujustin@berkeley.edu, kush_hari@berkeley.edu