Simon Schwaiger, MSc

I am a Lecturer/Researcher at University of Applied Sciences Technikum Wien and Doctoral Student at Graz University of Technology, working on machine learning and modern control approaches in robotics as well as personal projects.

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Tessa Pulli¹, Stefan Thalhammer², Simon Schwaiger² and Markus Vincze¹

¹ Vision for Robotics Laboratory, Automation and Control Institute, TU Wien, Austria

² University of Applied Sciences Technikum Wien, Faculty of Industrial Engineering, 1200 Vienna, Austria

pulli@acin.tuwien.ac.at

Paper

Code

arXiv

Abstract

Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF’s suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.

Citation

If you use this work in your research, please cite our paper:

@misc{Pulli2024FromWords,
    title               = {From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models. \textit{arXiv preprint arXiv:2409.05413}}, 
    author              = {Tessa Pulli and Stefan Thalhammer and Simon Schwaiger and Markus Vincze},
    year                = {2024},
    url                 = {https://arxiv.org/abs/2409.05413}
}

Simon Schwaiger, MSc

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Tessa Pulli1, Stefan Thalhammer2, Simon Schwaiger2 and Markus Vincze1

Abstract

Citation

Tessa Pulli¹, Stefan Thalhammer², Simon Schwaiger² and Markus Vincze¹