Simon Schwaiger, MSc

I am a Lecturer/Researcher at University of Applied Sciences Technikum Wien and Doctoral Student at Graz University of Technology, working on machine learning and modern control approaches in robotics as well as personal projects.

OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation

Simon Schwaiger1,2, Stefan Thalhammer2, Wilfried Wöber2 and Gerald Steinbauer-Wagner1

This work was supported by the city of Vienna (MA23 – Economic Affairs, Labour and Statistics) through the project Stadt Wien Kompetenzteam für Drohnentechnik in der Fachhochschulausbildung (DrohnFH, MA23 project 35-02).

1 Graz University of Technology, Faculty of Computer Science and Biomedical Engineering, Institute of Software Engineering and Artificial Intelligence, Inffeldgasse 16b/II, 8010 Graz, Austria

2 University of Applied Sciences Technikum Wien, Faculty of Industrial Engineering, Research Group Digital Manufacturing, Automation and Robotics, 1200 Vienna, Austria

schwaige@technikum-wien.at

Website
Code
arXiv

OTAS Capability Overview GIF OTAS Semantic Mapping Demo GIF

Abstract

Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping approaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct class boundaries. We propose OTAS—an Open-vocabulary Token Alignment method for outdoor Segmentation. OTAS addresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene-specific fine-tuning, and achieves real-time performance of up to ~17 fps. On the Off-Road Freespace Detection dataset, OTAS yields a modest IoU improvement over fine-tuned and open-vocabulary 2D segmentation baselines. In 3D segmentation on TartanAir, it achieves up to a 151% relative IoU improvement compared to existing open-vocabulary mapping methods. Real-world reconstructions further demonstrate OTAS' applicability to robotic deployment. Code and a ROS node are available on GitHub.


Citation

If you use this work in your research, please cite our paper:

@misc{Schwaiger2025OTAS,
    title               = {OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation. \textit{arXiv preprint arXiv:2507.08851}}, 
    author              = {Simon Schwaiger and Stefan Thalhammer and Wilfried Wöber and Gerald Steinbauer-Wagner},
    year                = {2025},
    url                 = {https://arxiv.org/abs/2507.08851}
}