Open-Vocabulary Robotic Object Manipulation using Foundation Models

Abstract

Classical vision-language-action models are limited by unidirectional communication, hindering natural human-robot interaction. The recent CrossT5 embeds an efficient vision action pathway into an LLM, but lacks visual generalization, restricting actions to objects seen during training. We introduce OWL×T5, which integrates the OWLv2 object detection model into CrossT5 to enable robot actions on unseen objects. OWL×T5 is trained on a simulated dataset using the NICO humanoid robot and evaluated on the new CLAEO dataset featuring interactions with unseen objects. Results show that OWL×T5 achieves zero-shot object recognition for robotic manipulation, while efficiently integrating vision-language-action capabilities.

Publication
2025 European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
Cornelius Weber
Cornelius Weber
WP3 Researcher
Stefan Wermter
Stefan Wermter
Networking Lead Expert