Open-Vocabulary Robotic Object Manipulation using Foundation Models

Stig Griebenow, Ozan Özdemir, Cornelius Weber, Stefan Wermter

January 2025

Abstract

Classical vision-language-action models are limited by unidirectional communication, hindering natural human-robot interaction. The recent CrossT5 embeds an efficient vision action pathway into an LLM, but lacks visual generalization, restricting actions to objects seen during training. We introduce OWL×T5, which integrates the OWLv2 object detection model into CrossT5 to enable robot actions on unseen objects. OWL×T5 is trained on a simulated dataset using the NICO humanoid robot and evaluated on the new CLAEO dataset featuring interactions with unseen objects. Results show that OWL×T5 achieves zero-shot object recognition for robotic manipulation, while efficiently integrating vision-language-action capabilities.

Type

Conference paper

Publication

2025 European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning

University of Hamburg

Open-Vocabulary Robotic Object Manipulation using Foundation Models

Abstract

Cornelius Weber

WP3 Researcher

Stefan Wermter

Networking Lead Expert