Safe and Smart Robotics
Authors: Paolo Sebeto (TU Wien) , Jean-Baptiste Weibel (BOKU University) , Christian Hartl-Nesic (TU Wien) , Markus Vincze (TU Wien)
Learning dense, pose‑aware object descriptors is a key ingredient for generalizing robotic manipulation across novel instances and viewpoints. Intermediate features from self‑supervised models like DINO and Stable Diffusion can serve as powerful dense descriptors for semantic correspondence, yet these features degrade under large viewpoint changes.
To address this, we introduce D²DINO, a descriptor prediction model for pixel level object understanding. Our model attaches a lightweight convolutional head to a frozen DINOv3 encoder and trains it to produce low‑dimensional (16‑D), pixel‑wise descriptors at full input resolution. The head fuses multi‑scale ViT features and progressively upsamples them, yielding compact descriptors that can be used directly for dense matching. Supervision comes from Normalized Object Coordinate Space (NOCS) annotations exploiting consistent 2D–3D mappings across frames.
We optimize D²DINO with a contrastive objective and further distinguish between negatives on other objects or background and negatives on the same object, down‑weighting the latter to encourage intra‑object variation. We show that D²DINO yields higher point matching accuracy than raw DINOv3 features with upscaled inputs, while requiring only a single forward pass at the original image resolution and a much lower descriptor dimensionality.
Keywords:
How to Cite: Sebeto, P. , Weibel, J. , Hartl-Nesic, C. & Vincze, M. (2026) “D²DINO: Dense Descriptors from DINO for Pixel‑Level Object Understanding”, Proceedings of the Austrian Symposium on AI, Robotics, and Vision. 3(1).