Learning Multi-Task Robot Trajectory Segmentation
from Visual and Kinematic Streams

1University of California, Berkeley
*Equal contribution
RoboSegNet overview: visual and kinematic streams fused to predict trajectory segment boundaries

Figure 1. RoboSegNet processes an image stream (DINOv3) and a kinematic stream (FAST), fuses them with cross-attention (frequency synchronization), and uses a transformer encoder–decoder to predict transition probabilities over time—for example segmenting an insertion into phases such as Grasp, Insert, and Nudge.

Abstract

Segmenting robot demonstration trajectories into semantically coherent parts is key to efficient policy learning, skill reuse, and recovery from failure. We present RoboSegNet, a multi-task framework that jointly learns from visual and kinematic proprioceptive signals. Kinematic trajectories are encoded with a Discrete Cosine Transform (DCT)-based tokenizer; images are encoded with a visual transformer. The two modalities are fused with bidirectional cross-modal attention, and transition boundaries are predicted via Hungarian matching. We introduce RoboSegData, a benchmark built from the Agibot dataset with dense frame-level transition annotations. RoboSegNet achieves strong performance on RoboSegData and generalizes in zero shot to unseen scenes, tasks, and skills.