Segmenting robot demonstration trajectories into semantically coherent parts is key to efficient policy learning, skill reuse, and recovery from failure. We present RoboSegNet, a multi-task framework that jointly learns from visual and kinematic proprioceptive signals. Kinematic trajectories are encoded with a Discrete Cosine Transform (DCT)-based tokenizer; images are encoded with a visual transformer. The two modalities are fused with bidirectional cross-modal attention, and transition boundaries are predicted via Hungarian matching. We introduce RoboSegData, a benchmark built from the Agibot dataset with dense frame-level transition annotations. RoboSegNet achieves strong performance on RoboSegData and generalizes in zero shot to unseen scenes, tasks, and skills.