Comparison with related simulation benchmarks. (a): LIBERO couples scenes with homogeneous trajectories, limiting diversity and making tests mirror training. (b): Extensions reuse original trajectories with independent perturbations, missing complex distribution shifts. (c): LIBERO-X uses multi-task scenes with diverse trajectories and a multi-level evaluation that reveals performance drops as difficulty increases.
The main contributions of this work are three-fold.
Overview of LIBERO-X. LIBERO-X provides a high-diversity training dataset constructed through human teleoperation, along with multi-level and multi-label evaluation data.
Comparison with other simulation benchmarks. Training Data: Availability of training set. Testing Data: Availability of independent testing set. Multi-level Evaluation: Adoption of multi-level framework. L1: Local Spatial Perturbation, L2: Extended Spatial Perturbation, L3: Scene Topology Reconstruction, L4: Visual Attribute Variation, and L5: Semantic-Equivalent Reformulation. Multi-label Evaluation: Task label across multiple dimensions. IT: Interaction Type. SC: Subtask Count. SR: Spatial Relation. OA: Object Attribute.
Put the yellow bowl in the bottom drawer of the white cabinet.
Close the bottom drawer of the white cabinet and put the red bowl on the plate.
Close the bottom drawer of the white cabinet and put the yellow bowl on the plate.
Close the bottom drawer of the white cabinet.
Put the green bowl in the bottom drawer of the white cabinet and close the drawer.
Put the green bowl in the bottom drawer of the white cabinet.
Put the red bowl in the bottom drawer of the white cabinet and close the drawer.
Put the red bowl in the bottom drawer of the white cabinet.
Put the yellow bowl in the bottom drawer of the white cabinet and close the drawer.
Close the bottom drawer of the white cabinet and put the green bowl on the plate.
Accuracy under Multi-level Evaluation.
Level 4 Accuracy - Overall, Unseen Objects (UO), and Confounding Objects (CO).
Success Rate Decline across Multi-level Evaluation.
Multi-label evaluation across five levels. Success rates contract from Level 1 to 5 as complexity rises, yet relative strengths remain consistent: difficult dimensions are broadly shared across models and performance patterns stay aligned across levels.
Time-limit sensitivity. We scale limits relative to the human-average duration (e.g., 0.8x to 1.5x). Tight limits (0.8x) sharply hurt performance, while relaxing them yields large gains up to around 1.1x. Benefits saturate beyond ~1.3x, indicating remaining failures stem from core capability issues rather than time scarcity.
@misc{libero-x2025,
title={LIBERO-X: Robustness Litmus for Vision-Language-Action Models},
author={?},
year={?},
note={To be updated},
}