LIBERO-X

Robustness Litmus for Vision-Language-Action Models

Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu*, Xinmin Liu
Equal Contribution. Project Leader. *Corresponding Author.
wangguodong20@meituan.com, zhangchenkai@buaa.edu.cn

Overview

Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
Overview of Method

Comparison with related simulation benchmarks. (a): LIBERO couples scenes with homogeneous trajectories, limiting diversity and making tests mirror training. (b): Extensions reuse original trajectories with independent perturbations, missing complex distribution shifts. (c): LIBERO-X uses multi-task scenes with diverse trajectories and a multi-level evaluation that reveals performance drops as difficulty increases.

The main contributions of this work are three-fold.

  • We propose LIBERO-X, a comprehensive benchmark for manipulation that introduces a progressively challenging evaluation framework with multi-label annotations and jointly perturbs spatial layouts, object properties, and instruction semantics to systematically characterize model performance under multi-dimensional distribution shifts.
  • We construct a high-diversity training dataset comprising 2,520 demonstrations, 600 tasks, and 100 scenes collected via human teleoperation, substantially increasing scene diversity and task granularity.
  • Extensive experiments on fine-tuned representative VLA models reveal notable performance degradation as task and scene complexity increase, highlighting critical limitations in scene comprehension and instruction grounding and offering empirical insights for future model design.

Benchmark

Statistics

Overview of LIBERO-X

Overview of LIBERO-X. LIBERO-X provides a high-diversity training dataset constructed through human teleoperation, along with multi-level and multi-label evaluation data.

Dataset Comparison

comparison

Comparison with other simulation benchmarks. Training Data: Availability of training set. Testing Data: Availability of independent testing set. Multi-level Evaluation: Adoption of multi-level framework. L1: Local Spatial Perturbation, L2: Extended Spatial Perturbation, L3: Scene Topology Reconstruction, L4: Visual Attribute Variation, and L5: Semantic-Equivalent Reformulation. Multi-label Evaluation: Task label across multiple dimensions. IT: Interaction Type. SC: Subtask Count. SR: Spatial Relation. OA: Object Attribute.

Examples from LIBERO-X Training Set

Put the yellow bowl in the bottom drawer of the white cabinet.

Close the bottom drawer of the white cabinet and put the red bowl on the plate.

Close the bottom drawer of the white cabinet and put the yellow bowl on the plate.

Close the bottom drawer of the white cabinet.

Put the green bowl in the bottom drawer of the white cabinet and close the drawer.

Put the green bowl in the bottom drawer of the white cabinet.

Put the red bowl in the bottom drawer of the white cabinet and close the drawer.

Put the red bowl in the bottom drawer of the white cabinet.

Put the yellow bowl in the bottom drawer of the white cabinet and close the drawer.

Close the bottom drawer of the white cabinet and put the green bowl on the plate.

Evaluation

Multi-level evaluation accuracy

Accuracy under Multi-level Evaluation.

Level 4 accuracy

Level 4 Accuracy - Overall, Unseen Objects (UO), and Confounding Objects (CO).

Multi-level evaluation success rate decline

Success Rate Decline across Multi-level Evaluation.

Multi-label evaluation

Multi-label evaluation across five levels. Success rates contract from Level 1 to 5 as complexity rises, yet relative strengths remain consistent: difficult dimensions are broadly shared across models and performance patterns stay aligned across levels.

Task horizon length

Time-limit sensitivity. We scale limits relative to the human-average duration (e.g., 0.8x to 1.5x). Tight limits (0.8x) sharply hurt performance, while relaxing them yields large gains up to around 1.1x. Benefits saturate beyond ~1.3x, indicating remaining failures stem from core capability issues rather than time scarcity.

Case Study

LEVEL 1
LEVEL 2
LEVEL 3
LEVEL 4
LEVEL 5

BibTeX

@misc{libero-x2025,
      title={LIBERO-X: Robustness Litmus for Vision-Language-Action Models},
      author={?},
      year={?},
      note={To be updated},
}