Robust Reward Alignment via Hypothesis Space Batch Cutting

Zhixian Xie*1  , Haode Zhang*2  , Yizhe Feng2  , Wanxin Jin1
1Arizona State University 2Shanghai Jiao Tong University
TL;DR: We propose a novel geometric view of reward alignment as an iterative cutting process over the hypothesis space. Our batched cutting method significantly improves data efficiency by maximizing the value of each human preference query. We introduce a conservative cutting algorithm that ensures robustness to unknown erroneous human preferences without explicitly identifying them.

Method

Results

We evaluate HSBC on a diverse set of MPC tasks, including locomotion (Cartpole-Swingup, Walker-Walk, Humanoid-Standup, Go2-Standup) and dexterous manipulation (Allegro-Cube, Allegro-Bunny).

Our method achieves performance comparable to or better than state-of-the-art baselines such as PREF-BI and Disagreement Learning in noise-free settings. Under high levels of incorrect human feedback (up to 30%), HSBC significantly outperforms existing approaches in both reward accuracy and task performance.

These results demonstrate HSBC's strong robustness and sample efficiency in aligning with true reward functions, even in the presence of substantial label noise.



Cartpole 20%

Cartpole, 20% false rate

Cartpole 30%

Cartpole, 30% false rate

Walker 20%

Walker, 20% false rate

Walker 30%

Walker, 30% false rate

Go2 Standup

Humanoid Standup

Allegro Cube

Allegro Bunny

Cartpole Swingup

Walker Walk




BibTeX

@misc{xie2025robustrewardalignmenthypothesis,
  title        = {Robust Reward Alignment via Hypothesis Space Batch Cutting}, 
  author       = {Zhixian Xie and Haode Zhang and Yizhe Feng and Wanxin Jin},
  year         = {2025},
  eprint       = {2502.02921},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2502.02921}
}