Action Supervision for Learning from Corrective Feedback
Under Review
Abstract. Behavior cloning (BC) learns policies by treating human demonstrations as pointwise action labels. While effective with accurate action labels, this formulation is brittle in practice: when a human-provided action is imperfect, treating each label as an exact target can steer the policy away from the underlying desired behavior, particularly for expressive models such as energy-based models. Noting that this brittleness stems from pointwise supervision, we propose a human-in-the-loop online learning alternative that reformulates supervision as set-valued action targets derived from corrective feedback.
We introduce Contrastive policy Learning from Interactive Corrections (CLIC). CLIC constructs desired action sets from human corrections and trains the policy to place probability mass over these sets rather than a single action target. This formulation naturally accommodates both absolute and relative corrections, and enables the aggregation of multiple corrections into progressively refined desired action sets. Extensive simulation and real-robot experiments show that the proposed set-valued supervision leads to effective policy learning across diverse feedback settings: CLIC outperforms state-of-the-art baselines, remains competitive under accurate feedback, is substantially more robust under noisy, relative, and partial feedback, and supports complex multi-modal behaviors.
Highlights
CLIC aggregates corrective feedback into progressively refined desired action sets (gray) while updating the policy accordingly.
Pointwise supervision in Implicit BC arises as a limiting case of CLIC’s set-valued supervision: with circular desired action sets, the set collapses to a single action as the radius goes to zero, and the policy-weighted loss reduces to a uniform Bayes loss.
Ball catching: Quick coordination with
partial feedback to the end effector
or the robot hand.
Water pouring: Learning full pose
control to precisely pour
liquid (marbles) into a bowl.
Insert-T: Learning long-horizon multi-
modal task that inserts the T-shape
object into the U-shape object.
Simulation benchmarks
We evaluate CLIC on four simulation tasks under diverse feedback settings, including accurate, noisy, relative, and partial feedback. These experiments show that set-valued supervision remains competitive under accurate feedback and is more robust under imperfect feedback. Example post-training rollouts under accurate demonstration feedback are shown below:
Real World Insert-T Task
The Insert-T task requires the robot to insert a T-shaped object into a U-shaped object by pushing. We categorize the task into three difficulty levels, and examples of the post-training policy rollouts for each level are shown below:
Insert-T 'Easy' category:
Insert-T 'Medium' category:
Insert-T 'Hard' category:
Team
Omited for anonymous review.
Acknowledgements
Omited for anonymous review.