Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning

Introduction

Behavior Cloning (BC), or learning from human demonstration, is a foundational component of modern robotic learning. Though BC has been studied for decades in other domains (Pomerleau, 1988), only recently has it been seriously applied to robotic manipulation tasks. Though these successess have been attributed to the use of generative models (e.g. diffusion models) as parametrizations for control policies, a second ingredient of equal importance is the practice of action-chunking (AC), where policies predict and execute open-loop sequences of actions, or “chunks” (Chi et al., 2023; Zhao et al., 2023).

In this writeup, we aim to answer:

1. Why does action-chunking help behavior cloning in robotic manipulation?

2. When action-chunking doesn’t work, what else can be done?

Action-chunking policies predict entire sequences of future control actions and executes them without querying the policy again. These actions are generally used as inputs to a lower-level position-based controller.

Our findings crucially depend on the properties of both the open-loop dynamics (where actions are generated without access to the underlying state) and the closed-loop dynamics (where actions are based on the current state and we must consider the combined environment + policy system as a whole).

Summary of Findings

There are a number of speculated benefits associated with action-chunking, including improved representation learning, multi-modal prediction, and receding-horizon control. However, we identify a more fundamental benefit: action-chunking encourages stability of the learned policy in closed-loop interaction with the environment, mitigating compounding errors.

Finding 1:

When the underlying environment is inherently stablizing (i.e. open-loop stable), action-chunking alone suffices to prevent compounding errors, leading to horizon-free imitation guarantees. In contrast, without action-chunking, prior work (Simchowitz et al., 2025) show that errors can grow expoenntially in the problem horizon, even when the dynamics are stable.

We validate this finding in simulated robotic manipulation tasks from RoboMimic, where we observe that increasing chunk-lengths leads to significant improvements in task success rates. We do so even for deterministic policies imitating a deterministic expert, confirming that the benefits of action-chunking can be realized even in the absence of multi-modality or partial observability.

RoboMimic experiments: Success rates as a function of evaluated action-chunk lengths

RoboMimic tool-hang task success, as a function of both prediction horizon and evaluated chunk length. Center: Chunk length ablation, 100 training trajectories. Right: Ablation on noise injection vs no noise injection, 50 training trajectories.

Informally, stability of a dynamical system measures its sensitivity to compounding errors: stable systems attenuate small perturbations over time, while unstable systems amplify them.. As described below, open-loop stability is a valid assumption for many robotic manipulation settings, where the robot interacts with objects in a quasi-static manner via end effector control.

However, when the underlying environment is not open-loop stable, action-chunking alone is insufficient to prevent compounding errors, and can even make compounding error worse. In fact, Simchowitz et al. ‘25 show that, in this regime, no algorithmic modification suffices to mitigate error compounding. Instead, we need better data. To this end, we consider a simple practice where the expert demonstrator adds a small amount of well-conditioned noise to their actions during data collection, but collects ground-truth action labels.

Finding 2:

We show that a simple exploratory data collection procedure suffices to prevent compounding errors, even when the underlying environment is not inherently stabilizing, provided that the expert demonstrator is capable of correcting from errors.

The effect of noise injection during demonstration collection for unstable environments can be easily understood by examining reward accumulation over a single trajectory in a continuous control environment such as the MuJoCo continuous control suite:

Noise injection sweep: noisy trajectories

Mean accumulated reward for Half-Cheetah environment by timestep, with differing levels of noise injection and using the clean expert actions vs noised expert actions for the training labels.

For the adventurous reader, we will now introduce the general framework we use to make precise these fuzzy notions of stability and performance. This requires elements from Control Theory with which many Roboticists and RL theoristists may be unfamiliar with. We build up our analytical framework in a notation-light and broadly informal manner.

Preliminaries

To isolate the effects of compounding error, we consider the minimal setting a fully observed, continuous state-action environment. Speficially, we adopt the language of a control system as a natural abstraction for decision making in robot learning.

A determinsitic¹, discrete-time, continuous state-action control system is defined by

States
Control inputs , which correspond to actions.
Dynamics deterministically evolving according to .

Thus, a control system is just an Markov Decision Process (MDP) with deterministic transitions and no rewards, tailored to continuous state and action spaces. We assume the initial state is drawn for some distribution fixed throughout.

Imitation Learning in Control Systems

Our goal is to learn a policy which mimics the behavior of a given expert policy, (e.g. a human demonstrator) given demonstration data. Formally:

A deterministic policy maps histories of states, inputs, and the current time step to a control input . Given two deterministic policies , we let denote the expectation over sequences under the dynamics and using respectively, where . Our aim is thus to learn some policy which accumulates low squared-trajectory error²:

We say is Markovian and time-invariant if we can simply express . In this case, we define the closed-loop dynamics , and .

Throughout the rest of this post we will assume that we have been given access to some number of demonstration trajectories sampled from (which we generally take to be , trajectories sampled using ). An offline Imitation Learning algorithm can thus be formalized as a (possibly randomized) map from , -length demonstration trajectories of length , , to a policy . The resulting suffers from some on-expert error²,

The Compounding Error Problem

One of the key challenges in Imitation Learning is that of Compounding Error (Simchowitz et al., 2025), which we can now formalize explicitly,

Definition (Compounding Error):

An imitation learning algorithm suffers from exponential compounding errors on given a demonstration distribution if, for some and any choice of demonstration length ,

In other words, imitating via empirical risk minimization on a given demonstration distribution leads to learned policies that suffer exponentially more trajectory error rolled out in closed-loop compared to their on-expert regression error.

As proposed in prior work (Pfrommer et al., 2022; Tu et al., 2022), compounding error can be understood through the lens of control-theoretic stability, which describes the sensitivity of the dynamics to perturbations of the state or input. We specifically consider the following notion of incremental stability, which models the rate at which two trajectories converge or diverge under different control inputs.

Definition (EISS):

A system is -exponentially incrementally input-to-state stable (EISS) if for all pairs of initial conditions and input sequences , there exist constants , such that for any :

We say a policy-dynamics pair is -EISS if the induced closed-loop dynamics is -EISS.

Intuitively, captures the sensitivity of the state with respectivity to errors in control inputs, while captures the rate at which errors from previous steps decay. In particular, we note that for the closed loop system , the “input” is on top of , meaning the term above captures the policy-relative error, and not the difference in control inputs given to . We can visualize this below as follows:

Visualization of EISS (Exponential-Incremental-Input-to-State Stability)

EISS captures the ability of a system to naturally correct errors.

Is EISS of the open-loop dynamics a reasonable assumption?

We believe so. Many robotic manipulation settings involve quasi-static interactions between the robot and objects, where the learned policy controls the position of the end-effector directly, and the objects move in response to contact forces. In these settings, the dynamics from end-effector commands to object states are often stable, as small perturbations in end-effector position lead to small perturbations in object position, without amplification over time.

This means that for most settings, the components of the state corresponding to the environment are at worst marginally stable (), while the components corresponding to the robot’s state are EISS () due to the position-based control paradigm.

Open-loop EISS abstracts the presence of a stabilizing lower-level control algorithm, such as a PID-based position controller.

Given an EISS system (for either or the closed-loop ), it may appear that Imitation Learning is inherently easy. As errors decay exponentially over time, small control errors yield only minor differences in state, which again yields a small input error…and so forth.

However, this is unfortunately not the case. This surprising fact is formalized in our prior work (Simchowitz et al., 2025), which we restate (informally) here:

Theorem (IL Hardness Lower Bounds):

There exist families , of dynamics, expert pairs such that drawn using is identical across all instances and:

For every , the open-loop and closed-loop are both EISS and are Lipschitz and smooth. However any algorithm which returns smooth, Lipschitz, Markovian policies with state-independent stochasticisty must suffer exponential-in- compounding error for some .
For every , is not necessarily EISS, but the closed-loop is EISS and are Lipschitz and smooth. However any algorithm , without restriction, must suffer exponential-in- compounding error for some .

Indeed, it is possible to construct fairly simple (dynamics, expert) pairs such where both the open-loop and closed-loop uder the expert are EISS, yet the closed-loop using the learned policy is potentially not EISS.

This naturally sets the stage for our twin results:

For any EISS-stable , using action chunking provably avoids compounding error. This bypasses the first hardness lower bound as chunking policies maintain internal state (the chunk sequence) and are therefore non-Markovian. By choosing action-chunk lengths that are sufficiently long, we can guarantee that the closed-loop under the learned policy is always EISS.
For any EISS-stable , where may potentially be unstable, a simple modification to the data collection procedure using noise-injection (à la DART, Laskey et al. (2017)) is sufficient to guarantee that under the learned policy is EISS.

The hardness results rely on (and therefore, ) which explore only a subspace of the actual reachable portions of the subspace. This allows all to share the same distribution and makes the “true” choice of opaque to the learning algorithm. By modifying the data collection procedure from , we can make uniquely identifiable from and bypass the lower bound.

Note

For systems with stochastic state transitions, we can “de-randomize” the system by absorbing the stochasticity into the initial state. ↩
We use such that large expected errors are not possibly indicative of vanishingly small events. ↩ ↩²

Action-Chunking Does Prevent Exponential Compounding Errors in Open-Loop Stable Systems

Action-chunking is a popular practice in modern sequential modeling pipelines, where a policy predicts a sequence of actions, of which some number are played in open-loop. There are various intuitions of the practical benefits of action-chunking, ranging from:

Robustness to non-Markovian / partial observability quirks in the data.
Amenability to multi-modal prediction.
Improved representation learning via multi-step prediction.
Simulating Model-Predictive Control.

We show a different mechanism, one not described by the past literature: action-chunking can leverage the open-loop stability of the system to stabilize the learned policy.

Definition (Chunking Policy):

A chunking policy is specified by a chunk-length , and a chunking policy such that where , i.e. we predict -length sequences which are then executed “open-loop” without feedback from until the chunk has been exhausted.

For convenience we also write and denote a chunking policy as . For chunked policies, our demonstration loss becomes:

We now formalize action-chunking for imitating deterministic expert policies:

Intervention 1 (Learning over Chunked Policies):

We sample i.i.d. trajectories drawn from the expert distribution . Instead of learning , we learn a -chunked-policy , that attains low on-expert error , e.g., by empirical risk minimization.

Our novel motivation behind this cmmon intervention is that, by making the chunk length long enough, the learned policy inherits the open-loop stability of the dynamics .

Visualization of the stabilizing effect of action chunks

-EISS can be visualized as a “funnel” which coerces the system state towards the commanded trajectories. By making the chunk-length sufficiently long such that , we can guarantee that is also EISS, i.e. contractive.

Key Result:

Let the true dynamics are -EISS, and all chunk mappings in consideration are -Lipschitz. For sufficiently long chunk-length:

Then we have the trajectory-error bound for any -chunking policy ,

This implies that when the ambient dynamics are EISS, then a sufficiently chunked imitator policy will accrue limited compounding errors—horizon-free—relative to the on-expert error it sees.

Our result follows from the following fact: under natural assumptions, the learners chunked policies are all closed-loop EISS. This circumvents the lower bound given earlier, in which it is hard for the learner to find policies which stabilize the dynamics if those policies must predict a single action at a time.

Noise Injection Mitigates Compounding Error under Smooth, Unstable Dynamics

We now consider the difficult setting where the ambient dynamics may not be open-loop stable. In this case, purely algorithmic interventions like action-chunking are generally insufficient, as erroneous actions can quickly lead to unstable behavior. This necessitates altering the demonstration distribution beyond the expert’s , i.e., some form of additional exploratory data collection is required.

Definition (Noise Injection):

We define the expert distribution under noise injection as the distribution over trajectories with , and for , where is drawn uniformly over the unit ball.

Our key innovation over prior algorithms such as DAgger or DART is that we learn using a weighted mixture of both the noise-injected and the “vanilla” expert data distribution .

Using a mixture is provably better, particularly in the high data regime with large . This is an intuitive result: when is already low, using demonstrations with the fixed noise level , i.e. may explore too much and has low coverage on .

Intervention (Exploratory Data Collection):

For the noise-injected distribution defined above, provide a sample of trajectories, where for the trajectories are i.i.d. from , and the remaining trajectories are drawn i.i.d. from . Define the corresponding mixture distribution . We then find that attains low , e.g., by empirical risk minimization.

Exploratory data collection via noise injection

We can think of the data mixture as ensuring coverage both on-expert, as well as a “tube” around the expert trajectories. Using either one or the other is suboptimal, either due to lack of on-expert data, or off-expert data.

Our results in this domain make extensive use of the analysis tools introduced in Pfrommer et al. (2022), which provides strong guarantees when imitating a closed-loop EISS expert in an adversarial manner.

There are many technical subtleties that we gloss over here but explore in-detail in our full manuscript. Namely our analysis is carefully constructed to consider coverage only on the manifold of reachable states. To perform this analysis in a technically rigorous requires careful Control-Theoretic analysis involving concepts such as the Controllability Grammian. We additionally make several simplifying assumptions regarding first-order-smoothness (i.e that are differentiable with -Lipschitz derivatives, respectively).

Key Result:

Let the dynamics and expert policy be -smooth, respectively, and all policies are -Lipschitz. Assume that the closed-loop system induced by , , is -EISS. Let be a -Lipschitz, -smooth policy. Then, for any and,

we have:

In particular, setting (i.e. some -independent constant), we have:

Experimental Validation

Action Chunking

To validate our predictions about the stability-theoretic benefits of action-chunking, we propose experiments on robotic imitation tasks in the RoboMimic framework. We find that:

Executing action chunks matters more than simply predicting longer sequences of actions. This demonstrates the action-chunking is more than a simple consequence of representation learning, or a simulation of receding-horizon control.
The merits of action-chunking remain showcased in deterministic, state-based control. This reveals that action-chunking still improves performance independently of partial observability or compatibility with generative control policies.
End-effector control enables the benefits of action-chunking. This is because end-effector control renders the closed-loop between system state and end-effector prediction incrementally stable. Hence, the low-level end-effector controller transforms imitating the position policy to taking place in an open-loop stable dynamical system, precisely the regime where we prescribe our AC guarantees.

We visualize performance as a function of noise injection and chunk length for the MuJoCo HalfCheetah environment, and show performance relative to both DAgger and DART on HalfCheetah, Humanoid.

Noise Injection

We seek to validate our hypotheses about the exploratory benefits of noise-injection. We propose experiments on MuJoCo continuous control environments, where we seek to imitate pre-trained expert policies. To summarize:

Noise injection as in Intervention 2 provides the exploration necessary to mitigate compounding errors, increasing performance on par with iteratively interactive methods such as DAgger and DART. We note Intervention 2 collects data in one shot, without ever observing learned policy rollouts.
Larger noise scales (within tolerance) improve performance, in contrast to prior understanding which necessitates set proportional to , i.e. very small for policies with low on-expert error.
A mixture of noise-injected and clean expert trajectories is beneficial, and the difference is small when provided more data. This matches the theoretical intuition that noise-injection is necessary up until is “locally stabilized” sufficiently well around , and thus only enters the trajectory error as a higher-order term.

Discussion and Limitations

Our combined action-chunking, noise-injection procedure relies on a structural assumption of either or being EISS.

Without either of these assumptions, if is unstable, errors may always compound in the worst-case (Simchowitz et al., 2025). This setting is, to some degree, uninteresting for Imitation Learning, as it means that the expert is inherently bad and cannot correct from failure.

For settings where an external oracle can stabilize the dynamics (e.g. a low-level position-based control loop), the dynamics can be reformulated such that is open-loop EISS. As such, we believe our results cover the full spectrum of situations where learning is reasonable.

References

Block, A., Jadbabaie, A., Pfrommer, D., Simchowitz, M., & Tedrake, R. (2023). Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior. Advances in Neural Information Processing Systems, 36, 48534–48547.

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., & Song, S. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. Robotics: Science and Systems.

Pfrommer, D., Zhang, T., Tu, S., & Matni, N. (2022). Tasil: Taylor series imitation learning. Advances in Neural Information Processing Systems, 35, 20162–20174.

Pomerleau, D. A. (1988). Alvinn: An autonomous land vehicle in a neural network. Advances in Neural Information Processing Systems, 1.

Simchowitz, M., Pfrommer, D., & Jadbabaie, A. (2025). The pitfalls of imitation learning when actions are continuous. arXiv Preprint arXiv:2503.09722.

Tu, S., Robey, A., Zhang, T., & Matni, N. (2022). On the sample complexity of stability constrained imitation learning. Learning for Dynamics and Control Conference, 180–191.

Zhao, T., Kumar, V., Levine, S., & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. Robotics: Science and Systems XIX.