Multi-Intention Inverse Reinforcement Learning
Disentangling internally misaligned demonstrations and learning multiple reward functions using gradient-based classification with MaxEnt IRL.
Overview
When learning from demonstrations, you usually assume the demonstrators all want the same thing. In practice, that assumption breaks: different people have different preferences, biases, and goals, and trying to fit a single reward function to a misaligned mixture often produces a policy that satisfies no one. Multi-Intention IRL (MIIRL) addresses this by learning K reward functions from N demonstrations, partitioning the demonstrations by which underlying intent they reflect.
This project, joint with Misha Lvovsky, proposes a framework for augmenting standard IRL algorithms into MIIRL algorithms by using the gradients of the candidate reward functions to classify demonstrations. We apply this approach to Maximum Entropy IRL and characterize when it works, when it fails, and why.
Background
The MIIRL problem learns K reward functions (intents) from a set of N demonstrations, where each demonstration is assumed to correspond to one of the K classes. Prior approaches mostly use Expectation-Maximization in the inner loop — Babes-Vroman et al. combine EM with MaxEnt; Bighashdel et al. extend it to deep networks; Hausman et al. add a latent intent vector to GAIL.
EM requires running full IRL inside each iteration, which is expensive. Our approach sidesteps this by using a single quantity each step — the gradient magnitude of each demonstration with respect to each candidate reward — to softly classify demonstrations into K sets, then runs the standard MaxEnt update on the resulting weighted mixture.
Method
For MaxEnt IRL specifically, the gradient of the log-likelihood for a demonstration under reward parameters θ is the difference between the demonstration’s empirical feature counts and the feature counts under the policy induced by θ. Demonstrations whose feature counts are closer to those produced by reward k will have smaller gradient magnitudes under θₖ. We exploit this by computing per-demonstration, per-reward gradient magnitudes, applying a softmin to get assignment probabilities, and using those probabilities to weight the feature-count averages used in the MaxEnt update.
The general framework:
- Compute per-demonstration, per-reward gradients
- Convert gradient magnitudes to soft cluster assignments via softmin
- Build K weighted datasets from the assignment logits
- Apply each IRL update using its weighted dataset
- Repeat until convergence
The full algorithm is general: any IRL method whose gradient is informative about per-demonstration fit can be extended this way. We instantiate it with MaxEnt for tractability.
Experimental Setup
Experiments use 15×15 randomly generated gridworld environments built on OpenAI Gym, with 5 features distributing color-coded rewards across cells. Ground-truth reward functions are generated as small perturbations around a baseline vector — typically perturbation magnitude m = 0.1 to mimic the close-but-not-identical preferences of real demonstrators. We use 64 demonstrations per reward function.
Findings
Performance is bimodal
On unconstrained random gridworlds and reward pairs, the algorithm either converges cleanly (cross-entropy near zero) or fails to disentangle the demonstrations at all. There is little middle ground.
Failure modes are pair-dependent, not environment-dependent
Boxplots across many environments with different reward pairs show only mild variation in median performance — no individual gridworld is consistently “bad.” A heatmap of (environment, reward) pairs reveals that the difficulty is in the interaction between specific environments and specific reward functions, not in either alone.
Demonstration feature counts can be nearly indistinguishable
For small reward perturbations, the feature-count distributions of the two demonstration classes overlap heavily. When the underlying rewards are too similar, no amount of clever classification will separate them — the signal isn’t there.
Initialization matters more than expected
Comparing random reward proposals against orthogonal proposals (where the second proposal is the negation of the first) showed dramatic differences. Orthogonal initialization consistently produced lower cross-entropy, even when the true reward functions were similar. The intuition: starting with proposals that are far apart gives the algorithm room to separate demonstrations before the proposals collapse toward each other.
Reward similarity is the dominant difficulty axis
Sweeping the perturbation magnitude from 0.01 to 1.0 confirms what theory suggests: the closer the two ground-truth reward functions, the harder the problem. At perturbation 1.0 the runs converge cleanly almost every time; at 0.01–0.1 (the realistic regime for human-generated data) the bimodal failure mode dominates.
Conclusions
The gradient-augmented MIIRL framework works, but its sensitivity to initialization and reward similarity is significant. The bimodal failure pattern is largely a property of MaxEnt’s optimizer — converged solutions are correct, but bad initializations get stuck. Most environment-reward combinations are learnable in principle; getting there requires either orthogonal initialization or rewards separated by more than ~0.1.
For real-world deployment, this suggests two practical levers: (1) use orthogonal or otherwise diverse reward proposals, and (2) be aware that the method has a floor on how similar the underlying intents can be while remaining recoverable.
Future Work
The most natural extension is applying the gradient-augmentation framework to a stronger IRL backbone — AIRL is the obvious target, since its gradient also carries per-demonstration information about reward fit. Beyond that, applying this to deliberately biased demonstrator pools (left-handed vs right-handed, novice vs expert) would test whether the method recovers semantically meaningful intents from real human data. There’s also a connection to KL-constrained preference learning for LLMs worth exploring, where MIIRL-style separation could let preference budgets be allocated per niche rather than globally averaged.