Manifold Bandits

Abstract

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient to ensure strong downstream performance and highlight the importance of incorporating structure and type-awareness in problem sampling.

Latent Task Trees

A Latent Task Tree is a hierarchical partition of training prompts induced from the policy model's own hidden representations. It gives the curriculum a structured action space over related regions of the task manifold, without relying on human-written labels or external taxonomies.

Because the tree is built from the model's own representations, it provides a practical interface for curriculum learning:

General

Every model has a latent space over its inputs. Latent Task Trees can therefore be constructed across languages, domains, and modalities.

Adaptive

Tree construction adapts to the policy's latent geometry: different models and datasets induce different task organizations.

Convenient

Requires only a model and a prompt dataset. No task labels, difficulty labels, external models, or hand-built taxonomies are needed.

Efficient

Built from forward-only representation extraction and unsupervised clustering, making tree construction relatively cheap compared to the autoregressive rollouts used during RL training.

Latent Trees Across Domains

The resulting tree can vary greatly across datasets, reflecting how the policy organizes different domains in representation space. Click any thumbnail to inspect a larger version.

MathDAPO-Math-17K

Open full resolution

CodeDeepCoder-Preview

Open full resolution

MedicineAlphaMed-19K

Open full resolution

LawBarExamQA

Open full resolution

FinanceAgentar-DeepFinance-100K

Open full resolution

Multi-domain ReasoningGURU-92K

Open full resolution

Instruction FollowingIF-RLVR

Open full resolution

Visual MathGeometry3K

Open full resolution

Bayesian Manifold Curriculum

BMC treats curriculum learning as structured bandit optimization over the Latent Task Tree. It samples top-down, updates prompt-level beliefs from observed rollout learning signal, and aggregates information bottom-up so related regions share statistical strength.

BMC works over the Latent Task Tree at multiple levels of abstraction. Its key behavior is to balance focus and dispersion: concentrate effort where learning signal appears strongest, but spread effort across distinct types when several regions look comparably promising.

Focus

If one part of the tree looks especially promising, BMC concentrates more samples there, allocating effort where the policy seems most able to improve.

Dispersion

If several parts of the tree look similarly promising, BMC spreads samples across them, preserving diversity across distinct problem types.

Key idea: BMC treats curriculum learning as multi-scale allocation over the latent task space, focusing training effort where learning signal is high while preserving coverage across distinct problem types.

This makes BMC a generally applicable curriculum sampler for RL training: it can be paired with different reward signals, optimization algorithms, and task domains.

Reward-compatible

BMC is not limited to binary-verifier settings typical of RLVR. It can handle continuous rewards, enabling applications to RLHF and rubric-based RL.

Algorithm-compatible

BMC is compatible with group-relative RL algorithms that generate multiple rollouts per prompt, including GRPO, DAPO, and GSPO.

Domain-general

BMC inherits the generality of Latent Task Trees: the same curriculum sampler can operate across domains, languages, and modalities.

Results: Three Axes of Curriculum Design

We focus this website on the main DAPO-Math-17K experiments with Qwen3-8B-Base, which provide the clearest view of the curriculum tradeoffs studied in the paper. The full paper additionally reports other model sizes, policy optimization algorithms, ablations, and appendix experiments in additional domains. Here, we organize the main evidence around three interacting axes of curriculum design: productivity, diversity, and utility.

Productivity

Productivity measures whether sampled prompts produce useful RLVR updates. The most productive problems often lie in a “Goldilocks zone” of difficulty: not so easy that the model always succeeds, and not so hard that it always fails. These problems create variation across attempts, giving the policy a learning signal for improvement.

We track this in two ways. Effective Ratio measures how many sampled prompts produce nonzero learning signal, measured by rollout reward variance. Learning Speed measures how quickly the policy improves on the training set. Dynamic Sampling finds productive prompts by repeatedly searching for batches with reward variation, but this extra search increases wall-clock cost. BMC instead uses the Latent Task Tree to predict and target productive problems directly.

Productivity results on Qwen-8B-Base with GSPO. — Effective Ratio tracks how often sampled prompts fall into the useful learning regime; Learning Speed tracks training-set pass@1 improvement. Representative `Qwen3-8B-Base` result shown. Full protocol details are provided in the paper.

Takeaway: On Qwen3-8B-Base, BMC matches Dynamic Sampling's learning speed without its wall-clock overhead. More broadly, BMC provides a structured middle ground between cheap uniform sampling and expensive resampling-based curricula.

Productivity is only the first axis: the next sections ask whether productive sampling also preserves coverage and aligns with downstream evaluation goals.

Diversity

Diversity measures how well a sampling method preserves coverage of problem types on the latent task manifold. Since training datasets are often imbalanced, a flat, difficulty-driven sampler can focus narrowly on a small set of productive problem types. A diversity-only sampler (the “Tree Only” ablation of BMC) instead emphasizes broad type exposure. BMC is designed to jointly balance these objectives.

Takeaways

Coverage is preserved

BMC jointly balances productivity and diversity, interpolating between flat difficulty-driven Thompson Sampling and the diversity-only Tree Only ablation while preserving broader type coverage.

Similar type, similar difficulty

Positive structure gain shows that learning signal, our proxy for difficulty, is more aligned with Latent Task Tree neighborhoods than with random groupings. In other words, related problem types tend to share more learning structure than unrelated types.

Utility

Utility asks whether a sampling method improves the capabilities or evaluations we actually care about. Productive prompts are not always evaluation-relevant prompts: a sampler can find high learning signal while under-sampling regions that matter for downstream tests. Likewise, broad diversity is not helpful if the additional coverage falls into evaluation deadzones: regions of the latent task manifold that are productive, but weakly related to the target evaluations.

Utility is evaluation-dependent: methods that improve one benchmark family can underperform on another. Full method-by-method breakdowns are provided in the paper.

Takeaways

Productivity is not sufficient

Dynamic Sampling is highly productive during training, but productivity does not guarantee downstream utility. It can find strong learning signal while focusing on problem types that transfer poorly to some evaluations, such as Chinese math.

Diversity is not sufficient

BMC is productive and samples across broader problem types, but broader coverage only helps when those types are relevant to the target evaluation. Some productive problem types can fall into evaluation deadzones: regions of the task manifold that are learnable during training, but weakly related to the target test distribution.

Utility Awareness: Steering Generalization with BMC-T

The evaluation deadzone interpretation raises a natural question: can we intentionally target problem types that are not only productive and diverse, but also relevant to a desired capability? To investigate this, we introduce a targeting extension of BMC, denoted BMC-T.

BMC-T builds a shared Latent Task Tree over both training prompts and target examples, such as held-out evaluation problems or a collection of tasks representing a desired capability. The target examples are never used for training. Instead, they help identify which training problem types overlap with, or lie near, the target capability on the tree.

Intuitively, BMC-T treats proximity on the Latent Task Tree as a coarse form of conceptual similarity: training on problem types near a target should be more likely to transfer than training on unrelated types. We evaluate two variants on DAPO-Math-17K: BMC-T (T = All), which uses the full evaluation mixture as the target distribution, and BMC-T (T = AIME2024), which uses only AIME2024 as the target distribution.

The key question is whether utility can be treated as an independent axis of curriculum design. If BMC-T variants maintain similar productivity but produce different downstream evaluation profiles, then utility is not reducible to productivity alone.

Productivity metrics for BMC and BMC-T variants. — BMC and the two BMC-T variants maintain similar productivity during training, with comparable learning speed and effective ratio.

Evaluation results for BMC and BMC-T variants. — Despite similar productivity, the variants produce different downstream evaluation profiles depending on the target distribution used to bias sampling. Targeting AIME2024 improves AIME-style English competition math, while targeting the full evaluation mixture shifts performance toward a broader capability profile, including stronger Chinese math performance.

Takeaways

Utility is not productivity

BMC, BMC-T (T = All), and BMC-T (T = AIME2024) exhibit similar learning speed and effective ratio, yet produce different downstream evaluation profiles. Productivity explains whether training examples produce useful updates, but not which capabilities those updates improve.

Generalization is steered

Changing the target distribution changes which evaluations improve, suggesting that capability development can be shaped by where training effort falls on the latent task manifold. BMC-T uses the shared Latent Task Tree as a coarse map for steering training toward desired capabilities.

Main Takeaways

BMC treats curriculum learning as multi-scale allocation over a heterogeneous task space. The central lesson is simple: there are more variables to consider than just difficulty. Effective curricula must also account for problem type, coverage, and downstream utility.

Human-Policy Task Mismatch

Human-defined labels and categories can hide the structure the policy actually sees. What looks like one dataset or domain can contain many distinct task distributions and problem types, each with different difficulty, learning dynamics, and downstream relevance.

Type matters

Two curricula can produce similar learning signal while exposing the model to different problem types, leading to different downstream capability profiles.

Difficulty is not enough

Productive problems help the model learn faster, but high productivity alone does not guarantee better evaluations.

Utility is its own axis

Even a productive curriculum that spans many problem types can spend effort on types that contribute little to the capabilities we care about. Utility captures whether sampled types are not only learnable and diverse, but also relevant to the desired capability.

BibTeX

@misc{mckenzie2026manifoldbanditsbayesiancurriculum,
      title={Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models}, 
      author={Darrien McKenzie and Nicklas Hansen and Xiaolong Wang},
      year={2026},
      eprint={2606.19750},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.19750}, 
}

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Bayesian Manifold Curriculum (BMC) builds RL training batches by sampling productive and diverse problem types from the policy model's latent task manifold. This improves the trade-off between training efficiency and broad data coverage.

Abstract

Latent Task Trees

General

Adaptive

Convenient

Efficient

Latent Trees Across Domains

Bayesian Manifold Curriculum

Focus

Dispersion

Reward-compatible

Algorithm-compatible

Domain-general

Results: Three Axes of Curriculum Design

Productivity

Diversity

Coverage is preserved

Similar type, similar difficulty

Utility

Productivity is not sufficient

Diversity is not sufficient

Utility Awareness: Steering Generalization with BMC-T

Utility is not productivity

Generalization is steered

Main Takeaways

Human-Policy Task Mismatch

Type matters

Difficulty is not enough

Utility is its own axis

BibTeX

Manifold Bandits:
Bayesian Curriculum Learning over the Latent Geometry of Large Language Models