Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient to ensure strong downstream performance and highlight the importance of incorporating structure and type-awareness in problem sampling.
A Latent Task Tree is a hierarchical partition of training prompts induced from the policy model's own hidden representations. It gives the curriculum a structured action space over related regions of the task manifold, without relying on human-written labels or external taxonomies.
Because the tree is built from the model's own representations, it provides a practical interface for curriculum learning:
Every model has a latent space over its inputs. Latent Task Trees can therefore be constructed across languages, domains, and modalities.
Tree construction adapts to the policy's latent geometry: different models and datasets induce different task organizations.
Requires only a model and a prompt dataset. No task labels, difficulty labels, external models, or hand-built taxonomies are needed.
Built from forward-only representation extraction and unsupervised clustering, making tree construction relatively cheap compared to the autoregressive rollouts used during RL training.
The resulting tree can vary greatly across datasets, reflecting how the policy organizes different domains in representation space. Click any thumbnail to inspect a larger version.
BMC treats curriculum learning as structured bandit optimization over the Latent Task Tree. It samples top-down, updates prompt-level beliefs from observed rollout learning signal, and aggregates information bottom-up so related regions share statistical strength.
BMC works over the Latent Task Tree at multiple levels of abstraction. Its key behavior is to balance focus and dispersion: concentrate effort where learning signal appears strongest, but spread effort across distinct types when several regions look comparably promising.
If one part of the tree looks especially promising, BMC concentrates more samples there, allocating effort where the policy seems most able to improve.
If several parts of the tree look similarly promising, BMC spreads samples across them, preserving diversity across distinct problem types.
Key idea: BMC treats curriculum learning as multi-scale allocation over the latent task space, focusing training effort where learning signal is high while preserving coverage across distinct problem types.
This makes BMC a generally applicable curriculum sampler for RL training: it can be paired with different reward signals, optimization algorithms, and task domains.
BMC is not limited to binary-verifier settings typical of RLVR. It can handle continuous rewards, enabling applications to RLHF and rubric-based RL.
BMC is compatible with group-relative RL algorithms that generate multiple rollouts per prompt, including GRPO, DAPO, and GSPO.
BMC inherits the generality of Latent Task Trees: the same curriculum sampler can operate across domains, languages, and modalities.
We focus this website on the main DAPO-Math-17K experiments with Qwen3-8B-Base,
which provide the clearest view of the curriculum tradeoffs studied in the paper. The full paper additionally
reports other model sizes, policy optimization algorithms, ablations, and appendix experiments in additional
domains. Here, we organize the main evidence around three interacting axes of curriculum design:
productivity, diversity, and utility.
Productivity measures whether sampled prompts produce useful RLVR updates. The most productive problems often lie in a “Goldilocks zone” of difficulty: not so easy that the model always succeeds, and not so hard that it always fails. These problems create variation across attempts, giving the policy a learning signal for improvement.
We track this in two ways. Effective Ratio measures how many sampled prompts produce nonzero learning signal, measured by rollout reward variance. Learning Speed measures how quickly the policy improves on the training set. Dynamic Sampling finds productive prompts by repeatedly searching for batches with reward variation, but this extra search increases wall-clock cost. BMC instead uses the Latent Task Tree to predict and target productive problems directly.
Qwen3-8B-Base result shown. Full protocol details are provided in the paper.
Takeaway: On Qwen3-8B-Base, BMC matches
Dynamic Sampling's learning speed without its wall-clock overhead. More broadly,
BMC provides a structured middle ground between cheap uniform sampling and
expensive resampling-based curricula.
Productivity is only the first axis: the next sections ask whether productive sampling also preserves coverage and aligns with downstream evaluation goals.
Diversity measures how well a sampling method preserves coverage of problem types on the latent task manifold. Since training datasets are often imbalanced, a flat, difficulty-driven sampler can focus narrowly on a small set of productive problem types. A diversity-only sampler (the “Tree Only” ablation of BMC) instead emphasizes broad type exposure. BMC is designed to jointly balance these objectives.
Takeaways
BMC jointly balances productivity and diversity, interpolating between flat difficulty-driven Thompson Sampling and the diversity-only Tree Only ablation while preserving broader type coverage.
Positive structure gain shows that learning signal, our proxy for difficulty, is more aligned with Latent Task Tree neighborhoods than with random groupings. In other words, related problem types tend to share more learning structure than unrelated types.
Utility asks whether a sampling method improves the capabilities or evaluations we actually care about. Productive prompts are not always evaluation-relevant prompts: a sampler can find high learning signal while under-sampling regions that matter for downstream tests. Likewise, broad diversity is not helpful if the additional coverage falls into evaluation deadzones: regions of the latent task manifold that are productive, but weakly related to the target evaluations.
Utility is evaluation-dependent: methods that improve one benchmark family can underperform on another. Full method-by-method breakdowns are provided in the paper.
Takeaways
Dynamic Sampling is highly productive during training, but productivity does not guarantee downstream utility. It can find strong learning signal while focusing on problem types that transfer poorly to some evaluations, such as Chinese math.
BMC is productive and samples across broader problem types, but broader coverage only helps when those types are relevant to the target evaluation. Some productive problem types can fall into evaluation deadzones: regions of the task manifold that are learnable during training, but weakly related to the target test distribution.
The evaluation deadzone interpretation raises a natural question: can we intentionally target problem types that are not only productive and diverse, but also relevant to a desired capability? To investigate this, we introduce a targeting extension of BMC, denoted BMC-T.
BMC-T builds a shared Latent Task Tree over both training prompts and target examples, such as held-out evaluation problems or a collection of tasks representing a desired capability. The target examples are never used for training. Instead, they help identify which training problem types overlap with, or lie near, the target capability on the tree.
Intuitively, BMC-T treats proximity on the Latent Task Tree as a coarse form of
conceptual similarity: training on problem types near a target should be more likely
to transfer than training on unrelated types. We evaluate two variants on
DAPO-Math-17K: BMC-T (T = All), which uses the full
evaluation mixture as the target distribution, and
BMC-T (T = AIME2024), which uses only AIME2024 as the target
distribution.
The key question is whether utility can be treated as an independent axis of curriculum design. If BMC-T variants maintain similar productivity but produce different downstream evaluation profiles, then utility is not reducible to productivity alone.
Takeaways
BMC, BMC-T (T = All), and BMC-T (T = AIME2024) exhibit similar learning speed and effective ratio, yet produce different downstream evaluation profiles. Productivity explains whether training examples produce useful updates, but not which capabilities those updates improve.
Changing the target distribution changes which evaluations improve, suggesting that capability development can be shaped by where training effort falls on the latent task manifold. BMC-T uses the shared Latent Task Tree as a coarse map for steering training toward desired capabilities.
BMC treats curriculum learning as multi-scale allocation over a heterogeneous task space. The central lesson is simple: there are more variables to consider than just difficulty. Effective curricula must also account for problem type, coverage, and downstream utility.
Human-defined labels and categories can hide the structure the policy actually sees. What looks like one dataset or domain can contain many distinct task distributions and problem types, each with different difficulty, learning dynamics, and downstream relevance.
Two curricula can produce similar learning signal while exposing the model to different problem types, leading to different downstream capability profiles.
Productive problems help the model learn faster, but high productivity alone does not guarantee better evaluations.
Even a productive curriculum that spans many problem types can spend effort on types that contribute little to the capabilities we care about. Utility captures whether sampled types are not only learnable and diverse, but also relevant to the desired capability.
@misc{mckenzie2026manifoldbanditsbayesiancurriculum,
title={Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models},
author={Darrien McKenzie and Nicklas Hansen and Xiaolong Wang},
year={2026},
eprint={2606.19750},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.19750},
}