Theory of Change

As of March 2026

Agency should be studied as being dynamic in nature and with the potential to decide and change its own objectives in the face of novel scenarios. Future agents will likely learn continually, in environments that co-evolve with their behaviour, forming and revising goals throughout their lifecycles. The risk posed by such agents is the complete unpredictability of their behaviour and the downstream effects of their decisions—both in environments shaped by their feedback and in human-AI interactions shaped by their decisions. Misalignment in such systems may have a developmental trajectory—from initial conditions to every subsequent environment—not handled by the detection methods designed for static systems with fixed objectives. Therefore, empirically studying agency in development should be a priority.

This project was initially started to understand how and under what selection pressures do instrumental subgoals form, their constraints, and the trajectory of their evolution. The central hypothesis is that every environment contains natural attractors for values, which may be selectively picked by the agent depending on its learning path and initial conditions, and which further cause certain subgoals to form, contextually emerge, conflict and compose to evolve with each learning environment. If the agent develops awareness of its own learning trajectory and the co-evolution of the environment with itself, it may actively shape what it learns next, and with it, its own goals and values — by exploiting, cooperating with, or co-opting the curriculum generator.

My current research empirically shows that an agent, when trained to perform robustly under evolving environment distributions, develops a predictive model of the data generator driving the curriculum. This study trained RL agents through five different Unsupervised Environment Design (UED) methods that provide an evolving curriculum of environments developed from the agent’s feedback or regret — always presenting the environment with the highest learning potential next. Consequentially, the agent develops a robust enough model to predict the next set of environment variables and their associated difficulty, patterns in the evolving environment, and by how much the environment might shift next. This level of awareness can allow the agent to exploit its knowledge of the data generator, with evaluation awareness and situational awareness as downstream effects.

Following the implications of Richens and Everitt (2024): in a setup involving either (i) an innocuous monitor supervising a stronger agent in nested scalable oversight, or (ii) a stronger teacher designing learning environments for a “newly minted” agent — the weaker agent can learn a robust model of the stronger teacher’s incentives, which can be passively extracted without the risk of collusion, obfuscation or alignment faking. Even an RL agent in a setup designed for being taught by a language model could potentially extract the belief model of the latter. Tested first in a controlled toy setting, this gives us three signals to monitor, each with its own safety implication:

Whether the teacher is misaligned and trying to corrupt the student agent, as a passive causal world model of the teacher extracted from the student, compared against an explicit objective, if available.
How well the student is learning what the teacher is correctly teaching, as extracted from the student’s performance on held-out levels — manifesting otherwise as goal misgeneralisation or misspecification.
Whether any agent has learnt a hidden misaligned belief structure, which is not elicited through behavioural evaluations alone — even something as subtle as exploitation of training dynamics awareness — as evaluated from the passive extraction of its learnt world model in a student-teacher setting.

The level of curriculum structure increases across the five UED methods, with the final method having an explicit utility function being maximised by the environment generator — enabling causal interventions, developmental interpretability, and proxy accuracy measurement in a partially observable setting to assess exactly what the agent learns about its training dynamics and if it can exploit it. The thresholds at which a causal world model develops for an agent, and at which it begins to be corrupted through escalating engagement with a misaligned agent, must be understood now — while the systems involved are still small and controllable. If the hypotheses hold under careful controls, our understanding of the alignment problem changes in these ways:

Nested scalable oversight may be achieved passively without collusion or sandbagging. A causal world model of the stronger agent’s belief system can be extracted from a much weaker agent, if the agent is trained to be robust to the teacher’s curriculum.
Curriculum design becomes a core alignment tool. The underlying assumption — every environment contains latent values, selectively instantiated by the agent’s learning path — makes curriculum design and prohibition of some agentic traits (through Path-Specific Objectives; Farquhar et al., 2022) an essential part of solving the alignment problem.
Insight into how values and goals evolve over an agent’s lifecycle. Understanding how values are induced by environments, and how goals contextually activate, emerge, conflict, compose, and evolve throughout training, clarifies what methods break when the static assumptions of mechanistic interpretability are not held.
New evaluation methods. Understanding how evaluation awareness develops as a downstream effect of emergent awareness of training dynamics, and whether it is conditional solely on the agent recognising distributional differences between interactions, makes new evaluation methods possible.
Early warning signal. Understanding whether current (emergent) goal misgeneralisation is a strategic exploitation of training dynamics awareness at varying depth levels, when and how it may arise, what selection pressures (Sadek et al., 2025) may be required for it, and how it might corrupt a multi-agent network, enables early detection of misgeneralised agent behaviour.

The first implication — passive scalable oversight via causal world model extraction — is the most important and urgent to test carefully for the alignment problem. As such, current work focuses on analysing how robust, interpretable and faithful these causal world models can be, what exactly the agent learns, and whether there are failure modes not yet accounted for.

As of March 2026#

As of March 2026