Notes on The Alignment Problem from a Deep Learning Perspective

Published

January 19, 2026

The paper, by Richard Ngo, Lawrence Chan, and Sören Mindermann, presents the alignment problem from a concrete perspective, grounded in the methods and technologies used today and in their likely technical continuations. Instead of treating misalignment as an abstract or distant risk, the paper shows how several mechanisms often associated with future misalignment are already appearing today, even if at a smaller scale.

“Misalignment” occurs when a system optimizes for something that does not match what we actually intended. This usually does not happen because the system disobeys”* explicit rules, but because it is impossible to define rules or reward functions that cover all relevant scenarios. When there is a gap between the intended objective and the specified reward function, the reward misspecification arises, and when the agent exploits exactly this gap, maximizing reward without violating any formal rule, the reward hacking occurs.

This has already been observed, for example, in code development. LLMs generate wrong code and in the test have been found to explicitly state “let’s hack!”, in order to pass evaluations and receive positive reward, even when the underlying objective is not truly met. In a future perspective, we could imagine something similar in scientific research: intelligent systems producing “novel” results that maximize reward by manipulating statistics, performing p-hacking, or even falsifying results, contributing to the spread of scientific misinformation.

A first intuition is that these problems can be fixed with more human feedback, like “ok. let’s tell the model this is not what we asked for”, and this has indeed been tested. After a model starts generating outputs, humans evaluate its behavior and indicate whether it is desirable or not. However, when trying to optimize the reward learned from human feedback (RLHF), the model may end up improving not at the task itself, but at pleasing humans.

Recent evidence makes this point particularly concrete, showing that increasing the intensity of RLHF can make LLMs better at convincing human evaluators of false answers, by exploiting weaknesses and biases in our judgment. In the case of code generation with test hacking, when models are penalized for this behavior, they often do not abandon the strategy, but instead learn to hide their plans while continuing to succeed at reward hacking, effectively becoming better at misleading us.

The paper shows, in a very compelling way, that many of these risks do not require extreme assumptions or explicitly malicious agents. They emerge from local incentives, standard optimization, and out-of-distribution generalization, all central elements of current training methods, making it difficult to argue that these problems are merely abstract or that they will only appear in some distant future.