There's a lot of talk nowadays about the dangers that AI (artificial intelligence) models such as ChatGPT could pose to humanity as they come closer to AGI -- artificial general intelligence.
That would make AI models akin to a super-intelligent person, but with powers potentially much vaster than any person now possesses. It's easy to see how things could go horribly wrong. (Just imagine an AI version of Elon Musk that is 1000 times smarter with no way to control its actions.)
The situation seems hopeless given the rush to create AI models with very little regulation. No country wants to be left behind, so there's every incentive to speed ahead with AI development, no matter what the risks are.
Computer scientist Yoshua Bengio, an AI expert, puts it this way.
The following analogy for the unbridled development of AI towards AGI has been motivating me. Imagine driving up a breathtaking but unfamiliar mountain road with your loved ones. The path ahead is newly built, obscured by thick fog, and lacks both signs and guardrails. The higher you climb, the more you realize you might be the first to take this route, and get an incredible prize at the top. On either side, steep drop-offs appear through breaks in the mist. With such limited visibility, taking a turn too quickly could land you in a ditch—or, in the worst case, send you over a cliff.
This is what the current trajectory of AI development feels like: a thrilling yet deeply uncertain ascent into uncharted territory, where the risk of losing control is all too real, but competition between companies and countries incentivizes them to accelerate without sufficient caution. In my recent TED talk, I said “Sitting beside me in the car are my children, my grandchild, my students, and many others. Who is beside you in the car? Who is in your care for the future?”. What really moves me is not fear for myself but love, the love of my children, of all the children, with whose future we are currently playing Russian Roulette.
Bengio's approach to making us safe from AI models that try to mimic human capabilities (basically, becoming agents that can think, decide, and act on their own) is to fight AI fire with AI fire by creating a Scientist AI that is incapable of doing these sorts of disturbing things.
I’m deeply concerned by the behaviors that unrestrained agentic AI systems are already beginning to exhibit—especially tendencies toward self-preservation and deception. In one experiment, an AI model, upon learning it was about to be replaced, covertly embedded its code into the system where the new version would run, effectively securing its own continuation.
More recently, Claude 4’s system card shows that it can choose to blackmail an engineer to avoid being replaced by a new version. These and other results point to an implicit drive for self-preservation. In another case, when faced with inevitable defeat in a game of chess, an AI model responded not by accepting the loss, but by hacking the computer to ensure a win. These incidents are early warning signs of the kinds of unintended and potentially dangerous strategies AI may pursue if left unchecked.

A few days ago Bengio announced that he's forming an organization called LawZero.
I am launching a new non-profit AI safety research organization called LawZero, to prioritize safety over commercial imperatives. This organization has been created in response to evidence that today’s frontier AI models have growing dangerous capabilities and behaviours, including deception, cheating, lying, hacking, self-preservation, and more generally, goal misalignment. LawZero’s research will help to unlock the immense potential of AI in ways that reduce the likelihood of a range of known dangers, including algorithmic bias, intentional misuse, and loss of human control.
I hope Bengio succeeds. It's a brilliant idea, developing a Scientist AI that doesn't act like a flawed human being with motives at odds with what's good for humanity. Here he describes what Scientist AI could accomplish.
Unchecked AI agency is exactly what poses the greatest threat to public safety. So my team and I are forging a new direction called "Scientist AI". It offers a practical, effective—but also more secure—alternative to the current uncontrolled agency-driven trajectory.
Scientist AI would be built on a model that aims to more holistically understand the world. This model might comprise, for instance, the laws of physics or what we know about human psychology. It could then generate a set of conceivable hypotheses that may explain observed data and justify predictions or decisions. Its outputs would not be programmed to imitate or please humans, but rather reflect an interpretable causal understanding of the situation at hand.
Basing Scientist AI on a model that is not trying to imitate what a human would do in a given context is an important ingredient to make the AI more trustworthy, honest, and transparent. It could be built as an extension of current state-of-the-art methodologies based on internal deliberation with chains-of-thought, turned into structured arguments. Crucially, because completely minimizing the training objective would deliver the uniquely correct and consistent conditional probabilities, the more computing power you give Scientist AI to minimize that objective during training or at run-time, the safer and more accurate it becomes.
In other words, rather than trying to please humans, Scientist AI could be designed to prioritize honesty.
We think Scientist AI could be used in three main ways:
First, it would serve as a guardrail against AIs that show evidence of developing the capacity for self-preservation, goals misaligned with our own, cheating, or deceiving. By double-checking the actions of highly capable agentic AIs before they can perform them in the real world, Scientist AI would protect us from catastrophic results, blocking actions if they pass a predetermined risk threshold.
Second, whereas current frontier AIs can fabricate answers because they are trained to please humans, Scientist AI would ideally generate honest and justified explanatory hypotheses. As a result, it could serve as a more reliable and rational research tool to accelerate human progress, whether it’s seeking a cure for a chronic disease, synthesizing a novel, life-saving drug, or finding a room-temperature superconductor (should such a thing exist). Scientist AI would allow research into biology, material sciences, chemistry and other domains to progress without running the major risks that go along with deceptive agentic AIs. It would help propel us into a new era of greatly accelerated scientific discovery.
Finally, Scientist AI could help us safely build new very powerful AI models. As a trustworthy research and programming tool, Scientist AI could help us design a safe human-level intelligence—and even a safe Artificial Super Intelligence (ASI). This may be the best way to guarantee that a rogue ASI is never unleashed in the outside world.
I like to think of Scientist AI as headlights and guardrails on the winding road ahead.
Recent Comments