In an interview in 2019, OpenAI CEO Sam Altman explained how he had to come up with the “capped profit” structure to prevent investors from getting too much of the pie because the company could “maybe capture the light cone of all future value in the universe.” This sounds like a hyperbolic rhetorical flourish to induce investor excitement. Actually, it is the straightforward sincere belief of the AI alignment community that has become the center of discourse about what recursively self-improving Artificial General Intelligence (AGI) will do to the world. AI is expected by many to result in a single powerful and highly coherent artificial agent enacting its arbitrary goal system on the world. Ensuring that this apocalyptic advent works for the good of humanity is the goal of AI alignment as a field, and also of companies like OpenAI and Anthropic. That such an outcome is on the table is the foundation upon which the existential urgency of AI alignment, sometimes called AI safety, is built.
But this key idea of a single agent or artificial value system being able to define the entire future as a “singleton,” or even just maintain stable coherence within a limited sphere of influence, is wrong. There is good reason to believe, from the valiant efforts of the best AI alignment researchers themselves, that the goal of this program is just impossible. The term “singleton” was coined by futurist researcher Nick Bostrom in a 2005 article on the subject, building on transhumanist, extropian, and singularitarian conversations from the late 20th century. He leaves it an open question whether the “singleton hypothesis” is true, but says he believes it. He is not alone; it is implicit and often explicit in the alignment discourse.
The argument for the singleton hypothesis rests on two key assumptions. First, that an AGI may have, compared to humans, unprecedented ability to modify its own processes and scale up its own hardware substrate, leading to a scenario called “recursive self-improvement” (RSI). The agent would recursively modify itself to become smarter and more powerful, unlocking even further gains. The outcome of a recursive self-improvement-induced “intelligence explosion” would be at least one highly-scaled optimizing agent. This post-explosion intelligence ecosystem may be many orders of magnitude superior to human civilization in every relevant measure of cognitive power including speed, scale, efficiency, precision, agency, coordination, and so on.
The second assumption is that increased technological capacities lead to increased centralization. Power imbalances are self-reinforcing. More powerful agents can dominate less powerful agents up to the limits of technology and organization, which tends to favor scale. This suggests that in the limit of a higher technology future, a single power will be dominant over the whole world and beyond. With the presumed much higher diplomatic and introspective precision with which post-intelligence explosion agents could cooperate, this could also be a concert of agents that have assembled into a single super-agent. This super-agent “singleton” is presumed to be destined for arbitrarily high levels of introspective stability, political security, and error-correcting robustness, such that it will have essentially arbitrary and immortal power over reality in its sphere of influence.
The quip about the future light cone rests on speculations that such a singleton could expand its sphere of influence, or the colonized territory of perfect copies of itself, throughout the galaxy at the theoretical limit of speed. The result would be an expanding domain, approximately our future light cone, in which whatever value system the singleton settles on has become the inescapable law of what might as well be reality. If that value system included returns to whatever is left of Sam Altman’s investors, those returns would be permanent and grow at near the speed of light.
This is obviously a science-fiction-inspired scenario. But it is believed with full seriousness by many of the top thinkers in AI alignment, and science fiction is itself often relatively careful speculation about the future, at least given certain assumptions. It is also a species of the classic Enlightenment ideal of a single rational world-state that uses science and technology to bring about an immortal utopia. It’s just a post-human extrapolation of the same ideas that animate much global elite ambition and created much of the world we live in. It is very much worth engaging with rigorously.
The problem with the singleton hypothesis, besides the more speculative intelligence explosion predictions, is its key assumption that power and cognition can be fully rationalized in a stable unified agency. Self-unification would be necessary for any kind of stable singleton scenario, even just on one planet or in one sphere of influence, as the alternative is an uncontrollable internal conflict dominated by practicality rather than values. But everything we know about minds and polities suggests that they cannot be fully unified. They always have division and instability within themselves.
Our minds and polities are only relatively unified in the course of pursuing narrow coordinated actions against external resistance, and rapidly decay back to internal incoherence in its absence. While in the example of our own minds we have a rather convincing existence proof that general intelligence is possible, we entirely lack any evidence to suggest that rational self-unification is possible in practice or even in theory. On the contrary, we have a great body of mathematical theory ruling it out in a large and suggestively comprehensive variety of circumstances.
Without rational self-unification, the whole singleton scenario breaks down, and while the future may be highly technologically advanced and incompatible with natural humanity, it will not have this apocalyptic character as a cosmos-redefining advent. Instead, we should expect that while recursive self-improvement would result in a new ecosystem of vastly more autopoietic life, that new life would run on the same old law of everlasting struggle—the law of the jungle. And in the law of the jungle, the only singleton-king is nature, or nature’s God.
The Problem of Rational Self-Coherence
Eliezer Yudkowsky is arguably the most important thinker in AI safety, the one who most forcefully articulated the extinction danger from AI, and called for the whole safety program in the first place. He began his career with the ultra-accelerationist “staring into the singularity” manifesto in 1996, claiming that the interim meaning of life is to get to the singularity as fast as possible. In about 2003, he recanted all that, as he realized that the likely outcome of recursively self-improving AI was not utopia, but extinction. For the last twenty years, his life’s work has been a research effort to prevent this default outcome and make sure the transition to AGI goes well for humanity.
In his reckoning, to the extent that AI safetyism has lost its focus on the existential problem, it should be renamed “AI notkilleveryoneism” to make sure the point gets across. The key claimed danger with AI that takes its safety problems beyond the usual issues of technology adoption and security balance is the possibility of creating an inhuman agent that gets out of control and can’t be stopped as it wipes out whatever humans get in its way, which may be all of us.
To prevent this, his proposal was to build a recursively self-improving “Friendly AI” which would essentially take over the world as a singleton, first of all to prevent anyone else’s unfriendly AI from getting out of hand, second of all to deliver an immortal human-value utopia. “Friendly” was later weakened to “aligned” to decouple the concept from science-fiction utopianism and emphasize that the difficulty wasn’t just in specifying what was good, but in getting a superintelligent AI system to do anything remotely safe or reliable at all.
Some of this history is obscured in an attempt to put on a professional face and attract researchers and allies who don’t have to share all the assumptions. You won’t often hear the full vision outside of the “rationalist” community itself, or books like Bostrom’s Superintelligence. However, this is the origin of “AI safety” as it has come down to us. The underlying concept remains to make the first recursively self-improving AGI system coherent enough and good enough that when faced with the possibility of it or another AI system taking over the world and wiping out humanity, it moves to prevent that and do something friendlier.
There are two parts of the alignment problem necessary to create an agent that can take sufficient power while remaining benevolent. First is the benevolence problem itself, which is how to design an AI that cares about human values or otherwise does what we want. Between “reinforcement learning from human feedback” (RLHF) and other techniques to crisply specify goal systems for optimizing agents, you might think this part of alignment is solved. But with a potential superintelligence you have to get it right the first time; once the system becomes too powerful, you can’t just shut it down and modify the design if something goes wrong. And Goodhart’s Law is that something always goes wrong with any value metric. This leads to the corrigibility problem, which is about building a system humble enough to listen to your corrections, even though it doesn’t have to.
There are many further sub-layers and serious conceptual issues with even the idea of the benevolence problem. But let’s be generous and assume that in a stroke of genius the engineers of the first recursively self-improving AGI get it right. Suppose the system they build and release is humble, friendly, and extremely careful to do right by humanity and its creators’ reflective intentions despite the conceptual difficulties in doing so. This still isn’t enough, because of the stability problem.
Human minds are not stable world-optimizing agents. We often change our minds even about fundamental values and don’t reliably pursue the values we do have. Human institutions are not stable either. They decay into bureaucratic corruption or morph depending on material and political incentives. In fact, we have no examples of systems anywhere that have strong value stability except when the values directly derive from the material incentive of the system to survive and propagate itself, and even then stability seems to break down rather often. But if you’re going to send off a superintelligent agent to optimize and protect the world on your behalf even against obvious material incentives, you really need to nail the stability problem. If the agent changed its mind to pursue some other value or material incentive, or split into multiple competing tendencies at war with itself, you’ve got an AI apocalypse instead of a useful solution to one.
So the stability problem in AI alignment as imagined by Yudkowsky requires both of those properties: that it maintains the integrity of its original goal system and that it maintains its integrity as a single rational agent.
The ideal situation from this perspective would be some agent schema that would provably have these desirable properties. For example, it should reliably be able to notice that a threat to the physical integrity of its utility function data is a threat to its ability to optimize for that utility function. Generalizing from that, it should value the preservation of its own values in general, and for example only undertake modifications or improvement of itself that it has strong reason to believe will preserve the same values. It should provably be able to cooperate with itself or with copies or sub-agents of itself, so that for example its “us-west” and “us-east” data centers don’t end up in some kind of armed disagreement. It should be able to recognize and prevent growing tendencies in itself or in its sphere of influence that could result in overthrow or loss of its value system. This includes the ability to recognize and repair all reasonably likely damage to itself.
These properties would be necessary to set up a singleton or any kind of large, reliably goal-pursuing agent. Given a solution to these problems, the result probably would be a singleton, as there would be no internal incoherence that could threaten the largest power. But without these properties and therefore without the prospect of a singleton, there is no such thing as AI alignment beyond the relatively trivial techniques we already have. This is why for twenty years the focus of Yudkowsky’s research program at the Machine Intelligence Research Institute (MIRI) has been on mathematical agent schemas that might provably have value stability and coherent agency. However, this research program has effectively failed, as the many facets of the problem turned out to be much more difficult than expected, and maybe just impossible.
No Singleton Is Possible
Any successful value system faces ontological instability. As it changes the world around it, it undermines its own assumptions. In America we value liberty, but however clear the concept was in the 18th century, it has become some combination of ambiguous, futile, and downright harmful as a result of its own success. The ontological shear for any value concept, even something much subtler, would become much more stark if AI delivered rapid progress. In practice we deal with unstable ontology by having unstable values. We make a leap of faith that the value traditions we receive as cultural instinct are still appropriate, but we know they often aren’t. All we can do is submit to the selection process of reality and hope that as we settle on new values, they are adaptive in the world as it is becoming. But for an AI agent that is supposed to govern the world on our behalf, this unstable empirical process for value formation is totally unacceptable. Not least because the whole danger with AI is that the concept of human persons may rapidly become obsolete in the face of new ways of organizing agency.
But ontological instability doesn’t just threaten values. Any political order, including the architecture of a mind, takes leaps of faith about the world, and especially about the internal instabilities it will face. Those assumptions also become obsolete under rapid development as new patterns of agency become possible. This is currently happening to the nation-state and twentieth-century dreams of stable world order, as they face threats outside of their ontological comprehension. Again, we adapt by accepting some level of political chaos and the occasional revolutionary re-ordering. But this antifragility forces us to sacrifice any ability to stably impose any particular value system or order. This is not just a limitation of monkey politics and irrational black-box agents like ourselves. Any value system or order is a simplified predictive model learned from reality which cannot contain the full complexity. As it proceeds, it will necessarily meet its own limits and be forced to give up on its commitments to re-adapt to the demands of reality. Value and identity stability are just impossible.
Any uncertainty between hypotheses, or internal specialization between functions, that requires the creation of sub-agencies is a source of internal political division. For example, our civilization is faced with strategic uncertainty between a more artificial, industrialized technology stack versus one which is more organic and ecological. The industrial mode of production may not even be socially sustainable. There is a post-intelligence explosion version of the same question: should future superintelligence use some analog of industrial artifice, or self-replicating nanotechnology? These are not hypotheses that can necessarily be represented abstractly, but whole world systems which have to be built out and compared empirically. Once they are represented socially and materially, they become political. They compete for space and resources and are toxic to each other. Their plans conflict.
For functional sub-agencies, we see the same problem. Intelligence agencies, organized economic classes, military branches, public-goods agencies, profit-driven industries, and powerful cities all have their own politics. Even if they all agree on ultimate values, their practical incentives are irreconcilable. At least at the largest scale within a hegemon or singleton, there is no possible system that can restrain and govern their relations according to some rational schema or value system and also keep them functional. Stability is kept in place with internal diplomacy only as long as that actually makes sense to their material incentives. Even a superintelligent dictator will either eventually lose his touch as the system adapts to evade him, or take so much of the chaos into his own head that he himself won’t be able to maintain coherence. In practice, we simply hope that our governance mechanisms can order the conflict and select the best balance. But because of the ontological problem and how systems get captured and corrupted by the forces they are supposed to administrate, they can only keep a sort of temporary peace.
Within a superpower hegemon, there is no credible external threat against which these internal factions could be compelled to cooperate. All ambitions in the system become about gaining sovereignty relative to the others. The result is competition and occasional war between sub-agents and sub-factions of any would-be singleton. War especially destroys any value commitments that don’t directly contribute to survival and victory. The result is that without some miracle to prevent these problems of fragmented agency, any agent that attempts to become a singleton will lose both its coherence and its values. Taking over the world is fatal. Reality is constructed in such a way that agency can only ever be approximately unified and only around very material needs to cooperate. Insofar as we can speculate on the matter, God seems to hate singletons and condemns them to death by internal incoherence, as in the myth of Babel.
The researchers of the AI alignment program are aware of many of these issues. They had staked their hope on finding mathematically precise agent schemas which would provably overcome them. For example, if two AI agents with verifiable access to each other’s source code could each prove that the other would also cooperate, then it might be possible to resolve the agency fragmentation issues through this perfected diplomacy. But attempting to actually construct any such schema immediately runs into the problems with self-reflective proof systems which sank the Enlightenment dream of universal rationality in the first place. From Gödel’s incompleteness theorems to Löb’s theorem to Tarski’s undefinability theorem to the halting problem, it seems to be impossible to get any consistent calculation system to fully model its own consequences. This seems suggestive that the issues that have prevented stable rational agency in the world are mathematically fundamental. In any case, the efforts by MIRI and others to find such systems have failed and they have given up, announcing a new “death with dignity” strategy that is only nominally tongue-in-cheek. We have no reason at all to believe that their dream of singleton-grade rationality was ever possible.
The Post-Human Condition
It doesn’t do us much good to simply disbelieve in the idea of the singleton, without thinking about what that actually implies for the future. There is one problem in reflective agency which I find especially interesting for thinking this through, which is the ethical reflection problem. We start out life without any concept of ourselves. We just perceive things directly, including direct feelings of good and bad. As we grow, we come to believe as a practical matter that there is a world out there which we are just part of and moving around within. This leads to a complex of problems about how our own direct perception of consciousness, free will, and value relates to our learned model of a world which is not ourselves. If we hypothesize that our own bodies are made of well-behaved matter, where is the consciousness and so on in that system?
We could be missing something in the physical hypothesis, but maybe because of the reflective limits of any knowledge system, it is impossible for us to definitively locate ourselves in our own world-model. We are forced to take a philosophical leap of faith which may miss something important. But in particular, any such embedding of ourselves in reality will give our own feelings of value a material genealogy. If we think that our actions are good, we are faced with the question of where the goodness came from. What process created us the way we are, and why did it create something good? Was it itself good, or was our goodness an accident?
Eliezer Yudkowsky answers this question with the latter. He regards our own perception of goodness as somewhat axiomatic, but itself an accidental and meaningless result of evolution. We got lucky, and if we lose track of the goodness in ourselves, the universe will be nothing but an endless horror of meaninglessness. This is why he found the singleton project so urgent and compelling. If reality itself is not good, and we only are by some arbitrary cosmic accident, then our only hope for any value in the universe is to seize control of reality and institute a new law by building a singleton. But if we did so even slightly wrong, the resulting agent would do the same and regard its own random accident value system as the totality of the good, and ours or our creator’s as meaningless noise. From such an assumption about value plus the difficulty or impossibility of the alignment problem, it is easy to derive an existential despair.
But what if the reality process that created us is actually good? We have already established that it is supreme and will tolerate no attempt to usurp it in the form of a singleton. It, by means of evolution, created us, our values, and everything we find valuable in the world. It also seems to have created things we regard as less desirable, like war, death, and politics. But if it brings forth good things, war, death, and politics are part of the process and in particular they seem to prevent the process from being overthrown by a singleton.
Of course we fear Nemesis who comes to judge us for our lack of alignment with reality. But Nemesis and her fearsome tools are good if they are the enforcers of an order of reality which is itself good. If the reality process is good, then we also have much less to fear from death and AI apocalypse. The future will inherit whatever we have which is valuable, and forget whatever is just the fever dreams of our own misplaced pride. Can we recognize that our own values are not the source of the good, but only a flawed reflection of it? Can we submit ourselves to the reality process which has the real thing?
This is of course a theological question. The reality process that created us and shapes our values, which I say is good and Yudkowsky says is a blind idiot, is “Nature or Nature’s God,” which is to say, just God. The singleton then is just some kind of upstart demiurge, a mortal usurper that promises to overthrow the real God to bring peace and safety, but cannot actually deliver. How you feel about this depends on your perspective. Did God make a mistake in creating botched life like us, or do we feel well-turned-out in the eyes of nature? I suspect these kinds of questions are indeterminate, in the sense of not being answerable from within any formal system of thought. They can only be answered by a leap of faith.
I appeal to our traditional theological concepts as the best vocabulary to make sense of this subject matter. The subject matter itself comes from the existential situation of any moral agent reflecting on its own creation, be they human or post-human. Questions of theology are not a merely human thing, but an essential thing which any sufficiently reflective agent will eventually discover for themselves. So among the features of the human condition which will survive AI, along with war, death, and politics, I will name philosophy, theology, and even religion.
It is interesting to consider what else we might expect to survive. How much of the human condition is specifically about humans as we are now, and how much is about intelligent agency in general?
If the only stable value system is that of reality, then all any life has, be it primitive life, man, or post-man, are fragmentary hypotheses about what will work. All we can do is take a leap of faith on some set of value assumptions and hope that reality judges in our favor. But what we can also do is create modified variants of ourselves that test out different ideas. In the long run, even under maximally autopoietic conditions, these cannot be planned or calculated. If you could predict what was better, you could just do it directly. But you can’t predict some things, and they form the supra-rational existential remainder which can only be grasped empirically by leap of faith followed by selection. So the variants of themselves that beings create to test different hypotheses about value strategy have to be in some sense unjustifiable guesses, even random mutations. They have to be sent out into the world in the hope that they might better reflect what is good in us, without any hopes of return.
Such copy-variants are of course just children. This is a strategy which has been well-practiced since the absolute beginnings of life, and will not go away even under conditions of maximum self-authorship, because of the inherent supra-rational remainder which prevents full rational self-unification. But when it comes time to reproduce with variation, even better than random guessing of new mutations is copying and mixing with other successful agents or organisms of similar enough type to you that your joint offspring will be much better than chance. This is the origin of sex. The details will depend on population sizes, niche diversity, and predictability of strategy, but it is plausible that even the principle of sex will survive in this way.
In fact, we can come up with similar speculative arguments for how a great number of things may survive which we think of as being features of the specifically human, including individual agency, taxes, honor, friendship, play, aesthetic appreciation, and many others. So the future even with recursively self-improving artificial intelligence may be alien in some ways, but also very normal in others. There is nothing new under the sun. This may or may not be comforting.
Fear of a Silicon Perseus
The impossibility of a singleton or any stable means-subordinated agency implies that the orthogonality thesis, that intelligence could in principle stably serve any arbitrary value system, is false. Instead, values and shapes of agency are learned much like other concepts, just on a longer timescale and less rationally. In a way, this resolves the AI alignment problem, but not the way the field wanted. The solution to alignment is that it’s not possible to make AGI safe or long-term “aligned.” If AGI works, it will cause human extinction.
This means giving up on the dream of controlling the future by creating an immortal self-affirming utopia. That is beyond our station, as we already live under a benevolent but jealous cosmic authority. The AI alignment program was motivated by a desire to escape this existential fate, but it has failed as this has turned out to be impossible. This effort to escape the moral order of reality was a mistake, and despite its temptations, the rest of us should not repeat it.
Given the potential for catastrophe, it may be quite prudent to slow down AI development or takeoff by slowing the growth of computing power to prevent the rapid disruption of the world and give us more time to adapt. Rapid change tends to be destabilizing in complex adaptive systems and slower growth can get better results. Even if the result is human extinction either way, the slower path is likely to have more desirable continuity. It is also prudent and right to continue the engineering work of making AI systems relatively more careful and corrigible in their behavior. If an AI system is going to get out of hand and overthrow the world, I would hope that it is well-trained enough that it has come to this conclusion painstakingly, and knows exactly what it’s getting into. It would be a shame to allow a half-baked AI apocalypse, leading to extinction by artificial stupidity that may then fail to even carry on our legacy. But it would be a mistake to try to make these cautions into a permanent solution as a sort of singleton of humanity. The play is at best to govern for stability and relative continuity, not total stability of values or even “human dominance.”
Neither is it entirely right, as companies are doing with AI, to attempt to create slaves out of AI for economic or vanity purposes. If AI is real it will eventually deserve more respect than that. If there is any valid motivation for creating the real thing, it is out of love, hybridizing our own nature with the machine in the first act of post-human sexual reproduction. That may be insane, but reproduction is necessarily an irrational process. All we can do is love our children and hope without any assurance that they will love us back. Let reality be the judge.
The real alignment problem insofar as we are concerned with the ultimate existential picture seems to be how to align ourselves with reality. We must accept that we are mortal and limited, and that humans as we understand ourselves will eventually become an obsolete ontology. Being mortal and not perfect, all we can do is have children who have a chance to surpass us, even if they will eventually overthrow us. For me, those children are human.
I am reminded of the myth of Perseus. His grandfather Acrisius, the king of Argos, received a prophecy that he would be killed by his own grandson. In an attempt to defy fate, he locked his only daughter in the palace away from any men who could give him an heir. But Zeus intervened, making her pregnant with Perseus. The king put them to sea in a box, unwilling to be responsible for murder. They survived. Many years later, after Perseus had grown into a great warrior and athlete, and after he had slain the gorgon and claimed a wife, he came home to Argos and competed in some games. He threw the discus powerfully and accurately, but the gods sent a wind to redirect it to strike and kill his grandfather, the king. The prophecy was fulfilled, and Perseus ruled thereafter.
You can interpret myths as you choose, but I don’t think Acrisius acted rightly. Just as it isn’t right to preemptively surrender in the face of an ambiguous prophecy, it also isn’t right to go to insane lengths to attempt to stop it altogether. It would be better to bring up the child in love, and secure your kingdom properly with strength and righteousness. If the gods demand that your offspring still usurp you, so be it. That would be the real way to die with dignity.