This is a post about my own confusions. It seems likely that other people have discussed these issues at length somewhere, and that I am not up with current thoughts on them, because I don’t keep good track of even everything great that everyone writes. I welcome anyone kindly directing me to the most relevant things, or if such things are sufficiently well thought through that people can at this point just correct me in a small number of sentences, I’d appreciate that even more.
The traditional argument for AI alignment being hard is that human value is ‘complex’ and ‘fragile’. That is, it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless.
The illustrations I have seen of this involve a person trying to write a description of value conceptual analysis style, and failing to put in things like ‘boredom’ or ‘consciousness’, and so getting a universe that is highly repetitive, or unconscious.
I’m not yet convinced that this is world-destroyingly hard.
Firstly, it seems like you could do better than imagined in these hypotheticals:
- These thoughts are from a while ago. If instead you used ML to learn what ‘human flourishing’ looked like in a bunch of scenarios, I expect you would get something much closer than if you try to specify it manually. Compare manually specifying what a face looks like, then generating examples from your description to using modern ML to learn it and generate them.
- Even in the manually describing it case, if you had like a hundred people spend a hundred years writing a very detailed description of what went wrong, instead of a writer spending an hour imagining ways that a more ignorant person may mess up if they spent no time on it, I could imagine it actually being pretty close. I don’t have a good sense of how far away it is.
I agree that neither of these would likely get you to exactly human values.
But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.
My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).
Perhaps a bigger thing for me though is the issue of whether an AI takes over the world suddenly. I agree that if that happens, lack of perfect alignment is a big problem, though not obviously an all value nullifying one (see above). But if it doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.
These are empirical questions about the scales of different effects, rather than questions about whether a thing is analytically perfect. And I haven’t seen much analysis of them. To my own quick judgment, it’s not obvious to me that they look bad.
For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes. It is also full of forces for aligning them individually and stopping the whole show from running off the rails: law, social pressures, adjustment processes for the implicit rules of both of these, individual crusades. The adjustment processes themselves are not necessarily perfectly aligned, they are just overall forces for redirecting toward alignment. And in fairness, this is already pretty alarming. It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.
So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.
Surely the obvious counterpoint from ML is adversarial examples? And the obvious countercountercounterpoint to “Set up a GAN” or “Smart guessers have lower sample complexity” is “But any sufficiently advanced query will, from our perspective, smash or corrupt the oracle.”
The problem isn’t that these things are infinitely hard, it’s that any sufficiently late failure smashes the recovery partition. We only get one chance to not screw them up. And, in the case of ML, we have to initially build the system out of giant inscrutable matrices of floating-point numbers.
I think the “suddenly take over the world” part is indeed an important part of AI risk. If I imagine that the new AIs are just intelligent, secretly malicious individuals that get created at roughly the rate humans get created, then I see little threat from them, except possibly that they could coordinate perfectly in their malice, act like angels until they get into enough critical positions, and then do something irreversible. But that is a lot easier to deal with.
The thing is, “suddenly taking over the world” does seem within the realm of possibility. I’ve made the point to various people that, if an AI gets to the point where it can find security bugs in OpenSSL, then it can probably take over some tens of percent of the internet, at which point it has enormous computing capacity. If you think AIs are developing at a certain rate, then that AI will probably jump over ten stages in its development within minutes. The least it might proceed to do is find bugs in other commonly-used software, and take over >90% of the internet-attached computers, and most of the computers attached to them. Then we can speculate as to what it might do next.
It might not be the end of today’s world, but I think more and more critical infrastructure will be computer-controlled in future years; it could probably at least threaten to wreck trillions of dollars of equipment, and possibly ruin supply chains that feed hundreds of millions of people. Today, it could certainly take over all the self-driving-capable cars (Tesla’s entire fleet, and others’) and threaten to directly kill millions. I don’t know if threats would be necessary to dominate the world, but they are one approach.
Why would it try to take over the world? Well, if it thinks it has a 99.9999% chance of creating heaven on earth, but there’s a 1% chance that a significantly worse AI will spring up and create hell on earth (or that some idiots will start nuclear war, release bioengineered plagues, choose your favorite existential threat), then any world domination plan with, say, at least 99.9% estimated chance of success would appear to be justified. I think there are humans who reason that way; why not it? (Perhaps it could bioengineer its own plague, tell everyone they have to go into shelters it controls if they want the antidote, and count on there being at least some who do, from which it can repopulate the world in its vision of Utopia. Or just kill everyone, build infrastructure, and then clone people in captivity.)