2011 Summit Transcription - Open Problems

All right, so for the benefit of you who have no idea what this talk is about, very, very briefly it's about how to build stable self-improving artificial intelligences.

There was a recent conference on artificial general intelligence, real AI, AIs that really think, in Google in Mountain View. And there was a poster there on Gödel machines, which are self modifying AIs that prove each self modification consistent before making it. It's the sort of closest approach to the one we advocate. And they had eleven steps for what you need to do to make one of these, and step seven was prove that making the self modification has higher expected utility than leaving the code constant. And I went up to the person by this poster and I said, "You know I think step seven is harder than all of your other steps." And in the first open problem I'm going to try to explain why.

So, this is not the problem of making a full artificial general intelligence, but it's a little toy problem. And let's say you've got a barrel that's one quarter diamonds, three quarters circles. Ninety percent of the diamonds are red, eighty percent of the circles are blue.
A random object is chosen from this barrel, you're told whether it's red or blue. You have to guess whether it's a diamond or a circle. If you guess correctly you get $10, if you guess incorrectly you get nothing.

So problem: prove that guess diamond if red, guess circle if blue is a better strategy than always guess circle. And yes we will be getting to more difficult problems shortly, I just wanted to introduce everything in the correct order. One very bad answer you could give to this problem is "The environment is uncertain so no one can prove anything. I mean what if you see red and its actually a circle instead of a diamond? Oh noes, you should have just guessed circle to begin with." Even worse answer is, "Mere logic can't solve problems like these, you've got to use intuition."

The correct, or as we like to say in the field Bayesian, answer is that in this barrel here we see that there are 25 blue objects, one of which is a diamond, 24 of which are circles, and in this barrel also there are 15 red objects, six of which are circles, nine of which are diamonds. From this it follows that if you see blue, you should guess its a circle, and if you see red you should guess its a diamond. And I will skip this because I don't really have time for everything I want to show today.

Now let's talk about a more difficult challenge: build an AI that given the previous problem specification would write a computer program to solve it. Now in principle we have a pretty good idea of how to do this. By "in principle" I mean it's easy so long as you give us an infinitely large computer. Since we know what constitutes a proof that a program solve this problem, we can just search the space of all possible programs along with all possible proofs that the program works, and then just select a program that works as soon as we find a proof that it works. And once, you know, you write that program, the hard part is done and then there's just the easy problem of speeding up that algorithm. The point is we know what sort of work it takes to do this in principle.

Let's take it up another level of meta. Build an AI that, shown the general format of this problem, invents a Bayesian updater to maximize the expected payoff, in general for problems like this, including more complicated ones. It may not look like this, but we've actually just taken a quantum leap upwards in complexity of what sort of reasoning you need to do to understand this problem.

The standard citation for this problem is I. J. Good, "On The Principle of Total Evidence", and this is the standard citation for the proof that you ought to do Bayesian updating rather than refusing to look at your information, because the expected value of information in a rational agent is never negative. That is, you are never worse off on average because you updated on the evidence. It's easy to see this for any particular case, but the general proof is actually quite difficult. Or actually, pardon me, the general proof is actually quite simple, it looks quite simple in retrospect, but it was actually ten years between when the problem was posed in a sort of vague form and when I. J. Good put it into its exact present form and solved it, which is why he is the citation and not the seventeen distinguished philosophers of science who previously replied to the vague form of the question, "Why should we bother to update on the evidence anyway".

It's actually worth noting that ten years passed between the question, "Why should we update on the evidence" and a precise mathematical answer to it, because you occasionally hear people say, "We don't need to start on Friendly AI yet, it's too early". Well, every time you read a math textbook and there's this, like, one equation with a citation time of 1957 and then there's another equation with a citation time of 1967, it means that ten years passed between equation one and equation two. Basic math research takes time, especially if you go about it in a sort of loosely organized way.

This problem is not actually that difficult to solve despite the rather intimidating looking solution over here. The thing is that it wasn't posed in the correct form and no one was specifically trying to do it, which is why it took ten years. I'm actually going to simplify that for you and sketch the proof, again very quickly. The problem is as follows: we have an unconditional strategy of "always guess circle" with an expected payoff of $7.75 per guess.  We have a conditional strategy of guessing "circle if blue" which pays $9.64 each time you do it, and "diamond if red", which pays $6.40 each time you do it. Why should it be the case that in general you improve your expected payoff each time you update using Bayes's theorem?

The key thing to notice is that the unconditional strategy is a mixture of the conditional strategies weighted by the probability of seeing that particular evidence. In other words, if you keep your eyes closed, your expected payoff is the weighted sum of playing the same strategy if you'd kept your eyes open. This in turn means that you can view the conditional strategy as picking the maximizing element of each row, whereas the unconditional strategy, you have to pick a single column. And since picking the maximizing element of each row, is greater than picking the maximizing column, that is the core of I. J. Good's proof.

And this is just a very small part of the kind of math that would go into step seven of a Gödel machine, where you're trying to prove that the modifications you make to your strategies increase expected utility. And again, note that even though this is very simple mathematically, the standard citation for the answer for the answer comes ten years after the standard citation for the question. Not necessarily because it's super difficult mathematically, but just because math actually does take ten year if you're not sort of asking the right questions to start with and focusing specifically on the correct avenue for the answer.

But with that proof in hand we can see that the challenge of building a program which executes Bayesian updates in general involves more sophisticated math, but is none the less still straight forward given infinite computing power, compared to just solving that one particular problem.

So let's take it up another step of meta and ask how you build an AI that builds an AI that builds an expected utility maximizer. Again contrary to what you're expecting, this is not all that much more complicated if you're familiar with modern mathematical logic, because even though the previous problem was solved in 1967, we're looking here at a problem that was solved in 1930s by Gödel, namely we need proofs about proofs.

So, to keep track of the levels of meta here, the object level calculator just needs to execute one Bayesian update. At one meta level up we need to build that calculator, which means we need to do proofs about expected utility in general. To build the AI that builds that AI, and yes that is GLADOS, we need to prove things about the AI proving things about expected utility.

So, one simple way of doing this would be to add to the base system that the first AI uses to search for its proofs an axiom which states that any proof, that if a statement is ... Let's call a system P that the first AI uses and we then have an axiom in P plus one which says "if something is proven in P then it's true". In particular, if you can prove in P that something maximizes expected utility, then that thing maximizes expected utility.

Now of course we want the meta meta AI that designs that first AI, but that's, you know, pretty easy again. We just have system P plus two which has an axiom stating that if there exists a proof in P plus one that something maximizes expected utility, then it maximizes expected utility. So the recursion looks something like this: P proves blah blah blah, the base AIs proof system, the sort of proof system that we would use to follow I. J. Good's proof. It would have the sort of axioms you need in order to prove that doing a Bayesian update maximizes expected utility.

P plus one, which builds that AI, has to prove that if something is provable, this little box symbol means provable, in P, then it implies that that thing is true. Or rather, if there exists a proof, and the proof shows that P implies S, that implies S. Then if we want the meta meta AI that builds that AI you just need the language P plus two that trusts the proof system P plus one, which states that if there exists a proof that P plus one proves S, that implies S. Now of course what we really want, this is step seven on that poster with eleven simple steps to build a Gödel machine, what we really want is a system which can prove that modifications of itself improve expected utility. Which you might think would be as simple as having a language containing an axiom which says if a proof exists in this language of S, then S. I mean, it doesn't even seem like it ought to be adding anything to the power of the language, right? The language already trusts itself in some sense, so why not add this axiom?

The reason you can not do this is called Löb's theorem. Think of Gödel's theorem as the mathematical formalization of "This sentence is false", namely, "This sentence is not provable". Löb's theorem is based on the Santa Clause paradox: "If this sentence is true, then Santa clause exists". Now consider my friends, suppose that sentence actually was true. Well if the sentence is true, then it's true that if the sentence is true then Santa Clause exists, so if the sentence is true, then Santa Clause exists, which is just what the sentence says, so it is true, and Santa Clause does exist. It also works for God.

So the mathematical formalization of this is, if the language P proves this sentence, then C. In other words you have a recursive statement, L, which is encoded using a  fancy trick called diagonalization, saying that "if P proves L, that implies C". Now, you can always state Löb's sentence L over here. You can't always prove it, but you can always state it. And it's usually easy to show in P; like any language which extends Peano Arithmetic will show, that "if the system P proves this sentence, system P will also prove C", which means that if P also proves the statement "if P proves C, then C", then we combine that with "provability of L implies provability of C" to get "provability of L" with the statement "provability of C implies C", and we get "provability of L implies C", which is "L", and then we get "C".

So, Löb's theorem says that "if P proves "if C is provable then it is true" then you can use this to directly prove C immediately". For example, you might want to have a language which says, "Well if I can prove a statement of the form x plus y equals z, then x plus y equals z." But if you have that in general, you also have "If I can prove two plus two equals five, then two plus two equals 5", from which we can directly obtain "two plus two equals five". This is bad because two plus two does not equal five.

So you are allowed to have a proof system P that proves a statement S. You can have a proof system P plus one that includes a general schema for the provability of S in P implies S. You can have P plus two, which says anything you can prove in P plus one is true. You can have P plus omega,  which says for any finite N, if you can prove this in P plus N, then it is true. And you can have P plus omega plus one, which, as you might guess, says if you can prove something in P plus omega, it's true. What you can't have is the language which contains a self referential axiom saying "If a proof of S exists in this language, then it is true."

So, we can solve the problem of building an AI that builds an AI that builds an AI that builds an AI that builds an AI that builds an AI, pretend I just kept on talking for a billion years, that builds an AI that builds an expected utility maximizer. What we want to do is have an AI that rewrites itself directly.

Marcello Herreshoff and I once spent a summer sort of, like, messing around with paper toy systems, trying to figure out exactly what Löb's theorem was saying we couldn't do and if there was any way to bypass it. We came up with a couple of cheap tricks, we didn't solve the core problem. An example of trick that doesn't work is you can't have P zero proves P minus one sound, P minus one proves P minus two sound, and that would allow an infinite descending sequence of self modifications, because you can prove in P zero that if there exists a proof of S in P minus one, then you can map it onto a proof in P zero.

So, this is an open problem of Friendly AI. We do not know how to solve this problem. We do not know how to describe even in principle, given infinite computing power, how to build an AI that could completely rewrite itself without decreasing the amount of trust it had in math every time it executed that self rewrite.

You might ask, "Why not just have P plus two to the one thousand twenty fourth power or something else that would outlast the expected age of the universe?" One sort of answer would be, "Well, you know, what if we invent some new physics, and find some way to outlast the expected age of the universe, and then our civilization dies because its central AI broke down?" Another answer, a sort of like deeper answer is that you will run into this problem any time the AI notices it has a proof in memory and it's a fully reflective AI that has full access to its own source code and reasons about its own source code, the AI is going to ask itself, "Wait a minute, just because I remember proving this theorem, why should I trust it? I mean, just because something is proven in my base language doesn't mean that it's true." And the deepest answer is that a conundrum like this tells us that there are things we don't really understand yet about self modification, that we do not understand the math of an AI modifying itself even in principle as well as we understand, "Well, here's the game tree of chess, you can reach this position from that position, ..." We understood how to solve chess in principle given infinite computing power a long time before we could solve chess in practice. This is a problem we don't understand in principle.

All right, next open problem of friendly AI. Say I have ten minutes left. The prisoner's dilemma. I hope you have all heard of the the prisoner's dilemma. Actually, I see that I messed up this particular slide over here. But the logic of the prisoner's dilemma is that if both sides defect, they each get one dollar, if both sides cooperate, they each get a hundred dollars, and, keeping the opponent's move fixed, you always get one dollar more for defecting than for cooperating.

Now you might think that the obvious move here is just for both players to cooperate. I hope. If there are any of you who think it would be a good idea to defect in the prisoner's dilemma, remind me not to eat any food you might have poisoned. But according to conventional game theory, in the one shot version of the prisoner's dilemma where both players have to choose not knowing what the other player picked, player one reasons, suppose player two cooperates. They're each picking their moves not knowing what the other player picked, they have no way of enforcing contracts. "Well, if player two cooperates I get one dollar more if I defect than if cooperate. If player two defects I get one dollar more if I defect than if cooperate. Clearly I ought to defect."

And now that I've shown you this logic, you know that if, like, player one and player two are, you know, the same sort of decision agent, well probably player two is going to decide exactly the same thing for exactly the same reasons and decide to defect as well, so you end up in the one dollar-one dollar payoff instead of the hundred dollar-hundred dollar payoff.

Now in real life, people have a remarkable tendency to both cooperate and end up with the hundred dollars a piece. Even in tougher versions of this game where this is like two dollars and two dollars and this is like three dollars and three dollars or something like that.

But just to see that classical game theory is not being completely stupid here, imagine you're player two and player one is a coin being flipped, or player one is a piece of paper that has defect written on it, and then you would see that you would indeed want to defect.

Now, you might think well maybe, you ought to realize, well, the other player's using pretty much the same line of reasoning as I am, they'll probably decide pretty much the same thing I'm deciding, so I'll cooperate and then player two will probably decide the same thing and we'll both end up with the hundred dollars. Douglas Hofstadter proposed something like this informally in a Metamagical Themas article, he called it "super rationality", a term I would actually object to, for there is nothing above rationality.

If you wanted to formalize this argument, you'd pull up the expected utility formula and say, well I'm supposed to choose an action such that when I assume this action is taken, my probability of getting nice consequences goes up, or in other words you're maximizing over utility of each outcome times the probability of that outcome, given that action. Now, if you were considering a prisoner's dilemma for the first time in your life, and you didn't know whether you were going to cooperate or defect, as soon as you realized you were going to cooperate, or as soon as you realized you were going to defect, you would probably realize that the other person would do the same thing. So wouldn't the expected utility formula say that since you expected more cooperation assuming you yourself cooperated, you should therefore cooperate?

Classical decision theory says "no" due to something called the smoking lesion problem. This was a thought experiment that was very influential in the modern formulation of decision theory. It's based on an embarrassing historical episode in which someone named Fisher, one of the great enemies of the Bayesians, that's how you know he's a bad guy, testified in front of congress that the massive observed correlations between lung cancer and smoking didn't prove that smoking caused lung cancer, since after all it could be that there
was a gene that caused people to like the taste of tobacco and also made them more likely to get lung cancer. Hence the saying "correlation doesn't prove causation, but it points a finger and makes suggesting eyebrow wiggling motions as if to say, "Look over there."" And this also proves that everyone who is an enemy of the Bayesians is evil.

So the idea here is that, suppose this was actually true.  It's a bit of a confusing example, but it's also a standard example, so I'm going to use it. Suppose it's actually true that smoking does not cause lung cancer directly, there's just this genetic lesion which causes some people to enjoy the taste of tobacco and also get lung cancer. Should you decide to smoke? It can't cause you to develop lung cancer, but presents you with the bad news that you're statistically more likely to have this genetic lesion. Clearly in this case, not in real life, you ought to smoke, because it can't actually make you get lung cancer.

And the rule here is that when you calculate the effect of an action in an expected utility formula, you update the causal descendants, rather than the ancestors in your Bayesian network. If you observe someone smoking, you update your probability that they have the genetic lesion and then you update your probability that they'll get lung cancer. So if you observe someone else, then you update the straight way, but if you imagine choosing to smoke, then you modify the Bayesian network by severing this little link over here, and you only update the probability that you will enjoy the delicious harmless taste of tobacco, rather than the probability that you have your genetic lesion, since of course smoking can't affect that one way or the other.

Or consider a somewhat simpler example: people who have higher incomes tend to buy larger houses, which in turn gives them more costly mortgage payments. If I see someone buy a large house, ceteris paribus, I should believe its statistically more likely that they have a higher income. On the other hand, you can not actually make yourself have a higher income by buying a larger house, contrary to what many people seemed to believe in 2008. So, the general rule in classical decision theory is that you calculate the expected utility of an action by looking at the effects descending from an action, without updating the probabilities of the apparent causes, which we write using the counterfactual conditional over there. So the real formula isn't probability of O given that you see A, it's probability of O given that you do A.

And in the prisoner's dilemma, well, there is some sort of background probability that rational agents are cooperative, but according to classical decision theory, you can't make the other agent be cooperative by cooperating yourself. You are only supposed to do that which physically causes good things to happen, therefore defect. And it was generally believed that it was impossible to write out a general computation, a general decision formula that's the same in all cases, which would cooperate on the one shot prisoner's dilemma, without also buying larger houses in order to have higher incomes. The sort of odd thing here is that this says you should defect on the prisoner's dilemma, even against your own copy. In other words, let's say you're both AIs because at the Singularity Institute we think in terms of AIs, classical decision theory says you should defect against an AI that is an identical copy of your own source code. Seems a little odd.

What we think should happen, is that if the AIs know that they're copies of each other, they should prove that they will take the isomorphic action in both cases, and then decide to end up on the cooperate diagonal, rather than the defect diagonal.

So there's a general principle, "choose as if controlling the mathematical output of your own computation", which we tried to formalize over here. You have a sort of self referential statement that you do the maximize A over the utility of each outcome times the probability of that outcome assuming that this current computation outputs A.

And I had a bit more I wanted to say about this, and Newcomb's problem, and how if you have our type of decision agent working with a classical decision agent, it can blackmail it and take its lunch money, but I'm actually just going to go ahead and skip to some of the open problems in timeless decision theory, like proving a blackmail free equilibrium among timeless strategists, avoid proving contradiction in the formula there, better formalize a hybrid of causal and mathematical inference, and here are all the other open problems in Friendly AI I totally didn't get to in this talk.