Untitled


robertskmilesToday at 21:25
That's actually a pretty interesting idea
assume that the very highest utility strategies are probably nutty
plexToday at 21:27
it limits how good things get, but extremely good strategies by one measure will often be terrible by some other
robertskmilesToday at 21:31
Chop off your extremal goodheart
plexToday at 21:38
I think this is maybe your chance to get a youtube comment cited
robertskmilesToday at 21:38
haha maybe
robertskmilesToday at 21:47
Not sure how to handle this one. Seems like a sensible suggestion, I'm not sure if it's already been proposed elsewhere
I don't really have an answer?
plexToday at 21:48
the comments section of https://www.lesswrong.com/tag/mild-optimization is probably a good place to look
Is there something I can ctrl-f on those which will catch it?
or do we have to wait until you get access to GPT-3 semantic search
there are actually very few comments
I can just read them all
robertskmilesToday at 21:52
I'm considering asking someone who's actually working on this stuff but I don't want to waste their time
(actually I don't want to look foolish by not doing my research)
plexToday at 21:53
I'm considering posting a link + quote to the youtube comment in the LW comments section
I assume you read the full paper (https://intelligence.org/files/QuantilizersSaferAlternative.pdf) and it does not mention it there? (@rob)
robertskmilesToday at 21:56
I don't remember seeing it?
but i read the paper when writing the script for this, which was like a year ago
plexToday at 21:58
hm, how about I skim the paper and ctrl-f a few possible terms, if I don't see anything I'll post to LW comments and reply via stampy linking to the LW comment?
my reputation is not super important
if you're comfortable posting it that'd be fine, or we could do something else
robertskmilesToday at 22:03
Interestingly I'm now seeing a bunch of people asking the same question
is it possible that my way of explaining quantilizers makes this previously obscure idea seem obvious?
more likely it's already been proposed
plexToday at 22:04
highlighting the "still might build a maximizer" problem could have caused it
cutting off the top x% does not totally rule out creating a maximizer, but it does seem likely that a lot of maximizers will give extreme values
evhubToday at 22:11
I don't think you get any additional safety guarantees on top of what you get from quantilization by default if you cut off the top x%.  Your safe distribution should be good enough—and the space of possible actions so large—that the sort of really crazy strategies that you're worried about just should never be sampled by any top tiny% quantilizer.
All you're doing is needlessly hurting performance competitiveness.
plexToday at 22:14
a strategy that a human could take, and so would be in the initial distribution, is to build an optimizer. This by default goes very badly, because of https://www.lesswrong.com/tag/goodhart-s-law I could see cutting off the strategies which look the best actually leading to better results if the evaluation function is not actually what you mean to optimize for
evhubToday at 22:15
You get to define the safe distribution—if building an optimizer without good safety guarantees is in your safe distribution, you're doing something wrong.
plexToday at 22:15
how do you define things so to exclude that?
robertskmilesToday at 22:16
In the video I'm mostly assuming the "safe" distribution is just a human imitation
evhubToday at 22:17
Well, how are you learning the distribution? This starts to get into questions of how you would actually implement a quantilizer, which is something that seems very difficult. Abram Demski has proposed Recursive Quantilization, which I think is probably the current leading candidate for how you would learn a safe distribution for a quantilizer in a safe way: https://www.alignmentforum.org/posts/2JGu9yxiJkoGdQR4s/learning-normativity-a-research-agenda
plexToday at 22:20
hm, I'm not seeing where he plans on getting his 'safe distribution, S,'
evhubToday at 22:20
In a recursive quantilization setup, you give feedback not just on the safe distribution, but also on how to look for safe distributions, and how to look for how to look for safe distributions, and so on, which hopefully should help you find a distribution that you actually like.
robertskmilesToday at 22:27
That might make a good follow-up video
stampy, remember videoidea is "Recursive Quantilization" https://www.alignmentforum.org/posts/2JGu9yxiJkoGdQR4s/learning-normativity-a-research-agenda
stampy
BOT
Today at 22:30
Ok robertskmiles, remembering that "videoidea" is ""Recursive Quantilization" https://www.alignmentforum.org/posts/2JGu9yxiJkoGdQR4s/learning-normativity-a-research-agenda"
robertskmilesToday at 22:31
My intuition still says that in the case where the base distribution is "as safe as a human", cutting off the top buys you some extra safety from extremal goodheart. Is that not the case?
evhubToday at 22:32
It's definitely a neat idea—I think it has pretty good outer alignment properties and the ability to give process-level feedback is imo necessary (though not sufficient) for inner alignment, though personally I have some doubts about its training and performance competitiveness (see https://www.alignmentforum.org/posts/fRsjBseRuvRhMPPE5/an-overview-of-11-proposals-for-building-safe-advanced-ai for definitions of the terms if you don't know them).
evhubToday at 22:34
I doubt it. I think that sampling from the top x% is already giving you that effect—you have to remember that the space of possible actions is massive.
plexToday at 22:34
I'm curious, is the idea that you get the feedback on which distribution is safe from the same distribution S? If S is an imitated human, how does it get used to analyse how appropriate specific examples given from S are? Or am I misunderstanding?
robertskmilesToday at 22:34
god that's a potential video too isn't it. Lists translate well
evhubToday at 22:37
Honestly, with quantilization I usually worry much more about the opposite—imo it's much more likely that sampling from the top is not close enough to maximizing to be competitive than that it's too close to be unsafe because of extremal goodhart (though it could be unsafe if your safe distribution is bad, or if inner alignment failed and your model isn't actually a quantilizer).
robertskmilesToday at 22:38
yeah I feel like they're not competitive except with very small q. A 10% quantilizer feels like throwing away most of your capability
plexToday at 22:40
maybe both are problems. Like, either your action is create an unsafe maximizer or it's only a smart-human level plan and does not do great things.
robertskmilesToday at 22:40
but once q gets small enough I feel like you're operating with a tiny slice of your probability distribution, and things might behave strangely when you're dealing with extremes like that
evhubToday at 22:40
Once you make q very small, it becomes very difficult to tell a quantilizer apart from a maximizer—but that's not to say that it means that the quantilizer is therefore unsafe, it's more that if you're training a model to be a quantilizer you're not going to have any way of knowing whether you're succeeding or not and just getting a maximizer instead.
plexToday at 22:40
the target, do an awesome plan which does not involve making an maximizer, may just not be part of the distribution
evhubToday at 22:41
It is certainly true from a theoretical perspective that if you set q small enough you'll get bad actions even for a perfect safe distribution, but in practice you'll never be able to train a model with a q that small imo (if you're even able to train a model to be a quantilizer in the first place).
robertskmilesToday at 22:43
I was imagining you train the base distribution and the expected utility calculator, and construct the quantilizer from those, rather than training a quantilizer such that q is part of the training?
guess I need to read the recursive quantilizing work
evhubToday at 22:45
Problem with that is that it's hopelessly uncompetitive—if I want to find an action in the top q of my distribution, I have to sample 1/q actions, which means sampling a truly stupendously large number of actions for anything competitive.
robertskmilesToday at 22:47
Right... I was thinking something like GPT-3, which spits out a distribution over all its actions every timestep
but that only works with small action spaces
plexToday at 22:47
@evhub what is the source for the safe distribution? If it's 'copy a very smart human' that seems like it includes a decent number of 'create an optimizer' actions? Though I agree that this looks uncompetitive for real world actions.
evhubToday at 22:49
@robertskmiles Suppose I learn a pdf p for a base distribution and I want to find an action in the top q according to u in a totally unbiased way—how do you do that other than via sampling? You really have to learn some sort of set of heuristics for searching through p in an efficient way, but then you need those heuristics to be unbiased somehow.
@plex Recursive quantilization is the only proposal for how to learn a safe distribution that I'll really defend—the idea there is to start by learning some simple human distribution (still not necessarily copy a smart human, though, more like what would a human in the training data do given instructions from whoever's doing the training), but then incorporate additional process-level feedback from the human during training to correct errors (sort of like relaxed adversarial training, which is why I like it so much, see: https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment).
robertskmilesToday at 22:56
I once made a GPT-2 quantilizer where U was "how many times can I include the letter e". For moderate values of q it made plausible human text that also contained a lot of Es. Got a similar thing working with poetic meter. But it meant evaluating U for every possible action, which I can see scales horribly
evhubToday at 22:58
One thing which I really recommend for anyone trying to think about AI safety is to really try to focus on how an idea would actually be implemented in practice—I think when you start doing that a lot more difficulties start showing up than you might have initially anticipated. See: https://ai-alignment.com/prosaic-ai-control-b959644d79c2 and https://www.alignmentforum.org/posts/qEjh8rpxjG4qGtfuK/the-backchaining-to-local-search-technique-in-ai-alignment
robertskmilesToday at 23:03
...did the alignment forum just go down? I get 500 Internal Server Error
DylanCopeToday at 23:03
I can access it!
plexToday at 23:04
works for me too
robertskmilesToday at 23:05
and now it's back for me
just a glitch I guess
DylanCopeToday at 23:06
Even if we do solve alignment conceptually, we will be doomed by the glitches
robertskmilesToday at 23:11
Honestly I think it's a big factor, I just don't yell about it because software reliability/testing isn't neglected like alignment is
sudonymToday at 23:12
Clearly you haven't worked in enterprise software
where QA is a total joke
robertskmilesToday at 23:14
it's not a neglected research problem though
plexToday at 23:15
so, getting back to the original question, does anyone want to take a stab at answering it?
robertskmilesToday at 23:15
people just elect not to do it
my problem before was lack of confidence in my response. After our discussion I am more informed but less confident
plexToday at 23:19
hm, we could say it sparked a good discussion and link to a pastebin with an edited version of the log