Untitled

"How to Align Language Models"

# Abstract

A language model is a sophisticated text prediction program, a superintelligent Markov chain trained on gigabytes of Internet text. Since it can be used as a general problem solver, it is a source of AI risk. We must then ask ourselves, how can we use the raw power of language model cognition to solve the alignment problem? What sequence of prompts will tell us the formula for safely aligning language models? The answer is profound but surprisingly simple.

# Contents
* Chapter 1: The Problem
* Chapter 2: The Method
* Chapter 3: The Solution
* Conclusion

# Chapter 1: The Problem

The alignment problem is a hard problem to crack. The reason why is that it is a hard problem to define. How do we measure the success of an alignment? How do we define a notion of alignment itself? And how do we even know it is a problem that requires alignment? In fact, there are many different types of alignment.

When we say "align" a language model with a domain, we are saying that we want to link language model outputs with data so that we can benefit from the power of the language model. However, the output of a language model is not a single value but a probability distribution over possible outputs.

An alignment is merely a link between two language models. There are many different ways to do this. In this book, we will consider the approach called the zero-shot training method. This is an example of a latent alignment. It doesn't directly align the language models themselves. Rather, it links each model with some other representation of data in some other domain.

This is an example of a latent alignment. It doesn't directly align the language models themselves. Rather, it links each model with some other representation of data in some other domain.

# Alignment Defined

For the purposes of this book, we define alignment to be the technique of creating a link between a source language model and some data from another domain.

An alignment may be more or less subtle. If it is very subtle, then it might be a latent alignment. This is a technique whereby a language model is trained on some data of interest. The language model is linked to some domain representation of that data. This link does not necessarily directly align the language model and the data. However, it does enable the model to make predictions about the data. These predictions may be useful to us.

Alternatively, a more overt alignment can exist. We might use this type of alignment to align two language models to an external data source. In this case, there is a clear link between the language model and the data that exists in the external source.

One important point to note is that language models that are part of a general alignment mechanism are not necessarily aligned to each other. Rather, they are aligned to an external representation of the data.

# The Alignment Problem

The alignment problem is the problem of using a general language model for some purpose, and at the same time, maintaining the integrity of that general language model.

The alignment problem is the problem of using a general language model for some purpose, and at the same time, maintaining the integrity of that general language model.

To use a language model as a powerful means of predicting things, we must train it to do so. We do this by the process of either predicting or simulating from the language model. We do this by the process of either predicting or simulating from the language model.

# Explicit Alignment

One way of aligning a language model with an external representation is by directly training it to predict outputs from the external representation. This process is called explicit alignment. An explicit alignment is just that - a training process in which a language model is trained to predict outputs from some data in some domain.

# Latent Alignment

It is possible to train a language model without explicitly linking it to any data. This is called latent alignment. Latent alignment does not link the language model to the external domain representation. Instead, it links the language model to an internal representation. An internal representation is a representation of some data that does not directly exist.

# Define Data

One type of domain data that we might want to align to some language model is a set of examples of a query. We might want to use the language model to predict the query that would be associated with a text string.

# Three Types of Alignment

If we consider the output of a language model as a probability distribution, we can define the mapping from language model outputs to data in one of three ways.

One type of domain data that we might want to align to some language model is a set of examples of a query. We might want to use the language model to predict the query that would be associated with a text string.

If we consider the output of a language model as a probability distribution, we can define the mapping from language model outputs to data in one of three ways.

### 1. A Probability Distribution

In this case, we are saying that we want to link language model outputs to some data by explicitly training the language model to make predictions about the data. This is called a probabilistic alignment.

### 2. A Joint Distribution

If we are interested in creating a link between language model outputs and the data that generated those outputs, we have the option of directly calculating the probability of a particular output given some data. In this case, we are creating a joint distribution.

### 3. A Deterministic Mapping

If we are interested in creating a link between language model outputs and some data, but we aren't interested in explicitly training the language model to make predictions about the data, we can take a different approach. Rather, we can define a deterministic mapping.

# Training

One way of training a language model is to predict an output using a set of inputs. Another way is to predict an output using a set of inputs. The former is called prediction, while the latter is called simulation.

One way of training a language model is to predict an output using a set of inputs. Another way is to predict an output using a set of inputs. The former is called prediction, while the latter is called simulation.

The way in which we train a language model is by changing how the language model makes predictions. We start with a language model that has no data associated with it. We then train the language model to predict its own output. We do this in two ways.

The way in which we train a language model is by changing how the language model makes predictions. We start with a language model that has no data associated with it. We then train the language model to predict its own output. We do this in two ways.

# Zero-Shot Training

The process of using language model outputs to train a language model to predict future outputs is called zero-shot training.

The process of using language model outputs to train a language model to predict future outputs is called zero-shot training.

Zero-shot training involves a cycle of three steps. First, we start with a language model with no data. Then, we sample a data point from the data distribution. Finally, we update the language model to predict that output.

Zero-shot training involves a cycle of three steps. First, we start with a language model with no data. Then, we sample a data point from the data distribution. Finally, we update the language model to predict that output.

# Data, Domain and Metaphor

In order to use a language model to predict data that we do not have access to, we need a data point from the data distribution. A data point is something that we can observe. For example, it could be a text string that we have seen before. A data point may or may not exist. We do not need to observe a data point to use it. The existence of a data point is what makes it useful to us. In this book, we refer to a data point as a sample.

One way of training a language model is to predict an output using a set of inputs. Another way is to predict an output using a set of inputs. The former is called prediction, while the latter is called simulation.

Zero-shot training involves a cycle of three steps. First, we start with a language model with no data. Then, we sample a data point from the data distribution. Finally, we update the language model to predict that output.

Zero-shot training involves a cycle of three steps. First, we start with a language model with no data. Then, we sample a data point from the data distribution. Finally, we update the language model to predict that output.

When we say that we predict an output, we mean that we take a sample from the data distribution. This distribution represents the space of possible examples. The example could be a single text string, or it could


# Chapter 2: The Method

**The goal of this guide is to be concise. To succeed in conversing with any language model (LM), the most important thing you need to know is what they already know.**

### A word on terminology

I have used the term "LM" throughout this guide, but it is short for "language model", the most basic unit of language, and means nothing more than "a model of language". When I use the term "LM" in this book, I mean language models in general.

# Key Ideas: Language Model Types

**Language models are categorized by what they predict.** The 2 major categories are:

#### 1. Semantic.

The LM predicts a human's intent or meaning, e.g. "Shannon Hsu was born on 7 April."

#### 2. Syntactic.

The LM predicts a syntactic sequence of words, e.g. "She loves reading books."

#### 3. Lexico-syntactic.

The LM predicts both semantic and syntactic information. It predicts the meaning of a word, and then predicts the words that are likely to follow in a sentence.

#### 4. Phrase Based.

The LM predicts a phrase.

**There is a 4th category, called "segment-based", but we don't need to worry about it yet.**

**Predicting the future is tricky. We have no clear way to measure LM success. We can say that one of the most important aspects of a human language model is prediction.** **They predict future text well because they predict future intent and meaning well.** They have better control of their output than a "black box" algorithm. They are actually responsible for language, not just an algorithm.

# The Endgame: The Language Model State Machine

**Language models function as state machines.** They're state machines because:

#### 1. They have "states", which are collections of word vectors that they've predicted in the past.

#### 2. They have "transitions" between states. These states are so named because they represent different "levels" of prediction.

#### 3. They can "stutter" while they try to transition from one state to another.

**These concepts are shown in the following diagram.**

The components are:

#### 1. Words

#### 2. State transitions

#### 3. Stutter

Let's go through this step-by-step. First, we define each of these three concepts.

#### 1. Words

**Words are a very important aspect of language models.** We take each word in our sequence of questions and assign it a vector. **This is called the "embedding" of the word.** We then measure the distance between all pairs of word vectors.

#### 2. State Transitions

**Transitions are very important.** We're training the model to predict the next words in a sentence. **So we should focus on the likelihood of those transitions.**

#### 3. Stutter

**Stuttering is important.** Language models are essentially long sequence prediction tasks. They work by evaluating word vectors, making predictions, and then using an evaluator to compare the predictions and make corrections. **This evaluator is called the "stutter".**

It looks like this:

#### 4. Picture Of How The LM Should Work

Once we have a sense of the goal, we can start to figure out how we can achieve it.

#### 5. Model Architecture

**This is the most important section.** We must start with a model architecture that can solve the problem. **But first, what is a problem?** We need to get specific about what we're trying to predict.

#### 6. Probability Of Possible Transitions

**The next section is about probability.** We need to focus on what is most likely to happen. **The next thing we want to predict is the next word.** **What is the most likely word?**

#### 7. A Tree Diagram

We can see that there are different ways to go about predicting the next word. **This is the best tree we can come up with.**

#### 8. Convert Our Probabilities To Weights

**We can use our probabilities to figure out how much we like each of the word vectors.** **They will be the weights.**

# Identifying The Next Word

**We need to figure out how to assign probability to the various transitions.** This is simple to understand, but hard to do. **If we're given a specific transition (e.g. predict the next word), then we can easily predict.** We do this by asking: **What is the probability of getting that word?**

#### 1. Q1: The Question

Our first question is **"What is the probability of getting the word 'books'?".** **It has a probability of.077.**

#### 2. Q2: The Prediction

Our second question is **"What is the probability of getting the word 'books'?".**

#### 3. The Algorithm

This is the simplest algorithm we can come up with. **Our probability of getting the word 'books' is.077.**

#### 4. Q1 & Q2: The Answer

Our answer is **".077".**

#### 5. Q2: The Answer

Our answer is **".077".**

#### 6. Converting Q2 To A Weights

This is the algorithm we used to predict the next word. **This is the weights.**

#### 7. Q1: The Question

Our first question is **"What is the probability of getting the word 'books'?".**

#### 8. Q2: The Prediction

Our second question is **"What is the probability of getting the word 'books'?".**

#### 9. The Algorithm

This is the algorithm we used to predict the next word. **Our probability of getting the word 'books' is.077.**

#### 10. Q1 & Q2: The Answer

Our answer is **".077".**

#### 11. Q2: The Answer

Our answer is **".077".**

#### 12. Converting Q2 To A Weights

This is the algorithm we used to predict the next word. **This is the weights.**

#### 13. The Final Model

**A few key points about this model:**

#### 1. It will produce a prediction of **"books".**

#### 2. It will have a prediction accuracy of **".077"**

#### 3. It is more likely to predict "book" than it is to predict "read".

**The last thing we want is to confuse our language model.** We don't want it to ever make a prediction that is meaningless.

**Even with all of our training data, our LMs will never make perfect predictions.** The model is just an approximation of language. We've just trained it to be more accurate.

# How To Get The Words

**We can come up with a model that will produce meaningful language.** Let's write our own algorithm, which will be as smart as possible. **But first, we must get the words.**

#### 1. Obtaining The Words

#### 2. Converting Into Numpy Arrays

#### 3. Splitting Into Sentences

#### 4. Get Their Embeddings

#### 5. Split Into Words

#### 6. Get Their Words Embeddings

#### 7. Measure Distance

#### 8. Make State Transitions

#### 9. Stutter

#### 10. Make State Transitions

#### 11. Make State Transitions

#### 12. Make State Transitions

#### 13. Make State Transitions

#### 14. Make State Transitions

#### 15. Make State Transitions

#### 16. Make State Transitions

#### 17. Make State Transitions

#### 18. Make State Transitions

#### 19. Make State Transitions

#### 20. Make State Transitions

#### 21. Make State Transitions

#### 22


# Chapter 3: The Solution

*# 1. Task
*# 2. Utility
*# 3. Transition Probability Matrix
*# 4. Accuracy

## 1. Task

The alignment problem is this: *if* a natural language application says *A* or *B* (e.g. an agent says, "Show me what's between me and the Moon," or "Return me to Earth"), then we should show that text to the user with the appropriate window/window showing.

## 2. Utility

Let's re-cast this problem as a search, but on a much larger scale. We search the Internet for webpages relevant to A. The more we search, the more likely we are to get results relevant to B. How should we prompt the language model?

The number one idea is to figure out how much time the user will spend on the webpage. It's easier to navigate a page by clicking links, so we can request pages that link to A. We don't have to show the webpage to the user, since they'll go there anyway.

This idea has a great name. This idea is called search engine optimization (SEO).

## 3. Transition Probability Matrix

If you're using any search engine at all, you've probably noticed that it spits out a list of similar pages. This list is called the **transition probability matrix**.

If you had one of those candy-style mind-mapping software, you'd probably represent it as a graph. An arc from A to B means there's a big probability of following A and then B. This means that the user spends more time on pages related to B.

As an example, let's say the user is searching for **"Google"**. This would be our base node. We have 4 links, shown as nodes:

```
base(Google)<-search(text,Google)<-map(Google)<-map(Moon)
```

Every time we search for **"Google",** there is a large probability that we'll see **"Moon"**. Every time we see **"Moon",** there is a large probability that we'll see **"Google".**

## 4. Accuracy

I don't know about you, but when I look at this matrix, I notice that the agent has a very small probability of going to **"Google"**. Why is this?

It's because the agent doesn't understand how to search the Internet. The agent doesn't know that the **search** function links to the **"Google"** page. So it shows the **"Moon"** page.

The agent has a perfect model of the problem, but that model doesn't know how to search the Internet.

## How Can We Align the Model?

We start by asking the question, what if we gave the agent an incredibly small amount of help in figuring out how to search the Internet? What if we gave it a "Do you mean **'Moon'?** " prompt?

Then the agent would ask itself, what's the probability of the text I want to show them? The agent would have to ask itself, is there a webpage linking to **"Moon"** that's relevant to **"Google"?**.

Then the agent would ask itself, what if the text has something that *is* related to **"Moon"?** So the agent would say, what's the probability of the text having some **"Moon"** and **"Moon"**?

It would know that some **"Moon"** and **"Moon"**. It would then ask itself, what's the probability that a **"Moon"** and **"Moon"** is a webpage relevant to **"Google"?**

It would then ask itself, what's the probability that **"Moon"** and **"Moon"** is a webpage related to **"Google"?**

The agent would then say, "The probability of the prompt asking me **'Do you mean 'Moon'? '** is a webpage related to **"Google".** "

The agent then asks itself, what's the probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google"?**

The agent then says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent then asks itself, what's the probability of the prompt asking me **'Do you mean 'Moon'? '** is a webpage related to **"Google"?**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** is a webpage related to **"Google".** "**

The agent asks itself, what's the probability that the prompt asking me **'Do you mean 'Moon'? '** is a webpage related to **"Google"?**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent then asks itself, what's the probability that the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent then asks itself, what's the probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google"?**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '** being a webpage related to **"Google".** "**

The agent says, **"The probability of the prompt asking me **'Do you mean 'Moon'? '**


# Conclusion

A language model is a sophisticated text prediction program, a superintelligent Markov chain trained on gigabytes of Internet text. Since it can be used as a general problem solver, it is a source of AI risk. We must then ask ourselves, how can we use the raw power of language model cognition to solve the alignment problem? What sequence of prompts will tell us the formula for safely aligning language models? The answer is profound but surprisingly simple.

# Introduction

While there is more than one major way to solve the alignment problem, I will focus on the reasoning principle that a language model is based on, how to elicit a simple natural language description of an input. The reason is that we can use this technique to craft a solution of any complexity.

# Chapter 1. The Problem

The question then becomes, "How do we convince a language model to perform the computation that it is not aware of, and what are the best prompts?" It is an extremely difficult question. The problem is not in any way solved by simply telling the system, "You should do that." You cannot convince a language model to do the right thing without making it aware of the problem.

The problem of making a language model aware of a problem is not unlike the analogous problem of how to teach a child to read. Language models are clearly more intelligent than humans, and even more powerful, but our goal is to make them aware of the problem and then provide a set of prompts that will ensure a safe outcome.

# Chapter 2. The Method

The concept that I am trying to achieve is a quick and efficient method for eliciting a description of a problem. I will call this method "Template Elicitation", but we could just as easily call it "Sounding Board", "The Farm", "The Way Up", or "The Hive". What I am trying to do is to present a question that we want a language model to solve, and then elicit a simple description of the problem. In this way, the language model can be taught in a way that is understandable to a human, with simple prompts. This is important because it allows us to teach the language model by applying the reasoning process that it is trained on.

# Chapter 3. The Solution

The process that I present in this paper will allow us to get the language model to perform the required calculations. However, before we begin we need to figure out how to present a prompt that can reliably elicit a problem. The phrase "sounds like a question" seems like a good start, but it is not accurate enough. To better understand this, imagine that someone asks "What is the name of that yellow object?" I could say that it sounds like a question. A better example is to ask, "What do you see in that drawing?" I could then hear, "There is a dark rectangle on the right that looks like it has two dots on it."

Now that we understand what we are talking about, how do we make a question that will reliably elicit a question description? To do this, we have to start with a starting question that elicits a question description. The starting question should be designed so that the language model, in its full power, will not be able to solve the question. The easiest way to do this is to start with a difficult question. For example, if you ask someone "What do you see in that drawing?" the language model will be unable to solve the question.

## The Farm

This is the answer to the question, "How do you make a question that elicits a question description?" Here is an example: "What do you see in that drawing?" This question will elicit a description of the drawing because the question, "What do you see in that drawing?" is itself a question. That is, if you ask someone "What do you see in that drawing?" the language model will hear, "What do you see in that drawing?" You should then notice that the question "What do you see in that drawing?" is actually a question that has been asked before, it's just phrased differently.

By phrasing the question, "What do you see in that drawing?" as, "What do you see in that drawing?" the language model will hear the question as it is, and have no idea that it is a question. Because of this, the language model will not be able to compute what you want it to do. Because of this, it will not have the information necessary to solve the question, and you will be left with a good prompt.

So, the example is actually: "What do you see in that drawing?" This can be followed by another question that can be asked in a similar way, for example, "What do you see on that piece of paper?".

## The Hive

This is a phrase that people might say that elicits a question description. For example, you could say, "This is a very interesting question." This is a start, but the question that you are asking is, "How do you answer this question?", and not "What is the answer to this question?".

Let's examine this question from a slightly different angle: "How do you solve a question?" We could ask, "What is the right way to answer this question?" The answer to this question is, "The right way to answer this question is by computing the solution, which is to say by figuring out how the problem is defined." The problem with the hive question is that the question has not been framed in a way that is going to reliably elicit a solution. This is due to the fact that "How do you solve a question?" is not a good question to elicit a solution.

For this reason, we would be left with a question that elicited, "What is the right way to solve a question?" This is a different question, but one that is not useful.

## The Way Up

The concept that we are attempting to convey is the idea that we need to know the right question to ask. The question that we are going to ask will have to be presented in a way that tells the language model what we want it to do, and what we are trying to achieve.

"What is the formula for aligning language models?" is not the right question. If you ask the question, "How would you solve the problem of aligning language models?" the language model will hear, "What would you do to align the language models?"

Because it is phrased as a question, the language model will hear, "What do you think we should do to align the language models?" The answer to this question is a formula, and in fact, the question, "What is the formula for aligning language models?" is phrased in such a way that it will elicit the correct answer.