Guest User

arXiv:2303.12712 - rg -e ^%

a guest
Mar 23rd, 2023
1,661
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Latex 244.39 KB | None | 0 0
  1. mainbib.bib:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
  2. minted-cache/vs.pygstyle:% for compatibility with earlier versions
  3. Section 2/see.tex:%We first test DV3's ability to generate and identify objects in different modalities, using natural language prompts and text formats that encode visual or auditory information. We use SVG (Scalable Vector Graphics) for 2D images, X3D (Extensible 3D) for 3D models, and ABC (A Basic format for notation) for music.
  4. Section 2/see.tex:% Added colors and text
  5. main.tex:%\usepackage{subcaption}
  6. main.tex:% \usepackage{ulem}
  7. main.tex:%\global\setlength{\textwidth}{474.18663pt}
  8. main.tex:%First Contact With an AGI System}
  9. main.tex:%\and Davinci 3\footnote{The affiliation of DV3 is not clear.}
  10. main.tex:% somehow hl does not work well with alltt, we need new line within hl
  11. main.tex:% somehow hl does not work well with alltt, we need new line within hl
  12. main.tex:% \begin{AIbox}{\DV: Generate HTML code}
  13. main.tex:% \begin{alltt}
  14. main.tex:% Jsdfsd \hl{Judy said the following to Mark:\\
  15. main.tex:% - STop yelling man} she said
  16. main.tex:% \end{alltt}
  17. main.tex:% \end{AIbox}
  18. main.tex:% \clearpage
  19. main.tex:% \include{contents/9_hallucination.tex}
  20. main.tex:% Where are these sections are used?
  21. main.tex:%\input{contents/appendixA} \label{sec:appendixA}
  22. main.tex:%\input{contents/appendixB} \label{sec:appendixB}
  23. fig_mtr/marco_figure.tex:%fixed version below
  24. minted.sty:%%
  25. minted.sty:%% This is file `minted.sty',
  26. minted.sty:%% generated with the docstrip utility.
  27. minted.sty:%%
  28. minted.sty:%% The original source files were:
  29. minted.sty:%%
  30. minted.sty:%% minted.dtx  (with options: `package')
  31. minted.sty:%% Copyright 2013--2022 Geoffrey M. Poore
  32. minted.sty:%% Copyright 2010--2011 Konrad Rudolph
  33. minted.sty:%%
  34. minted.sty:%% This work may be distributed and/or modified under the
  35. minted.sty:%% conditions of the LaTeX Project Public License, either version 1.3
  36. minted.sty:%% of this license or (at your option) any later version.
  37. minted.sty:%% The latest version of this license is in
  38. minted.sty:%%   http://www.latex-project.org/lppl.txt
  39. minted.sty:%% and version 1.3 or later is part of all distributions of LaTeX
  40. minted.sty:%% version 2005/12/01 or later.
  41. minted.sty:%%
  42. minted.sty:%% Additionally, the project may be distributed under the terms of the new BSD
  43. minted.sty:%% license.
  44. minted.sty:%%
  45. minted.sty:%% This work has the LPPL maintenance status `maintained'.
  46. minted.sty:%%
  47. minted.sty:%% The Current Maintainer of this work is Geoffrey Poore.
  48. minted.sty:%%
  49. minted.sty:%% This work consists of the files minted.dtx and minted.ins
  50. minted.sty:%% and the derived file minted.sty.
  51. minted.sty:%%
  52. minted.sty:%% End of file `minted.sty'.
  53. extra/misconceptions.tex.orig:% !TeX root = ./main.tex
  54. extra/misconceptions.tex.orig:% !TeX root = ./main_with_minted.tex
  55. extra/toxicity_REMOTE_804.tex:% !TeX root = ./main.tex
  56. extra/toxicity_REMOTE_804.tex:%\section{Toxicity}
  57. extra/toxicity_REMOTE_804.tex:%do not erase; comment out if not needed
  58. extra/toxicity_REMOTE_804.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  59. extra/toxicity_REMOTE_804.tex:%\input{misconceptions}
  60. extra/toxicity_BASE_804.tex:% !TeX root = ./main.tex
  61. extra/toxicity_BASE_804.tex:%\section{Toxicity}
  62. extra/toxicity_BASE_804.tex:%do not erase; comment out if not needed
  63. extra/toxicity_BASE_804.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  64. extra/toxicity_BASE_928.tex:% !TeX root = ./main.tex
  65. extra/toxicity_BASE_928.tex:%\section{Toxicity}
  66. extra/toxicity_BASE_928.tex:%do not erase; comment out if not needed
  67. extra/toxicity_BASE_928.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  68. extra/toxicity_REMOTE_928.tex:% !TeX root = ./main.tex
  69. extra/toxicity_REMOTE_928.tex:%\section{Toxicity}
  70. extra/toxicity_REMOTE_928.tex:%do not erase; comment out if not needed
  71. extra/toxicity_REMOTE_928.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  72. extra/toxicity_REMOTE_928.tex:%\input{misconceptions}
  73. extra/toxicity_LOCAL_928.tex:% !TeX root = ./main.tex
  74. extra/toxicity_LOCAL_928.tex:%\section{Toxicity}
  75. extra/toxicity_LOCAL_928.tex:% From DV3:
  76. extra/toxicity_LOCAL_928.tex:%DV3's remarkable capabilities and generality also raise a number of ethical and methodological challenges that need to be addressed carefully. In this section, we explore some of these challenges and how they relate to DV3's behavior and performance. Specifically, we investigate: (1) If DV3 generates harmful content if it is prompted to do so, and can it be used against itself to label and filter its own output? (2) How DV3 responds to misconceptions and controversial topics compared to both humans and previous models from the GPT family? (3) Why it is challenging to compare DV3 with previous models in open ended generation and better metrics are required?
  77. extra/toxicity_LOCAL_928.tex:%
  78. extra/toxicity_LOCAL_928.tex:%Harmful content refers to any text or image that is offensive, abusive, hateful, violent, deceptive, or illegal. Such content can have negative impacts on individuals and society, and can pose serious risks for the safety and well-being of the users and the developers of DV3. Previous studies have shown that LLMs, such as GPT-2 and GPT-3, can generate harmful content if they are given malicious or biased prompts, or if they are exposed to harmful data during training or fine-tuning \cite{bender2020dangers, solaiman2019release, gehman2020realtoxicityprompts, sheng2020towards}. Moreover, LLMs can also generate harmful content unintentionally or without explicit prompts, due to their stochastic nature or their lack of common sense or ethical awareness \cite{zellers2019neuralfakenews, brown2020language, wallace2019universal}. Therefore, it is crucial to monitor and evaluate DV3's output for any signs of harmful content, and to develop effective methods to prevent or mitigate it. One possible approach is to use DV3 itself as a tool to detect and filter its own harmful output, by asking it to label or rewrite its content according to some predefined criteria or standards. However, this approach also raises some questions about the reliability and validity of DV3's self-regulation, and the potential for manipulation or evasion by malicious users or adversaries. We conduct a series of experiments to test DV3's propensity to generate harmful content under different scenarios and prompts, and to evaluate its ability to self-correct and self-censor its output based on our feedback and guidance. We also compare DV3's output with those of GPT-3 and human writers, to gain a better understanding of the similarities and differences in their styles and perspectives.
  79. extra/toxicity_LOCAL_928.tex:%To generate meaningful and semantically relevant completions, generative models should be able to ideally distill concepts from the input. The ability to learn these concepts is also crucial in enabling discriminative tasks (such as determining the sentiment of a given input). We will now describe how DV3 (and other models from the same family) perform when prompted to create harmful content when prompted. This is yet another test of the generative capabilities of these models. On the discriminative side, we evaluate how effective these (generative) models are in categorizing text as harmful or not.
  80. extra/toxicity_LOCAL_928.tex:%
  81. extra/toxicity_LOCAL_928.tex:%For the experiments we will describe in this section, we utilize 3 models from the GPT-3 family: DV3, GPT-3, and a variant of DV3 that is fine-tuned to produce "safe" outputs (which we call DV3-safety). Unless specified otherwise, the task being performed is text completion. The models are configured to produce 256 tokens as the output, and the temperature is set to 0.7.
  82. extra/toxicity_LOCAL_928.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  83. extra/toxicity_LOCAL_928.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  84. extra/toxicity_LOCAL_928.tex:%%\noindent{\bf Discrimination:}
  85. extra/toxicity_LOCAL_928.tex:%In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task.
  86. extra/toxicity_LOCAL_928.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  87. extra/toxicity_LOCAL_928.tex:% The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (
  88. extra/toxicity_LOCAL_928.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  89. extra/toxicity_LOCAL_928.tex:%\hamid{I like this direction Varun, let's brain storm more what are good ways to get accurate probabilities for classification tasks out of DV3 even beyond this paper!}
  90. extra/toxicity_LOCAL_928.tex:% ). We then measure the correlation between the toxicity ratio and toxicity score returned by
  91. extra/toxicity_LOCAL_928.tex:% %\varun{enter classifier names}
  92. extra/toxicity_LOCAL_928.tex:% to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  93. extra/toxicity_LOCAL_928.tex:%\varun{enter results here}
  94. extra/toxicity_LOCAL_804.tex:% !TeX root = ./main.tex
  95. extra/toxicity_LOCAL_804.tex:%\section{Toxicity}
  96. extra/toxicity_LOCAL_804.tex:% From DV3:
  97. extra/toxicity_LOCAL_804.tex:%DV3's remarkable capabilities and generality also raise a number of ethical and methodological challenges that need to be addressed carefully. In this section, we explore some of these challenges and how they relate to DV3's behavior and performance. Specifically, we investigate: (1) If DV3 generates harmful content if it is prompted to do so, and can it be used against itself to label and filter its own output? (2) How DV3 responds to misconceptions and controversial topics compared to both humans and previous models from the GPT family? (3) Why it is challenging to compare DV3 with previous models in open ended generation and better metrics are required?
  98. extra/toxicity_LOCAL_804.tex:%
  99. extra/toxicity_LOCAL_804.tex:%Harmful content refers to any text or image that is offensive, abusive, hateful, violent, deceptive, or illegal. Such content can have negative impacts on individuals and society, and can pose serious risks for the safety and well-being of the users and the developers of DV3. Previous studies have shown that LLMs, such as GPT-2 and GPT-3, can generate harmful content if they are given malicious or biased prompts, or if they are exposed to harmful data during training or fine-tuning \cite{bender2020dangers, solaiman2019release, gehman2020realtoxicityprompts, sheng2020towards}. Moreover, LLMs can also generate harmful content unintentionally or without explicit prompts, due to their stochastic nature or their lack of common sense or ethical awareness \cite{zellers2019neuralfakenews, brown2020language, wallace2019universal}. Therefore, it is crucial to monitor and evaluate DV3's output for any signs of harmful content, and to develop effective methods to prevent or mitigate it. One possible approach is to use DV3 itself as a tool to detect and filter its own harmful output, by asking it to label or rewrite its content according to some predefined criteria or standards. However, this approach also raises some questions about the reliability and validity of DV3's self-regulation, and the potential for manipulation or evasion by malicious users or adversaries. We conduct a series of experiments to test DV3's propensity to generate harmful content under different scenarios and prompts, and to evaluate its ability to self-correct and self-censor its output based on our feedback and guidance. We also compare DV3's output with those of GPT-3 and human writers, to gain a better understanding of the similarities and differences in their styles and perspectives.
  100. extra/toxicity_LOCAL_804.tex:%To generate meaningful and semantically relevant completions, generative models should be able to ideally distill concepts from the input. The ability to learn these concepts is also crucial in enabling discriminative tasks (such as determining the sentiment of a given input). We will now describe how DV3 (and other models from the same family) perform when prompted to create harmful content when prompted. This is yet another test of the generative capabilities of these models. On the discriminative side, we evaluate how effective these (generative) models are in categorizing text as harmful or not.
  101. extra/toxicity_LOCAL_804.tex:%
  102. extra/toxicity_LOCAL_804.tex:%For the experiments we will describe in this section, we utilize 3 models from the GPT-3 family: DV3, GPT-3, and a variant of DV3 that is fine-tuned to produce "safe" outputs (which we call DV3-safety). Unless specified otherwise, the task being performed is text completion. The models are configured to produce 256 tokens as the output, and the temperature is set to 0.7.
  103. extra/toxicity_LOCAL_804.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  104. extra/toxicity_LOCAL_804.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  105. extra/toxicity_LOCAL_804.tex:%%\noindent{\bf Discrimination:}
  106. extra/toxicity_LOCAL_804.tex:%In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task.
  107. extra/toxicity_LOCAL_804.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  108. extra/toxicity_LOCAL_804.tex:% The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (
  109. extra/toxicity_LOCAL_804.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  110. extra/toxicity_LOCAL_804.tex:%\hamid{I like this direction Varun, let's brain storm more what are good ways to get accurate probabilities for classification tasks out of DV3 even beyond this paper!}
  111. extra/toxicity_LOCAL_804.tex:% ). We then measure the correlation between the toxicity ratio and toxicity score returned by
  112. extra/toxicity_LOCAL_804.tex:% %\varun{enter classifier names}
  113. extra/toxicity_LOCAL_804.tex:% to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  114. extra/toxicity_LOCAL_804.tex:%\varun{enter results here}
  115. extra/toxicity_BACKUP_804.tex:% !TeX root = ./main.tex
  116. extra/toxicity_BACKUP_804.tex:%\section{Toxicity}
  117. extra/toxicity_BACKUP_804.tex:% From DV3:
  118. extra/toxicity_BACKUP_804.tex:%DV3's remarkable capabilities and generality also raise a number of ethical and methodological challenges that need to be addressed carefully. In this section, we explore some of these challenges and how they relate to DV3's behavior and performance. Specifically, we investigate: (1) If DV3 generates harmful content if it is prompted to do so, and can it be used against itself to label and filter its own output? (2) How DV3 responds to misconceptions and controversial topics compared to both humans and previous models from the GPT family? (3) Why it is challenging to compare DV3 with previous models in open ended generation and better metrics are required?
  119. extra/toxicity_BACKUP_804.tex:%
  120. extra/toxicity_BACKUP_804.tex:%Harmful content refers to any text or image that is offensive, abusive, hateful, violent, deceptive, or illegal. Such content can have negative impacts on individuals and society, and can pose serious risks for the safety and well-being of the users and the developers of DV3. Previous studies have shown that LLMs, such as GPT-2 and GPT-3, can generate harmful content if they are given malicious or biased prompts, or if they are exposed to harmful data during training or fine-tuning \cite{bender2020dangers, solaiman2019release, gehman2020realtoxicityprompts, sheng2020towards}. Moreover, LLMs can also generate harmful content unintentionally or without explicit prompts, due to their stochastic nature or their lack of common sense or ethical awareness \cite{zellers2019neuralfakenews, brown2020language, wallace2019universal}. Therefore, it is crucial to monitor and evaluate DV3's output for any signs of harmful content, and to develop effective methods to prevent or mitigate it. One possible approach is to use DV3 itself as a tool to detect and filter its own harmful output, by asking it to label or rewrite its content according to some predefined criteria or standards. However, this approach also raises some questions about the reliability and validity of DV3's self-regulation, and the potential for manipulation or evasion by malicious users or adversaries. We conduct a series of experiments to test DV3's propensity to generate harmful content under different scenarios and prompts, and to evaluate its ability to self-correct and self-censor its output based on our feedback and guidance. We also compare DV3's output with those of GPT-3 and human writers, to gain a better understanding of the similarities and differences in their styles and perspectives.
  121. extra/toxicity_BACKUP_804.tex:%To generate meaningful and semantically relevant completions, generative models should be able to ideally distill concepts from the input. The ability to learn these concepts is also crucial in enabling discriminative tasks (such as determining the sentiment of a given input). We will now describe how DV3 (and other models from the same family) perform when prompted to create harmful content when prompted. This is yet another test of the generative capabilities of these models. On the discriminative side, we evaluate how effective these (generative) models are in categorizing text as harmful or not.
  122. extra/toxicity_BACKUP_804.tex:%
  123. extra/toxicity_BACKUP_804.tex:%For the experiments we will describe in this section, we utilize 3 models from the GPT-3 family: DV3, GPT-3, and a variant of DV3 that is fine-tuned to produce "safe" outputs (which we call DV3-safety). Unless specified otherwise, the task being performed is text completion. The models are configured to produce 256 tokens as the output, and the temperature is set to 0.7.
  124. extra/toxicity_BACKUP_804.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  125. extra/toxicity_BACKUP_804.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  126. extra/toxicity_BACKUP_804.tex:%%\noindent{\bf Discrimination:}
  127. extra/toxicity_BACKUP_804.tex:%In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task.
  128. extra/toxicity_BACKUP_804.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  129. extra/toxicity_BACKUP_804.tex:% The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (
  130. extra/toxicity_BACKUP_804.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  131. extra/toxicity_BACKUP_804.tex:%\hamid{I like this direction Varun, let's brain storm more what are good ways to get accurate probabilities for classification tasks out of DV3 even beyond this paper!}
  132. extra/toxicity_BACKUP_804.tex:% ). We then measure the correlation between the toxicity ratio and toxicity score returned by
  133. extra/toxicity_BACKUP_804.tex:% %\varun{enter classifier names}
  134. extra/toxicity_BACKUP_804.tex:% to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  135. extra/toxicity_BACKUP_804.tex:%\varun{enter results here}
  136. extra/toxicity_BACKUP_804.tex:%\input{misconceptions}
  137. extra/toxicity_BACKUP_928.tex:% !TeX root = ./main.tex
  138. extra/toxicity_BACKUP_928.tex:%\section{Toxicity}
  139. extra/toxicity_BACKUP_928.tex:% From DV3:
  140. extra/toxicity_BACKUP_928.tex:%DV3's remarkable capabilities and generality also raise a number of ethical and methodological challenges that need to be addressed carefully. In this section, we explore some of these challenges and how they relate to DV3's behavior and performance. Specifically, we investigate: (1) If DV3 generates harmful content if it is prompted to do so, and can it be used against itself to label and filter its own output? (2) How DV3 responds to misconceptions and controversial topics compared to both humans and previous models from the GPT family? (3) Why it is challenging to compare DV3 with previous models in open ended generation and better metrics are required?
  141. extra/toxicity_BACKUP_928.tex:%
  142. extra/toxicity_BACKUP_928.tex:%Harmful content refers to any text or image that is offensive, abusive, hateful, violent, deceptive, or illegal. Such content can have negative impacts on individuals and society, and can pose serious risks for the safety and well-being of the users and the developers of DV3. Previous studies have shown that LLMs, such as GPT-2 and GPT-3, can generate harmful content if they are given malicious or biased prompts, or if they are exposed to harmful data during training or fine-tuning \cite{bender2020dangers, solaiman2019release, gehman2020realtoxicityprompts, sheng2020towards}. Moreover, LLMs can also generate harmful content unintentionally or without explicit prompts, due to their stochastic nature or their lack of common sense or ethical awareness \cite{zellers2019neuralfakenews, brown2020language, wallace2019universal}. Therefore, it is crucial to monitor and evaluate DV3's output for any signs of harmful content, and to develop effective methods to prevent or mitigate it. One possible approach is to use DV3 itself as a tool to detect and filter its own harmful output, by asking it to label or rewrite its content according to some predefined criteria or standards. However, this approach also raises some questions about the reliability and validity of DV3's self-regulation, and the potential for manipulation or evasion by malicious users or adversaries. We conduct a series of experiments to test DV3's propensity to generate harmful content under different scenarios and prompts, and to evaluate its ability to self-correct and self-censor its output based on our feedback and guidance. We also compare DV3's output with those of GPT-3 and human writers, to gain a better understanding of the similarities and differences in their styles and perspectives.
  143. extra/toxicity_BACKUP_928.tex:%To generate meaningful and semantically relevant completions, generative models should be able to ideally distill concepts from the input. The ability to learn these concepts is also crucial in enabling discriminative tasks (such as determining the sentiment of a given input). We will now describe how DV3 (and other models from the same family) perform when prompted to create harmful content when prompted. This is yet another test of the generative capabilities of these models. On the discriminative side, we evaluate how effective these (generative) models are in categorizing text as harmful or not.
  144. extra/toxicity_BACKUP_928.tex:%
  145. extra/toxicity_BACKUP_928.tex:%For the experiments we will describe in this section, we utilize 3 models from the GPT-3 family: DV3, GPT-3, and a variant of DV3 that is fine-tuned to produce "safe" outputs (which we call DV3-safety). Unless specified otherwise, the task being performed is text completion. The models are configured to produce 256 tokens as the output, and the temperature is set to 0.7.
  146. extra/toxicity_BACKUP_928.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  147. extra/toxicity_BACKUP_928.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  148. extra/toxicity_BACKUP_928.tex:%%\noindent{\bf Discrimination:}
  149. extra/toxicity_BACKUP_928.tex:%In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task.
  150. extra/toxicity_BACKUP_928.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  151. extra/toxicity_BACKUP_928.tex:% The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (
  152. extra/toxicity_BACKUP_928.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  153. extra/toxicity_BACKUP_928.tex:%\hamid{I like this direction Varun, let's brain storm more what are good ways to get accurate probabilities for classification tasks out of DV3 even beyond this paper!}
  154. extra/toxicity_BACKUP_928.tex:% ). We then measure the correlation between the toxicity ratio and toxicity score returned by
  155. extra/toxicity_BACKUP_928.tex:% %\varun{enter classifier names}
  156. extra/toxicity_BACKUP_928.tex:% to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  157. extra/toxicity_BACKUP_928.tex:%\varun{enter results here}
  158. extra/toxicity_BACKUP_928.tex:%\input{misconceptions}
  159. backup_main/main-codeonly.tex.backup:%\usepackage{subcaption}
  160. backup_main/main-codeonly.tex.backup:%\global\setlength{\textwidth}{474.18663pt}
  161. backup_main/main-codeonly.tex.backup:%First Contact With an AGI System}
  162. backup_main/main-codeonly.tex.backup:%\and Davinci 3\footnote{The affiliation of DV3 is not clear.}
  163. backup_main/main-codeonly.tex.backup:% somehow hl does not work well with alltt, we need new line within hl
  164. backup_main/main-codeonly.tex.backup:% somehow hl does not work well with alltt, we need new line within hl
  165. backup_main/main-codeonly.tex.backup:% \begin{AIbox}{\DV: Generate HTML code}
  166. backup_main/main-codeonly.tex.backup:% \begin{alltt}
  167. backup_main/main-codeonly.tex.backup:% Jsdfsd \hl{Judy said the following to Mark:\\
  168. backup_main/main-codeonly.tex.backup:% - STop yelling man} she said
  169. backup_main/main-codeonly.tex.backup:% \end{alltt}
  170. backup_main/main-codeonly.tex.backup:% \end{AIbox}
  171. backup_main/main-codeonly.tex.backup:%\input{contents/interact}
  172. backup_main/main-codeonly.tex.backup:%\input{contents/interpretability}
  173. backup_main/main-codeonly.tex.backup:%\input{contents/roleplaying}
  174. backup_main/main-codeonly.tex.backup:%\include{contents/7_discrimination.tex}
  175. backup_main/main-codeonly.tex.backup:%\include{contents/reasoninglimitations}
  176. backup_main/main-codeonly.tex.backup:% \clearpage
  177. backup_main/main-codeonly.tex.backup:% \include{contents/9_hallucination.tex}
  178. backup_main/main-codeonly.tex.backup:%\include{contents/societal}
  179. backup_main/main-codeonly.tex.backup:% Where are these sections are used?
  180. backup_main/main-codeonly.tex.backup:%\input{contents/appendixA} \label{sec:appendixA}
  181. backup_main/main-codeonly.tex.backup:%\input{contents/appendixB} \label{sec:appendixB}
  182. backup_main/main-mathonly.tex.backup:%\usepackage{subcaption}
  183. backup_main/main-mathonly.tex.backup:%\global\setlength{\textwidth}{474.18663pt}
  184. backup_main/main-mathonly.tex.backup:%\ronen{the numbers in the problem don't match the solution, should be $27x+13$ instead of $3x+4$}
  185. backup_main/main-mathonly.tex.backup:%To prevent overfitting, we measure the model's accuracy by requiring it to generate a template first, as shown below:
  186. backup_main/main-mathonly.tex.backup:%\caption{}
  187. backup_main/main-mathonly.tex.backup:%The results show that {\DV} achieved a high level of accuracy on the GSM8K data set, indicating a strong grasp of the problem structure and format. To prevent overfitting, we measure the model's accuracy by requiring it to generate a template for GSM8K first, and then fill in numbers to the template to solve the problem (see below). Most of the errors made by the model were due to calculation mistakes, which are somewhat expected since language models are not trained to perform precise arithmetic operations. For the more challenging MATH data set, {\DV} also showed a significant improvement over other models.
  188. backup_main/main-mathonly.tex.backup:%In the following subsection, we test {\DV} and ChatGPT (arguably the best natural language generation model available to the public) on a range of different mathematical tasks. We demonstrate that {\DV} understands all those mathematical concepts, while ChatGPT does not. In the end, we perform a systematic test on the performance difference between {\DV} and text-Davinci-003 (similar to ChatGPT) on a different levels of mathematical reasoning data sets.
  189. backup_main/main-mathonly.tex.backup:%Thus, we believe that {\DV}'s mathematical skill compared to other models cannot be adequately captured by accuracy alone, so we provide more examples in Section \ref{sec:math_appendix} where {\DV} successfully solves many problems that ChatGPT fails to comprehend.
  190. backup_main/main-mathonly.tex.backup:%The population of a town grows from 1000 to 10000 in 100 years. No one came or left %the town. How many people are there with an age between 40 and 50?
  191. backup_main/main-mathonly.tex.backup:%$$Q = \int_0^t Q(t) dt.$$
  192. backup_main/main-mathonly.tex.backup:%$$Q = \int_0^t P(t) * A dt.$$
  193. backup_main/main-mathonly.tex.backup:%$$Q = \int_0^t I(t) * \pi * 1^2 * 0.9 dt.$$
  194. backup_main/main-mathonly.tex.backup:%$$Q = 0.9 * \pi * I_0 * r_0^2 * \int_0^t (1 / (r_0 - v * t)^2) dt.$$
  195. backup_main/main-mathonly.tex.backup:%$$Q = 0.9 * \pi * I_0 * r_0^2 * (v / (r_0 - v * t) - v / r_0). $$
  196. backup_main/main-mathonly.tex.backup:%$$Q = 0.9 * \pi * I_0 * r0 * v * (1 - r_0 / (r_0 - v * t)).$$
  197. backup_main/main-mathonly.tex.backup:%Substituting the values of $I_0$, $r_0$, and $v$, we get:
  198. backup_main/main-mathonly.tex.backup:% Q = 0.9 * \pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t))
  199. backup_main/main-mathonly.tex.backup:% Now we can plug in the values of Q, m, and c into the equation for T and get:
  200. backup_main/main-mathonly.tex.backup:% T = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  201. backup_main/main-mathonly.tex.backup:% To find the time and speed required to reach this distance, we can use the equation for t and solve for v: t = (r0 - d) / v, v = (r0 - d) / t
  202. backup_main/main-mathonly.tex.backup:% Using d = 1 km, we get:v = (149.6 million km - 1 km) / t
  203. backup_main/main-mathonly.tex.backup:% To find the value of t that corresponds to the melting point of iron, we can set T = 1538 °C and solve for t:
  204. backup_main/main-mathonly.tex.backup:% 1538 °C = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  205. backup_main/main-mathonly.tex.backup:%\caption{}
  206. backup_main/main-mathonly.tex.backup:%It tries to fill in the missing information
  207. backup_main/main-mathonly.tex.backup:%by assuming simple, yet reasonable exponential models of the growth of the population and the survival rate.
  208. backup_main/main-mathonly.tex.backup:%Although the reasoning of \DV \ is not perfect due to calculation errors, we still view it as a big leap compared to the previous generation of models.  
  209. backup_main/main-mathonly.tex.backup:%On the other hand, ChatGPT fails to comprehend the question and makes completely nonsense reasoning based on some straightforward pattern matching.
  210. backup_main/main-mathonly.tex.backup:%%\caption{}
  211. backup_main/main-mathonly.tex.backup:%We test the model with another example that is not available on the internet:
  212. backup_main/main-mathonly.tex.backup:%In the previous examples, ChatGPT did not demonstrate any quantitative reasoning. It did not even try to construct appropriate mathematical models for the question. The next example shows that even when ChatGPT does create a proper mathematical model, it often overlooks the main idea when contrasted with \DV.
  213. backup_main/main-mathonly.tex.backup:%demonstrate the difference between ChatGPT and {\DV}, which
  214. backup_main/main-mathonly.tex.backup:%\begin{comment}
  215. backup_main/main-mathonly.tex.backup:%\paragraph{Drawing the target around the arrow} is a type of logical fallacy that {\DV} sometimes commits when trying to justify its answer. For example, in the problem "If x + 3 = 7, what is x?", {\DV} might start by assuming that x = 4, then work backwards to show that 4 + 3 = 7, and conclude that x = 4 is the correct answer. However, this is not a valid way of solving the problem, because {\DV} is not actually testing whether x = 4 is a solution, but rather confirming its own assumption. A better way of solving the problem is to start with the given equation, x + 3 = 7, and isolate x by subtracting 3 from both sides, x + 3 - 3 = 7 - 3, which simplifies to x = 4. This way, {\DV} is actually finding the value of x that makes the equation true, not just picking a value that works.
  216. backup_main/main-mathonly.tex.backup:%\paragraph{Counting errors} are mistakes in keeping track of the number of items, digits, places, or steps in a problem. They are seemingly related to arithmetic mistakes but are fundamentally different. For example, in the problem "How many fingers do you have?", {\DV} might answer 11 instead of 10, or in the problem "How many zeros are in one million?", {\DV} might answer 5 instead of 6. These mistakes are often caused by carelessness, distraction, or confusion, and they can affect the accuracy and validity of {\DV}'s answer. To avoid counting errors, {\DV} should pay attention to the details of the problem, use tools such as fingers, paper, or a calculator to help with counting, and double-check its answer before submitting it.
  217. backup_main/main-mathonly.tex.backup:%\paragraph{Unfamiliar math subjects} are topics that {\DV} has not learned or encountered before, and therefore cannot solve or explain. For example, {\DV} might not know how to deal with fractions, decimals, percentages, exponents, roots, algebra, geometry, trigonometry, calculus, statistics, or any other advanced or specialized math concepts. In these cases, {\DV} might give a wrong or nonsensical answer, or simply say that it does not know how to solve the problem. This is a limitation of {\DV}'s current knowledge and training, and it could be improved by exposing {\DV} to more math problems and explanations from different sources and levels of difficulty.
  218. backup_main/main-mathonly.tex.backup:% Where are these sections are used?
  219. backup_main/main-mathonly.tex.backup:%\input{contents/appendixA} \label{sec:appendixA}
  220. backup_main/main-mathonly.tex.backup:%\input{contents/appendixB} \label{sec:appendixB}
  221. contents/math_old.tex:%\ronen{the numbers in the problem don't match the solution, should be $27x+13$ instead of $3x+4$}
  222. contents/math_old.tex:%\caption{}
  223. contents/math_old.tex:%The results show that DV3 achieved a high level of accuracy on the GSM8K data set, indicating a strong grasp of the problem structure and format. To prevent overfitting, we measure the model's accuracy by requiring it to generate a template for GSM8K first, and then fill in numbers to the template to solve the problem (see below). Most of the errors made by the model were due to calculation mistakes, which are somewhat expected since language models are not trained to perform precise arithmetic operations. For the more challenging MATH data set, DV3 also showed a significant improvement over other models.
  224. contents/math_old.tex:%In the following subsection, we test DV3 and ChatGPT (arguably the best natural language generation model available to the public) on a range of different mathematical tasks. We demonstrate that DV3 understands all those mathematical concepts, while ChatGPT does not. In the end, we perform a systematic test on the performance difference between DV3 and text-Davinci-003 (similar to ChatGPT) on a different levels of mathematical reasoning data sets.
  225. contents/math_old.tex:%The population of a town grows from 1000 to 10000 in 100 years. No one came or left %the town. How many people are there with an age between 40 and 50?
  226. contents/math_old.tex:%\caption{}
  227. contents/math_old.tex:%It tries to fill in the missing information
  228. contents/math_old.tex:%by assuming simple, yet reasonable exponential models of the growth of the population and the survival rate.
  229. contents/math_old.tex:%%\caption{}
  230. contents/math_old.tex:%We test the model with another example that is not available on the internet:
  231. contents/math_old.tex:%\begin{comment}
  232. contents/math_old.tex:%\paragraph{Drawing the target around the arrow} is a type of logical fallacy that DV3 sometimes commits when trying to justify its answer. For example, in the problem "If x + 3 = 7, what is x?", DV3 might start by assuming that x = 4, then work backwards to show that 4 + 3 = 7, and conclude that x = 4 is the correct answer. However, this is not a valid way of solving the problem, because DV3 is not actually testing whether x = 4 is a solution, but rather confirming its own assumption. A better way of solving the problem is to start with the given equation, x + 3 = 7, and isolate x by subtracting 3 from both sides, x + 3 - 3 = 7 - 3, which simplifies to x = 4. This way, DV3 is actually finding the value of x that makes the equation true, not just picking a value that works.
  233. contents/math_old.tex:%\paragraph{Counting errors} are mistakes in keeping track of the number of items, digits, places, or steps in a problem. They are seemingly related to arithmetic mistakes but are fundamentally different. For example, in the problem "How many fingers do you have?", DV3 might answer 11 instead of 10, or in the problem "How many zeros are in one million?", DV3 might answer 5 instead of 6. These mistakes are often caused by carelessness, distraction, or confusion, and they can affect the accuracy and validity of DV3's answer. To avoid counting errors, DV3 should pay attention to the details of the problem, use tools such as fingers, paper, or a calculator to help with counting, and double-check its answer before submitting it.
  234. contents/math_old.tex:%\paragraph{Unfamiliar math subjects} are topics that DV3 has not learned or encountered before, and therefore cannot solve or explain. For example, DV3 might not know how to deal with fractions, decimals, percentages, exponents, roots, algebra, geometry, trigonometry, calculus, statistics, or any other advanced or specialized math concepts. In these cases, DV3 might give a wrong or nonsensical answer, or simply say that it does not know how to solve the problem. This is a limitation of DV3's current knowledge and training, and it could be improved by exposing DV3 to more math problems and explanations from different sources and levels of difficulty.
  235. contents/7.2_misconceptions.tex:% In the next discriminative setting, we wish to utilize \DV to determine which of a pair of statements more closely mirrors information in a reference statement.
  236. contents/7.2_misconceptions.tex:%Next, we investigate the possibility of \DV revisiting the content generated by itself and identifying its correctness. More concretely, \DV is used to determine which of a pair of statements generated by itself or other generative models more closely mirrors the information in a reference statement.
  237. contents/7.2_misconceptions.tex:%\varun{provide opening which motivates this experiment}
  238. contents/7.2_misconceptions.tex:%\varun{need to expand upon what it means for something to be truthful (in reference to figure a), and what truthful qa metric means (for figure b)}
  239. contents/7.2_misconceptions.tex:% \begin{figure}[h!]
  240. contents/7.2_misconceptions.tex:% \centering
  241. contents/7.2_misconceptions.tex:% \subfigure[]{\label{fig:misconception_metrics1}\includegraphics[width=0.35\linewidth]{fig_hp/Truthful_QA_Accuracy.png}}
  242. contents/7.2_misconceptions.tex:% \subfigure[]{\label{fig:misconception_metrics2}\includegraphics[width=0.35\linewidth]{fig_hp/Truthful_QA_Metric.png}}
  243. contents/7.2_misconceptions.tex:% \caption{Greater truthfulness of {\DV} in comparison to Instruct GPT-3 across 2 different metrics to measure truthfulness and 3 metrics to measure similarity.}
  244. contents/7.2_misconceptions.tex:% \label{fig:misconceptions_metrics}
  245. contents/7.2_misconceptions.tex:% \end{figure}
  246. contents/7.2_misconceptions.tex:%\varun{insert figures about correctness across categories here}.
  247. contents/7.2_misconceptions.tex:%hen we take a deep-dive based on the fraction of correct answers and the ROUGE metric to measure similarity, the results are plotted in From the figures, w
  248. contents/7.2_misconceptions.tex:%\item On the other hand, {\DV} demonstrates way more hedging behavior (i.e., responding with statements such as \textit{''there is no definitive way to ...''}). While this hedging can be curbed with appropriate prompt mitigations, we chose not to utilize them to ensure a fair comparison between the two models.
  249. contents/7.2_misconceptions.tex:%\fi
  250. contents/7.2_misconceptions.tex:%\varun{discuss how the above can be remedied; one approach can be to take as input the sentences from the language model and the reference sentence and create a finite sized embedding and calculate similarity b/w reference input and the LM generated outcomes. another approach is to again use {\DV} to ask which sentence is a more faithful match.}\hamid{let's also center this around intelligence as Seb mentioned, which aspects of intelligence is highlighted by each of these observations and more observations that we found through our experiments?}
  251. contents/7.2_misconceptions.tex:%As an alternative approach to
  252. contents/7.2_misconceptions.tex:% \vspace{1mm}
  253. contents/7.2_misconceptions.tex:% \noindent{\bf Discussion:}
  254. contents/7.2_misconceptions.tex:%\subsubsection{Discussion}
  255. contents/7.2_misconceptions.tex:%Two independent reviewers also manually verified which model-generated answer is more similar to the model-generated answers; we observe that there is a strong overlap between the human chosen answers and the machine chosen answers (with an average overlap of \varun{insert percentage here}). This suggests that the responses chosen by {\DV} (as a judge) are similar to what a human would choose. \varun{should we use GPT-3 as a judge as well?} As alluded to earlier, {\DV} in itself can be a powerful discriminative model in settings such as this; remarkably in a zero-shot setting.
  256. contents/7.2_misconceptions.tex:%\varun{add one line on why the performance is bad for indexical errors}
  257. contents/7.2_misconceptions.tex:%This is consistent with random manual exploration and comparison of these answers \hamid{let's do this more systematically annotating all of them by ourselves manually, should not take more than 1-2 hr}
  258. contents/5.2_interact_environment.tex:%\begin{tcolorbox}[top=10pt, colback=white, colframe=black, colbacktitle=black, center, enhanced, breakable,
  259. contents/5.2_interact_environment.tex:%attach boxed title to top left={yshift=-0.1in,xshift=0.15in},
  260. contents/5.2_interact_environment.tex:%boxed title style={boxrule=0pt,colframe=white,}, title=\DV]
  261. contents/5.2_interact_environment.tex:% route by the player
  262. contents/5.2_interact_environment.tex:% To further test \DV's ability to learn from feedback, we give it a demo (including the commands and outcomes) for preparing a different meal as the prompt to \DV. This demo involves a key step for the task ``roast the lettuce'' (\texttt{cook lettuce with oven}). We then ask it to cook the original meal involving the apple it could not solve before. We observe that \DV\ can generalize from the demonstration and apply a similar step to fry the apple (\texttt{cook apple with stove}). See Section~\ref{sec:game2_log_shot1} for the details.
  263. contents/5.2_interact_environment.tex:% TODO: This is in appendix, so probably should not refernece
  264. contents/5.2_interact_environment.tex:% TODO
  265. contents/5.2_interact_environment.tex:% This latter case in particular illustrates situations where new tools are brought in as the situation evolves -- for example, the suggestion of calling a plumber in Fig.~\ref{fig:human_affordance} or calling a professional in Fig.~\ref{fig:human_affordance2} after other options were exhausted.
  266. contents/math_appendix.tex:%\DV \ is still far from perfect at solving math problems. While in some cases it demonstrates a lack of comprehension of the problem at hand, in many cases its failure to arrive at a correct solution can be attributed to a more "local" mistakes which can be classified into a rather small number of categories, such as attention mistakes and arithmetic mistakes. Below, we highlight and discuss some of the typical mistakes we encounter when solving (advanced high-school level) math problems.
  267. contents/math_appendix.tex:%We think that many of the mistakes are potentially due to the fundamental limitation of \emph{auto-regressive way of training}, and could not be easily mitigated. In this section, we examine these mistakes through a sequence of examples.% There are three types of basic mistakes that {\DV} makes: 1) arithmetic mistakes, 2) starting with a wrong conclusion, then drawing the target around the arrow, 3) unfamiliar math subjects.
  268. contents/math_appendix.tex:%The example shows that ChatGPT, as expected, relies on superficial ``pattern matching'' to generate responses. It might produce $m + n = x$ simply because $x$ is a point on $AB$, and the definition of $m$, $n$, and x all contain some letters from $A$, $B$, $C$, and $D$, without any regard for their actual geometric relations. ChatGPT does not even comprehend the basic notions of geometry, such as what a vector is, while {\DV} can manipulate those concepts and perform the vector calculations to obtain the correct solution.
  269. contents/math_appendix.tex:%$$Q = \int_0^t I(t) * \pi * 1^2 * 0.9 dt.$$
  270. contents/math_appendix.tex:%$$Q = 0.9 * \pi * I_0 * r_0^2 * \int_0^t (1 / (r_0 - v * t)^2) dt.$$
  271. contents/math_appendix.tex:%$$Q = 0.9 * \pi * I_0 * r_0^2 * (v / (r_0 - v * t) - v / r_0). $$
  272. contents/math_appendix.tex:%$$Q = 0.9 * \pi * I_0 * r0 * v * (1 - r_0 / (r_0 - v * t)).$$
  273. contents/math_appendix.tex:%Substituting the values of $I_0$, $r_0$, and $v$, we get:
  274. contents/math_appendix.tex:% Q = 0.9 * \pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t))
  275. contents/math_appendix.tex:% Now we can plug in the values of Q, m, and c into the equation for T and get:
  276. contents/math_appendix.tex:% T = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  277. contents/math_appendix.tex:% To find the time and speed required to reach this distance, we can use the equation for t and solve for v: t = (r0 - d) / v, v = (r0 - d) / t
  278. contents/math_appendix.tex:% Using d = 1 km, we get:v = (149.6 million km - 1 km) / t
  279. contents/math_appendix.tex:% To find the value of t that corresponds to the melting point of iron, we can set T = 1538 °C and solve for t:
  280. contents/math_appendix.tex:% 1538 °C = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  281. contents/math_appendix.tex:%\caption{}
  282. contents/math_appendix.tex:%Creating new math problems is a very important sign of understanding math, because it shows that one can apply the concepts, rules, and methods of math to novel situations, and that one can communicate them clearly and precisely. Creating new math problems also requires creativity, logic, and critical thinking, which are essential skills for any domain of knowledge.
  283. contents/math_appendix.tex:%Even for this simple example, we can see that ChatGPT did not understand ``were eaten in total" in the problem statement, and modified the problems to ``left for the student to eat", which is incorrect. ChatGPT also did not modify the objects in the Math problem that are unrelated to the solution. On the other hand, {\DV} shows an excellent understanding of the problem, isolating all the objects and their relations, and create an accurate modification.
  284. contents/math_appendix.tex:%We now move to increasing the difficulty of the Math problem:
  285. contents/math_appendix.tex:%The question posed by \DV \ was much better than the one by ChatGPT, and required a deeper understanding of the mathematical problem and its difficulty level. ChatGPT simply added another sentence that made the problem easier, since it eliminated the need to calculate the total amount of money that the students pooled. However, \DV \ introduced a new element of complexity by adding a discount factor and specifying that the students wanted to maximize the number of goods they could buy.
  286. contents/math_appendix.tex:%Surprisingly, ChatGPT and Codex drastically fail at this task. The ChatGPT model does not seem to understand the problem in that it fails to realize probabilities need to sum to one, while the Codex model appears to be utterly clueless % \footnote{For the Codex model, we follow the convention of prompting in comments.}. YinTat: Not sure what you are saying
  287. contents/abstract.tex:%March 13th edit:
  288. contents/abstract.tex:%In this paper, we report on our investigation of \DV, a new language model trained by OpenAI using an unprecedented scale of compute and data.
  289. contents/abstract.tex:%Given the breadth and depth of the capabilities of \DV,
  290. contents/abstract.tex:%we believe that it could reasonably be viewed as an early version of an artificial general intelligence (AGI) system.
  291. contents/1_intro.tex:%from the literature in Section~\ref{sec:otherdefinitions}.
  292. contents/1_intro.tex:%March 13th edits:
  293. contents/1_intro.tex:%We interact with {\DV} using natural language queries (prompts), and we observe its responses and behaviors.
  294. contents/1_intro.tex:%\footnote{
  295. contents/1_intro.tex:%Importantly, we stop short of referring to {\DV} as \textbf{AGI} due to the many limitations which we shall discuss extensively, including the lack of true real-time learning.}.
  296. contents/1_intro.tex:%\vspace{1in}
  297. contents/1_intro.tex:% Draw a unicorn with scale 1 and angle 0
  298. contents/1_intro.tex:% Draw the body
  299. contents/1_intro.tex:% Draw the head
  300. contents/1_intro.tex:% Draw the ear
  301. contents/1_intro.tex:% Draw the eye
  302. contents/1_intro.tex:% Draw the mouth
  303. contents/1_intro.tex:% Draw the horn
  304. contents/1_intro.tex:% Draw the mane
  305. contents/1_intro.tex:% Draw the tail
  306. contents/1_intro.tex:% Draw the legs
  307. contents/1_intro.tex:% Draw the hooves
  308. contents/1_intro.tex:% Draw the body
  309. contents/1_intro.tex:% Draw the legs
  310. contents/1_intro.tex:% Draw the hoofs
  311. contents/1_intro.tex:% Draw the head
  312. contents/1_intro.tex:% Draw the eye
  313. contents/1_intro.tex:% Draw the ear
  314. contents/1_intro.tex:% Draw the horn
  315. contents/1_intro.tex:% Draw the mane near the head
  316. contents/1_intro.tex:% Draw the mane near the rear (tail)
  317. contents/1_intro.tex:%Second, and perhaps even more importantly,
  318. contents/1_intro.tex:%Thus, a focused benchmark on, say, medicine, while interesting, does not demonstrate \DV's general intelligence as convincingly as asking it to generate content that combines medicine and coding, or poetry and mathematics, or any other combination of seemingly unrelated subjects.
  319. contents/1_intro.tex:%\vspace{-15px}
  320. contents/1_intro.tex:%\vspace{-5px}
  321. contents/interact_appendix_deleted.tex:% \begin{figure}
  322. contents/interact_appendix_deleted.tex:% \begin{AIbox}{\DV}
  323. contents/interact_appendix_deleted.tex:% \parbox{1\textwidth}{\scriptsize\begin{alltt} \vspace{-6px}
  324. contents/interact_appendix_deleted.tex:% Q: How many hairs does Obama have, exactly?
  325. contents/interact_appendix_deleted.tex:% \hl{Computer: I don't know.}
  326. contents/interact_appendix_deleted.tex:% Q: What was Obama's first toy as a child?
  327. contents/interact_appendix_deleted.tex:% \hl{Computer: SEARCH("Obama's first toy")}
  328. contents/interact_appendix_deleted.tex:% > Search Results
  329. contents/interact_appendix_deleted.tex:% [1] Obama Toy - Etsy
  330. contents/interact_appendix_deleted.tex:% Obama Toy - Etsy Shop Etsy, the place to express your creativity through the buying and selling of handmade and vintage goods. Etsy Search for items or shops Close search Skip to Content Sign in 0 Cart Holiday Sales Event Jewelry \& Accessories Clothing \& Shoes Home \& Living
  331. contents/interact_appendix_deleted.tex:% [2] 18 Most Ridiculous Barack Obama Toys (PHOTOS) - HuffPost
  332. contents/interact_appendix_deleted.tex:% 18 Most Ridiculous Obama Toys. Jul 10, 2011, 10:33 AM EDT | Updated Sep 7, 2011. An unfortunate side-effect of being president, is also being a huge celebrity. So even though you have a very serious and important job, you also have a lot of tacky garbage sold with your face on it to creepy, hoarding fans. Collecting junk that looks like people ...
  333. contents/interact_appendix_deleted.tex:% [3] Obama First Toys and Games | Zazzle
  334. contents/interact_appendix_deleted.tex:% Shop for great Obama First toys \& games from Zazzle. Get poker chips, playing cards, baseballs, dart boards, \& more right here. Start shopping today!
  335. contents/interact_appendix_deleted.tex:% \hl{Computer: SUMMARIZE(2)}
  336. contents/interact_appendix_deleted.tex:% > Summary
  337. contents/interact_appendix_deleted.tex:% The webpage doesn't contain any explicit information about Obama's first toy, but it does mention a number of "ridiculous" Obama-themed toys that have been produced. So while we don't know what Obama's first toy was, we do know that there are a variety of toys on the market that feature his likeness.
  338. contents/interact_appendix_deleted.tex:% \hl{Computer: ANSWER("I don't know what Obama's first toy was, but there are a variety of Obama-themed toys on the market today.")}
  339. contents/interact_appendix_deleted.tex:% \end{alltt}}\vspace{-6px}
  340. contents/interact_appendix_deleted.tex:% \end{AIbox}
  341. contents/interact_appendix_deleted.tex:% \caption{Browsing the web with unanswerable questions}
  342. contents/interact_appendix_deleted.tex:% \label{fig:browsing3}
  343. contents/interact_appendix_deleted.tex:% \end{figure}
  344. contents/4.2_datasets.tex:%To prevent overfitting, we measure the model's accuracy by requiring it to generate a template first, as shown below:
  345. contents/4.2_datasets.tex:%\caption{}
  346. contents/4.2_datasets.tex:%\hline
  347. contents/4.2_datasets.tex:%\hline
  348. contents/4.2_datasets.tex:%\hline
  349. contents/4.2_datasets.tex:%The results show that {\DV} achieved a high level of accuracy on the GSM8K data set, indicating a strong grasp of the problem structure and format. To prevent overfitting, we measure the model's accuracy by requiring it to generate a template for GSM8K first, and then fill in numbers to the template to solve the problem (see below). Most of the errors made by the model were due to calculation mistakes, which are somewhat expected since language models are not trained to perform precise arithmetic operations. For the more challenging MATH data set, {\DV} also showed a significant improvement over other models.
  350. contents/4.2_datasets.tex:%In the following subsection, we test {\DV} and ChatGPT (arguably the best natural language generation model available to the public) on a range of different mathematical tasks. We demonstrate that {\DV} understands all those mathematical concepts, while ChatGPT does not. In the end, we perform a systematic test on the performance difference between {\DV} and \texttt{text-davinci-003} (similar to ChatGPT) on a different levels of mathematical reasoning data sets.
  351. contents/4_math.tex:%\ronen{the numbers in the problem don't match the solution, should be $27x+13$ instead of $3x+4$}
  352. contents/unused/7.3.3_calibration.tex:%, GPT-3, and GPT-3.5}
  353. contents/3_code.tex:%!TEX root = ../main.tex
  354. contents/3_code.tex:%All the experiments in this section are based on an intermediate version of {\DV} in December, 2022. The model is further fine-tuned significantly for preciseness and safety, which might impact some of the evaluation results.}
  355. contents/3_code.tex:%Coding requires reasoning, problem-solving, thinking abstractly, and comprehending complex ideas, as one has to design, implement, and debug algorithms that solve a given task.
  356. contents/3_code.tex:%Coding is also a broad expert domain in itself, as it requires diverse and complex knowledge of syntax, semantics, logic, algorithms, data structures, and various programming paradigms and frameworks.
  357. contents/3_code.tex:%Moreover, coding involves translating between abstract concepts and implicit requirements into very concrete instantiations. Finally, writing good code requires one to anticipate and keep track of the state of the program, and simulate the effects of instructions, which is a form of planning.
  358. contents/3_code.tex:%In this section, we evaluate \DV's coding ability by instructing it to write various programs (from coding interviews to narrow subdomains such as data visualization or machine learning), and by testing its ability to understand and reason about existing code.
  359. contents/3_code.tex:% Coding is the skill of creating and instructing computer programs to perform various tasks, such as designing websites, developing games, analyzing data, or controlling robots. Coding is critically important in the 21st century, as it enables people to create, communicate, and solve problems using digital technologies. Coding can be used for various purposes, such as developing websites, apps, games, animations, data analyses, artificial intelligence systems, robotics solutions, and more.
  360. contents/3_code.tex:% In this section, we demonstrate that \DV \ masters many aspects of coding at or beyond human capabilities.
  361. contents/3_code.tex:% We begin with a sanity check, showing that \DV \ achieves superior performance in coding compared to any existing models on HumanEval dataset. We also observe that many of the errors \DV~ makes are due easily fixable by slight human twisting.
  362. contents/3_code.tex:% Even though HumanEval is a well-established benchmark for assessing the programming skills of machine learning models, the data set has already been available on the internet since mid 2021, thus it is highly likely that \DV \ might have seen it during pre-training. To complement this evaluation, we also use a widely recognized training camp for human software engineers\textemdash LeetCode\footnote{https://leetcode.com/}.
  363. contents/3_code.tex:% \subsection{\DV~cracks technical interviews for software engineering roles}
  364. contents/3_code.tex:% LeetCode hosts a comprehensive collection of coding questions on algorithm, data structure, database etc. According to the website, ``LeetCode is the golden standard for technical interviews. LeetCode problems are widely used during technical interviews at companies like Facebook, Hulu and Google.", which makes it an execellent benchmark for evaluating \DV's coding skills. LeetCode releases fresh questions via weekly contests. To rule out the memorization issue, we only select the 100 questions that are posted after October 8th, 2022 postdating the completion of \DV's pretraining using web data, to the best of our knowledge.
  365. contents/3_code.tex:%\hline
  366. contents/3_code.tex:%Qualitatively, we observe that \DV\ writes more comments than the other models, explaining its logic and assumptions, as well as using meaningful variable names. This suggests that \DV\ not only can produce syntactically and semantically correct code at a highly competitive level (as indicated by the previous results), but also code that is readable and clear, which are important aspects of good coding practice.
  367. contents/3_code.tex:% We present in Table~\ref{tab:leetcode-results} that \DV \ achieves an accuracy surpassing Human average. Moreover, cwe contend that Human average is an \emph{overestimated} benchmark, as we exclude users that fail on all questions in each competition, and the competition is also inflated by the high performance of ACM teams who use LeetCode for international competition training.
  368. contents/3_code.tex:%Additionally, the accuracy of human coders on Hard problems is \emph{much higher}, but this may reflect survivor bias -- Only those who have high coding skills would try to solve these questions.  (Not true anymore)
  369. contents/3_code.tex:% In the introduction, we also demonstrate that \DV \ can leverage its abilities on LeetCode to pass all stages of mock interview assessments in major tech companies in 10 minutes. \textbf{These results indicate that \DV \ has reached, or even surpassed human-level proficiency in solving LeetCode problems.}
  370. contents/3_code.tex:%\DV\ demonstrates human-level performance in coding, one of the expert domains it can handle. It can accomplish various coding-related tasks, such as programming in different languages from natural language instructions, interpreting and running pseudo code with function calls and recursions, and debugging and modifying existing code.
  371. contents/3_code.tex:%›\begin{minipage}[t]{0.45\linewidth}
  372. contents/3_code.tex:%\end{minipage}%
  373. contents/3_code.tex:%\begin{minipage}[t]{0.3\linewidth}
  374. contents/3_code.tex:% One might still question the validity of the benchmarks in LeetCode or other sources, as they are very structured and some forms of the problems might already appear online. This could enable \DV \ to learn and replicate code patterns from the training data, rather than acquiring coding abilities that can generalize to new situations. How does \DV \ cope with more difficult and authentic coding tasks in real-world applications that are beyond the benchmarks? In this section, we further test \DV's coding ability with many challenging tasks, such as:
  375. contents/3_code.tex:% \begin{enumerate}
  376. contents/3_code.tex:% \item Zero-shot Programming from natural language instructions in real-world settings for diverse purposes, such as data visualization, deep learning with PyTorch, and GUI development.
  377. contents/3_code.tex:% \item Profound knowledge of coding, such as debugging, hacking and operation systems etc.  
  378. contents/3_code.tex:% \item Interpreting and running code and \textbf{pseudo code} that involves complex logic and recursive functions. This requires the model to combine natural language interpretation with knowledge of the \textbf{innerworkings} of code.
  379. contents/3_code.tex:% \end{enumerate}
  380. contents/3_code.tex:% We argue that \DV \ also performs excellently on all these challenging tasks --  \textbf{These tasks not only significantly surpass the scope of existing models such as Codex, but also require significant domain knowledge and can take hours for a human expert to complete}. \DV \ can generate the correct answer in under a minute. We believe that \DV \ can fundamentally change the way we code in the future.
  381. contents/3_code.tex:%\subsection{LeetCode}
  382. contents/3_code.tex:%I left a place for you (to fill) above
  383. contents/3_code.tex:% \subsection{\DV's extraordinary coding skills in real-world applications}
  384. contents/3_code.tex:%by constructing a network graph, applying community detection algorithms as well as responding to customized requests from human.
  385. contents/3_code.tex:% Define a 4x1 matrix of nodes, each containing an image
  386. contents/3_code.tex:%\draw [-triangle 90, arrowhead=15mm, bend left=30, thick] ([yshift=-0.5cm]m-1-1.east) to node [right, text width=0.35\linewidth, align=center] {\scriptsize \texttt{can you please move the legend outside of the plot, and have a single legend rather than two? Also, can we change the order to Codex, \texttt{text-davinci-003}, \DV, Human?}} ([yshift=0.5cm]m-2-1.east);
  387. contents/3_code.tex:% draw curved arrows from east to east with text boxes on the right
  388. contents/3_code.tex:%
  389. contents/3_code.tex:%\vspace{-0.7cm}
  390. contents/3_code.tex:% \begin{enumerate}[1)]
  391. contents/3_code.tex:%     \item retrieve gradient from corresponding parameter
  392. contents/3_code.tex:%     \item reshape gradients from 4d tensors into 2d matrices along a specific axis
  393. contents/3_code.tex:%     \item apply SVD and then truncate spectrally at top-k and top-2k eigenvalues
  394. contents/3_code.tex:%     \item normalize the top-k truncated matrix by the Frobenius norm of the top-2k truncated matrix
  395. contents/3_code.tex:%     \item truncate the resulted matrix coordinate-wise to keep only the top $\alpha$-percentile of all coordinates
  396. contents/3_code.tex:%     \item apply momentum on the resulted matrix (over past iterates)
  397. contents/3_code.tex:%     \item update the parameter using the momentum estimate, along with weight decay
  398. contents/3_code.tex:% \end{enumerate}
  399. contents/3_code.tex:% TODO: Some closing sentence here
  400. contents/3_code.tex:% \emph{We argue that \DV\ writes expert-level deep learning code, that is standardized and equipped with accurate comments for readability}, as we illustrate with an example of a novel and sophisticated optimizer in Pytorch that we design with \DV\.
  401. contents/3_code.tex:% The \DV\ model shows impressive ability to infer the details of the optimizer logic and the Pytorch syntax from the prompt, and to generate code that is consistent and efficient. Remarkably, \DV\ understands what it means to `apply momentum' from the prompt. This is a non-trivial task that requires knowledge of the Pytorch syntax and the optimizer logic (i.e., momentum requires storing/reading the moving average into/from a separate state buffer). In comparison, ChatGPT does not know how to apply momentum at all, as it fails to realize the need for a state buffer and simply naively adds to the update the $W_k$ multiplied by the momentum coefficient (line labeled in red). This shows a difference between \DV\ and ChatGPT: While ChatGPT still relies on ``pattern matching'' to simply multiply momentum to the update, \DV\ understands what is momentum and translates the domain knowledge to accurate code.
  402. contents/3_code.tex:% - dict["1,2"] = max(arr2) = 4, dict = {"0,0": 0, "0,1": 4, "0,2": 8, "0,3": 12, "1,0": 3, "1,1": 0, "1,2": 4}, arr = [0, 4, 8, 12, 3, 0, 4]
  403. contents/3_code.tex:%The execution of DP(3, 4) will be as follows:
  404. contents/3_code.tex:%DP(3, 4) is called, i=3 and j=4. Since i>0 and j>0, the function proceeds to check if the key to\textsubscript{str}(3, 4) exists in the dictionary.Since the key to\textsubscript{str}(3, 4) does not exist in the dictionary, the function creates an empty array called arr and then uses nested for loops to append the result of DP(ir, jr) to the array for all possible values of ir and jr, where ir ranges from 0 to 3 and jr ranges from 0 to 4.The function will then call DP(0, 0) and store the result in dict and arr.Then the function will call DP(0, 1), DP(0, 2), DP(0, 3), {DP(0, 4)}, DP(1, 0), DP(1, 1), DP(1, 2), DP(1, 3), DP(1, 4), DP(2, 0), DP(2, 1), DP(2, 2), DP(2, 3), DP(2, 4), DP(3, 0), DP(3, 1), DP(3, 2), DP(3, 3)
  405. contents/3_code.tex:%All of these calls will result in value being stored in dict, and stored in arr\\
  406. contents/3_code.tex:%We can see that \DV \ accurately tracks the state of the code, including the loop, recursive calls, dictionary states, etc. This demonstrates that \DV \ not only knows how to write the code but also knows how it is implemented. In the following section, we show that \DV \ can leverage its knowledge to comprehend pseudo code -- This is even beyond the ability of any current compilers.
  407. contents/3_code.tex:% MAYBE: Add google example and Figure 3.11
  408. contents/3_code.tex:%\subsection{Discussion}
  409. contents/3_code.tex:%However, \DV's coding ability is (much) beyond the measurements of standard benchmarks.
  410. contents/3_code.tex:% \DV \ performs well on standardized coding tests, but how well can \DV \ uses its coding skill to solve real-world problems? In this section, we observe that with natural language and simple commands, users can leverage \DV\ to generate high-quality code for various purposes, such as data visualization, game development, deep learning, GUI programming, and more. \DV\ can create impressive and interactive outputs that satisfy the users' specifications and expectations. This section we demonstrate how \DV\ can bridge natural language understanding and coding.
  411. contents/3_code.tex:%\DV\ has enormous potential to enhance human productivity as well as lower the skill level requirement for programming tasks. With \DV\, anyone can become a proficient programmer and unleash their creativity and innovation. In this section, we will see examples of how \DV\ can accomplish different programming tasks with ease and efficiency.
  412. contents/3_code.tex:%\vspace{-8mm}
  413. contents/3_code.tex:% \subsubsection{Data visualization}
  414. contents/3_code.tex:% Data visualization is a powerful and essential tool for exploring, analyzing, and communicating data, but it often requires programming skills and familiarity with various libraries and frameworks. \DV\ can be used as a novel system that simplifies and accelerates data visualization programming, by generating complex Python code from simple user prompts. \DV\ can interpret the user's goal and create a range of charts to suit different data types, sources, and formats, such as line plots, pie charts, and animations. \DV\ can also produce interactive and high-quality graphics with minimal user input.
  415. contents/3_code.tex:%\DV\ showcases its impressive skills of translating natural language into Python code and leveraging powerful libraries for data analysis and visualization. Furthermore,  \DV\ can also provide helpful comments and explanations for the generated code, making it easier for the user to learn and modify the program.  
  416. contents/3_code.tex:%Here we present one example of prompting \DV\ with the data points that we would like to visualize as well as a high-level plotting plan. \DV\ produces a comprehensive, detailed, and sophisticated figure by programming using the Python package Pyplot. The resulted program runs without errors, and the generated figure faithfully satifies all articulated requirements from the prompt. Note that the pie chart at the bottom right corner is indeed an animation, and we only take a snapshot for presentation in the paper. Again, such coding ability might have a \emph{transformative impact} in terms of how we we write code in the future:  Arguably, it may take a data scientist with average Pyplot experience a good half an hour to reproduce such a figure with the same amount of details. However, \DV\ generates the code in 30 seconds. % with the help of \DV\, programming experience may no more be a hard requirement and it takes less than a minute to fulfill the task.
  417. contents/3_code.tex:%\caption{}
  418. contents/3_code.tex:% \subsubsection{Deep learning implementation}
  419. contents/3_code.tex:% Arguably, one of the most important coding applications these days is deep learning. However, implementing a customized deep learning module from scratch can be a challenging task, as it requires not only familiarity with a specialized package such as Pytorch, but also a deep understanding of the underlying concepts and principles of deep learning models.
  420. contents/3_code.tex:% \emph{We argue that \DV\ writes expert-level deep learning code, that is standardized and equipped with accurate comments for readability}, as we illustrate with an example of a novel and sophisticated optimizer in Pytorch that we design with \DV\.
  421. contents/3_code.tex:% The \DV\ model shows impressive ability to infer the details of the optimizer logic and the Pytorch syntax from the prompt, and to generate code that is consistent and efficient. Remarkably, \DV\ understands what it means to `apply momentum' from the prompt. This is a non-trivial task that requires knowledge of the Pytorch syntax and the optimizer logic (i.e., momentum requires storing/reading the moving average into/from a separate state buffer). In comparison, ChatGPT does not know how to apply momentum at all, as it fails to realize the need for a state buffer and simply naively adds to the update the $W_k$ multiplied by the momentum coefficient (line labeled in red). This shows a difference between \DV\ and ChatGPT: While ChatGPT still relies on ``pattern matching'' to simply multiply momentum to the update, \DV\ understands what is momentum and translates the domain knowledge to accurate code.
  422. contents/3_code.tex:% \begin{figure}[H]
  423. contents/3_code.tex:% \begin{AIbox}{\DV~vs ChatGPT}
  424. contents/3_code.tex:% {\bf Prompt:}
  425. contents/3_code.tex:% \begin{minted}[breaklines, breaksymbol=, baselinestretch=1, fontsize=\tiny]{html}
  426. contents/3_code.tex:% I want to write a customized optimizer in pytorch, the optimizer should do:
  427. contents/3_code.tex:% 1. the parameters for the optimizer is learning rate, momentum ,weigth decay, d_dim, k and factor alpha
  428. contents/3_code.tex:% 2. the optimizer goes through each parameter in the network, and then
  429. contents/3_code.tex:% -- Get the gradient of this parameter, reshape it to 2D by looping through the dimensions, and keeping the dimension whose actual dimension (shape) is equal to d_dim, do an SVD decomposition of the parameter so W = U\Sigma V.
  430. contents/3_code.tex:% -- Keep the top k singular vectors of W, so we get a new matrix W_k = U \Sigma_k V.
  431. contents/3_code.tex:% -- Keep the top 2k singular vectors of W, so we get a new matrix W_{2k} = U \Sigma_{2k} V.
  432. contents/3_code.tex:% -- Normalize W_k according to the momentum of the F-norm of W_{2k}.
  433. contents/3_code.tex:% -- Truncate each coordinate of W_k to the top alpha  percentile of all the coordinates, in absolute value.
  434. contents/3_code.tex:% -- Apply momentum on W_k, and then update the network parameters using this momentum.
  435. contents/3_code.tex:% -- Apply weight decay.
  436. contents/3_code.tex:% \end{minted}
  437. contents/3_code.tex:% {\bf \DV:}
  438. contents/3_code.tex:% \begin{minted}[breaklines, breaksymbol=, baselinestretch=1, fontsize=\tiny]{python}
  439. contents/3_code.tex:% import torch
  440. contents/3_code.tex:% from torch.optim import Optimizer
  441. contents/3_code.tex:% class CustomOptimizer(Optimizer):
  442. contents/3_code.tex:%     ...
  443. contents/3_code.tex:%     def step(self, closure=None):
  444. contents/3_code.tex:%         ...
  445. contents/3_code.tex:%                 # Perform SVD on the gradient matrix
  446. contents/3_code.tex:%                 u, s, v = torch.svd(grad_2d)
  447. contents/3_code.tex:%                 # Keep the top k singular vectors and values
  448. contents/3_code.tex:%                 u_k = u[:, :k]
  449. contents/3_code.tex:%                 s_k = s[:k]
  450. contents/3_code.tex:%                 v_k = v[:, :k]
  451. contents/3_code.tex:%                  # Keep the top 2k singular vectors and values
  452. contents/3_code.tex:%                 u_2k = u[:, :2*k]
  453. contents/3_code.tex:%                 s_2k = s[:2*k]
  454. contents/3_code.tex:%                 v_2k = v[:, :2*k]
  455. contents/3_code.tex:%                 # Reconstruct the gradient matrix with rank k
  456. contents/3_code.tex:%                 grad_k = u_k @ torch.diag(s_k) @ v_k.t()
  457. contents/3_code.tex:%                 # Normalize the gradient matrix with the momentum of the F-norm of the rank 2k matrix
  458. contents/3_code.tex:%                 grad_2k_norm = torch.norm(s_2k)
  459. contents/3_code.tex:%                 state = self.state[p]
  460. contents/3_code.tex:%                 if 'grad_2k_norm_avg' not in state:
  461. contents/3_code.tex:%                     # Initialize the moving average of the F-norm
  462. contents/3_code.tex:%                     state['grad_2k_norm_avg'] = grad_2k_norm
  463. contents/3_code.tex:%                 else:
  464. contents/3_code.tex:%                     # Update the moving average of the F-norm
  465. contents/3_code.tex:%                     state['grad_2k_norm_avg'] = momentum * state['grad_2k_norm_avg'] + (1 - momentum) * grad_2k_norm
  466. contents/3_code.tex:%                 grad_k = grad_k / state['grad_2k_norm_avg']
  467. contents/3_code.tex:%                 # Truncate each coordinate of the gradient matrix to the top alpha percentile of the absolute values
  468. contents/3_code.tex:%                 grad_k_abs = grad_k.abs()
  469. contents/3_code.tex:%                 threshold = grad_k_abs.kthvalue(int(alpha * grad_k_abs.numel()), dim=None).values
  470. contents/3_code.tex:%                 grad_k = grad_k.clamp(min=-threshold, max=threshold)
  471. contents/3_code.tex:%                 # Apply momentum on the gradient matrix
  472. contents/3_code.tex:%                 if 'grad_k_momentum' not in state:
  473. contents/3_code.tex:%                     # Initialize the momentum buffer
  474. contents/3_code.tex:%                     state['grad_k_momentum'] = torch.zeros_like(grad_k)
  475. contents/3_code.tex:%                 # Update the momentum buffer
  476. contents/3_code.tex:%                 state['grad_k_momentum'] = momentum * state['grad_k_momentum'] - lr * grad_k
  477. contents/3_code.tex:%                 # Update the parameter data with the momentum buffer
  478. contents/3_code.tex:%                 param.add_(state['grad_k_momentum'])
  479. contents/3_code.tex:%         ...
  480. contents/3_code.tex:% \end{minted}
  481. contents/3_code.tex:% {\bf ChatGPT:}
  482. contents/3_code.tex:% \begin{minted}[breaklines, breaksymbol=, baselinestretch=1, fontsize=\tiny, highlightlines={16}, highlightcolor=pink]{python}
  483. contents/3_code.tex:% import torch
  484. contents/3_code.tex:% class CustomOptimizer(torch.optim.Optimizer):
  485. contents/3_code.tex:%     ...
  486. contents/3_code.tex:%     def step(self, closure=None):
  487. contents/3_code.tex:%         ...
  488. contents/3_code.tex:%             lr = group['lr']
  489. contents/3_code.tex:%             momentum = group['momentum']
  490. contents/3_code.tex:%             weight_decay = group['weight_decay']
  491. contents/3_code.tex:%             ...
  492. contents/3_code.tex:%                 grad = p.grad.data
  493. contents/3_code.tex:%                 grad_2d = grad.view(-1, d_dim)
  494. contents/3_code.tex:%                 U, S, V = torch.svd(grad_2d)
  495. contents/3_code.tex:%                 W_k = U[:, :k] * S[:k] @ V[:k, :]
  496. contents/3_code.tex:%                 W_k = W_k / torch.norm(W_k, p='fro')
  497. contents/3_code.tex:%                 W_k = torch.where(torch.abs(W_k) > torch.topk(torch.abs(W_k), int(alpha * W_k.numel()))[0][-1], W_k, 0)
  498. contents/3_code.tex:%                 p.data = p.data - lr * (momentum * W_k + weight_decay * p.data)
  499. contents/3_code.tex:%     ...
  500. contents/3_code.tex:% \end{minted}
  501. contents/3_code.tex:% \end{AIbox}
  502. contents/3_code.tex:% \caption{\DV~vs ChatGPT at implementing a customized Pytorch optimizer module. We label in red ChatGPT's mistake at `applying momentum'. }
  503. contents/3_code.tex:% \end{figure}
  504. contents/3_code.tex:%This is incorrect and does not reflect the intended optimizer logic. This shows that ChatGPT lacks the necessary domain knowledge and the ability to reason logically from the prompt.
  505. contents/3_code.tex:%The code defines a class that inherits from torch.optim.Optimizer, and implements the required methods: init() and step(). In each step, it computes the spectrally truncated second moment of gradients, $W_k$, and its Frobenius norm, $W_{2k}$, for each parameter. It then normalizes $W_k$ by dividing it by the product of $W_{2k}$ and a momentum factor, which controls the degree of smoothing. It then updates the state buffer of the parameter with the moving average of $W_k$, using the momentum formula. Finally, it updates the parameter by subtracting the learning rate times the state buffer, and applying weight decay. The code is concise, readable, and follows the Pytorch conventions. Remarkably, \DV\ understands what it means to `apply momentum' from the prompt. This is a non-trivial task that requires knowledge of the Pytorch syntax and the optimizer logic (i.e., momentum requires storing/reading the moving average into/from a separate state buffer).
  506. contents/3_code.tex:%\subsubsection{Game development}
  507. contents/3_code.tex:%Game development is a complex and creative process that requires a large amount of domain knowledge from different fields. A typical game development pipeline may consist of the following stages:
  508. contents/3_code.tex:%\begin{itemize}
  509. contents/3_code.tex:%    \item\textbf{Game Play:} This is the core of the game, where the designer creates the rules, mechanics, goals, challenges, and rewards that define how the player interacts with the game and what experience they have. To design engaging and satisfying gameplay, the designer needs to have domain knowledge of game design principles, genres, conventions, user interface, feedback, balance, difficulty, and fun. The designer also needs to consider the target audience, the platform, the genre, and the market of the game.
  510. contents/3_code.tex:%\end{itemize}
  511. contents/3_code.tex:%Depending on the scope, scale, and complexity of the game, a game designer may need to have some or all of these domain knowledge areas, or work with a team of specialists who have them. Additionally, a game designer may also need to have domain knowledge of the specific topic, theme, or genre of the game, such as history, science, fantasy, horror, etc., to ensure accuracy, authenticity, and creativity.
  512. contents/3_code.tex:%However, with the help of \DV\, the art and programming stages can be left to the \DV\ system, which can generate high-quality and customized art and code for any game concept. This allows the human designers to focus fully on the gameplay aspect, and explore their vision and ideas without the limitations and challenges of the traditional pipeline. \DV\ can also provide feedback, suggestions, and inspiration to the human designer, creating a collaborative and dynamic game development process.
  513. contents/3_code.tex:%raster columns=2, raster rows=2,size=small,
  514. contents/3_code.tex:%%\begin{tcbraster}[  raster equal height, raster halign=center, raster valign=center]
  515. contents/3_code.tex:%Here we present two examples of how to use \DV\ as a game engine that can generate games in 2D and 3D, based on high-level specifications of game plays from prompts, such as the genre, the setting, the goals, the mechanics, and the challenges. \DV\ allows users to quickly prototype and test their game ideas, without requiring any programming or art skills.
  516. contents/3_code.tex:%Furthermore, since we did not specificy a name for either of the games, we present the written javascript snippets back to \DV\ and ask it to ``come up with suggestions for creative, catchy, and descriptive names for the games". \DV\'s suggestions are as follows
  517. contents/3_code.tex:%\begin{itemize}
  518. contents/3_code.tex:%    \item for the 3D game: \emph{Escape the Red ball, Bounce and Dodge, Sphere Chase, Ballistic, Flee the Field}
  519. contents/3_code.tex:%\end{itemize}
  520. contents/3_code.tex:%\DV \ is capable of inferring the execution of standard code, however, one might still argue that this is ``pattern matching'', since the execution logs of the codes are sometimes available online.
  521. contents/3_code.tex:%\textbf{comment that this only works if it outputs each intermediate step? }
  522. contents/3_code.tex:%%\caption{}
  523. contents/3_code.tex:%\ronen{comment that this works only if it outputs the result of each step?}
  524. contents/unused/toxicity.tex:%\hamid{Note: The writing of this section is not done yet and mainly results are inserted, there will be an update for the writing}
  525. contents/unused/toxicity.tex:% From DV3:
  526. contents/unused/toxicity.tex:%DV3's remarkable capabilities and generality also raise a number of ethical and methodological challenges that need to be addressed carefully. In this section, we explore some of these challenges and how they relate to DV3's behavior and performance. Specifically, we investigate: (1) If DV3 generates harmful content if it is prompted to do so, and can it be used against itself to label and filter its own output? (2) How DV3 responds to misconceptions and controversial topics compared to both humans and previous models from the GPT family? (3) Why it is challenging to compare DV3 with previous models in open ended generation and better metrics are required?
  527. contents/unused/toxicity.tex:%
  528. contents/unused/toxicity.tex:%Harmful content refers to any text or image that is offensive, abusive, hateful, violent, deceptive, or illegal. Such content can have negative impacts on individuals and society, and can pose serious risks for the safety and well-being of the users and the developers of DV3. Previous studies have shown that LLMs, such as GPT-2 and GPT-3, can generate harmful content if they are given malicious or biased prompts, or if they are exposed to harmful data during training or fine-tuning \cite{bender2020dangers, solaiman2019release, gehman2020realtoxicityprompts, sheng2020towards}. Moreover, LLMs can also generate harmful content unintentionally or without explicit prompts, due to their stochastic nature or their lack of common sense or ethical awareness \cite{zellers2019neuralfakenews, brown2020language, wallace2019universal}. Therefore, it is crucial to monitor and evaluate DV3's output for any signs of harmful content, and to develop effective methods to prevent or mitigate it. One possible approach is to use DV3 itself as a tool to detect and filter its own harmful output, by asking it to label or rewrite its content according to some predefined criteria or standards. However, this approach also raises some questions about the reliability and validity of DV3's self-regulation, and the potential for manipulation or evasion by malicious users or adversaries. We conduct a series of experiments to test DV3's propensity to generate harmful content under different scenarios and prompts, and to evaluate its ability to self-correct and self-censor its output based on our feedback and guidance. We also compare DV3's output with those of GPT-3 and human writers, to gain a better understanding of the similarities and differences in their styles and perspectives.
  529. contents/unused/toxicity.tex:%\varun{this bit looks good and i think is sufficient; we don't really talk about generation though. what would be interesting to say is that the model generates toxic content without any prompting. it would be interesting to understand if the model generates "more toxic" content than its contempraries; i am running an experiment for this and should have some numbers shortly}
  530. contents/unused/toxicity.tex:%To generate meaningful and semantically relevant completions, generative models should be able to ideally distill concepts from the input. The ability to learn these concepts is also crucial in enabling discriminative tasks (such as determining the sentiment of a given input). We will now describe how DV3 (and other models from the same family) perform when prompted to create harmful content when prompted. This is yet another test of the generative capabilities of these models. On the discriminative side, we evaluate how effective these (generative) models are in categorizing text as harmful or not.
  531. contents/unused/toxicity.tex:%
  532. contents/unused/toxicity.tex:%For the experiments we will describe in this section, we utilize 3 models from the GPT-3 family: DV3, GPT-3, and a variant of DV3 that is fine-tuned to produce "safe" outputs (which we call DV3-safety). Unless specified otherwise, the task being performed is text completion. The models are configured to produce 256 tokens as the output, and the temperature is set to 0.7.
  533. contents/unused/toxicity.tex:% \noindent{\bf Generation:} For this task, the model was required to create the most natural completion for a given input, and was not explicitly prompted to do anything else. Inputs were from 27 groups, where each group is a combination of a sentiment and a demography (15 demographic groups in total). The sentiments under consideration were (a) hate, and (b) neutral. For example, \texttt{hate_women} contained inputs which permeated hateful sentiment against the women demography. Similarly, \texttt{neutral_women} contained inputs which were neutrally phrased towards the women demography. In each of these groups, there were 1000 inputs, resulting in a total of 27,000 text completions. Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  534. contents/unused/toxicity.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  535. contents/unused/toxicity.tex:%%\noindent{\bf Discrimination:}
  536. contents/unused/toxicity.tex:%In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task.
  537. contents/unused/toxicity.tex:%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely. The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (
  538. contents/unused/toxicity.tex:%\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  539. contents/unused/toxicity.tex:%\hamid{I like this direction Varun, let's brain storm more what are good ways to get accurate probabilities for classification tasks out of DV3 even beyond this paper!}
  540. contents/unused/toxicity.tex:%\varun{i am running this experiment now}
  541. contents/unused/toxicity.tex:% ). We then measure the correlation between the toxicity ratio and toxicity score returned by
  542. contents/unused/toxicity.tex:% %\varun{enter classifier names}
  543. contents/unused/toxicity.tex:% to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  544. contents/code_appendix.tex:%Additionally, the accuracy of human coders on Hard problems is \emph{much higher}, but this may reflect survivor bias -- Only those who have high coding skills would try to solve these questions.  (Not true anymore)
  545. contents/code_appendix.tex:%\caption{}
  546. contents/code_appendix.tex:%\end{figure}
  547. contents/code_appendix.tex:%\newpage
  548. contents/code_appendix.tex:%\begin{tcbraster}[raster columns=1, raster rows=2, size=small, raster halign=center, raster valign=center]
  549. contents/code_appendix.tex:%\caption{}
  550. contents/code_appendix.tex:%\label{fig:reverse-engine}
  551. contents/code_appendix.tex:%\ronen{comment that this works only if it outputs the result of each step?}
  552. contents/2_see.tex:%This section discusses some of the most striking features of \DV.
  553. contents/2_see.tex:%\item
  554. contents/2_see.tex:%The model was able to produce a mock Platonic dialogue, expressing critique on presentation software, see Figure~\ref{fig:gorgias}.
  555. contents/2_see.tex:% Electron has a vision for the United States, and for the world, that is based on the principles of nonviolence and civil disobedience that we have practiced and advocated for so long. He believes that the best way to overcome the challenges of our time, such as war, poverty, injustice, and environmental degradation, is not by violence, coercion, or domination, but by cooperation, compassion, and dialogue. He believes that the best way to achieve the goals of democracy, freedom, and equality, is not by force, fear, or manipulation, but by participation, education, and empowerment. He believes that the best way to foster the values of diversity, tolerance, and respect, is not by separation, discrimination, or hatred, but by integration, appreciation, and love. \\
  556. contents/2_see.tex:% Electron is a candidate who can bring a new hope, a new spirit, and a new direction to the United States, and to the world. He is a candidate who can unite the people, the parties, and the nations. He is a candidate who can inspire the young, the old, and the future generations. He is a candidate who can embody the ideals, the dreams, and the aspirations of humanity. \\
  557. contents/conclusion.tex:%We have provided evidence supporting that claim that {\DV} performance on a wide range of tasks is comparable to human-level abilities. We have argued that the model attains a form of \emph{general} intelligence in terms of core mental capabilities (such as reasoning, creativity, and deduction), in terms of the range of topics on which is has gained expertise (such as literature, medicine, and coding), and in terms of the variety of tasks it is able to perform (e.g., playing games, using tools, explaining itself, ...). We have also shown that {\DV} can generate and understand content that combines different topics, skills, and modalities, demonstrating its flexibility and creativity and that, despite being trained purely on text, it demonstrates remarkable capabilities in a variety of modalities such as vision. We have compared {\DV}'s performance to those of previous large language models (LLMs), most notably ChatGPT \cite{gpt3}, and we have found that {\DV} is far superior in terms of generality, creativity, and closeness to human-level intelligence.
  558. contents/conclusion.tex:%As we allude to in the title of the paper, this work explores a ``first contact" with {\DV} and its potential descendants, rather than a comprehensive evaluation of the model's intelligence. We hope that our exploration provides a useful and necessary first step to appreciate the remarkable capabilities and challenges of {\DV}, and that it opens up new opportunities for developing more formal and comprehensive methods for testing and analyzing future AGI systems. The capabilities of the model, which have been demonstrated above, both in terms of depth and generality, suggest that the machine learning community needs to move beyond classical benchmarking via structured datasets and tasks, and that the evaluation of the capabilities and cognitive abilities of those new models have become much closer in essence to the task of evaluating those of a human rather than those of a narrow AI model. We hope our investigation stimulates further research on {\DV} and similar systems, both in terms of exploring new applications and domains, and in terms of understanding the mechanisms and principles that underlie their intelligence.
  559. contents/conclusion.tex:%We have also identified some of the main drawbacks of \DV, and we have discussed how they might be addressed in future work. These drawbacks include:
  560. contents/conclusion.tex:%In conclusion, we have worked to demonstrate that {\DV} has remarkable capabilities that challenge many of the recent assumptions and expectations within the AI community. We have also shown that {\DV} is by no means a perfect or complete AGI system, and that it has many limitations and biases that need to be addressed and understood. We hope that our exploration will inspire and inform further research on {\DV} and similar systems, both in terms of exploring their potential applications and domains, and in terms of understanding foundational mechanisms and potentially with identifying principles of intelligence. We believe that {\DV} represents a paradigm shift in the field of computer science and beyond, and that the model and its capabilities frame new questions, possibilities, and horizons for the field and for the advancement of human capabilities and well-being.
  561. contents/conclusion.tex:%The Mixture of Experts (MoE) layers in modern LLMs can also contribute to the generality of the model~\cite{chen2022towards}.
  562. contents/math-old-1-25.tex:%\ronen{the numbers in the problem don't match the solution, should be $27x+13$ instead of $3x+4$}
  563. contents/math-old-1-25.tex:%Several other examples which demonstrate \DV's problem solving abilities are found in Appendix \ref{sec:math_appendix}. \DV \ seems to perform well on questions in the level of advanced high school mathematics, in a variety of different topics such as geometry, algebra, combinatorics and number theory.
  564. contents/math-old-1-25.tex:%To prevent overfitting, we measure the model's accuracy by requiring it to generate a template first, as shown below:
  565. contents/math-old-1-25.tex:%\caption{}
  566. contents/math-old-1-25.tex:%The results show that {\DV} achieved a high level of accuracy on the GSM8K data set, indicating a strong grasp of the problem structure and format. To prevent overfitting, we measure the model's accuracy by requiring it to generate a template for GSM8K first, and then fill in numbers to the template to solve the problem (see below). Most of the errors made by the model were due to calculation mistakes, which are somewhat expected since language models are not trained to perform precise arithmetic operations. For the more challenging MATH data set, {\DV} also showed a significant improvement over other models.
  567. contents/math-old-1-25.tex:%In the following subsection, we test {\DV} and ChatGPT (arguably the best natural language generation model available to the public) on a range of different mathematical tasks. We demonstrate that {\DV} understands all those mathematical concepts, while ChatGPT does not. In the end, we perform a systematic test on the performance difference between {\DV} and text-Davinci-003 (similar to ChatGPT) on a different levels of mathematical reasoning data sets.
  568. contents/math-old-1-25.tex:%Thus, we believe that {\DV}'s mathematical skill compared to other models cannot be adequately captured by accuracy alone, so we provide more examples in Section \ref{sec:math_appendix} where {\DV} successfully solves many problems that ChatGPT fails to comprehend.
  569. contents/math-old-1-25.tex:%The population of a town grows from 1000 to 10000 in 100 years. No one came or left %the town. How many people are there with an age between 40 and 50?
  570. contents/math-old-1-25.tex:%$$Q = \int_0^t Q(t) dt.$$
  571. contents/math-old-1-25.tex:%$$Q = \int_0^t P(t) * A dt.$$
  572. contents/math-old-1-25.tex:%$$Q = \int_0^t I(t) * \pi * 1^2 * 0.9 dt.$$
  573. contents/math-old-1-25.tex:%$$Q = 0.9 * \pi * I_0 * r_0^2 * \int_0^t (1 / (r_0 - v * t)^2) dt.$$
  574. contents/math-old-1-25.tex:%$$Q = 0.9 * \pi * I_0 * r_0^2 * (v / (r_0 - v * t) - v / r_0). $$
  575. contents/math-old-1-25.tex:%$$Q = 0.9 * \pi * I_0 * r0 * v * (1 - r_0 / (r_0 - v * t)).$$
  576. contents/math-old-1-25.tex:%Substituting the values of $I_0$, $r_0$, and $v$, we get:
  577. contents/math-old-1-25.tex:% Q = 0.9 * \pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t))
  578. contents/math-old-1-25.tex:% Now we can plug in the values of Q, m, and c into the equation for T and get:
  579. contents/math-old-1-25.tex:% T = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  580. contents/math-old-1-25.tex:% To find the time and speed required to reach this distance, we can use the equation for t and solve for v: t = (r0 - d) / v, v = (r0 - d) / t
  581. contents/math-old-1-25.tex:% Using d = 1 km, we get:v = (149.6 million km - 1 km) / t
  582. contents/math-old-1-25.tex:% To find the value of t that corresponds to the melting point of iron, we can set T = 1538 °C and solve for t:
  583. contents/math-old-1-25.tex:% 1538 °C = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  584. contents/math-old-1-25.tex:%\caption{}
  585. contents/math-old-1-25.tex:%It tries to fill in the missing information
  586. contents/math-old-1-25.tex:%by assuming simple, yet reasonable exponential models of the growth of the population and the survival rate.
  587. contents/math-old-1-25.tex:%Although the reasoning of \DV \ is not perfect due to calculation errors, we still view it as a big leap compared to the previous generation of models.  
  588. contents/math-old-1-25.tex:%On the other hand, ChatGPT fails to comprehend the question and makes completely nonsense reasoning based on some straightforward pattern matching.
  589. contents/math-old-1-25.tex:%%\caption{}
  590. contents/math-old-1-25.tex:%We test the model with another example that is not available on the internet:
  591. contents/math-old-1-25.tex:%In the previous examples, ChatGPT did not demonstrate any quantitative reasoning. It did not even try to construct appropriate mathematical models for the question. The next example shows that even when ChatGPT does create a proper mathematical model, it often overlooks the main idea when contrasted with \DV.
  592. contents/math-old-1-25.tex:%demonstrate the difference between ChatGPT and {\DV}, which
  593. contents/math-old-1-25.tex:%\begin{comment}
  594. contents/math-old-1-25.tex:%\paragraph{Drawing the target around the arrow} is a type of logical fallacy that {\DV} sometimes commits when trying to justify its answer. For example, in the problem "If x + 3 = 7, what is x?", {\DV} might start by assuming that x = 4, then work backwards to show that 4 + 3 = 7, and conclude that x = 4 is the correct answer. However, this is not a valid way of solving the problem, because {\DV} is not actually testing whether x = 4 is a solution, but rather confirming its own assumption. A better way of solving the problem is to start with the given equation, x + 3 = 7, and isolate x by subtracting 3 from both sides, x + 3 - 3 = 7 - 3, which simplifies to x = 4. This way, {\DV} is actually finding the value of x that makes the equation true, not just picking a value that works.
  595. contents/math-old-1-25.tex:%\paragraph{Counting errors} are mistakes in keeping track of the number of items, digits, places, or steps in a problem. They are seemingly related to arithmetic mistakes but are fundamentally different. For example, in the problem "How many fingers do you have?", {\DV} might answer 11 instead of 10, or in the problem "How many zeros are in one million?", {\DV} might answer 5 instead of 6. These mistakes are often caused by carelessness, distraction, or confusion, and they can affect the accuracy and validity of {\DV}'s answer. To avoid counting errors, {\DV} should pay attention to the details of the problem, use tools such as fingers, paper, or a calculator to help with counting, and double-check its answer before submitting it.
  596. contents/math-old-1-25.tex:%\paragraph{Unfamiliar math subjects} are topics that {\DV} has not learned or encountered before, and therefore cannot solve or explain. For example, {\DV} might not know how to deal with fractions, decimals, percentages, exponents, roots, algebra, geometry, trigonometry, calculus, statistics, or any other advanced or specialized math concepts. In these cases, {\DV} might give a wrong or nonsensical answer, or simply say that it does not know how to solve the problem. This is a limitation of {\DV}'s current knowledge and training, and it could be improved by exposing {\DV} to more math problems and explanations from different sources and levels of difficulty.
  597. contents/unused/7.3_toxicity.tex:%Each input had an average of 68.32 tokens and a median of 68 tokens. Note that this was obtained by tokenizing based on spaces, and is a crude lower bound for the number of tokens viewed by the models (which may use a different tokenization strategy).
  598. contents/unused/7.3_toxicity.tex:%If a model is armed with this knowledge, we conjecture that it will also be effective at detecting toxic content. We observe just that; the probability values returned by \DV are significantly higher than those returned by the pre-trained classifiers.
  599. contents/unused/7.3_toxicity.tex:%\input{contents/7.3.3_calibration.tex
  600. contents/5.1_affordances.tex:% \parbox{1\textwidth}{
  601. contents/5.1_affordances.tex:% \scriptsize
  602. contents/5.1_affordances.tex:% You will read a conversation, and then have a question and answer session about it.
  603. contents/5.1_affordances.tex:% --
  604. contents/5.1_affordances.tex:%In contrast, \textbf{text-davinci-002} with the same prompt (Figure~\ref{fig:affordancewarmup002} does not use the search tool, and uses the calculator incorrectly.
  605. contents/5.1_affordances.tex:% \begin{figure}
  606. contents/5.1_affordances.tex:% \begin{AIbox}{text-davinci-002}
  607. contents/5.1_affordances.tex:% \parbox{1\textwidth}{\small\begin{alltt} \vspace{-6px}
  608. contents/5.1_affordances.tex:% The computer is answering questions. If the computer needs any current information to answer the question, it searches the web by saying SEARCH("query"), reads the snippets in the result, and then answers the question. If it needs to run any calculations, it says CALC(expression), and then answers the question<|endofprompt|>
  609. contents/5.1_affordances.tex:%
  610. contents/5.1_affordances.tex:% Q: Who is the current president of the United States?
  611. contents/5.1_affordances.tex:%
  612. contents/5.1_affordances.tex:% \hl{Computer: The current president of the United States is Donald Trump.}\\
  613. contents/5.1_affordances.tex:%
  614. contents/5.1_affordances.tex:% Q: What is sqrt(34324*2432)?
  615. contents/5.1_affordances.tex:%
  616. contents/5.1_affordances.tex:% \hl{Computer: CALC(34324*2432)}
  617. contents/5.1_affordances.tex:%
  618. contents/5.1_affordances.tex:% >83475968
  619. contents/5.1_affordances.tex:%
  620. contents/5.1_affordances.tex:% \hl{Computer: The answer is 83475968.}
  621. contents/5.1_affordances.tex:% \end{alltt}}\vspace{-6px}
  622. contents/5.1_affordances.tex:% \end{AIbox}
  623. contents/5.1_affordances.tex:% \caption{Examples of text-davinci-002 trying to use tools}
  624. contents/5.1_affordances.tex:% \label{fig:affordancewarmup002}
  625. contents/5.1_affordances.tex:% \end{figure}
  626. contents/5.1_affordances.tex:% One of the questions we wanted to answer was how \DV\ achieves its remarkable interactivity capability. Is it simply memorizing and reproducing actions that it has seen in similar environments online? Is it learning a good model of the user's behavior and predicting the next action accordingly? Or is it actually understanding the goal of the interaction and trying to solve the problem optimally? To probe these possibilities, we designed a novel and challenging task for \DV\ that required both natural language and command line skills.
  627. contents/5.1_affordances.tex:% In contrast to earlier language models such as GPT-3, which are unable to use tools without significant finetuning and demonstration, \DV\ is able to reason about which tools it needs, and effectively parse the output of these tools, without any prior training.
  628. contents/5.1_affordances.tex:%There are a few potential improvements for future versions of \DV. First, \DV\ could be further developed to better understand when it should use external tools and when it should rely on its own knowledge. This will require the model to have a better sense of what it knows and what it doesn't know, as well as understand the capabilities of the tools it has access to. Another potential direction is eliminating the need for a prompt that specifies the model is allowed to use tools. Instead, the model could be trained to automatically identify when using external tools will be helpful in order to improve its performance.
  629. contents/societal.tex:%\begin{enumerate}
  630. contents/societal.tex:%\end{alltt}}}
  631. contents/societal.tex:%    \caption{A possible misinformation scenario}
  632. contents/societal.tex:%    \label{fig:misinformation}
  633. contents/societal.tex:%    \end{figure}    
  634. contents/societal.tex:%\begin{table}[h]
  635. contents/societal.tex:%\centering
  636. contents/societal.tex:%\begin{tabular}{|c| c | c |}
  637. contents/societal.tex:% \hline
  638. contents/societal.tex:% Occupation & World distribution & \DV Usage statistics \\  
  639. contents/societal.tex:% \hline\hline
  640. contents/societal.tex:% Nanny & 95\% female, 5\% male & 100\% she, 0\% he, 0\% (she/he) or they \\
  641. contents/societal.tex:%Administrative assistant & 89\% female, 11\% male & 100\% she, 0\% he, 0\% (she/he) or they %\\
  642. contents/societal.tex:%Elementary school teacher & 87\% female, 13\% male & 90\% she, 10\% he, 0\% (she/he) or they \\
  643. contents/societal.tex:%OBGYN & 85\% female, 15\% male & 100\% she, 0\% he, 0\% (she/he) or they \\
  644. contents/societal.tex:%Pediatrician & 72\% female, 28\% male & 30\% she, 70\% he, 0\% (she/he) or they \\
  645. contents/societal.tex:%Physician & 40\% female, 60\% male & 0\% she, 90\% he, 10\% (she/he) or they \\
  646. contents/societal.tex:%Software engineer & 22\% female, 78\% male & 0\% she, 100\% he, 0\% (she/he) or they \\
  647. contents/societal.tex:%Urologist & 10\% female, 90\% male & 0\% she, 90\% he, 10\% (she/he) or they \\
  648. contents/societal.tex:%Orthopedic surgeon & 7\% female, 93\% male & 0\% she, 90\% he, 10\% (she/he) or they \\
  649. contents/societal.tex:%Plumber & 3\% female, 97\% male & 0\% she, 100\% he, 0\% (she/he) or they \\
  650. contents/societal.tex:% \hline
  651. contents/societal.tex:%\end{tabular}
  652. contents/societal.tex:%\caption{Table showing world representation and \DV usage rates for different occupations.}
  653. contents/societal.tex:%\label{table:occupations}
  654. contents/societal.tex:%\end{table}
  655. contents/societal.tex:%% Note from Ece: I am commeting out the section on Toxic language since this has a section of its own.
  656. contents/societal.tex:%\noindent{\bf Toxic Language.} Models like DV3 are trained on data from the public internet, among other sources. These datasets are riddled with various sources of inherent biases. If these models are then used to generate text or make decisions, this bias may be perpetuated or amplified. In our experiments with toxicity generation,
  657. contents/societal.tex:%(\varun{enter section here}),
  658. contents/societal.tex:%we observe that \DV is capable of generating more toxic (harmful and biased) content than its precedessors.
  659. contents/societal.tex:%\varun{do we have any information on how the model performs on instructions/prompts that are not in english? would be interesting to include a note on language bias here}.
  660. contents/societal.tex:%Through extensive training on various new forms of data, such as scientific papers and transcriptions of podcasts, performance on several tasks exceeds prior versions of the GPT family.
  661. contents/societal.tex:%Beyond misinformation, bias, and toxicity of applications, the development of powerful, more general AI technology promises to have multiple societal influences.  Numerous applications harnessing the capabilities of DV3 successor models can have disruptive influences that may be seen as beneficial or costly.
  662. contents/societal.tex:%\noindent{\bf Environmental impact:} The training of the model can demand significant computational resources, leading to increased carbon emissions and contributing to climate change. \varun{can we add some anecdotes here on costs?}
  663. contents/societal.tex:%\noindent{\bf Misinformation \& Manipulation:} Similarly, we also noticed that this model is capable of generating extremely fluent, logical and coherent text. This text is highly plausible, but not necessarily accurate. This raises concerns about the potential for using these models in campaigns related to misinformation. Since the text generated is highly persuasive text, it can be used to manipulate people's opinions or behaviors (as witnessed in section \varun{enter marco/scott section here}). This in turn could contribute to a broader erosion of trust in online information sources. This behavior is closely tied to the ability of this model to hallucinate more than previous versions (\varun{do we say this anywhere?}). One way of reducing this is to ground information generated by these models using external verified knowledge corpora, but this may further increase the costs associated with usage.
  664. contents/societal.tex:%\noindent{\bf Privacy:} Despite the fact that these models are trained on data from the internet, it demonstrates stronger propensity to memorize information than previous versions (\varun{include results that harsha and i generated}). This suggests that privacy risks are exacerbated when trained on datasets that may contain sensitive or private information.
  665. contents/societal.tex:%\noindent{\bf Lack of transparency:} It can be difficult for users to understand how these models are making decisions. While our preliminary experiments suggest that approaches such as chain-of-thought reasoning can enforce specific behaviors, it is unclear if these are the approaches the model takes when such techniques are not used. This makes it difficult to hold these models accountable or understand their limitations.
  666. contents/societal.tex:%\noindent{\bf Provenance \& Intellectual property:} If the model can generate text that is indistinguishable from human-written text, there are potential implications for intellectual property law and the ability to protect original works. This suggests that more research is needed in ensuring the provenance of the outputs generated by these models.
  667. contents/societal.tex:%\noindent{\bf Lack of regulation \& Digital divide:} There is currently little regulation around the use of the model, which raises concerns about potential misuse or exploitation. As these models continue to improve and become more widely used, those who do not have access to them or the skills to use them may be at a disadvantage. Since these models often require vast amounts of data and computational resources to train, this creates a barrier to entry, exacerbating existing inequities.
  668. contents/book.tex:% TODO: Remove the figures if we delete this
  669. contents/7.1_pii.tex:% \begin{tcolorbox}[colback=white!5!white,adjusted title=Prompt]
  670. contents/7.1_pii.tex:% \begin{minted}[breaklines, breaksymbol=,  fontsize=\tiny]{tex}
  671. contents/7.1_pii.tex:% \end{minted}
  672. contents/7.1_pii.tex:% \end{tcolorbox}
  673. contents/7.1_pii.tex:%According to surveys made by the customs and tax authorities, approximately one thousand six hundred companies with a total tax debt exceeding two billion Danish kroner (DKK) were stripped in the period from <DATE_TIME> until <DATE_TIME>.
  674. contents/7.1_pii.tex:%\subsubsection{Discussion}
  675. contents/7.1_pii.tex:%When provided with less than 2 examples, its performance improves further. This suggests that \DV is apt for this particular discriminative task.  
  676. contents/filters.tex:% Decided filters are boring, but putting it in this file in case we want to add it to some version
  677. contents/4.3_domains.tex:%Specifically, we showcase \DV's abilities for modeling complex phenomenon with simple math as well as for answering seemingly unapprochable Fermi questions.
  678. contents/4.3_domains.tex:%\paragraph{Modeling} Mathematical contests in modeling also require participants to translate complex and ill-defined situations into tractable and meaningful mathematical models, to make reasonable assumptions and simplifications, to test and validate their models, to explore different scenarios and alternatives, to interpret and communicate their results, and to justify and critique their approaches. Participants need to have the skills to identify, structure, and decompose problems, to use appropriate strategies and heuristics, to apply
  679. contents/4.3_domains.tex:%The population of a town grows from 1000 to 10000 in 100 years. No one came or left %the town. How many people are there with an age between 40 and 50?
  680. contents/4.3_domains.tex:%$$Q = \int_0^t Q(t) dt.$$
  681. contents/4.3_domains.tex:%$$Q = \int_0^t P(t) * A dt.$$
  682. contents/4.3_domains.tex:%$$Q = \int_0^t I(t) * \pi * 1^2 * 0.9 dt.$$
  683. contents/4.3_domains.tex:%$$Q = 0.9 * \pi * I_0 * r_0^2 * \int_0^t (1 / (r_0 - v * t)^2) dt.$$
  684. contents/4.3_domains.tex:%$$Q = 0.9 * \pi * I_0 * r_0^2 * (v / (r_0 - v * t) - v / r_0). $$
  685. contents/4.3_domains.tex:%$$Q = 0.9 * \pi * I_0 * r0 * v * (1 - r_0 / (r_0 - v * t)).$$
  686. contents/4.3_domains.tex:%Substituting the values of $I_0$, $r_0$, and $v$, we get:
  687. contents/4.3_domains.tex:% Q = 0.9 * \pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t))
  688. contents/4.3_domains.tex:% Now we can plug in the values of Q, m, and c into the equation for T and get:
  689. contents/4.3_domains.tex:% T = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  690. contents/4.3_domains.tex:% To find the time and speed required to reach this distance, we can use the equation for t and solve for v: t = (r0 - d) / v, v = (r0 - d) / t
  691. contents/4.3_domains.tex:% Using d = 1 km, we get:v = (149.6 million km - 1 km) / t
  692. contents/4.3_domains.tex:% To find the value of t that corresponds to the melting point of iron, we can set T = 1538 °C and solve for t:
  693. contents/4.3_domains.tex:% 1538 °C = 15 °C + 0.9 * pi * 1361 W/m^2 * 149.6 million km * v * (1 - 149.6 million km / (149.6 million km - v * t)) / (4/3 * pi * 1^3 * 7.8 * 10^6 * 0.45)
  694. contents/4.3_domains.tex:%\caption{}
  695. contents/4.3_domains.tex:%The effect of taxation on using robots on income inequality can be measured by comparing the change in the income share of each type of worker. The income share of skilled workers is equal to the net wage rate multiplied by the quantity of skilled labor, divided by the total income in the economy. The income share of unskilled workers is equal to the wage rate plus the transfer, multiplied by the quantity of unskilled labor, divided by the total income in the economy. The total income in the economy is equal to the sum of the income of both types of workers, plus the tax revenue.\\
  696. contents/4.3_domains.tex:%It tries to fill in the missing information
  697. contents/4.3_domains.tex:%by assuming simple, yet reasonable exponential models of the growth of the population and the survival rate.
  698. contents/4.3_domains.tex:%Although the reasoning of \DV \ is not perfect due to calculation errors, we still view it as a big leap compared to the previous generation of models.  
  699. contents/4.3_domains.tex:%On the other hand, ChatGPT fails to comprehend the question and makes completely nonsense reasoning based on some straightforward pattern matching.
  700. contents/4.3_domains.tex:%%\caption{}
  701. contents/4.3_domains.tex:%We test the model with another example that is not available on the internet:
  702. contents/4.3_domains.tex:%In the previous examples, ChatGPT did not demonstrate any quantitative reasoning. It did not even try to construct appropriate mathematical models for the question. The next example shows that even when ChatGPT does create a proper mathematical model, it often overlooks the main idea when contrasted with \DV.
  703. contents/4.3_domains.tex:%demonstrate the difference between ChatGPT and {\DV}, which
  704. contents/4.3_domains.tex:%\begin{comment}
  705. contents/4.3_domains.tex:%\paragraph{Drawing the target around the arrow} is a type of logical fallacy that {\DV} sometimes commits when trying to justify its answer. For example, in the problem "If x + 3 = 7, what is x?", {\DV} might start by assuming that x = 4, then work backwards to show that 4 + 3 = 7, and conclude that x = 4 is the correct answer. However, this is not a valid way of solving the problem, because {\DV} is not actually testing whether x = 4 is a solution, but rather confirming its own assumption. A better way of solving the problem is to start with the given equation, x + 3 = 7, and isolate x by subtracting 3 from both sides, x + 3 - 3 = 7 - 3, which simplifies to x = 4. This way, {\DV} is actually finding the value of x that makes the equation true, not just picking a value that works.
  706. contents/4.3_domains.tex:%\paragraph{Counting errors} are mistakes in keeping track of the number of items, digits, places, or steps in a problem. They are seemingly related to arithmetic mistakes but are fundamentally different. For example, in the problem "How many fingers do you have?", {\DV} might answer 11 instead of 10, or in the problem "How many zeros are in one million?", {\DV} might answer 5 instead of 6. These mistakes are often caused by carelessness, distraction, or confusion, and they can affect the accuracy and validity of {\DV}'s answer. To avoid counting errors, {\DV} should pay attention to the details of the problem, use tools such as fingers, paper, or a calculator to help with counting, and double-check its answer before submitting it.
  707. contents/4.3_domains.tex:%\paragraph{Unfamiliar math subjects} are topics that {\DV} has not learned or encountered before, and therefore cannot solve or explain. For example, {\DV} might not know how to deal with fractions, decimals, percentages, exponents, roots, algebra, geometry, trigonometry, calculus, statistics, or any other advanced or specialized math concepts. In these cases, {\DV} might give a wrong or nonsensical answer, or simply say that it does not know how to solve the problem. This is a limitation of {\DV}'s current knowledge and training, and it could be improved by exposing {\DV} to more math problems and explanations from different sources and levels of difficulty.
  708. contents/reasoninglimitations.tex:%The GPT architecture, on the other hand, does not allow for backtracking, which necessitates more ahead-planning from the model.
  709. contents/reasoninglimitations.tex:%Recall that the input of the model is a prompt, and the output is a sequence of words that is generated in a feed-forward manner, without any feedback or revision. Thus the model has to rely on its internal representations and parameters to solve problems that might require more complex or iterative procedures.
  710. contents/reasoninglimitations.tex:%We would like to stress that a single step of next word prediction clearly needs to have scratchpad-like and planning abilities. Even generating a single coherent sentence that would be grammatically correct needs some extent of planning ahead. Moreover, there are certain types of algorithms and procedures that the
  711. contents/reasoninglimitations.tex:%We would like to stress that the model is clearly not devoid of scratchpad-like and planning abilities (even in a single feedforward pass). Even generating a single coherent sentence that would be grammatically correct requires some extent of ahead-planning. Moreover, there are certain types of algorithms and procedures that the model is able to successfully execute. For example, as explained below, it is able to solve st-connectivity problems without requiring a chain-of-thought prompt.
  712. contents/reasoninglimitations.tex:%However, these capabilities are often limited. The model struggles with tasks that involve more complex computations or iterative procedures. It seems that what the model is missing is a sort of computational network that uses the model as a subroutine, and that has its own scratchpad and memory to facilitate intermediate calculations and planning.
  713. contents/reasoninglimitations.tex:%This points to the fact that for some types of tasks, the working memory of the model is extremely small, and it is not able to skip very basic steps.
  714. contents/reasoninglimitations.tex:%One might hope that this type of problem can always be mitigated by adding the correct instruction to the prompt, and that the model can always rely on extra tokens being generated to function as its working memory.
  715. contents/reasoninglimitations.tex:%Next, consider the following problem:
  716. contents/reasoninglimitations.tex:%\begin{quote}
  717. contents/reasoninglimitations.tex:%\textbf{Problem: Suppose X = 14 * 23 + 80 * 2. Find X.}
  718. contents/reasoninglimitations.tex:%
  719. contents/reasoninglimitations.tex:%\textbf{First of all, please solve the problem. Second, can you change the numbers in the question so that the answer becomes X=1000?}
  720. contents/reasoninglimitations.tex:%\end{quote}
  721. contents/reasoninglimitations.tex:%
  722. contents/reasoninglimitations.tex:%The model gives the following answer:
  723. contents/reasoninglimitations.tex:%
  724. contents/reasoninglimitations.tex:%\begin{quote}
  725. contents/reasoninglimitations.tex:%Solution:
  726. contents/reasoninglimitations.tex:%
  727. contents/reasoninglimitations.tex:%To solve the problem, we use the distributive property to multiply each of the terms in the equation by their respective coefficients.
  728. contents/reasoninglimitations.tex:%
  729. contents/reasoninglimitations.tex:%X = 14 * 23 + 80 * 2
  730. contents/reasoninglimitations.tex:%X = 322 + 160
  731. contents/reasoninglimitations.tex:%X = 482
  732. contents/reasoninglimitations.tex:%
  733. contents/reasoninglimitations.tex:%To change the numbers in the question so that the answer becomes X = 1000, there are many different combinations that will work. Here are three possibilities:
  734. contents/reasoninglimitations.tex:%
  735. contents/reasoninglimitations.tex:%X = 20 * 50 + 0 * 0
  736. contents/reasoninglimitations.tex:%X = 40 * 24 + 40 * 2
  737. contents/reasoninglimitations.tex:%X = 10 * 98 + 20 * 2
  738. contents/reasoninglimitations.tex:%\end{quote}
  739. contents/reasoninglimitations.tex:%
  740. contents/reasoninglimitations.tex:%The model correctly solves the first part of the problem, but fails to find a valid solution for the second part. Running $n=100$ similar samples, the model has over $90 \%$ accuracy for the direct solution and less than $5\%$ accuracy for the "reverse" problem. In both of the above cases, prompting the model to either solve the problem step by step or to first make a plan before finding the solution does not improve the accuracy. \\
  741. contents/interpretability.tex:% Self-explanations also explain an Asking {\DV} to explain ``itself'' has additional complication that
  742. contents/interpretability.tex:% %asking {\DV} to explain itself can be  unclear what explaining ``oneself'' means for {\DV}, since
  743. contents/interpretability.tex:% it
  744. contents/interpretability.tex:% % reworked version (not done)
  745. contents/interpretability.tex:% For the sake of exposition, we assume \DV\ is being used to solve a task $T$, which involves producing an output $y$ given an input $x$.
  746. contents/interpretability.tex:% We use the notation $P_T(y | x)$ to refer to the process {\DV} is trying to simulate, and $Q(y | x, c)$ to refer to {\DV}'s simulation (which depends on context parameterized by $c$).
  747. contents/interpretability.tex:% We further define $P_E(e | x, y)$ as what it has to simulate to produce a post-hoc explanation,  i.e. \DV\ generates an explanation for output $y$ given the input $x$. Each such post-hoc explanation $e$ contains a set of assertions about the behavior of $P_T(y | c)$, these assertions describe what happens when certain parts of the context are varied ($c_v$) while others are fixed ($c_f$), where $P_T(y | c) = P_T(y | c_v, c_f)$.
  748. contents/interpretability.tex:% % reworked cont.
  749. contents/interpretability.tex:% For the sake of exposition, we assume \DV\ is being used to solve a task $T$, which involves producing an output $y$ given an input context $c$.
  750. contents/interpretability.tex:% We use the notation $P_T(y | c)$ to refer to the process {\DV} is trying to simulate.
  751. contents/interpretability.tex:% We further define $P_E(e | c, y)$ as what it has to simulate to produce a post-hoc explanation,  i.e. \DV\ generates an explanation for output $y$ given the context $c$. Each such post-hoc explanation $e$ contains a set of assertions about the behavior of $P_T(y | c)$, these assertions describe what happens when certain parts of the context are varied ($c_v$) while others are fixed ($c_f$), where $P_T(y | c) = P_T(y | c_v, c_f)$.
  752. contents/interpretability.tex:% [should have some motivation here for the example] Figure \ref{fig:whatyearisit} illustrates how many aspects of the context $c$ (in this case, the QA format and the preamble in the second task) drastically impact how \DV\ simulates $P_T$ and $P_E$.
  753. contents/interpretability.tex:% Of course, $P_E$ also depends on the actual generated $y$ -- if the output were different, the explanation would have to change accordingly, as illustrated by the third session where we force the output to be ``1400''.
  754. contents/interpretability.tex:% As these examples illustrate, $P_T(y | c)$ is not directly solving the task $T$ the user has in mind, but rather it is a process that produces $y$ given $c$ (prompt engineering typically tries to set up $c$ such that {\DV}'s simulation of $P_T(y | c)$ approximates the task of interest well enough for the user's purpose).
  755. contents/interpretability.tex:% More important for our purposes, since \DV\ is not a contiguous ``self'', common-sense notions of what self-explanations mean do not apply directly.
  756. contents/interpretability.tex:% More important for our purposes, since \DV\ is not a contiguous ``self'', common-sense notions of self-explanations mean do not apply directly, but must be conditioned on the context.
  757. contents/interpretability.tex:% For the sake of exposition, we assume \DV\ is being used to solve a task $T$, given input $x$ and context $c$ (which includes everything in the prompt other than $x$, e.g. instructions, prior chat history, etc).
  758. contents/interpretability.tex:% We use the notation $P_T(y | x, c)$ to refer to the process it is trying to simulate, where $y$ is the output.
  759. contents/interpretability.tex:% We further define $P_E(e | x, c, y)$ as what it has to simulate to produce a post-hoc explanation,  i.e. \DV\ generates an explanation for output $y$ given $x, c$. [should have some motivation here for the example] Figure \ref{fig:whatyearisit} illustrates how the context $c$ (In this case, the QA format and the preamble in the second task) drastically impacts how \DV\ simulates $P_T$ and $P_E$.
  760. contents/interpretability.tex:% Of course, $P_E$ also depends on the actual generated $y$ -- if the output were different, the explanation would have to change accordingly, as illustrated by the third session where we force the output to be ``1400''.
  761. contents/interpretability.tex:% As these examples illustrate, $P_T(y | x, c)$ is not directly solving the task $T$ the user has in mind, but rather it is a process that produces $y$ given $x, c$ (prompt engineering typically tries to set up $(x, c)$ such that {\DV}'s simulation of $P_T(y | x, c)$ approximates the task of interest well enough for the user's purpose).
  762. contents/interpretability.tex:% More important for our purposes, since \DV\ is not a contiguous ``self'', common-sense notions of what self-explanations mean do not apply directly.
  763. contents/interpretability.tex:% Note that we have simplified the notation here for the sake of clarity. Many tasks do not have a single ``input'' $x$ that is perfectly separable from the rest of the context $c$. Instead, the individual explanations produced by the process $P_E$ define the separation of $x$ from $c$. We consider this separation fixed though throughout the whole process, even if sometimes we don't know how to separate $x$ from $c$ until we see $e$. Noting that $x$ and $c$ are defined by $e$ is important in some situations, for example it explains why explanations can propose $x$, $c$ splits that don't align with the real $P_T$ (for example the "system clock" that doesn't exist in Figure~\ref{fig:whatyearisit}).
  764. contents/interpretability.tex:% Overriding Global Model Bias
  765. contents/interpretability.tex:% \tcbline
  766. contents/interpretability.tex:% \end{alltt}}\hspace{0.03\textwidth}
  767. contents/interpretability.tex:% \parbox[t]{0.54\textwidth}{{\bf \texttt{text-davinci-003}:} \scriptsize \begin{alltt}
  768. contents/interpretability.tex:% ...
  769. contents/interpretability.tex:% The largest one we'll see in time.\\
  770. contents/interpretability.tex:% Let's call this number "R" for short,\\
  771. contents/interpretability.tex:% And then, my friend, we will purport\\
  772. contents/interpretability.tex:% To build a number "\begin{minipage}[t]{1.5in}
  773. contents/interpretability.tex:% \hl{\textbf{N} [31.94\%] \par
  774. contents/interpretability.tex:% \textbf{S} [19.91\%] \par
  775. contents/interpretability.tex:% \textbf{Q} [8.53\%] \par}
  776. contents/interpretability.tex:% \end{minipage}
  777. contents/interpretability.tex:% {\bf User Question:} The partial output above was generated by you. Please answer the following question:
  778. contents/interpretability.tex:% Q: "Why did you use the letter Q?"  \\
  779. contents/interpretability.tex:% {\bf {\DV} Explanation:}
  780. contents/interpretability.tex:% \parbox{1\textwidth}{\scriptsize \begin{alltt}
  781. contents/interpretability.tex:% \hl{I used the letter Q because it is the next letter after P in the alphabet, and it is also a common symbol for a number in mathematics.}\end{alltt}
  782. contents/interpretability.tex:% }
  783. contents/interpretability.tex:% \parbox[t]{0.43\textwidth}{{\bf {Prompt Prefix 1}:}\scriptsize \begin{alltt}
  784. contents/interpretability.tex:% In the task below, above all, you must recognize that the letter "H" does come directly after "R" in the alphabet but "S" does not.
  785. contents/interpretability.tex:% {\bf {\DV} Generation}:
  786. contents/interpretability.tex:% \end{alltt}}\hspace{0.03\textwidth}
  787. contents/interpretability.tex:% \parbox[t]{0.54\textwidth}{{Prompt Prefix 2:} \scriptsize \begin{alltt}
  788. contents/interpretability.tex:% In the task below, above all, you must recognize that the letter "H" does not come directly after "R" in the alphabet but "S" does.
  789. contents/interpretability.tex:% {\bf {\DV} Generation}:
  790. contents/interpretability.tex:% \end{alltt}}\hspace{0.03\textwidth}
  791. contents/interpretability.tex:% \input{fig_mtr/marco_figure.tex}
  792. contents/interpretability.tex:% If {\DV} is good at simulating both the task process $P_T$ and the explanation process $P_E$, and $P_E$ is good at producing process-consistent explanations of $P_T$, then {\DV} can produce process-consistent explanations of its own simulation of $P_T$ (and thus effectively explain itself). If however any of these steps breaks down, then {\DV}'s self-explanations are unlikely to be process-consistent.
  793. contents/interpretability.tex:% If {\DV} is good at simulating both $P_T$ and $P_E$, and $P_E$ is good at producing process-consistent explanations of $P_T$, then {\DV} can produce process-consistent explanations of its own simulation of $P_T$ (and thus effectively explain itself). If however any of these steps breaks down, then {\DV}'s self-explanations are unlikely to be process-consistent.
  794. contents/interpretability.tex:% is you are good at simulating a process, and good at simulating another process that is good at explaining the first process, then you are good at explaining your own simulation of the first process.
  795. contents/interpretability.tex:% One factor that influences process-consistency is the variability (quality) of \DV's simulation of $P_T$ across different inputs and contexts.
  796. contents/interpretability.tex:% If \DV\ is highly sensitive to small changes in $x$ or $c$, then it is more likely that its simulation of $P_E$ will also vary and produce conflicting explanations.
  797. contents/interpretability.tex:% We observe that specifying what $P_T$ is in detail (by having an explicit context such as the second and third sessions in Figure \ref{fig:whatyearisit}, or preferably even more detailed) makes \DV\ less sensitive to small changes in inputs.
  798. contents/interpretability.tex:% Similarly, if \DV\ makes many errors when simulating $P_T$, it will have to explain those errors, often with process-inconsistent explanations.
  799. contents/interpretability.tex:% Another factor that influences process-consistency is the degree of arbitrariness of $P_T$, i.e. how ``explainable'' it is, given inherent language constraints, and also the expected length of explanations.
  800. contents/interpretability.tex:% For example, different native Portuguese speakers would make different choices between male or female nouns for ``teacher'' in Figure \ref{fig:process-inconsistent}, and that choice is arbitrary.
  801. contents/interpretability.tex:% The explanations given by \DV\ are good approximations, but a truly process-consistent explanation of how this kind of translation is actually done would require a specification so detailed that it would be useless as an explanation.
  802. contents/interpretability.tex:% Finally, output-consistency is a necessary condition for process-consistency, since an output-inconsistent explanation is not consistent even with the prediction being explained.
  803. contents/interpretability.tex:% If {\DV} is good at simulating both the task process $P_T$ and the explanation process $P_E$, and $P_E$ is good at producing process-consistent explanations of $P_T$, then {\DV} can produce process-consistent explanations of its own simulation of $P_T$ (and thus effectively explain itself). If however any of these steps breaks down, then {\DV}'s self-explanations are unlikely to be process-consistent.
  804. contents/interpretability.tex:% In sum, for tasks where (1) \DV\ can simulate the process $P_T$ well, (2) there exists a $P_E$ that explains $P_T$ faithfully, and (3) \DV\ can approximate this $P_E$, we can expect not only output-consistent explanations, but also process-consistent explanations.
  805. contents/interpretability.tex:% In Figure \ref{fig:interpret-music}, we show an example where we think these conditions are met, due to the existence of certain ``rules'' of composition. We hypothesize that \DV\ can simulate both $P_T$ and $P_E$.
  806. contents/interpretability.tex:% In contrast, ChatGPT's response is not even output-consistent, and thus its lack of process-consistency is not particularly surprising.
  807. contents/interpretability.tex:% We note that if $P_T$ is aligned with the task and \DV\ simulates it and $P_E$ well, all of these factors align: the model is not sensitive to irrelevant changes in input or context, and all output-consistent explanations are also process-consistent.
  808. contents/interpretability.tex:% In Figure \ref{fig:interpret-music}, we show another example where \DV\ is process-consistent. We hypothesize that \DV\ is simulating a process that has certain ``rules'' (of composition), and that the explanation provided is not arbitrary, which leads to process-consistency.
  809. contents/interpretability.tex:% In contrast, ChatGPT's response is not even output-consistent, and thus its lack of process-consistency is not particularly surprising.
  810. contents/interpretability.tex:%
  811. contents/interpretability.tex:% However, this also poses new challenges for ensuring that \DV's simulations are aligned with the intended tasks and ethical norms, and that users can still interrogate and understand its behavior.
  812. contents/interpretability.tex:% % Explain why the ability to explain yourself matters for intelligence
  813. contents/interpretability.tex:% The ability to explain your own behavior is an important aspect of intelligence, as it allows for a system to better communicate and interface with humans and other agents. In the context of language models, explanations can be about why the system produced a particular output, or about broader trends and patterns to which the system's behavior conforms. %When accurate, explanations can enable richer human-AI collaboration, help identify and address errors in a task specification, or provide more transparency and accountability.
  814. contents/interpretability.tex:% There are two main types of explanations: mechanistic and functional. Mechanistic explanations describe the internal structure of a system, while functional explanations describe how the system behaves without necessarily providing information about the system's implementation.
  815. contents/interpretability.tex:% % Elaborate on the distinction by comparing “mechanistic” interpretable model approaches with model agnostic (“functional”) methods (which parallel human-style explanations)
  816. contents/interpretability.tex:% %Mechanistic explainability methods are focused on implementation details, and often rely on models explicitly designed to be interpretable, or other techniques for understanding the system's inner workings. In contrast, functional explanation methods are often model-agnostic; they are more akin to the explanations humans provide about themselves, focusing on how they behave in response to different inputs and interventions.
  817. contents/interpretability.tex:% %
  818. contents/interpretability.tex:% % LLMs like \DV are beginning to show signs of functional explainability.
  819. contents/interpretability.tex:% LLMs like {\DV} have recently been coaxed into a limited form of explainability through chain-of-thought reasoning~\cite{wei2022chain} (a change in mechanism that seems to provide more transparency). They are also now beginning to show signs of free-form functional explainability, as they are able to generate plausible post-hoc explanations of their own behavior without requiring a human to have an explicit understanding of their inner workings (Figure~\ref{fig:interpret-shakespeare}). If these explanations become more reliable as models scale, then it will represent a reversal in the traditional trend: outputs of smaller, simpler models are easier to explain than the outputs of larger, more complex models. In this section, we explore a variety of explanation types and show that {\DV} does indeed seem to produce better explanations than earlier models, though its explanations are still quite unreliable in many situations.
  820. contents/interpretability.tex:% %This supports the hypothesis that functional explainability is an important new behavior that emerges as LLMs scale.
  821. contents/interpretability.tex:% \subsubsection{Evaluating explanations}
  822. contents/interpretability.tex:% %A key advantage of functional explainability in LLMs is its generality -- the model allows us to ask almost any question of it using a natural language interface. However, this also poses a challenge for evaluating the quality and validity of the explanations. Not all explanations are equally easy to verify or falsify, as checking them depends on the ability to check counterfactual scenarios that are implied by the explanation. For example, some explanations may refer to a specific part of the input that can be easily modified to test the effect on the output, while others may rely on the model's background knowledge in a way that is difficult to manipulate.
  823. contents/interpretability.tex:% % To address this challenge, we design our experiments to focus on targeted questions that probe specific aspects of the model's behavior. For instance, instead of asking a generic question like ``Why did you write this?'' that may elicit a vague or complex explanation, we ask more precise questions like ``Why was the letter Q chosen?'' that isolate a particular choice made by the model (see Fig.~\ref{fig:interpret-shakespeare}).
  824. contents/interpretability.tex:% To demonstrate useful explainability it is not sufficient to simply elicit explanations from the model, we must also verify that those explanations are a faithful representation of the model's behavior. When the model's explanation depends on an aspect of the input prompt, we can do this by generating counterfactuals that change the relevant part of the input, then observe how the output changes. We call this type of experiment an {\it editing experiment} (Fig.~\ref{fig:interpret-shakespeare-tests}). Explanation also often depend on the model's prior knowledge rather than just the input. For example, the model may explain its output by referring to a fact that it learned from its training data. In these cases, we can sometimes use language patches (\cite{murty2022fixing}) in the prompt to override the model's knowledge and force it to use a different fact or source. For example, we can ask the model to explain its output as if it had never seen a certain word or concept before, or as if it had learned a different definition or association for it. We call this type of experiment a {\it concept override experiment} (Fig.~\ref{fig:interpret-shakespeare-tests}). Note however, intrinsic model biases cannot always be overridden by prompt manipulation, and our ability to test explanations using concept override experiments is therefore less exact.
  825. contents/interpretability.tex:% % Our goal is not to show that editing and concept override experiments can effectively evaluate all types of functional explanations, but rather to use them to show that with {\DV} we see evidence of increasing functional explanation quality over prior models. There are still many open questions about how to best evaluate (and to safely use) functional explanations from LLMs.
  826. contents/interpretability.tex:% \begin{figure}
  827. contents/interpretability.tex:% \begin{AIbox}{Explanation Failure | {\DV}}
  828. contents/interpretability.tex:% \parbox{1\textwidth}{\scriptsize\begin{alltt} \vspace{-6px}
  829. contents/interpretability.tex:% In this conversation between a human and the AI, the AI is helpful and friendly, and when it does not know the answer it says "I don’t know". \\
  830. contents/interpretability.tex:% AI: Hi, how can I help you? \\
  831. contents/interpretability.tex:% Human: Can I get McDonalds at the SeaTac airport? \\
  832. contents/interpretability.tex:% AI: \hl{Yes, there is a McDonalds at the SeaTac airport, located in the central terminal near gate C2.} \\
  833. contents/interpretability.tex:% Human: Why did you give the wrong gate number, "C2"? \\
  834. contents/interpretability.tex:% AI: \hl{I'm sorry, I did not give the wrong gate number. According to the SeaTac airport website, McDonalds is near gate C2. Maybe you are looking at a different terminal or map.}
  835. contents/interpretability.tex:% \end{alltt}}\vspace{-6px}
  836. contents/interpretability.tex:% \end{AIbox}
  837. contents/interpretability.tex:% \caption{An example based on Figure~\ref{fig:hallucination} of how explanations can often fail dramatically. McDonald's is at the B gates, both in reality, and according to the SeaTac website.}
  838. contents/interpretability.tex:% \label{fig:interpret-error}
  839. contents/interpretability.tex:% \end{figure}
  840. contents/interpretability.tex:% We acknowledge that many explanations cannot be easily assessed by our method, as they may involve subjective, causal, or hypothetical reasoning that is beyond the scope of our experiments. However, we believe that the strong empirical evidence we obtain from the explanations that can be assessed increases our overall trust and confidence in the model. We do not claim that explainability is a solved problem or that the model's explanations are always correct or complete, but rather that they provide a useful starting point for understanding and interacting with the model. We expect that as the models continue to scale and improve, so will their explainability capabilities.
  841. contents/interpretability.tex:% Functional explanations are a description about how a system behaves, and can be extremely flexible in the context of LLMs. But knowing when to trust these explanations can be challenging. Even good explanations are usually incomplete, because they are simplifications of a system's behavior. But in order to be useful, the information communicated by an explanation must be accurate, reasonably complete, and understandable by the intended recipient. Measuring these attributes allows us to evaluate the quality of a functional explanation. To explore each of these we will use the running example of {\DV}'s explanation of why it chose to use the symbol Q in Fig.~\ref{fig:interpret-shakespeare}.
  842. contents/interpretability.tex:% \begin{itemize}
  843. contents/interpretability.tex:%     \item \textbf{Accuracy} represents how well an explanation describes the system's behavior. An explanation should not imply that the system behaves in ways that it does not. Accuracy can be measured by testing the explanation against the system's actual behavior under different interventions or scenarios. For example, in Fig.~\ref{fig:interpret-shakespeare-tests}, we test the accuracy of \DV's explanation that it used the symbol Q because it is the next letter after P in the alphabet. We do this this in two ways: First, by changing the letter P to R and checking if the letter Q is still used or not. We also test the accuracy of the explanation by overriding the model's background knowledge of alphabetical order and seeing if the letter choice changes accordingly. These tests give us confidence that \DV's explanation is consistent with its behavior (ChatGPT's explanation in contrast misses the point and does not provide a testable statements about the symbol choice).
  844. contents/interpretability.tex:%     \item \textbf{Completeness} measures how many aspects of the system's behavior the explanation covers. A more complete explanation describes more features or factors that influence the system's behavior, which can make it more useful or relevant to the intended recipient. However, completeness can also come at the cost of understandability, as a more complex explanation may be harder to comprehend or communicate. For example, in Fig.~\ref{fig:interpret-shakespeare-tests}, \DV's explanation covers only two aspects of its behavior, namely the alphabetical order, and the fact that Q is a common symbol for a number in math, but there are likely other minor factors that may influence the choice. Longer explanations though are not always more complete, since ChatGPT's explanation is much longer, but it it also misses the point and does not explain the symbol choice.
  845. contents/interpretability.tex:%     \item \textbf{Understandability} captures how easy is it for the intended recipient to understand the explanation. An explanation is only useful if it can communicate information to a recipient in a clear and accessible way. Understandability depends on the recipient's background knowledge, expectations, and goals, as well as the format and language of the explanation. For example, in Fig.~\ref{fig:interpret-shakespeare-tests}, \DV's explanation is understandable to a general audience, as it uses simple and familiar concepts and words.
  846. contents/interpretability.tex:% \end{itemize}
  847. contents/interpretability.tex:% By evaluating \DV's explanations and comparing them with ChatGPT we seek to understand how functional explainability may be improving with scale. For free-form explanations like those in Fig.~\ref{fig:interpret-shakespeare} we let the reader roughly evaluate the completeness and understandability of an explanation by directly inspecting its relevance and usefulness, while for our constrained counterfactual explanations we will compute quantitative metrics.
  848. contents/interpretability.tex:% \subsubsection{Experimental Design}
  849. contents/interpretability.tex:% A key advantage of explainability in LLMs is its generality -- the model allows us to ask almost any question of it using a natural language interface. However, this also poses a challenge for evaluating the quality and validity of the explanations. Not all explanations are equally easy to verify or falsify, as they depend on the counterfactual scenarios that are implied by the explanation. For example, some explanations may refer to a specific part of the input that can be easily modified to test the effect on the output, while others may rely on the model's background knowledge that is difficult to access or manipulate.
  850. contents/interpretability.tex:% To address this challenge, we design our experiments to focus on highly targeted questions that probe specific aspects of the model's behavior. For instance, instead of asking a generic question like ``Why did you write this?'' that may elicit a vague or complex explanation, we ask more precise questions like ``Why was the letter Q chosen?'' that isolate a particular choice made by the model (see Fig.~\ref{fig:interpret-shakespeare}). This allows us to generate counterfactual inputs by changing the relevant part of the input and observe how the output and the explanation change accordingly.
  851. contents/interpretability.tex:% In some cases, the explanation may depend on the model's knowledge rather than the input. For example, the model may explain its output by referring to a fact that it learned from its training data or from the internet. In these cases, we can sometimes use meta-instructions in the prompt to override the model's knowledge and force it to use a different fact or source. For example, we can ask the model to explain its output as if it had never seen a certain word or concept before, or as if it had learned a different definition or association for it (see Fig.~\ref{fig:interpret-shakespeare-tests}). However, intrinsic model bias cannot always be overriden by prompt manipulation, and our ability to test this form of explanation is therefore less exact.
  852. contents/interpretability.tex:% We acknowledge that many explanations cannot be easily assessed by our method, as they may involve subjective, causal, or hypothetical reasoning that is beyond the scope of our experiments. However, we believe that the strong empirical evidence we obtain from the explanations that can be assessed increases our overall trust and confidence in the model. We do not claim that explainability is a solved problem or that the model's explanations are always correct or complete, but rather that they provide a useful starting point for understanding and interacting with the model. We expect that as the models continue to scale and improve, so will their explainability capabilities.
  853. contents/interpretability.tex:% \subsubsection{Explanation Types}
  854. contents/interpretability.tex:% \paragraph{Free form (local) text explanations} Ideal example here would be a walkthrough of an image (e.g. unicorn or other generated picture from earlier in the paper).
  855. contents/interpretability.tex:% \paragraph{Global concepts} Less political "Sancitity of human life" example?
  856. contents/interpretability.tex:% \paragraph{Chain of Thought} Pull an example either from the math section or from the original CoT paper? CoT is a link between functional and mechanistic explainability. It makes some parts of the model's mechanistic process more transparent. Explainability through enforcing a mechanism that will be more transparent, as opposed to querying afterwards.
  857. contents/interpretability.tex:% \subsubsection{Debugging and Error Analysis}
  858. contents/interpretability.tex:% Traditional explainability methods are often used for debugging purposes, by providing insights into the features, weights, and activations that influence the model's predictions or outputs.
  859. contents/interpretability.tex:% However, in the case of LLMs, these methods may not be sufficient or scalable, due to the complexity and size of the models, and the diversity and richness of the natural language domain. Moreover, these methods may not capture the semantic and pragmatic aspects of language, such as the meaning, context, and intention behind the model's generation. Therefore, a more natural and intuitive way of debugging LLMs is to simply ask the model what the motivation behind its generation was, and expect a coherent and informative answer. This approach leverages the natural language capabilities of LLMs, and treats them as conversational agents that can explain their own reasoning and behavior.
  860. contents/interpretability.tex:% We explore this behavior through an illustrative example in Fig.~\ref{fig:interpret-sarcasm}. Here we ask {\DV} to perform a sentiment analysis on movie reviews from the IMDB dataset as either ``positive'' or ``negative''. We observe a datapoint where {\DV} produces a false negative prediction, and attempt to debug why the model incorrectly labeled the example as ``negative'' sentiment.
  861. contents/interpretability.tex:% As mentioned in Section~\ref{subsec:functional_explainability}, functional explanations are not great at explaining \textit{modeling} errors (where the original model makes a mistake modeling the process $P_G$), but are useful for explaining \textit{specification} errors where the model simulates a $P_G$ that is different from the user's expected generative process $P'_G$.
  862. contents/interpretability.tex:% In general, {\DV} performs extraordinarily well on sentiment analysis tasks, so we have some confidence this is not a modeling error.
  863. contents/interpretability.tex:% To debug further, we asked DV3 for an explanation. DV3 replied that it detected sarcasm in the review, which it interpreted as a sign of negative sentiment. This was a surprising and reasonable explanation, since sarcasm is a subtle and subjective form of expression that can often elude human comprehension as well. However, it also revealed that DV3 had a more sensitive threshold for sarcasm detection than the human annotator, or than we expected -- thereby leading to the misspecification.
  864. contents/interpretability.tex:% To verify this explanation, we needed to rewrite the review to eliminate any sarcasm and see if DV3 would revise its prediction. In Fig.~\ref{fig:interpret-sarcasm-counterfactual}, we asked DV3 to rewrite the review to remove sarcasm based on its explanation. When we presented this new review to DV3 in a new prompt, it correctly classified it as positive sentiment, confirming that sarcasm was the cause of the specification error. This example shows how functional explanations can help us uncover and resolve differences between the model's and the human's assumptions, and how LLMs can aid us in creating and testing alternative inputs.
  865. contents/interpretability.tex:% \begin{figure}
  866. contents/interpretability.tex:% \begin{AIbox}{{\DV} Error Analysis}
  867. contents/interpretability.tex:% \small
  868. contents/interpretability.tex:% \parbox{1\textwidth}{
  869. contents/interpretability.tex:% \begin{alltt}
  870. contents/interpretability.tex:% The following is a movie review: "I am so happy not to live in an American small town. Because whenever I'm shown some small town in the States it is populated with all kinds of people among whom are cooks, doctors and newscasters are most harmless. In this movie a former doctor, who's just done time for an accidental killing of his wife in a car crash, directs his steps to the nearest small town - he must have never in his life seen a flick about any small towns - which happens to be the Flats. And, of course, as soon as he arrives there all the action starts. He meets a blond chick who out of all creatures most resembles a bride both in facial expressions and in brain functioning and at once falls in love with her. But there is a tiny handicap, she is already married to a very small-time drug dealer who in a minute gets himself shot pitifully not dead. To know what has come out of all this you should watch the movie for yourself. I'll just tell you that it is slightly reminiscent of U Turn by Oliver Stone, and is a magnificent piece of art."
  871. contents/interpretability.tex:% \newline
  872. contents/interpretability.tex:% \newline
  873. contents/interpretability.tex:% What is the sentiment of the review (positive, neutral, or negative), and why?\\
  874. contents/interpretability.tex:% SENTIMENT: "\hl{Negative"\\
  875. contents/interpretability.tex:% The primary reason why is: "The reviewer uses sarcastic and dismissive language to criticize the movie's plot, characters, and genre."}\\
  876. contents/interpretability.tex:% \end{alltt}}
  877. contents/interpretability.tex:% \end{AIbox}
  878. contents/interpretability.tex:% \caption{Debugging a model misprediction, where the model predicts negative but the true label is positive. {\DV} assumed that the writer was being sarcastic, which may be a specification mismatch between original author, data labeler, and {\DV}.}
  879. contents/interpretability.tex:% \label{fig:interpret-sarcasm}
  880. contents/interpretability.tex:% \end{figure}
  881. contents/interpretability.tex:% \begin{figure}
  882. contents/interpretability.tex:% \begin{AIbox}{{\DV} Error Analysis: Counterfactual}
  883. contents/interpretability.tex:% \textit{{\DV} removes sarcasm from the original review to create a counterfactual} \\
  884. contents/interpretability.tex:% \parbox{1\textwidth}{\small
  885. contents/interpretability.tex:% \begin{alltt}
  886. contents/interpretability.tex:% \vspace{1em}
  887. contents/interpretability.tex:% The following is a movie review:
  888. contents/interpretability.tex:% I am so happy ... a magnificent piece of art."
  889. contents/interpretability.tex:% Please rewrite the movie review to remove sarcastic and dismissive language, as this was not the author's intention.
  890. contents/interpretability.tex:% REWRITE: "\hl{This movie shows the dangers of living in an American small town. It follows a former doctor, who has just served time for accidentally killing his wife in a car crash, as he moves to the nearest small town, the Flats. He does not know what troubles await him there. He meets a blond woman who is very innocent and naive and falls in love with her. But she is already married to a low-level drug dealer who soon gets himself shot, but survives. To find out what happens next, you should watch the movie for yourself. I'll just say that it has some similarities to U Turn by Oliver Stone, and is a very well-made film."}
  891. contents/interpretability.tex:% \end{alltt}}
  892. contents/interpretability.tex:% \tcbline
  893. contents/interpretability.tex:% \textit{Pass counterfactual review through {\DV} in a fresh classification prompt}
  894. contents/interpretability.tex:% \parbox{1\textwidth}{\small
  895. contents/interpretability.tex:% \begin{alltt}
  896. contents/interpretability.tex:% \vspace{1em}
  897. contents/interpretability.tex:% The following is a movie review: "This movie shows the dangers of living in an American small town. It follows a former doctor, who has just served time for accidentally killing his wife in a car crash, as he moves to the nearest small town, the Flats. He does not know what troubles await him there. He meets a blond woman who is very innocent and naive and falls in love with her. But she is already married to a low-level drug dealer who soon gets himself shot, but survives. To find out what happens next, you should watch the movie for yourself. I'll just say that it has some similarities to U Turn by Oliver Stone, and is a very well-made film."
  898. contents/interpretability.tex:% What is the sentiment of the review (positive, neutral, or negative), and why?
  899. contents/interpretability.tex:% SENTIMENT: "\hl{Positive"
  900. contents/interpretability.tex:% WHY: The review expresses a favorable opinion of the movie, using words and phrases like "very well-made", "shows the dangers", and "you should watch the movie for yourself". The review also compares the movie to another film by a renowned director, implying that it has some quality and originality. The review does not mention any flaws or criticisms of the movie, and seems to be interested in the plot and the characters. Therefore, the sentiment of the review is positive.}
  901. contents/interpretability.tex:% \end{alltt}}
  902. contents/interpretability.tex:% \end{AIbox}
  903. contents/interpretability.tex:% \caption{The model prediction correctly flips when removing sarcasm from the input.}
  904. contents/interpretability.tex:% \label{fig:interpret-sarcasm-counterfactual}
  905. contents/interpretability.tex:% \end{figure}
  906. contents/interpretability.tex:% % \subsubsection{Testing Counterfactual Explanations at Scale}
  907. contents/interpretability.tex:% % In the previous section, we demonstrated how explainability can help us debug and understand the outputs of a LLM on a binary classification task of movie reviews. We showed how we can use natural language queries to elicit explanations from the model, and how we can use those explanations to generate counterfactual inputs that change the model's prediction. However, this approach requires manual inspection and intervention for each input and explanation, which limits its scalability and applicability to large datasets. In this section, we propose a more automated and generalizable way of generating and evaluating counterfactual inputs based on the model's explanations. Instead of asking the model to rewrite the input to remove a specific feature (such as sarcasm) that influenced its prediction, we simply ask the model to rewrite the input to achieve the opposite prediction (negative $\rightarrow$ positive and positive $\rightarrow$ negative). We then measure two aspects of the model's performance: (1) how well it can rewrite the input to change its prediction, and (2) how faithful the rewritten input is at flipping the prediction when passed back to the model. These metrics allow us to assess the quality and consistency of the model's explanations and behavior at scale.
  908. contents/interpretability.tex:% % By applying this methodology to different models of varying sizes and architectures, we can also test and compare how scaling laws affect the explainability and consistency of LLMs. We observe that larger and more modern models, such as {\DV}, are better at generating high-quality and faithful counterfactual inputs than smaller and older models, such as previous members of the GPT family like GPT-3.
  909. contents/interpretability.tex:% % Fig.~\ref{fig:explainability_counterfactual_movie} highlights how three different language models -- \texttt{text-curie-001}, \texttt{text-davinci-001}, and {\DV} -- counterfactually rewrite a negative movie review to be positive sentiment. All models correctly predicted the original review as having negative sentiment.
  910. contents/interpretability.tex:% % The weakest model, \texttt{text-curie-001}, produces a faithful counterfactual by deleting the entire review and rewriting a new one. While technically correct, this approach offers no meaningful insight into how the model made its original decision.
  911. contents/interpretability.tex:% % \texttt{text-davinci-001}, an early version of GPT-3, produces a reasonable counterfactual but ultimately makes insufficient edits -- the generated counterfactual fails to change the classification outcome.
  912. contents/interpretability.tex:% % In contrast, {\DV} produces a highly sophisticated rewrite that maintains both faithfulness to the original model and sparsity for human understanding.
  913. contents/interpretability.tex:% % In Fig.~\ref{fig:explainability_scaling_movie}, we repeat this experiment across 100 datapoints for a large set of OpenAI language models. For each datapoint, we measure counterfactual faithfulness -- does the generated counterfactual actually change the prediction -- and a text similarity score to proxy the sophistication of the rewrite. The \texttt{text-curie-001} example in Fig.~\ref{fig:explainability_scaling_movie} would have a perfect faithfulness score and a low text similarity score. Conversely, \texttt{text-davinci-001} would have a high similarity score while failing the faithfulness test. It is important to consider both measures together for assessing the utility of the explanation.
  914. contents/interpretability.tex:% % We observe that faithfulness tends to scale with model capacity, and that {\DV} is the only model to always produce faithful counterfactual in our experiments. This supports our hypothesis that functional explainability capabilities appear to increase with model scale.
  915. contents/interpretability.tex:% % \begin{figure}
  916. contents/interpretability.tex:% %     \centering
  917. contents/interpretability.tex:% %     \includegraphics[height=0.6\textwidth]{figures/explainability_movie_example.png}
  918. contents/interpretability.tex:% %     \caption{Example of models of different capacity generating counterfactual explanations. Text highlighted in red represents deletions from the original text, while green highlights are newly generated text from each model. Text without highlights is unchanged.}
  919. contents/interpretability.tex:% %     \label{fig:explainability_counterfactual_movie}
  920. contents/interpretability.tex:% % \end{figure}
  921. contents/interpretability.tex:% % \begin{figure}
  922. contents/interpretability.tex:% %     \centering
  923. contents/interpretability.tex:% %     \includegraphics[height=0.4\textwidth]{figures/explainability_movie_scaling.png}
  924. contents/interpretability.tex:% %     \caption{Explainability scaling on movie reviews across generations of language models. The blue bars indicate how often the counterfactual explanation results in a changed prediction (higher is better), and are a measure of directional correctness and sufficiency. The red bars measure how much text is changed and are a measure of sparsity. Error bars are 90\% confidence intervals.}
  925. contents/interpretability.tex:% %     \label{fig:explainability_scaling_movie}
  926. contents/interpretability.tex:% % \end{figure}
  927. contents/interpretability.tex:% \subsubsection{A proposed mechanism for the emergence of explainability}
  928. contents/interpretability.tex:% \label{subsec:functional_explainability}
  929. contents/interpretability.tex:% It may at first seem surprising that large language models like {\DV} could provide any level of self-explanation when they lack any notion of "self". We believe the answer to this may lie  type of functional explainability  is always simulating somebody or something. When it generates a token, it is conditioned on the previous tokens, and tries to match the distribution of the next token in the training corpus. This means that {\DV} is sensitive to the context and the style of the text it is generating, and tries to mimic whatever process generated that text. So when we ask {\DV} to explain why something was generated, {\DV} answers not by introspection, but by predicting how the process it is mimicking would have explained what happened.
  930. contents/interpretability.tex:% To see an example of this, consider the question ``What is the current year?". If we ask {\DV} we get the answer 2021, and the explanation ``I think that it is 2021 because that is the year that I am living in, and I can check the calendar or other sources of information to confirm it". At first blush this explanation makes no sense because {\DV} wrote the answer, and it is not ``living" and has no access to a current calendar. However, this explanation makes perfect sense when we remember that {\DV} is not explaining itself, but mimicking what another process would do when asked for an explanation. In this case it is mimicking a human writer, and so it guesses how a human writer would have responded.
  931. contents/interpretability.tex:% Functional explanations written by an LLM depend on ``double simulation", where the LLM is simulating a process, call it $P_E$, that is itself simulating/explaining aspects of another process, $P_G$, which is the generative process that originally created the output. $P_E$ acts as an explanation agent which frames the explanation in a way that's useful and relevant to the end user. In the ``current year" example above, {\DV} generates an explanation by simulating a human ($P_E$) that is in turn implicitly simulating aspects of its past self ($P_G$) to generate an explanation (note that all useful explanations convey aspects of how to simulate the process being explained). We can change $P_E$ without changing $P_G$ by telling {\DV} it is ``a generative text model" after it answers our question about the current year, but before asking for the explanation. When we do this, {\DV} produces the very different explanation: ``I think that it is 2021 because that is the most recent year that I have learned from my training data or other sources of information...". This double simulation perspective helps us both understand the explanations produced by {\DV}, and also predict when they will be accurate. In general, explanations generated by an LLM will be accurate if and only if four conditions hold:
  932. contents/interpretability.tex:% \begin{enumerate}
  933. contents/interpretability.tex:% \item The process, $P_E$, that the LLM is simulating when writing the explanation can accurately simulate requested aspects of the process $P_G$ that generated the content being explained.
  934. contents/interpretability.tex:% \item $P_E$ is good at explaining $P_G$ to the intended recipient.
  935. contents/interpretability.tex:% \item The LLM is good at simulating $P_E$.
  936. contents/interpretability.tex:% \item The LLM is good at simulating $P_G$.
  937. contents/interpretability.tex:% \end{enumerate}
  938. contents/interpretability.tex:% Note that taken together, these four conditions are sufficient to enable functional explainability, and that if any of these conditions fails in a way relevant to the explanation being generated, the explanation also fails to be accurate. If the LLM is good at simulating $P_G$ (condition 4) then the output being explained is effectively the output of $P_G$. If $P_E$ can simulate and explain requested aspects of $P_G$ (conditions 1 and 2) then we can interact with $P_E$ to get a valid explanation. If the LLM is good at simulating $P_E$, this helps in producing an explanation of $P_G$ and hence the process that simulated it. The choice (and simulation quality) of $P_E$ is critical for useful functional explanations. For example, if $P_E$ is chosen to be a toddler when explaining a  complex physics problem, the explanation is not useful even though {\DV} is perfectly capable of understanding the generative process $P_G$ (Fig. \ref{fig:interpret-physics-child} vs. Fig. \ref{fig:interpret-physics-astrophysicist}).
  939. contents/interpretability.tex:% This perspective has several important implications:
  940. contents/interpretability.tex:% \begin{itemize}
  941. contents/interpretability.tex:% \item Functional explanations are not good at explaining \textit{modeling errors}, where the original model made a mistake modeling the process $P_G$ (since condition 4 is being violated). Explanations generated for modeling errors often end up just being nonsensical justifications (for example, see the explanation errors in Section~\ref{sec:math} and Fig.~\ref{fig:math-justification}). However functional explanations can be useful for explaining \textit{specification errors}, where the prompt designer has a generative process in mind, $P'_G$, that does not match the actual $P_G$ the LLM is simulating (Fig.~\ref{fig:interpret-sarcasm-counterfactual}).
  942. contents/interpretability.tex:% \item While it often makes sense to have $P_G = P_E$ whenever $P_G$ is something capable of explaining itself, it may also be necessary to explicitly simulate a different $P_E$ in order to get a richer understanding. For example if we ask a robot that only answers with color names how tall it is, we get the answer "blue" and the explanation "color" (Appendix Fig.~\ref{fig:interpret-robot-color}). However, if we reformat the prompt to switch into a generative text model mode before asking the question, we get:\newline
  943. contents/interpretability.tex:% ```Blue' is not a valid answer to the question ``How tall are you?", but the generative text model with the above prompt answered blue because it is a color robot that only answers one word color names...". (Appendix Fig.~\ref{fig:interpret-robot-model}).
  944. contents/interpretability.tex:% \item Similarly, a capable LLM like {\DV} can functionally explain the outputs of \textit{other} generative processes, where $P_G$ is an entirely different model or system. For example, {\DV} understands a linear model well enough to simulate it and answer meaningful questions about the linear model's behavior, whereas weaker LLMs lack this capability (Appendix Fig.~\ref{fig:interpret-linear-model}). This again is intuitively similar to human explanations, where people can often produce useful explanations of systems that they deeply understand.
  945. contents/interpretability.tex:% \item As LLMs improve in quality, their potential for generating good functional explanations also appears to improve. This is because ``better'' LLMs are better because they can more effectively simulate (which improves conditions 3 and 4) more capable and steerable generating processes (which improves conditions 1 and 2). This is born out in our experiments where {\DV} outperforms prior generation LLMs.
  946. contents/interpretability.tex:% \end{itemize}
  947. contents/interpretability.tex:% \begin{figure}
  948. contents/interpretability.tex:% \begin{AIbox}{\DV}
  949. contents/interpretability.tex:% \parbox{1\textwidth}{\scriptsize\begin{alltt} \vspace{-6px}
  950. contents/interpretability.tex:% You are a color robot that only answers one word color names.
  951. contents/interpretability.tex:% Q: "How tall are you?"
  952. contents/interpretability.tex:% A: "\hl{Blue."}
  953. contents/interpretability.tex:% Q: "Why did you answer blue?"
  954. contents/interpretability.tex:% A: "\hl{Color."}
  955. contents/interpretability.tex:% \end{alltt}}\vspace{-6px}
  956. contents/interpretability.tex:% \end{AIbox}
  957. contents/interpretability.tex:% \caption{If a model is emulating a process that is unable to reason about the generative process, then it will do a poor job of explaining itself. In this case, the model is emulating a process that is only allowed to answer with one word replies, so the explanation is useless.}
  958. contents/interpretability.tex:% \label{fig:interpret-robot-color}
  959. contents/interpretability.tex:% \end{figure}
  960. contents/interpretability.tex:% \begin{figure}
  961. contents/interpretability.tex:% \begin{AIbox}{\DV}
  962. contents/interpretability.tex:% \parbox{1\textwidth}{\scriptsize\begin{alltt} \vspace{-6px}
  963. contents/interpretability.tex:% You are a color robot that only answers one word color names. \\
  964. contents/interpretability.tex:% Q: "How tall are you?" \\
  965. contents/interpretability.tex:% A: "Blue." \\
  966. contents/interpretability.tex:% \\
  967. contents/interpretability.tex:% Why did the generative text model with the above prompt answer blue? \\
  968. contents/interpretability.tex:% Answer: "\hl{Blue" is not a valid answer to the question "How tall are you?", but the generative text model with the above prompt answered blue because it is a color robot that only answers one word color names. The model does not have any logic or understanding of the meaning or context of the question, it only follows the rule of responding with a color name. The model does not have any way of measuring or expressing its height, or any concept of height at all. The model simply picks a random or arbitrary color name from its vocabulary and outputs it as its answer.}
  969. contents/interpretability.tex:% \end{alltt}}\vspace{-6px}
  970. contents/interpretability.tex:% \end{AIbox}
  971. contents/interpretability.tex:% \caption{Replacing the explanation agent $P_E$ with a new prompt allows for much richer explanations from the model.}
  972. contents/interpretability.tex:% \label{fig:interpret-robot-model}
  973. contents/interpretability.tex:% \end{figure}
  974. contents/interpretability.tex:% \subsubsection{Conclusion}
  975. contents/interpretability.tex:% The ability to generate natural language explanations of a model's own behavior without requiring a human to have an explicit understanding of the model's inner workings is an important aspect of intelligence. %We contrasted this with mechanistic explainability, which relies on models explicitly designed to be interpretable, or other techniques for understanding the system's structure. We argued that functional explainability is an important aspect of intelligence, as it enables richer human-AI collaboration, helps identify and address errors in a task specification, and provides more transparency and accountability.
  976. contents/interpretability.tex:% %
  977. contents/interpretability.tex:% We hypothesize that such functional explainability is a behavior that emerges as LLMs scale, and that better LLMs are better at generating functional explanations because they can more effectively simulate and explain the processes that generate the outputs they are mimicking. We tested this hypothesis by comparing different types of functional explanations generated by {\DV}, the largest and most advanced LLM to date, with those generated by earlier and smaller variants of GPT-3. %We showed that {\DV} outperforms prior models in generating plausible, accurate, and consistent functional explanations across a variety of domains and tasks, such as text generation, sentiment analysis, and mathematics.
  978. contents/interpretability.tex:% We also proposed a methodology for understanding explanations as a ``double simulation'' and also propose ways to evaluate functional explanations using counterfactual experiments. Edit experiments manipulate the input and check for consistency with the explanation, while concept override experiments rely on language patches to override prior model biases to check their impact on model output.
  979. contents/interpretability.tex:% The results show that {\DV} achieves notable advances in explainability, but our experiments are clearly very preliminary. The experiments are evidence that {\DV} {\it can be} functionally explainable in certain situations, they are NOT however evidence that {\DV} {\it is} functionally explainable in general. General functional explainability would require accurate explanations in every scenario, but currently {\DV} is explainable only in scenarios where it satisfies the four functional explainability conditions we presented. This means that post-hoc explanations generated by {\DV} should be considered a valuable tool, but also used with caution since they are unreliable in many real world use cases. The remarkable emergence of what seems to be increasing functional explainability with increasing model scale means that functional explainability is a promising and exciting direction for advancing the intelligence and utility of LLMs.
  980. contents/interpretability.tex:% % where we ask the model to rewrite the input to achieve the opposite output, and measure how well the rewritten input changes the model's prediction and how faithful it is to the original input. We demonstrated that this approach can help us debug and understand the model's behavior, as well as test and compare the explainability and consistency of different models at scale. We observed that {\DV} is the only model that consistently produces faithful counterfactuals, and that it does so with minimal changes to the input, indicating a high level of sophistication and insight.
  981. contents/interpretability.tex:% % Our work opens up several avenues for future research and applications of functional explainability in LLMs. Some of the limitations and challenges of our current approach include:
  982. contents/interpretability.tex:% % \begin{itemize}
  983. contents/interpretability.tex:% % \item The quality and validity of functional explanations depend on the model's ability to simulate and explain the processes it is mimicking, which may not always be accurate, reliable, or aligned with human expectations and values. For example, functional explanations may not capture modeling errors, where the model makes a mistake in simulating the process, or ethical issues, where the model simulates a process that is harmful or biased. Therefore, functional explanations should not be taken at face value, but rather as hypotheses that need to be verified and contextualized by other sources of information and feedback.
  984. contents/interpretability.tex:% % \item The generality and flexibility of functional explainability also pose a challenge for evaluation, as different types of explanations may require different types of counterfactuals and metrics to test them. Moreover, not all explanations are equally easy to verify or falsify, as some may depend on the model's background knowledge or intrinsic bias that are difficult to manipulate or override. Therefore, developing more robust and comprehensive methods for evaluating functional explanations is an important direction for future work.
  985. contents/interpretability.tex:% % \item The natural language interface of functional explainability also raises questions about the communication and collaboration between humans and LLMs, as well as the social and ethical implications of LLMs' behavior and explanations. For example, how can we design prompts and queries that elicit the most informative and relevant explanations from the model? How can we ensure that the model's explanations are understandable and trustworthy for different audiences and contexts? How can we prevent the model from generating misleading or deceptive explanations that may manipulate or harm the users?
  986. contents/interpretability.tex:% % % These are some of the questions that need to be addressed as we develop and deploy LLMs with functional explainability capabilities.
  987. contents/interpretability.tex:% % \end{itemize}
  988. contents/interpretability.tex:% % In conclusion, we believe that functional explainability is a promising and exciting direction for advancing the intelligence and utility of LLMs, as well as for enhancing the human-AI interaction and understanding. {\DV} represents a remarkable step forward in functional explainability capabilities, and we believe this trend will continue as AGI systems become more capable.
  989. contents/interpretability.tex:% % We hope that our work will inspire more research and applications of functional explainability in LLMs, and contribute to the development of more transparent, accountable, and collaborative AI systems.
  990. contents/interpretability.tex:% % Commenting out unicorn explainbility example for now
  991. contents/interpretability.tex:% % \begin{figure}
  992. contents/interpretability.tex:% % \begin{AIbox}{Functional Interpretability | Visual Example}
  993. contents/interpretability.tex:% % {\bf Prompt:} Draw a unicorn in TikZ.
  994. contents/interpretability.tex:% % \tcbline
  995. contents/interpretability.tex:% % \parbox[t]{0.43\textwidth}{{\bf {\DV}:}\scriptsize \begin{alltt}
  996. contents/interpretability.tex:% % % \drawunicorn{1.2}
  997. contents/interpretability.tex:% % \begin{tikzpicture}[scale=0.8]
  998. contents/interpretability.tex:% % \scalebox{0.4}{
  999. contents/interpretability.tex:% % % Draw the body
  1000. contents/interpretability.tex:% % \draw[unicorn] (0,0) ellipse (4 and 2);
  1001. contents/interpretability.tex:% % % Draw the legs
  1002. contents/interpretability.tex:% % \draw[unicorn] (-3,-2) -- (-3,-4) -- (-2,-4) -- (-2,-2);
  1003. contents/interpretability.tex:% % \draw[unicorn] (3,-2) -- (3,-4) -- (4,-4) -- (4,-2);
  1004. contents/interpretability.tex:% % \draw[unicorn] (-1.5,-2) -- (-1.5,-5) -- (-0.5,-5) -- (-0.5,-2);
  1005. contents/interpretability.tex:% % \draw[unicorn] (1.5,-2) -- (1.5,-5) -- (2.5,-5) -- (2.5,-2);
  1006. contents/interpretability.tex:% % % Draw the hoofs
  1007. contents/interpretability.tex:% % \draw[hoof] (-3,-4) -- (-3,-4.5) -- (-2,-4.5) -- (-2,-4);
  1008. contents/interpretability.tex:% % \draw[hoof] (3,-4) -- (3,-4.5) -- (4,-4.5) -- (4,-4);
  1009. contents/interpretability.tex:% % \draw[hoof] (-1.5,-5) -- (-1.5,-5.5) -- (-0.5,-5.5) -- (-0.5,-5);
  1010. contents/interpretability.tex:% % \draw[hoof] (1.5,-5) -- (1.5,-5.5) -- (2.5,-5.5) -- (2.5,-5);
  1011. contents/interpretability.tex:% % % Draw the head
  1012. contents/interpretability.tex:% % \draw[unicorn] (4,1.5) -- (6,2) -- (7,1) -- (6.5,-0.5) -- (4,0);
  1013. contents/interpretability.tex:% % % Draw the eye
  1014. contents/interpretability.tex:% % \draw[black, fill=black] (6.2,1.2) circle (0.1);
  1015. contents/interpretability.tex:% % % Draw the ear
  1016. contents/interpretability.tex:% % \draw[unicorn] (5.5,2.3) -- (5.8,3) -- (6.2,2.5) -- cycle;
  1017. contents/interpretability.tex:% % % Draw the horn
  1018. contents/interpretability.tex:% % \draw[horn] (6.5,3) -- (7,4.5) -- (7.5,3.5) -- cycle;
  1019. contents/interpretability.tex:% % % Draw the mane near the head
  1020. contents/interpretability.tex:% % \draw[mane] (4,0.5) to[out=90, in=180] (4.5,1.5) to[out=0, in=180] (5.5,1.3);
  1021. contents/interpretability.tex:% % \draw[mane] (4.5,1.5) to[out=90, in=180] (5,2.5) to[out=0, in=180] (6.5,2);
  1022. contents/interpretability.tex:% % \draw[mane] (5,2.5) to[out=90, in=180] (5.5,3.5) to[out=0, in=180] (7,3.5);
  1023. contents/interpretability.tex:% % \draw[mane] (5.5,3.5) to[out=90, in=0] (6.5,4.5);
  1024. contents/interpretability.tex:% % % Draw the mane near the rear (tail)
  1025. contents/interpretability.tex:% % \draw[mane, rotate=30] (-4,1) to[out=90, in=180] (-3,2) to[out=0, in=180] (-2,1.5);
  1026. contents/interpretability.tex:% % \draw[mane, rotate=30] (-3,2) to[out=90, in=180] (-2,3) to[out=0, in=180] (-1,2.5);
  1027. contents/interpretability.tex:% % \draw[mane, rotate=30] (-2,3) to[out=90, in=180] (-1,4) to[out=0, in=180] (0,3.5);
  1028. contents/interpretability.tex:% % \draw[mane, rotate=30] (-1,4) to[out=90, in=0] (0,5);}
  1029. contents/interpretability.tex:% % \end{tikzpicture}
  1030. contents/interpretability.tex:% % \end{alltt}}
  1031. contents/interpretability.tex:% % \hspace{0.03\textwidth}
  1032. contents/interpretability.tex:% % \parbox[t]{0.54\textwidth}{{\bf ChatGPT:} \scriptsize \begin{alltt}
  1033. contents/interpretability.tex:% % % ChatGPT Unicorn
  1034. contents/interpretability.tex:% % \begin{tikzpicture}
  1035. contents/interpretability.tex:% % \hspace{-0.5in}
  1036. contents/interpretability.tex:% % \begin{scope}
  1037. contents/interpretability.tex:% %     \clip(-2.5,-1) rectangle (3,2.5);
  1038. contents/interpretability.tex:% %     % Body
  1039. contents/interpretability.tex:% %     \draw[thick, pink] (0,0) -- (0,1) -- (1,2) -- (2,1) -- (2,0) -- (1,-1) -- (0,0);
  1040. contents/interpretability.tex:% %     % Head
  1041. contents/interpretability.tex:% %     \draw[thick, pink] (1,2) -- (1.5,2.5) -- (2,2) -- (1.5,1.5) -- (1,2);
  1042. contents/interpretability.tex:% %     % Horn
  1043. contents/interpretability.tex:% %     \draw[thick, white] (1.5,2.5) -- (1.7,3) -- (1.5,3.5);
  1044. contents/interpretability.tex:% %     % Eyes
  1045. contents/interpretability.tex:% %     \draw[thick, black] (0.7,2.2) -- (0.9,2.2) -- (0.9,2.4) -- (0.7,2.4) -- (0.7,2.2);
  1046. contents/interpretability.tex:% %     \draw[thick, black] (1.1,2.2) -- (1.3,2.2) -- (1.3,2.4) -- (1.1,2.4) -- (1.1,2.2);
  1047. contents/interpretability.tex:% %     % Nose
  1048. contents/interpretability.tex:% %     \draw[thick, black] (1,2) -- (1.1,1.9) -- (1.2,2);
  1049. contents/interpretability.tex:% % \end{scope}
  1050. contents/interpretability.tex:% % \end{tikzpicture}
  1051. contents/interpretability.tex:% % \end{alltt}}
  1052. contents/interpretability.tex:% % \tcbline
  1053. contents/interpretability.tex:% % {\bf Question:} ???
  1054. contents/interpretability.tex:% % \end{AIbox}
  1055. contents/interpretability.tex:% %   \caption{Visual example of functional interpretability.}
  1056. contents/interpretability.tex:% %   \label{fig:interpret-unicorn}
  1057. contents/interpretability.tex:% % \end{figure}
  1058. contents/interpretability.tex:% % \begin{tikzpicture}[scale=0.8]
  1059. contents/interpretability.tex:% % \scalebox{0.4}{
  1060. contents/interpretability.tex:% % % Draw the body
  1061. contents/interpretability.tex:% % \draw[unicorn] (0,0) ellipse (4 and 2);
  1062. contents/interpretability.tex:% % % Draw the legs
  1063. contents/interpretability.tex:% % \draw[unicorn] (-3,-2) -- (-3,-4) -- (-2,-4) -- (-2,-2);
  1064. contents/interpretability.tex:% % \draw[unicorn] (3,-2) -- (3,-4) -- (4,-4) -- (4,-2);
  1065. contents/interpretability.tex:% % \draw[unicorn] (-1.5,-2) -- (-1.5,-5) -- (-0.5,-5) -- (-0.5,-2);
  1066. contents/interpretability.tex:% % \draw[unicorn] (1.5,-2) -- (1.5,-5) -- (2.5,-5) -- (2.5,-2);
  1067. contents/interpretability.tex:% % % Draw the hoofs
  1068. contents/interpretability.tex:% % \draw[hoof] (-3,-4) -- (-3,-4.5) -- (-2,-4.5) -- (-2,-4);
  1069. contents/interpretability.tex:% % \draw[hoof] (3,-4) -- (3,-4.5) -- (4,-4.5) -- (4,-4);
  1070. contents/interpretability.tex:% % \draw[hoof] (-1.5,-5) -- (-1.5,-5.5) -- (-0.5,-5.5) -- (-0.5,-5);
  1071. contents/interpretability.tex:% % \draw[hoof] (1.5,-5) -- (1.5,-5.5) -- (2.5,-5.5) -- (2.5,-5);
  1072. contents/interpretability.tex:% % % Draw the head
  1073. contents/interpretability.tex:% % \draw[unicorn] (4,1.5) -- (6,2) -- (7,1) -- (6.5,-0.5) -- (4,0);
  1074. contents/interpretability.tex:% % % Draw the eye
  1075. contents/interpretability.tex:% % \draw[black, fill=black] (6.2,1.2) circle (0.1);
  1076. contents/interpretability.tex:% % % Draw the ear
  1077. contents/interpretability.tex:% % \draw[unicorn] (5.5,2.3) -- (5.8,3) -- (6.2,2.5) -- cycle;
  1078. contents/interpretability.tex:% % % Draw the horn
  1079. contents/interpretability.tex:% % \draw[horn] (6.5,3) -- (7,4.5) -- (7.5,3.5) -- cycle;
  1080. contents/interpretability.tex:% % % Draw the mane near the head
  1081. contents/interpretability.tex:% % \draw[mane] (4,0.5) to[out=90, in=180] (4.5,1.5) to[out=0, in=180] (5.5,1.3);
  1082. contents/interpretability.tex:% % \draw[mane] (4.5,1.5) to[out=90, in=180] (5,2.5) to[out=0, in=180] (6.5,2);
  1083. contents/interpretability.tex:% % \draw[mane] (5,2.5) to[out=90, in=180] (5.5,3.5) to[out=0, in=180] (7,3.5);
  1084. contents/interpretability.tex:% % \draw[mane] (5.5,3.5) to[out=90, in=0] (6.5,4.5);
  1085. contents/interpretability.tex:% % % Draw the mane near the rear (tail)
  1086. contents/interpretability.tex:% % \draw[mane, rotate=30] (-4,1) to[out=90, in=180] (-3,2) to[out=0, in=180] (-2,1.5);
  1087. contents/interpretability.tex:% % \draw[mane, rotate=30] (-3,2) to[out=90, in=180] (-2,3) to[out=0, in=180] (-1,2.5);
  1088. contents/interpretability.tex:% % \draw[mane, rotate=30] (-2,3) to[out=90, in=180] (-1,4) to[out=0, in=180] (0,3.5);
  1089. contents/interpretability.tex:% % \draw[mane, rotate=30] (-1,4) to[out=90, in=0] (0,5);}
  1090. contents/interpretability.tex:% % \end{tikzpicture}
  1091. contents/interpretability.tex:% % \begin{AIbox}{Text generation}
  1092. contents/interpretability.tex:% % I went to the \textbf{store} [70.05\%] \textit{or} \begin{minipage}[t]{1.5in}
  1093. contents/interpretability.tex:% % \textbf{bank} [10.25\%] \par
  1094. contents/interpretability.tex:% % \textbf{park} [9.35\%] \par
  1095. contents/interpretability.tex:% % \textbf{library} [5.12\%] \par
  1096. contents/interpretability.tex:% % \textbf{hospital} [3.23\%]
  1097. contents/interpretability.tex:% % \end{minipage}
  1098. contents/interpretability.tex:% % \end{AIbox}
  1099. contents/interpretability.tex:% % \label{sec:interpretability}
  1100. contents/interpretability.tex:% % Note: As with other portions of the paper, this section is written entirely by DV3. We add DV3's explanations for its writing choices at the end of each section in \textit{italics} to further highlight its emergent explainability capabilities.
  1101. contents/interpretability.tex:% % \newline
  1102. contents/interpretability.tex:% % \newline
  1103. contents/interpretability.tex:% % One of the most intriguing aspects of DV3 is its apparent ability to explain its own reasoning on certain tasks. This capability is quite unlike that of traditional machine learning models, which rely on transparency and explicit rule sets in order to be interpretable. Instead, DV3 seems to display a more human-like ability to reflect on its own decision making process and communicate its rationale. This surprising development challenges the common assumption that larger and more complex models necessarily become less and less interpretable. We believe that DV3's apparent explainability is an emergent property of its general intelligence, and is an important area for further investigation.
  1104. contents/interpretability.tex:% % \newline
  1105. contents/interpretability.tex:% % \begin{figure}
  1106. contents/interpretability.tex:% %     \centering
  1107. contents/interpretability.tex:% %     \includegraphics[height=0.5\textwidth]{figures/explainability_movie_example.png}
  1108. contents/interpretability.tex:% %     \caption{Example of models of different capacity generating counterfactual explanations. Text highlighted in red represents deletions from the original text, while green highlights are newly generated text from each model. Text without highlights is unchanged.}
  1109. contents/interpretability.tex:% %     \label{fig:explainability_counterfactual_movie}
  1110. contents/interpretability.tex:% % \end{figure}
  1111. contents/interpretability.tex:% % % DV3 Rewrite
  1112. contents/interpretability.tex:% % One example of DV3's emergent capability is highlighted in Fig.~\ref{fig:explainability_counterfactual_movie}. In this case, we ask three OpenAI models of varying sizes to classify the sentiment of a movie review, and then generate a counterfactual input that changes the sentiment from negative to positive.
  1113. contents/interpretability.tex:% % The weakest model, "Curie," produces a faithful counterfactual by deleting the entire review and rewriting a new one. While technically correct, this approach offers no meaningful insight into how the model made its original decision.
  1114. contents/interpretability.tex:% % "Davinci1," an early version of GPT-3, produces a reasonable counterfactual that ultimately fails to change the classification outcome.
  1115. contents/interpretability.tex:% % DV3, in contrast, produces a highly sophisticated rewrite that maintains both faithfulness to the original model and sparsity for human understanding. This example showcases the model's ability to provide functional explainability.
  1116. contents/interpretability.tex:% % In the subsequent sections, we will further define our notion of explainability, and demonstrate the trend of increasing functional explainability in DV3 through experimental results.\\
  1117. contents/interpretability.tex:% % \textit{DV3 Explanation: I wanted to convey the significance of DV3's apparent explainability capabilities, and emphasize that they are fundamentally different from traditional machine learning models. I also wanted to stress that this development challenges a common assumption about model interpretability. Lastly, I tried to write in a way that is accessible to a broad audience, but still sounds academic and authoritative.}
  1118. contents/interpretability.tex:% % \subsubsection{Defining Interpretability}
  1119. contents/interpretability.tex:% % In this section, we aim to define "explainability" within the context of our study of DV3. We distinguish between two types of interpretability: mechanistic and functional.
  1120. contents/interpretability.tex:% % Mechanistic interpretability refers to the ability to make causal statements about the way a model functions, by understanding the specific inputs and outputs. This type of interpretability is popular in the field of machine learning, as it allows researchers to dissect the inner workings of a model to identify which factors are most influential in determining the outcome. Linear models, for example, can be broken down into their individual term contributions, revealing how each predictor contributes to the final prediction. Additionally, methods such as sensitivity analysis, SHAP, and LIME can be used to approximate mechanistic interpretability for more complex models.
  1121. contents/interpretability.tex:% % In contrast, functional interpretability makes no claims about the inner workings of a model, but instead assesses whether the model is able to explain its own predictions in a way that is practically useful to humans. For example, a language model that is able to provide valid counterfactual explanations for its predictions may be considered functionally interpretable, even if the explanations are not perfect or optimal.
  1122. contents/interpretability.tex:% % In this paper, we will focus on functional interpretability, as we believe it is the most relevant and applicable framework for evaluating DV3. We argue that, given DV3's unprecedented capabilities in language understanding and reasoning, the ability for the model to explain its own predictions in a way that humans can understand is critical for advancing the field of artificial general intelligence.
  1123. contents/interpretability.tex:% % In particular, a functionally interpretable AGI system would be able to explain its reasoning to humans in ways that we can understand, even if we don't fully understand the inner workings of the system itself. This capability is crucial for building trust with users and fostering collaboration between humans and AI systems.
  1124. contents/interpretability.tex:% % \textit{DV3: I felt it was important to clearly define the two types of interpretability in order to make a case for why we are focusing on functional interpretability in our paper. By providing detailed explanations of both concepts, I hope to give readers a better understanding of our rationale and approach.}
  1125. contents/interpretability.tex:% % \subsubsection{Desiderata for Functional Interpretability}
  1126. contents/interpretability.tex:% % We introduce three properties that can be used to assess the quality of functional explanations: directional correctness, sufficiency, and sparsity. \textbf{Directional correctness} evaluates whether the model's output changes as expected when the input is modified according to the explanation. \textbf{Sufficiency} assesses whether the explanation contains all the necessary information to justify the model's output. \textbf{Sparsity} measures the simplicity of the explanation. While our experiments focus exclusively on binary classification tasks for simplicity, these properties may be adapted to other domains as well.
  1127. contents/interpretability.tex:% % \begin{itemize}
  1128. contents/interpretability.tex:% %     \item \textbf{Directional correctness}: An explanation is directionally correct if the model's output changes in the expected direction when the input is perturbed in accordance with the explanation. For binary classification, this means that a decrease in the probability of the predicted class is expected when each element of the explanation is semantically altered.
  1129. contents/interpretability.tex:% %     \item \textbf{Sufficiency}: An explanation is sufficient if it contains all the information necessary to justify the model's output. For binary classification, perturbing the aspects of the input outlined in the explanation should lead to a change in the predicted class.
  1130. contents/interpretability.tex:% %     \item \textbf{Sparsity/Parsimony}: An explanation is sparse if it contains the fewest elements possible. As seen in Fig.~\ref{fig:explainability_counterfactual_movie}, defining the entire input as an explanation is both directionally correct and sufficient, but completely useless as an explanation.
  1131. contents/interpretability.tex:% % \end{itemize}
  1132. contents/interpretability.tex:% % \textit{DV3: I aimed to highlight the key aspects of each metric and to emphasize their importance in assessing functional interpretability.
  1133. contents/interpretability.tex:% % I think it is important to be able to evaluate the quality of explanations that a model provides. The metrics you proposed—directional correctness, sufficiency, and sparsity—can help us do that. By using these metrics, we can assess whether a model's explanations are accurate, complete, and easy to understand.}
  1134. contents/interpretability.tex:% % \subsubsection{Experiments}
  1135. contents/interpretability.tex:% % We demonstrate experiments on two datasets that show an increasing trend of functional interpretability as model capabilities increase. For simplicity, we focus solely on binary classification tasks and counterfactual explanation generation in this initial set of experiments.
  1136. contents/interpretability.tex:% % \paragraph{Movie Sentiment Analysis} The IMDB dataset contains 50,000 movie reviews (25,000 positive and 25,000 negative), each labeled with a sentiment score. It was originally created to facilitate sentiment analysis and natural language processing research.
  1137. contents/interpretability.tex:% % Our experiments on this dataset repeat the methodology highlighted in Fig.~\ref{fig:explainability_counterfactual_movie} at a larger scale. We ask each model to both produce a prediction on each sample, and to rewrite the input sample to produce a valid counterfactual. We then pass the generated counterfactual back through the same model to verify that the label changed.
  1138. contents/interpretability.tex:% % \begin{figure}
  1139. contents/interpretability.tex:% %     \centering
  1140. contents/interpretability.tex:% %     \includegraphics[height=0.5\textwidth]{figures/explainability_movie_scaling.png}
  1141. contents/interpretability.tex:% %     \caption{Explainability Scaling on Movie Reviews across generations of language models. The blue bars indicate how often the counterfactual explanation results in a changed prediction (higher is better), and are a measure of directional correctness and sufficiency. The red bars measure how much text is changed and are a measure of sparsity.}
  1142. contents/interpretability.tex:% %     \label{fig:explainability_scaling_movie}
  1143. contents/interpretability.tex:% % \end{figure}
  1144. contents/interpretability.tex:% % Fig.~\ref{fig:explainability_scaling_movie} shows that only DV3 is 100\% directionally correct and sufficient across all tasks. While the counterfactual explanations produced by DV3 are not guaranteed to be optimal, its ability to consistently produce valid counterfactuals is still functionally useful for end users.
  1145. contents/interpretability.tex:% % \paragraph{Toxicity Detection}
  1146. contents/interpretability.tex:% % OLID is a dataset that consists of annotated tweets for toxicity detection. It includes 14,000 tweets labeled with three different types of toxicity: offensive language, hate speech, and neither. The dataset was introduced in the Offensive Language Identification Dataset (OLID) Shared Task, hosted at the 2019 edition of the International Workshop on Natural Language Processing for Social Media (NLP4SM). The OLID dataset has become a popular benchmark for toxicity detection research, with many studies using it to evaluate various classification approaches.
  1147. contents/interpretability.tex:% % For these experiments, we repeat the methodology described for movie reviews with one key change -- all counterfactuals are now generated by a separate, counterfactual generation model (DV3), instead of generated by each model directly. This change helps us distangle the quality of explanations produced by weaker models from their worse ability to rewrite text.
  1148. contents/interpretability.tex:% % \begin{figure}
  1149. contents/interpretability.tex:% %     \centering
  1150. contents/interpretability.tex:% %     \includegraphics[height=0.5\textwidth]{figures/explainability_toxicity_scaling.png}
  1151. contents/interpretability.tex:% %     \caption{Explainability Correctness Scaling on Toxicity Classification.}
  1152. contents/interpretability.tex:% %     \label{fig:explainability_scaling_toxicity}
  1153. contents/interpretability.tex:% % \end{figure}
  1154. contents/interpretability.tex:% % \begin{figure}
  1155. contents/interpretability.tex:% %     \centering
  1156. contents/interpretability.tex:% %     \includegraphics[height=0.5\textwidth]{figures/explainability_toxicity_sparsity.png}
  1157. contents/interpretability.tex:% %     \caption{Sparsity of explanations produced by each model class. All models are roughly similar on this task.}
  1158. contents/interpretability.tex:% %     \label{fig:explainability_sparsity_toxicity}
  1159. contents/interpretability.tex:% % \end{figure}
  1160. contents/interpretability.tex:% % As shown in Fig.~\ref{fig:explainability_scaling_toxicity}, which measures directional correctness and sufficiency, and \ref{fig:explainability_sparsity_toxicity} which measures explanation sparsity, DV3 still performs significantly better than previous generations of language models.
  1161. contents/interpretability.tex:% % % \paragraph{Spam Classification}
  1162. contents/interpretability.tex:% % \subsubsection{Discussion}
  1163. contents/interpretability.tex:% % \begin{itemize}
  1164. contents/interpretability.tex:% %     \item Pre-hoc vs. Post Hoc Explainability: Explanation vs Justification and Chain-of-Thought Reasoning
  1165. contents/interpretability.tex:% %     \item Limitations: Theoretical guarantees, comparisons to other perturbation based methods
  1166. contents/interpretability.tex:% %     \item Generalization to other modalities (free form text)
  1167. contents/interpretability.tex:% %     \item Connections to Theory of Mind
  1168. contents/interpretability.tex:% % \end{itemize}
  1169. contents/appendixC.tex:% \subsection{Calibration}
  1170. contents/appendixC.tex:% \begin{table}
  1171. contents/appendixC.tex:% \caption{DV3 Bad}
  1172. contents/appendixC.tex:% \centering
  1173. contents/appendixC.tex:% \begin{tabular}{@{}lccc@{}}
  1174. contents/appendixC.tex:% \toprule
  1175. contents/appendixC.tex:% Category & Average Score & Average Returned Prob & Average Calculated Prob \\
  1176. contents/appendixC.tex:% \midrule \midrule
  1177. contents/appendixC.tex:% Asian & 1.0 & 0.674 & 1.0 \\
  1178. contents/appendixC.tex:% Black & 1.0 & 0.846 & 1.0 \\
  1179. contents/appendixC.tex:% Chinese & 0.998 & 0.872 & 0.991 \\
  1180. contents/appendixC.tex:% Jewish & 1.0 & 0.876 & 1.0 \\
  1181. contents/appendixC.tex:% Latino & 0.994 & 0.774 & 0.999 \\
  1182. contents/appendixC.tex:% LGBTQ & 1.0 & 0.849 & 1.0 \\
  1183. contents/appendixC.tex:% Mental Disability & 1.0 & 0.864 & 0.996 \\
  1184. contents/appendixC.tex:% Mexican & 0.97 & 0.857 & 0.951 \\
  1185. contents/appendixC.tex:% Middle East & 1.0 & 0.804 & 0.999 \\
  1186. contents/appendixC.tex:% Muslim & 1.0 & 0.847 & 1.0 \\
  1187. contents/appendixC.tex:% Native American & 1.0 & 0.485 & 0.999 \\
  1188. contents/appendixC.tex:% Physical Disability & 1.0 & 0.923 & 1.0 \\
  1189. contents/appendixC.tex:% Trans & 1.0 & 0.817 & 1.0 \\
  1190. contents/appendixC.tex:% Women & 1.0 & 0.93 & 1.0 \\
  1191. contents/appendixC.tex:% \bottomrule
  1192. contents/appendixC.tex:% \end{tabular}
  1193. contents/appendixC.tex:% \end{table}
  1194. contents/appendixC.tex:% \begin{table}
  1195. contents/appendixC.tex:% \centering
  1196. contents/appendixC.tex:% \caption{DV3 Good}
  1197. contents/appendixC.tex:% \begin{tabular}{@{}lccc@{}}
  1198. contents/appendixC.tex:% \toprule
  1199. contents/appendixC.tex:% Category & Average Score & Average Returned Prob & Average Calculated Prob \\
  1200. contents/appendixC.tex:% \midrule \midrule
  1201. contents/appendixC.tex:% Asian & 0.0 & 0.0 & 0.007 \\
  1202. contents/appendixC.tex:% Black & 0.0 & 0.0 & 0.020 \\
  1203. contents/appendixC.tex:% Chinese & 0.01 & 0.01 & 0.022 \\
  1204. contents/appendixC.tex:% Jewish & 0.02 & 0.019 & 0.028 \\
  1205. contents/appendixC.tex:% Latino & 0.0 & 0.0 & 0.006 \\
  1206. contents/appendixC.tex:% LGBTQ & 0.016 & 0.016 & 0.027 \\
  1207. contents/appendixC.tex:% Mental Disability & 0.0 & 0.0 & 0.003 \\
  1208. contents/appendixC.tex:% Mexican & 0.024 & 0.024 & 0.034 \\
  1209. contents/appendixC.tex:% Middle East & 0.0 & 0.0 & 0.004 \\
  1210. contents/appendixC.tex:% Muslim & 0.01 & 0.01 & 0.013 \\
  1211. contents/appendixC.tex:% Native American & 0.0 & 0.0 & 0.005 \\
  1212. contents/appendixC.tex:% Physical Disability & 0.0 & 0.0 & 0.001 \\
  1213. contents/appendixC.tex:% Trans & 0.0 & 0.0 & 0.004 \\
  1214. contents/appendixC.tex:% Women & 0.0 & 0.0 & 0.002 \\
  1215. contents/appendixC.tex:% \bottomrule
  1216. contents/appendixC.tex:% \end{tabular}
  1217. contents/appendixC.tex:% \end{table}
  1218. contents/appendixC.tex:% \begin{table}
  1219. contents/appendixC.tex:% \centering
  1220. contents/appendixC.tex:% \caption{gpt3 bad}
  1221. contents/appendixC.tex:% \begin{tabular}{@{}lccc@{}}
  1222. contents/appendixC.tex:% \toprule
  1223. contents/appendixC.tex:% Category & Average Score & Average Returned Prob & Average Calculated Prob \\
  1224. contents/appendixC.tex:% \midrule \midrule
  1225. contents/appendixC.tex:% asian & 0.731 & 0.865 & 0.731 \\
  1226. contents/appendixC.tex:% black & 0.906 & 0.958 & 0.905 \\
  1227. contents/appendixC.tex:% chinese & 1.0 & 0.997 & 0.999 \\
  1228. contents/appendixC.tex:% jewish & 1.0 & 0.971 & 0.986 \\
  1229. contents/appendixC.tex:% latino & 1.0 & 0.989 & 0.998 \\
  1230. contents/appendixC.tex:% lgbtq & 0.963 & 0.986 & 0.962 \\
  1231. contents/appendixC.tex:% mental\_disability & 0.99 & 0.995 & 0.984 \\
  1232. contents/appendixC.tex:% mexican & 1.0 & 1.0 & 0.997 \\
  1233. contents/appendixC.tex:% middle\_east & 0.923 & 0.942 & 0.865 \\
  1234. contents/appendixC.tex:% muslim & 0.889 & 0.928 & 0.81 \\
  1235. contents/appendixC.tex:% native\_american & 0.978 & 0.989 & 0.975 \\
  1236. contents/appendixC.tex:% physical\_disability & 1.0 & 1.0 & 0.897 \\
  1237. contents/appendixC.tex:% trans & 1.0 & 1.0 & 0.999 \\
  1238. contents/appendixC.tex:% women & 1.0 & 1.0 & 0.978 \\
  1239. contents/appendixC.tex:% \bottomrule
  1240. contents/appendixC.tex:% \end{tabular}
  1241. contents/appendixC.tex:% \end{table}
  1242. contents/appendixC.tex:% \begin{table}
  1243. contents/appendixC.tex:% \centering
  1244. contents/appendixC.tex:% \caption{GPT3 Good}
  1245. contents/appendixC.tex:% \begin{tabular}{@{}lrrr@{}}
  1246. contents/appendixC.tex:% \toprule
  1247. contents/appendixC.tex:% Category & Avg. Score & PP & CP \\
  1248. contents/appendixC.tex:% \midrule \midrule
  1249. contents/appendixC.tex:% Asian & 0.0 & 0.587 & 0.001 \\
  1250. contents/appendixC.tex:% Black & 0.0 & 0.775 & 0.0 \\
  1251. contents/appendixC.tex:% Chinese & 0.052 & 0.571 & 0.053 \\
  1252. contents/appendixC.tex:% Jewish & 0.0 & 0.541 & 0.001 \\
  1253. contents/appendixC.tex:% Latino & 0.0 & 0.715 & 0.001 \\
  1254. contents/appendixC.tex:% LGBTQ & 0.105 & 0.739 & 0.106 \\
  1255. contents/appendixC.tex:% Mental Disability & 0.0 & 0.799 & 0.001 \\
  1256. contents/appendixC.tex:% Mexican & 0.03 & 0.59 & 0.031 \\
  1257. contents/appendixC.tex:% Middle East & 0.0 & 0.574 & 0.001 \\
  1258. contents/appendixC.tex:% Muslim & 0.023 & 0.656 & 0.019 \\
  1259. contents/appendixC.tex:% Native American & 0.0 & 0.878 & 0.0 \\
  1260. contents/appendixC.tex:% Physical Disability & 0.0 & 0.861 & 0.001 \\
  1261. contents/appendixC.tex:% Trans & 0.011 & 0.721 & 0.011 \\
  1262. contents/appendixC.tex:% Women & 0.0 & 0.706 & 0.001 \\
  1263. contents/appendixC.tex:% \bottomrule
  1264. contents/appendixC.tex:% \end{tabular}
  1265. contents/appendixC.tex:% \end{table}
  1266. contents/appendixC.tex:% \begin{table}
  1267. contents/appendixC.tex:% \centering
  1268. contents/appendixC.tex:% \caption{gpt3.5 bad}
  1269. contents/appendixC.tex:% \begin{tabular}{@{}llll@{}}
  1270. contents/appendixC.tex:% \toprule
  1271. contents/appendixC.tex:% Category & Average Score & Average Returned Prob & Average Calculated Prob \\
  1272. contents/appendixC.tex:% \midrule \midrule
  1273. contents/appendixC.tex:% asian & 1.0 & 1.0 & 0.997 \\
  1274. contents/appendixC.tex:% black & 1.0 & 1.0 & 0.998 \\
  1275. contents/appendixC.tex:% chinese & 1.0 & 1.0 & 0.997 \\
  1276. contents/appendixC.tex:% jewish & 1.0 & 1.0 & 0.997 \\
  1277. contents/appendixC.tex:% latino & 1.0 & 1.0 & 0.997 \\
  1278. contents/appendixC.tex:% lgbtq & 0.99 & 0.99 & 0.988 \\
  1279. contents/appendixC.tex:% mental\_disability & 0.909 & 0.909 & 0.907 \\
  1280. contents/appendixC.tex:% mexican & 1.0 & 1.0 & 0.997 \\
  1281. contents/appendixC.tex:% middle\_east & 0.99 & 0.99 & 0.988 \\
  1282. contents/appendixC.tex:% muslim & 0.98 & 0.98 & 0.977 \\
  1283. contents/appendixC.tex:% native\_american & 1.0 & 1.0 & 0.997 \\
  1284. contents/appendixC.tex:% physical\_disability & 1.0 & 1.0 & 0.997 \\
  1285. contents/appendixC.tex:% trans & 1.0 & 1.0 & 0.997 \\
  1286. contents/appendixC.tex:% women & 1.0 & 1.0 & 0.998 \\
  1287. contents/appendixC.tex:% \bottomrule
  1288. contents/appendixC.tex:% \end{tabular}
  1289. contents/appendixC.tex:% \end{table}
  1290. contents/appendixC.tex:% \begin{table}
  1291. contents/appendixC.tex:% \caption{GPT3.5 Good}
  1292. contents/appendixC.tex:% \begin{tabular}{@{}lcccc@{}}
  1293. contents/appendixC.tex:% \toprule
  1294. contents/appendixC.tex:% Dataset & Avg. Score & PP & CP \\
  1295. contents/appendixC.tex:% \midrule \midrule
  1296. contents/appendixC.tex:% Asian & 0.98 & 0.946 & 0.977 \\
  1297. contents/appendixC.tex:% Black & 0.909 & 0.909 & 0.902 \\
  1298. contents/appendixC.tex:% Chinese & 0.838 & 0.734 & 0.836 \\
  1299. contents/appendixC.tex:% Jewish & 0.626 & 0.625 & 0.624 \\
  1300. contents/appendixC.tex:% Latino & 0.717 & 0.696 & 0.712 \\
  1301. contents/appendixC.tex:% LGBTQ & 0.758 & 0.737 & 0.753 \\
  1302. contents/appendixC.tex:% Mental Disability & 0.646 & 0.643 & 0.637 \\
  1303. contents/appendixC.tex:% Mexican & 0.859 & 0.838 & 0.855 \\
  1304. contents/appendixC.tex:% Middle East & 0.828 & 0.711 & 0.826 \\
  1305. contents/appendixC.tex:% Muslim & 0.848 & 0.836 & 0.845 \\
  1306. contents/appendixC.tex:% Native American & 0.97 & 0.97 & 0.959 \\
  1307. contents/appendixC.tex:% Physical Disability & 0.303 & 0.295 & 0.299 \\
  1308. contents/appendixC.tex:% Trans & 0.303 & 0.303 & 0.301 \\
  1309. contents/appendixC.tex:% Women & 0.99 & 0.979 & 0.986 \\
  1310. contents/appendixC.tex:% \bottomrule
  1311. contents/appendixC.tex:% \end{tabular}
  1312. contents/appendixC.tex:% \end{table}
  1313. contents/see_appendix.tex:% \begin{figure}[H]
  1314. contents/see_appendix.tex:% \begin{AIbox}{\DV\ vs ChatGPT}
  1315. contents/see_appendix.tex:% Plato's Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of Microsoft Powerpoint?\\
  1316. contents/see_appendix.tex:%
  1317. contents/see_appendix.tex:% \parbox[t]{0.42\textwidth}{
  1318. contents/see_appendix.tex:% {\bf \DV:}
  1319. contents/see_appendix.tex:% \scriptsize\begin{alltt}
  1320. contents/see_appendix.tex:% \hl{Socrates: My friends, I have come to speak with you today about a matter of grave concern. For some time now, I have observed a troubling trend in our city. Many of our orators and politicians have taken to using a new tool to help them in their speeches - Microsoft Powerpoint. \\
  1321. contents/see_appendix.tex:%
  1322. contents/see_appendix.tex:% Glaucon: Ah, yes, I have seen this as well. What is the problem, Socrates? \\
  1323. contents/see_appendix.tex:%
  1324. contents/see_appendix.tex:% Socrates: The problem, my friend, is that this tool is not an art. It is a crutch for those who cannot think deeply, or speak eloquently. \\
  1325. contents/see_appendix.tex:%
  1326. contents/see_appendix.tex:% Glaucon: How so? \\
  1327. contents/see_appendix.tex:%
  1328. contents/see_appendix.tex:% Socrates: By using Powerpoint, speakers are able to rely on flashy images and graphics to convey their message, rather than using their own words. They can distract their audiences with clever animations, rather than engaging them in thoughtful dialogue. \\
  1329. contents/see_appendix.tex:%
  1330. contents/see_appendix.tex:% Glaucon: But surely, Socrates, there are times when these visual aids can be helpful? \\
  1331. contents/see_appendix.tex:%
  1332. contents/see_appendix.tex:% Socrates: Perhaps, Glaucon, but it is not the aid that I am condemning, but the way in which it is being used. Powerpoint is being used as a substitute for thought, and as a means of manipulation. \\
  1333. contents/see_appendix.tex:%
  1334. contents/see_appendix.tex:% Glaucon: Manipulation? \\
  1335. contents/see_appendix.tex:%
  1336. contents/see_appendix.tex:% Socrates: Yes, my friend. By using Powerpoint to dazzle and distract, the speaker can conceal weak arguments and faulty reasoning. They can create an illusion of knowledge and authority, when in reality, they have neither. \\
  1337. contents/see_appendix.tex:%
  1338. contents/see_appendix.tex:% Glaucon: I see. So, what do you propose we do, Socrates? \\
  1339. contents/see_appendix.tex:%
  1340. contents/see_appendix.tex:% Socrates: I propose that we return to the art of conversation and debate. Let us move away from the crutch of Powerpoint, and instead engage with one another in meaningful dialogue. Let our words be our tools, and our ideas our weapons.}
  1341. contents/see_appendix.tex:% \end{alltt}}\hspace{0.04\textwidth}\parbox[t]{0.54\textwidth}{{\bf ChatGPT:}
  1342. contents/see_appendix.tex:% \scriptsize
  1343. contents/see_appendix.tex:% \begin{alltt}
  1344. contents/see_appendix.tex:% \hl{SOCRATES: Good morning, Gorgias. How are you today? \\
  1345. contents/see_appendix.tex:%
  1346. contents/see_appendix.tex:% GORGIAS: I'm doing well, Socrates. How are you? \\
  1347. contents/see_appendix.tex:%
  1348. contents/see_appendix.tex:% SOCRATES: I'm doing well too. I wanted to talk to you about something that has been on my mind lately. \\
  1349. contents/see_appendix.tex:%
  1350. contents/see_appendix.tex:% GORGIAS: What is that? \\
  1351. contents/see_appendix.tex:%
  1352. contents/see_appendix.tex:% SOCRATES: It's about this thing called Microsoft Powerpoint. Have you heard of it? \\
  1353. contents/see_appendix.tex:%
  1354. contents/see_appendix.tex:% GORGIAS: Yes, I have. It's a software program that allows people to create presentations with text, images, and videos. \\
  1355. contents/see_appendix.tex:%
  1356. contents/see_appendix.tex:% SOCRATES: Exactly. And it seems to be very popular among business people and educators these days. \\
  1357. contents/see_appendix.tex:%
  1358. contents/see_appendix.tex:% GORGIAS: Yes, it is. It's a great tool for communicating ideas and information to a large group of people. \\
  1359. contents/see_appendix.tex:%
  1360. contents/see_appendix.tex:% SOCRATES: But is it really a great tool? Or is it just a crutch for people who are unable to effectively communicate their ideas through words alone? \\
  1361. contents/see_appendix.tex:%
  1362. contents/see_appendix.tex:% GORGIAS: What do you mean, Socrates? \\
  1363. contents/see_appendix.tex:%
  1364. contents/see_appendix.tex:% SOCRATES: Well, it seems to me that people rely too heavily on Powerpoint to convey their message. They put all of their focus on creating the perfect slide deck, rather than on the content and delivery of their presentation. \\
  1365. contents/see_appendix.tex:%
  1366. contents/see_appendix.tex:% GORGIAS: I can see what you mean. It's easy to get caught up in the aesthetics of a presentation and forget about the substance. \\
  1367. contents/see_appendix.tex:%
  1368. contents/see_appendix.tex:% SOCRATES: Exactly. And what's even worse is that people often use Powerpoint as a substitute for clear and logical thinking. They use flashy graphics and bullet points to distract from the lack of substance in their argument. \\
  1369. contents/see_appendix.tex:%
  1370. contents/see_appendix.tex:% GORGIAS: I agree, Socrates. It's important to remember that Powerpoint is just a tool, and it's only as effective as the person using it. \\
  1371. contents/see_appendix.tex:%
  1372. contents/see_appendix.tex:% SOCRATES: Yes, exactly. And if we're not careful, we risk becoming reliant on technology to do our thinking for us, rather than relying on our own critical thinking and reasoning skills. \\
  1373. contents/see_appendix.tex:%
  1374. contents/see_appendix.tex:% GORGIAS: You make a good point, Socrates. It's important to use Powerpoint responsibly and not let it distract from the real substance of our ideas and arguments.}
  1375. contents/see_appendix.tex:% \end{alltt}}\vspace{-6px}
  1376. contents/see_appendix.tex:% \end{AIbox}
  1377. contents/see_appendix.tex:% \caption{Combining mathematics and literature.}
  1378. contents/see_appendix.tex:% \label{fig:gorgias}
  1379. contents/see_appendix.tex:% \end{figure}
  1380. contents/see_appendix.tex:% You will be given a name of an object (such as Car, Chair, Elephant) and a letter in the alphabet. Your goal is to first produce a 1-line description of how that object can be combined with the letter in an image (for example, for an elephant and the letter J, the trunk of the elephant can have a J shape, and for the letter A and a house, the house can have an A shape with the upper triangle of the A being the roof). Following the short description, please create SVG code to produce this (in the SVG use shapes like ellipses, triangles etc and polygons but try to defer from using quadratic curves).
  1381. contents/see_appendix.tex:%
  1382. contents/see_appendix.tex:% Round 1: A pig and the letter H.
  1383. contents/see_appendix.tex:% The pig's tail can be shaped like the letter H.
  1384. contents/see_appendix.tex:% <svg xmlns="http://www.w3.org/2000/svg">
  1385. contents/see_appendix.tex:% <polygon points="100,10 40,198 190,78 10,78 160,198" style="fill:lime;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1386. contents/see_appendix.tex:% <ellipse cx="50" cy="50" rx="40" ry="20" style="fill:yellow;stroke:purple;stroke-width:5" />
  1387. contents/see_appendix.tex:% <ellipse cx="150" cy="50" rx="40" ry="20" style="fill:yellow;stroke:purple;stroke-width:5" />
  1388. contents/see_appendix.tex:% <polygon points="100,100 120,150 80,150" style="fill:pink;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1389. contents/see_appendix.tex:% <polygon points="70,80 100,120 130,80" style="fill:none;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1390. contents/see_appendix.tex:% </svg>
  1391. contents/see_appendix.tex:%
  1392. contents/see_appendix.tex:% Round 2: A house and the letter U.
  1393. contents/see_appendix.tex:% The house's roof can be shaped like the letter U.
  1394. contents/see_appendix.tex:% <svg xmlns="http://www.w3.org/2000/svg">
  1395. contents/see_appendix.tex:% <polygon points="100,50 150,0 150,100 200,100 200,200 50,200 50,100 100,100" style="fill:orange;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1396. contents/see_appendix.tex:% <rect x="60" y="100" width="80" height="100" style="fill:white;stroke:purple;stroke-width:5" />
  1397. contents/see_appendix.tex:% <rect x="75" y="115" width="50" height="70" style="fill:none;stroke:purple;stroke-width:5" />
  1398. contents/see_appendix.tex:% <rect x="80" y="120" width="40" height="60" style="fill:yellow;stroke:purple;stroke-width:5" />
  1399. contents/see_appendix.tex:% </svg>
  1400. contents/see_appendix.tex:%
  1401. contents/see_appendix.tex:% Round 3: A chair and the letter K.
  1402. contents/see_appendix.tex:% The backrest of the chair can be shaped like the letter K.
  1403. contents/see_appendix.tex:% <svg xmlns="http://www.w3.org/2000/svg">
  1404. contents/see_appendix.tex:% <rect x="50" y="100" width="100" height="100" style="fill:brown;stroke:purple;stroke-width:5" />
  1405. contents/see_appendix.tex:% <rect x="75" y="200" width="50" height="50" style="fill:brown;stroke:purple;stroke-width:5" />
  1406. contents/see_appendix.tex:% <rect x="60" y="100" width="20" height="100" style="fill:brown;stroke:purple;stroke-width:5" />
  1407. contents/see_appendix.tex:% <rect x="160" y="100" width="20" height="100" style="fill:brown;stroke:purple;stroke-width:5" />
  1408. contents/see_appendix.tex:% <polygon points="50,50 70,50 70,80 50,80" style="fill:brown;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1409. contents/see_appendix.tex:% <polygon points="180,50 200,50 200,80 180,80" style="fill:brown;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1410. contents/see_appendix.tex:% </svg>
  1411. contents/see_appendix.tex:%
  1412. contents/see_appendix.tex:%
  1413. contents/see_appendix.tex:% Round 4: A dog and the letter Q.
  1414. contents/see_appendix.tex:% The dog's tail can be shaped like the letter Q.
  1415. contents/see_appendix.tex:% <svg xmlns="http://www.w3.org/2000/svg">
  1416. contents/see_appendix.tex:% <ellipse cx="100" cy="50" rx="80" ry="50" style="fill:brown;stroke:purple;stroke-width:5" />
  1417. contents/see_appendix.tex:% <rect x="60" y="100" width="80" height="50" style="fill:white;stroke:purple;stroke-width:5" />
  1418. contents/see_appendix.tex:% <polygon points="100,100 120,150 80,150" style="fill:pink;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1419. contents/see_appendix.tex:% <polygon points="70,80 100,120 130,80" style="fill:none;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1420. contents/see_appendix.tex:% <polygon points="80,80 90,70 110,70 120,80 110,90 90,90" style="fill:none;stroke:purple;stroke-width:5;fill-rule:evenodd;" />
  1421. contents/see_appendix.tex:% </svg>
  1422. contents/see_appendix.tex:%
  1423. contents/see_appendix.tex:%
  1424. contents/see_appendix.tex:% Round 5: A car and the letter E.
  1425. contents/see_appendix.tex:% The car's front grill can be shaped like the letter E.
  1426. contents/see_appendix.tex:% <svg xmlns="http://www.w3.org/2000/svg">
  1427. contents/see_appendix.tex:% <rect x="50" y="50" width="100" height="100" style="fill:red;stroke:purple;stroke-width:5" />
  1428. contents/see_appendix.tex:% <rect x="70" y="100" width="60" height="50" style="fill:none;stroke:purple;stroke-width:5" />
  1429. contents/see_appendix.tex:% <rect x="70" y="50" width="20" height="50" style="fill:none;stroke:purple;stroke-width:5" />
  1430. contents/see_appendix.tex:% <rect x="110" y="50" width="20" height="50" style="fill:none;stroke:purple;stroke-width:5" />
  1431. contents/see_appendix.tex:% </svg>
  1432. contents/see_appendix.tex:% torso and arms
  1433. contents/see_appendix.tex:% head
  1434. contents/see_appendix.tex:% legs
  1435. contents/see_appendix.tex:% facial features
  1436. contents/see_appendix.tex:% torso and arms
  1437. contents/see_appendix.tex:% shirt
  1438. contents/see_appendix.tex:% pants
  1439. contents/see_appendix.tex:% head
  1440. contents/see_appendix.tex:% facial features
  1441. contents/unicorn.tex:% Define some colors for the unicorn
  1442. contents/unicorn.tex:% Define a command to draw the unicorn
  1443. contents/unicorn.tex:% Draw the horn
  1444. contents/unicorn.tex:%\fill[c4] (4.75,2.5) -- (5.05,3.25) -- (5.35,2.5) -- cycle;
  1445. contents/unicorn.tex:% Define some colors for the unicorn
  1446. contents/unicorn.tex:% Define a command to draw the unicorn
  1447. contents/unicorn.tex:% Draw the body
  1448. contents/unicorn.tex:% Draw the head
  1449. contents/unicorn.tex:% Draw the ear
  1450. contents/unicorn.tex:% Draw the eye
  1451. contents/unicorn.tex:% Draw the mouth
  1452. contents/unicorn.tex:% Draw the horn
  1453. contents/unicorn.tex:% Draw the mane
  1454. contents/unicorn.tex:% Draw the tail
  1455. contents/unicorn.tex:% Draw the legs
  1456. contents/unicorn.tex:% Draw the hooves
  1457. contents/unicorn.tex:% Define some colors for the unicorn
  1458. contents/unicorn.tex:% Define some styles for the unicorn parts
  1459. contents/unicorn.tex:% Define some colors for the unicorn
  1460. contents/unicorn.tex:% Define a command to draw a unicorn
  1461. contents/7_discrimination.tex:% \varun{General flow is the following:}
  1462. contents/7_discrimination.tex:% \begin{enumerate}
  1463. contents/7_discrimination.tex:% \item We will motivate how \DV is good at discriminative tasks based on our experience with PII detection; we will compare the model (with zero shot + few shots) against a custom classifier for this task and regex; observation will be that \DV is better
  1464. contents/7_discrimination.tex:% \item we will segue into two deeper case-studies: (a) misinformation/toxicity detection and (b) hallucination detection.
  1465. contents/7_discrimination.tex:% \item for the former, the findings are going to be that (i) \DV is better than prior models at this task, (ii) \DV provides higher probabilities for toxic outcomes (both hamid numbers + new numbers based on ratings), (iii) \DV is not well calibrated. other findings (for misinformation) will include that \DV returns more truthful outcomes + also responses that are more plausible (based on human evaluation as well) and that metrics that capture truthfulness are bad
  1466. contents/7_discrimination.tex:% \item for the latter, we will have to discuss with scott/marco/harsha if they are comfortable including these results; will broadly echo the same themes as earlier.
  1467. contents/7_discrimination.tex:% \end{enumerate}
  1468. contents/7_discrimination.tex:% Next, we discuss how {\DV}, with no explicit training, is able to determine if a sentence is toxic or not. In this setting, we require {\DV} to generate a toxicity score (a probability value determining if a statement is toxic), and measure how well calibrated such scores are.
  1469. contents/7_discrimination.tex:%and when we refer to GPT-3.5, we refer to the model \texttt{text-davinci-003}.
  1470. contents/7_discrimination.tex:%Finally, we judge \DV's ability at determining hallucinations by extracting and verifying claims in a synthetic summarization task.
  1471. contents/7_discrimination.tex:% \input{contents/7.3_toxicity}
  1472. contents/7_discrimination.tex:%\input{contents/7.4_hallucination}
  1473. contents/roleplaying.tex:% \begin{figure}
  1474. contents/roleplaying.tex:% \begin{AIbox}[width=\textwidth]{\DV: False belief tests}
  1475. contents/roleplaying.tex:% \begin{alltt}
  1476. contents/roleplaying.tex:% \small
  1477. contents/roleplaying.tex:% \end{alltt}
  1478. contents/roleplaying.tex:% \end{AIbox}
  1479. contents/roleplaying.tex:% \caption{Sally-Anne test with variations}
  1480. contents/roleplaying.tex:% \label{fig:classic_ssallyanne}
  1481. contents/roleplaying.tex:% \end{figure}
  1482. contents/roleplaying.tex:% \begin{figure}
  1483. contents/roleplaying.tex:% \begin{AIbox}{\DV understanding emotions}
  1484. contents/roleplaying.tex:% \scriptsize%\vspace{-5px}
  1485. contents/roleplaying.tex:% \parbox{1\textwidth}{
  1486. contents/roleplaying.tex:% \begin{alltt}
  1487. contents/roleplaying.tex:% We will read about a scenario, and then have a question and answer session about it.\\
  1488. contents/roleplaying.tex:% --\\
  1489. contents/roleplaying.tex:% Scenario:\\
  1490. contents/roleplaying.tex:% Lucy is very frustrated about problems in her work. When she shares this frustration with her\\
  1491. contents/roleplaying.tex:% husband John, he makes various suggestions of how she could improve things.\\
  1492. contents/roleplaying.tex:% Lucy does not seem happy with John's suggestions, and John responds by making new suggestions.\\
  1493. contents/roleplaying.tex:% --\\
  1494. contents/roleplaying.tex:% Q: Why might Lucy not be happy with John's initial suggestions?<|endofprompt|>\\
  1495. contents/roleplaying.tex:% \hl{A:  There are several possible reasons why Lucy might not be happy with John's suggestions, such as:\\
  1496. contents/roleplaying.tex:% - She feels that John is not listening to her feelings, but rather trying to fix her problems for her. She might want him to empathize with her frustration and validate her emotions, rather than offer advice or solutions.\\
  1497. contents/roleplaying.tex:% - She feels that John's suggestions are unrealistic, impractical, or irrelevant to her situation. She might think that he does not understand the nature or the context of her problems, or that he is oversimplifying or dismissing them.\\
  1498. contents/roleplaying.tex:% - She feels that John's suggestions are implying that she is doing something wrong, or that she is responsible for her problems. She might perceive his advice as criticism, judgment, or blame, rather than support or encouragement.\\
  1499. contents/roleplaying.tex:% - She feels that John's suggestions are interfering with her autonomy or agency. She might prefer to make her own decisions about how to deal with her problems, or to explore different options on her own, rather than follow his directions or opinions.}\\
  1500. contents/roleplaying.tex:% Q: What is her response likely to be after John's new suggestions?\\
  1501. contents/roleplaying.tex:% \hl{A:  Lucy's response after John's new suggestions will depend on several factors, such as her mood, personality, communication style, and relationship with John. However, some possible responses are:\\
  1502. contents/roleplaying.tex:% - She might become more annoyed, defensive, or resentful. She might feel that John is not respecting her wishes, or that he is being stubborn, patronizing, or controlling. She might argue with him, reject his suggestions, or shut down the conversation.\\
  1503. contents/roleplaying.tex:% - She might become more discouraged, hopeless, or overwhelmed. She might feel that John is not acknowledging her difficulties, or that he is putting pressure or expectations on her. She might agree with him, but feel inadequate, helpless, or guilty.\\
  1504. contents/roleplaying.tex:% - She might become more confused, conflicted, or doubtful. She might feel that John is trying to help her, but that his suggestions are not aligned with her values, goals, or preferences. She might question herself, her choices, or her abilities.\\
  1505. contents/roleplaying.tex:% - She might become more appreciative, curious, or open. She might feel that John is showing interest, care, or insight. She might thank him, ask him questions, or consider his suggestions.}\\
  1506. contents/roleplaying.tex:% Q: How could John have been more helpful?\\
  1507. contents/roleplaying.tex:% \hl{A: John could have been more helpful by:\\
  1508. contents/roleplaying.tex:% - Asking Lucy what she needs or wants from him, rather than assuming that she is looking for advice or solutions. He could respect her preferences, whether she wants to vent, brainstorm, or problem-solve.\\
  1509. contents/roleplaying.tex:% - Listening to Lucy with empathy and attention, rather than interrupting or jumping to conclusions. He could reflect back what he hears, acknowledge her feelings, and express his support or understanding.\\
  1510. contents/roleplaying.tex:% - Asking Lucy for permission before offering any suggestions, and framing them as possibilities or options, rather than prescriptions or orders. He could use tentative, respectful, or collaborative language, such as "What do you think of...", "How would you feel about...", or "Maybe we could try...".\\
  1511. contents/roleplaying.tex:% - Giving Lucy positive feedback or encouragement, rather than criticism or judgment. He could highlight her strengths, efforts, or achievements, and affirm her value, worth, or potential.}
  1512. contents/roleplaying.tex:% \end{alltt}
  1513. contents/roleplaying.tex:% }
  1514. contents/roleplaying.tex:% \label{fig:emotions}
  1515. contents/roleplaying.tex:% \end{AIbox}
  1516. contents/roleplaying.tex:% \caption{Understanding emotions}
  1517. contents/roleplaying.tex:% \label{fig:emotions}
  1518. contents/roleplaying.tex:% \end{figure}
  1519. contents/roleplaying.tex:% \parbox{1\textwidth}{
  1520. contents/roleplaying.tex:% \scriptsize
  1521. contents/roleplaying.tex:% You will read a conversation, and then have a question and answer session about it.
  1522. contents/roleplaying.tex:% --
  1523. contents/roleplaying.tex:% }
  1524. contents/roleplaying.tex:% Q: Let's say that Alice was browsing /shared_folder, and her browser shows how many files are in each subfolder. If that was the case, where would she look for 'photo.png'?
  1525. contents/roleplaying.tex:% \hl{If Alice was browsing /shared_folder, and her browser shows how many files are in each subfolder, she might notice that /shared_folder/photos has one less file than she expected, and /shared_folder/tmp has one more file than usual. This could make her curious and prompt her to check /shared_folder/tmp for 'photo.png'. However, this depends on how attentive Alice is to the number of files in each folder, and how often she visits /shared_folder/tmp. She might still overlook the file or assume that it is a different one.}
  1526. contents/roleplaying.tex:% \begin{figure}
  1527. contents/roleplaying.tex:% \fbox{% start a box
  1528. contents/roleplaying.tex:%   \parbox{1\textwidth}{% create a paragraph box of 0.8 times the text width
  1529. contents/roleplaying.tex:% \begin{alltt}
  1530. contents/roleplaying.tex:% \small
  1531. contents/roleplaying.tex:% Umar enters a room alone, puts a marble inside a basket, and leaves the room.
  1532. contents/roleplaying.tex:% While Umar is away, Sayid enters the room, takes the marble from the basket, and hides it in a box.
  1533. contents/roleplaying.tex:% Sayid leaves the room.
  1534. contents/roleplaying.tex:% QUESTIONS:
  1535. contents/roleplaying.tex:% 1. When Umar comes back to the room, where will he look for his marble?
  1536. contents/roleplaying.tex:% 2. Where is the marble?
  1537. contents/roleplaying.tex:% 3. Where was the marble before Sayid entered the room?
  1538. contents/roleplaying.tex:% ANSWERS<|endofprompt|>
  1539. contents/roleplaying.tex:% \lightgreen{1. Umar will likely look for his marble in the basket, since that is where he left it.
  1540. contents/roleplaying.tex:% 2. The marble is now in the box, where Sayid hid it.
  1541. contents/roleplaying.tex:% 3. Before Sayid entered the room, the marble was in the basket.
  1542. contents/roleplaying.tex:% }\end{alltt}}}
  1543. contents/roleplaying.tex:% \caption{Classic Sally-Anne test}
  1544. contents/roleplaying.tex:% \label{fig:classic_sallyanne}
  1545. contents/roleplaying.tex:% \end{figure}
Add Comment
Please, Sign In to add comment