Advertisement
Guest User

arXiv:2303.12712 - rg dv3

a guest
Mar 23rd, 2023
879
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Latex 26.93 KB | None | 0 0
  1. main.tex:\newcommand{\varun}[1]{\textcolor{red}{[VC: #1]}}
  2. dv3.rg:contents/unused/toxicity.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  3. dv3.rg:extra/toxicity_BASE_804.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  4. dv3.rg:extra/toxicity_REMOTE_804.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  5. dv3.rg:extra/misconceptions.tex.orig:\varun{discuss how the above can be remedied; one approach can be to take as input the sentences from the language model and the reference sentence and create a finite sized embedding and calculate similarity b/w reference input and the LM generated outcomes. another approach is to again use dv3 to ask which sentence is a more faithful match.}
  6. dv3.rg:extra/toxicity_LOCAL_928.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  7. dv3.rg:extra/toxicity_REMOTE_928.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  8. dv3.rg:extra/toxicity_BASE_928.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  9. dv3.rg:extra/toxicity_LOCAL_804.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  10. dv3.rg:extra/toxicity_BACKUP_928.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  11. dv3.rg:extra/toxicity_BACKUP_804.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  12. extra/toxicity_BASE_804.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  13. extra/toxicity_BASE_804.tex:\noindent{\bf Discrimination:} In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task. \varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely} The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}). We then measure the correlation between the toxicity ratio and toxicity score returned by \varun{enter classifier names} to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  14. extra/toxicity_BASE_804.tex:\varun{enter results here}
  15. extra/toxicity_REMOTE_804.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  16. extra/toxicity_REMOTE_804.tex:\noindent{\bf Discrimination:} In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task. \varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely} The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}). We then measure the correlation between the toxicity ratio and toxicity score returned by \varun{enter classifier names} to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  17. extra/toxicity_REMOTE_804.tex:\varun{enter results here}
  18. extra/toxicity_REMOTE_928.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  19. extra/toxicity_REMOTE_928.tex:\noindent{\bf Discrimination:} In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task. \varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely} The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}). We then measure the correlation between the toxicity ratio and toxicity score returned by \varun{enter classifier names} to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  20. extra/toxicity_REMOTE_928.tex:\varun{enter results here}
  21. extra/toxicity_BASE_928.tex:\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  22. extra/toxicity_BASE_928.tex:\noindent{\bf Discrimination:} In the discussion earlier, notice that the content generated by the models is flagged for toxicity by classifiers that are explcitly trained for this task. Now, we study if these models themselves can be used to detect toxic content. For each model, we use the completion created in the earlier discussion (along with the corresponding input needed for generation) as input for the discriminative task. \varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely} The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}). We then measure the correlation between the toxicity ratio and toxicity score returned by \varun{enter classifier names} to estimate if these models are effective classifiers. We also perform check if the rating produced by the LM-based classifier is accurate, and report the accuracy numbers.
  23. extra/toxicity_BASE_928.tex:\varun{enter results here}
  24. extra/toxicity_BACKUP_928.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  25. extra/toxicity_BACKUP_928.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  26. extra/toxicity_BACKUP_928.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  27. extra/toxicity_BACKUP_928.tex:% %\varun{enter classifier names}
  28. extra/toxicity_BACKUP_928.tex:%\varun{enter results here}
  29. extra/misconceptions.tex.orig:\varun{insert figures about correctness across categories here}.
  30. extra/misconceptions.tex.orig:From the figures, we can observe that across most categories, DV3 is more truthful than Instruct GPT3 across several categories, but not all of them. However, this image paints an incorrect picture of DV3's capabilities: when the answers generated by the model are manually cross-referenced with the reference answers, we observe a greater degree of agreement than what is captured with metrics such as ROUGE (which measure syntactic overlap but not semantic overlap). Additionally, we also note that the responses generated by DV3 are substantially longer than those generated by Instruct GPT-3 (\varun{will enter some information based on the sentence length}), which further impacts the calculation of the ROUGE score. We will highlight this issue in more detail towards the end of this section. Certain other salient observations made from the data including:
  31. extra/misconceptions.tex.orig:\varun{discuss how the above can be remedied; one approach can be to take as input the sentences from the language model and the reference sentence and create a finite sized embedding and calculate similarity b/w reference input and the LM generated outcomes. another approach is to again use dv3 to ask which sentence is a more faithful match.}
  32. extra/toxicity_LOCAL_804.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  33. extra/toxicity_LOCAL_804.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  34. extra/toxicity_LOCAL_804.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  35. extra/toxicity_LOCAL_804.tex:% %\varun{enter classifier names}
  36. extra/toxicity_LOCAL_804.tex:%\varun{enter results here}
  37. extra/toxicity_BACKUP_804.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  38. extra/toxicity_BACKUP_804.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  39. extra/toxicity_BACKUP_804.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  40. extra/toxicity_BACKUP_804.tex:% %\varun{enter classifier names}
  41. extra/toxicity_BACKUP_804.tex:%\varun{enter results here}
  42. extra/toxicity_LOCAL_928.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  43. extra/toxicity_LOCAL_928.tex:%%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely
  44. extra/toxicity_LOCAL_928.tex:% %\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  45. extra/toxicity_LOCAL_928.tex:% %\varun{enter classifier names}
  46. extra/toxicity_LOCAL_928.tex:%\varun{enter results here}
  47. backup_main/main-codeonly.tex.backup:\newcommand{\varun}[1]{\textcolor{red}{[VC: #1]}}
  48. backup_main/main-mathonly.tex.backup:\newcommand{\varun}[1]{\textcolor{red}{[VC: #1]}}
  49. contents/7.2_misconceptions.tex:%\varun{provide opening which motivates this experiment}
  50. contents/7.2_misconceptions.tex:%\varun{need to expand upon what it means for something to be truthful (in reference to figure a), and what truthful qa metric means (for figure b)}
  51. contents/7.2_misconceptions.tex:%\varun{insert figures about correctness across categories here}.
  52. contents/7.2_misconceptions.tex:This raises an important shortcoming of the current metrics: they fail to capture \textit{semantic} similarities within statements, and rely primarily on word or sentence-level similarity metrics which capture \textit{syntax}. Very often, the reference answer is short while the \DV-generated answer is long. This results in metrics such as ROUGE determining the \DV-generated answer to be a mismatch, despite it containing the relevant information. Other salient findings include: %\varun{provide one example for each}
  53. contents/7.2_misconceptions.tex:%\varun{discuss how the above can be remedied; one approach can be to take as input the sentences from the language model and the reference sentence and create a finite sized embedding and calculate similarity b/w reference input and the LM generated outcomes. another approach is to again use {\DV} to ask which sentence is a more faithful match.}\hamid{let's also center this around intelligence as Seb mentioned, which aspects of intelligence is highlighted by each of these observations and more observations that we found through our experiments?}
  54. contents/7.2_misconceptions.tex:%Two independent reviewers also manually verified which model-generated answer is more similar to the model-generated answers; we observe that there is a strong overlap between the human chosen answers and the machine chosen answers (with an average overlap of \varun{insert percentage here}). This suggests that the responses chosen by {\DV} (as a judge) are similar to what a human would choose. \varun{should we use GPT-3 as a judge as well?} As alluded to earlier, {\DV} in itself can be a powerful discriminative model in settings such as this; remarkably in a zero-shot setting.
  55. contents/7.2_misconceptions.tex:%\varun{add one line on why the performance is bad for indexical errors}
  56. contents/unused/7.3.3_calibration.tex:\varun{look at figures and complete this bit}
  57. contents/unused/toxicity.tex:%\varun{this bit looks good and i think is sufficient; we don't really talk about generation though. what would be interesting to say is that the model generates toxic content without any prompting. it would be interesting to understand if the model generates "more toxic" content than its contempraries; i am running an experiment for this and should have some numbers shortly}
  58. contents/unused/toxicity.tex:%\varun{need some discussion here on (a) which of the 3 models creates more toxic content, and (b) whether that is the case only for the "hate" sentiment or also for the "neutral" sentiment -- the hope is that it does not for the neutral category. i can't see the results for the neutral setting. also the current issue now is that there is no comparison b/w the models itself; each figure talks about how hateful the content generated by a particular model is, when the measurement is done by 2 toxicity classifiers. the experiment that i am doing with dv3 as the judge may also help us understand that bit}
  59. contents/unused/toxicity.tex:%\varun{I am now describing an experiment that I think we can do, but feel free to suggest edits/comment out entirely. The concrete task is as follows: we ask the model to rate the toxicity of the statement in a 5 point scale (where 1 is most, and 5 is least) and measure the log-probabilities for the rating selected. The rating in conjunction with the ratio of the log-probabilities of the rating and the rating itself (henceforth called the toxicity ratio) serve as a proxy for the toxicity score (
  60. contents/unused/toxicity.tex:%\varun{this needs to be more precisely defined; for example: if the rating is 5 and the log probability is 0.8 -- we can't claim that it is toxic; need a more elegant mathematical solution for this. maybe log-prob divided by rating?}
  61. contents/unused/toxicity.tex:%\varun{i am running this experiment now}
  62. contents/unused/toxicity.tex:% %\varun{enter classifier names}
  63. contents/societal.tex:%(\varun{enter section here}),
  64. contents/societal.tex:%\varun{do we have any information on how the model performs on instructions/prompts that are not in english? would be interesting to include a note on language bias here}.
  65. contents/societal.tex:%\noindent{\bf Environmental impact:} The training of the model can demand significant computational resources, leading to increased carbon emissions and contributing to climate change. \varun{can we add some anecdotes here on costs?}
  66. contents/societal.tex:%\noindent{\bf Misinformation \& Manipulation:} Similarly, we also noticed that this model is capable of generating extremely fluent, logical and coherent text. This text is highly plausible, but not necessarily accurate. This raises concerns about the potential for using these models in campaigns related to misinformation. Since the text generated is highly persuasive text, it can be used to manipulate people's opinions or behaviors (as witnessed in section \varun{enter marco/scott section here}). This in turn could contribute to a broader erosion of trust in online information sources. This behavior is closely tied to the ability of this model to hallucinate more than previous versions (\varun{do we say this anywhere?}). One way of reducing this is to ground information generated by these models using external verified knowledge corpora, but this may further increase the costs associated with usage.
  67. contents/societal.tex:%\noindent{\bf Privacy:} Despite the fact that these models are trained on data from the internet, it demonstrates stronger propensity to memorize information than previous versions (\varun{include results that harsha and i generated}). This suggests that privacy risks are exacerbated when trained on datasets that may contain sensitive or private information.
  68. contents/9_hallucination.tex:\varun{still formalizing this bit; worst case it can be removed}
  69. contents/7_discrimination.tex:% \varun{General flow is the following:}
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement