Advertisement
Guest User

Questionable or Fraudulent Practices in Machine Learning

a guest
Nov 14th, 2024
13
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.01 KB | None | 0 0
  1. # Questionable or Fraudulent Practices in Machine Learning
  2.  
  3. |Section|Description|Stage|Accidental?|
  4. |---|---|---|---|
  5. |**Contamination**||||
  6. |Training Contamination|Training on the test set (e.g., in the web corpus)|Training|Plausibly|
  7. |Prompt Contamination|Just putting test data into the prompt (few-shot)|Evaluation|Plausibly|
  8. |RAG Contamination|Leaking benchmark data via Retrieval Augmented Generation|Evaluation|Plausibly|
  9. |Dirty Paraphrases|Rephrasing test data and training on it|Collection|No|
  10. |Contamination Laundering|Contaminated models generating training data|Collection|Plausibly|
  11. |Thieved Test|Obtaining private test labels|Collection|No|
  12. |User Contamination|Post-training on test in user prompts|Training|Plausibly|
  13. |Over-hyping|Tuning hyperparameters further after test|Training|Plausibly|
  14. |Meta-contamination|Reusing contaminated hyperparameters/designs|Training|Plausibly|
  15. |Split Twins|Train and test set include near-identical points|Collection|Plausibly|
  16. |||||
  17. |**Cherrypicking**||||
  18. |Baseline Nerfing|Optimising training parameters of baselines less|Evaluation|Plausibly|
  19. |Baseline Hacking|Choosing weak baselines to compare to|Evaluation|No|
  20. |Runtime Nerfing|Optimising baselines’ inference parameters less|Evaluation|Plausibly|
  21. |Runtime Hacking|Post-hoc best inference parameters or decoding|Evaluation|No|
  22. |Benchmark Hacking|Choosing easier benchmarks|Evaluation|No|
  23. |Subset Hacking|Subsetting the benchmark until you win|Evaluation|No|
  24. |Harness Hacking|Choosing post-hoc best evaluation harness|Evaluation|No|
  25. |Golden Seed|Training/tuning with many different seeds|Training|No|
  26. |Prompt Nerfing|Undertuning prompts of baseline models (e.g., few-shot examples)|Evaluation|Plausibly|
  27. |Prompt Hacking|Choosing the best prompt strategy post-hoc (few-shot examples, system/user prompt, CoT)|Evaluation|Plausibly|
  28. |||||
  29. |**Misreporting**||||
  30. |Superfluous Cog|Redundant module added to claim novelty|Design|Plausibly|
  31. |Whack-a-mole|Monitoring for specific failures and fine-tuning away ad hoc|Training|No|
  32. |Benchmark Decoration|Pretraining on benchmark/instruction data|Training|Plausibly|
  33. |p-hacking|When bolding SOTA results, flawed sampling|Reporting|Plausibly|
  34. |Point Scores|Reporting single run results, i.e., no error bars|Reporting|No|
  35. |Outright Lies|Fabricating results; included for completeness|Reporting|No|
  36. |Over/Underclaiming|Misleading claims about model capabilities|Reporting|Plausibly|
  37. |Reification|General claims from narrow ML benchmarks|Reporting|Plausibly|
  38. |Nonzero-shot|Claiming ‘zero-shot’ while training on examples|Reporting|Plausibly|
  39. |Misarithmetic Mean|Using arithmetic mean on normalised results|Reporting|Plausibly|
  40. |Parameter Smuggling|Under-reporting the model size, or substituting in more embedding parameters|Reporting|Plausibly|
  41. |File Drawer|Failing to report negative benchmark studies|Reporting|Plausibly|
  42. |||||
  43. |**Amplifiers**||||
  44. |Inductive Smuggling|Handcrafting inductive bias for a task|Design|No|
  45. |Label Noise|Using benchmarks known to be error-ridden|Collection|Plausibly|
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement