Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # Questionable or Fraudulent Practices in Machine Learning
- |Section|Description|Stage|Accidental?|
- |---|---|---|---|
- |**Contamination**||||
- |Training Contamination|Training on the test set (e.g., in the web corpus)|Training|Plausibly|
- |Prompt Contamination|Just putting test data into the prompt (few-shot)|Evaluation|Plausibly|
- |RAG Contamination|Leaking benchmark data via Retrieval Augmented Generation|Evaluation|Plausibly|
- |Dirty Paraphrases|Rephrasing test data and training on it|Collection|No|
- |Contamination Laundering|Contaminated models generating training data|Collection|Plausibly|
- |Thieved Test|Obtaining private test labels|Collection|No|
- |User Contamination|Post-training on test in user prompts|Training|Plausibly|
- |Over-hyping|Tuning hyperparameters further after test|Training|Plausibly|
- |Meta-contamination|Reusing contaminated hyperparameters/designs|Training|Plausibly|
- |Split Twins|Train and test set include near-identical points|Collection|Plausibly|
- |||||
- |**Cherrypicking**||||
- |Baseline Nerfing|Optimising training parameters of baselines less|Evaluation|Plausibly|
- |Baseline Hacking|Choosing weak baselines to compare to|Evaluation|No|
- |Runtime Nerfing|Optimising baselines’ inference parameters less|Evaluation|Plausibly|
- |Runtime Hacking|Post-hoc best inference parameters or decoding|Evaluation|No|
- |Benchmark Hacking|Choosing easier benchmarks|Evaluation|No|
- |Subset Hacking|Subsetting the benchmark until you win|Evaluation|No|
- |Harness Hacking|Choosing post-hoc best evaluation harness|Evaluation|No|
- |Golden Seed|Training/tuning with many different seeds|Training|No|
- |Prompt Nerfing|Undertuning prompts of baseline models (e.g., few-shot examples)|Evaluation|Plausibly|
- |Prompt Hacking|Choosing the best prompt strategy post-hoc (few-shot examples, system/user prompt, CoT)|Evaluation|Plausibly|
- |||||
- |**Misreporting**||||
- |Superfluous Cog|Redundant module added to claim novelty|Design|Plausibly|
- |Whack-a-mole|Monitoring for specific failures and fine-tuning away ad hoc|Training|No|
- |Benchmark Decoration|Pretraining on benchmark/instruction data|Training|Plausibly|
- |p-hacking|When bolding SOTA results, flawed sampling|Reporting|Plausibly|
- |Point Scores|Reporting single run results, i.e., no error bars|Reporting|No|
- |Outright Lies|Fabricating results; included for completeness|Reporting|No|
- |Over/Underclaiming|Misleading claims about model capabilities|Reporting|Plausibly|
- |Reification|General claims from narrow ML benchmarks|Reporting|Plausibly|
- |Nonzero-shot|Claiming ‘zero-shot’ while training on examples|Reporting|Plausibly|
- |Misarithmetic Mean|Using arithmetic mean on normalised results|Reporting|Plausibly|
- |Parameter Smuggling|Under-reporting the model size, or substituting in more embedding parameters|Reporting|Plausibly|
- |File Drawer|Failing to report negative benchmark studies|Reporting|Plausibly|
- |||||
- |**Amplifiers**||||
- |Inductive Smuggling|Handcrafting inductive bias for a task|Design|No|
- |Label Noise|Using benchmarks known to be error-ridden|Collection|Plausibly|
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement