Questionable or Fraudulent Practices in Machine Learning

# Questionable or Fraudulent Practices in Machine Learning

|Section|Description|Stage|Accidental?|
|---|---|---|---|
|**Contamination**||||
|Training Contamination|Training on the test set (e.g., in the web corpus)|Training|Plausibly|
|Prompt Contamination|Just putting test data into the prompt (few-shot)|Evaluation|Plausibly|
|RAG Contamination|Leaking benchmark data via Retrieval Augmented Generation|Evaluation|Plausibly|
|Dirty Paraphrases|Rephrasing test data and training on it|Collection|No|
|Contamination Laundering|Contaminated models generating training data|Collection|Plausibly|
|Thieved Test|Obtaining private test labels|Collection|No|
|User Contamination|Post-training on test in user prompts|Training|Plausibly|
|Over-hyping|Tuning hyperparameters further after test|Training|Plausibly|
|Meta-contamination|Reusing contaminated hyperparameters/designs|Training|Plausibly|
|Split Twins|Train and test set include near-identical points|Collection|Plausibly|
|||||
|**Cherrypicking**||||
|Baseline Nerfing|Optimising training parameters of baselines less|Evaluation|Plausibly|
|Baseline Hacking|Choosing weak baselines to compare to|Evaluation|No|
|Runtime Nerfing|Optimising baselines’ inference parameters less|Evaluation|Plausibly|
|Runtime Hacking|Post-hoc best inference parameters or decoding|Evaluation|No|
|Benchmark Hacking|Choosing easier benchmarks|Evaluation|No|
|Subset Hacking|Subsetting the benchmark until you win|Evaluation|No|
|Harness Hacking|Choosing post-hoc best evaluation harness|Evaluation|No|
|Golden Seed|Training/tuning with many different seeds|Training|No|
|Prompt Nerfing|Undertuning prompts of baseline models (e.g., few-shot examples)|Evaluation|Plausibly|
|Prompt Hacking|Choosing the best prompt strategy post-hoc (few-shot examples, system/user prompt, CoT)|Evaluation|Plausibly|
|||||
|**Misreporting**||||
|Superfluous Cog|Redundant module added to claim novelty|Design|Plausibly|
|Whack-a-mole|Monitoring for specific failures and fine-tuning away ad hoc|Training|No|
|Benchmark Decoration|Pretraining on benchmark/instruction data|Training|Plausibly|
|p-hacking|When bolding SOTA results, flawed sampling|Reporting|Plausibly|
|Point Scores|Reporting single run results, i.e., no error bars|Reporting|No|
|Outright Lies|Fabricating results; included for completeness|Reporting|No|
|Over/Underclaiming|Misleading claims about model capabilities|Reporting|Plausibly|
|Reification|General claims from narrow ML benchmarks|Reporting|Plausibly|
|Nonzero-shot|Claiming ‘zero-shot’ while training on examples|Reporting|Plausibly|
|Misarithmetic Mean|Using arithmetic mean on normalised results|Reporting|Plausibly|
|Parameter Smuggling|Under-reporting the model size, or substituting in more embedding parameters|Reporting|Plausibly|
|File Drawer|Failing to report negative benchmark studies|Reporting|Plausibly|
|||||
|**Amplifiers**||||
|Inductive Smuggling|Handcrafting inductive bias for a task|Design|No|
|Label Noise|Using benchmarks known to be error-ridden|Collection|Plausibly|