Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- python run_openai.py --url http://127.0.0.1:5000/v1 --model Llama-3.1-8B-Instruct-exl2-8bpw-h8
- 2024-09
- {
- "comment": "",
- "server": {
- "url": "http://127.0.0.1:5000/v1",
- "model": "Llama-3.1-8B-Instruct-exl2-8bpw-h8",
- "timeout": 600.0
- },
- "inference": {
- "temperature": 0.0,
- "top_p": 1.0,
- "max_tokens": 2048,
- "system_prompt": "The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.",
- "style": "multi_chat"
- },
- "test": {
- "parallel": 1
- },
- "log": {
- "verbosity": 0,
- "log_prompt": true
- }
- }
- assigned subjects ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'philosophy', 'physics', 'psychology', 'other']
- Testing biology...
- 100%|##############################################################################| 717/717 [4:54:58<00:00, 24.68s/it]
- Finished testing biology in 4 hours, 54 minutes, 58 seconds.
- Total, 457/717, 63.74%
- Random Guess Attempts, 20/717, 2.79%
- Correct Random Guesses, 2/20, 10.00%
- Adjusted Score Without Random Guesses, 455/697, 65.28%
- Testing business...
- 100%|##############################################################################| 789/789 [5:13:47<00:00, 23.86s/it]
- Finished testing business in 5 hours, 13 minutes, 47 seconds.
- Total, 392/789, 49.68%
- Random Guess Attempts, 78/789, 9.89%
- Correct Random Guesses, 8/78, 10.26%
- Adjusted Score Without Random Guesses, 384/711, 54.01%
- Testing chemistry...
- 100%|############################################################################| 1132/1132 [7:55:15<00:00, 25.19s/it]
- Finished testing chemistry in 7 hours, 55 minutes, 15 seconds.
- Total, 418/1132, 36.93%
- Random Guess Attempts, 87/1132, 7.69%
- Correct Random Guesses, 6/87, 6.90%
- Adjusted Score Without Random Guesses, 412/1045, 39.43%
- Testing computer science...
- 100%|##############################################################################| 410/410 [2:44:56<00:00, 24.14s/it]
- Finished testing computer science in 2 hours, 44 minutes, 56 seconds.
- Total, 198/410, 48.29%
- Random Guess Attempts, 9/410, 2.20%
- Correct Random Guesses, 1/9, 11.11%
- Adjusted Score Without Random Guesses, 197/401, 49.13%
- Testing economics...
- 100%|##############################################################################| 844/844 [5:19:23<00:00, 22.71s/it]
- Finished testing economics in 5 hours, 19 minutes, 24 seconds.
- Total, 471/844, 55.81%
- Random Guess Attempts, 17/844, 2.01%
- Correct Random Guesses, 1/17, 5.88%
- Adjusted Score Without Random Guesses, 470/827, 56.83%
- Testing engineering...
- 100%|##############################################################################| 969/969 [6:38:31<00:00, 24.68s/it]
- Finished testing engineering in 6 hours, 38 minutes, 31 seconds.
- Total, 277/969, 28.59%
- Random Guess Attempts, 47/969, 4.85%
- Correct Random Guesses, 10/47, 21.28%
- Adjusted Score Without Random Guesses, 267/922, 28.96%
- Testing health...
- 100%|##############################################################################| 818/818 [5:23:08<00:00, 23.70s/it]
- Finished testing health in 5 hours, 23 minutes, 8 seconds.
- Total, 432/818, 52.81%
- Random Guess Attempts, 7/818, 0.86%
- Correct Random Guesses, 1/7, 14.29%
- Adjusted Score Without Random Guesses, 431/811, 53.14%
- Testing history...
- 100%|##############################################################################| 381/381 [2:40:14<00:00, 25.23s/it]
- Finished testing history in 2 hours, 40 minutes, 14 seconds.
- Total, 174/381, 45.67%
- Random Guess Attempts, 1/381, 0.26%
- Correct Random Guesses, 0/1, 0.00%
- Adjusted Score Without Random Guesses, 174/380, 45.79%
- Testing law...
- 100%|############################################################################| 1101/1101 [7:33:56<00:00, 24.74s/it]
- Finished testing law in 7 hours, 33 minutes, 56 seconds.
- Total, 339/1101, 30.79%
- Random Guess Attempts, 7/1101, 0.64%
- Correct Random Guesses, 0/7, 0.00%
- Adjusted Score Without Random Guesses, 339/1094, 30.99%
- Testing math...
- 100%|############################################################################| 1351/1351 [8:59:26<00:00, 23.96s/it]
- Finished testing math in 8 hours, 59 minutes, 26 seconds.
- Total, 609/1351, 45.08%
- Random Guess Attempts, 219/1351, 16.21%
- Correct Random Guesses, 18/219, 8.22%
- Adjusted Score Without Random Guesses, 591/1132, 52.21%
- Testing philosophy...
- 100%|##############################################################################| 499/499 [3:15:43<00:00, 23.53s/it]
- Finished testing philosophy in 3 hours, 15 minutes, 43 seconds.
- Total, 202/499, 40.48%
- Random Guess Attempts, 11/499, 2.20%
- Correct Random Guesses, 2/11, 18.18%
- Adjusted Score Without Random Guesses, 200/488, 40.98%
- Testing physics...
- 100%|############################################################################| 1299/1299 [7:10:37<00:00, 19.89s/it]
- Finished testing physics in 7 hours, 10 minutes, 38 seconds.
- Total, 507/1299, 39.03%
- Random Guess Attempts, 80/1299, 6.16%
- Correct Random Guesses, 8/80, 10.00%
- Adjusted Score Without Random Guesses, 499/1219, 40.94%
- Testing psychology...
- 100%|##############################################################################| 798/798 [5:27:07<00:00, 24.60s/it]
- Finished testing psychology in 5 hours, 27 minutes, 7 seconds.
- Total, 486/798, 60.90%
- Random Guess Attempts, 0/798, 0.00%
- Correct Random Guesses, division by zero error
- Adjusted Score Without Random Guesses, 486/798, 60.90%
- Testing other...
- 100%|##############################################################################| 924/924 [6:26:46<00:00, 25.12s/it]
- Finished testing other in 6 hours, 26 minutes, 47 seconds.
- Total, 447/924, 48.38%
- Random Guess Attempts, 11/924, 1.19%
- Correct Random Guesses, 3/11, 27.27%
- Adjusted Score Without Random Guesses, 444/913, 48.63%
- Finished the benchmark in 7 hours, 43 minutes, 58 seconds.
- Total, 5409/12032, 44.96%
- Random Guess Attempts, 594/12032, 4.94%
- Correct Random Guesses, 60/594, 10.10%
- Adjusted Score Without Random Guesses, 5349/11438, 46.77%
- Token Usage:
- Prompt tokens: min 913, average 1394, max 2669, total 16778053, tk/s 58.45
- Completion tokens: min 41, average 1643, max 2049, total 19766691, tk/s 68.86
- Markdown Table:
- | overall | biology | business | chemistry | computer science | economics | engineering | health | history | law | math | philosophy | physics | psychology | other |
- | ------- | ------- | -------- | --------- | ---------------- | --------- | ----------- | ------ | ------- | --- | ---- | ---------- | ------- | ---------- | ----- |
- | 44.96 | 63.74 | 49.68 | 36.93 | 48.29 | 55.81 | 28.59 | 52.81 | 45.67 | 30.79 | 45.08 | 40.48 | 39.03 | 60.90 | 48.38 |
- ---
- -python run_openai.py --url http://127.0.0.1:5000/v1 --model 1_Ll
- ama-3.1-8B-German-ORPO-8.0bpw-h8-exl2
- 2024-09
- {
- "comment": "",
- "server": {
- "url": "http://127.0.0.1:5000/v1",
- "model": "1_Llama-3.1-8B-German-ORPO-8.0bpw-h8-exl2",
- "timeout": 600.0
- },
- "inference": {
- "temperature": 0.0,
- "top_p": 1.0,
- "max_tokens": 2048,
- "system_prompt": "The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.",
- "style": "multi_chat"
- },
- "test": {
- "parallel": 1
- },
- "log": {
- "verbosity": 0,
- "log_prompt": true
- }
- }
- assigned subjects ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'philosophy', 'physics', 'psychology', 'other']
- Testing biology...
- 100%|##############################################################################| 717/717 [4:55:16<00:00, 24.71s/it]
- Finished testing biology in 4 hours, 55 minutes, 16 seconds.
- Total, 436/717, 60.81%
- Random Guess Attempts, 15/717, 2.09%
- Correct Random Guesses, 1/15, 6.67%
- Adjusted Score Without Random Guesses, 435/702, 61.97%
- Testing business...
- 100%|##############################################################################| 789/789 [5:21:54<00:00, 24.48s/it]
- Finished testing business in 5 hours, 21 minutes, 54 seconds.
- Total, 294/789, 37.26%
- Random Guess Attempts, 23/789, 2.92%
- Correct Random Guesses, 4/23, 17.39%
- Adjusted Score Without Random Guesses, 290/766, 37.86%
- Testing chemistry...
- 100%|############################################################################| 1132/1132 [7:59:51<00:00, 25.43s/it]
- Finished testing chemistry in 7 hours, 59 minutes, 51 seconds.
- Total, 372/1132, 32.86%
- Random Guess Attempts, 29/1132, 2.56%
- Correct Random Guesses, 5/29, 17.24%
- Adjusted Score Without Random Guesses, 367/1103, 33.27%
- Testing computer science...
- 100%|##############################################################################| 410/410 [2:52:45<00:00, 25.28s/it]
- Finished testing computer science in 2 hours, 52 minutes, 45 seconds.
- Total, 159/410, 38.78%
- Random Guess Attempts, 4/410, 0.98%
- Correct Random Guesses, 0/4, 0.00%
- Adjusted Score Without Random Guesses, 159/406, 39.16%
- Testing economics...
- 100%|##############################################################################| 844/844 [6:02:45<00:00, 25.79s/it]
- Finished testing economics in 6 hours, 2 minutes, 45 seconds.
- Total, 391/844, 46.33%
- Random Guess Attempts, 21/844, 2.49%
- Correct Random Guesses, 4/21, 19.05%
- Adjusted Score Without Random Guesses, 387/823, 47.02%
- Testing engineering...
- 100%|##############################################################################| 969/969 [6:58:06<00:00, 25.89s/it]
- Finished testing engineering in 6 hours, 58 minutes, 7 seconds.
- Total, 226/969, 23.32%
- Random Guess Attempts, 34/969, 3.51%
- Correct Random Guesses, 3/34, 8.82%
- Adjusted Score Without Random Guesses, 223/935, 23.85%
- Testing health...
- 100%|##############################################################################| 818/818 [5:34:26<00:00, 24.53s/it]
- Finished testing health in 5 hours, 34 minutes, 26 seconds.
- Total, 372/818, 45.48%
- Random Guess Attempts, 12/818, 1.47%
- Correct Random Guesses, 1/12, 8.33%
- Adjusted Score Without Random Guesses, 371/806, 46.03%
- Testing history...
- 100%|##############################################################################| 381/381 [2:38:36<00:00, 24.98s/it]
- Finished testing history in 2 hours, 38 minutes, 36 seconds.
- Total, 152/381, 39.90%
- Random Guess Attempts, 2/381, 0.52%
- Correct Random Guesses, 1/2, 50.00%
- Adjusted Score Without Random Guesses, 151/379, 39.84%
- Testing law...
- 100%|############################################################################| 1101/1101 [7:27:12<00:00, 24.37s/it]
- Finished testing law in 7 hours, 27 minutes, 12 seconds.
- Total, 238/1101, 21.62%
- Random Guess Attempts, 13/1101, 1.18%
- Correct Random Guesses, 2/13, 15.38%
- Adjusted Score Without Random Guesses, 236/1088, 21.69%
- Testing math...
- 100%|############################################################################| 1351/1351 [9:31:54<00:00, 25.40s/it]
- Finished testing math in 9 hours, 31 minutes, 54 seconds.
- Total, 525/1351, 38.86%
- Random Guess Attempts, 36/1351, 2.66%
- Correct Random Guesses, 4/36, 11.11%
- Adjusted Score Without Random Guesses, 521/1315, 39.62%
- Testing philosophy...
- 100%|##############################################################################| 499/499 [3:20:12<00:00, 24.07s/it]
- Finished testing philosophy in 3 hours, 20 minutes, 12 seconds.
- Total, 173/499, 34.67%
- Random Guess Attempts, 1/499, 0.20%
- Correct Random Guesses, 1/1, 100.00%
- Adjusted Score Without Random Guesses, 172/498, 34.54%
- Testing physics...
- 100%|############################################################################| 1299/1299 [8:41:32<00:00, 24.09s/it]
- Finished testing physics in 8 hours, 41 minutes, 32 seconds.
- Total, 374/1299, 28.79%
- Random Guess Attempts, 57/1299, 4.39%
- Correct Random Guesses, 8/57, 14.04%
- Adjusted Score Without Random Guesses, 366/1242, 29.47%
- Testing psychology...
- 100%|##############################################################################| 798/798 [5:30:23<00:00, 24.84s/it]
- Finished testing psychology in 5 hours, 30 minutes, 24 seconds.
- Total, 404/798, 50.63%
- Random Guess Attempts, 8/798, 1.00%
- Correct Random Guesses, 1/8, 12.50%
- Adjusted Score Without Random Guesses, 403/790, 51.01%
- Testing other...
- 100%|##############################################################################| 924/924 [6:25:31<00:00, 25.03s/it]
- Finished testing other in 6 hours, 25 minutes, 31 seconds.
- Total, 409/924, 44.26%
- Random Guess Attempts, 4/924, 0.43%
- Correct Random Guesses, 1/4, 25.00%
- Adjusted Score Without Random Guesses, 408/920, 44.35%
- Finished the benchmark in 11 hours, 20 minutes, 37 seconds.
- Total, 4525/12032, 37.61%
- Random Guess Attempts, 259/12032, 2.15%
- Correct Random Guesses, 36/259, 13.90%
- Adjusted Score Without Random Guesses, 4489/11773, 38.13%
- Token Usage:
- Prompt tokens: min 871, average 1352, max 2627, total 16266109, tk/s 54.21
- Completion tokens: min 40, average 2017, max 2077, total 24265055, tk/s 80.87
- Markdown Table:
- | overall | biology | business | chemistry | computer science | economics | engineering | health | history | law | math | philosophy | physics | psychology | other |
- | ------- | ------- | -------- | --------- | ---------------- | --------- | ----------- | ------ | ------- | --- | ---- | ---------- | ------- | ---------- | ----- |
- | 37.61 | 60.81 | 37.26 | 32.86 | 38.78 | 46.33 | 23.32 | 45.48 | 39.90 | 21.62 | 38.86 | 34.67 | 28.79 | 50.63 | 44.26 |
Advertisement
Add Comment
Please, Sign In to add comment