Untitled

root@2b45959ac2a3:/workspace/exllama# python test_benchmark_inference.py -d /workspace/exllama/models/samantha-33B-GPTQ -p
 -- Loading model
 -- Tokenizer: /workspace/exllama/models/samantha-33B-GPTQ/tokenizer.model
 -- Model config: /workspace/exllama/models/samantha-33B-GPTQ/config.json
 -- Model: /workspace/exllama/models/samantha-33B-GPTQ/Samantha-33B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf']
 ** Time, Load model: 6.89 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 15,936.28 MB
 -- Inference, first pass.
 ** Time, Inference: 1.90 seconds
 ** Speed: 1012.72 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 35.26 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 40.48 tokens/second
 ** VRAM, Inference: [cuda:0] 3,964.67 MB
 ** VRAM, Total: [cuda:0] 19,900.95 MB
root@2b45959ac2a3:/workspace/exllama# python test_benchmark_inference.py -d /workspace/exllama/models/guanaco-65B-GPTQ/ -p
 -- Loading model
 -- Tokenizer: /workspace/exllama/models/guanaco-65B-GPTQ/tokenizer.model
 -- Model config: /workspace/exllama/models/guanaco-65B-GPTQ/config.json
 -- Model: /workspace/exllama/models/guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf']
 ** Time, Load model: 8.21 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 31,399.77 MB
 -- Inference, first pass.
 ** Time, Inference: 3.57 seconds
 ** Speed: 537.69 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 19.13 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 19.27 tokens/second
 ** VRAM, Inference: [cuda:0] 6,155.17 MB
 ** VRAM, Total: [cuda:0] 37,554.94 MB