Advertisement
Guest User

Qwen/Qwen2.5-Coder-32B-Instruct becnhmark on 1x H100 with long context (10k tokens input)

a guest
Jun 12th, 2025
16
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.15 KB | None | 0 0
  1. I've tested with `Qwen/Qwen2.5-Coder-32B-Instruct` on 1x Nvidia Hopper H100 (95830MiB VRAM), quantized to FP8 with FP8 K/V cache, run with TensorRT-LLM:
  2.  
  3. sudo docker run --network host -e HF_TOKEN=$HF_TOKEN \
  4. -v ~/inference-benchmarker-results:/opt/inference-benchmarker/results \
  5. inference_benchmarker inference-benchmarker --no-console \
  6. --url http://localhost:8000/v1 \
  7. --max-vus 8 --duration 120s --warmup 30s --benchmark-kind rate \
  8. --rates 1.0 --rates 2.0 --rates 3.0 --rates 4.0 --rates 8.0 \
  9. --prompt-options "min_tokens=10,max_tokens=16000,num_tokens=10000,variance=6000" \
  10. --decode-options "min_tokens=2000,max_tokens=10000,num_tokens=6000,variance=4000" \
  11. --model-name "Qwen/Qwen2.5-Coder-32B-Instruct" \
  12. --tokenizer-name "Qwen/Qwen2.5-Coder-32B-Instruct"
  13. # Results:
  14. ┌─────────────────┬──────────────────────────────────────────────────────────────────────┐
  15. │ Parameter │ Value │
  16. ├─────────────────┼──────────────────────────────────────────────────────────────────────┤
  17. │ Max VUs │ 8 │
  18. │ Duration │ 120 │
  19. │ Warmup Duration │ 30 │
  20. │ Benchmark Kind │ Rate │
  21. │ Rates │ [1.0, 2.0, 3.0, 4.0, 8.0] │
  22. │ Num Rates │ 10 │
  23. │ Prompt Options │ num_tokens=Some(10000),min_tokens=10,max_tokens=16000,variance=6000 │
  24. │ Decode Options │ num_tokens=Some(6000),min_tokens=2000,max_tokens=10000,variance=4000 │
  25. │ Tokenizer │ Qwen/Qwen2.5-Coder-32B-Instruct │
  26. │ Extra Metadata │ N/A │
  27. └─────────────────┴──────────────────────────────────────────────────────────────────────┘
  28. ┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
  29. │ Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
  30. ├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
  31. │ warmup │ 0.04 req/s │ 26.45 sec │ 980.16 ms │ 14.43 ms │ 66.57 tokens/sec │ 0.00% │ 2/2 │ 10000.00 │ 1761.00 │
  32. [email protected]/s │ 0.31 req/s │ 22.14 sec │ 96.71 ms │ 18.46 ms │ 370.34 tokens/sec │ 0.00% │ 36/36 │ 10000.00 │ 1193.22 │
  33. [email protected]/s │ 0.32 req/s │ 21.67 sec │ 71.41 ms │ 18.55 ms │ 373.51 tokens/sec │ 0.00% │ 38/38 │ 10000.00 │ 1163.68 │
  34. [email protected]/s │ 0.32 req/s │ 22.16 sec │ 71.22 ms │ 18.59 ms │ 383.55 tokens/sec │ 0.00% │ 37/37 │ 10000.00 │ 1188.54 │
  35. [email protected]/s │ 0.32 req/s │ 23.19 sec │ 71.54 ms │ 18.61 ms │ 401.18 tokens/sec │ 0.00% │ 38/38 │ 10000.00 │ 1242.42 │
  36. [email protected]/s │ 0.33 req/s │ 22.20 sec │ 71.14 ms │ 18.62 ms │ 388.78 tokens/sec │ 0.00% │ 39/39 │ 10000.00 │ 1188.74 │
  37. └────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘
  38.  
  39. So this is 370-400 Tokens/sec throughput with a single card with avg 10000 tokens input and ~1200 tokens output.
  40.  
  41. Probably a single RTX 6000 Pro would even perform better.
  42.  
  43. TensorRT-LLM built instructions:
  44.  
  45. # Build a very recent TensorRT-LLM (probably not needed here, but I did other tests with FP4 on Blackwell where you need a recent version):
  46. sudo apt-get update && sudo apt-get -y install git git-lfs && \
  47. git lfs install && \
  48. git clone --depth 1 -b v0.21.0rc1 https://github.com/NVIDIA/TensorRT-LLM.git && \
  49. cd TensorRT-LLM && \
  50. git submodule update --init --recursive && \
  51. git lfs pull
  52. # For Hopper H100: 90-real, for Blackwell: 120-real
  53. sudo make -C docker release_build CUDA_ARCHS="90-real"
  54.  
  55. # SKIPPED here: download model https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct/tree/main to ~/mymodel/Qwen_Qwen2.5-Coder-32B-Instruct
  56.  
  57. # Run Docker container, and map directories for the model:
  58. sudo docker run --gpus all -it --rm \
  59. -v ~/mymodel/Qwen_Qwen2.5-Coder-32B-Instruct:/mymodel \
  60. --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  61. tensorrt_llm/release
  62. # Within the docker container, make FP4 quant, for Blackwell inference:
  63. cd /app/tensorrt_llm/examples/models/core/qwen
  64. python3 /app/tensorrt_llm/examples/quantization/quantize.py \
  65. --model_dir /mymodel \
  66. --dtype bfloat16 \
  67. --qformat fp8 \
  68. --kv_cache_dtype fp8 \
  69. --output_dir /ckpt \
  70. --calib_size 512
  71. # Build the engine:
  72. trtllm-build --checkpoint_dir /ckpt --output_dir /engine \
  73. --remove_input_padding enable \
  74. --kv_cache_type paged \
  75. --max_batch_size 8 \
  76. --max_num_tokens 16384 \
  77. --max_seq_len 26500 \
  78. --use_paged_context_fmha enable \
  79. --gemm_plugin disable \
  80. --multiple_profiles enable \
  81. --use_fp8_context_fmha enable
  82. # Run the server:
  83. export KV_CACHE_FREE_GPU_MEM_FRACTION=0.9 && \
  84. export ENGINE_DIR=/engine && \
  85. export TOKENIZER_DIR=/mymodel/ && \
  86. trtllm-serve ${ENGINE_DIR} --tokenizer ${TOKENIZER_DIR} --max_num_tokens 16384 --max_batch_size 8
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement