Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- I've tested with `Qwen/Qwen2.5-Coder-32B-Instruct` on 1x Nvidia Hopper H100 (95830MiB VRAM), quantized to FP8 with FP8 K/V cache, run with TensorRT-LLM:
- sudo docker run --network host -e HF_TOKEN=$HF_TOKEN \
- -v ~/inference-benchmarker-results:/opt/inference-benchmarker/results \
- inference_benchmarker inference-benchmarker --no-console \
- --url http://localhost:8000/v1 \
- --max-vus 8 --duration 120s --warmup 30s --benchmark-kind rate \
- --rates 1.0 --rates 2.0 --rates 3.0 --rates 4.0 --rates 8.0 \
- --prompt-options "min_tokens=10,max_tokens=16000,num_tokens=10000,variance=6000" \
- --decode-options "min_tokens=2000,max_tokens=10000,num_tokens=6000,variance=4000" \
- --model-name "Qwen/Qwen2.5-Coder-32B-Instruct" \
- --tokenizer-name "Qwen/Qwen2.5-Coder-32B-Instruct"
- # Results:
- ┌─────────────────┬──────────────────────────────────────────────────────────────────────┐
- │ Parameter │ Value │
- ├─────────────────┼──────────────────────────────────────────────────────────────────────┤
- │ Max VUs │ 8 │
- │ Duration │ 120 │
- │ Warmup Duration │ 30 │
- │ Benchmark Kind │ Rate │
- │ Rates │ [1.0, 2.0, 3.0, 4.0, 8.0] │
- │ Num Rates │ 10 │
- │ Prompt Options │ num_tokens=Some(10000),min_tokens=10,max_tokens=16000,variance=6000 │
- │ Decode Options │ num_tokens=Some(6000),min_tokens=2000,max_tokens=10000,variance=4000 │
- │ Tokenizer │ Qwen/Qwen2.5-Coder-32B-Instruct │
- │ Extra Metadata │ N/A │
- └─────────────────┴──────────────────────────────────────────────────────────────────────┘
- ┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
- │ Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
- ├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
- │ warmup │ 0.04 req/s │ 26.45 sec │ 980.16 ms │ 14.43 ms │ 66.57 tokens/sec │ 0.00% │ 2/2 │ 10000.00 │ 1761.00 │
- │ [email protected]/s │ 0.31 req/s │ 22.14 sec │ 96.71 ms │ 18.46 ms │ 370.34 tokens/sec │ 0.00% │ 36/36 │ 10000.00 │ 1193.22 │
- │ [email protected]/s │ 0.32 req/s │ 21.67 sec │ 71.41 ms │ 18.55 ms │ 373.51 tokens/sec │ 0.00% │ 38/38 │ 10000.00 │ 1163.68 │
- │ [email protected]/s │ 0.32 req/s │ 22.16 sec │ 71.22 ms │ 18.59 ms │ 383.55 tokens/sec │ 0.00% │ 37/37 │ 10000.00 │ 1188.54 │
- │ [email protected]/s │ 0.32 req/s │ 23.19 sec │ 71.54 ms │ 18.61 ms │ 401.18 tokens/sec │ 0.00% │ 38/38 │ 10000.00 │ 1242.42 │
- │ [email protected]/s │ 0.33 req/s │ 22.20 sec │ 71.14 ms │ 18.62 ms │ 388.78 tokens/sec │ 0.00% │ 39/39 │ 10000.00 │ 1188.74 │
- └────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘
- So this is 370-400 Tokens/sec throughput with a single card with avg 10000 tokens input and ~1200 tokens output.
- Probably a single RTX 6000 Pro would even perform better.
- TensorRT-LLM built instructions:
- # Build a very recent TensorRT-LLM (probably not needed here, but I did other tests with FP4 on Blackwell where you need a recent version):
- sudo apt-get update && sudo apt-get -y install git git-lfs && \
- git lfs install && \
- git clone --depth 1 -b v0.21.0rc1 https://github.com/NVIDIA/TensorRT-LLM.git && \
- cd TensorRT-LLM && \
- git submodule update --init --recursive && \
- git lfs pull
- # For Hopper H100: 90-real, for Blackwell: 120-real
- sudo make -C docker release_build CUDA_ARCHS="90-real"
- # SKIPPED here: download model https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct/tree/main to ~/mymodel/Qwen_Qwen2.5-Coder-32B-Instruct
- # Run Docker container, and map directories for the model:
- sudo docker run --gpus all -it --rm \
- -v ~/mymodel/Qwen_Qwen2.5-Coder-32B-Instruct:/mymodel \
- --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
- tensorrt_llm/release
- # Within the docker container, make FP4 quant, for Blackwell inference:
- cd /app/tensorrt_llm/examples/models/core/qwen
- python3 /app/tensorrt_llm/examples/quantization/quantize.py \
- --model_dir /mymodel \
- --dtype bfloat16 \
- --qformat fp8 \
- --kv_cache_dtype fp8 \
- --output_dir /ckpt \
- --calib_size 512
- # Build the engine:
- trtllm-build --checkpoint_dir /ckpt --output_dir /engine \
- --remove_input_padding enable \
- --kv_cache_type paged \
- --max_batch_size 8 \
- --max_num_tokens 16384 \
- --max_seq_len 26500 \
- --use_paged_context_fmha enable \
- --gemm_plugin disable \
- --multiple_profiles enable \
- --use_fp8_context_fmha enable
- # Run the server:
- export KV_CACHE_FREE_GPU_MEM_FRACTION=0.9 && \
- export ENGINE_DIR=/engine && \
- export TOKENIZER_DIR=/mymodel/ && \
- trtllm-serve ${ENGINE_DIR} --tokenizer ${TOKENIZER_DIR} --max_num_tokens 16384 --max_batch_size 8
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement