Qwen/Qwen2.5-Coder-32B-Instruct becnhmark on 1x H100 with long context (10k tokens input)

I've tested with `Qwen/Qwen2.5-Coder-32B-Instruct` on 1x Nvidia Hopper H100 (95830MiB VRAM), quantized to FP8 with FP8 K/V cache, run with TensorRT-LLM:

    sudo docker run --network host -e HF_TOKEN=$HF_TOKEN \
      -v ~/inference-benchmarker-results:/opt/inference-benchmarker/results \
      inference_benchmarker inference-benchmarker --no-console \
      --url http://localhost:8000/v1 \
      --max-vus 8 --duration 120s --warmup 30s --benchmark-kind rate \
      --rates 1.0 --rates 2.0 --rates 3.0 --rates 4.0 --rates 8.0 \
      --prompt-options "min_tokens=10,max_tokens=16000,num_tokens=10000,variance=6000" \
      --decode-options "min_tokens=2000,max_tokens=10000,num_tokens=6000,variance=4000" \
      --model-name "Qwen/Qwen2.5-Coder-32B-Instruct" \
      --tokenizer-name "Qwen/Qwen2.5-Coder-32B-Instruct"
    # Results:
    ┌─────────────────┬──────────────────────────────────────────────────────────────────────┐
    │ Parameter       │ Value                                                                │
    ├─────────────────┼──────────────────────────────────────────────────────────────────────┤
    │ Max VUs         │ 8                                                                    │
    │ Duration        │ 120                                                                  │
    │ Warmup Duration │ 30                                                                   │
    │ Benchmark Kind  │ Rate                                                                 │
    │ Rates           │ [1.0, 2.0, 3.0, 4.0, 8.0]                                            │
    │ Num Rates       │ 10                                                                   │
    │ Prompt Options  │ num_tokens=Some(10000),min_tokens=10,max_tokens=16000,variance=6000  │
    │ Decode Options  │ num_tokens=Some(6000),min_tokens=2000,max_tokens=10000,variance=4000 │
    │ Tokenizer       │ Qwen/Qwen2.5-Coder-32B-Instruct                                      │
    │ Extra Metadata  │ N/A                                                                  │
    └─────────────────┴──────────────────────────────────────────────────────────────────────┘
    ┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
    │ Benchmark          │ QPS        │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput        │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
    ├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
    │ warmup             │ 0.04 req/s │ 26.45 sec         │ 980.16 ms  │ 14.43 ms  │ 66.57 tokens/sec  │ 0.00%      │ 2/2                 │ 10000.00                    │ 1761.00                      │
    │ [email protected]/s │ 0.31 req/s │ 22.14 sec         │ 96.71 ms   │ 18.46 ms  │ 370.34 tokens/sec │ 0.00%      │ 36/36               │ 10000.00                    │ 1193.22                      │
    │ [email protected]/s │ 0.32 req/s │ 21.67 sec         │ 71.41 ms   │ 18.55 ms  │ 373.51 tokens/sec │ 0.00%      │ 38/38               │ 10000.00                    │ 1163.68                      │
    │ [email protected]/s │ 0.32 req/s │ 22.16 sec         │ 71.22 ms   │ 18.59 ms  │ 383.55 tokens/sec │ 0.00%      │ 37/37               │ 10000.00                    │ 1188.54                      │
    │ [email protected]/s │ 0.32 req/s │ 23.19 sec         │ 71.54 ms   │ 18.61 ms  │ 401.18 tokens/sec │ 0.00%      │ 38/38               │ 10000.00                    │ 1242.42                      │
    │ [email protected]/s │ 0.33 req/s │ 22.20 sec         │ 71.14 ms   │ 18.62 ms  │ 388.78 tokens/sec │ 0.00%      │ 39/39               │ 10000.00                    │ 1188.74                      │
    └────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘

So this is 370-400 Tokens/sec throughput with a single card with avg 10000 tokens input and ~1200 tokens output.

Probably a single RTX 6000 Pro would even perform better.

TensorRT-LLM built instructions:

    # Build a very recent TensorRT-LLM (probably not needed here, but I did other tests with FP4 on Blackwell where you need a recent version):
    sudo apt-get update && sudo apt-get -y install git git-lfs && \
    git lfs install && \
    git clone --depth 1 -b v0.21.0rc1 https://github.com/NVIDIA/TensorRT-LLM.git && \
    cd TensorRT-LLM && \
    git submodule update --init --recursive && \
    git lfs pull
    # For Hopper H100: 90-real, for Blackwell: 120-real
    sudo make -C docker release_build CUDA_ARCHS="90-real"

    # SKIPPED here: download model https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct/tree/main to ~/mymodel/Qwen_Qwen2.5-Coder-32B-Instruct

    # Run Docker container, and map directories for the model:
    sudo docker run --gpus all -it --rm \
      -v ~/mymodel/Qwen_Qwen2.5-Coder-32B-Instruct:/mymodel \
      --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
      tensorrt_llm/release
    # Within the docker container, make FP4 quant, for Blackwell inference:
    cd /app/tensorrt_llm/examples/models/core/qwen
    python3 /app/tensorrt_llm/examples/quantization/quantize.py \
      --model_dir /mymodel \
      --dtype bfloat16 \
      --qformat fp8 \
      --kv_cache_dtype fp8 \
      --output_dir /ckpt \
      --calib_size 512
    # Build the engine:
    trtllm-build --checkpoint_dir /ckpt --output_dir /engine \
      --remove_input_padding enable \
      --kv_cache_type paged \
      --max_batch_size 8 \
      --max_num_tokens 16384 \
      --max_seq_len 26500 \
      --use_paged_context_fmha enable \
      --gemm_plugin disable \
      --multiple_profiles enable \
      --use_fp8_context_fmha enable
    # Run the server:
    export KV_CACHE_FREE_GPU_MEM_FRACTION=0.9 && \
    export ENGINE_DIR=/engine && \
    export TOKENIZER_DIR=/mymodel/ && \
    trtllm-serve ${ENGINE_DIR} --tokenizer ${TOKENIZER_DIR}  --max_num_tokens 16384 --max_batch_size 8