Qwen/Qwen2.5-Coder-32B-Instruct benchmark on 1x H100 with long context (10k tokens input) with vLLM

# Benchmarked VM: Microsoft Azure Standard NC40ads H100 v5 (40 vcpus, 320 GiB memory)
# Benchmarked server: vLLM
# you can compare these vLLM numbers with the Nvidia TensorRT-LLM test here: https://pastebin.com/Kc4Cbtfa

# run vLLM server:
sudo docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:v0.9.1 \
    --max-model-len 26500 \
    --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
    --disable-log-requests \
    --max_num_batched_tokens 60000 \
    --kv-cache-dtype fp8 \
    --enable-chunked-prefill \
    --gpu_memory_utilization 0.9

# run benchmark:
sudo docker run --network host -e HF_TOKEN=$HF_TOKEN \
  -v ~/inference-benchmarker-results:/opt/inference-benchmarker/results \
  inference_benchmarker inference-benchmarker --no-console \
  --url http://localhost:8000/v1 \
  --max-vus 8 --duration 120s --warmup 30s --benchmark-kind rate \
  --rates 1.0 --rates 2.0 --rates 3.0 --rates 4.0 --rates 8.0 \
  --prompt-options "min_tokens=10,max_tokens=16000,num_tokens=10000,variance=6000" \
  --decode-options "min_tokens=2000,max_tokens=10000,num_tokens=6000,variance=4000" \
  --model-name "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic" \
  --tokenizer-name "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
# benchmark results:

┌─────────────────┬──────────────────────────────────────────────────────────────────────┐
│ Parameter       │ Value                                                                │
├─────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Max VUs         │ 8                                                                    │
│ Duration        │ 120                                                                  │
│ Warmup Duration │ 30                                                                   │
│ Benchmark Kind  │ Rate                                                                 │
│ Rates           │ [1.0, 2.0, 3.0, 4.0, 8.0]                                            │
│ Num Rates       │ 10                                                                   │
│ Prompt Options  │ num_tokens=Some(10000),min_tokens=10,max_tokens=16000,variance=6000  │
│ Decode Options  │ num_tokens=Some(6000),min_tokens=2000,max_tokens=10000,variance=4000 │
│ Tokenizer       │ textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic                  │
│ Extra Metadata  │ N/A                                                                  │
└─────────────────┴──────────────────────────────────────────────────────────────────────┘


┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
│ Benchmark          │ QPS        │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput        │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
│ warmup             │ 0.05 req/s │ 19.82 sec         │ 1305.95 ms │ 16.69 ms  │ 55.96 tokens/sec  │ 0.00%      │ 2/2                 │ 10000.00                    │ 1109.00                      │
│ [email protected]/s │ 0.25 req/s │ 26.05 sec         │ 168.47 ms  │ 23.75 ms  │ 274.15 tokens/sec │ 0.00%      │ 29/29               │ 10000.00                    │ 1078.76                      │
│ [email protected]/s │ 0.29 req/s │ 24.51 sec         │ 57.47 ms   │ 23.52 ms  │ 295.21 tokens/sec │ 0.00%      │ 34/34               │ 10000.00                    │ 1028.29                      │
│ [email protected]/s │ 0.30 req/s │ 22.88 sec         │ 56.71 ms   │ 23.13 ms  │ 296.63 tokens/sec │ 0.00%      │ 35/35               │ 10000.00                    │ 983.80                       │
│ [email protected]/s │ 0.25 req/s │ 27.24 sec         │ 59.15 ms   │ 23.99 ms  │ 283.22 tokens/sec │ 0.00%      │ 30/30               │ 10000.00                    │ 1123.00                      │
│ [email protected]/s │ 0.25 req/s │ 26.14 sec         │ 55.66 ms   │ 23.68 ms  │ 269.53 tokens/sec │ 0.00%      │ 29/29               │ 10000.00                    │ 1097.28                      │
└────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘


# vLLM server output lines:
azureuser@qwentest:~/mymodel$ sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000     vllm/vllm-openai:v0.9.1                                                                                                                                                  --max-model-len 26500     --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic     --disable-log-requests     --max_num_batched_tokens 60000     --kv-cache-dtype fp8                                                                                                                                                  --enable-chunked-prefill     --gpu_memory_utilization 0.9
INFO 06-12 06:51:55 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 06:51:59 [api_server.py:1287] vLLM API server version 0.9.1
INFO 06-12 06:51:59 [cli_args.py:309] non-default args: {'model': 'textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', 'max_model_len': 26500, 'kv_cache_dtype': 'fp8', 'max_n                                                                                                                                             um_batched_tokens': 60000, 'enable_chunked_prefill': True, 'disable_log_requests': True}
INFO 06-12 06:52:09 [config.py:823] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 06-12 06:52:15 [config.py:1559] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy dro                                                                                                                                             p without a proper scaling factor
INFO 06-12 06:52:15 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=60000.
WARNING 06-12 06:52:17 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVI                                                                                                                                             DIA/nccl/issues/1234
INFO 06-12 06:52:19 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 06:52:21 [core.py:455] Waiting for init message from front-end.
INFO 06-12 06:52:21 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', speculative_config=None, tokeni                                                                                                                                             zer='textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, tr                                                                                                                                             ust_remote_code=False, dtype=torch.bfloat16, max_seq_len=26500, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom                                                                                                                                             _all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallbac                                                                                                                                             k=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None                                                                                                                                             , otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic, num_scheduler_steps=1, multi_step_str                                                                                                                                             eam_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","                                                                                                                                             cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inducto                                                                                                                                             r_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,                                                                                                                                             472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,                                                                                                                                             120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-12 06:52:22 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_w                                                                                                                                             orker.Worker object at 0x7f1ea5c1ba70>
INFO 06-12 06:52:22 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-12 06:52:22 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-12 06:52:22 [gpu_model_runner.py:1595] Starting to load model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic...
INFO 06-12 06:52:23 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 06-12 06:52:23 [cuda.py:252] Using Flash Attention backend on V1 engine.
INFO 06-12 06:52:23 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:00<00:02,  2.13it/s]
Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:01<00:02,  1.88it/s]
Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:01<00:02,  1.81it/s]
Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:02<00:01,  1.78it/s]
Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:02<00:01,  1.74it/s]
Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:03<00:00,  1.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00,  1.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00,  1.78it/s]

INFO 06-12 06:52:27 [default_loader.py:272] Loading weights took 4.06 seconds
WARNING 06-12 06:52:27 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
WARNING 06-12 06:52:27 [kv_cache.py:99] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
WARNING 06-12 06:52:27 [kv_cache.py:130] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
INFO 06-12 06:52:28 [gpu_model_runner.py:1624] Model loading took 32.1798 GiB and 4.793205 seconds
INFO 06-12 06:52:40 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/8b85eac542/rank_0_0 for vLLM's torch.compile
INFO 06-12 06:52:40 [backends.py:472] Dynamo bytecode transform time: 12.47 s
INFO 06-12 06:52:43 [backends.py:161] Cache the graph of shape None for later use
INFO 06-12 06:53:35 [backends.py:173] Compiling a graph for general shape takes 53.81 s
INFO 06-12 06:54:16 [monitor.py:34] torch.compile takes 66.28 s in total
INFO 06-12 06:54:17 [gpu_worker.py:227] Available KV cache memory: 38.00 GiB
INFO 06-12 06:54:17 [kv_cache_utils.py:715] GPU KV cache size: 311,280 tokens
INFO 06-12 06:54:17 [kv_cache_utils.py:719] Maximum concurrency for 26,500 tokens per request: 11.74x
INFO 06-12 06:54:51 [gpu_model_runner.py:2048] Graph capturing finished in 34 secs, took 1.23 GiB
INFO 06-12 06:54:51 [core.py:171] init engine (profile, create kv cache, warmup model) took 143.58 seconds
INFO 06-12 06:54:52 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 19455
WARNING 06-12 06:54:52 [config.py:1363] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 06-12 06:54:52 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 06-12 06:54:52 [serving_completion.py:66] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 06-12 06:54:52 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 06-12 06:54:52 [launcher.py:29] Available routes are:
INFO 06-12 06:54:52 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /health, Methods: GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /load, Methods: GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /ping, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /ping, Methods: GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /version, Methods: GET
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /pooling, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /classify, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /score, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-12 06:54:52 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(...)
INFO 06-12 06:27:32 [chat_utils.py:420] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO:     172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:27:40 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%
INFO 06-12 06:27:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 59.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%
INFO:     172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:28:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 59.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 49.9%
INFO 06-12 06:28:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 49.9%
INFO:     172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:28:20 [loggers.py:118] Engine 000: Avg prompt throughput: 9025.0 tokens/s, Avg generation throughput: 154.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.1%, Prefix cache hit rate: 72.7%
INFO:     172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:28:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 358.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 76.8%
INFO:     172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:28:40 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 355.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.7%, Prefix cache hit rate: 78.5%
INFO:     172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:28:50 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.3 tokens/s, Avg generation throughput: 334.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.7%, Prefix cache hit rate: 83.2%
INFO:     172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:29:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 325.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 84.9%
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:29:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 312.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 86.3%
INFO:     172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:29:20 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 311.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.6%, Prefix cache hit rate: 87.4%
INFO:     172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:29:30 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 303.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 88.8%
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:29:40 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.3 tokens/s, Avg generation throughput: 302.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 89.9%
INFO:     172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:29:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 303.5 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.5%, Prefix cache hit rate: 90.2%
INFO:     172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:30:00 [loggers.py:118] Engine 000: Avg prompt throughput: 6016.5 tokens/s, Avg generation throughput: 307.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 91.8%
INFO:     172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:30:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 323.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 92.2%
INFO 06-12 06:30:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 278.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.2%, Prefix cache hit rate: 92.2%
INFO 06-12 06:30:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 194.9 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 92.2%
INFO 06-12 06:30:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 92.2%
INFO 06-12 06:30:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 92.2%
INFO 06-12 06:31:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 92.2%
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:31:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.2 tokens/s, Avg generation throughput: 66.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.7%, Prefix cache hit rate: 92.9%
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:31:20 [loggers.py:118] Engine 000: Avg prompt throughput: 7019.7 tokens/s, Avg generation throughput: 334.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.5%, Prefix cache hit rate: 93.9%
INFO 06-12 06:31:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 357.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 93.9%
INFO:     172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:31:40 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.1 tokens/s, Avg generation throughput: 341.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 94.4%
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:31:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 333.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 94.7%
INFO:     172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:32:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 313.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 94.9%
INFO:     172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:32:10 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.4 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 95.1%
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:32:20 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 312.3 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.2%, Prefix cache hit rate: 95.3%
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:32:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 310.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 95.4%
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:32:40 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.1 tokens/s, Avg generation throughput: 330.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 95.7%
INFO:     172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:32:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 352.2 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 95.9%
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:33:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 349.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.8%, Prefix cache hit rate: 96.0%
INFO:     172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:33:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 343.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.3%, Prefix cache hit rate: 96.2%
INFO 06-12 06:33:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 271.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.5%, Prefix cache hit rate: 96.2%
INFO 06-12 06:33:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 222.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 96.2%
INFO 06-12 06:33:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 90.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 96.2%
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:33:50 [loggers.py:118] Engine 000: Avg prompt throughput: 9025.0 tokens/s, Avg generation throughput: 129.5 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.0%, Prefix cache hit rate: 96.5%
INFO:     172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:34:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 358.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 96.6%
INFO:     172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:34:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.9 tokens/s, Avg generation throughput: 354.5 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 96.6%
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:34:20 [loggers.py:118] Engine 000: Avg prompt throughput: 7020.1 tokens/s, Avg generation throughput: 335.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 96.8%
INFO:     172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:34:30 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 330.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 96.9%
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:34:40 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 350.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 97.0%
INFO:     172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:34:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 350.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.4%, Prefix cache hit rate: 97.1%
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:35:00 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.6 tokens/s, Avg generation throughput: 332.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.4%, Prefix cache hit rate: 97.2%
INFO:     172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:35:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.4 tokens/s, Avg generation throughput: 333.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.2%, Prefix cache hit rate: 97.3%
INFO 06-12 06:35:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 97.3%
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:35:30 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.6 tokens/s, Avg generation throughput: 314.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 97.4%
INFO:     172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:35:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 97.4%
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:35:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 317.6 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 97.5%
INFO 06-12 06:36:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 223.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 97.5%
INFO 06-12 06:36:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 125.3 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.3%, Prefix cache hit rate: 97.5%
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:36:20 [loggers.py:118] Engine 000: Avg prompt throughput: 8022.7 tokens/s, Avg generation throughput: 143.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.9%, Prefix cache hit rate: 97.6%
INFO:     172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:36:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 358.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.8%, Prefix cache hit rate: 97.6%
INFO 06-12 06:36:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 356.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.0%, Prefix cache hit rate: 97.6%
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:36:50 [loggers.py:118] Engine 000: Avg prompt throughput: 7019.5 tokens/s, Avg generation throughput: 346.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 97.7%
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:37:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 325.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 97.8%
INFO:     172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:37:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 313.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 97.8%
INFO:     172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:37:20 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.4 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 97.8%
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:37:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 312.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 97.9%
INFO:     172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:37:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 97.9%
INFO:     172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:37:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.2 tokens/s, Avg generation throughput: 314.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 97.9%
INFO:     172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:38:00 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.4 tokens/s, Avg generation throughput: 326.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.1%, Prefix cache hit rate: 98.0%
INFO 06-12 06:38:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 316.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.0%
INFO:     172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:38:20 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.6 tokens/s, Avg generation throughput: 313.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.6%, Prefix cache hit rate: 98.0%
INFO 06-12 06:38:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 215.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 98.0%
INFO 06-12 06:38:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 185.7 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.6%, Prefix cache hit rate: 98.0%
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:38:50 [loggers.py:118] Engine 000: Avg prompt throughput: 10027.6 tokens/s, Avg generation throughput: 263.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.4%, Prefix cache hit rate: 98.1%
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:39:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 357.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 98.1%
INFO:     172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:39:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 353.1 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 98.2%
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:39:20 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 348.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 98.2%
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:39:30 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 333.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.3%, Prefix cache hit rate: 98.2%
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:39:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 317.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 98.2%
INFO:     172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:39:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.0%, Prefix cache hit rate: 98.2%
INFO:     172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:40:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 313.3 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.2%, Prefix cache hit rate: 98.3%
INFO:     172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:40:10 [loggers.py:118] Engine 000: Avg prompt throughput: 6017.1 tokens/s, Avg generation throughput: 317.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 98.3%
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:40:20 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 351.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 98.3%
INFO:     172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:40:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 329.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 98.3%
INFO:     172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:40:40 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.7 tokens/s, Avg generation throughput: 315.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.4%
INFO:     172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-12 06:40:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.9 tokens/s, Avg generation throughput: 309.2 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.4%
INFO 06-12 06:41:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 264.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 98.4%
INFO 06-12 06:41:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 97.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 98.4%
INFO 06-12 06:41:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 98.4%
INFO 06-12 06:41:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 98.4%
^CINFO 06-12 06:48:46 [launcher.py:80] Shutting down FastAPI HTTP server.