Advertisement
Guest User

Qwen/Qwen2.5-Coder-32B-Instruct benchmark on 1x H100 with long context (10k tokens input) with vLLM

a guest
Jun 12th, 2025
10
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 50.85 KB | None | 0 0
  1. # Benchmarked VM: Microsoft Azure Standard NC40ads H100 v5 (40 vcpus, 320 GiB memory)
  2. # Benchmarked server: vLLM
  3. # you can compare these vLLM numbers with the Nvidia TensorRT-LLM test here: https://pastebin.com/Kc4Cbtfa
  4.  
  5. # run vLLM server:
  6. sudo docker run --runtime nvidia --gpus all \
  7. -v ~/.cache/huggingface:/root/.cache/huggingface \
  8. -p 8000:8000 \
  9. vllm/vllm-openai:v0.9.1 \
  10. --max-model-len 26500 \
  11. --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
  12. --disable-log-requests \
  13. --max_num_batched_tokens 60000 \
  14. --kv-cache-dtype fp8 \
  15. --enable-chunked-prefill \
  16. --gpu_memory_utilization 0.9
  17.  
  18. # run benchmark:
  19. sudo docker run --network host -e HF_TOKEN=$HF_TOKEN \
  20. -v ~/inference-benchmarker-results:/opt/inference-benchmarker/results \
  21. inference_benchmarker inference-benchmarker --no-console \
  22. --url http://localhost:8000/v1 \
  23. --max-vus 8 --duration 120s --warmup 30s --benchmark-kind rate \
  24. --rates 1.0 --rates 2.0 --rates 3.0 --rates 4.0 --rates 8.0 \
  25. --prompt-options "min_tokens=10,max_tokens=16000,num_tokens=10000,variance=6000" \
  26. --decode-options "min_tokens=2000,max_tokens=10000,num_tokens=6000,variance=4000" \
  27. --model-name "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic" \
  28. --tokenizer-name "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
  29. # benchmark results:
  30.  
  31. ┌─────────────────┬──────────────────────────────────────────────────────────────────────┐
  32. │ Parameter │ Value │
  33. ├─────────────────┼──────────────────────────────────────────────────────────────────────┤
  34. │ Max VUs │ 8 │
  35. │ Duration │ 120 │
  36. │ Warmup Duration │ 30 │
  37. │ Benchmark Kind │ Rate │
  38. │ Rates │ [1.0, 2.0, 3.0, 4.0, 8.0] │
  39. │ Num Rates │ 10 │
  40. │ Prompt Options │ num_tokens=Some(10000),min_tokens=10,max_tokens=16000,variance=6000 │
  41. │ Decode Options │ num_tokens=Some(6000),min_tokens=2000,max_tokens=10000,variance=4000 │
  42. │ Tokenizer │ textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic │
  43. │ Extra Metadata │ N/A │
  44. └─────────────────┴──────────────────────────────────────────────────────────────────────┘
  45.  
  46.  
  47. ┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
  48. │ Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
  49. ├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
  50. │ warmup │ 0.05 req/s │ 19.82 sec │ 1305.95 ms │ 16.69 ms │ 55.96 tokens/sec │ 0.00% │ 2/2 │ 10000.00 │ 1109.00 │
  51. [email protected]/s │ 0.25 req/s │ 26.05 sec │ 168.47 ms │ 23.75 ms │ 274.15 tokens/sec │ 0.00% │ 29/29 │ 10000.00 │ 1078.76 │
  52. [email protected]/s │ 0.29 req/s │ 24.51 sec │ 57.47 ms │ 23.52 ms │ 295.21 tokens/sec │ 0.00% │ 34/34 │ 10000.00 │ 1028.29 │
  53. [email protected]/s │ 0.30 req/s │ 22.88 sec │ 56.71 ms │ 23.13 ms │ 296.63 tokens/sec │ 0.00% │ 35/35 │ 10000.00 │ 983.80 │
  54. [email protected]/s │ 0.25 req/s │ 27.24 sec │ 59.15 ms │ 23.99 ms │ 283.22 tokens/sec │ 0.00% │ 30/30 │ 10000.00 │ 1123.00 │
  55. [email protected]/s │ 0.25 req/s │ 26.14 sec │ 55.66 ms │ 23.68 ms │ 269.53 tokens/sec │ 0.00% │ 29/29 │ 10000.00 │ 1097.28 │
  56. └────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘
  57.  
  58.  
  59.  
  60.  
  61.  
  62. # vLLM server output lines:
  63. azureuser@qwentest:~/mymodel$ sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.9.1 --max-model-len 26500 --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic --disable-log-requests --max_num_batched_tokens 60000 --kv-cache-dtype fp8 --enable-chunked-prefill --gpu_memory_utilization 0.9
  64. INFO 06-12 06:51:55 [__init__.py:244] Automatically detected platform cuda.
  65. INFO 06-12 06:51:59 [api_server.py:1287] vLLM API server version 0.9.1
  66. INFO 06-12 06:51:59 [cli_args.py:309] non-default args: {'model': 'textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', 'max_model_len': 26500, 'kv_cache_dtype': 'fp8', 'max_n um_batched_tokens': 60000, 'enable_chunked_prefill': True, 'disable_log_requests': True}
  67. INFO 06-12 06:52:09 [config.py:823] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
  68. INFO 06-12 06:52:15 [config.py:1559] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy dro p without a proper scaling factor
  69. INFO 06-12 06:52:15 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=60000.
  70. WARNING 06-12 06:52:17 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVI DIA/nccl/issues/1234
  71. INFO 06-12 06:52:19 [__init__.py:244] Automatically detected platform cuda.
  72. INFO 06-12 06:52:21 [core.py:455] Waiting for init message from front-end.
  73. INFO 06-12 06:52:21 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', speculative_config=None, tokeni zer='textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, tr ust_remote_code=False, dtype=torch.bfloat16, max_seq_len=26500, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom _all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallbac k=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None , otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic, num_scheduler_steps=1, multi_step_str eam_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":""," cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inducto r_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480, 472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128, 120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
  74. WARNING 06-12 06:52:22 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_w orker.Worker object at 0x7f1ea5c1ba70>
  75. INFO 06-12 06:52:22 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
  76. INFO 06-12 06:52:22 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
  77. INFO 06-12 06:52:22 [gpu_model_runner.py:1595] Starting to load model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic...
  78. INFO 06-12 06:52:23 [gpu_model_runner.py:1600] Loading model from scratch...
  79. INFO 06-12 06:52:23 [cuda.py:252] Using Flash Attention backend on V1 engine.
  80. INFO 06-12 06:52:23 [weight_utils.py:292] Using model weights format ['*.safetensors']
  81. Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s]
  82. Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:00<00:02, 2.13it/s]
  83. Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:01<00:02, 1.88it/s]
  84. Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:01<00:02, 1.81it/s]
  85. Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:02<00:01, 1.78it/s]
  86. Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:02<00:01, 1.74it/s]
  87. Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:03<00:00, 1.74it/s]
  88. Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00, 1.74it/s]
  89. Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00, 1.78it/s]
  90.  
  91. INFO 06-12 06:52:27 [default_loader.py:272] Loading weights took 4.06 seconds
  92. WARNING 06-12 06:52:27 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
  93. WARNING 06-12 06:52:27 [kv_cache.py:99] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
  94. WARNING 06-12 06:52:27 [kv_cache.py:130] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
  95. INFO 06-12 06:52:28 [gpu_model_runner.py:1624] Model loading took 32.1798 GiB and 4.793205 seconds
  96. INFO 06-12 06:52:40 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/8b85eac542/rank_0_0 for vLLM's torch.compile
  97. INFO 06-12 06:52:40 [backends.py:472] Dynamo bytecode transform time: 12.47 s
  98. INFO 06-12 06:52:43 [backends.py:161] Cache the graph of shape None for later use
  99. INFO 06-12 06:53:35 [backends.py:173] Compiling a graph for general shape takes 53.81 s
  100. INFO 06-12 06:54:16 [monitor.py:34] torch.compile takes 66.28 s in total
  101. INFO 06-12 06:54:17 [gpu_worker.py:227] Available KV cache memory: 38.00 GiB
  102. INFO 06-12 06:54:17 [kv_cache_utils.py:715] GPU KV cache size: 311,280 tokens
  103. INFO 06-12 06:54:17 [kv_cache_utils.py:719] Maximum concurrency for 26,500 tokens per request: 11.74x
  104. INFO 06-12 06:54:51 [gpu_model_runner.py:2048] Graph capturing finished in 34 secs, took 1.23 GiB
  105. INFO 06-12 06:54:51 [core.py:171] init engine (profile, create kv cache, warmup model) took 143.58 seconds
  106. INFO 06-12 06:54:52 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 19455
  107. WARNING 06-12 06:54:52 [config.py:1363] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
  108. INFO 06-12 06:54:52 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
  109. INFO 06-12 06:54:52 [serving_completion.py:66] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
  110. INFO 06-12 06:54:52 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
  111. INFO 06-12 06:54:52 [launcher.py:29] Available routes are:
  112. INFO 06-12 06:54:52 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
  113. INFO 06-12 06:54:52 [launcher.py:37] Route: /docs, Methods: HEAD, GET
  114. INFO 06-12 06:54:52 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
  115. INFO 06-12 06:54:52 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
  116. INFO 06-12 06:54:52 [launcher.py:37] Route: /health, Methods: GET
  117. INFO 06-12 06:54:52 [launcher.py:37] Route: /load, Methods: GET
  118. INFO 06-12 06:54:52 [launcher.py:37] Route: /ping, Methods: POST
  119. INFO 06-12 06:54:52 [launcher.py:37] Route: /ping, Methods: GET
  120. INFO 06-12 06:54:52 [launcher.py:37] Route: /tokenize, Methods: POST
  121. INFO 06-12 06:54:52 [launcher.py:37] Route: /detokenize, Methods: POST
  122. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/models, Methods: GET
  123. INFO 06-12 06:54:52 [launcher.py:37] Route: /version, Methods: GET
  124. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
  125. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/completions, Methods: POST
  126. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/embeddings, Methods: POST
  127. INFO 06-12 06:54:52 [launcher.py:37] Route: /pooling, Methods: POST
  128. INFO 06-12 06:54:52 [launcher.py:37] Route: /classify, Methods: POST
  129. INFO 06-12 06:54:52 [launcher.py:37] Route: /score, Methods: POST
  130. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/score, Methods: POST
  131. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
  132. INFO 06-12 06:54:52 [launcher.py:37] Route: /rerank, Methods: POST
  133. INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/rerank, Methods: POST
  134. INFO 06-12 06:54:52 [launcher.py:37] Route: /v2/rerank, Methods: POST
  135. INFO 06-12 06:54:52 [launcher.py:37] Route: /invocations, Methods: POST
  136. INFO 06-12 06:54:52 [launcher.py:37] Route: /metrics, Methods: GET
  137. INFO: Started server process [1]
  138. INFO: Waiting for application startup.
  139. INFO: Application startup complete.
  140. (...)
  141. INFO 06-12 06:27:32 [chat_utils.py:420] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
  142. INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  143. INFO 06-12 06:27:40 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%
  144. INFO 06-12 06:27:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 59.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%
  145. INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  146. INFO 06-12 06:28:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 59.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 49.9%
  147. INFO 06-12 06:28:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 49.9%
  148. INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  149. INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  150. INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  151. INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  152. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  153. INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  154. INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  155. INFO: 172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  156. INFO: 172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  157. INFO 06-12 06:28:20 [loggers.py:118] Engine 000: Avg prompt throughput: 9025.0 tokens/s, Avg generation throughput: 154.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.1%, Prefix cache hit rate: 72.7%
  158. INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  159. INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  160. INFO 06-12 06:28:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 358.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 76.8%
  161. INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  162. INFO 06-12 06:28:40 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 355.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.7%, Prefix cache hit rate: 78.5%
  163. INFO: 172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  164. INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  165. INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  166. INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  167. INFO 06-12 06:28:50 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.3 tokens/s, Avg generation throughput: 334.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.7%, Prefix cache hit rate: 83.2%
  168. INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  169. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  170. INFO 06-12 06:29:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 325.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 84.9%
  171. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  172. INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  173. INFO 06-12 06:29:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 312.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 86.3%
  174. INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  175. INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  176. INFO 06-12 06:29:20 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 311.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.6%, Prefix cache hit rate: 87.4%
  177. INFO: 172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  178. INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  179. INFO: 172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  180. INFO 06-12 06:29:30 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 303.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 88.8%
  181. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  182. INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  183. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  184. INFO 06-12 06:29:40 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.3 tokens/s, Avg generation throughput: 302.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 89.9%
  185. INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  186. INFO 06-12 06:29:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 303.5 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.5%, Prefix cache hit rate: 90.2%
  187. INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  188. INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  189. INFO: 172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  190. INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  191. INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  192. INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  193. INFO 06-12 06:30:00 [loggers.py:118] Engine 000: Avg prompt throughput: 6016.5 tokens/s, Avg generation throughput: 307.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 91.8%
  194. INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  195. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  196. INFO 06-12 06:30:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 323.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 92.2%
  197. INFO 06-12 06:30:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 278.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.2%, Prefix cache hit rate: 92.2%
  198. INFO 06-12 06:30:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 194.9 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 92.2%
  199. INFO 06-12 06:30:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 92.2%
  200. INFO 06-12 06:30:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 92.2%
  201. INFO 06-12 06:31:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 92.2%
  202. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  203. INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  204. INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  205. INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  206. INFO 06-12 06:31:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.2 tokens/s, Avg generation throughput: 66.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.7%, Prefix cache hit rate: 92.9%
  207. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  208. INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  209. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  210. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  211. INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  212. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  213. INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  214. INFO 06-12 06:31:20 [loggers.py:118] Engine 000: Avg prompt throughput: 7019.7 tokens/s, Avg generation throughput: 334.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.5%, Prefix cache hit rate: 93.9%
  215. INFO 06-12 06:31:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 357.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 93.9%
  216. INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  217. INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  218. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  219. INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  220. INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  221. INFO 06-12 06:31:40 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.1 tokens/s, Avg generation throughput: 341.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 94.4%
  222. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  223. INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  224. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  225. INFO 06-12 06:31:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 333.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 94.7%
  226. INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  227. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  228. INFO 06-12 06:32:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 313.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 94.9%
  229. INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  230. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  231. INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  232. INFO 06-12 06:32:10 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.4 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 95.1%
  233. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  234. INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  235. INFO 06-12 06:32:20 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 312.3 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.2%, Prefix cache hit rate: 95.3%
  236. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  237. INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  238. INFO 06-12 06:32:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 310.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 95.4%
  239. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  240. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  241. INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  242. INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  243. INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  244. INFO 06-12 06:32:40 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.1 tokens/s, Avg generation throughput: 330.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 95.7%
  245. INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  246. INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  247. INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  248. INFO 06-12 06:32:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 352.2 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 95.9%
  249. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  250. INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  251. INFO 06-12 06:33:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 349.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.8%, Prefix cache hit rate: 96.0%
  252. INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  253. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  254. INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  255. INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  256. INFO 06-12 06:33:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 343.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.3%, Prefix cache hit rate: 96.2%
  257. INFO 06-12 06:33:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 271.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.5%, Prefix cache hit rate: 96.2%
  258. INFO 06-12 06:33:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 222.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 96.2%
  259. INFO 06-12 06:33:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 90.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 96.2%
  260. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  261. INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  262. INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  263. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  264. INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  265. INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  266. INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  267. INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  268. INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  269. INFO 06-12 06:33:50 [loggers.py:118] Engine 000: Avg prompt throughput: 9025.0 tokens/s, Avg generation throughput: 129.5 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.0%, Prefix cache hit rate: 96.5%
  270. INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  271. INFO 06-12 06:34:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 358.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 96.6%
  272. INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  273. INFO 06-12 06:34:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.9 tokens/s, Avg generation throughput: 354.5 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 96.6%
  274. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  275. INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  276. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  277. INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  278. INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  279. INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  280. INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  281. INFO 06-12 06:34:20 [loggers.py:118] Engine 000: Avg prompt throughput: 7020.1 tokens/s, Avg generation throughput: 335.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 96.8%
  282. INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  283. INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  284. INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  285. INFO 06-12 06:34:30 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 330.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 96.9%
  286. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  287. INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  288. INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  289. INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  290. INFO 06-12 06:34:40 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 350.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 97.0%
  291. INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  292. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  293. INFO 06-12 06:34:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 350.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.4%, Prefix cache hit rate: 97.1%
  294. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  295. INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  296. INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  297. INFO 06-12 06:35:00 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.6 tokens/s, Avg generation throughput: 332.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.4%, Prefix cache hit rate: 97.2%
  298. INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  299. INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  300. INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  301. INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  302. INFO 06-12 06:35:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.4 tokens/s, Avg generation throughput: 333.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.2%, Prefix cache hit rate: 97.3%
  303. INFO 06-12 06:35:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 97.3%
  304. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  305. INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  306. INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  307. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  308. INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  309. INFO 06-12 06:35:30 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.6 tokens/s, Avg generation throughput: 314.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 97.4%
  310. INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  311. INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  312. INFO 06-12 06:35:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 97.4%
  313. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  314. INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  315. INFO 06-12 06:35:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 317.6 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 97.5%
  316. INFO 06-12 06:36:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 223.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 97.5%
  317. INFO 06-12 06:36:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 125.3 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.3%, Prefix cache hit rate: 97.5%
  318. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  319. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  320. INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  321. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  322. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  323. INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  324. INFO: 172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  325. INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  326. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  327. INFO 06-12 06:36:20 [loggers.py:118] Engine 000: Avg prompt throughput: 8022.7 tokens/s, Avg generation throughput: 143.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.9%, Prefix cache hit rate: 97.6%
  328. INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  329. INFO 06-12 06:36:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 358.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.8%, Prefix cache hit rate: 97.6%
  330. INFO 06-12 06:36:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 356.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.0%, Prefix cache hit rate: 97.6%
  331. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  332. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  333. INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  334. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  335. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  336. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  337. INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  338. INFO 06-12 06:36:50 [loggers.py:118] Engine 000: Avg prompt throughput: 7019.5 tokens/s, Avg generation throughput: 346.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 97.7%
  339. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  340. INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  341. INFO 06-12 06:37:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 325.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 97.8%
  342. INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  343. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  344. INFO 06-12 06:37:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 313.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 97.8%
  345. INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  346. INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  347. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  348. INFO 06-12 06:37:20 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.4 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 97.8%
  349. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  350. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  351. INFO 06-12 06:37:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 312.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 97.9%
  352. INFO: 172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  353. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  354. INFO 06-12 06:37:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 97.9%
  355. INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  356. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  357. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  358. INFO 06-12 06:37:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.2 tokens/s, Avg generation throughput: 314.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 97.9%
  359. INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  360. INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  361. INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  362. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  363. INFO 06-12 06:38:00 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.4 tokens/s, Avg generation throughput: 326.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.1%, Prefix cache hit rate: 98.0%
  364. INFO 06-12 06:38:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 316.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.0%
  365. INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  366. INFO: 172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  367. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  368. INFO 06-12 06:38:20 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.6 tokens/s, Avg generation throughput: 313.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.6%, Prefix cache hit rate: 98.0%
  369. INFO 06-12 06:38:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 215.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 98.0%
  370. INFO 06-12 06:38:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 185.7 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.6%, Prefix cache hit rate: 98.0%
  371. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  372. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  373. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  374. INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  375. INFO: 172.17.0.1:53604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  376. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  377. INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  378. INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  379. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  380. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  381. INFO 06-12 06:38:50 [loggers.py:118] Engine 000: Avg prompt throughput: 10027.6 tokens/s, Avg generation throughput: 263.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.4%, Prefix cache hit rate: 98.1%
  382. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  383. INFO 06-12 06:39:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 357.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 98.1%
  384. INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  385. INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  386. INFO 06-12 06:39:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 353.1 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 98.2%
  387. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  388. INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  389. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  390. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  391. INFO 06-12 06:39:20 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 348.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 98.2%
  392. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  393. INFO 06-12 06:39:30 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 333.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.3%, Prefix cache hit rate: 98.2%
  394. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  395. INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  396. INFO 06-12 06:39:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 317.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 98.2%
  397. INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  398. INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  399. INFO 06-12 06:39:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.0%, Prefix cache hit rate: 98.2%
  400. INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  401. INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  402. INFO 06-12 06:40:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 313.3 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.2%, Prefix cache hit rate: 98.3%
  403. INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  404. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  405. INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  406. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  407. INFO: 172.17.0.1:53604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  408. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  409. INFO 06-12 06:40:10 [loggers.py:118] Engine 000: Avg prompt throughput: 6017.1 tokens/s, Avg generation throughput: 317.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 98.3%
  410. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  411. INFO 06-12 06:40:20 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 351.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 98.3%
  412. INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  413. INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  414. INFO 06-12 06:40:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 329.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 98.3%
  415. INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  416. INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  417. INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  418. INFO 06-12 06:40:40 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.7 tokens/s, Avg generation throughput: 315.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.4%
  419. INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  420. INFO 06-12 06:40:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.9 tokens/s, Avg generation throughput: 309.2 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.4%
  421. INFO 06-12 06:41:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 264.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 98.4%
  422. INFO 06-12 06:41:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 97.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 98.4%
  423. INFO 06-12 06:41:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 98.4%
  424. INFO 06-12 06:41:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 98.4%
  425. ^CINFO 06-12 06:48:46 [launcher.py:80] Shutting down FastAPI HTTP server.
  426.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement