Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # Benchmarked VM: Microsoft Azure Standard NC40ads H100 v5 (40 vcpus, 320 GiB memory)
- # Benchmarked server: vLLM
- # you can compare these vLLM numbers with the Nvidia TensorRT-LLM test here: https://pastebin.com/Kc4Cbtfa
- # run vLLM server:
- sudo docker run --runtime nvidia --gpus all \
- -v ~/.cache/huggingface:/root/.cache/huggingface \
- -p 8000:8000 \
- vllm/vllm-openai:v0.9.1 \
- --max-model-len 26500 \
- --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
- --disable-log-requests \
- --max_num_batched_tokens 60000 \
- --kv-cache-dtype fp8 \
- --enable-chunked-prefill \
- --gpu_memory_utilization 0.9
- # run benchmark:
- sudo docker run --network host -e HF_TOKEN=$HF_TOKEN \
- -v ~/inference-benchmarker-results:/opt/inference-benchmarker/results \
- inference_benchmarker inference-benchmarker --no-console \
- --url http://localhost:8000/v1 \
- --max-vus 8 --duration 120s --warmup 30s --benchmark-kind rate \
- --rates 1.0 --rates 2.0 --rates 3.0 --rates 4.0 --rates 8.0 \
- --prompt-options "min_tokens=10,max_tokens=16000,num_tokens=10000,variance=6000" \
- --decode-options "min_tokens=2000,max_tokens=10000,num_tokens=6000,variance=4000" \
- --model-name "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic" \
- --tokenizer-name "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
- # benchmark results:
- ┌─────────────────┬──────────────────────────────────────────────────────────────────────┐
- │ Parameter │ Value │
- ├─────────────────┼──────────────────────────────────────────────────────────────────────┤
- │ Max VUs │ 8 │
- │ Duration │ 120 │
- │ Warmup Duration │ 30 │
- │ Benchmark Kind │ Rate │
- │ Rates │ [1.0, 2.0, 3.0, 4.0, 8.0] │
- │ Num Rates │ 10 │
- │ Prompt Options │ num_tokens=Some(10000),min_tokens=10,max_tokens=16000,variance=6000 │
- │ Decode Options │ num_tokens=Some(6000),min_tokens=2000,max_tokens=10000,variance=4000 │
- │ Tokenizer │ textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic │
- │ Extra Metadata │ N/A │
- └─────────────────┴──────────────────────────────────────────────────────────────────────┘
- ┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
- │ Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
- ├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
- │ warmup │ 0.05 req/s │ 19.82 sec │ 1305.95 ms │ 16.69 ms │ 55.96 tokens/sec │ 0.00% │ 2/2 │ 10000.00 │ 1109.00 │
- │ [email protected]/s │ 0.25 req/s │ 26.05 sec │ 168.47 ms │ 23.75 ms │ 274.15 tokens/sec │ 0.00% │ 29/29 │ 10000.00 │ 1078.76 │
- │ [email protected]/s │ 0.29 req/s │ 24.51 sec │ 57.47 ms │ 23.52 ms │ 295.21 tokens/sec │ 0.00% │ 34/34 │ 10000.00 │ 1028.29 │
- │ [email protected]/s │ 0.30 req/s │ 22.88 sec │ 56.71 ms │ 23.13 ms │ 296.63 tokens/sec │ 0.00% │ 35/35 │ 10000.00 │ 983.80 │
- │ [email protected]/s │ 0.25 req/s │ 27.24 sec │ 59.15 ms │ 23.99 ms │ 283.22 tokens/sec │ 0.00% │ 30/30 │ 10000.00 │ 1123.00 │
- │ [email protected]/s │ 0.25 req/s │ 26.14 sec │ 55.66 ms │ 23.68 ms │ 269.53 tokens/sec │ 0.00% │ 29/29 │ 10000.00 │ 1097.28 │
- └────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘
- # vLLM server output lines:
- azureuser@qwentest:~/mymodel$ sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.9.1 --max-model-len 26500 --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic --disable-log-requests --max_num_batched_tokens 60000 --kv-cache-dtype fp8 --enable-chunked-prefill --gpu_memory_utilization 0.9
- INFO 06-12 06:51:55 [__init__.py:244] Automatically detected platform cuda.
- INFO 06-12 06:51:59 [api_server.py:1287] vLLM API server version 0.9.1
- INFO 06-12 06:51:59 [cli_args.py:309] non-default args: {'model': 'textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', 'max_model_len': 26500, 'kv_cache_dtype': 'fp8', 'max_n um_batched_tokens': 60000, 'enable_chunked_prefill': True, 'disable_log_requests': True}
- INFO 06-12 06:52:09 [config.py:823] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
- INFO 06-12 06:52:15 [config.py:1559] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy dro p without a proper scaling factor
- INFO 06-12 06:52:15 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=60000.
- WARNING 06-12 06:52:17 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVI DIA/nccl/issues/1234
- INFO 06-12 06:52:19 [__init__.py:244] Automatically detected platform cuda.
- INFO 06-12 06:52:21 [core.py:455] Waiting for init message from front-end.
- INFO 06-12 06:52:21 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', speculative_config=None, tokeni zer='textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, tr ust_remote_code=False, dtype=torch.bfloat16, max_seq_len=26500, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom _all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallbac k=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None , otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic, num_scheduler_steps=1, multi_step_str eam_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":""," cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inducto r_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480, 472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128, 120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
- WARNING 06-12 06:52:22 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_w orker.Worker object at 0x7f1ea5c1ba70>
- INFO 06-12 06:52:22 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
- INFO 06-12 06:52:22 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
- INFO 06-12 06:52:22 [gpu_model_runner.py:1595] Starting to load model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic...
- INFO 06-12 06:52:23 [gpu_model_runner.py:1600] Loading model from scratch...
- INFO 06-12 06:52:23 [cuda.py:252] Using Flash Attention backend on V1 engine.
- INFO 06-12 06:52:23 [weight_utils.py:292] Using model weights format ['*.safetensors']
- Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s]
- Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:00<00:02, 2.13it/s]
- Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:01<00:02, 1.88it/s]
- Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:01<00:02, 1.81it/s]
- Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:02<00:01, 1.78it/s]
- Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:02<00:01, 1.74it/s]
- Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:03<00:00, 1.74it/s]
- Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00, 1.74it/s]
- Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00, 1.78it/s]
- INFO 06-12 06:52:27 [default_loader.py:272] Loading weights took 4.06 seconds
- WARNING 06-12 06:52:27 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
- WARNING 06-12 06:52:27 [kv_cache.py:99] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
- WARNING 06-12 06:52:27 [kv_cache.py:130] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
- INFO 06-12 06:52:28 [gpu_model_runner.py:1624] Model loading took 32.1798 GiB and 4.793205 seconds
- INFO 06-12 06:52:40 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/8b85eac542/rank_0_0 for vLLM's torch.compile
- INFO 06-12 06:52:40 [backends.py:472] Dynamo bytecode transform time: 12.47 s
- INFO 06-12 06:52:43 [backends.py:161] Cache the graph of shape None for later use
- INFO 06-12 06:53:35 [backends.py:173] Compiling a graph for general shape takes 53.81 s
- INFO 06-12 06:54:16 [monitor.py:34] torch.compile takes 66.28 s in total
- INFO 06-12 06:54:17 [gpu_worker.py:227] Available KV cache memory: 38.00 GiB
- INFO 06-12 06:54:17 [kv_cache_utils.py:715] GPU KV cache size: 311,280 tokens
- INFO 06-12 06:54:17 [kv_cache_utils.py:719] Maximum concurrency for 26,500 tokens per request: 11.74x
- INFO 06-12 06:54:51 [gpu_model_runner.py:2048] Graph capturing finished in 34 secs, took 1.23 GiB
- INFO 06-12 06:54:51 [core.py:171] init engine (profile, create kv cache, warmup model) took 143.58 seconds
- INFO 06-12 06:54:52 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 19455
- WARNING 06-12 06:54:52 [config.py:1363] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
- INFO 06-12 06:54:52 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
- INFO 06-12 06:54:52 [serving_completion.py:66] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
- INFO 06-12 06:54:52 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
- INFO 06-12 06:54:52 [launcher.py:29] Available routes are:
- INFO 06-12 06:54:52 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /docs, Methods: HEAD, GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /health, Methods: GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /load, Methods: GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /ping, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /ping, Methods: GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /tokenize, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /detokenize, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/models, Methods: GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /version, Methods: GET
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/completions, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/embeddings, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /pooling, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /classify, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /score, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/score, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /rerank, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v1/rerank, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /v2/rerank, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /invocations, Methods: POST
- INFO 06-12 06:54:52 [launcher.py:37] Route: /metrics, Methods: GET
- INFO: Started server process [1]
- INFO: Waiting for application startup.
- INFO: Application startup complete.
- (...)
- INFO 06-12 06:27:32 [chat_utils.py:420] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
- INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:27:40 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%
- INFO 06-12 06:27:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 59.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%
- INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:28:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 59.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 49.9%
- INFO 06-12 06:28:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 49.9%
- INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:28:20 [loggers.py:118] Engine 000: Avg prompt throughput: 9025.0 tokens/s, Avg generation throughput: 154.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.1%, Prefix cache hit rate: 72.7%
- INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:28:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 358.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 76.8%
- INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:28:40 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 355.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.7%, Prefix cache hit rate: 78.5%
- INFO: 172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:28:50 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.3 tokens/s, Avg generation throughput: 334.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.7%, Prefix cache hit rate: 83.2%
- INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:29:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 325.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 84.9%
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:29:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 312.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 86.3%
- INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:29:20 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 311.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.6%, Prefix cache hit rate: 87.4%
- INFO: 172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:29:30 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 303.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 88.8%
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:29:40 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.3 tokens/s, Avg generation throughput: 302.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 89.9%
- INFO: 172.17.0.1:39832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:29:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 303.5 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.5%, Prefix cache hit rate: 90.2%
- INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:40368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:30:00 [loggers.py:118] Engine 000: Avg prompt throughput: 6016.5 tokens/s, Avg generation throughput: 307.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 91.8%
- INFO: 172.17.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:30:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 323.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 92.2%
- INFO 06-12 06:30:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 278.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.2%, Prefix cache hit rate: 92.2%
- INFO 06-12 06:30:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 194.9 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.1%, Prefix cache hit rate: 92.2%
- INFO 06-12 06:30:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 92.2%
- INFO 06-12 06:30:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 92.2%
- INFO 06-12 06:31:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 92.2%
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:31:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.2 tokens/s, Avg generation throughput: 66.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.7%, Prefix cache hit rate: 92.9%
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:31:20 [loggers.py:118] Engine 000: Avg prompt throughput: 7019.7 tokens/s, Avg generation throughput: 334.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.5%, Prefix cache hit rate: 93.9%
- INFO 06-12 06:31:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 357.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 93.9%
- INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:31:40 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.1 tokens/s, Avg generation throughput: 341.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 94.4%
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:31:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 333.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 94.7%
- INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:32:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 313.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 94.9%
- INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:32:10 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.4 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 95.1%
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:32:20 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 312.3 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.2%, Prefix cache hit rate: 95.3%
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:32:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 310.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 95.4%
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:39830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:32:40 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.1 tokens/s, Avg generation throughput: 330.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 95.7%
- INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:48658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:32:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 352.2 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 95.9%
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:33:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 349.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.8%, Prefix cache hit rate: 96.0%
- INFO: 172.17.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:33:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 343.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.3%, Prefix cache hit rate: 96.2%
- INFO 06-12 06:33:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 271.2 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.5%, Prefix cache hit rate: 96.2%
- INFO 06-12 06:33:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 222.6 tokens/s, Running: 3 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 96.2%
- INFO 06-12 06:33:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 90.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 96.2%
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:33:50 [loggers.py:118] Engine 000: Avg prompt throughput: 9025.0 tokens/s, Avg generation throughput: 129.5 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.0%, Prefix cache hit rate: 96.5%
- INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:34:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 358.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 96.6%
- INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:34:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.9 tokens/s, Avg generation throughput: 354.5 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 96.6%
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:34:20 [loggers.py:118] Engine 000: Avg prompt throughput: 7020.1 tokens/s, Avg generation throughput: 335.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 96.8%
- INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:34:30 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.5 tokens/s, Avg generation throughput: 330.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 96.9%
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:34:40 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 350.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 97.0%
- INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:34:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 350.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.4%, Prefix cache hit rate: 97.1%
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:35:00 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.6 tokens/s, Avg generation throughput: 332.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.4%, Prefix cache hit rate: 97.2%
- INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:35:10 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.4 tokens/s, Avg generation throughput: 333.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.2%, Prefix cache hit rate: 97.3%
- INFO 06-12 06:35:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 97.3%
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:35:30 [loggers.py:118] Engine 000: Avg prompt throughput: 5014.6 tokens/s, Avg generation throughput: 314.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 97.4%
- INFO: 172.17.0.1:34772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:35:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 97.4%
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:35:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 317.6 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 97.5%
- INFO 06-12 06:36:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 223.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.0%, Prefix cache hit rate: 97.5%
- INFO 06-12 06:36:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 125.3 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.3%, Prefix cache hit rate: 97.5%
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:36:20 [loggers.py:118] Engine 000: Avg prompt throughput: 8022.7 tokens/s, Avg generation throughput: 143.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.9%, Prefix cache hit rate: 97.6%
- INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:36:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 358.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.8%, Prefix cache hit rate: 97.6%
- INFO 06-12 06:36:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 356.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.0%, Prefix cache hit rate: 97.6%
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:36:50 [loggers.py:118] Engine 000: Avg prompt throughput: 7019.5 tokens/s, Avg generation throughput: 346.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 97.7%
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:37:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 325.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 97.8%
- INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:37:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 313.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.8%, Prefix cache hit rate: 97.8%
- INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:37:20 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.4 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 97.8%
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:37:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.8 tokens/s, Avg generation throughput: 312.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 97.9%
- INFO: 172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:37:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.0%, Prefix cache hit rate: 97.9%
- INFO: 172.17.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:37:50 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.2 tokens/s, Avg generation throughput: 314.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 97.9%
- INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:38:00 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.4 tokens/s, Avg generation throughput: 326.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.1%, Prefix cache hit rate: 98.0%
- INFO 06-12 06:38:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 316.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.0%
- INFO: 172.17.0.1:50742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:50732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:38:20 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.6 tokens/s, Avg generation throughput: 313.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.6%, Prefix cache hit rate: 98.0%
- INFO 06-12 06:38:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 215.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.9%, Prefix cache hit rate: 98.0%
- INFO 06-12 06:38:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 185.7 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.6%, Prefix cache hit rate: 98.0%
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:38:50 [loggers.py:118] Engine 000: Avg prompt throughput: 10027.6 tokens/s, Avg generation throughput: 263.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.4%, Prefix cache hit rate: 98.1%
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:39:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 357.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 98.1%
- INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:39:10 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 353.1 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 98.2%
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:39:20 [loggers.py:118] Engine 000: Avg prompt throughput: 4011.5 tokens/s, Avg generation throughput: 348.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 98.2%
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:39:30 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 333.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.3%, Prefix cache hit rate: 98.2%
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:39:40 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 317.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 98.2%
- INFO: 172.17.0.1:37718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:39:50 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.7 tokens/s, Avg generation throughput: 314.7 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.0%, Prefix cache hit rate: 98.2%
- INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:40:00 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.5 tokens/s, Avg generation throughput: 313.3 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.2%, Prefix cache hit rate: 98.3%
- INFO: 172.17.0.1:50708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:40:10 [loggers.py:118] Engine 000: Avg prompt throughput: 6017.1 tokens/s, Avg generation throughput: 317.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 98.3%
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:40:20 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.8 tokens/s, Avg generation throughput: 351.4 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.6%, Prefix cache hit rate: 98.3%
- INFO: 172.17.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:40:30 [loggers.py:118] Engine 000: Avg prompt throughput: 2005.6 tokens/s, Avg generation throughput: 329.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.9%, Prefix cache hit rate: 98.3%
- INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO: 172.17.0.1:53612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:40:40 [loggers.py:118] Engine 000: Avg prompt throughput: 3008.7 tokens/s, Avg generation throughput: 315.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.4%
- INFO: 172.17.0.1:34796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
- INFO 06-12 06:40:50 [loggers.py:118] Engine 000: Avg prompt throughput: 1002.9 tokens/s, Avg generation throughput: 309.2 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 98.4%
- INFO 06-12 06:41:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 264.9 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 98.4%
- INFO 06-12 06:41:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 97.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 98.4%
- INFO 06-12 06:41:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 98.4%
- INFO 06-12 06:41:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 98.4%
- ^CINFO 06-12 06:48:46 [launcher.py:80] Shutting down FastAPI HTTP server.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement