Guest User

Untitled

a guest
May 30th, 2026
24
0
Never
1
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 39.71 KB | None | 0 0
  1. Loading audio sample from LibriSpeech...
  2.  ▷ Loaded: 93680 samples, 16000 Hz, 5.9s
  3.  ▷ Reference transcript: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
  4.  
  5. Starting Benchmark: granite_speech (ibm-granite/granite-speech-3.3-2b)
  6. INFO 05-30 08:35:33 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'ibm-granite/granite-speech-3.3-2b'}
  7. Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  8. INFO 05-30 08:35:47 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
  9. INFO 05-30 08:35:47 [model.py:1751] Using max model len 2048
  10. INFO 05-30 08:35:47 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
  11. INFO 05-30 08:35:47 [vllm.py:984] Asynchronous scheduling is enabled.
  12. WARNING 05-30 08:35:47 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  13. WARNING 05-30 08:35:47 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  14. INFO 05-30 08:35:47 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  15. INFO 05-30 08:35:48 [vllm.py:1258] Cudagraph is disabled under eager mode
  16. INFO 05-30 08:35:48 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  17. (EngineCore pid=92890) INFO 05-30 08:36:10 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='ibm-granite/granite-speech-3.3-2b', speculative_config=None, tokenizer='ibm-granite/granite-speech-3.3-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-granite/granite-speech-3.3-2b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
  18. (EngineCore pid=92890) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  19. (EngineCore pid=92890) INFO 05-30 08:36:17 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:38097 backend=nccl
  20. (EngineCore pid=92890) INFO 05-30 08:36:17 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
  21. (EngineCore pid=92890) INFO 05-30 08:36:18 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
  22. (EngineCore pid=92890) INFO 05-30 08:36:18 [gpu_model_runner.py:5036] Starting to load model ibm-granite/granite-speech-3.3-2b...
  23. (EngineCore pid=92890) INFO 05-30 08:36:18 [base.py:117] Using Transformers modeling backend.
  24. (EngineCore pid=92890) INFO 05-30 08:36:19 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
  25. (EngineCore pid=92890) INFO 05-30 08:36:19 [flash_attn.py:636] Using FlashAttention version 2
  26. (EngineCore pid=92890) INFO 05-30 08:36:20 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 5.61 GiB. Available RAM: 116.37 GiB.
  27. (EngineCore pid=92890) INFO 05-30 08:36:20 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
  28. (EngineCore pid=92890)
  29. Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
  30. (EngineCore pid=92890)
  31. Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.74it/s]
  32. (EngineCore pid=92890)
  33. Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.68it/s]
  34. (EngineCore pid=92890)
  35. Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.56it/s]
  36. (EngineCore pid=92890)
  37. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.07it/s]
  38. (EngineCore pid=92890)
  39. (EngineCore pid=92890) INFO 05-30 08:36:22 [default_loader.py:397] Loading weights took 1.94 seconds
  40. (EngineCore pid=92890) INFO 05-30 08:36:23 [gpu_model_runner.py:5131] Model loading took 5.61 GiB memory and 3.492085 seconds
  41. (EngineCore pid=92890) INFO 05-30 08:36:23 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 audio items of the maximum feature size.
  42. (EngineCore pid=92890) INFO 05-30 08:36:27 [gpu_worker.py:469] Available KV cache memory: 13.53 GiB
  43. (EngineCore pid=92890) INFO 05-30 08:36:27 [kv_cache_utils.py:1733] GPU KV cache size: 177,296 tokens
  44. (EngineCore pid=92890) INFO 05-30 08:36:27 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 86.57x
  45. (EngineCore pid=92890) INFO 05-30 08:36:27 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
  46. (EngineCore pid=92890) INFO 05-30 08:36:27 [core.py:309] init engine (profile, create kv cache, warmup model) took 4.13 s
  47. (EngineCore pid=92890) INFO 05-30 08:36:27 [vllm.py:984] Asynchronous scheduling is enabled.
  48. (EngineCore pid=92890) WARNING 05-30 08:36:27 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  49. (EngineCore pid=92890) WARNING 05-30 08:36:27 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  50. (EngineCore pid=92890) INFO 05-30 08:36:27 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  51. (EngineCore pid=92890) INFO 05-30 08:36:27 [vllm.py:1258] Cudagraph is disabled under eager mode
  52. (EngineCore pid=92890) INFO 05-30 08:36:27 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  53. [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer CachedGPT2Tokenizer. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
  54. (EngineCore pid=92890) WARNING 05-30 08:36:27 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
  55. (EngineCore pid=92890) INFO 05-30 08:36:39 [core.py:1287] Shutdown initiated (timeout=0)
  56. (EngineCore pid=92890) INFO 05-30 08:36:39 [core.py:1310] Shutdown complete
  57.  ▷ Model loaded in 53.9s
  58.  ▷ Warming up...
  59.  ▷ Output (59 tokens):
  60.  ▷ Mister Quilterter is the apostle of the middle classes, and we are glad to welcome his gospel.
  61.  
  62. In written format:
  63.  
  64. Mister Quilterter is the apostle of the middle classes, and we are glad to welcome his gospel.
  65.  ▷ Metrics (avg of 3 runs):
  66.  ▷ E2E latency:     2958.6 ms
  67.  ▷ Throughput:      19.9 tokens/s
  68.  ▷ Best E2E:        2938.7 ms
  69.  ▷ Cooling down GPU for 5 seconds...
  70.  
  71.  
  72. Starting Benchmark: audioflamingo3 (nvidia/audio-flamingo-3-hf)
  73. INFO 05-30 08:37:02 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'nvidia/audio-flamingo-3-hf'}
  74. Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  75. INFO 05-30 08:37:15 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
  76. INFO 05-30 08:37:15 [model.py:2086] Downcasting torch.float32 to torch.bfloat16.
  77. INFO 05-30 08:37:15 [model.py:1751] Using max model len 2048
  78. INFO 05-30 08:37:15 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
  79. INFO 05-30 08:37:15 [vllm.py:984] Asynchronous scheduling is enabled.
  80. WARNING 05-30 08:37:15 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  81. WARNING 05-30 08:37:15 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  82. INFO 05-30 08:37:15 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  83. INFO 05-30 08:37:16 [vllm.py:1258] Cudagraph is disabled under eager mode
  84. INFO 05-30 08:37:16 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  85. (EngineCore pid=96361) INFO 05-30 08:37:49 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='nvidia/audio-flamingo-3-hf', speculative_config=None, tokenizer='nvidia/audio-flamingo-3-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/audio-flamingo-3-hf, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
  86. (EngineCore pid=96361) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  87. (EngineCore pid=96361) INFO 05-30 08:38:01 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:38409 backend=nccl
  88. (EngineCore pid=96361) INFO 05-30 08:38:02 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
  89. (EngineCore pid=96361) INFO 05-30 08:38:02 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
  90. (EngineCore pid=96361) INFO 05-30 08:38:03 [gpu_model_runner.py:5036] Starting to load model nvidia/audio-flamingo-3-hf...
  91. (EngineCore pid=96361) INFO 05-30 08:38:03 [base.py:117] Using Transformers modeling backend.
  92. (EngineCore pid=96361) INFO 05-30 08:38:04 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
  93. (EngineCore pid=96361) INFO 05-30 08:38:04 [flash_attn.py:636] Using FlashAttention version 2
  94. (EngineCore pid=96361) INFO 05-30 08:38:04 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 15.40 GiB. Available RAM: 115.72 GiB.
  95. (EngineCore pid=96361) INFO 05-30 08:38:04 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
  96. (EngineCore pid=96361)
  97. Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
  98. (EngineCore pid=96361)
  99. Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.40s/it]
  100. (EngineCore pid=96361)
  101. Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.41s/it]
  102. (EngineCore pid=96361)
  103. Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.40s/it]
  104. (EngineCore pid=96361)
  105. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.06it/s]
  106. (EngineCore pid=96361)
  107. Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.11s/it]
  108. (EngineCore pid=96361)
  109. (EngineCore pid=96361) INFO 05-30 08:38:09 [default_loader.py:397] Loading weights took 4.47 seconds
  110. (EngineCore pid=96361) INFO 05-30 08:38:09 [gpu_model_runner.py:5131] Model loading took 14.51 GiB memory and 5.603415 seconds
  111. (EngineCore pid=96361) INFO 05-30 08:38:10 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 5 audio items of the maximum feature size.
  112. (EngineCore pid=96361) INFO 05-30 08:38:15 [gpu_worker.py:469] Available KV cache memory: 3.98 GiB
  113. (EngineCore pid=96361) INFO 05-30 08:38:15 [kv_cache_utils.py:1733] GPU KV cache size: 74,448 tokens
  114. (EngineCore pid=96361) INFO 05-30 08:38:15 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 36.35x
  115. (EngineCore pid=96361) INFO 05-30 08:38:15 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
  116. (EngineCore pid=96361) INFO 05-30 08:38:15 [core.py:309] init engine (profile, create kv cache, warmup model) took 5.42 s
  117. (EngineCore pid=96361) INFO 05-30 08:38:15 [vllm.py:984] Asynchronous scheduling is enabled.
  118. (EngineCore pid=96361) WARNING 05-30 08:38:15 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  119. (EngineCore pid=96361) WARNING 05-30 08:38:15 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  120. (EngineCore pid=96361) INFO 05-30 08:38:15 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  121. (EngineCore pid=96361) INFO 05-30 08:38:15 [vllm.py:1258] Cudagraph is disabled under eager mode
  122. (EngineCore pid=96361) INFO 05-30 08:38:15 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  123. (EngineCore pid=96361) WARNING 05-30 08:38:15 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
  124. (EngineCore pid=96361) INFO 05-30 08:39:01 [core.py:1287] Shutdown initiated (timeout=0)
  125. (EngineCore pid=96361) INFO 05-30 08:39:01 [core.py:1310] Shutdown complete
  126.  ▷ Model loaded in 73.7s
  127.  ▷ Warming up...
  128.  ▷ Output (200 tokens):
  129.  ▷  serving characteristic distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant dist
  130.  ▷ Metrics (avg of 3 runs):
  131.  ▷ E2E latency:     11523.4 ms
  132.  ▷ Throughput:      17.4 tokens/s
  133.  ▷ Best E2E:        11514.8 ms
  134.  ▷ Cooling down GPU for 5 seconds...
  135.  
  136.  
  137. Starting Benchmark: vibevoice_asr (microsoft/VibeVoice-ASR-HF)
  138. INFO 05-30 08:39:24 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'microsoft/VibeVoice-ASR-HF'}
  139. Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  140. INFO 05-30 08:39:37 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
  141. INFO 05-30 08:39:37 [model.py:1751] Using max model len 2048
  142. INFO 05-30 08:39:37 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
  143. INFO 05-30 08:39:37 [vllm.py:984] Asynchronous scheduling is enabled.
  144. WARNING 05-30 08:39:37 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  145. WARNING 05-30 08:39:37 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  146. INFO 05-30 08:39:37 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  147. INFO 05-30 08:39:38 [vllm.py:1258] Cudagraph is disabled under eager mode
  148. INFO 05-30 08:39:38 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  149. [transformers] `VibeVoiceAsrProcessor` defines `feature_extractor_class = 'VibeVoiceAcousticTokenizerFeatureExtractor'`, which is deprecated. Register the correct mapping in `AutoFeatureExtractor` instead.
  150. (EngineCore pid=101082) INFO 05-30 08:40:12 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='microsoft/VibeVoice-ASR-HF', speculative_config=None, tokenizer='microsoft/VibeVoice-ASR-HF', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=microsoft/VibeVoice-ASR-HF, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
  151. (EngineCore pid=101082) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  152. (EngineCore pid=101082) [transformers] `VibeVoiceAsrProcessor` defines `feature_extractor_class = 'VibeVoiceAcousticTokenizerFeatureExtractor'`, which is deprecated. Register the correct mapping in `AutoFeatureExtractor` instead.
  153. (EngineCore pid=101082) INFO 05-30 08:40:24 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:40577 backend=nccl
  154. (EngineCore pid=101082) INFO 05-30 08:40:24 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
  155. (EngineCore pid=101082) INFO 05-30 08:40:25 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
  156. (EngineCore pid=101082) INFO 05-30 08:40:25 [gpu_model_runner.py:5036] Starting to load model microsoft/VibeVoice-ASR-HF...
  157. (EngineCore pid=101082) INFO 05-30 08:40:25 [base.py:117] Using Transformers modeling backend.
  158. (EngineCore pid=101082) INFO 05-30 08:40:26 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
  159. (EngineCore pid=101082) INFO 05-30 08:40:26 [flash_attn.py:636] Using FlashAttention version 2
  160. (EngineCore pid=101082) INFO 05-30 08:40:26 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 15.52 GiB. Available RAM: 115.77 GiB.
  161. (EngineCore pid=101082) INFO 05-30 08:40:26 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
  162. (EngineCore pid=101082)
  163. Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
  164. (EngineCore pid=101082)
  165. Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:00<00:04,  1.53it/s]
  166. (EngineCore pid=101082)
  167. Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:01<00:04,  1.47it/s]
  168. (EngineCore pid=101082)
  169. Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:02<00:03,  1.45it/s]
  170. (EngineCore pid=101082)
  171. Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:02<00:02,  1.43it/s]
  172. (EngineCore pid=101082)
  173. Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:03<00:02,  1.42it/s]
  174. (EngineCore pid=101082)
  175. Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:04<00:01,  1.54it/s]
  176. (EngineCore pid=101082)
  177. Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:04<00:00,  1.67it/s]
  178. (EngineCore pid=101082)
  179. Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00,  1.76it/s]
  180. (EngineCore pid=101082)
  181. (EngineCore pid=101082) INFO 05-30 08:40:31 [default_loader.py:397] Loading weights took 4.56 seconds
  182. (EngineCore pid=101082) INFO 05-30 08:40:32 [gpu_model_runner.py:5131] Model loading took 14.63 GiB memory and 5.895775 seconds
  183. (EngineCore pid=101082) INFO 05-30 08:40:32 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 audio items of the maximum feature size.
  184. (EngineCore pid=101082) INFO 05-30 08:40:38 [gpu_worker.py:469] Available KV cache memory: 3.05 GiB
  185. (EngineCore pid=101082) INFO 05-30 08:40:38 [kv_cache_utils.py:1733] GPU KV cache size: 57,024 tokens
  186. (EngineCore pid=101082) INFO 05-30 08:40:38 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 27.84x
  187. (EngineCore pid=101082) INFO 05-30 08:40:38 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
  188. (EngineCore pid=101082) INFO 05-30 08:40:38 [core.py:309] init engine (profile, create kv cache, warmup model) took 6.28 s
  189. (EngineCore pid=101082) INFO 05-30 08:40:38 [vllm.py:984] Asynchronous scheduling is enabled.
  190. (EngineCore pid=101082) WARNING 05-30 08:40:38 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  191. (EngineCore pid=101082) WARNING 05-30 08:40:38 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  192. (EngineCore pid=101082) INFO 05-30 08:40:38 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  193. (EngineCore pid=101082) INFO 05-30 08:40:38 [vllm.py:1258] Cudagraph is disabled under eager mode
  194. (EngineCore pid=101082) INFO 05-30 08:40:38 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  195. (EngineCore pid=101082) WARNING 05-30 08:40:39 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
  196. (EngineCore pid=101082) INFO 05-30 08:41:25 [core.py:1287] Shutdown initiated (timeout=0)
  197. (EngineCore pid=101082) INFO 05-30 08:41:25 [core.py:1310] Shutdown complete
  198.  ▷ Model loaded in 75.1s
  199.  ▷ Warming up...
  200.  ▷ Output (200 tokens):
  201.  ▷ 何的ο星几�ksksいてв藏某恢复 localiosfefefefefefefefefefefefeemeeme豆uesomaomaca小心ad几เพ worried� AV Spanish בי Sta米 진 Op óioneues核心核心核心核心loglen核心log fund dos disc disc� ob dos인데인데eps�ues exiente� Pass ex ringمسいてв winsMione me fefefefefefefefefefefefefefefefefefefefefefeami核心 pose视频望指are Urban案 Gu se� smoke smoke�几几ami�几 gen geniente weiß It Itistefefefefefefefefefefefefefefefefefefefefefefefefefefefefefefeですね distinct� AV-se Bo minusave요ねv Element历史 strength Indians Indians Johnsonの forsa� Smoke smoke bomb
  202.  ▷ Metrics (avg of 3 runs):
  203.  ▷ E2E latency:     11505.7 ms
  204.  ▷ Throughput:      17.4 tokens/s
  205.  ▷ Best E2E:        11503.5 ms
  206.  ▷ Cooling down GPU for 5 seconds...
  207.  
  208.  
  209. Starting Benchmark: glmasr (zai-org/GLM-ASR-Nano-2512)
  210. INFO 05-30 08:41:47 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'zai-org/GLM-ASR-Nano-2512'}
  211. Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  212. INFO 05-30 08:42:00 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
  213. INFO 05-30 08:42:00 [model.py:1751] Using max model len 2048
  214. INFO 05-30 08:42:00 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
  215. INFO 05-30 08:42:00 [vllm.py:984] Asynchronous scheduling is enabled.
  216. WARNING 05-30 08:42:00 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  217. WARNING 05-30 08:42:00 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  218. INFO 05-30 08:42:00 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  219. INFO 05-30 08:42:01 [vllm.py:1258] Cudagraph is disabled under eager mode
  220. INFO 05-30 08:42:01 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  221. (EngineCore pid=105801) INFO 05-30 08:42:26 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='zai-org/GLM-ASR-Nano-2512', speculative_config=None, tokenizer='zai-org/GLM-ASR-Nano-2512', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=zai-org/GLM-ASR-Nano-2512, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
  222. (EngineCore pid=105801) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  223. (EngineCore pid=105801) INFO 05-30 08:42:33 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:51913 backend=nccl
  224. (EngineCore pid=105801) INFO 05-30 08:42:33 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
  225. (EngineCore pid=105801) INFO 05-30 08:42:35 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
  226. (EngineCore pid=105801) INFO 05-30 08:42:35 [gpu_model_runner.py:5036] Starting to load model zai-org/GLM-ASR-Nano-2512...
  227. (EngineCore pid=105801) INFO 05-30 08:42:35 [base.py:117] Using Transformers modeling backend.
  228. (EngineCore pid=105801) INFO 05-30 08:42:36 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
  229. (EngineCore pid=105801) INFO 05-30 08:42:36 [flash_attn.py:636] Using FlashAttention version 2
  230. (EngineCore pid=105801) INFO 05-30 08:42:36 [weight_utils.py:647] No model.safetensors.index.json found in remote.
  231. (EngineCore pid=105801) INFO 05-30 08:42:36 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 4.21 GiB. Available RAM: 116.29 GiB.
  232. (EngineCore pid=105801) INFO 05-30 08:42:36 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
  233. (EngineCore pid=105801)
  234. Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
  235. (EngineCore pid=105801)
  236. Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.30s/it]
  237. (EngineCore pid=105801)
  238. Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.30s/it]
  239. (EngineCore pid=105801)
  240. (EngineCore pid=105801) INFO 05-30 08:42:37 [default_loader.py:397] Loading weights took 1.38 seconds
  241. (EngineCore pid=105801) INFO 05-30 08:42:38 [gpu_model_runner.py:5131] Model loading took 4.0 GiB memory and 2.442925 seconds
  242. (EngineCore pid=105801) INFO 05-30 08:42:39 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 5 audio items of the maximum feature size.
  243. (EngineCore pid=105801) INFO 05-30 08:42:42 [gpu_worker.py:469] Available KV cache memory: 15.18 GiB
  244. (EngineCore pid=105801) INFO 05-30 08:42:42 [kv_cache_utils.py:1733] GPU KV cache size: 284,272 tokens
  245. (EngineCore pid=105801) INFO 05-30 08:42:42 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 138.80x
  246. (EngineCore pid=105801) INFO 05-30 08:42:42 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
  247. (EngineCore pid=105801) INFO 05-30 08:42:42 [core.py:309] init engine (profile, create kv cache, warmup model) took 4.01 s
  248. (EngineCore pid=105801) INFO 05-30 08:42:43 [vllm.py:984] Asynchronous scheduling is enabled.
  249. (EngineCore pid=105801) WARNING 05-30 08:42:43 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
  250. (EngineCore pid=105801) WARNING 05-30 08:42:43 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
  251. (EngineCore pid=105801) INFO 05-30 08:42:43 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
  252. (EngineCore pid=105801) INFO 05-30 08:42:43 [vllm.py:1258] Cudagraph is disabled under eager mode
  253. (EngineCore pid=105801) INFO 05-30 08:42:43 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
  254. (EngineCore pid=105801) WARNING 05-30 08:42:43 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
  255. (EngineCore pid=105801) INFO 05-30 08:42:46 [core.py:1287] Shutdown initiated (timeout=0)
  256. (EngineCore pid=105801) INFO 05-30 08:42:46 [core.py:1310] Shutdown complete
  257.  ▷ Model loaded in 55.9s
  258.  ▷ Warming up...
  259.  ▷ Output (23 tokens):
  260.  ▷ Mister Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
  261.  ▷ Metrics (avg of 3 runs):
  262.  ▷ E2E latency:     818.0 ms
  263.  ▷ Throughput:      28.1 tokens/s
  264.  ▷ Best E2E:        815.9 ms
  265.  ▷ Cooling down GPU for 5 seconds...
  266.  
  267.  
  268. FINAL SUMMARY
  269. Model                  E2E (ms)    Tok/s  Tokens  Output preview
  270. ----------------------------------------------------------------------
  271. granite_speech             2959     19.9      59  Mister Quilterter is the apostle of the
  272. audioflamingo3            11523     17.4     200   serving characteristic distant distant
  273. vibevoice_asr             11506     17.4     200  何的ο星几�ksksいてв藏某恢复 localiosfefefefefefefe
  274. glmasr                      818     28.1      23  Mister Quilter is the apostle of the mid
  275.  
  276. Detailed results saved to /home/harsh/workspace/benchmark_results.json
Advertisement
Comments
  • User was banned
Add Comment
Please, Sign In to add comment