Untitled

Loading audio sample from LibriSpeech...
 ▷ Loaded: 93680 samples, 16000 Hz, 5.9s
 ▷ Reference transcript: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL

Starting Benchmark: granite_speech (ibm-granite/granite-speech-3.3-2b)
INFO 05-30 08:35:33 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'ibm-granite/granite-speech-3.3-2b'}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 05-30 08:35:47 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
INFO 05-30 08:35:47 [model.py:1751] Using max model len 2048
INFO 05-30 08:35:47 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-30 08:35:47 [vllm.py:984] Asynchronous scheduling is enabled.
WARNING 05-30 08:35:47 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 05-30 08:35:47 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 05-30 08:35:47 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
INFO 05-30 08:35:48 [vllm.py:1258] Cudagraph is disabled under eager mode
INFO 05-30 08:35:48 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=92890) INFO 05-30 08:36:10 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='ibm-granite/granite-speech-3.3-2b', speculative_config=None, tokenizer='ibm-granite/granite-speech-3.3-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-granite/granite-speech-3.3-2b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=92890) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=92890) INFO 05-30 08:36:17 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:38097 backend=nccl
(EngineCore pid=92890) INFO 05-30 08:36:17 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=92890) INFO 05-30 08:36:18 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=92890) INFO 05-30 08:36:18 [gpu_model_runner.py:5036] Starting to load model ibm-granite/granite-speech-3.3-2b...
(EngineCore pid=92890) INFO 05-30 08:36:18 [base.py:117] Using Transformers modeling backend.
(EngineCore pid=92890) INFO 05-30 08:36:19 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=92890) INFO 05-30 08:36:19 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=92890) INFO 05-30 08:36:20 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 5.61 GiB. Available RAM: 116.37 GiB.
(EngineCore pid=92890) INFO 05-30 08:36:20 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(EngineCore pid=92890)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(EngineCore pid=92890)
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.74it/s]
(EngineCore pid=92890)
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.68it/s]
(EngineCore pid=92890)
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.56it/s]
(EngineCore pid=92890)
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.07it/s]
(EngineCore pid=92890)
(EngineCore pid=92890) INFO 05-30 08:36:22 [default_loader.py:397] Loading weights took 1.94 seconds
(EngineCore pid=92890) INFO 05-30 08:36:23 [gpu_model_runner.py:5131] Model loading took 5.61 GiB memory and 3.492085 seconds
(EngineCore pid=92890) INFO 05-30 08:36:23 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 audio items of the maximum feature size.
(EngineCore pid=92890) INFO 05-30 08:36:27 [gpu_worker.py:469] Available KV cache memory: 13.53 GiB
(EngineCore pid=92890) INFO 05-30 08:36:27 [kv_cache_utils.py:1733] GPU KV cache size: 177,296 tokens
(EngineCore pid=92890) INFO 05-30 08:36:27 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 86.57x
(EngineCore pid=92890) INFO 05-30 08:36:27 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=92890) INFO 05-30 08:36:27 [core.py:309] init engine (profile, create kv cache, warmup model) took 4.13 s
(EngineCore pid=92890) INFO 05-30 08:36:27 [vllm.py:984] Asynchronous scheduling is enabled.
(EngineCore pid=92890) WARNING 05-30 08:36:27 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=92890) WARNING 05-30 08:36:27 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=92890) INFO 05-30 08:36:27 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=92890) INFO 05-30 08:36:27 [vllm.py:1258] Cudagraph is disabled under eager mode
(EngineCore pid=92890) INFO 05-30 08:36:27 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer CachedGPT2Tokenizer. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
(EngineCore pid=92890) WARNING 05-30 08:36:27 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=92890) INFO 05-30 08:36:39 [core.py:1287] Shutdown initiated (timeout=0)
(EngineCore pid=92890) INFO 05-30 08:36:39 [core.py:1310] Shutdown complete
 ▷ Model loaded in 53.9s
 ▷ Warming up...
 ▷ Output (59 tokens):
 ▷ Mister Quilterter is the apostle of the middle classes, and we are glad to welcome his gospel.

In written format:

Mister Quilterter is the apostle of the middle classes, and we are glad to welcome his gospel.
 ▷ Metrics (avg of 3 runs):
 ▷ E2E latency:     2958.6 ms
 ▷ Throughput:      19.9 tokens/s
 ▷ Best E2E:        2938.7 ms
 ▷ Cooling down GPU for 5 seconds...


Starting Benchmark: audioflamingo3 (nvidia/audio-flamingo-3-hf)
INFO 05-30 08:37:02 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'nvidia/audio-flamingo-3-hf'}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 05-30 08:37:15 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
INFO 05-30 08:37:15 [model.py:2086] Downcasting torch.float32 to torch.bfloat16.
INFO 05-30 08:37:15 [model.py:1751] Using max model len 2048
INFO 05-30 08:37:15 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-30 08:37:15 [vllm.py:984] Asynchronous scheduling is enabled.
WARNING 05-30 08:37:15 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 05-30 08:37:15 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 05-30 08:37:15 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
INFO 05-30 08:37:16 [vllm.py:1258] Cudagraph is disabled under eager mode
INFO 05-30 08:37:16 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=96361) INFO 05-30 08:37:49 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='nvidia/audio-flamingo-3-hf', speculative_config=None, tokenizer='nvidia/audio-flamingo-3-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/audio-flamingo-3-hf, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=96361) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=96361) INFO 05-30 08:38:01 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:38409 backend=nccl
(EngineCore pid=96361) INFO 05-30 08:38:02 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=96361) INFO 05-30 08:38:02 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=96361) INFO 05-30 08:38:03 [gpu_model_runner.py:5036] Starting to load model nvidia/audio-flamingo-3-hf...
(EngineCore pid=96361) INFO 05-30 08:38:03 [base.py:117] Using Transformers modeling backend.
(EngineCore pid=96361) INFO 05-30 08:38:04 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=96361) INFO 05-30 08:38:04 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=96361) INFO 05-30 08:38:04 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 15.40 GiB. Available RAM: 115.72 GiB.
(EngineCore pid=96361) INFO 05-30 08:38:04 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(EngineCore pid=96361)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(EngineCore pid=96361)
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.40s/it]
(EngineCore pid=96361)
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.41s/it]
(EngineCore pid=96361)
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.40s/it]
(EngineCore pid=96361)
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.06it/s]
(EngineCore pid=96361)
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.11s/it]
(EngineCore pid=96361)
(EngineCore pid=96361) INFO 05-30 08:38:09 [default_loader.py:397] Loading weights took 4.47 seconds
(EngineCore pid=96361) INFO 05-30 08:38:09 [gpu_model_runner.py:5131] Model loading took 14.51 GiB memory and 5.603415 seconds
(EngineCore pid=96361) INFO 05-30 08:38:10 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 5 audio items of the maximum feature size.
(EngineCore pid=96361) INFO 05-30 08:38:15 [gpu_worker.py:469] Available KV cache memory: 3.98 GiB
(EngineCore pid=96361) INFO 05-30 08:38:15 [kv_cache_utils.py:1733] GPU KV cache size: 74,448 tokens
(EngineCore pid=96361) INFO 05-30 08:38:15 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 36.35x
(EngineCore pid=96361) INFO 05-30 08:38:15 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=96361) INFO 05-30 08:38:15 [core.py:309] init engine (profile, create kv cache, warmup model) took 5.42 s
(EngineCore pid=96361) INFO 05-30 08:38:15 [vllm.py:984] Asynchronous scheduling is enabled.
(EngineCore pid=96361) WARNING 05-30 08:38:15 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=96361) WARNING 05-30 08:38:15 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=96361) INFO 05-30 08:38:15 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=96361) INFO 05-30 08:38:15 [vllm.py:1258] Cudagraph is disabled under eager mode
(EngineCore pid=96361) INFO 05-30 08:38:15 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=96361) WARNING 05-30 08:38:15 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=96361) INFO 05-30 08:39:01 [core.py:1287] Shutdown initiated (timeout=0)
(EngineCore pid=96361) INFO 05-30 08:39:01 [core.py:1310] Shutdown complete
 ▷ Model loaded in 73.7s
 ▷ Warming up...
 ▷ Output (200 tokens):
 ▷  serving characteristic distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant distant dist
 ▷ Metrics (avg of 3 runs):
 ▷ E2E latency:     11523.4 ms
 ▷ Throughput:      17.4 tokens/s
 ▷ Best E2E:        11514.8 ms
 ▷ Cooling down GPU for 5 seconds...


Starting Benchmark: vibevoice_asr (microsoft/VibeVoice-ASR-HF)
INFO 05-30 08:39:24 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'microsoft/VibeVoice-ASR-HF'}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 05-30 08:39:37 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
INFO 05-30 08:39:37 [model.py:1751] Using max model len 2048
INFO 05-30 08:39:37 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-30 08:39:37 [vllm.py:984] Asynchronous scheduling is enabled.
WARNING 05-30 08:39:37 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 05-30 08:39:37 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 05-30 08:39:37 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
INFO 05-30 08:39:38 [vllm.py:1258] Cudagraph is disabled under eager mode
INFO 05-30 08:39:38 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
[transformers] `VibeVoiceAsrProcessor` defines `feature_extractor_class = 'VibeVoiceAcousticTokenizerFeatureExtractor'`, which is deprecated. Register the correct mapping in `AutoFeatureExtractor` instead.
(EngineCore pid=101082) INFO 05-30 08:40:12 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='microsoft/VibeVoice-ASR-HF', speculative_config=None, tokenizer='microsoft/VibeVoice-ASR-HF', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=microsoft/VibeVoice-ASR-HF, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=101082) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=101082) [transformers] `VibeVoiceAsrProcessor` defines `feature_extractor_class = 'VibeVoiceAcousticTokenizerFeatureExtractor'`, which is deprecated. Register the correct mapping in `AutoFeatureExtractor` instead.
(EngineCore pid=101082) INFO 05-30 08:40:24 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:40577 backend=nccl
(EngineCore pid=101082) INFO 05-30 08:40:24 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101082) INFO 05-30 08:40:25 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=101082) INFO 05-30 08:40:25 [gpu_model_runner.py:5036] Starting to load model microsoft/VibeVoice-ASR-HF...
(EngineCore pid=101082) INFO 05-30 08:40:25 [base.py:117] Using Transformers modeling backend.
(EngineCore pid=101082) INFO 05-30 08:40:26 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101082) INFO 05-30 08:40:26 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=101082) INFO 05-30 08:40:26 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 15.52 GiB. Available RAM: 115.77 GiB.
(EngineCore pid=101082) INFO 05-30 08:40:26 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(EngineCore pid=101082)
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:00<00:04,  1.53it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:01<00:04,  1.47it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:02<00:03,  1.45it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:02<00:02,  1.43it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:03<00:02,  1.42it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:04<00:01,  1.54it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:04<00:00,  1.67it/s]
(EngineCore pid=101082)
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00,  1.76it/s]
(EngineCore pid=101082)
(EngineCore pid=101082) INFO 05-30 08:40:31 [default_loader.py:397] Loading weights took 4.56 seconds
(EngineCore pid=101082) INFO 05-30 08:40:32 [gpu_model_runner.py:5131] Model loading took 14.63 GiB memory and 5.895775 seconds
(EngineCore pid=101082) INFO 05-30 08:40:32 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 audio items of the maximum feature size.
(EngineCore pid=101082) INFO 05-30 08:40:38 [gpu_worker.py:469] Available KV cache memory: 3.05 GiB
(EngineCore pid=101082) INFO 05-30 08:40:38 [kv_cache_utils.py:1733] GPU KV cache size: 57,024 tokens
(EngineCore pid=101082) INFO 05-30 08:40:38 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 27.84x
(EngineCore pid=101082) INFO 05-30 08:40:38 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=101082) INFO 05-30 08:40:38 [core.py:309] init engine (profile, create kv cache, warmup model) took 6.28 s
(EngineCore pid=101082) INFO 05-30 08:40:38 [vllm.py:984] Asynchronous scheduling is enabled.
(EngineCore pid=101082) WARNING 05-30 08:40:38 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=101082) WARNING 05-30 08:40:38 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=101082) INFO 05-30 08:40:38 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=101082) INFO 05-30 08:40:38 [vllm.py:1258] Cudagraph is disabled under eager mode
(EngineCore pid=101082) INFO 05-30 08:40:38 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=101082) WARNING 05-30 08:40:39 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=101082) INFO 05-30 08:41:25 [core.py:1287] Shutdown initiated (timeout=0)
(EngineCore pid=101082) INFO 05-30 08:41:25 [core.py:1310] Shutdown complete
 ▷ Model loaded in 75.1s
 ▷ Warming up...
 ▷ Output (200 tokens):
 ▷ 何的ο星几�ksksいてв藏某恢复 localiosfefefefefefefefefefefefeemeeme豆uesomaomaca小心ad几เพ worried� AV Spanish בי Sta米 진 Op óioneues核心核心核心核心loglen核心log fund dos disc disc� ob dos인데인데eps�ues exiente� Pass ex ringمسいてв winsMione me fefefefefefefefefefefefefefefefefefefefefefeami核心 pose视频望指are Urban案 Gu se� smoke smoke�几几ami�几 gen geniente weiß It Itistefefefefefefefefefefefefefefefefefefefefefefefefefefefefefefeですね distinct� AV-se Bo minusave요ねv Element历史 strength Indians Indians Johnsonの forsa� Smoke smoke bomb
 ▷ Metrics (avg of 3 runs):
 ▷ E2E latency:     11505.7 ms
 ▷ Throughput:      17.4 tokens/s
 ▷ Best E2E:        11503.5 ms
 ▷ Cooling down GPU for 5 seconds...


Starting Benchmark: glmasr (zai-org/GLM-ASR-Nano-2512)
INFO 05-30 08:41:47 [utils.py:278] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.9, 'disable_log_stats': True, 'enforce_eager': True, 'limit_mm_per_prompt': {'audio': 1}, 'model_impl': 'transformers', 'model': 'zai-org/GLM-ASR-Nano-2512'}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 05-30 08:42:00 [model.py:617] Resolved architecture: TransformersMultiModalForCausalLM
INFO 05-30 08:42:00 [model.py:1751] Using max model len 2048
INFO 05-30 08:42:00 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-30 08:42:00 [vllm.py:984] Asynchronous scheduling is enabled.
WARNING 05-30 08:42:00 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 05-30 08:42:00 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 05-30 08:42:00 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
INFO 05-30 08:42:01 [vllm.py:1258] Cudagraph is disabled under eager mode
INFO 05-30 08:42:01 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=105801) INFO 05-30 08:42:26 [core.py:112] Initializing a V1 LLM engine (v0.19.1rc1.dev72+g7b9de7c89.d20260407) with config: model='zai-org/GLM-ASR-Nano-2512', speculative_config=None, tokenizer='zai-org/GLM-ASR-Nano-2512', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=zai-org/GLM-ASR-Nano-2512, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=105801) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=105801) INFO 05-30 08:42:33 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.128.0.11:51913 backend=nccl
(EngineCore pid=105801) INFO 05-30 08:42:33 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=105801) INFO 05-30 08:42:35 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=105801) INFO 05-30 08:42:35 [gpu_model_runner.py:5036] Starting to load model zai-org/GLM-ASR-Nano-2512...
(EngineCore pid=105801) INFO 05-30 08:42:35 [base.py:117] Using Transformers modeling backend.
(EngineCore pid=105801) INFO 05-30 08:42:36 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=105801) INFO 05-30 08:42:36 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=105801) INFO 05-30 08:42:36 [weight_utils.py:647] No model.safetensors.index.json found in remote.
(EngineCore pid=105801) INFO 05-30 08:42:36 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 4.21 GiB. Available RAM: 116.29 GiB.
(EngineCore pid=105801) INFO 05-30 08:42:36 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(EngineCore pid=105801)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore pid=105801)
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.30s/it]
(EngineCore pid=105801)
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.30s/it]
(EngineCore pid=105801)
(EngineCore pid=105801) INFO 05-30 08:42:37 [default_loader.py:397] Loading weights took 1.38 seconds
(EngineCore pid=105801) INFO 05-30 08:42:38 [gpu_model_runner.py:5131] Model loading took 4.0 GiB memory and 2.442925 seconds
(EngineCore pid=105801) INFO 05-30 08:42:39 [gpu_model_runner.py:6140] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 5 audio items of the maximum feature size.
(EngineCore pid=105801) INFO 05-30 08:42:42 [gpu_worker.py:469] Available KV cache memory: 15.18 GiB
(EngineCore pid=105801) INFO 05-30 08:42:42 [kv_cache_utils.py:1733] GPU KV cache size: 284,272 tokens
(EngineCore pid=105801) INFO 05-30 08:42:42 [kv_cache_utils.py:1734] Maximum concurrency for 2,048 tokens per request: 138.80x
(EngineCore pid=105801) INFO 05-30 08:42:42 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=105801) INFO 05-30 08:42:42 [core.py:309] init engine (profile, create kv cache, warmup model) took 4.01 s
(EngineCore pid=105801) INFO 05-30 08:42:43 [vllm.py:984] Asynchronous scheduling is enabled.
(EngineCore pid=105801) WARNING 05-30 08:42:43 [vllm.py:1040] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=105801) WARNING 05-30 08:42:43 [vllm.py:1082] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=105801) INFO 05-30 08:42:43 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=105801) INFO 05-30 08:42:43 [vllm.py:1258] Cudagraph is disabled under eager mode
(EngineCore pid=105801) INFO 05-30 08:42:43 [compilation.py:321] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=105801) WARNING 05-30 08:42:43 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=105801) INFO 05-30 08:42:46 [core.py:1287] Shutdown initiated (timeout=0)
(EngineCore pid=105801) INFO 05-30 08:42:46 [core.py:1310] Shutdown complete
 ▷ Model loaded in 55.9s
 ▷ Warming up...
 ▷ Output (23 tokens):
 ▷ Mister Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
 ▷ Metrics (avg of 3 runs):
 ▷ E2E latency:     818.0 ms
 ▷ Throughput:      28.1 tokens/s
 ▷ Best E2E:        815.9 ms
 ▷ Cooling down GPU for 5 seconds...


FINAL SUMMARY
Model                  E2E (ms)    Tok/s  Tokens  Output preview
----------------------------------------------------------------------
granite_speech             2959     19.9      59  Mister Quilterter is the apostle of the
audioflamingo3            11523     17.4     200   serving characteristic distant distant
vibevoice_asr             11506     17.4     200  何的ο星几�ksksいてв藏某恢复 localiosfefefefefefefe
glmasr                      818     28.1      23  Mister Quilter is the apostle of the mid

Detailed results saved to /home/harsh/workspace/benchmark_results.json