more logs

(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:299]

(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:299]        █     █     █▄   ▄█

(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0

(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:299]   █▄█▀ █     █     █     █  model   H:\qwen3.6-windows-server\models\Qwen3.6-27B-int4-AutoRound

(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀

(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:299]
(APIServer pid=5356) INFO 05-02 22:27:08 [utils.py:233] non-default args: {'model_tag': 'H:\\qwen3.6-windows-server\\models\\Qwen3.6-27B-int4-AutoRound', 'chat_template': 'H:\\qwen3.6-windows-server\\templates\\qwen3.5-enhanced.jinja', 'default_chat_template_kwargs': {'preserve_thinking': False}, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 5001, 'model': 'H:\\qwen3.6-windows-server\\models\\Qwen3.6-27B-int4-AutoRound', 'trust_remote_code': True, 'max_model_len': 120000, 'quantization': 'auto-round', 'served_model_name': ['qwen3.6-27b-autoround'], 'use_tqdm_on_load': False, 'attention_backend': 'TRITON_ATTN', 'reasoning_parser': 'qwen3', 'block_size': 32, 'gpu_memory_utilization': 0.948, 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 0, 'video': 0}, 'max_num_batched_tokens': 4128, 'max_num_seqs': 1, 'enable_chunked_prefill': True, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 4}}
(APIServer pid=5356) INFO 05-02 22:27:08 [system_utils.py:279] Windows detected, skipping ulimit adjustment.
(APIServer pid=5356) WARNING 05-02 22:27:08 [envs.py:1786] Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND
(APIServer pid=5356) WARNING 05-02 22:27:08 [envs.py:1786] Unknown vLLM environment variable detected: VLLM_MODEL_DIR
(APIServer pid=5356) WARNING 05-02 22:27:08 [envs.py:1786] Unknown vLLM environment variable detected: VLLM_SLEEP_WHEN_IDLE
(APIServer pid=5356) INFO 05-02 22:27:08 [model.py:554] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=5356) INFO 05-02 22:27:08 [model.py:1685] Using max model len 120000
(APIServer pid=5356) INFO 05-02 22:27:09 [cache.py:253] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=5356) INFO 05-02 22:27:09 [model.py:554] Resolved architecture: Qwen3_5MTP
(APIServer pid=5356) INFO 05-02 22:27:09 [model.py:1685] Using max model len 262144
(APIServer pid=5356) WARNING 05-02 22:27:09 [speculative.py:521] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=5356) INFO 05-02 22:27:09 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=4128.
(APIServer pid=5356) WARNING 05-02 22:27:09 [config.py:306] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=5356) INFO 05-02 22:27:09 [config.py:326] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=5356) INFO 05-02 22:27:09 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=5356) INFO 05-02 22:27:09 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=5356) WARNING 05-02 22:27:09 [vllm.py:1362] max_num_scheduled_tokens is set to 4128 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=5356) INFO 05-02 22:27:10 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=25592) INFO 05-02 22:27:19 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='H:\\qwen3.6-windows-server\\models\\Qwen3.6-27B-int4-AutoRound', speculative_config=SpeculativeConfig(method='mtp', model='H:\\qwen3.6-windows-server\\models\\Qwen3.6-27B-int4-AutoRound', num_spec_tokens=4), tokenizer='H:\\qwen3.6-windows-server\\models\\Qwen3.6-27B-int4-AutoRound', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=120000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=inc, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.6-27b-autoround, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [4128], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=25592) INFO 05-02 22:27:20 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=25592) INFO 05-02 22:27:20 [parallel_state.py:1455] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.0.1:55065 backend=gloo
[W502 22:27:20.000000000 socket.cpp:764] [c10d] The client socket has failed to connect to [Shush]:55065 (system error: 10049 -                                             .).
[W502 22:27:20.000000000 socket.cpp:764] [c10d] The client socket has failed to connect to [Shush]:55065 (system error: 10049 -                                             .).
(EngineCore pid=25592) INFO 05-02 22:27:21 [parallel_state.py:1767] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108] EngineCore failed to start.

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108] Traceback (most recent call last):

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 1082, in run_engine_core

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 848, in __init__

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     super().__init__(

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 114, in __init__

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self.model_executor = executor_class(vllm_config)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\executor\abstract.py", line 109, in __init__

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self._init_executor()

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\executor\uniproc_executor.py", line 47, in _init_executor

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self.driver_worker.init_device()

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\worker_base.py", line 312, in init_device

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self.worker.init_device()  # type: ignore

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     ^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\gpu_worker.py", line 311, in init_device

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\gpu_model_runner.py", line 480, in __init__

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self.sampler = Sampler(logprobs_mode=self.model_config.logprobs_mode)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\sample\sampler.py", line 64, in __init__

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     self.topk_topp_sampler = TopKTopPSampler(logprobs_mode)

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\sample\ops\topk_topp_sampler.py", line 40, in __init__

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     from vllm.v1.attention.backends.flashinfer import FlashInferBackend

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\attention\backends\flashinfer.py", line 11, in <module>

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     from flashinfer import (

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\flashinfer\__init__.py", line 24, in <module>

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     from . import jit as jit

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]   File "H:\qwen3.6-windows-server\python\Lib\site-packages\flashinfer\jit\__init__.py", line 125, in <module>

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108]     raise ValueError(

(EngineCore pid=25592) ERROR 05-02 22:27:21 [core.py:1108] ValueError: CUDA_LIB_PATH is not set. CUDA_LIB_PATH need to be set with the absolute path to CUDA root folder on Windows (for example, set CUDA_LIB_PATH=C:\CUDA\v12.4)
(EngineCore pid=25592) Process EngineCore:
(EngineCore pid=25592) Traceback (most recent call last):
(EngineCore pid=25592)   File "multiprocessing\process.py", line 314, in _bootstrap
(EngineCore pid=25592)   File "multiprocessing\process.py", line 108, in run
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 1112, in run_engine_core
(EngineCore pid=25592)     raise e
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 1082, in run_engine_core
(EngineCore pid=25592)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=25592)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper
(EngineCore pid=25592)     return func(*args, **kwargs)
(EngineCore pid=25592)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 848, in __init__
(EngineCore pid=25592)     super().__init__(
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\engine\core.py", line 114, in __init__
(EngineCore pid=25592)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=25592)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper
(EngineCore pid=25592)     return func(*args, **kwargs)
(EngineCore pid=25592)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\executor\abstract.py", line 109, in __init__
(EngineCore pid=25592)     self._init_executor()
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\executor\uniproc_executor.py", line 47, in _init_executor
(EngineCore pid=25592)     self.driver_worker.init_device()
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\worker_base.py", line 312, in init_device
(EngineCore pid=25592)     self.worker.init_device()  # type: ignore
(EngineCore pid=25592)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\tracing\otel.py", line 178, in sync_wrapper
(EngineCore pid=25592)     return func(*args, **kwargs)
(EngineCore pid=25592)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\gpu_worker.py", line 311, in init_device
(EngineCore pid=25592)     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore pid=25592)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\worker\gpu_model_runner.py", line 480, in __init__
(EngineCore pid=25592)     self.sampler = Sampler(logprobs_mode=self.model_config.logprobs_mode)
(EngineCore pid=25592)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\sample\sampler.py", line 64, in __init__
(EngineCore pid=25592)     self.topk_topp_sampler = TopKTopPSampler(logprobs_mode)
(EngineCore pid=25592)                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\sample\ops\topk_topp_sampler.py", line 40, in __init__
(EngineCore pid=25592)     from vllm.v1.attention.backends.flashinfer import FlashInferBackend
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\vllm\v1\attention\backends\flashinfer.py", line 11, in <module>
(EngineCore pid=25592)     from flashinfer import (
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\flashinfer\__init__.py", line 24, in <module>
(EngineCore pid=25592)     from . import jit as jit
(EngineCore pid=25592)   File "H:\qwen3.6-windows-server\python\Lib\site-packages\flashinfer\jit\__init__.py", line 125, in <module>
(EngineCore pid=25592)     raise ValueError(
(EngineCore pid=25592) ValueError: CUDA_LIB_PATH is not set. CUDA_LIB_PATH need to be set with the absolute path to CUDA root folder on Windows (for example, set CUDA_LIB_PATH=C:\CUDA\v12.4)
forrtl: error (200): program aborting due to window-CLOSE event
Image              PC                Routine            Line        Source
KERNELBASE.dll     00007FFDD586A3CD  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFDD794E8D7  Unknown               Unknown  Unknown
ntdll.dll          00007FFDD828C3FC  Unknown               Unknown  Unknown