Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1,0,2,4 VLLM_PP_LAYER_PARTITION="3,8,27,8" vllm serve \
- /mnt/llms/models/zai-org/GLM-4.5-Air-FP8/ \
- --served-model-name GLM \
- \
- --swap-space 16 \
- --max-num-seqs 512 \
- --max-model-len 8192 \
- --max-seq-len-to-capture 8192 \
- --gpu-memory-utilization 0.9 \
- -pp 4 \
- --trust-remote-code \
- --disable-log-requests \
- --host 0.0.0.0 \
- --port 8000 --enforce-eager
- INFO 09-09 20:28:26 [__init__.py:216] Automatically detected platform cuda.
- WARNING 09-09 20:28:29 [__init__.py:1756] argument '--disable-log-requests' is deprecated and replaced with '--enable-log-requests'. This will be removed in v0.12.0.
- (APIServer pid=511717) INFO 09-09 20:28:29 [api_server.py:1896] vLLM API server version 0.10.2rc2.dev191+g6fb278816.d20250909
- (APIServer pid=511717) INFO 09-09 20:28:29 [utils.py:328] non-default args: {'model_tag': '/mnt/llms/models/zai-org/GLM-4.5-Air-FP8/', 'host': '0.0.0.0', 'model': '/mnt/llms/models/zai-org/GLM-4.5-Air-FP8/', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'served_model_name': ['GLM'], 'pipeline_parallel_size': 4, 'swap_space': 16.0, 'max_num_seqs': 512}
- (APIServer pid=511717) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
- (APIServer pid=511717) INFO 09-09 20:28:35 [__init__.py:744] Resolved architecture: Glm4MoeForCausalLM
- (APIServer pid=511717) `torch_dtype` is deprecated! Use `dtype` instead!
- (APIServer pid=511717) INFO 09-09 20:28:35 [__init__.py:1772] Using max model len 8192
- (APIServer pid=511717) WARNING 09-09 20:28:35 [_ipex_ops.py:16] Import error msg: No module named 'intel_extension_for_pytorch'
- (APIServer pid=511717) INFO 09-09 20:28:35 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
- (APIServer pid=511717) INFO 09-09 20:28:35 [__init__.py:3499] Cudagraph is disabled under eager mode
- INFO 09-09 20:28:40 [__init__.py:216] Automatically detected platform cuda.
- (EngineCore_DP0 pid=511959) INFO 09-09 20:28:42 [core.py:654] Waiting for init message from front-end.
- (EngineCore_DP0 pid=511959) INFO 09-09 20:28:42 [core.py:76] Initializing a V1 LLM engine (v0.10.2rc2.dev191+g6fb278816.d20250909) with config: model='/mnt/llms/models/zai-org/GLM-4.5-Air-FP8/', speculative_config=None, tokenizer='/mnt/llms/models/zai-org/GLM-4.5-Air-FP8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=GLM, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
- (EngineCore_DP0 pid=511959) WARNING 09-09 20:28:42 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
- (EngineCore_DP0 pid=511959) INFO 09-09 20:28:42 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_1a765819'), local_subscribe_addr='ipc:///tmp/d78e8889-6bf4-43df-a30a-613b5ff31999', remote_subscribe_addr=None, remote_addr_ipv6=False)
- INFO 09-09 20:28:46 [__init__.py:216] Automatically detected platform cuda.
- INFO 09-09 20:28:46 [__init__.py:216] Automatically detected platform cuda.
- INFO 09-09 20:28:46 [__init__.py:216] Automatically detected platform cuda.
- INFO 09-09 20:28:46 [__init__.py:216] Automatically detected platform cuda.
- INFO 09-09 20:28:49 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a95d08bb'), local_subscribe_addr='ipc:///tmp/542d5194-cf38-44d5-9e62-cdd15b8bd909', remote_subscribe_addr=None, remote_addr_ipv6=False)
- INFO 09-09 20:28:49 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_98358131'), local_subscribe_addr='ipc:///tmp/b351aa86-c7e1-461c-9806-50449e56535f', remote_subscribe_addr=None, remote_addr_ipv6=False)
- INFO 09-09 20:28:50 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_bc961b2b'), local_subscribe_addr='ipc:///tmp/77374ef5-7f91-441b-88ad-95ac9000581f', remote_subscribe_addr=None, remote_addr_ipv6=False)
- INFO 09-09 20:28:50 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8700974d'), local_subscribe_addr='ipc:///tmp/16999e09-b4d5-498a-ba40-055b6bebe96a', remote_subscribe_addr=None, remote_addr_ipv6=False)
- [W909 20:28:50.304204817 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [localhost]:60461 (errno: 97 - Address family not supported by protocol).
- [W909 20:28:50.377584382 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [localhost]:60461 (errno: 97 - Address family not supported by protocol).
- [W909 20:28:50.395981055 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [localhost]:60461 (errno: 97 - Address family not supported by protocol).
- [W909 20:28:50.400721762 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [localhost]:60461 (errno: 97 - Address family not supported by protocol).
- [W909 20:28:50.401253860 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
- [W909 20:28:50.704116670 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
- [W909 20:28:51.046605499 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
- [W909 20:28:51.052565047 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
- [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
- INFO 09-09 20:28:51 [__init__.py:1432] Found nccl from library libnccl.so.2
- INFO 09-09 20:28:51 [__init__.py:1432] Found nccl from library libnccl.so.2
- INFO 09-09 20:28:51 [pynccl.py:70] vLLM is using nccl==2.27.3
- INFO 09-09 20:28:51 [pynccl.py:70] vLLM is using nccl==2.27.3
- INFO 09-09 20:28:51 [__init__.py:1432] Found nccl from library libnccl.so.2
- INFO 09-09 20:28:51 [__init__.py:1432] Found nccl from library libnccl.so.2
- INFO 09-09 20:28:51 [pynccl.py:70] vLLM is using nccl==2.27.3
- INFO 09-09 20:28:51 [pynccl.py:70] vLLM is using nccl==2.27.3
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- INFO 09-09 20:28:51 [parallel_state.py:1164] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
- INFO 09-09 20:28:51 [parallel_state.py:1164] rank 2 in world size 4 is assigned as DP rank 0, PP rank 2, TP rank 0, EP rank 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
- INFO 09-09 20:28:51 [parallel_state.py:1164] rank 3 in world size 4 is assigned as DP rank 0, PP rank 3, TP rank 0, EP rank 0
- INFO 09-09 20:28:51 [parallel_state.py:1164] rank 1 in world size 4 is assigned as DP rank 0, PP rank 1, TP rank 0, EP rank 0
- WARNING 09-09 20:28:51 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
- WARNING 09-09 20:28:51 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
- WARNING 09-09 20:28:51 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
- WARNING 09-09 20:28:51 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
- (Worker_PP2 pid=512070) INFO 09-09 20:28:51 [gpu_model_runner.py:2178] Starting to load model /mnt/llms/models/zai-org/GLM-4.5-Air-FP8/...
- (Worker_PP0 pid=512068) INFO 09-09 20:28:51 [gpu_model_runner.py:2178] Starting to load model /mnt/llms/models/zai-org/GLM-4.5-Air-FP8/...
- (Worker_PP3 pid=512071) INFO 09-09 20:28:51 [gpu_model_runner.py:2178] Starting to load model /mnt/llms/models/zai-org/GLM-4.5-Air-FP8/...
- (Worker_PP1 pid=512069) INFO 09-09 20:28:51 [gpu_model_runner.py:2178] Starting to load model /mnt/llms/models/zai-org/GLM-4.5-Air-FP8/...
- (Worker_PP3 pid=512071) INFO 09-09 20:28:52 [gpu_model_runner.py:2210] Loading model from scratch...
- (Worker_PP1 pid=512069) INFO 09-09 20:28:52 [gpu_model_runner.py:2210] Loading model from scratch...
- (Worker_PP2 pid=512070) INFO 09-09 20:28:52 [gpu_model_runner.py:2210] Loading model from scratch...
- (Worker_PP0 pid=512068) INFO 09-09 20:28:52 [gpu_model_runner.py:2210] Loading model from scratch...
- (Worker_PP1 pid=512069) INFO 09-09 20:28:52 [cuda.py:340] Using Flash Attention backend on V1 engine.
- (Worker_PP3 pid=512071) INFO 09-09 20:28:52 [cuda.py:340] Using Flash Attention backend on V1 engine.
- (Worker_PP2 pid=512070) INFO 09-09 20:28:52 [cuda.py:340] Using Flash Attention backend on V1 engine.
- (Worker_PP0 pid=512068) INFO 09-09 20:28:52 [cuda.py:340] Using Flash Attention backend on V1 engine.
- Loading safetensors checkpoint shards: 0% Completed | 0/47 [00:00<?, ?it/s]
- Loading safetensors checkpoint shards: 6% Completed | 3/47 [00:00<00:01, 25.55it/s]
- Loading safetensors checkpoint shards: 13% Completed | 6/47 [00:00<00:04, 8.32it/s]
- Loading safetensors checkpoint shards: 17% Completed | 8/47 [00:00<00:04, 8.86it/s]
- Loading safetensors checkpoint shards: 23% Completed | 11/47 [00:00<00:03, 11.87it/s]
- Loading safetensors checkpoint shards: 30% Completed | 14/47 [00:01<00:02, 14.22it/s]
- Loading safetensors checkpoint shards: 36% Completed | 17/47 [00:01<00:01, 16.08it/s]
- Loading safetensors checkpoint shards: 43% Completed | 20/47 [00:01<00:01, 17.59it/s]
- Loading safetensors checkpoint shards: 47% Completed | 22/47 [00:01<00:02, 8.87it/s]
- Loading safetensors checkpoint shards: 53% Completed | 25/47 [00:02<00:01, 11.07it/s]
- Loading safetensors checkpoint shards: 60% Completed | 28/47 [00:02<00:01, 13.09it/s]
- Loading safetensors checkpoint shards: 66% Completed | 31/47 [00:02<00:01, 14.85it/s]
- Loading safetensors checkpoint shards: 72% Completed | 34/47 [00:02<00:00, 16.28it/s]
- Loading safetensors checkpoint shards: 77% Completed | 36/47 [00:02<00:00, 12.97it/s]
- Loading safetensors checkpoint shards: 81% Completed | 38/47 [00:02<00:00, 12.87it/s]
- Loading safetensors checkpoint shards: 87% Completed | 41/47 [00:03<00:00, 14.77it/s]
- Loading safetensors checkpoint shards: 94% Completed | 44/47 [00:03<00:00, 16.33it/s]
- Loading safetensors checkpoint shards: 100% Completed | 47/47 [00:03<00:00, 17.46it/s]
- Loading safetensors checkpoint shards: 100% Completed | 47/47 [00:03<00:00, 13.75it/s]
- (Worker_PP0 pid=512068)
- (Worker_PP0 pid=512068) INFO 09-09 20:28:55 [default_loader.py:266] Loading weights took 3.42 seconds
- (Worker_PP0 pid=512068) WARNING 09-09 20:28:55 [marlin_utils_fp8.py:80] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
- (Worker_PP0 pid=512068) INFO 09-09 20:28:56 [gpu_model_runner.py:2232] Model loading took 5.7690 GiB and 3.784014 seconds
- (Worker_PP1 pid=512069) INFO 09-09 20:28:58 [default_loader.py:266] Loading weights took 6.04 seconds
- (Worker_PP1 pid=512069) WARNING 09-09 20:28:58 [marlin_utils_fp8.py:80] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
- (Worker_PP3 pid=512071) INFO 09-09 20:28:58 [default_loader.py:266] Loading weights took 6.39 seconds
- (Worker_PP3 pid=512071) WARNING 09-09 20:28:58 [marlin_utils_fp8.py:80] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
- (Worker_PP1 pid=512069) INFO 09-09 20:28:59 [gpu_model_runner.py:2232] Model loading took 17.4959 GiB and 6.603870 seconds
- (Worker_PP3 pid=512071) INFO 09-09 20:28:59 [gpu_model_runner.py:2232] Model loading took 18.6522 GiB and 6.952122 seconds
- (Worker_PP2 pid=512070) INFO 09-09 20:29:07 [default_loader.py:266] Loading weights took 15.03 seconds
- (Worker_PP2 pid=512070) WARNING 09-09 20:29:07 [marlin_utils_fp8.py:80] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
- (Worker_PP2 pid=512070) INFO 09-09 20:29:08 [gpu_model_runner.py:2232] Model loading took 59.0111 GiB and 16.300322 seconds
- (Worker_PP0 pid=512068) INFO 09-09 20:29:10 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] WorkerProc hit an exception.
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Traceback (most recent call last):
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] output = func(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] self.model_runner.profile_run()
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2847, in profile_run
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] = self._dummy_run(self.max_num_tokens, is_profile=True)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2624, in _dummy_run
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] outputs = self.model(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 672, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.model(input_ids, positions, intermediate_tensors,
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/compilation/decorators.py", line 223, in __call__
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 449, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states, residual = layer(positions, hidden_states, residual)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 376, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.mlp(hidden_states)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 190, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.experts(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1615, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] fused_output = torch.ops.vllm.moe_forward(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1881, in moe_forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward_impl(hidden_states, router_logits)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1772, in forward_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.quant_method.apply(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 929, in apply
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.ops.vllm.fused_marlin_moe(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 204, in fused_marlin_moe
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Traceback (most recent call last):
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] output = func(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] self.model_runner.profile_run()
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2847, in profile_run
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] = self._dummy_run(self.max_num_tokens, is_profile=True)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2624, in _dummy_run
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] outputs = self.model(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 672, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.model(input_ids, positions, intermediate_tensors,
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/compilation/decorators.py", line 223, in __call__
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 449, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states, residual = layer(positions, hidden_states, residual)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 376, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.mlp(hidden_states)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 190, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.experts(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1615, in forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] fused_output = torch.ops.vllm.moe_forward(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1881, in moe_forward
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward_impl(hidden_states, router_logits)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1772, in forward_impl
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.quant_method.apply(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 929, in apply
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.ops.vllm.fused_marlin_moe(
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 204, in fused_marlin_moe
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP1 pid=512069) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] WorkerProc hit an exception.
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Traceback (most recent call last):
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] output = func(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] self.model_runner.profile_run()
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2847, in profile_run
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] = self._dummy_run(self.max_num_tokens, is_profile=True)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2624, in _dummy_run
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] outputs = self.model(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 672, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.model(input_ids, positions, intermediate_tensors,
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/compilation/decorators.py", line 223, in __call__
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 449, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states, residual = layer(positions, hidden_states, residual)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 376, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.mlp(hidden_states)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 190, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.experts(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1615, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] fused_output = torch.ops.vllm.moe_forward(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1881, in moe_forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward_impl(hidden_states, router_logits)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1772, in forward_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.quant_method.apply(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 929, in apply
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.ops.vllm.fused_marlin_moe(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 204, in fused_marlin_moe
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Traceback (most recent call last):
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] output = func(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] self.model_runner.profile_run()
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2847, in profile_run
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] = self._dummy_run(self.max_num_tokens, is_profile=True)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2624, in _dummy_run
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] outputs = self.model(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 672, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.model(input_ids, positions, intermediate_tensors,
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/compilation/decorators.py", line 223, in __call__
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 449, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states, residual = layer(positions, hidden_states, residual)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 376, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.mlp(hidden_states)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 190, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.experts(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1615, in forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] fused_output = torch.ops.vllm.moe_forward(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1881, in moe_forward
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward_impl(hidden_states, router_logits)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1772, in forward_impl
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.quant_method.apply(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 929, in apply
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.ops.vllm.fused_marlin_moe(
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 204, in fused_marlin_moe
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP3 pid=512071) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP0 pid=512068) INFO 09-09 20:29:10 [gpu_worker.py:276] Available KV cache memory: 15.07 GiB
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] WorkerProc hit an exception.
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Traceback (most recent call last):
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] output = func(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] self.model_runner.profile_run()
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2847, in profile_run
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] = self._dummy_run(self.max_num_tokens, is_profile=True)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2624, in _dummy_run
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] outputs = self.model(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 672, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.model(input_ids, positions, intermediate_tensors,
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/compilation/decorators.py", line 223, in __call__
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 449, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states, residual = layer(positions, hidden_states, residual)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 376, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.mlp(hidden_states)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 190, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.experts(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1615, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] fused_output = torch.ops.vllm.moe_forward(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1881, in moe_forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward_impl(hidden_states, router_logits)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1772, in forward_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.quant_method.apply(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 929, in apply
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.ops.vllm.fused_marlin_moe(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 204, in fused_marlin_moe
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Traceback (most recent call last):
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] output = func(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] self.model_runner.profile_run()
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2847, in profile_run
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] = self._dummy_run(self.max_num_tokens, is_profile=True)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return func(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 2624, in _dummy_run
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] outputs = self.model(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 672, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.model(input_ids, positions, intermediate_tensors,
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/compilation/decorators.py", line 223, in __call__
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 449, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states, residual = layer(positions, hidden_states, residual)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 376, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] hidden_states = self.mlp(hidden_states)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/models/glm4_moe.py", line 190, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.experts(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._call_impl(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1615, in forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] fused_output = torch.ops.vllm.moe_forward(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1881, in moe_forward
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self.forward_impl(hidden_states, router_logits)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/layer.py", line 1772, in forward_impl
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] final_hidden_states = self.quant_method.apply(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 929, in apply
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.ops.vllm.fused_marlin_moe(
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return self._op(*args, **kwargs)
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] File "/home/ubuntuai/vllm_source/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 204, in fused_marlin_moe
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (Worker_PP2 pid=512070) ERROR 09-09 20:29:10 [multiproc_executor.py:654]
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] EngineCore failed to start.
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] Traceback (most recent call last):
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 709, in run_engine_core
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 505, in __init__
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] super().__init__(vllm_config, executor_class, log_stats,
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 91, in __init__
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] self._initialize_kv_caches(vllm_config)
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 183, in _initialize_kv_caches
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] self.model_executor.determine_available_memory())
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/executor/abstract.py", line 84, in determine_available_memory
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] return self.collective_rpc("determine_available_memory")
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 257, in collective_rpc
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] result = result.result()
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] return self.__get_result()
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] raise self._exception
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] result = self.fn(*self.args, **self.kwargs)
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 243, in get_response
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] raise RuntimeError(
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:11 [core.py:718] ', please check the stack trace above for the root cause
- (EngineCore_DP0 pid=511959) ERROR 09-09 20:29:12 [multiproc_executor.py:149] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
- (EngineCore_DP0 pid=511959) Process EngineCore_DP0:
- (EngineCore_DP0 pid=511959) Traceback (most recent call last):
- (EngineCore_DP0 pid=511959) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
- (EngineCore_DP0 pid=511959) self.run()
- (EngineCore_DP0 pid=511959) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
- (EngineCore_DP0 pid=511959) self._target(*self._args, **self._kwargs)
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 722, in run_engine_core
- (EngineCore_DP0 pid=511959) raise e
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 709, in run_engine_core
- (EngineCore_DP0 pid=511959) engine_core = EngineCoreProc(*args, **kwargs)
- (EngineCore_DP0 pid=511959) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 505, in __init__
- (EngineCore_DP0 pid=511959) super().__init__(vllm_config, executor_class, log_stats,
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 91, in __init__
- (EngineCore_DP0 pid=511959) self._initialize_kv_caches(vllm_config)
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core.py", line 183, in _initialize_kv_caches
- (EngineCore_DP0 pid=511959) self.model_executor.determine_available_memory())
- (EngineCore_DP0 pid=511959) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/executor/abstract.py", line 84, in determine_available_memory
- (EngineCore_DP0 pid=511959) return self.collective_rpc("determine_available_memory")
- (EngineCore_DP0 pid=511959) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 257, in collective_rpc
- (EngineCore_DP0 pid=511959) result = result.result()
- (EngineCore_DP0 pid=511959) ^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
- (EngineCore_DP0 pid=511959) return self.__get_result()
- (EngineCore_DP0 pid=511959) ^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
- (EngineCore_DP0 pid=511959) raise self._exception
- (EngineCore_DP0 pid=511959) File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
- (EngineCore_DP0 pid=511959) result = self.fn(*self.args, **self.kwargs)
- (EngineCore_DP0 pid=511959) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (EngineCore_DP0 pid=511959) File "/home/ubuntuai/vllm_source/vllm/v1/executor/multiproc_executor.py", line 243, in get_response
- (EngineCore_DP0 pid=511959) raise RuntimeError(
- (EngineCore_DP0 pid=511959) RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
- (EngineCore_DP0 pid=511959) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
- (EngineCore_DP0 pid=511959) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
- (EngineCore_DP0 pid=511959) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
- (EngineCore_DP0 pid=511959) ', please check the stack trace above for the root cause
- (APIServer pid=511717) Traceback (most recent call last):
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/.venv/bin/vllm", line 8, in <module>
- (APIServer pid=511717) sys.exit(main())
- (APIServer pid=511717) ^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/entrypoints/cli/main.py", line 54, in main
- (APIServer pid=511717) args.dispatch_function(args)
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/entrypoints/cli/serve.py", line 50, in cmd
- (APIServer pid=511717) uvloop.run(run_server(args))
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
- (APIServer pid=511717) return __asyncio.run(
- (APIServer pid=511717) ^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
- (APIServer pid=511717) return runner.run(main)
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
- (APIServer pid=511717) return self._loop.run_until_complete(task)
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
- (APIServer pid=511717) return await main
- (APIServer pid=511717) ^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
- (APIServer pid=511717) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
- (APIServer pid=511717) async with build_async_engine_client(
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
- (APIServer pid=511717) return await anext(self.gen)
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
- (APIServer pid=511717) async with build_async_engine_client_from_engine_args(
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
- (APIServer pid=511717) return await anext(self.gen)
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
- (APIServer pid=511717) async_llm = AsyncLLM.from_vllm_config(
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/utils/__init__.py", line 1587, in inner
- (APIServer pid=511717) return fn(*args, **kwargs)
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
- (APIServer pid=511717) return cls(
- (APIServer pid=511717) ^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/async_llm.py", line 129, in __init__
- (APIServer pid=511717) self.engine_core = EngineCoreClient.make_async_mp_client(
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
- (APIServer pid=511717) return AsyncMPClient(*client_args)
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core_client.py", line 767, in __init__
- (APIServer pid=511717) super().__init__(
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/core_client.py", line 446, in __init__
- (APIServer pid=511717) with launch_core_engines(vllm_config, executor_class,
- (APIServer pid=511717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- (APIServer pid=511717) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
- (APIServer pid=511717) next(self.gen)
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/utils.py", line 729, in launch_core_engines
- (APIServer pid=511717) wait_for_engine_startup(
- (APIServer pid=511717) File "/home/ubuntuai/vllm_source/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
- (APIServer pid=511717) raise RuntimeError("Engine core initialization failed. "
- (APIServer pid=511717) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Advertisement
Add Comment
Please, Sign In to add comment