Advertisement
Guest User

Untitled

a guest
Nov 17th, 2024
44
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 14.09 KB | None | 0 0
  1. thinclient@skynet:~/llm-server$ thinclient@skynet:~/llm-server$ source start-vllm.sh
  2. INFO 11-17 17:06:48 api_server.py:585] vLLM API server version 0.6.4.post1
  3. INFO 11-17 17:06:48 api_server.py:586] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/weights/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=3500, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.97, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
  4. INFO 11-17 17:06:48 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/2d91cf6b-8f1b-484d-bc8a-d572527b8366 for IPC Path.
  5. INFO 11-17 17:06:48 api_server.py:194] Started engine process with PID 24
  6. INFO 11-17 17:07:08 config.py:1861] Downcasting torch.float32 to torch.float16.
  7. INFO 11-17 17:07:11 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
  8. WARNING 11-17 17:07:11 config.py:428] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
  9. INFO 11-17 17:07:11 config.py:1020] Defaulting to use ray for distributed inference
  10. WARNING 11-17 17:07:11 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
  11. Traceback (most recent call last):
  12. File "<frozen runpy>", line 198, in _run_module_as_main
  13. File "<frozen runpy>", line 88, in _run_code
  14. File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 643, in <module>
  15. uvloop.run(run_server(args))
  16. File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
  17. return __asyncio.run(
  18. ^^^^^^^^^^^^^^
  19. File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
  20. return runner.run(main)
  21. ^^^^^^^^^^^^^^^^
  22. File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
  23. return self._loop.run_until_complete(task)
  24. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  25. File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  26. File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
  27. return await main
  28. ^^^^^^^^^^
  29. File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
  30. async with build_async_engine_client(args) as engine_client:
  31. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  32. File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
  33. return await anext(self.gen)
  34. ^^^^^^^^^^^^^^^^^^^^^
  35. File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
  36. async with build_async_engine_client_from_engine_args(
  37. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  38. File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
  39. return await anext(self.gen)
  40. ^^^^^^^^^^^^^^^^^^^^^
  41. File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 197, in build_async_engine_client_from_engine_args
  42. engine_config = engine_args.create_engine_config()
  43. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  44. File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1143, in create_engine_config
  45. return VllmConfig(
  46. ^^^^^^^^^^^
  47. File "<string>", line 15, in __init__
  48. File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 2135, in __post_init__
  49. self.quant_config = VllmConfig._get_quantization_config(
  50. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  51. File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 2088, in _get_quantization_config
  52. capability_tuple = current_platform.get_device_capability()
  53. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  54. File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 114, in get_device_capability
  55. major, minor = get_physical_device_capability(physical_device_id)
  56. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  57. File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 46, in wrapper
  58. return fn(*args, **kwargs)
  59. ^^^^^^^^^^^^^^^^^^^
  60. File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 56, in get_physical_device_capability
  61. handle = pynvml.nvmlDeviceGetHandleByIndex(device_id)
  62. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  63. File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 2437, in nvmlDeviceGetHandleByIndex
  64. _nvmlCheckReturn(ret)
  65. File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 979, in _nvmlCheckReturn
  66. raise NVMLError(ret)
  67. pynvml.NVMLError_InvalidArgument: Invalid Argument
  68. INFO 11-17 17:07:11 config.py:1861] Downcasting torch.float32 to torch.float16.
  69. INFO 11-17 17:07:15 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
  70. WARNING 11-17 17:07:15 config.py:428] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
  71. INFO 11-17 17:07:15 config.py:1020] Defaulting to use ray for distributed inference
  72. WARNING 11-17 17:07:15 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
  73. Process SpawnProcess-1:
  74. ERROR 11-17 17:07:15 engine.py:366] Invalid Argument
  75. ERROR 11-17 17:07:15 engine.py:366] Traceback (most recent call last):
  76. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
  77. ERROR 11-17 17:07:15 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  78. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  79. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 114, in from_engine_args
  80. ERROR 11-17 17:07:15 engine.py:366] engine_config = engine_args.create_engine_config()
  81. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  82. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1143, in create_engine_config
  83. ERROR 11-17 17:07:15 engine.py:366] return VllmConfig(
  84. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^
  85. ERROR 11-17 17:07:15 engine.py:366] File "<string>", line 15, in __init__
  86. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 2135, in __post_init__
  87. ERROR 11-17 17:07:15 engine.py:366] self.quant_config = VllmConfig._get_quantization_config(
  88. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  89. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 2088, in _get_quantization_config
  90. ERROR 11-17 17:07:15 engine.py:366] capability_tuple = current_platform.get_device_capability()
  91. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  92. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 114, in get_device_capability
  93. ERROR 11-17 17:07:15 engine.py:366] major, minor = get_physical_device_capability(physical_device_id)
  94. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  95. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 46, in wrapper
  96. ERROR 11-17 17:07:15 engine.py:366] return fn(*args, **kwargs)
  97. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^
  98. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 56, in get_physical_device_capability
  99. ERROR 11-17 17:07:15 engine.py:366] handle = pynvml.nvmlDeviceGetHandleByIndex(device_id)
  100. ERROR 11-17 17:07:15 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  101. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 2437, in nvmlDeviceGetHandleByIndex
  102. ERROR 11-17 17:07:15 engine.py:366] _nvmlCheckReturn(ret)
  103. ERROR 11-17 17:07:15 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 979, in _nvmlCheckReturn
  104. ERROR 11-17 17:07:15 engine.py:366] raise NVMLError(ret)
  105. ERROR 11-17 17:07:15 engine.py:366] pynvml.NVMLError_InvalidArgument: Invalid Argument
  106. Traceback (most recent call last):
  107. File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
  108. self.run()
  109. File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
  110. self._target(*self._args, **self._kwargs)
  111. File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
  112. raise e
  113. File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
  114. engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  115. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  116. File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 114, in from_engine_args
  117. engine_config = engine_args.create_engine_config()
  118. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  119. File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1143, in create_engine_config
  120. return VllmConfig(
  121. ^^^^^^^^^^^
  122. File "<string>", line 15, in __init__
  123. File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 2135, in __post_init__
  124. self.quant_config = VllmConfig._get_quantization_config(
  125. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  126. File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 2088, in _get_quantization_config
  127. capability_tuple = current_platform.get_device_capability()
  128. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  129. File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 114, in get_device_capability
  130. major, minor = get_physical_device_capability(physical_device_id)
  131. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  132. File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 46, in wrapper
  133. return fn(*args, **kwargs)
  134. ^^^^^^^^^^^^^^^^^^^
  135. File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 56, in get_physical_device_capability
  136. handle = pynvml.nvmlDeviceGetHandleByIndex(device_id)
  137. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  138. File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 2437, in nvmlDeviceGetHandleByIndex
  139. _nvmlCheckReturn(ret)
  140. File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 979, in _nvmlCheckReturn
  141. raise NVMLError(ret)
  142. pynvml.NVMLError_InvalidArgument: Invalid Argument
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement