Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 2024-02-16 22:15:28.531
- [2akn5byerrxpel]
- [info]
- Finished running generator.
- 2024-02-16 22:15:28.531
- [2akn5byerrxpel]
- [info]
- --- Starting Serverless Worker | Version 1.5.3 ---
- 2024-02-16 22:15:27.699
- [2akn5byerrxpel]
- [info]
- INFO 02-17 03:15:27 model_runner.py:689] Graph capturing finished in 4 secs.
- 2024-02-16 22:15:27.699
- [2akn5byerrxpel]
- [info]
- INFO 02-17 03:15:22 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
- 2024-02-16 22:15:27.699
- [2akn5byerrxpel]
- [info]
- INFO 02-17 03:15:22 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
- 2024-02-16 22:15:27.699
- [2akn5byerrxpel]
- [info]
- INFO 02-17 03:15:20 llm_engine.py:316] # GPU blocks: 2214, # CPU blocks: 2048
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- raise ValueError(
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/vllm-installation/vllm/engine/llm_engine.py", line 325, in _init_cache
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- self._init_cache()
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/vllm-installation/vllm/engine/llm_engine.py", line 112, in __init__
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- return engine_class(*args, **kwargs)
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/vllm-installation/vllm/engine/async_llm_engine.py", line 366, in _init_engine
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- self.engine = self._init_engine(*args, **kwargs)
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/vllm-installation/vllm/engine/async_llm_engine.py", line 321, in __init__
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- engine = cls(parallel_config.worker_use_ray,
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/vllm-installation/vllm/engine/async_llm_engine.py", line 617, in from_engine_args
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/src/engine.py", line 192, in _initialize_llm
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- raise e
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/src/engine.py", line 195, in _initialize_llm
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- self.llm = self._initialize_llm() if engine is None else engine
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/src/engine.py", line 45, in __init__
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- vllm_engine = vLLMEngine()
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- File "/src/handler.py", line 5, in <module>
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- Traceback (most recent call last):
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- engine.py :194 2024-02-17 03:09:59,747 Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
- 2024-02-16 22:09:59.747
- [dj6at93clwgje7]
- [info]
- INFO 02-17 03:09:59 llm_engine.py:316] # GPU blocks: 1509, # CPU blocks: 2048
- 2024-02-16 22:09:48.879
- [dj6at93clwgje7]
- [info]
- INFO 02-17 03:09:48 llm_engine.py:72] Initializing an LLM engine with config: model='/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', tokenizer='/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
- 2024-02-16 22:09:48.835
- [dj6at93clwgje7]
- [info]
- engine.py :43 2024-02-17 03:09:48,835 vLLM config: {'model': '/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', 'revision': None, 'download_dir': None, 'quantization': None, 'load_format': 'auto', 'dtype': 'auto', 'tokenizer': '/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', 'tokenizer_revision': None, 'disable_log_stats': True, 'disable_log_requests': True, 'trust_remote_code': False, 'gpu_memory_utilization': 0.95, 'max_parallel_loading_workers': 24, 'max_model_len': None, 'tensor_parallel_size': 1}
- 2024-02-16 22:09:48.834
- [dj6at93clwgje7]
- [info]
- engine.py :212 2024-02-17 03:09:48,834 Using local model at /models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24
- 2024-02-16 22:09:46.804
- [dj6at93clwgje7]
- [info]
- ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
- 2024-02-16 22:09:46.804
- [dj6at93clwgje7]
- [info]
- raise ValueError(
Advertisement
Add Comment
Please, Sign In to add comment