Untitled

2024-02-16 22:15:28.531
[2akn5byerrxpel]
[info]
Finished running generator.
2024-02-16 22:15:28.531
[2akn5byerrxpel]
[info]
--- Starting Serverless Worker | Version 1.5.3 ---
2024-02-16 22:15:27.699
[2akn5byerrxpel]
[info]
INFO 02-17 03:15:27 model_runner.py:689] Graph capturing finished in 4 secs.
2024-02-16 22:15:27.699
[2akn5byerrxpel]
[info]
INFO 02-17 03:15:22 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-02-16 22:15:27.699
[2akn5byerrxpel]
[info]
INFO 02-17 03:15:22 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-02-16 22:15:27.699
[2akn5byerrxpel]
[info]
INFO 02-17 03:15:20 llm_engine.py:316] # GPU blocks: 2214, # CPU blocks: 2048
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
raise ValueError(
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/vllm-installation/vllm/engine/llm_engine.py", line 325, in _init_cache
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
self._init_cache()
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/vllm-installation/vllm/engine/llm_engine.py", line 112, in __init__
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
return engine_class(*args, **kwargs)
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/vllm-installation/vllm/engine/async_llm_engine.py", line 366, in _init_engine
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
self.engine = self._init_engine(*args, **kwargs)
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/vllm-installation/vllm/engine/async_llm_engine.py", line 321, in __init__
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
engine = cls(parallel_config.worker_use_ray,
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/vllm-installation/vllm/engine/async_llm_engine.py", line 617, in from_engine_args
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/src/engine.py", line 192, in _initialize_llm
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
raise e
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/src/engine.py", line 195, in _initialize_llm
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
self.llm = self._initialize_llm() if engine is None else engine
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/src/engine.py", line 45, in __init__
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
vllm_engine = vLLMEngine()
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
File "/src/handler.py", line 5, in <module>
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
Traceback (most recent call last):
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
engine.py :194 2024-02-17 03:09:59,747 Error initializing vLLM engine: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
2024-02-16 22:09:59.747
[dj6at93clwgje7]
[info]
INFO 02-17 03:09:59 llm_engine.py:316] # GPU blocks: 1509, # CPU blocks: 2048
2024-02-16 22:09:48.879
[dj6at93clwgje7]
[info]
INFO 02-17 03:09:48 llm_engine.py:72] Initializing an LLM engine with config: model='/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', tokenizer='/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
2024-02-16 22:09:48.835
[dj6at93clwgje7]
[info]
engine.py :43 2024-02-17 03:09:48,835 vLLM config: {'model': '/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', 'revision': None, 'download_dir': None, 'quantization': None, 'load_format': 'auto', 'dtype': 'auto', 'tokenizer': '/models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24', 'tokenizer_revision': None, 'disable_log_stats': True, 'disable_log_requests': True, 'trust_remote_code': False, 'gpu_memory_utilization': 0.95, 'max_parallel_loading_workers': 24, 'max_model_len': None, 'tensor_parallel_size': 1}
2024-02-16 22:09:48.834
[dj6at93clwgje7]
[info]
engine.py :212 2024-02-17 03:09:48,834 Using local model at /models/huggingface-cache/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24
2024-02-16 22:09:46.804
[dj6at93clwgje7]
[info]
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (24144). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
2024-02-16 22:09:46.804
[dj6at93clwgje7]
[info]
raise ValueError(