Lissanro

DeepSeek-V3-0324-GGUF-UD-Q4_K_R4 with -rtr option

Apr 11th, 2025
16
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 35.88 KB | None | 0 0
  1. taskset -c 0-63 /home/lissanro/pkgs/ik_llama.cpp/build/bin/llama-server --model /home/lissanro/neuro/DeepSeek-V3-0324-GGUF-UD-Q4_K_R4-163840seq/DeepSeek-V3-0324-GGUF-UD-Q4_K_R4.gguf --ctx-size 81920 --n-gpu-layers 62 --tensor-split 25,25,25,25 -mla 2 -fa -ctk q8_0 -amb 2048 -rtr -fmoe --override-tensor "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" --threads 64 --host 0.0.0.0 --port 5000
  2. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  3. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4. ggml_cuda_init: found 4 CUDA devices:
  5. Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  6. Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  7. Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  8. Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  9. INFO [ main] build info | tid="127320811921408" timestamp=1744368818 build=3630 commit="5f44f4b3"
  10. INFO [ main] system info | tid="127320811921408" timestamp=1744368818 n_threads=64 n_threads_batch=-1 total_threads=128 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
  11. llama_model_loader: loaded meta data with 46 key-value pairs and 1025 tensors from /home/lissanro/neuro/DeepSeek-V3-0324-GGUF-UD-Q4_K_R4-163840seq/DeepSeek-V3-0324-GGUF-UD-Q4_K_R4.gguf (version GGUF V3 (latest))
  12. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  13. llama_model_loader: - kv 0: general.architecture str = deepseek2
  14. llama_model_loader: - kv 1: general.type str = model
  15. llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324 BF16
  16. llama_model_loader: - kv 3: general.quantized_by str = Unsloth
  17. llama_model_loader: - kv 4: general.size_label str = 256x20B
  18. llama_model_loader: - kv 5: general.license str = mit
  19. llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
  20. llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
  21. llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
  22. llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
  23. llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
  24. llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
  25. llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
  26. llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
  27. llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
  28. llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
  29. llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3
  30. llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280
  31. llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536
  32. llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512
  33. llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192
  34. llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128
  35. llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
  36. llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256
  37. llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
  38. llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000
  39. llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
  40. llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
  41. llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
  42. llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
  43. llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000
  44. llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
  45. llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
  46. llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
  47. llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3
  48. llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
  49. llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
  50. llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
  51. llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0
  52. llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1
  53. llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 1
  54. llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true
  55. llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false
  56. llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de...
  57. llama_model_loader: - kv 44: general.quantization_version u32 = 2
  58. llama_model_loader: - kv 45: general.file_type u32 = 214
  59. llama_model_loader: - type f32: 361 tensors
  60. llama_model_loader: - type q4_K: 306 tensors
  61. llama_model_loader: - type q6_K: 184 tensors
  62. llama_model_loader: - type q4_k_r4: 147 tensors
  63. llama_model_loader: - type q6_k_r4: 27 tensors
  64. llm_load_vocab: special tokens cache size = 818
  65. llm_load_vocab: token to piece cache size = 0.8223 MB
  66. llm_load_print_meta: format = GGUF V3 (latest)
  67. llm_load_print_meta: arch = deepseek2
  68. llm_load_print_meta: vocab type = BPE
  69. llm_load_print_meta: n_vocab = 129280
  70. llm_load_print_meta: n_merges = 127741
  71. llm_load_print_meta: vocab_only = 0
  72. llm_load_print_meta: n_ctx_train = 163840
  73. llm_load_print_meta: n_embd = 7168
  74. llm_load_print_meta: n_layer = 61
  75. llm_load_print_meta: n_head = 128
  76. llm_load_print_meta: n_head_kv = 128
  77. llm_load_print_meta: n_rot = 64
  78. llm_load_print_meta: n_swa = 0
  79. llm_load_print_meta: n_embd_head_k = 192
  80. llm_load_print_meta: n_embd_head_v = 128
  81. llm_load_print_meta: n_gqa = 1
  82. llm_load_print_meta: n_embd_k_gqa = 24576
  83. llm_load_print_meta: n_embd_v_gqa = 16384
  84. llm_load_print_meta: f_norm_eps = 0.0e+00
  85. llm_load_print_meta: f_norm_rms_eps = 1.0e-06
  86. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  87. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  88. llm_load_print_meta: f_logit_scale = 0.0e+00
  89. llm_load_print_meta: n_ff = 18432
  90. llm_load_print_meta: n_expert = 256
  91. llm_load_print_meta: n_expert_used = 8
  92. llm_load_print_meta: causal attn = 1
  93. llm_load_print_meta: pooling type = 0
  94. llm_load_print_meta: rope type = 0
  95. llm_load_print_meta: rope scaling = yarn
  96. llm_load_print_meta: freq_base_train = 10000.0
  97. llm_load_print_meta: freq_scale_train = 0.025
  98. llm_load_print_meta: n_ctx_orig_yarn = 4096
  99. llm_load_print_meta: rope_finetuned = unknown
  100. llm_load_print_meta: ssm_d_conv = 0
  101. llm_load_print_meta: ssm_d_inner = 0
  102. llm_load_print_meta: ssm_d_state = 0
  103. llm_load_print_meta: ssm_dt_rank = 0
  104. llm_load_print_meta: model type = 671B
  105. llm_load_print_meta: model ftype = Q4_K_R4
  106. llm_load_print_meta: model params = 671.026 B
  107. llm_load_print_meta: model size = 377.065 GiB (4.827 BPW)
  108. llm_load_print_meta: repeating layers = 375.872 GiB (4.825 BPW, 669.173 B parameters)
  109. llm_load_print_meta: general.name = DeepSeek V3 0324 BF16
  110. llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
  111. llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
  112. llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
  113. llm_load_print_meta: LF token = 131 'Ä'
  114. llm_load_print_meta: max token length = 256
  115. llm_load_print_meta: n_layer_dense_lead = 3
  116. llm_load_print_meta: n_lora_q = 1536
  117. llm_load_print_meta: n_lora_kv = 512
  118. llm_load_print_meta: n_ff_exp = 2048
  119. llm_load_print_meta: n_expert_shared = 1
  120. llm_load_print_meta: expert_weights_scale = 2.5
  121. llm_load_print_meta: expert_weights_norm = 1
  122. llm_load_print_meta: expert_gating_func = sigmoid
  123. llm_load_print_meta: rope_yarn_log_mul = 0.1000
  124. llm_load_tensors: ggml ctx size = 2.12 MiB
  125. Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
  126. Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
  127. Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
  128. Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
  129. Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
  130. Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
  131. Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
  132. Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
  133. Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
  134. Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
  135. Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
  136. Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
  137. Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
  138. Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
  139. Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
  140. Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
  141. Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
  142. Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
  143. Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
  144. Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
  145. Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
  146. Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
  147. Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
  148. Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
  149. Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
  150. Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
  151. Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
  152. Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
  153. Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
  154. Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
  155. Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
  156. Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
  157. Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
  158. Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
  159. Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
  160. Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
  161. Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
  162. Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
  163. Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
  164. Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
  165. Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
  166. Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
  167. Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
  168. Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
  169. Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
  170. Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
  171. Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
  172. Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
  173. Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
  174. Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
  175. Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
  176. Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
  177. Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
  178. Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
  179. Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
  180. Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
  181. Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
  182. Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
  183. Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
  184. Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
  185. Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
  186. Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
  187. Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
  188. Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
  189. Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
  190. Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
  191. Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
  192. Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
  193. Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
  194. Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
  195. Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
  196. Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
  197. Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
  198. Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
  199. Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
  200. Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
  201. Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
  202. Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
  203. Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
  204. Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
  205. Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
  206. Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
  207. Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
  208. Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
  209. Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
  210. Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
  211. Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
  212. Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
  213. Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
  214. Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
  215. Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
  216. Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
  217. Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
  218. Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
  219. Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
  220. Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
  221. Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
  222. Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
  223. Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
  224. Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
  225. Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
  226. Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
  227. Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
  228. Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
  229. Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
  230. Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
  231. Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
  232. Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
  233. Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
  234. Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
  235. Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
  236. Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
  237. Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
  238. Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
  239. Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
  240. Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
  241. Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
  242. Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
  243. Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
  244. Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
  245. Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
  246. Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
  247. Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
  248. Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
  249. Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
  250. Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
  251. Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
  252. Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
  253. Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
  254. Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
  255. Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
  256. Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
  257. Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
  258. Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
  259. Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
  260. Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
  261. Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
  262. Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
  263. Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
  264. Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
  265. Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
  266. Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
  267. Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
  268. Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
  269. Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
  270. Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
  271. Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
  272. Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
  273. Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
  274. Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
  275. Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
  276. Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
  277. Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
  278. Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
  279. Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
  280. Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
  281. Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
  282. Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
  283. Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
  284. Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
  285. Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
  286. Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
  287. Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
  288. Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
  289. Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
  290. Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
  291. Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
  292. Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
  293. Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
  294. Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
  295. Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
  296. Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
  297. Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
  298. Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
  299. llm_load_tensors: offloading 61 repeating layers to GPU
  300. llm_load_tensors: offloading non-repeating layers to GPU
  301. llm_load_tensors: offloaded 62/62 layers to GPU
  302. llm_load_tensors: CPU buffer size = 375732.00 MiB
  303. llm_load_tensors: CUDA_Host buffer size = 497.11 MiB
  304. llm_load_tensors: CUDA0 buffer size = 2869.57 MiB
  305. llm_load_tensors: CUDA1 buffer size = 2097.14 MiB
  306. llm_load_tensors: CUDA2 buffer size = 2236.95 MiB
  307. llm_load_tensors: CUDA3 buffer size = 2682.31 MiB
  308. ....................................................................................................
  309. ============ llm_load_tensors: need to compute 61 wk_b tensors
  310. Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  311. Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  312. Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  313. Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  314. Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  315. Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  316. Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  317. Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  318. Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  319. Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  320. Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  321. Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  322. Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  323. Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  324. Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  325. Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0
  326. Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  327. Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  328. Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  329. Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  330. Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  331. Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  332. Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  333. Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  334. Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  335. Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  336. Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  337. Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  338. Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  339. Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  340. Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA1
  341. Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  342. Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  343. Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  344. Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  345. Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  346. Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  347. Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  348. Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  349. Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  350. Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  351. Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  352. Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  353. Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  354. Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  355. Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  356. Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA2
  357. Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  358. Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  359. Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  360. Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  361. Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  362. Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  363. Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  364. Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  365. Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  366. Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  367. Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  368. Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  369. Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  370. Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA3
  371. llama_new_context_with_model: n_ctx = 81920
  372. llama_new_context_with_model: n_batch = 2048
  373. llama_new_context_with_model: n_ubatch = 512
  374. llama_new_context_with_model: flash_attn = 1
  375. llama_new_context_with_model: mla_attn = 2
  376. llama_new_context_with_model: attn_max_b = 2048
  377. llama_new_context_with_model: fused_moe = 1
  378. llama_new_context_with_model: ser = -1, 0
  379. llama_new_context_with_model: freq_base = 10000.0
  380. llama_new_context_with_model: freq_scale = 0.025
  381. llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  382. llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  383. llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  384. llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  385. llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  386. llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  387. llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  388. llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  389. llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  390. llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  391. llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  392. llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  393. llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  394. llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  395. llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  396. llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  397. llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  398. llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  399. llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  400. llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  401. llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  402. llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  403. llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  404. llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  405. llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  406. llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  407. llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  408. llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  409. llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  410. llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  411. llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  412. llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  413. llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  414. llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  415. llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  416. llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  417. llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  418. llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  419. llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  420. llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  421. llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  422. llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  423. llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  424. llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  425. llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  426. llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  427. llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  428. llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  429. llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  430. llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  431. llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  432. llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  433. llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  434. llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  435. llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  436. llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  437. llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  438. llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  439. llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  440. llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  441. llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
  442. llama_kv_cache_init: CUDA0 KV buffer size = 765.01 MiB
  443. llama_kv_cache_init: CUDA1 KV buffer size = 717.19 MiB
  444. llama_kv_cache_init: CUDA2 KV buffer size = 765.01 MiB
  445. llama_kv_cache_init: CUDA3 KV buffer size = 669.38 MiB
  446. llama_new_context_with_model: KV self size = 2916.56 MiB, c^KV (q8_0): 2916.56 MiB, kv^T: not used
  447. llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
  448. llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
  449. llama_new_context_with_model: CUDA0 compute buffer size = 14443.01 MiB
  450. llama_new_context_with_model: CUDA1 compute buffer size = 14930.04 MiB
  451. llama_new_context_with_model: CUDA2 compute buffer size = 15378.04 MiB
  452. llama_new_context_with_model: CUDA3 compute buffer size = 14482.05 MiB
  453. llama_new_context_with_model: CUDA_Host compute buffer size = 4147.80 MiB
  454. llama_new_context_with_model: graph nodes = 8245
  455. llama_new_context_with_model: graph splits = 121
  456. INFO [ init] initializing slots | tid="127320811921408" timestamp=1744369114 n_slots=1
  457. INFO [ init] new slot | tid="127320811921408" timestamp=1744369114 id_slot=0 n_ctx_slot=81920
  458. INFO [ main] model loaded | tid="127320811921408" timestamp=1744369114
  459. INFO [ main] chat template | tid="127320811921408" timestamp=1744369114 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
  460. INFO [ main] HTTP server listening | tid="127320811921408" timestamp=1744369114 n_threads_http="127" port="5000" hostname="0.0.0.0"
  461. INFO [ update_slots] all slots are idle | tid="127320811921408" timestamp=1744369114
  462. INFO [ launch_slot_with_task] slot is processing task | tid="127320811921408" timestamp=1744369120 id_slot=0 id_task=0
  463. INFO [ update_slots] kv cache rm [p0, end) | tid="127320811921408" timestamp=1744369120 id_slot=0 id_task=0 p0=0
  464. INFO [ print_timings] prompt eval time = 8754.64 ms / 572 tokens ( 15.31 ms per token, 65.34 tokens per second) | tid="127320811921408" timestamp=1744369176 id_slot=0 id_task=0 t_prompt_processing=8754.639 n_prompt_tokens_processed=572 t_token=15.305312937062936 n_tokens_second=65.33678887273365
  465. INFO [ print_timings] generation eval time = 46791.42 ms / 341 runs ( 137.22 ms per token, 7.29 tokens per second) | tid="127320811921408" timestamp=1744369176 id_slot=0 id_task=0 t_token_generation=46791.423 n_decoded=341 t_token=137.2182492668622 n_tokens_second=7.287660390238612
  466. INFO [ print_timings] total time = 55546.06 ms | tid="127320811921408" timestamp=1744369176 id_slot=0 id_task=0 t_prompt_processing=8754.639 t_token_generation=46791.423 t_total=55546.062000000005
  467. INFO [ update_slots] slot released | tid="127320811921408" timestamp=1744369176 id_slot=0 id_task=0 n_ctx=81920 n_past=912 n_system_tokens=0 n_cache_tokens=912 truncated=false
  468. INFO [ update_slots] all slots are idle | tid="127320811921408" timestamp=1744369176
  469. INFO [ log_server_request] request | tid="127306284662784" timestamp=1744369176 remote_addr="127.0.0.1" remote_port=48812 status=200 method="POST" path="/completion" params={}
  470. INFO [ update_slots] all slots are idle | tid="127320811921408" timestamp=1744369176
  471. INFO [ launch_slot_with_task] slot is processing task | tid="127320811921408" timestamp=1744369183 id_slot=0 id_task=343
  472. INFO [ update_slots] we have to evaluate at least 1 token to generate logits | tid="127320811921408" timestamp=1744369183 id_slot=0 id_task=343
  473. INFO [ update_slots] kv cache rm [p0, end) | tid="127320811921408" timestamp=1744369183 id_slot=0 id_task=343 p0=571
  474. INFO [ print_timings] prompt eval time = 152.29 ms / 1 tokens ( 152.29 ms per token, 6.57 tokens per second) | tid="127320811921408" timestamp=1744369220 id_slot=0 id_task=343 t_prompt_processing=152.292 n_prompt_tokens_processed=1 t_token=152.292 n_tokens_second=6.566333096945342
  475. INFO [ print_timings] generation eval time = 36683.23 ms / 274 runs ( 133.88 ms per token, 7.47 tokens per second) | tid="127320811921408" timestamp=1744369220 id_slot=0 id_task=343 t_token_generation=36683.233 n_decoded=274 t_token=133.88041240875913 n_tokens_second=7.469352551341372
  476. INFO [ print_timings] total time = 36835.53 ms | tid="127320811921408" timestamp=1744369220 id_slot=0 id_task=343 t_prompt_processing=152.292 t_token_generation=36683.233 t_total=36835.525
  477. INFO [ log_server_request] request | tid="127306274177024" timestamp=1744369220 remote_addr="127.0.0.1" remote_port=40122 status=200 method="POST" path="/completion" params={}
  478. INFO [ update_slots] slot released | tid="127320811921408" timestamp=1744369220 id_slot=0 id_task=343 n_ctx=81920 n_past=845 n_system_tokens=0 n_cache_tokens=845 truncated=false
  479. INFO [ update_slots] all slots are idle | tid="127320811921408" timestamp=1744369220
  480. ^CINFO [ update_slots] all slots are idle | tid="127320811921408" timestamp=1744369231
Add Comment
Please, Sign In to add comment