Advertisement
Guest User

Untitled

a guest
Jan 6th, 2025
43
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 15.30 KB | None | 0 0
  1. ./llama-cli -m /media/user/data/DSQ3/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf --prompt "List the instructions to make honeycomb candy" -t 56 --no-context-shift --n-gpu-layers 25
  2. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  3. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4. ggml_cuda_init: found 3 CUDA devices:
  5. Device 0: NVIDIA A100-SXM-64GB, compute capability 8.0, VMM: yes
  6. Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  7. Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  8. build: 4425 (6369f867) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
  9. main: llama backend init
  10. main: load the model and apply lora adapter, if any
  11. llama_model_load_from_file: using device CUDA0 (NVIDIA A100-SXM-64GB) - 64274 MiB free
  12. llama_model_load_from_file: using device CUDA1 (NVIDIA RTX A6000) - 48400 MiB free
  13. llama_model_load_from_file: using device CUDA2 (NVIDIA RTX A6000) - 48400 MiB free
  14. llama_model_loader: additional 7 GGUFs metadata loaded.
  15. llama_model_loader: loaded meta data with 51 key-value pairs and 1025 tensors from /media/user/data/DSQ3/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf (version GGUF V3 (latest))
  16. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  17. llama_model_loader: - kv 0: general.architecture str = deepseek2
  18. llama_model_loader: - kv 1: general.type str = model
  19. llama_model_loader: - kv 2: general.name str = DeepSeek V3 Bf16
  20. llama_model_loader: - kv 3: general.size_label str = 256x20B
  21. llama_model_loader: - kv 4: general.base_model.count u32 = 1
  22. llama_model_loader: - kv 5: general.base_model.0.name str = DeepSeek V3
  23. llama_model_loader: - kv 6: general.base_model.0.version str = V3
  24. llama_model_loader: - kv 7: general.base_model.0.organization str = Deepseek Ai
  25. llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
  26. llama_model_loader: - kv 9: deepseek2.block_count u32 = 61
  27. llama_model_loader: - kv 10: deepseek2.context_length u32 = 163840
  28. llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 7168
  29. llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 18432
  30. llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 128
  31. llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 128
  32. llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 10000.000000
  33. llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
  34. llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 8
  35. llama_model_loader: - kv 18: general.file_type u32 = 12
  36. llama_model_loader: - kv 19: deepseek2.leading_dense_block_count u32 = 3
  37. llama_model_loader: - kv 20: deepseek2.vocab_size u32 = 129280
  38. llama_model_loader: - kv 21: deepseek2.attention.q_lora_rank u32 = 1536
  39. llama_model_loader: - kv 22: deepseek2.attention.kv_lora_rank u32 = 512
  40. llama_model_loader: - kv 23: deepseek2.attention.key_length u32 = 192
  41. llama_model_loader: - kv 24: deepseek2.attention.value_length u32 = 128
  42. llama_model_loader: - kv 25: deepseek2.expert_feed_forward_length u32 = 2048
  43. llama_model_loader: - kv 26: deepseek2.expert_count u32 = 256
  44. llama_model_loader: - kv 27: deepseek2.expert_shared_count u32 = 1
  45. llama_model_loader: - kv 28: deepseek2.expert_weights_scale f32 = 2.500000
  46. llama_model_loader: - kv 29: deepseek2.expert_weights_norm bool = true
  47. llama_model_loader: - kv 30: deepseek2.expert_gating_func u32 = 2
  48. llama_model_loader: - kv 31: deepseek2.rope.dimension_count u32 = 64
  49. llama_model_loader: - kv 32: deepseek2.rope.scaling.type str = yarn
  50. llama_model_loader: - kv 33: deepseek2.rope.scaling.factor f32 = 40.000000
  51. llama_model_loader: - kv 34: deepseek2.rope.scaling.original_context_length u32 = 4096
  52. llama_model_loader: - kv 35: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
  53. llama_model_loader: - kv 36: tokenizer.ggml.model str = gpt2
  54. llama_model_loader: - kv 37: tokenizer.ggml.pre str = deepseek-v3
  55. llama_model_loader: - kv 38: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
  56. llama_model_loader: - kv 39: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
  57. llama_model_loader: - kv 40: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
  58. llama_model_loader: - kv 41: tokenizer.ggml.bos_token_id u32 = 0
  59. llama_model_loader: - kv 42: tokenizer.ggml.eos_token_id u32 = 1
  60. llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 1
  61. llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = true
  62. llama_model_loader: - kv 45: tokenizer.ggml.add_eos_token bool = false
  63. llama_model_loader: - kv 46: tokenizer.chat_template str = {% if not add_generation_prompt is de...
  64. llama_model_loader: - kv 47: general.quantization_version u32 = 2
  65. llama_model_loader: - kv 48: split.no u16 = 0
  66. llama_model_loader: - kv 49: split.count u16 = 8
  67. llama_model_loader: - kv 50: split.tensors.count i32 = 1025
  68. llama_model_loader: - type f32: 361 tensors
  69. llama_model_loader: - type q3_K: 483 tensors
  70. llama_model_loader: - type q4_K: 177 tensors
  71. llama_model_loader: - type q5_K: 3 tensors
  72. llama_model_loader: - type q6_K: 1 tensors
  73. llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  74. llm_load_vocab: special tokens cache size = 818
  75. llm_load_vocab: token to piece cache size = 0.8223 MB
  76. llm_load_print_meta: format = GGUF V3 (latest)
  77. llm_load_print_meta: arch = deepseek2
  78. llm_load_print_meta: vocab type = BPE
  79. llm_load_print_meta: n_vocab = 129280
  80. llm_load_print_meta: n_merges = 127741
  81. llm_load_print_meta: vocab_only = 0
  82. llm_load_print_meta: n_ctx_train = 163840
  83. llm_load_print_meta: n_embd = 7168
  84. llm_load_print_meta: n_layer = 61
  85. llm_load_print_meta: n_head = 128
  86. llm_load_print_meta: n_head_kv = 128
  87. llm_load_print_meta: n_rot = 64
  88. llm_load_print_meta: n_swa = 0
  89. llm_load_print_meta: n_embd_head_k = 192
  90. llm_load_print_meta: n_embd_head_v = 128
  91. llm_load_print_meta: n_gqa = 1
  92. llm_load_print_meta: n_embd_k_gqa = 24576
  93. llm_load_print_meta: n_embd_v_gqa = 16384
  94. llm_load_print_meta: f_norm_eps = 0.0e+00
  95. llm_load_print_meta: f_norm_rms_eps = 1.0e-06
  96. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  97. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  98. llm_load_print_meta: f_logit_scale = 0.0e+00
  99. llm_load_print_meta: n_ff = 18432
  100. llm_load_print_meta: n_expert = 256
  101. llm_load_print_meta: n_expert_used = 8
  102. llm_load_print_meta: causal attn = 1
  103. llm_load_print_meta: pooling type = 0
  104. llm_load_print_meta: rope type = 0
  105. llm_load_print_meta: rope scaling = yarn
  106. llm_load_print_meta: freq_base_train = 10000.0
  107. llm_load_print_meta: freq_scale_train = 0.025
  108. llm_load_print_meta: n_ctx_orig_yarn = 4096
  109. llm_load_print_meta: rope_finetuned = unknown
  110. llm_load_print_meta: ssm_d_conv = 0
  111. llm_load_print_meta: ssm_d_inner = 0
  112. llm_load_print_meta: ssm_d_state = 0
  113. llm_load_print_meta: ssm_dt_rank = 0
  114. llm_load_print_meta: ssm_dt_b_c_rms = 0
  115. llm_load_print_meta: model type = 671B
  116. llm_load_print_meta: model ftype = Q3_K - Medium
  117. llm_load_print_meta: model params = 671.03 B
  118. llm_load_print_meta: model size = 297.27 GiB (3.81 BPW)
  119. llm_load_print_meta: general.name = DeepSeek V3 Bf16
  120. llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
  121. llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
  122. llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>'
  123. llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
  124. llm_load_print_meta: LF token = 131 'Ä'
  125. llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>'
  126. llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>'
  127. llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>'
  128. llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>'
  129. llm_load_print_meta: max token length = 256
  130. llm_load_print_meta: n_layer_dense_lead = 3
  131. llm_load_print_meta: n_lora_q = 1536
  132. llm_load_print_meta: n_lora_kv = 512
  133. llm_load_print_meta: n_ff_exp = 2048
  134. llm_load_print_meta: n_expert_shared = 1
  135. llm_load_print_meta: expert_weights_scale = 2.5
  136. llm_load_print_meta: expert_weights_norm = 1
  137. llm_load_print_meta: expert_gating_func = sigmoid
  138. llm_load_print_meta: rope_yarn_log_mul = 0.1000
  139. llm_load_tensors: offloading 25 repeating layers to GPU
  140. llm_load_tensors: offloaded 25/62 layers to GPU
  141. llm_load_tensors: CUDA0 model buffer size = 52145.17 MiB
  142. llm_load_tensors: CUDA1 model buffer size = 41716.14 MiB
  143. llm_load_tensors: CUDA2 model buffer size = 36501.62 MiB
  144. llm_load_tensors: CPU_Mapped model buffer size = 42134.38 MiB
  145. llm_load_tensors: CPU_Mapped model buffer size = 41716.14 MiB
  146. llm_load_tensors: CPU_Mapped model buffer size = 41716.14 MiB
  147. llm_load_tensors: CPU_Mapped model buffer size = 41716.14 MiB
  148. llm_load_tensors: CPU_Mapped model buffer size = 6760.53 MiB
  149. ....................................................................................................
  150. llama_new_context_with_model: n_seq_max = 1
  151. llama_new_context_with_model: n_ctx = 4096
  152. llama_new_context_with_model: n_ctx_per_seq = 4096
  153. llama_new_context_with_model: n_batch = 2048
  154. llama_new_context_with_model: n_ubatch = 512
  155. llama_new_context_with_model: flash_attn = 0
  156. llama_new_context_with_model: freq_base = 10000.0
  157. llama_new_context_with_model: freq_scale = 0.025
  158. llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
  159. llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
  160. llama_kv_cache_init: CUDA0 KV buffer size = 3200.00 MiB
  161. llama_kv_cache_init: CUDA1 KV buffer size = 2560.00 MiB
  162. llama_kv_cache_init: CUDA2 KV buffer size = 2240.00 MiB
  163. llama_kv_cache_init: CPU KV buffer size = 11520.00 MiB
  164. llama_new_context_with_model: KV self size = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
  165. llama_new_context_with_model: CPU output buffer size = 0.49 MiB
  166. llama_new_context_with_model: CUDA0 compute buffer size = 3630.00 MiB
  167. llama_new_context_with_model: CUDA1 compute buffer size = 1186.00 MiB
  168. llama_new_context_with_model: CUDA2 compute buffer size = 1186.00 MiB
  169. llama_new_context_with_model: CUDA_Host compute buffer size = 88.01 MiB
  170. llama_new_context_with_model: graph nodes = 5025
  171. llama_new_context_with_model: graph splits = 675 (with bs=512), 5 (with bs=1)
  172. common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
  173. common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
  174. main: llama threadpool init, n_threads = 56
  175.  
  176. system_info: n_threads = 56 (n_threads_batch = 56) / 112 | CUDA : ARCHS = 800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
  177.  
  178. sampler seed: 2556559617
  179. sampler params:
  180. repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
  181. dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
  182. top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
  183. mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
  184. sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
  185. generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
  186.  
  187. List the instructions to make honeycomb candy.
  188.  
  189. To make honeycomb candy, follow these instructions:
  190.  
  191. 1. Prepare the ingredients: Gather 1 cup of granulated sugar, 1/4 cup of honey, 1/4 cup of water, 1 teaspoon of baking soda, and a candy thermometer.
  192.  
  193. 2. Line a baking sheet: Line a baking sheet with parchment paper or a silicone baking mat to prevent sticking.
  194.  
  195. 3. Combine sugar, honey, and water: In a medium-sized saucepan, combine the sugar, honey, and water. Stir gently to ensure the sugar is moistened.
  196.  
  197. 4. Heat the mixture: Place the saucepan over medium heat and attach the candy thermometer to the side. Heat the mixture without stirring until it reaches 300°F (150°C), which is the hard crack stage.
  198.  
  199. 5. Add baking soda: Once the mixture reaches the desired temperature, quickly remove the saucepan from the heat and add the baking soda. Stir gently but thoroughly to incorporate the baking soda, which will cause the mixture to foam and expand.
  200.  
  201. 6. Pour onto the baking sheet: Immediately pour the foamy mixture onto the prepared baking sheet. Spread it out evenly using a spatula, being careful not to deflate the bubbles.
  202.  
  203. 7. Let it cool: Allow the honeycomb candy to cool and harden completely at room temperature. This may take about 1-2 hours.
  204.  
  205. 8. Break into pieces: Once the candy is completely cooled and hardened, break it into smaller, bite-sized pieces using your hands or a knife.
  206.  
  207. 9. Store or enjoy: Store the honeycomb candy in an airtight container at room temperature or enjoy it right away. It is best consumed within a few days for optimal texture and flavor. [end of text]
  208.  
  209.  
  210. llama_perf_sampler_print: sampling time = 29.60 ms / 352 runs ( 0.08 ms per token, 11892.70 tokens per second)
  211. llama_perf_context_print: load time = 24553.27 ms
  212. llama_perf_context_print: prompt eval time = 536.69 ms / 9 tokens ( 59.63 ms per token, 16.77 tokens per second)
  213. llama_perf_context_print: eval time = 38243.46 ms / 342 runs ( 111.82 ms per token, 8.94 tokens per second)
  214. llama_perf_context_print: total time = 38871.66 ms / 351 tokens
  215.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement