Untitled

./llama-cli -m /media/user/data/DSQ3/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf --prompt "List the instructions to make honeycomb candy" -t 56 --no-context-shift --n-gpu-layers 25
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA A100-SXM-64GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
build: 4425 (6369f867) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file: using device CUDA0 (NVIDIA A100-SXM-64GB) - 64274 MiB free
llama_model_load_from_file: using device CUDA1 (NVIDIA RTX A6000) - 48400 MiB free
llama_model_load_from_file: using device CUDA2 (NVIDIA RTX A6000) - 48400 MiB free
llama_model_loader: additional 7 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 51 key-value pairs and 1025 tensors from /media/user/data/DSQ3/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 Bf16
llama_model_loader: - kv   3:                         general.size_label str              = 256x20B
llama_model_loader: - kv   4:                   general.base_model.count u32              = 1
llama_model_loader: - kv   5:                  general.base_model.0.name str              = DeepSeek V3
llama_model_loader: - kv   6:               general.base_model.0.version str              = V3
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv   9:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  10:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  11:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  12:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  13:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  14:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  15:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  18:                          general.file_type u32              = 12
llama_model_loader: - kv  19:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  20:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  21:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  22:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  23:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  24:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  25:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  26:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  27:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  28:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  29:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  30:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  31:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  32:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  33:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  34: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  35: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  36:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  37:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  38:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  39:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  40:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  42:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  43:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  44:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  45:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  47:               general.quantization_version u32              = 2
llama_model_loader: - kv  48:                                   split.no u16              = 0
llama_model_loader: - kv  49:                                split.count u16              = 8
llama_model_loader: - kv  50:                        split.tensors.count i32              = 1025
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q3_K:  483 tensors
llama_model_loader: - type q4_K:  177 tensors
llama_model_loader: - type q5_K:    3 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q3_K - Medium
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 297.27 GiB (3.81 BPW)
llm_load_print_meta: general.name     = DeepSeek V3 Bf16
llm_load_print_meta: BOS token        = 0 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: EOT token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<｜fim▁begin｜>'
llm_load_print_meta: FIM SUF token    = 128800 '<｜fim▁hole｜>'
llm_load_print_meta: FIM MID token    = 128802 '<｜fim▁end｜>'
llm_load_print_meta: EOG token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: offloading 25 repeating layers to GPU
llm_load_tensors: offloaded 25/62 layers to GPU
llm_load_tensors:        CUDA0 model buffer size = 52145.17 MiB
llm_load_tensors:        CUDA1 model buffer size = 41716.14 MiB
llm_load_tensors:        CUDA2 model buffer size = 36501.62 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 42134.38 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 41716.14 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 41716.14 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 41716.14 MiB
llm_load_tensors:   CPU_Mapped model buffer size =  6760.53 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size =  3200.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  2560.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  2240.00 MiB
llama_kv_cache_init:        CPU KV buffer size = 11520.00 MiB
llama_new_context_with_model: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3630.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1186.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    88.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 675 (with bs=512), 5 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 56

system_info: n_threads = 56 (n_threads_batch = 56) / 112 | CUDA : ARCHS = 800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2144914929
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

List the instructions to make honeycomb candy.

To make honeycomb candy, also known as cinder toffee or hokey pokey, you'll need the following ingredients and instructions:

### Ingredients:
- 1 cup granulated sugar
- 1/4 cup light corn syrup (or golden syrup)
- 1 tablespoon water
- 2 teaspoons baking soda
- 1 teaspoon vanilla extract (optional)

### Instructions:

1. **Prepare Your Pan:**
   - Line a baking sheet with parchment paper or a silicone baking mat. Have it ready before you start cooking.

2. **Combine Sugar, Corn Syrup, and Water:**
   - In a medium-sized heavy-bottomed saucepan, combine the sugar, corn syrup, and water. Stir gently to mix.

3. **Heat the Mixture:**
   - Place the saucepan over medium heat. Stir occasionally until the sugar has completely dissolved. Once dissolved, stop stirring and let the mixture come to a boil.

4. **Boil the Syrup:**
   - Allow the syrup to boil without stirring. Insert a candy thermometer if you have one, and cook until the temperature reaches 300°F (150°C), which is the hard crack stage. If you don’t have a thermometer, you can test by dropping a small amount of the syrup into a bowl of cold water; it should form hard, brittle threads.

5. **Remove from Heat:**
   - Once the syrup reaches the correct temperature, immediately remove the saucepan from the heat.

6. **Add Baking Soda and Vanilla:**
   - Quickly add the baking soda and vanilla extract (if using) to the syrup. Be careful as the mixture will foam up significantly. Stir gently to combine.

7. **Pour onto Prepared Pan:**
   - Immediately pour the foamy mixture onto your prepared baking sheet. Do not spread it out; the mixture will spread on its own.

8. **Cool and Set:**
   - Allow the honeycomb candy to cool completely at room temperature. This will take about 1-2 hours.

9. **Break into Pieces:**
   - Once the candy has cooled and hardened, break it into pieces using your hands or a knife.

10. **Store:**
    - Store the honeycomb candy in an airtight container to keep it crisp. It can be enjoyed as is or used as a topping for desserts.

### Tips:
- Be careful when working with hot sugar syrup as it can cause severe burns.
- The baking soda is crucial for creating the honeycomb texture, so make sure it’s fresh and active.
- If you want to add a chocolate coating, you can melt some chocolate and dip the pieces in it before letting them set.

Enjoy your homemade honeycomb candy! [end of text]


llama_perf_sampler_print:    sampling time =      48.19 ms /   569 runs   (    0.08 ms per token, 11808.65 tokens per second)
llama_perf_context_print:        load time =   23622.38 ms
llama_perf_context_print: prompt eval time =     543.71 ms /     9 tokens (   60.41 ms per token,    16.55 tokens per second)
llama_perf_context_print:        eval time =   63513.76 ms /   559 runs   (  113.62 ms per token,     8.80 tokens per second)
llama_perf_context_print:       total time =   64207.58 ms /   568 tokens