Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ./llama-cli -m /media/user/data/DSQ3/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf --prompt "List the instructions to make honeycomb candy" -t 56 --no-context-shift --n-gpu-layers 25
- ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
- ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
- ggml_cuda_init: found 3 CUDA devices:
- Device 0: NVIDIA A100-SXM-64GB, compute capability 8.0, VMM: yes
- Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
- Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
- build: 4425 (6369f867) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
- main: llama backend init
- main: load the model and apply lora adapter, if any
- llama_model_load_from_file: using device CUDA0 (NVIDIA A100-SXM-64GB) - 64274 MiB free
- llama_model_load_from_file: using device CUDA1 (NVIDIA RTX A6000) - 48400 MiB free
- llama_model_load_from_file: using device CUDA2 (NVIDIA RTX A6000) - 48400 MiB free
- llama_model_loader: additional 7 GGUFs metadata loaded.
- llama_model_loader: loaded meta data with 51 key-value pairs and 1025 tensors from /media/user/data/DSQ3/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf (version GGUF V3 (latest))
- llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
- llama_model_loader: - kv 0: general.architecture str = deepseek2
- llama_model_loader: - kv 1: general.type str = model
- llama_model_loader: - kv 2: general.name str = DeepSeek V3 Bf16
- llama_model_loader: - kv 3: general.size_label str = 256x20B
- llama_model_loader: - kv 4: general.base_model.count u32 = 1
- llama_model_loader: - kv 5: general.base_model.0.name str = DeepSeek V3
- llama_model_loader: - kv 6: general.base_model.0.version str = V3
- llama_model_loader: - kv 7: general.base_model.0.organization str = Deepseek Ai
- llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
- llama_model_loader: - kv 9: deepseek2.block_count u32 = 61
- llama_model_loader: - kv 10: deepseek2.context_length u32 = 163840
- llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 7168
- llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 18432
- llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 128
- llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 128
- llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 10000.000000
- llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
- llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 8
- llama_model_loader: - kv 18: general.file_type u32 = 12
- llama_model_loader: - kv 19: deepseek2.leading_dense_block_count u32 = 3
- llama_model_loader: - kv 20: deepseek2.vocab_size u32 = 129280
- llama_model_loader: - kv 21: deepseek2.attention.q_lora_rank u32 = 1536
- llama_model_loader: - kv 22: deepseek2.attention.kv_lora_rank u32 = 512
- llama_model_loader: - kv 23: deepseek2.attention.key_length u32 = 192
- llama_model_loader: - kv 24: deepseek2.attention.value_length u32 = 128
- llama_model_loader: - kv 25: deepseek2.expert_feed_forward_length u32 = 2048
- llama_model_loader: - kv 26: deepseek2.expert_count u32 = 256
- llama_model_loader: - kv 27: deepseek2.expert_shared_count u32 = 1
- llama_model_loader: - kv 28: deepseek2.expert_weights_scale f32 = 2.500000
- llama_model_loader: - kv 29: deepseek2.expert_weights_norm bool = true
- llama_model_loader: - kv 30: deepseek2.expert_gating_func u32 = 2
- llama_model_loader: - kv 31: deepseek2.rope.dimension_count u32 = 64
- llama_model_loader: - kv 32: deepseek2.rope.scaling.type str = yarn
- llama_model_loader: - kv 33: deepseek2.rope.scaling.factor f32 = 40.000000
- llama_model_loader: - kv 34: deepseek2.rope.scaling.original_context_length u32 = 4096
- llama_model_loader: - kv 35: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
- llama_model_loader: - kv 36: tokenizer.ggml.model str = gpt2
- llama_model_loader: - kv 37: tokenizer.ggml.pre str = deepseek-v3
- llama_model_loader: - kv 38: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
- llama_model_loader: - kv 39: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
- llama_model_loader: - kv 40: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
- llama_model_loader: - kv 41: tokenizer.ggml.bos_token_id u32 = 0
- llama_model_loader: - kv 42: tokenizer.ggml.eos_token_id u32 = 1
- llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 1
- llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = true
- llama_model_loader: - kv 45: tokenizer.ggml.add_eos_token bool = false
- llama_model_loader: - kv 46: tokenizer.chat_template str = {% if not add_generation_prompt is de...
- llama_model_loader: - kv 47: general.quantization_version u32 = 2
- llama_model_loader: - kv 48: split.no u16 = 0
- llama_model_loader: - kv 49: split.count u16 = 8
- llama_model_loader: - kv 50: split.tensors.count i32 = 1025
- llama_model_loader: - type f32: 361 tensors
- llama_model_loader: - type q3_K: 483 tensors
- llama_model_loader: - type q4_K: 177 tensors
- llama_model_loader: - type q5_K: 3 tensors
- llama_model_loader: - type q6_K: 1 tensors
- llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
- llm_load_vocab: special tokens cache size = 818
- llm_load_vocab: token to piece cache size = 0.8223 MB
- llm_load_print_meta: format = GGUF V3 (latest)
- llm_load_print_meta: arch = deepseek2
- llm_load_print_meta: vocab type = BPE
- llm_load_print_meta: n_vocab = 129280
- llm_load_print_meta: n_merges = 127741
- llm_load_print_meta: vocab_only = 0
- llm_load_print_meta: n_ctx_train = 163840
- llm_load_print_meta: n_embd = 7168
- llm_load_print_meta: n_layer = 61
- llm_load_print_meta: n_head = 128
- llm_load_print_meta: n_head_kv = 128
- llm_load_print_meta: n_rot = 64
- llm_load_print_meta: n_swa = 0
- llm_load_print_meta: n_embd_head_k = 192
- llm_load_print_meta: n_embd_head_v = 128
- llm_load_print_meta: n_gqa = 1
- llm_load_print_meta: n_embd_k_gqa = 24576
- llm_load_print_meta: n_embd_v_gqa = 16384
- llm_load_print_meta: f_norm_eps = 0.0e+00
- llm_load_print_meta: f_norm_rms_eps = 1.0e-06
- llm_load_print_meta: f_clamp_kqv = 0.0e+00
- llm_load_print_meta: f_max_alibi_bias = 0.0e+00
- llm_load_print_meta: f_logit_scale = 0.0e+00
- llm_load_print_meta: n_ff = 18432
- llm_load_print_meta: n_expert = 256
- llm_load_print_meta: n_expert_used = 8
- llm_load_print_meta: causal attn = 1
- llm_load_print_meta: pooling type = 0
- llm_load_print_meta: rope type = 0
- llm_load_print_meta: rope scaling = yarn
- llm_load_print_meta: freq_base_train = 10000.0
- llm_load_print_meta: freq_scale_train = 0.025
- llm_load_print_meta: n_ctx_orig_yarn = 4096
- llm_load_print_meta: rope_finetuned = unknown
- llm_load_print_meta: ssm_d_conv = 0
- llm_load_print_meta: ssm_d_inner = 0
- llm_load_print_meta: ssm_d_state = 0
- llm_load_print_meta: ssm_dt_rank = 0
- llm_load_print_meta: ssm_dt_b_c_rms = 0
- llm_load_print_meta: model type = 671B
- llm_load_print_meta: model ftype = Q3_K - Medium
- llm_load_print_meta: model params = 671.03 B
- llm_load_print_meta: model size = 297.27 GiB (3.81 BPW)
- llm_load_print_meta: general.name = DeepSeek V3 Bf16
- llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
- llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
- llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>'
- llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
- llm_load_print_meta: LF token = 131 'Ä'
- llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>'
- llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>'
- llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>'
- llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>'
- llm_load_print_meta: max token length = 256
- llm_load_print_meta: n_layer_dense_lead = 3
- llm_load_print_meta: n_lora_q = 1536
- llm_load_print_meta: n_lora_kv = 512
- llm_load_print_meta: n_ff_exp = 2048
- llm_load_print_meta: n_expert_shared = 1
- llm_load_print_meta: expert_weights_scale = 2.5
- llm_load_print_meta: expert_weights_norm = 1
- llm_load_print_meta: expert_gating_func = sigmoid
- llm_load_print_meta: rope_yarn_log_mul = 0.1000
- llm_load_tensors: offloading 25 repeating layers to GPU
- llm_load_tensors: offloaded 25/62 layers to GPU
- llm_load_tensors: CUDA0 model buffer size = 52145.17 MiB
- llm_load_tensors: CUDA1 model buffer size = 41716.14 MiB
- llm_load_tensors: CUDA2 model buffer size = 36501.62 MiB
- llm_load_tensors: CPU_Mapped model buffer size = 42134.38 MiB
- llm_load_tensors: CPU_Mapped model buffer size = 41716.14 MiB
- llm_load_tensors: CPU_Mapped model buffer size = 41716.14 MiB
- llm_load_tensors: CPU_Mapped model buffer size = 41716.14 MiB
- llm_load_tensors: CPU_Mapped model buffer size = 6760.53 MiB
- ....................................................................................................
- llama_new_context_with_model: n_seq_max = 1
- llama_new_context_with_model: n_ctx = 4096
- llama_new_context_with_model: n_ctx_per_seq = 4096
- llama_new_context_with_model: n_batch = 2048
- llama_new_context_with_model: n_ubatch = 512
- llama_new_context_with_model: flash_attn = 0
- llama_new_context_with_model: freq_base = 10000.0
- llama_new_context_with_model: freq_scale = 0.025
- llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
- llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
- llama_kv_cache_init: CUDA0 KV buffer size = 3200.00 MiB
- llama_kv_cache_init: CUDA1 KV buffer size = 2560.00 MiB
- llama_kv_cache_init: CUDA2 KV buffer size = 2240.00 MiB
- llama_kv_cache_init: CPU KV buffer size = 11520.00 MiB
- llama_new_context_with_model: KV self size = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
- llama_new_context_with_model: CPU output buffer size = 0.49 MiB
- llama_new_context_with_model: CUDA0 compute buffer size = 3630.00 MiB
- llama_new_context_with_model: CUDA1 compute buffer size = 1186.00 MiB
- llama_new_context_with_model: CUDA2 compute buffer size = 1186.00 MiB
- llama_new_context_with_model: CUDA_Host compute buffer size = 88.01 MiB
- llama_new_context_with_model: graph nodes = 5025
- llama_new_context_with_model: graph splits = 675 (with bs=512), 5 (with bs=1)
- common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
- common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
- main: llama threadpool init, n_threads = 56
- system_info: n_threads = 56 (n_threads_batch = 56) / 112 | CUDA : ARCHS = 800,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
- sampler seed: 2556559617
- sampler params:
- repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
- dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
- top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
- mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
- sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
- generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
- List the instructions to make honeycomb candy.
- To make honeycomb candy, follow these instructions:
- 1. Prepare the ingredients: Gather 1 cup of granulated sugar, 1/4 cup of honey, 1/4 cup of water, 1 teaspoon of baking soda, and a candy thermometer.
- 2. Line a baking sheet: Line a baking sheet with parchment paper or a silicone baking mat to prevent sticking.
- 3. Combine sugar, honey, and water: In a medium-sized saucepan, combine the sugar, honey, and water. Stir gently to ensure the sugar is moistened.
- 4. Heat the mixture: Place the saucepan over medium heat and attach the candy thermometer to the side. Heat the mixture without stirring until it reaches 300°F (150°C), which is the hard crack stage.
- 5. Add baking soda: Once the mixture reaches the desired temperature, quickly remove the saucepan from the heat and add the baking soda. Stir gently but thoroughly to incorporate the baking soda, which will cause the mixture to foam and expand.
- 6. Pour onto the baking sheet: Immediately pour the foamy mixture onto the prepared baking sheet. Spread it out evenly using a spatula, being careful not to deflate the bubbles.
- 7. Let it cool: Allow the honeycomb candy to cool and harden completely at room temperature. This may take about 1-2 hours.
- 8. Break into pieces: Once the candy is completely cooled and hardened, break it into smaller, bite-sized pieces using your hands or a knife.
- 9. Store or enjoy: Store the honeycomb candy in an airtight container at room temperature or enjoy it right away. It is best consumed within a few days for optimal texture and flavor. [end of text]
- llama_perf_sampler_print: sampling time = 29.60 ms / 352 runs ( 0.08 ms per token, 11892.70 tokens per second)
- llama_perf_context_print: load time = 24553.27 ms
- llama_perf_context_print: prompt eval time = 536.69 ms / 9 tokens ( 59.63 ms per token, 16.77 tokens per second)
- llama_perf_context_print: eval time = 38243.46 ms / 342 runs ( 111.82 ms per token, 8.94 tokens per second)
- llama_perf_context_print: total time = 38871.66 ms / 351 tokens
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement