Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ***
- Welcome to KoboldCpp - Version 1.87.4
- For command line arguments, please refer to --help
- ***
- Auto Selected CUDA Backend...
- WARNING: Admin was set without selecting an admin directory. Admin cannot be used.
- Setting process to Higher Priority - Use Caution
- High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.REALTIME_PRIORITY_CLASS
- Initializing dynamic library: koboldcpp_cublas.dll
- ==========
- Namespace(admin=False, admindir='', adminpassword='', analyze='', benchmark='stdout', blasbatchsize=2048, blasthreads=8, chatcompletionsadapter=None, cli=False, config=None, contextsize=20480, debugmode=0, defaultgenamt=512, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsmodel='', exportconfig='', exporttemplate='', failsafe=False, flashattention=True, forceversion=0, foreground=False, gpulayers=14, highpriority=True, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model=[], model_param='G:/_Ai/gemma-3-12b-it-q4_0_s.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[1.0, 10000.0], savedatafile=None, sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=3, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=True, usemmap=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
- ==========
- Loading Text Model: G:\_Ai\gemma-3-12b-it-q4_0_s.gguf
- The reported GGUF Arch is: gemma3
- Arch Category: 8
- ---
- Identified as GGUF model.
- Attempting to Load...
- ---
- Using Custom RoPE scaling (scale:1.000, base:10000.0).
- System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
- ---
- Initializing CUDA/HIP, please wait, the following step may take a few minutes for first launch...
- ---
- ggml_cuda_init: found 1 CUDA devices:
- Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
- llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3070 Ti) - 7002 MiB free
- llama_model_loader: loaded meta data with 39 key-value pairs and 626 tensors from G:\_Ai\gemma-3-12b-it-q4_0_s.gguf (version GGU`?г│Здprint_info: file format = GGUF V3 (latest)
- print_info: file type = Q4_0
- print_info: file size = 6.41 GiB (4.68 BPW)
- init_tokenizer: initializing tokenizer for type 1
- load: control-looking token: 106 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be 0?г│Здload: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
- load: special tokens cache size = 5
- load: token to piece cache size = 1.9446 MB
- print_info: arch = gemma3
- print_info: vocab_only = 0
- print_info: n_ctx_train = 131072
- print_info: n_embd = 3840
- print_info: n_layer = 48
- print_info: n_head = 16
- print_info: n_head_kv = 8
- print_info: n_rot = 256
- print_info: n_swa = 1024
- print_info: n_swa_pattern = 6
- print_info: n_embd_head_k = 256
- print_info: n_embd_head_v = 256
- print_info: n_gqa = 2
- print_info: n_embd_k_gqa = 2048
- print_info: n_embd_v_gqa = 2048
- print_info: f_norm_eps = 0.0e+00
- print_info: f_norm_rms_eps = 1.0e-06
- print_info: f_clamp_kqv = 0.0e+00
- print_info: f_max_alibi_bias = 0.0e+00
- print_info: f_logit_scale = 0.0e+00
- print_info: f_attn_scale = 6.2e-02
- print_info: n_ff = 15360
- print_info: n_expert = 0
- print_info: n_expert_used = 0
- print_info: causal attn = 1
- print_info: pooling type = 0
- print_info: rope type = 2
- print_info: rope scaling = linear
- print_info: freq_base_train = 1000000.0
- print_info: freq_scale_train = 0.125
- print_info: n_ctx_orig_yarn = 131072
- print_info: rope_finetuned = unknown
- print_info: ssm_d_conv = 0
- print_info: ssm_d_inner = 0
- print_info: ssm_d_state = 0
- print_info: ssm_dt_rank = 0
- print_info: ssm_dt_b_c_rms = 0
- print_info: model type = 12B
- print_info: model params = 11.77 B
- print_info: general.name = n/a
- print_info: vocab type = SPM
- print_info: n_vocab = 262144
- print_info: n_merges = 0
- print_info: BOS token = 2 '<bos>'
- print_info: EOS token = 1 '<eos>'
- print_info: EOT token = 106 '<end_of_turn>'
- print_info: UNK token = 3 '<unk>'
- print_info: PAD token = 0 '<pad>'
- print_info: LF token = 248 '<0x0A>'
- print_info: EOG token = 1 '<eos>'
- print_info: EOG token = 106 '<end_of_turn>'
- print_info: max token length = 93
- load_tensors: loading model tensors, this can take a while... (mmap = false)
- load_tensors: relocated tensors: 1 of 627
- load_tensors: offloading 14 repeating layers to GPU
- load_tensors: offloaded 14/49 layers to GPU
- load_tensors: CPU model buffer size = 787.50 MiB
- load_tensors: CUDA_Host model buffer size = 4877.54 MiB
- load_tensors: CUDA0 model buffer size = 1684.15 MiB
- ..........................................................load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
- .......................
- llama_context: constructing llama_context
- llama_context: n_seq_max = 1
- llama_context: n_ctx = 20600
- llama_context: n_ctx_per_seq = 20600
- llama_context: n_batch = 2048
- llama_context: n_ubatch = 2048
- llama_context: causal_attn = 1
- llama_context: flash_attn = 1
- llama_context: freq_base = 10000.0
- llama_context: freq_scale = 1
- llama_context: n_ctx_per_seq (20600) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
- set_abort_callback: call
- llama_context: CPU output buffer size = 1.00 MiB
- llama_context: n_ctx = 20600
- llama_context: n_ctx = 20736 (padded)
- init: kv_size = 20736, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
- init: CPU KV buffer size = 5508.00 MiB
- init: CUDA0 KV buffer size = 2268.00 MiB
- llama_context: KV self size = 7776.00 MiB, K (f16): 3888.00 MiB, V (f16): 3888.00 MiB
- llama_context: enumerating backends
- llama_context: backend_ptrs.size() = 2
- llama_context: max_nodes = 65536
- llama_context: worst-case: n_tokens = 2048, n_seqs = 1, n_outputs = 0
- llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
- llama_context: reserving graph for n_tokens = 1, n_seqs = 1
- llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
- llama_context: CUDA0 compute buffer size = 2865.50 MiB
- llama_context: CUDA_Host compute buffer size = 356.02 MiB
- llama_context: graph nodes = 1833
- llama_context: graph splits = 514 (with bs=2048), 3 (with bs=1)
- Load Text Model OK: True
- Embedded KoboldAI Lite loaded.
- Embedded API docs loaded.
- ======
- Active Modules: TextGeneration
- Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
- Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
- Running benchmark (Not Saved)...
- Processing Prompt [BLAS] (20380 / 20380 tokens)
- Generating (100 / 100 tokens)
- [19:17:57] CtxLimit:20480/20480, Amt:100/100, Init:0.04s, Process:23.61s (863.01T/s), Generate:22.62s (4.42T/s), Total:46.24s
- Benchmark Completed - v1.87.4 Results:
- ======
- Flags: NoAVX2=False Threads=8 HighPriority=True Cublas_Args=['normal', '0', 'mmq'] Tensor_Split=None BlasThreads=8 BlasBatchSize=2048 FlashAttention=True KvCache=0
- Timestamp: 2025-04-13 16:17:57.513868+00:00
- Backend: koboldcpp_cublas.dll
- Layers: 14
- Model: gemma-3-12b-it-q4_0_s
- MaxCtx: 20480
- GenAmount: 100
- -----
- ProcessingTime: 23.615s
- ProcessingSpeed: 863.01T/s
- GenerationTime: 22.625s
- GenerationSpeed: 4.42T/s
- TotalTime: 46.240s
- Output: 0 0 0
- -----
- Server was not started, main function complete. Idling.
- ===
- Press ENTER key to exit.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement