Untitled

***
Welcome to KoboldCpp - Version 1.87.4
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...


WARNING: Admin was set without selecting an admin directory. Admin cannot be used.

Setting process to Higher Priority - Use Caution
High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.REALTIME_PRIORITY_CLASS
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', benchmark='stdout', blasbatchsize=2048, blasthreads=8, chatcompletionsadapter=None, cli=False, config=None, contextsize=20480, debugmode=0, defaultgenamt=512, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsmodel='', exportconfig='', exporttemplate='', failsafe=False, flashattention=True, forceversion=0, foreground=False, gpulayers=14, highpriority=True, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model=[], model_param='G:/_Ai/gemma-3-12b-it-q4_0_s.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[1.0, 10000.0], savedatafile=None, sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=3, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=True, usemmap=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: G:\_Ai\gemma-3-12b-it-q4_0_s.gguf

The reported GGUF Arch is: gemma3
Arch Category: 8

---
Identified as GGUF model.
Attempting to Load...
---
Using Custom RoPE scaling (scale:1.000, base:10000.0).
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes for first launch...
---
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3070 Ti) - 7002 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 626 tensors from G:\_Ai\gemma-3-12b-it-q4_0_s.gguf (version GGU`?г│Здprint_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 6.41 GiB (4.68 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control-looking token:    106 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be 0?г│Здload: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3840
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 15360
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 12B
print_info: model params     = 11.77 B
print_info: general.name     = n/a
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 93
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 627
load_tensors: offloading 14 repeating layers to GPU
load_tensors: offloaded 14/49 layers to GPU
load_tensors:          CPU model buffer size =   787.50 MiB
load_tensors:    CUDA_Host model buffer size =  4877.54 MiB
load_tensors:        CUDA0 model buffer size =  1684.15 MiB
..........................................................load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.......................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 20600
llama_context: n_ctx_per_seq = 20600
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (20600) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     1.00 MiB
llama_context: n_ctx = 20600
llama_context: n_ctx = 20736 (padded)
init: kv_size = 20736, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:        CPU KV buffer size =  5508.00 MiB
init:      CUDA0 KV buffer size =  2268.00 MiB
llama_context: KV self size  = 7776.00 MiB, K (f16): 3888.00 MiB, V (f16): 3888.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 2048, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
llama_context:      CUDA0 compute buffer size =  2865.50 MiB
llama_context:  CUDA_Host compute buffer size =   356.02 MiB
llama_context: graph nodes  = 1833
llama_context: graph splits = 514 (with bs=2048), 3 (with bs=1)
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi

Running benchmark (Not Saved)...

Processing Prompt [BLAS] (20380 / 20380 tokens)
Generating (100 / 100 tokens)
[19:17:57] CtxLimit:20480/20480, Amt:100/100, Init:0.04s, Process:23.61s (863.01T/s), Generate:22.62s (4.42T/s), Total:46.24s
Benchmark Completed - v1.87.4 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=True Cublas_Args=['normal', '0', 'mmq'] Tensor_Split=None BlasThreads=8 BlasBatchSize=2048 FlashAttention=True KvCache=0
Timestamp: 2025-04-13 16:17:57.513868+00:00
Backend: koboldcpp_cublas.dll
Layers: 14
Model: gemma-3-12b-it-q4_0_s
MaxCtx: 20480
GenAmount: 100
-----
ProcessingTime: 23.615s
ProcessingSpeed: 863.01T/s
GenerationTime: 22.625s
GenerationSpeed: 4.42T/s
TotalTime: 46.240s
Output:    0 0 0
-----
Server was not started, main function complete. Idling.
===
Press ENTER key to exit.