Advertisement
Guest User

Untitled

a guest
Apr 13th, 2025
22
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.55 KB | None | 0 0
  1. ***
  2. Welcome to KoboldCpp - Version 1.87.4
  3. For command line arguments, please refer to --help
  4. ***
  5. Auto Selected CUDA Backend...
  6.  
  7.  
  8. WARNING: Admin was set without selecting an admin directory. Admin cannot be used.
  9.  
  10. Setting process to Higher Priority - Use Caution
  11. High Priority for Windows Set: Priority.NORMAL_PRIORITY_CLASS to Priority.REALTIME_PRIORITY_CLASS
  12. Initializing dynamic library: koboldcpp_cublas.dll
  13. ==========
  14. Namespace(admin=False, admindir='', adminpassword='', analyze='', benchmark='stdout', blasbatchsize=2048, blasthreads=8, chatcompletionsadapter=None, cli=False, config=None, contextsize=20480, debugmode=0, defaultgenamt=512, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsmodel='', exportconfig='', exporttemplate='', failsafe=False, flashattention=True, forceversion=0, foreground=False, gpulayers=14, highpriority=True, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model=[], model_param='G:/_Ai/gemma-3-12b-it-q4_0_s.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[1.0, 10000.0], savedatafile=None, sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=3, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=True, usemmap=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
  15. ==========
  16. Loading Text Model: G:\_Ai\gemma-3-12b-it-q4_0_s.gguf
  17.  
  18. The reported GGUF Arch is: gemma3
  19. Arch Category: 8
  20.  
  21. ---
  22. Identified as GGUF model.
  23. Attempting to Load...
  24. ---
  25. Using Custom RoPE scaling (scale:1.000, base:10000.0).
  26. System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
  27. ---
  28. Initializing CUDA/HIP, please wait, the following step may take a few minutes for first launch...
  29. ---
  30. ggml_cuda_init: found 1 CUDA devices:
  31. Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
  32. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3070 Ti) - 7002 MiB free
  33. llama_model_loader: loaded meta data with 39 key-value pairs and 626 tensors from G:\_Ai\gemma-3-12b-it-q4_0_s.gguf (version GGU`?г│Здprint_info: file format = GGUF V3 (latest)
  34. print_info: file type = Q4_0
  35. print_info: file size = 6.41 GiB (4.68 BPW)
  36. init_tokenizer: initializing tokenizer for type 1
  37. load: control-looking token: 106 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be 0?г│Здload: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  38. load: special tokens cache size = 5
  39. load: token to piece cache size = 1.9446 MB
  40. print_info: arch = gemma3
  41. print_info: vocab_only = 0
  42. print_info: n_ctx_train = 131072
  43. print_info: n_embd = 3840
  44. print_info: n_layer = 48
  45. print_info: n_head = 16
  46. print_info: n_head_kv = 8
  47. print_info: n_rot = 256
  48. print_info: n_swa = 1024
  49. print_info: n_swa_pattern = 6
  50. print_info: n_embd_head_k = 256
  51. print_info: n_embd_head_v = 256
  52. print_info: n_gqa = 2
  53. print_info: n_embd_k_gqa = 2048
  54. print_info: n_embd_v_gqa = 2048
  55. print_info: f_norm_eps = 0.0e+00
  56. print_info: f_norm_rms_eps = 1.0e-06
  57. print_info: f_clamp_kqv = 0.0e+00
  58. print_info: f_max_alibi_bias = 0.0e+00
  59. print_info: f_logit_scale = 0.0e+00
  60. print_info: f_attn_scale = 6.2e-02
  61. print_info: n_ff = 15360
  62. print_info: n_expert = 0
  63. print_info: n_expert_used = 0
  64. print_info: causal attn = 1
  65. print_info: pooling type = 0
  66. print_info: rope type = 2
  67. print_info: rope scaling = linear
  68. print_info: freq_base_train = 1000000.0
  69. print_info: freq_scale_train = 0.125
  70. print_info: n_ctx_orig_yarn = 131072
  71. print_info: rope_finetuned = unknown
  72. print_info: ssm_d_conv = 0
  73. print_info: ssm_d_inner = 0
  74. print_info: ssm_d_state = 0
  75. print_info: ssm_dt_rank = 0
  76. print_info: ssm_dt_b_c_rms = 0
  77. print_info: model type = 12B
  78. print_info: model params = 11.77 B
  79. print_info: general.name = n/a
  80. print_info: vocab type = SPM
  81. print_info: n_vocab = 262144
  82. print_info: n_merges = 0
  83. print_info: BOS token = 2 '<bos>'
  84. print_info: EOS token = 1 '<eos>'
  85. print_info: EOT token = 106 '<end_of_turn>'
  86. print_info: UNK token = 3 '<unk>'
  87. print_info: PAD token = 0 '<pad>'
  88. print_info: LF token = 248 '<0x0A>'
  89. print_info: EOG token = 1 '<eos>'
  90. print_info: EOG token = 106 '<end_of_turn>'
  91. print_info: max token length = 93
  92. load_tensors: loading model tensors, this can take a while... (mmap = false)
  93. load_tensors: relocated tensors: 1 of 627
  94. load_tensors: offloading 14 repeating layers to GPU
  95. load_tensors: offloaded 14/49 layers to GPU
  96. load_tensors: CPU model buffer size = 787.50 MiB
  97. load_tensors: CUDA_Host model buffer size = 4877.54 MiB
  98. load_tensors: CUDA0 model buffer size = 1684.15 MiB
  99. ..........................................................load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
  100. .......................
  101. llama_context: constructing llama_context
  102. llama_context: n_seq_max = 1
  103. llama_context: n_ctx = 20600
  104. llama_context: n_ctx_per_seq = 20600
  105. llama_context: n_batch = 2048
  106. llama_context: n_ubatch = 2048
  107. llama_context: causal_attn = 1
  108. llama_context: flash_attn = 1
  109. llama_context: freq_base = 10000.0
  110. llama_context: freq_scale = 1
  111. llama_context: n_ctx_per_seq (20600) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
  112. set_abort_callback: call
  113. llama_context: CPU output buffer size = 1.00 MiB
  114. llama_context: n_ctx = 20600
  115. llama_context: n_ctx = 20736 (padded)
  116. init: kv_size = 20736, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
  117. init: CPU KV buffer size = 5508.00 MiB
  118. init: CUDA0 KV buffer size = 2268.00 MiB
  119. llama_context: KV self size = 7776.00 MiB, K (f16): 3888.00 MiB, V (f16): 3888.00 MiB
  120. llama_context: enumerating backends
  121. llama_context: backend_ptrs.size() = 2
  122. llama_context: max_nodes = 65536
  123. llama_context: worst-case: n_tokens = 2048, n_seqs = 1, n_outputs = 0
  124. llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
  125. llama_context: reserving graph for n_tokens = 1, n_seqs = 1
  126. llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
  127. llama_context: CUDA0 compute buffer size = 2865.50 MiB
  128. llama_context: CUDA_Host compute buffer size = 356.02 MiB
  129. llama_context: graph nodes = 1833
  130. llama_context: graph splits = 514 (with bs=2048), 3 (with bs=1)
  131. Load Text Model OK: True
  132. Embedded KoboldAI Lite loaded.
  133. Embedded API docs loaded.
  134. ======
  135. Active Modules: TextGeneration
  136. Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl
  137. Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
  138.  
  139. Running benchmark (Not Saved)...
  140.  
  141. Processing Prompt [BLAS] (20380 / 20380 tokens)
  142. Generating (100 / 100 tokens)
  143. [19:17:57] CtxLimit:20480/20480, Amt:100/100, Init:0.04s, Process:23.61s (863.01T/s), Generate:22.62s (4.42T/s), Total:46.24s
  144. Benchmark Completed - v1.87.4 Results:
  145. ======
  146. Flags: NoAVX2=False Threads=8 HighPriority=True Cublas_Args=['normal', '0', 'mmq'] Tensor_Split=None BlasThreads=8 BlasBatchSize=2048 FlashAttention=True KvCache=0
  147. Timestamp: 2025-04-13 16:17:57.513868+00:00
  148. Backend: koboldcpp_cublas.dll
  149. Layers: 14
  150. Model: gemma-3-12b-it-q4_0_s
  151. MaxCtx: 20480
  152. GenAmount: 100
  153. -----
  154. ProcessingTime: 23.615s
  155. ProcessingSpeed: 863.01T/s
  156. GenerationTime: 22.625s
  157. GenerationSpeed: 4.42T/s
  158. TotalTime: 46.240s
  159. Output: 0 0 0
  160. -----
  161. Server was not started, main function complete. Idling.
  162. ===
  163. Press ENTER key to exit.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement