Guest User

Untitled

a guest
Jul 19th, 2025
4
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 459.29 KB | None | 0 0
  1. time=2025-07-19T17:01:40.559+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Haldi\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
  2. time=2025-07-19T17:01:40.560+02:00 level=INFO source=images.go:476 msg="total blobs: 0"
  3. time=2025-07-19T17:01:40.560+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
  4. time=2025-07-19T17:01:40.561+02:00 level=INFO source=routes.go:1288 msg="Listening on [::]:11434 (version 0.9.6)"
  5. time=2025-07-19T17:01:40.561+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
  6. time=2025-07-19T17:01:40.561+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
  7. time=2025-07-19T17:01:40.561+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=16 efficiency=0 threads=32
  8. time=2025-07-19T17:01:41.144+02:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="18.0 GiB"
  9. time=2025-07-19T17:01:41.146+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ebed2943-db15-05b0-424c-c3b47e1679a0 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4090" total="24.0 GiB" available="22.5 GiB"
  10. [GIN] 2025/07/19 - 17:02:25 | 200 | 0s | 127.0.0.1 | GET "/"
  11. [GIN] 2025/07/19 - 17:02:25 | 200 | 0s | 127.0.0.1 | GET "/"
  12. [GIN] 2025/07/19 - 17:02:25 | 404 | 0s | 127.0.0.1 | GET "/favicon.ico"
  13. [GIN] 2025/07/19 - 17:18:53 | 200 | 0s | 192.168.1.2 | GET "/api/version"
  14. [GIN] 2025/07/19 - 17:18:58 | 200 | 529.5µs | 192.168.1.2 | GET "/api/tags"
  15. [GIN] 2025/07/19 - 17:18:58 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  16. [GIN] 2025/07/19 - 17:18:59 | 200 | 0s | 192.168.1.2 | GET "/api/tags"
  17. [GIN] 2025/07/19 - 17:18:59 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  18. [GIN] 2025/07/19 - 17:19:04 | 200 | 0s | 192.168.1.2 | GET "/api/tags"
  19. [GIN] 2025/07/19 - 17:19:11 | 400 | 0s | 192.168.1.2 | POST "/api/pull"
  20. time=2025-07-19T17:24:16.826+02:00 level=INFO source=download.go:177 msg="downloading f2dc41fa964b in 27 1 GB part(s)"
  21. time=2025-07-19T17:28:41.677+02:00 level=INFO source=download.go:177 msg="downloading 53d74de0d84c in 1 84 B part(s)"
  22. time=2025-07-19T17:28:43.099+02:00 level=INFO source=download.go:177 msg="downloading 43070e2d4e53 in 1 11 KB part(s)"
  23. time=2025-07-19T17:28:44.471+02:00 level=INFO source=download.go:177 msg="downloading ed11eda7790d in 1 30 B part(s)"
  24. time=2025-07-19T17:28:45.826+02:00 level=INFO source=download.go:177 msg="downloading deae14c19dac in 1 486 B part(s)"
  25. [GIN] 2025/07/19 - 17:29:10 | 200 | 0s | 192.168.1.2 | GET "/api/tags"
  26. [GIN] 2025/07/19 - 17:29:11 | 200 | 4m56s | 192.168.1.2 | POST "/api/pull"
  27. [GIN] 2025/07/19 - 17:29:45 | 200 | 521.4µs | 192.168.1.2 | GET "/api/version"
  28. [GIN] 2025/07/19 - 17:29:46 | 200 | 548.3µs | 192.168.1.2 | GET "/api/tags"
  29. [GIN] 2025/07/19 - 17:29:46 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  30. time=2025-07-19T17:30:01.377+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="29.6 GiB" free_swap="29.1 GiB"
  31. time=2025-07-19T17:30:01.378+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=22 layers.split="" memory.available="[19.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="27.1 GiB" memory.required.partial="19.2 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[19.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="830.0 MiB"
  32. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  33. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  34. llama_model_loader: - kv 0: general.architecture str = llama
  35. llama_model_loader: - kv 1: general.type str = model
  36. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  37. llama_model_loader: - kv 3: general.version str = v0.1
  38. llama_model_loader: - kv 4: general.finetune str = Instruct
  39. llama_model_loader: - kv 5: general.basename str = Mixtral
  40. llama_model_loader: - kv 6: general.size_label str = 8x7B
  41. llama_model_loader: - kv 7: general.license str = apache-2.0
  42. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  43. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  44. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  45. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  46. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  47. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  48. llama_model_loader: - kv 14: llama.block_count u32 = 32
  49. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  50. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  51. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  52. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  53. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  54. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  55. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  56. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  57. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  58. llama_model_loader: - kv 24: general.file_type u32 = 2
  59. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  60. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  61. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  62. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  63. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  64. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  65. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  66. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  67. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  68. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  69. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  70. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  71. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  72. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  73. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  74. llama_model_loader: - type f32: 97 tensors
  75. llama_model_loader: - type q4_0: 161 tensors
  76. llama_model_loader: - type q8_0: 64 tensors
  77. llama_model_loader: - type q6_K: 1 tensors
  78. print_info: file format = GGUF V3 (latest)
  79. print_info: file type = Q4_0
  80. print_info: file size = 24.63 GiB (4.53 BPW)
  81. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  82. load: special tokens cache size = 3
  83. load: token to piece cache size = 0.1637 MB
  84. print_info: arch = llama
  85. print_info: vocab_only = 1
  86. print_info: model type = ?B
  87. print_info: model params = 46.70 B
  88. print_info: general.name = Mixtral 8x7B Instruct v0.1
  89. print_info: vocab type = SPM
  90. print_info: n_vocab = 32000
  91. print_info: n_merges = 0
  92. print_info: BOS token = 1 '<s>'
  93. print_info: EOS token = 2 '</s>'
  94. print_info: UNK token = 0 '<unk>'
  95. print_info: LF token = 13 '<0x0A>'
  96. print_info: EOG token = 2 '</s>'
  97. print_info: max token length = 48
  98. llama_model_load: vocab only - skipping tensors
  99. time=2025-07-19T17:30:01.412+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 4096 --batch-size 512 --n-gpu-layers 22 --threads 16 --no-mmap --parallel 1 --port 51719"
  100. time=2025-07-19T17:30:01.416+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  101. time=2025-07-19T17:30:01.416+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  102. time=2025-07-19T17:30:01.416+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  103. time=2025-07-19T17:30:01.447+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  104. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  105. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  106. ggml_cuda_init: found 1 CUDA devices:
  107. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  108. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  109. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  110. time=2025-07-19T17:30:09.855+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  111. time=2025-07-19T17:30:09.856+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:51719"
  112. time=2025-07-19T17:30:09.934+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  113. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  114. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  115. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  116. llama_model_loader: - kv 0: general.architecture str = llama
  117. llama_model_loader: - kv 1: general.type str = model
  118. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  119. llama_model_loader: - kv 3: general.version str = v0.1
  120. llama_model_loader: - kv 4: general.finetune str = Instruct
  121. llama_model_loader: - kv 5: general.basename str = Mixtral
  122. llama_model_loader: - kv 6: general.size_label str = 8x7B
  123. llama_model_loader: - kv 7: general.license str = apache-2.0
  124. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  125. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  126. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  127. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  128. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  129. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  130. llama_model_loader: - kv 14: llama.block_count u32 = 32
  131. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  132. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  133. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  134. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  135. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  136. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  137. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  138. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  139. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  140. llama_model_loader: - kv 24: general.file_type u32 = 2
  141. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  142. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  143. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  144. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  145. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  146. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  147. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  148. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  149. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  150. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  151. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  152. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  153. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  154. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  155. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  156. llama_model_loader: - type f32: 97 tensors
  157. llama_model_loader: - type q4_0: 161 tensors
  158. llama_model_loader: - type q8_0: 64 tensors
  159. llama_model_loader: - type q6_K: 1 tensors
  160. print_info: file format = GGUF V3 (latest)
  161. print_info: file type = Q4_0
  162. print_info: file size = 24.63 GiB (4.53 BPW)
  163. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  164. load: special tokens cache size = 3
  165. load: token to piece cache size = 0.1637 MB
  166. print_info: arch = llama
  167. print_info: vocab_only = 0
  168. print_info: n_ctx_train = 32768
  169. print_info: n_embd = 4096
  170. print_info: n_layer = 32
  171. print_info: n_head = 32
  172. print_info: n_head_kv = 8
  173. print_info: n_rot = 128
  174. print_info: n_swa = 0
  175. print_info: n_swa_pattern = 1
  176. print_info: n_embd_head_k = 128
  177. print_info: n_embd_head_v = 128
  178. print_info: n_gqa = 4
  179. print_info: n_embd_k_gqa = 1024
  180. print_info: n_embd_v_gqa = 1024
  181. print_info: f_norm_eps = 0.0e+00
  182. print_info: f_norm_rms_eps = 1.0e-05
  183. print_info: f_clamp_kqv = 0.0e+00
  184. print_info: f_max_alibi_bias = 0.0e+00
  185. print_info: f_logit_scale = 0.0e+00
  186. print_info: f_attn_scale = 0.0e+00
  187. print_info: n_ff = 14336
  188. print_info: n_expert = 8
  189. print_info: n_expert_used = 2
  190. print_info: causal attn = 1
  191. print_info: pooling type = 0
  192. print_info: rope type = 0
  193. print_info: rope scaling = linear
  194. print_info: freq_base_train = 1000000.0
  195. print_info: freq_scale_train = 1
  196. print_info: n_ctx_orig_yarn = 32768
  197. print_info: rope_finetuned = unknown
  198. print_info: ssm_d_conv = 0
  199. print_info: ssm_d_inner = 0
  200. print_info: ssm_d_state = 0
  201. print_info: ssm_dt_rank = 0
  202. print_info: ssm_dt_b_c_rms = 0
  203. print_info: model type = 8x7B
  204. print_info: model params = 46.70 B
  205. print_info: general.name = Mixtral 8x7B Instruct v0.1
  206. print_info: vocab type = SPM
  207. print_info: n_vocab = 32000
  208. print_info: n_merges = 0
  209. print_info: BOS token = 1 '<s>'
  210. print_info: EOS token = 2 '</s>'
  211. print_info: UNK token = 0 '<unk>'
  212. print_info: LF token = 13 '<0x0A>'
  213. print_info: EOG token = 2 '</s>'
  214. print_info: max token length = 48
  215. load_tensors: loading model tensors, this can take a while... (mmap = false)
  216. load_tensors: offloading 22 repeating layers to GPU
  217. load_tensors: offloaded 22/33 layers to GPU
  218. load_tensors: CUDA_Host model buffer size = 7999.43 MiB
  219. load_tensors: CUDA0 model buffer size = 17218.44 MiB
  220. llama_context: constructing llama_context
  221. llama_context: n_seq_max = 1
  222. llama_context: n_ctx = 4096
  223. llama_context: n_ctx_per_seq = 4096
  224. llama_context: n_batch = 512
  225. llama_context: n_ubatch = 512
  226. llama_context: causal_attn = 1
  227. llama_context: flash_attn = 0
  228. llama_context: freq_base = 1000000.0
  229. llama_context: freq_scale = 1
  230. llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  231. llama_context: CPU output buffer size = 0.14 MiB
  232. llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  233. llama_kv_cache_unified: CUDA0 KV buffer size = 352.00 MiB
  234. llama_kv_cache_unified: CPU KV buffer size = 160.00 MiB
  235. llama_kv_cache_unified: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
  236. llama_context: CUDA0 compute buffer size = 397.00 MiB
  237. llama_context: CUDA_Host compute buffer size = 16.01 MiB
  238. llama_context: graph nodes = 1574
  239. llama_context: graph splits = 124 (with bs=512), 3 (with bs=1)
  240. time=2025-07-19T17:30:25.218+02:00 level=INFO source=server.go:637 msg="llama runner started in 23.80 seconds"
  241. [GIN] 2025/07/19 - 17:30:26 | 200 | 25.6488028s | 192.168.1.2 | POST "/api/chat"
  242. [GIN] 2025/07/19 - 17:30:31 | 200 | 4.0439405s | 192.168.1.2 | POST "/api/chat"
  243. [GIN] 2025/07/19 - 17:30:32 | 200 | 1.2118845s | 192.168.1.2 | POST "/api/chat"
  244. [GIN] 2025/07/19 - 17:30:34 | 200 | 2.4645466s | 192.168.1.2 | POST "/api/chat"
  245. [GIN] 2025/07/19 - 17:31:11 | 200 | 18.1656705s | 192.168.1.2 | POST "/api/chat"
  246. [GIN] 2025/07/19 - 17:31:16 | 200 | 5.6065391s | 192.168.1.2 | POST "/api/chat"
  247. [GIN] 2025/07/19 - 17:31:36 | 200 | 11.1392618s | 192.168.1.2 | POST "/api/chat"
  248. [GIN] 2025/07/19 - 17:31:42 | 200 | 5.6420976s | 192.168.1.2 | POST "/api/chat"
  249. [GIN] 2025/07/19 - 17:32:17 | 200 | 12.0526902s | 192.168.1.2 | POST "/api/chat"
  250. [GIN] 2025/07/19 - 17:32:22 | 200 | 5.281443s | 192.168.1.2 | POST "/api/chat"
  251. [GIN] 2025/07/19 - 17:36:05 | 200 | 0s | 127.0.0.1 | GET "/"
  252. [GIN] 2025/07/19 - 17:36:05 | 200 | 0s | 127.0.0.1 | GET "/"
  253. [GIN] 2025/07/19 - 17:36:05 | 404 | 0s | 127.0.0.1 | GET "/favicon.ico"
  254. [GIN] 2025/07/19 - 17:36:19 | 200 | 520.4µs | 192.168.1.1 | HEAD "/"
  255. [GIN] 2025/07/19 - 17:36:38 | 200 | 0s | 192.168.1.1 | GET "/"
  256. [GIN] 2025/07/19 - 17:36:47 | 200 | 0s | 192.168.1.200 | GET "/"
  257. [GIN] 2025/07/19 - 17:36:47 | 404 | 0s | 192.168.1.200 | GET "/favicon.ico"
  258. time=2025-07-19T17:38:10.309+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.1 GiB" free_swap="30.2 GiB"
  259. time=2025-07-19T17:38:10.310+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=22 layers.split="" memory.available="[19.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="27.1 GiB" memory.required.partial="19.2 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[19.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="830.0 MiB"
  260. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  261. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  262. llama_model_loader: - kv 0: general.architecture str = llama
  263. llama_model_loader: - kv 1: general.type str = model
  264. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  265. llama_model_loader: - kv 3: general.version str = v0.1
  266. llama_model_loader: - kv 4: general.finetune str = Instruct
  267. llama_model_loader: - kv 5: general.basename str = Mixtral
  268. llama_model_loader: - kv 6: general.size_label str = 8x7B
  269. llama_model_loader: - kv 7: general.license str = apache-2.0
  270. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  271. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  272. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  273. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  274. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  275. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  276. llama_model_loader: - kv 14: llama.block_count u32 = 32
  277. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  278. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  279. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  280. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  281. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  282. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  283. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  284. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  285. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  286. llama_model_loader: - kv 24: general.file_type u32 = 2
  287. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  288. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  289. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  290. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  291. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  292. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  293. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  294. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  295. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  296. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  297. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  298. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  299. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  300. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  301. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  302. llama_model_loader: - type f32: 97 tensors
  303. llama_model_loader: - type q4_0: 161 tensors
  304. llama_model_loader: - type q8_0: 64 tensors
  305. llama_model_loader: - type q6_K: 1 tensors
  306. print_info: file format = GGUF V3 (latest)
  307. print_info: file type = Q4_0
  308. print_info: file size = 24.63 GiB (4.53 BPW)
  309. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  310. load: special tokens cache size = 3
  311. load: token to piece cache size = 0.1637 MB
  312. print_info: arch = llama
  313. print_info: vocab_only = 1
  314. print_info: model type = ?B
  315. print_info: model params = 46.70 B
  316. print_info: general.name = Mixtral 8x7B Instruct v0.1
  317. print_info: vocab type = SPM
  318. print_info: n_vocab = 32000
  319. print_info: n_merges = 0
  320. print_info: BOS token = 1 '<s>'
  321. print_info: EOS token = 2 '</s>'
  322. print_info: UNK token = 0 '<unk>'
  323. print_info: LF token = 13 '<0x0A>'
  324. print_info: EOG token = 2 '</s>'
  325. print_info: max token length = 48
  326. llama_model_load: vocab only - skipping tensors
  327. time=2025-07-19T17:38:10.332+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 4096 --batch-size 512 --n-gpu-layers 22 --threads 16 --no-mmap --parallel 1 --port 52038"
  328. time=2025-07-19T17:38:10.334+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  329. time=2025-07-19T17:38:10.334+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  330. time=2025-07-19T17:38:10.335+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  331. time=2025-07-19T17:38:10.378+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  332. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  333. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  334. ggml_cuda_init: found 1 CUDA devices:
  335. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  336. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  337. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  338. time=2025-07-19T17:38:10.453+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  339. time=2025-07-19T17:38:10.454+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52038"
  340. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  341. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  342. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  343. llama_model_loader: - kv 0: general.architecture str = llama
  344. llama_model_loader: - kv 1: general.type str = model
  345. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  346. llama_model_loader: - kv 3: general.version str = v0.1
  347. llama_model_loader: - kv 4: general.finetune str = Instruct
  348. llama_model_loader: - kv 5: general.basename str = Mixtral
  349. llama_model_loader: - kv 6: general.size_label str = 8x7B
  350. llama_model_loader: - kv 7: general.license str = apache-2.0
  351. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  352. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  353. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  354. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  355. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  356. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  357. llama_model_loader: - kv 14: llama.block_count u32 = 32
  358. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  359. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  360. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  361. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  362. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  363. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  364. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  365. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  366. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  367. llama_model_loader: - kv 24: general.file_type u32 = 2
  368. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  369. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  370. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  371. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  372. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  373. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  374. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  375. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  376. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  377. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  378. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  379. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  380. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  381. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  382. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  383. llama_model_loader: - type f32: 97 tensors
  384. llama_model_loader: - type q4_0: 161 tensors
  385. llama_model_loader: - type q8_0: 64 tensors
  386. llama_model_loader: - type q6_K: 1 tensors
  387. print_info: file format = GGUF V3 (latest)
  388. print_info: file type = Q4_0
  389. print_info: file size = 24.63 GiB (4.53 BPW)
  390. time=2025-07-19T17:38:10.586+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  391. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  392. load: special tokens cache size = 3
  393. load: token to piece cache size = 0.1637 MB
  394. print_info: arch = llama
  395. print_info: vocab_only = 0
  396. print_info: n_ctx_train = 32768
  397. print_info: n_embd = 4096
  398. print_info: n_layer = 32
  399. print_info: n_head = 32
  400. print_info: n_head_kv = 8
  401. print_info: n_rot = 128
  402. print_info: n_swa = 0
  403. print_info: n_swa_pattern = 1
  404. print_info: n_embd_head_k = 128
  405. print_info: n_embd_head_v = 128
  406. print_info: n_gqa = 4
  407. print_info: n_embd_k_gqa = 1024
  408. print_info: n_embd_v_gqa = 1024
  409. print_info: f_norm_eps = 0.0e+00
  410. print_info: f_norm_rms_eps = 1.0e-05
  411. print_info: f_clamp_kqv = 0.0e+00
  412. print_info: f_max_alibi_bias = 0.0e+00
  413. print_info: f_logit_scale = 0.0e+00
  414. print_info: f_attn_scale = 0.0e+00
  415. print_info: n_ff = 14336
  416. print_info: n_expert = 8
  417. print_info: n_expert_used = 2
  418. print_info: causal attn = 1
  419. print_info: pooling type = 0
  420. print_info: rope type = 0
  421. print_info: rope scaling = linear
  422. print_info: freq_base_train = 1000000.0
  423. print_info: freq_scale_train = 1
  424. print_info: n_ctx_orig_yarn = 32768
  425. print_info: rope_finetuned = unknown
  426. print_info: ssm_d_conv = 0
  427. print_info: ssm_d_inner = 0
  428. print_info: ssm_d_state = 0
  429. print_info: ssm_dt_rank = 0
  430. print_info: ssm_dt_b_c_rms = 0
  431. print_info: model type = 8x7B
  432. print_info: model params = 46.70 B
  433. print_info: general.name = Mixtral 8x7B Instruct v0.1
  434. print_info: vocab type = SPM
  435. print_info: n_vocab = 32000
  436. print_info: n_merges = 0
  437. print_info: BOS token = 1 '<s>'
  438. print_info: EOS token = 2 '</s>'
  439. print_info: UNK token = 0 '<unk>'
  440. print_info: LF token = 13 '<0x0A>'
  441. print_info: EOG token = 2 '</s>'
  442. print_info: max token length = 48
  443. load_tensors: loading model tensors, this can take a while... (mmap = false)
  444. load_tensors: offloading 22 repeating layers to GPU
  445. load_tensors: offloaded 22/33 layers to GPU
  446. load_tensors: CUDA_Host model buffer size = 7999.43 MiB
  447. load_tensors: CUDA0 model buffer size = 17218.44 MiB
  448. llama_context: constructing llama_context
  449. llama_context: n_seq_max = 1
  450. llama_context: n_ctx = 4096
  451. llama_context: n_ctx_per_seq = 4096
  452. llama_context: n_batch = 512
  453. llama_context: n_ubatch = 512
  454. llama_context: causal_attn = 1
  455. llama_context: flash_attn = 0
  456. llama_context: freq_base = 1000000.0
  457. llama_context: freq_scale = 1
  458. llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  459. llama_context: CPU output buffer size = 0.14 MiB
  460. llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  461. llama_kv_cache_unified: CUDA0 KV buffer size = 352.00 MiB
  462. llama_kv_cache_unified: CPU KV buffer size = 160.00 MiB
  463. llama_kv_cache_unified: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
  464. llama_context: CUDA0 compute buffer size = 397.00 MiB
  465. llama_context: CUDA_Host compute buffer size = 16.01 MiB
  466. llama_context: graph nodes = 1574
  467. llama_context: graph splits = 124 (with bs=512), 3 (with bs=1)
  468. time=2025-07-19T17:38:25.868+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.53 seconds"
  469. [GIN] 2025/07/19 - 17:38:31 | 200 | 21.6694484s | 192.168.1.2 | POST "/api/chat"
  470. [GIN] 2025/07/19 - 17:38:45 | 200 | 13.088288s | 192.168.1.2 | POST "/api/chat"
  471. [GIN] 2025/07/19 - 17:38:50 | 200 | 5.1806868s | 192.168.1.2 | POST "/api/chat"
  472. [GIN] 2025/07/19 - 17:39:57 | 200 | 2.9611066s | 192.168.1.2 | POST "/api/chat"
  473. [GIN] 2025/07/19 - 17:40:08 | 200 | 11.6257541s | 192.168.1.2 | POST "/api/chat"
  474. [GIN] 2025/07/19 - 17:40:15 | 200 | 6.8390283s | 192.168.1.2 | POST "/api/chat"
  475. [GIN] 2025/07/19 - 17:40:53 | 200 | 2.8219905s | 192.168.1.2 | POST "/api/chat"
  476. [GIN] 2025/07/19 - 17:41:01 | 200 | 7.9993163s | 192.168.1.2 | POST "/api/chat"
  477. [GIN] 2025/07/19 - 17:41:06 | 200 | 4.4385367s | 192.168.1.2 | POST "/api/chat"
  478. [GIN] 2025/07/19 - 17:41:08 | 200 | 1.0207ms | 192.168.1.1 | GET "/"
  479. [GIN] 2025/07/19 - 17:41:30 | 200 | 2.7923974s | 192.168.1.2 | POST "/api/chat"
  480. [GIN] 2025/07/19 - 17:41:48 | 200 | 17.8480399s | 192.168.1.2 | POST "/api/chat"
  481. [GIN] 2025/07/19 - 17:41:53 | 200 | 5.5996437s | 192.168.1.2 | POST "/api/chat"
  482. [GIN] 2025/07/19 - 17:42:17 | 200 | 2.7939619s | 192.168.1.2 | POST "/api/chat"
  483. [GIN] 2025/07/19 - 17:42:30 | 200 | 12.5119819s | 192.168.1.2 | POST "/api/chat"
  484. [GIN] 2025/07/19 - 17:42:35 | 200 | 5.0069482s | 192.168.1.2 | POST "/api/chat"
  485. [GIN] 2025/07/19 - 17:43:36 | 200 | 0s | 192.168.1.2 | GET "/api/tags"
  486. [GIN] 2025/07/19 - 17:43:36 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  487. [GIN] 2025/07/19 - 17:43:37 | 200 | 0s | 192.168.1.2 | GET "/api/version"
  488. [GIN] 2025/07/19 - 17:43:38 | 200 | 524.1µs | 192.168.1.2 | GET "/api/tags"
  489. [GIN] 2025/07/19 - 17:43:38 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  490. [GIN] 2025/07/19 - 17:43:39 | 200 | 521.3µs | 192.168.1.2 | GET "/api/tags"
  491. [GIN] 2025/07/19 - 17:43:39 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  492. [GIN] 2025/07/19 - 17:43:47 | 200 | 0s | 192.168.1.2 | GET "/api/version"
  493. [GIN] 2025/07/19 - 17:43:48 | 200 | 522µs | 192.168.1.2 | GET "/api/tags"
  494. [GIN] 2025/07/19 - 17:43:48 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  495. [GIN] 2025/07/19 - 17:43:49 | 200 | 531.7µs | 192.168.1.2 | GET "/api/tags"
  496. [GIN] 2025/07/19 - 17:43:49 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  497. [GIN] 2025/07/19 - 17:43:59 | 200 | 0s | 192.168.1.2 | GET "/api/version"
  498. [GIN] 2025/07/19 - 17:44:29 | 200 | 522.9µs | 192.168.1.2 | GET "/api/tags"
  499. [GIN] 2025/07/19 - 17:44:29 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  500. [GIN] 2025/07/19 - 17:45:49 | 200 | 32.1349125s | 192.168.1.2 | POST "/api/chat"
  501. [GIN] 2025/07/19 - 17:45:56 | 200 | 6.3929667s | 192.168.1.2 | POST "/api/chat"
  502. [GIN] 2025/07/19 - 17:45:59 | 200 | 2.6024057s | 192.168.1.2 | POST "/api/chat"
  503. [GIN] 2025/07/19 - 17:46:01 | 200 | 2.8657871s | 192.168.1.2 | POST "/api/chat"
  504. [GIN] 2025/07/19 - 17:46:06 | 200 | 3.824311s | 192.168.1.2 | POST "/api/chat"
  505. [GIN] 2025/07/19 - 17:46:08 | 200 | 0s | 192.168.1.1 | GET "/"
  506. [GIN] 2025/07/19 - 17:46:39 | 200 | 33.1099085s | 192.168.1.2 | POST "/api/chat"
  507. [GIN] 2025/07/19 - 17:46:50 | 200 | 11.3300074s | 192.168.1.2 | POST "/api/chat"
  508. [GIN] 2025/07/19 - 17:47:07 | 200 | 524.3µs | 192.168.1.2 | GET "/api/tags"
  509. [GIN] 2025/07/19 - 17:47:07 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  510. [GIN] 2025/07/19 - 17:47:10 | 200 | 1.0018ms | 192.168.1.2 | GET "/api/tags"
  511. [GIN] 2025/07/19 - 17:47:10 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  512. [GIN] 2025/07/19 - 17:47:10 | 200 | 2.6068ms | 192.168.1.2 | POST "/api/generate"
  513. [GIN] 2025/07/19 - 17:47:10 | 200 | 518.4µs | 192.168.1.2 | GET "/api/tags"
  514. [GIN] 2025/07/19 - 17:47:10 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  515. time=2025-07-19T17:47:18.676+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.5 GiB" free_swap="29.3 GiB"
  516. time=2025-07-19T17:47:18.676+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="27.1 GiB" memory.required.partial="18.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[18.4 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="830.0 MiB"
  517. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  518. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  519. llama_model_loader: - kv 0: general.architecture str = llama
  520. llama_model_loader: - kv 1: general.type str = model
  521. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  522. llama_model_loader: - kv 3: general.version str = v0.1
  523. llama_model_loader: - kv 4: general.finetune str = Instruct
  524. llama_model_loader: - kv 5: general.basename str = Mixtral
  525. llama_model_loader: - kv 6: general.size_label str = 8x7B
  526. llama_model_loader: - kv 7: general.license str = apache-2.0
  527. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  528. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  529. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  530. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  531. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  532. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  533. llama_model_loader: - kv 14: llama.block_count u32 = 32
  534. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  535. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  536. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  537. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  538. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  539. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  540. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  541. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  542. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  543. llama_model_loader: - kv 24: general.file_type u32 = 2
  544. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  545. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  546. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  547. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  548. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  549. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  550. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  551. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  552. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  553. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  554. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  555. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  556. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  557. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  558. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  559. llama_model_loader: - type f32: 97 tensors
  560. llama_model_loader: - type q4_0: 161 tensors
  561. llama_model_loader: - type q8_0: 64 tensors
  562. llama_model_loader: - type q6_K: 1 tensors
  563. print_info: file format = GGUF V3 (latest)
  564. print_info: file type = Q4_0
  565. print_info: file size = 24.63 GiB (4.53 BPW)
  566. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  567. load: special tokens cache size = 3
  568. load: token to piece cache size = 0.1637 MB
  569. print_info: arch = llama
  570. print_info: vocab_only = 1
  571. print_info: model type = ?B
  572. print_info: model params = 46.70 B
  573. print_info: general.name = Mixtral 8x7B Instruct v0.1
  574. print_info: vocab type = SPM
  575. print_info: n_vocab = 32000
  576. print_info: n_merges = 0
  577. print_info: BOS token = 1 '<s>'
  578. print_info: EOS token = 2 '</s>'
  579. print_info: UNK token = 0 '<unk>'
  580. print_info: LF token = 13 '<0x0A>'
  581. print_info: EOG token = 2 '</s>'
  582. print_info: max token length = 48
  583. llama_model_load: vocab only - skipping tensors
  584. time=2025-07-19T17:47:18.699+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 4096 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52520"
  585. time=2025-07-19T17:47:18.701+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  586. time=2025-07-19T17:47:18.701+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  587. time=2025-07-19T17:47:18.702+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  588. time=2025-07-19T17:47:18.746+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  589. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  590. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  591. ggml_cuda_init: found 1 CUDA devices:
  592. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  593. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  594. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  595. time=2025-07-19T17:47:18.822+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  596. time=2025-07-19T17:47:18.822+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52520"
  597. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  598. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  599. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  600. llama_model_loader: - kv 0: general.architecture str = llama
  601. llama_model_loader: - kv 1: general.type str = model
  602. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  603. llama_model_loader: - kv 3: general.version str = v0.1
  604. llama_model_loader: - kv 4: general.finetune str = Instruct
  605. llama_model_loader: - kv 5: general.basename str = Mixtral
  606. llama_model_loader: - kv 6: general.size_label str = 8x7B
  607. llama_model_loader: - kv 7: general.license str = apache-2.0
  608. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  609. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  610. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  611. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  612. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  613. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  614. llama_model_loader: - kv 14: llama.block_count u32 = 32
  615. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  616. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  617. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  618. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  619. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  620. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  621. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  622. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  623. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  624. llama_model_loader: - kv 24: general.file_type u32 = 2
  625. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  626. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  627. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  628. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  629. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  630. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  631. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  632. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  633. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  634. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  635. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  636. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  637. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  638. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  639. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  640. llama_model_loader: - type f32: 97 tensors
  641. llama_model_loader: - type q4_0: 161 tensors
  642. llama_model_loader: - type q8_0: 64 tensors
  643. llama_model_loader: - type q6_K: 1 tensors
  644. print_info: file format = GGUF V3 (latest)
  645. print_info: file type = Q4_0
  646. print_info: file size = 24.63 GiB (4.53 BPW)
  647. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  648. load: special tokens cache size = 3
  649. load: token to piece cache size = 0.1637 MB
  650. print_info: arch = llama
  651. print_info: vocab_only = 0
  652. print_info: n_ctx_train = 32768
  653. print_info: n_embd = 4096
  654. print_info: n_layer = 32
  655. print_info: n_head = 32
  656. print_info: n_head_kv = 8
  657. print_info: n_rot = 128
  658. print_info: n_swa = 0
  659. print_info: n_swa_pattern = 1
  660. print_info: n_embd_head_k = 128
  661. print_info: n_embd_head_v = 128
  662. print_info: n_gqa = 4
  663. print_info: n_embd_k_gqa = 1024
  664. print_info: n_embd_v_gqa = 1024
  665. print_info: f_norm_eps = 0.0e+00
  666. print_info: f_norm_rms_eps = 1.0e-05
  667. print_info: f_clamp_kqv = 0.0e+00
  668. print_info: f_max_alibi_bias = 0.0e+00
  669. print_info: f_logit_scale = 0.0e+00
  670. print_info: f_attn_scale = 0.0e+00
  671. print_info: n_ff = 14336
  672. print_info: n_expert = 8
  673. print_info: n_expert_used = 2
  674. print_info: causal attn = 1
  675. print_info: pooling type = 0
  676. print_info: rope type = 0
  677. print_info: rope scaling = linear
  678. print_info: freq_base_train = 1000000.0
  679. print_info: freq_scale_train = 1
  680. print_info: n_ctx_orig_yarn = 32768
  681. print_info: rope_finetuned = unknown
  682. print_info: ssm_d_conv = 0
  683. print_info: ssm_d_inner = 0
  684. print_info: ssm_d_state = 0
  685. print_info: ssm_dt_rank = 0
  686. print_info: ssm_dt_b_c_rms = 0
  687. print_info: model type = 8x7B
  688. print_info: model params = 46.70 B
  689. print_info: general.name = Mixtral 8x7B Instruct v0.1
  690. print_info: vocab type = SPM
  691. print_info: n_vocab = 32000
  692. print_info: n_merges = 0
  693. print_info: BOS token = 1 '<s>'
  694. print_info: EOS token = 2 '</s>'
  695. print_info: UNK token = 0 '<unk>'
  696. print_info: LF token = 13 '<0x0A>'
  697. print_info: EOG token = 2 '</s>'
  698. print_info: max token length = 48
  699. load_tensors: loading model tensors, this can take a while... (mmap = false)
  700. time=2025-07-19T17:47:18.952+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  701. load_tensors: offloading 21 repeating layers to GPU
  702. load_tensors: offloaded 21/33 layers to GPU
  703. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  704. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  705. llama_context: constructing llama_context
  706. llama_context: n_seq_max = 1
  707. llama_context: n_ctx = 4096
  708. llama_context: n_ctx_per_seq = 4096
  709. llama_context: n_batch = 512
  710. llama_context: n_ubatch = 512
  711. llama_context: causal_attn = 1
  712. llama_context: flash_attn = 0
  713. llama_context: freq_base = 1000000.0
  714. llama_context: freq_scale = 1
  715. llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  716. llama_context: CPU output buffer size = 0.14 MiB
  717. llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  718. llama_kv_cache_unified: CUDA0 KV buffer size = 336.00 MiB
  719. llama_kv_cache_unified: CPU KV buffer size = 176.00 MiB
  720. llama_kv_cache_unified: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
  721. llama_context: CUDA0 compute buffer size = 397.00 MiB
  722. llama_context: CUDA_Host compute buffer size = 16.01 MiB
  723. llama_context: graph nodes = 1574
  724. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  725. time=2025-07-19T17:47:34.234+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.53 seconds"
  726. [GIN] 2025/07/19 - 17:47:44 | 200 | 25.4584598s | 83.77.231.178 | POST "/api/generate"
  727. [GIN] 2025/07/19 - 17:47:48 | 200 | 4.2552416s | 83.77.231.178 | POST "/api/generate"
  728. [GIN] 2025/07/19 - 17:47:58 | 200 | 2.8626601s | 83.77.231.178 | POST "/api/generate"
  729. time=2025-07-19T17:48:02.545+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.7 GiB" free_swap="29.5 GiB"
  730. time=2025-07-19T17:48:02.546+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="294.5 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="183.9 MiB" memory.graph.partial="826.6 MiB"
  731. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  732. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  733. llama_model_loader: - kv 0: general.architecture str = llama
  734. llama_model_loader: - kv 1: general.type str = model
  735. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  736. llama_model_loader: - kv 3: general.version str = v0.1
  737. llama_model_loader: - kv 4: general.finetune str = Instruct
  738. llama_model_loader: - kv 5: general.basename str = Mixtral
  739. llama_model_loader: - kv 6: general.size_label str = 8x7B
  740. llama_model_loader: - kv 7: general.license str = apache-2.0
  741. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  742. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  743. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  744. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  745. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  746. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  747. llama_model_loader: - kv 14: llama.block_count u32 = 32
  748. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  749. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  750. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  751. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  752. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  753. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  754. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  755. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  756. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  757. llama_model_loader: - kv 24: general.file_type u32 = 2
  758. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  759. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  760. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  761. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  762. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  763. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  764. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  765. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  766. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  767. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  768. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  769. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  770. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  771. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  772. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  773. llama_model_loader: - type f32: 97 tensors
  774. llama_model_loader: - type q4_0: 161 tensors
  775. llama_model_loader: - type q8_0: 64 tensors
  776. llama_model_loader: - type q6_K: 1 tensors
  777. print_info: file format = GGUF V3 (latest)
  778. print_info: file type = Q4_0
  779. print_info: file size = 24.63 GiB (4.53 BPW)
  780. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  781. load: special tokens cache size = 3
  782. load: token to piece cache size = 0.1637 MB
  783. print_info: arch = llama
  784. print_info: vocab_only = 1
  785. print_info: model type = ?B
  786. print_info: model params = 46.70 B
  787. print_info: general.name = Mixtral 8x7B Instruct v0.1
  788. print_info: vocab type = SPM
  789. print_info: n_vocab = 32000
  790. print_info: n_merges = 0
  791. print_info: BOS token = 1 '<s>'
  792. print_info: EOS token = 2 '</s>'
  793. print_info: UNK token = 0 '<unk>'
  794. print_info: LF token = 13 '<0x0A>'
  795. print_info: EOG token = 2 '</s>'
  796. print_info: max token length = 48
  797. llama_model_load: vocab only - skipping tensors
  798. time=2025-07-19T17:48:02.579+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2356 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52531"
  799. time=2025-07-19T17:48:02.583+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  800. time=2025-07-19T17:48:02.583+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  801. time=2025-07-19T17:48:02.584+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  802. time=2025-07-19T17:48:02.665+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  803. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  804. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  805. ggml_cuda_init: found 1 CUDA devices:
  806. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  807. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  808. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  809. time=2025-07-19T17:48:02.746+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  810. time=2025-07-19T17:48:02.747+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52531"
  811. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  812. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  813. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  814. llama_model_loader: - kv 0: general.architecture str = llama
  815. llama_model_loader: - kv 1: general.type str = model
  816. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  817. llama_model_loader: - kv 3: general.version str = v0.1
  818. llama_model_loader: - kv 4: general.finetune str = Instruct
  819. llama_model_loader: - kv 5: general.basename str = Mixtral
  820. llama_model_loader: - kv 6: general.size_label str = 8x7B
  821. llama_model_loader: - kv 7: general.license str = apache-2.0
  822. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  823. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  824. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  825. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  826. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  827. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  828. llama_model_loader: - kv 14: llama.block_count u32 = 32
  829. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  830. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  831. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  832. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  833. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  834. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  835. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  836. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  837. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  838. llama_model_loader: - kv 24: general.file_type u32 = 2
  839. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  840. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  841. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  842. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  843. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  844. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  845. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  846. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  847. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  848. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  849. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  850. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  851. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  852. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  853. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  854. llama_model_loader: - type f32: 97 tensors
  855. llama_model_loader: - type q4_0: 161 tensors
  856. llama_model_loader: - type q8_0: 64 tensors
  857. llama_model_loader: - type q6_K: 1 tensors
  858. print_info: file format = GGUF V3 (latest)
  859. print_info: file type = Q4_0
  860. print_info: file size = 24.63 GiB (4.53 BPW)
  861. time=2025-07-19T17:48:02.835+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  862. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  863. load: special tokens cache size = 3
  864. load: token to piece cache size = 0.1637 MB
  865. print_info: arch = llama
  866. print_info: vocab_only = 0
  867. print_info: n_ctx_train = 32768
  868. print_info: n_embd = 4096
  869. print_info: n_layer = 32
  870. print_info: n_head = 32
  871. print_info: n_head_kv = 8
  872. print_info: n_rot = 128
  873. print_info: n_swa = 0
  874. print_info: n_swa_pattern = 1
  875. print_info: n_embd_head_k = 128
  876. print_info: n_embd_head_v = 128
  877. print_info: n_gqa = 4
  878. print_info: n_embd_k_gqa = 1024
  879. print_info: n_embd_v_gqa = 1024
  880. print_info: f_norm_eps = 0.0e+00
  881. print_info: f_norm_rms_eps = 1.0e-05
  882. print_info: f_clamp_kqv = 0.0e+00
  883. print_info: f_max_alibi_bias = 0.0e+00
  884. print_info: f_logit_scale = 0.0e+00
  885. print_info: f_attn_scale = 0.0e+00
  886. print_info: n_ff = 14336
  887. print_info: n_expert = 8
  888. print_info: n_expert_used = 2
  889. print_info: causal attn = 1
  890. print_info: pooling type = 0
  891. print_info: rope type = 0
  892. print_info: rope scaling = linear
  893. print_info: freq_base_train = 1000000.0
  894. print_info: freq_scale_train = 1
  895. print_info: n_ctx_orig_yarn = 32768
  896. print_info: rope_finetuned = unknown
  897. print_info: ssm_d_conv = 0
  898. print_info: ssm_d_inner = 0
  899. print_info: ssm_d_state = 0
  900. print_info: ssm_dt_rank = 0
  901. print_info: ssm_dt_b_c_rms = 0
  902. print_info: model type = 8x7B
  903. print_info: model params = 46.70 B
  904. print_info: general.name = Mixtral 8x7B Instruct v0.1
  905. print_info: vocab type = SPM
  906. print_info: n_vocab = 32000
  907. print_info: n_merges = 0
  908. print_info: BOS token = 1 '<s>'
  909. print_info: EOS token = 2 '</s>'
  910. print_info: UNK token = 0 '<unk>'
  911. print_info: LF token = 13 '<0x0A>'
  912. print_info: EOG token = 2 '</s>'
  913. print_info: max token length = 48
  914. load_tensors: loading model tensors, this can take a while... (mmap = false)
  915. load_tensors: offloading 21 repeating layers to GPU
  916. load_tensors: offloaded 21/33 layers to GPU
  917. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  918. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  919. llama_context: constructing llama_context
  920. llama_context: n_seq_max = 1
  921. llama_context: n_ctx = 2356
  922. llama_context: n_ctx_per_seq = 2356
  923. llama_context: n_batch = 512
  924. llama_context: n_ubatch = 512
  925. llama_context: causal_attn = 1
  926. llama_context: flash_attn = 0
  927. llama_context: freq_base = 1000000.0
  928. llama_context: freq_scale = 1
  929. llama_context: n_ctx_per_seq (2356) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  930. llama_context: CPU output buffer size = 0.14 MiB
  931. llama_kv_cache_unified: kv_size = 2368, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  932. llama_kv_cache_unified: CUDA0 KV buffer size = 194.25 MiB
  933. llama_kv_cache_unified: CPU KV buffer size = 101.75 MiB
  934. llama_kv_cache_unified: KV self size = 296.00 MiB, K (f16): 148.00 MiB, V (f16): 148.00 MiB
  935. llama_context: CUDA0 compute buffer size = 393.63 MiB
  936. llama_context: CUDA_Host compute buffer size = 12.63 MiB
  937. llama_context: graph nodes = 1574
  938. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  939. time=2025-07-19T17:48:18.367+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  940. time=2025-07-19T17:48:18.374+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2356 prompt=2713 keep=5 new=2356
  941. [GIN] 2025/07/19 - 17:48:35 | 200 | 33.2707893s | 83.77.231.178 | POST "/api/generate"
  942. time=2025-07-19T17:48:38.930+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="29.3 GiB"
  943. time=2025-07-19T17:48:38.930+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="18.2 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[18.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="826.0 MiB"
  944. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  945. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  946. llama_model_loader: - kv 0: general.architecture str = llama
  947. llama_model_loader: - kv 1: general.type str = model
  948. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  949. llama_model_loader: - kv 3: general.version str = v0.1
  950. llama_model_loader: - kv 4: general.finetune str = Instruct
  951. llama_model_loader: - kv 5: general.basename str = Mixtral
  952. llama_model_loader: - kv 6: general.size_label str = 8x7B
  953. llama_model_loader: - kv 7: general.license str = apache-2.0
  954. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  955. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  956. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  957. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  958. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  959. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  960. llama_model_loader: - kv 14: llama.block_count u32 = 32
  961. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  962. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  963. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  964. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  965. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  966. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  967. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  968. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  969. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  970. llama_model_loader: - kv 24: general.file_type u32 = 2
  971. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  972. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  973. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  974. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  975. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  976. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  977. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  978. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  979. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  980. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  981. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  982. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  983. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  984. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  985. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  986. llama_model_loader: - type f32: 97 tensors
  987. llama_model_loader: - type q4_0: 161 tensors
  988. llama_model_loader: - type q8_0: 64 tensors
  989. llama_model_loader: - type q6_K: 1 tensors
  990. print_info: file format = GGUF V3 (latest)
  991. print_info: file type = Q4_0
  992. print_info: file size = 24.63 GiB (4.53 BPW)
  993. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  994. load: special tokens cache size = 3
  995. load: token to piece cache size = 0.1637 MB
  996. print_info: arch = llama
  997. print_info: vocab_only = 1
  998. print_info: model type = ?B
  999. print_info: model params = 46.70 B
  1000. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1001. print_info: vocab type = SPM
  1002. print_info: n_vocab = 32000
  1003. print_info: n_merges = 0
  1004. print_info: BOS token = 1 '<s>'
  1005. print_info: EOS token = 2 '</s>'
  1006. print_info: UNK token = 0 '<unk>'
  1007. print_info: LF token = 13 '<0x0A>'
  1008. print_info: EOG token = 2 '</s>'
  1009. print_info: max token length = 48
  1010. llama_model_load: vocab only - skipping tensors
  1011. time=2025-07-19T17:48:38.955+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 1849 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52542"
  1012. time=2025-07-19T17:48:38.958+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  1013. time=2025-07-19T17:48:38.958+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  1014. time=2025-07-19T17:48:38.958+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  1015. time=2025-07-19T17:48:39.004+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  1016. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  1017. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  1018. ggml_cuda_init: found 1 CUDA devices:
  1019. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  1020. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  1021. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  1022. time=2025-07-19T17:48:39.081+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  1023. time=2025-07-19T17:48:39.081+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52542"
  1024. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  1025. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1026. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1027. llama_model_loader: - kv 0: general.architecture str = llama
  1028. llama_model_loader: - kv 1: general.type str = model
  1029. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1030. llama_model_loader: - kv 3: general.version str = v0.1
  1031. llama_model_loader: - kv 4: general.finetune str = Instruct
  1032. llama_model_loader: - kv 5: general.basename str = Mixtral
  1033. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1034. llama_model_loader: - kv 7: general.license str = apache-2.0
  1035. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1036. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1037. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1038. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1039. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1040. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1041. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1042. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1043. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1044. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1045. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1046. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1047. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1048. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1049. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1050. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1051. llama_model_loader: - kv 24: general.file_type u32 = 2
  1052. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1053. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1054. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1055. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1056. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1057. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1058. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1059. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1060. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1061. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1062. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1063. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1064. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1065. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1066. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1067. llama_model_loader: - type f32: 97 tensors
  1068. llama_model_loader: - type q4_0: 161 tensors
  1069. llama_model_loader: - type q8_0: 64 tensors
  1070. llama_model_loader: - type q6_K: 1 tensors
  1071. print_info: file format = GGUF V3 (latest)
  1072. print_info: file type = Q4_0
  1073. print_info: file size = 24.63 GiB (4.53 BPW)
  1074. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1075. load: special tokens cache size = 3
  1076. load: token to piece cache size = 0.1637 MB
  1077. print_info: arch = llama
  1078. print_info: vocab_only = 0
  1079. print_info: n_ctx_train = 32768
  1080. print_info: n_embd = 4096
  1081. print_info: n_layer = 32
  1082. print_info: n_head = 32
  1083. print_info: n_head_kv = 8
  1084. print_info: n_rot = 128
  1085. print_info: n_swa = 0
  1086. print_info: n_swa_pattern = 1
  1087. print_info: n_embd_head_k = 128
  1088. print_info: n_embd_head_v = 128
  1089. print_info: n_gqa = 4
  1090. print_info: n_embd_k_gqa = 1024
  1091. print_info: n_embd_v_gqa = 1024
  1092. print_info: f_norm_eps = 0.0e+00
  1093. print_info: f_norm_rms_eps = 1.0e-05
  1094. print_info: f_clamp_kqv = 0.0e+00
  1095. print_info: f_max_alibi_bias = 0.0e+00
  1096. print_info: f_logit_scale = 0.0e+00
  1097. print_info: f_attn_scale = 0.0e+00
  1098. print_info: n_ff = 14336
  1099. print_info: n_expert = 8
  1100. print_info: n_expert_used = 2
  1101. print_info: causal attn = 1
  1102. print_info: pooling type = 0
  1103. print_info: rope type = 0
  1104. print_info: rope scaling = linear
  1105. print_info: freq_base_train = 1000000.0
  1106. print_info: freq_scale_train = 1
  1107. print_info: n_ctx_orig_yarn = 32768
  1108. print_info: rope_finetuned = unknown
  1109. print_info: ssm_d_conv = 0
  1110. print_info: ssm_d_inner = 0
  1111. print_info: ssm_d_state = 0
  1112. print_info: ssm_dt_rank = 0
  1113. print_info: ssm_dt_b_c_rms = 0
  1114. print_info: model type = 8x7B
  1115. print_info: model params = 46.70 B
  1116. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1117. print_info: vocab type = SPM
  1118. print_info: n_vocab = 32000
  1119. print_info: n_merges = 0
  1120. print_info: BOS token = 1 '<s>'
  1121. print_info: EOS token = 2 '</s>'
  1122. print_info: UNK token = 0 '<unk>'
  1123. print_info: LF token = 13 '<0x0A>'
  1124. print_info: EOG token = 2 '</s>'
  1125. print_info: max token length = 48
  1126. load_tensors: loading model tensors, this can take a while... (mmap = false)
  1127. time=2025-07-19T17:48:39.208+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  1128. load_tensors: offloading 21 repeating layers to GPU
  1129. load_tensors: offloaded 21/33 layers to GPU
  1130. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  1131. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  1132. llama_context: constructing llama_context
  1133. llama_context: n_seq_max = 1
  1134. llama_context: n_ctx = 1849
  1135. llama_context: n_ctx_per_seq = 1849
  1136. llama_context: n_batch = 512
  1137. llama_context: n_ubatch = 512
  1138. llama_context: causal_attn = 1
  1139. llama_context: flash_attn = 0
  1140. llama_context: freq_base = 1000000.0
  1141. llama_context: freq_scale = 1
  1142. llama_context: n_ctx_per_seq (1849) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  1143. llama_context: CPU output buffer size = 0.14 MiB
  1144. llama_kv_cache_unified: kv_size = 1856, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  1145. llama_kv_cache_unified: CUDA0 KV buffer size = 152.25 MiB
  1146. llama_kv_cache_unified: CPU KV buffer size = 79.75 MiB
  1147. llama_kv_cache_unified: KV self size = 232.00 MiB, K (f16): 116.00 MiB, V (f16): 116.00 MiB
  1148. llama_context: CUDA0 compute buffer size = 405.00 MiB
  1149. llama_context: CUDA_Host compute buffer size = 11.63 MiB
  1150. llama_context: graph nodes = 1574
  1151. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  1152. time=2025-07-19T17:48:54.738+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  1153. [GIN] 2025/07/19 - 17:49:01 | 200 | 23.5846212s | 83.77.231.178 | POST "/api/generate"
  1154. time=2025-07-19T17:49:06.396+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="29.4 GiB"
  1155. time=2025-07-19T17:49:06.396+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="300.9 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="187.1 MiB" memory.graph.partial="826.7 MiB"
  1156. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1157. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1158. llama_model_loader: - kv 0: general.architecture str = llama
  1159. llama_model_loader: - kv 1: general.type str = model
  1160. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1161. llama_model_loader: - kv 3: general.version str = v0.1
  1162. llama_model_loader: - kv 4: general.finetune str = Instruct
  1163. llama_model_loader: - kv 5: general.basename str = Mixtral
  1164. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1165. llama_model_loader: - kv 7: general.license str = apache-2.0
  1166. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1167. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1168. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1169. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1170. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1171. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1172. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1173. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1174. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1175. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1176. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1177. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1178. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1179. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1180. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1181. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1182. llama_model_loader: - kv 24: general.file_type u32 = 2
  1183. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1184. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1185. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1186. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1187. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1188. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1189. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1190. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1191. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1192. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1193. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1194. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1195. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1196. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1197. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1198. llama_model_loader: - type f32: 97 tensors
  1199. llama_model_loader: - type q4_0: 161 tensors
  1200. llama_model_loader: - type q8_0: 64 tensors
  1201. llama_model_loader: - type q6_K: 1 tensors
  1202. print_info: file format = GGUF V3 (latest)
  1203. print_info: file type = Q4_0
  1204. print_info: file size = 24.63 GiB (4.53 BPW)
  1205. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1206. load: special tokens cache size = 3
  1207. load: token to piece cache size = 0.1637 MB
  1208. print_info: arch = llama
  1209. print_info: vocab_only = 1
  1210. print_info: model type = ?B
  1211. print_info: model params = 46.70 B
  1212. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1213. print_info: vocab type = SPM
  1214. print_info: n_vocab = 32000
  1215. print_info: n_merges = 0
  1216. print_info: BOS token = 1 '<s>'
  1217. print_info: EOS token = 2 '</s>'
  1218. print_info: UNK token = 0 '<unk>'
  1219. print_info: LF token = 13 '<0x0A>'
  1220. print_info: EOG token = 2 '</s>'
  1221. print_info: max token length = 48
  1222. llama_model_load: vocab only - skipping tensors
  1223. time=2025-07-19T17:49:06.421+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2407 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52548"
  1224. time=2025-07-19T17:49:06.424+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  1225. time=2025-07-19T17:49:06.424+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  1226. time=2025-07-19T17:49:06.424+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  1227. time=2025-07-19T17:49:06.466+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  1228. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  1229. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  1230. ggml_cuda_init: found 1 CUDA devices:
  1231. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  1232. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  1233. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  1234. time=2025-07-19T17:49:06.580+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  1235. time=2025-07-19T17:49:06.580+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52548"
  1236. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  1237. time=2025-07-19T17:49:06.675+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  1238. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1239. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1240. llama_model_loader: - kv 0: general.architecture str = llama
  1241. llama_model_loader: - kv 1: general.type str = model
  1242. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1243. llama_model_loader: - kv 3: general.version str = v0.1
  1244. llama_model_loader: - kv 4: general.finetune str = Instruct
  1245. llama_model_loader: - kv 5: general.basename str = Mixtral
  1246. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1247. llama_model_loader: - kv 7: general.license str = apache-2.0
  1248. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1249. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1250. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1251. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1252. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1253. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1254. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1255. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1256. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1257. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1258. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1259. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1260. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1261. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1262. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1263. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1264. llama_model_loader: - kv 24: general.file_type u32 = 2
  1265. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1266. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1267. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1268. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1269. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1270. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1271. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1272. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1273. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1274. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1275. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1276. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1277. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1278. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1279. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1280. llama_model_loader: - type f32: 97 tensors
  1281. llama_model_loader: - type q4_0: 161 tensors
  1282. llama_model_loader: - type q8_0: 64 tensors
  1283. llama_model_loader: - type q6_K: 1 tensors
  1284. print_info: file format = GGUF V3 (latest)
  1285. print_info: file type = Q4_0
  1286. print_info: file size = 24.63 GiB (4.53 BPW)
  1287. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1288. load: special tokens cache size = 3
  1289. load: token to piece cache size = 0.1637 MB
  1290. print_info: arch = llama
  1291. print_info: vocab_only = 0
  1292. print_info: n_ctx_train = 32768
  1293. print_info: n_embd = 4096
  1294. print_info: n_layer = 32
  1295. print_info: n_head = 32
  1296. print_info: n_head_kv = 8
  1297. print_info: n_rot = 128
  1298. print_info: n_swa = 0
  1299. print_info: n_swa_pattern = 1
  1300. print_info: n_embd_head_k = 128
  1301. print_info: n_embd_head_v = 128
  1302. print_info: n_gqa = 4
  1303. print_info: n_embd_k_gqa = 1024
  1304. print_info: n_embd_v_gqa = 1024
  1305. print_info: f_norm_eps = 0.0e+00
  1306. print_info: f_norm_rms_eps = 1.0e-05
  1307. print_info: f_clamp_kqv = 0.0e+00
  1308. print_info: f_max_alibi_bias = 0.0e+00
  1309. print_info: f_logit_scale = 0.0e+00
  1310. print_info: f_attn_scale = 0.0e+00
  1311. print_info: n_ff = 14336
  1312. print_info: n_expert = 8
  1313. print_info: n_expert_used = 2
  1314. print_info: causal attn = 1
  1315. print_info: pooling type = 0
  1316. print_info: rope type = 0
  1317. print_info: rope scaling = linear
  1318. print_info: freq_base_train = 1000000.0
  1319. print_info: freq_scale_train = 1
  1320. print_info: n_ctx_orig_yarn = 32768
  1321. print_info: rope_finetuned = unknown
  1322. print_info: ssm_d_conv = 0
  1323. print_info: ssm_d_inner = 0
  1324. print_info: ssm_d_state = 0
  1325. print_info: ssm_dt_rank = 0
  1326. print_info: ssm_dt_b_c_rms = 0
  1327. print_info: model type = 8x7B
  1328. print_info: model params = 46.70 B
  1329. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1330. print_info: vocab type = SPM
  1331. print_info: n_vocab = 32000
  1332. print_info: n_merges = 0
  1333. print_info: BOS token = 1 '<s>'
  1334. print_info: EOS token = 2 '</s>'
  1335. print_info: UNK token = 0 '<unk>'
  1336. print_info: LF token = 13 '<0x0A>'
  1337. print_info: EOG token = 2 '</s>'
  1338. print_info: max token length = 48
  1339. load_tensors: loading model tensors, this can take a while... (mmap = false)
  1340. load_tensors: offloading 21 repeating layers to GPU
  1341. load_tensors: offloaded 21/33 layers to GPU
  1342. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  1343. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  1344. llama_context: constructing llama_context
  1345. llama_context: n_seq_max = 1
  1346. llama_context: n_ctx = 2407
  1347. llama_context: n_ctx_per_seq = 2407
  1348. llama_context: n_batch = 512
  1349. llama_context: n_ubatch = 512
  1350. llama_context: causal_attn = 1
  1351. llama_context: flash_attn = 0
  1352. llama_context: freq_base = 1000000.0
  1353. llama_context: freq_scale = 1
  1354. llama_context: n_ctx_per_seq (2407) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  1355. llama_context: CPU output buffer size = 0.14 MiB
  1356. llama_kv_cache_unified: kv_size = 2432, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  1357. llama_kv_cache_unified: CUDA0 KV buffer size = 199.50 MiB
  1358. llama_kv_cache_unified: CPU KV buffer size = 104.50 MiB
  1359. llama_kv_cache_unified: KV self size = 304.00 MiB, K (f16): 152.00 MiB, V (f16): 152.00 MiB
  1360. llama_context: CUDA0 compute buffer size = 393.75 MiB
  1361. llama_context: CUDA_Host compute buffer size = 12.76 MiB
  1362. llama_context: graph nodes = 1574
  1363. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  1364. time=2025-07-19T17:49:22.204+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  1365. time=2025-07-19T17:49:22.208+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2407 prompt=2909 keep=5 new=2407
  1366. [GIN] 2025/07/19 - 17:49:36 | 200 | 30.7193165s | 83.77.231.178 | POST "/api/generate"
  1367. time=2025-07-19T17:49:40.715+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.5 GiB" free_swap="29.4 GiB"
  1368. time=2025-07-19T17:49:40.715+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="274.4 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="173.5 MiB" memory.graph.partial="826.3 MiB"
  1369. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1370. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1371. llama_model_loader: - kv 0: general.architecture str = llama
  1372. llama_model_loader: - kv 1: general.type str = model
  1373. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1374. llama_model_loader: - kv 3: general.version str = v0.1
  1375. llama_model_loader: - kv 4: general.finetune str = Instruct
  1376. llama_model_loader: - kv 5: general.basename str = Mixtral
  1377. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1378. llama_model_loader: - kv 7: general.license str = apache-2.0
  1379. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1380. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1381. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1382. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1383. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1384. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1385. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1386. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1387. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1388. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1389. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1390. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1391. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1392. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1393. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1394. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1395. llama_model_loader: - kv 24: general.file_type u32 = 2
  1396. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1397. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1398. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1399. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1400. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1401. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1402. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1403. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1404. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1405. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1406. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1407. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1408. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1409. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1410. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1411. llama_model_loader: - type f32: 97 tensors
  1412. llama_model_loader: - type q4_0: 161 tensors
  1413. llama_model_loader: - type q8_0: 64 tensors
  1414. llama_model_loader: - type q6_K: 1 tensors
  1415. print_info: file format = GGUF V3 (latest)
  1416. print_info: file type = Q4_0
  1417. print_info: file size = 24.63 GiB (4.53 BPW)
  1418. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1419. load: special tokens cache size = 3
  1420. load: token to piece cache size = 0.1637 MB
  1421. print_info: arch = llama
  1422. print_info: vocab_only = 1
  1423. print_info: model type = ?B
  1424. print_info: model params = 46.70 B
  1425. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1426. print_info: vocab type = SPM
  1427. print_info: n_vocab = 32000
  1428. print_info: n_merges = 0
  1429. print_info: BOS token = 1 '<s>'
  1430. print_info: EOS token = 2 '</s>'
  1431. print_info: UNK token = 0 '<unk>'
  1432. print_info: LF token = 13 '<0x0A>'
  1433. print_info: EOG token = 2 '</s>'
  1434. print_info: max token length = 48
  1435. llama_model_load: vocab only - skipping tensors
  1436. time=2025-07-19T17:49:40.739+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2195 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52573"
  1437. time=2025-07-19T17:49:40.741+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  1438. time=2025-07-19T17:49:40.741+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  1439. time=2025-07-19T17:49:40.741+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  1440. time=2025-07-19T17:49:40.817+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  1441. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  1442. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  1443. ggml_cuda_init: found 1 CUDA devices:
  1444. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  1445. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  1446. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  1447. time=2025-07-19T17:49:40.893+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  1448. time=2025-07-19T17:49:40.894+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52573"
  1449. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  1450. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1451. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1452. llama_model_loader: - kv 0: general.architecture str = llama
  1453. llama_model_loader: - kv 1: general.type str = model
  1454. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1455. llama_model_loader: - kv 3: general.version str = v0.1
  1456. llama_model_loader: - kv 4: general.finetune str = Instruct
  1457. llama_model_loader: - kv 5: general.basename str = Mixtral
  1458. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1459. llama_model_loader: - kv 7: general.license str = apache-2.0
  1460. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1461. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1462. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1463. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1464. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1465. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1466. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1467. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1468. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1469. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1470. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1471. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1472. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1473. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1474. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1475. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1476. llama_model_loader: - kv 24: general.file_type u32 = 2
  1477. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1478. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1479. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1480. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1481. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1482. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1483. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1484. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1485. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1486. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1487. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1488. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1489. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1490. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1491. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1492. llama_model_loader: - type f32: 97 tensors
  1493. llama_model_loader: - type q4_0: 161 tensors
  1494. llama_model_loader: - type q8_0: 64 tensors
  1495. llama_model_loader: - type q6_K: 1 tensors
  1496. print_info: file format = GGUF V3 (latest)
  1497. print_info: file type = Q4_0
  1498. print_info: file size = 24.63 GiB (4.53 BPW)
  1499. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1500. load: special tokens cache size = 3
  1501. load: token to piece cache size = 0.1637 MB
  1502. print_info: arch = llama
  1503. print_info: vocab_only = 0
  1504. print_info: n_ctx_train = 32768
  1505. print_info: n_embd = 4096
  1506. print_info: n_layer = 32
  1507. print_info: n_head = 32
  1508. print_info: n_head_kv = 8
  1509. print_info: n_rot = 128
  1510. print_info: n_swa = 0
  1511. print_info: n_swa_pattern = 1
  1512. print_info: n_embd_head_k = 128
  1513. print_info: n_embd_head_v = 128
  1514. print_info: n_gqa = 4
  1515. print_info: n_embd_k_gqa = 1024
  1516. print_info: n_embd_v_gqa = 1024
  1517. print_info: f_norm_eps = 0.0e+00
  1518. print_info: f_norm_rms_eps = 1.0e-05
  1519. print_info: f_clamp_kqv = 0.0e+00
  1520. print_info: f_max_alibi_bias = 0.0e+00
  1521. print_info: f_logit_scale = 0.0e+00
  1522. print_info: f_attn_scale = 0.0e+00
  1523. print_info: n_ff = 14336
  1524. print_info: n_expert = 8
  1525. print_info: n_expert_used = 2
  1526. print_info: causal attn = 1
  1527. print_info: pooling type = 0
  1528. print_info: rope type = 0
  1529. print_info: rope scaling = linear
  1530. print_info: freq_base_train = 1000000.0
  1531. print_info: freq_scale_train = 1
  1532. print_info: n_ctx_orig_yarn = 32768
  1533. print_info: rope_finetuned = unknown
  1534. print_info: ssm_d_conv = 0
  1535. print_info: ssm_d_inner = 0
  1536. print_info: ssm_d_state = 0
  1537. print_info: ssm_dt_rank = 0
  1538. print_info: ssm_dt_b_c_rms = 0
  1539. print_info: model type = 8x7B
  1540. print_info: model params = 46.70 B
  1541. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1542. print_info: vocab type = SPM
  1543. print_info: n_vocab = 32000
  1544. print_info: n_merges = 0
  1545. print_info: BOS token = 1 '<s>'
  1546. print_info: EOS token = 2 '</s>'
  1547. print_info: UNK token = 0 '<unk>'
  1548. print_info: LF token = 13 '<0x0A>'
  1549. print_info: EOG token = 2 '</s>'
  1550. print_info: max token length = 48
  1551. load_tensors: loading model tensors, this can take a while... (mmap = false)
  1552. time=2025-07-19T17:49:40.992+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  1553. load_tensors: offloading 21 repeating layers to GPU
  1554. load_tensors: offloaded 21/33 layers to GPU
  1555. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  1556. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  1557. llama_context: constructing llama_context
  1558. llama_context: n_seq_max = 1
  1559. llama_context: n_ctx = 2195
  1560. llama_context: n_ctx_per_seq = 2195
  1561. llama_context: n_batch = 512
  1562. llama_context: n_ubatch = 512
  1563. llama_context: causal_attn = 1
  1564. llama_context: flash_attn = 0
  1565. llama_context: freq_base = 1000000.0
  1566. llama_context: freq_scale = 1
  1567. llama_context: n_ctx_per_seq (2195) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  1568. llama_context: CPU output buffer size = 0.14 MiB
  1569. llama_kv_cache_unified: kv_size = 2208, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  1570. llama_kv_cache_unified: CUDA0 KV buffer size = 181.12 MiB
  1571. llama_kv_cache_unified: CPU KV buffer size = 94.88 MiB
  1572. llama_kv_cache_unified: KV self size = 276.00 MiB, K (f16): 138.00 MiB, V (f16): 138.00 MiB
  1573. llama_context: CUDA0 compute buffer size = 393.31 MiB
  1574. llama_context: CUDA_Host compute buffer size = 12.32 MiB
  1575. llama_context: graph nodes = 1574
  1576. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  1577. time=2025-07-19T17:49:56.527+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.79 seconds"
  1578. time=2025-07-19T17:49:56.532+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2195 prompt=2315 keep=5 new=2195
  1579. [GIN] 2025/07/19 - 17:50:05 | 200 | 25.1645168s | 83.77.231.178 | POST "/api/generate"
  1580. time=2025-07-19T17:50:09.600+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.3 GiB" free_swap="29.3 GiB"
  1581. time=2025-07-19T17:50:09.601+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="297.0 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="185.1 MiB" memory.graph.partial="826.7 MiB"
  1582. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1583. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1584. llama_model_loader: - kv 0: general.architecture str = llama
  1585. llama_model_loader: - kv 1: general.type str = model
  1586. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1587. llama_model_loader: - kv 3: general.version str = v0.1
  1588. llama_model_loader: - kv 4: general.finetune str = Instruct
  1589. llama_model_loader: - kv 5: general.basename str = Mixtral
  1590. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1591. llama_model_loader: - kv 7: general.license str = apache-2.0
  1592. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1593. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1594. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1595. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1596. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1597. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1598. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1599. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1600. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1601. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1602. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1603. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1604. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1605. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1606. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1607. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1608. llama_model_loader: - kv 24: general.file_type u32 = 2
  1609. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1610. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1611. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1612. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1613. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1614. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1615. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1616. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1617. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1618. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1619. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1620. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1621. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1622. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1623. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1624. llama_model_loader: - type f32: 97 tensors
  1625. llama_model_loader: - type q4_0: 161 tensors
  1626. llama_model_loader: - type q8_0: 64 tensors
  1627. llama_model_loader: - type q6_K: 1 tensors
  1628. print_info: file format = GGUF V3 (latest)
  1629. print_info: file type = Q4_0
  1630. print_info: file size = 24.63 GiB (4.53 BPW)
  1631. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1632. load: special tokens cache size = 3
  1633. load: token to piece cache size = 0.1637 MB
  1634. print_info: arch = llama
  1635. print_info: vocab_only = 1
  1636. print_info: model type = ?B
  1637. print_info: model params = 46.70 B
  1638. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1639. print_info: vocab type = SPM
  1640. print_info: n_vocab = 32000
  1641. print_info: n_merges = 0
  1642. print_info: BOS token = 1 '<s>'
  1643. print_info: EOS token = 2 '</s>'
  1644. print_info: UNK token = 0 '<unk>'
  1645. print_info: LF token = 13 '<0x0A>'
  1646. print_info: EOG token = 2 '</s>'
  1647. print_info: max token length = 48
  1648. llama_model_load: vocab only - skipping tensors
  1649. time=2025-07-19T17:50:09.626+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2376 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52594"
  1650. time=2025-07-19T17:50:09.629+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  1651. time=2025-07-19T17:50:09.629+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  1652. time=2025-07-19T17:50:09.629+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  1653. time=2025-07-19T17:50:09.670+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  1654. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  1655. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  1656. ggml_cuda_init: found 1 CUDA devices:
  1657. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  1658. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  1659. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  1660. time=2025-07-19T17:50:09.764+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  1661. time=2025-07-19T17:50:09.765+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52594"
  1662. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  1663. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1664. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1665. llama_model_loader: - kv 0: general.architecture str = llama
  1666. llama_model_loader: - kv 1: general.type str = model
  1667. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1668. llama_model_loader: - kv 3: general.version str = v0.1
  1669. llama_model_loader: - kv 4: general.finetune str = Instruct
  1670. llama_model_loader: - kv 5: general.basename str = Mixtral
  1671. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1672. llama_model_loader: - kv 7: general.license str = apache-2.0
  1673. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1674. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1675. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1676. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1677. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1678. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1679. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1680. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1681. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1682. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1683. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1684. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1685. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1686. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1687. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1688. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1689. llama_model_loader: - kv 24: general.file_type u32 = 2
  1690. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1691. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1692. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1693. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1694. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1695. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1696. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1697. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1698. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1699. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1700. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1701. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1702. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1703. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1704. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1705. llama_model_loader: - type f32: 97 tensors
  1706. llama_model_loader: - type q4_0: 161 tensors
  1707. llama_model_loader: - type q8_0: 64 tensors
  1708. llama_model_loader: - type q6_K: 1 tensors
  1709. print_info: file format = GGUF V3 (latest)
  1710. print_info: file type = Q4_0
  1711. print_info: file size = 24.63 GiB (4.53 BPW)
  1712. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1713. load: special tokens cache size = 3
  1714. load: token to piece cache size = 0.1637 MB
  1715. print_info: arch = llama
  1716. print_info: vocab_only = 0
  1717. print_info: n_ctx_train = 32768
  1718. print_info: n_embd = 4096
  1719. print_info: n_layer = 32
  1720. print_info: n_head = 32
  1721. print_info: n_head_kv = 8
  1722. print_info: n_rot = 128
  1723. print_info: n_swa = 0
  1724. print_info: n_swa_pattern = 1
  1725. print_info: n_embd_head_k = 128
  1726. print_info: n_embd_head_v = 128
  1727. print_info: n_gqa = 4
  1728. print_info: n_embd_k_gqa = 1024
  1729. print_info: n_embd_v_gqa = 1024
  1730. print_info: f_norm_eps = 0.0e+00
  1731. print_info: f_norm_rms_eps = 1.0e-05
  1732. print_info: f_clamp_kqv = 0.0e+00
  1733. print_info: f_max_alibi_bias = 0.0e+00
  1734. print_info: f_logit_scale = 0.0e+00
  1735. print_info: f_attn_scale = 0.0e+00
  1736. print_info: n_ff = 14336
  1737. print_info: n_expert = 8
  1738. print_info: n_expert_used = 2
  1739. print_info: causal attn = 1
  1740. print_info: pooling type = 0
  1741. print_info: rope type = 0
  1742. print_info: rope scaling = linear
  1743. print_info: freq_base_train = 1000000.0
  1744. print_info: freq_scale_train = 1
  1745. print_info: n_ctx_orig_yarn = 32768
  1746. print_info: rope_finetuned = unknown
  1747. print_info: ssm_d_conv = 0
  1748. print_info: ssm_d_inner = 0
  1749. print_info: ssm_d_state = 0
  1750. print_info: ssm_dt_rank = 0
  1751. print_info: ssm_dt_b_c_rms = 0
  1752. print_info: model type = 8x7B
  1753. print_info: model params = 46.70 B
  1754. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1755. print_info: vocab type = SPM
  1756. print_info: n_vocab = 32000
  1757. print_info: n_merges = 0
  1758. print_info: BOS token = 1 '<s>'
  1759. print_info: EOS token = 2 '</s>'
  1760. print_info: UNK token = 0 '<unk>'
  1761. print_info: LF token = 13 '<0x0A>'
  1762. print_info: EOG token = 2 '</s>'
  1763. print_info: max token length = 48
  1764. load_tensors: loading model tensors, this can take a while... (mmap = false)
  1765. time=2025-07-19T17:50:09.880+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  1766. load_tensors: offloading 21 repeating layers to GPU
  1767. load_tensors: offloaded 21/33 layers to GPU
  1768. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  1769. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  1770. llama_context: constructing llama_context
  1771. llama_context: n_seq_max = 1
  1772. llama_context: n_ctx = 2376
  1773. llama_context: n_ctx_per_seq = 2376
  1774. llama_context: n_batch = 512
  1775. llama_context: n_ubatch = 512
  1776. llama_context: causal_attn = 1
  1777. llama_context: flash_attn = 0
  1778. llama_context: freq_base = 1000000.0
  1779. llama_context: freq_scale = 1
  1780. llama_context: n_ctx_per_seq (2376) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  1781. llama_context: CPU output buffer size = 0.14 MiB
  1782. llama_kv_cache_unified: kv_size = 2400, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  1783. llama_kv_cache_unified: CUDA0 KV buffer size = 196.88 MiB
  1784. llama_kv_cache_unified: CPU KV buffer size = 103.12 MiB
  1785. llama_kv_cache_unified: KV self size = 300.00 MiB, K (f16): 150.00 MiB, V (f16): 150.00 MiB
  1786. llama_context: CUDA0 compute buffer size = 393.69 MiB
  1787. llama_context: CUDA_Host compute buffer size = 12.69 MiB
  1788. llama_context: graph nodes = 1574
  1789. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  1790. time=2025-07-19T17:50:25.662+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.03 seconds"
  1791. time=2025-07-19T17:50:25.666+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2376 prompt=2825 keep=5 new=2376
  1792. [GIN] 2025/07/19 - 17:50:42 | 200 | 33.7636898s | 83.77.231.178 | POST "/api/generate"
  1793. time=2025-07-19T17:50:48.767+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.5 GiB" free_swap="29.6 GiB"
  1794. time=2025-07-19T17:50:48.767+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="18.2 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[18.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="826.0 MiB"
  1795. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1796. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1797. llama_model_loader: - kv 0: general.architecture str = llama
  1798. llama_model_loader: - kv 1: general.type str = model
  1799. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1800. llama_model_loader: - kv 3: general.version str = v0.1
  1801. llama_model_loader: - kv 4: general.finetune str = Instruct
  1802. llama_model_loader: - kv 5: general.basename str = Mixtral
  1803. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1804. llama_model_loader: - kv 7: general.license str = apache-2.0
  1805. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1806. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1807. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1808. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1809. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1810. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1811. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1812. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1813. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1814. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1815. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1816. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1817. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1818. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1819. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1820. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1821. llama_model_loader: - kv 24: general.file_type u32 = 2
  1822. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1823. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1824. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1825. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1826. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1827. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1828. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1829. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1830. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1831. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1832. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1833. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1834. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1835. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1836. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1837. llama_model_loader: - type f32: 97 tensors
  1838. llama_model_loader: - type q4_0: 161 tensors
  1839. llama_model_loader: - type q8_0: 64 tensors
  1840. llama_model_loader: - type q6_K: 1 tensors
  1841. print_info: file format = GGUF V3 (latest)
  1842. print_info: file type = Q4_0
  1843. print_info: file size = 24.63 GiB (4.53 BPW)
  1844. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1845. load: special tokens cache size = 3
  1846. load: token to piece cache size = 0.1637 MB
  1847. print_info: arch = llama
  1848. print_info: vocab_only = 1
  1849. print_info: model type = ?B
  1850. print_info: model params = 46.70 B
  1851. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1852. print_info: vocab type = SPM
  1853. print_info: n_vocab = 32000
  1854. print_info: n_merges = 0
  1855. print_info: BOS token = 1 '<s>'
  1856. print_info: EOS token = 2 '</s>'
  1857. print_info: UNK token = 0 '<unk>'
  1858. print_info: LF token = 13 '<0x0A>'
  1859. print_info: EOG token = 2 '</s>'
  1860. print_info: max token length = 48
  1861. llama_model_load: vocab only - skipping tensors
  1862. time=2025-07-19T17:50:48.794+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2035 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52671"
  1863. time=2025-07-19T17:50:48.799+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  1864. time=2025-07-19T17:50:48.799+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  1865. time=2025-07-19T17:50:48.800+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  1866. time=2025-07-19T17:50:48.853+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  1867. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  1868. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  1869. ggml_cuda_init: found 1 CUDA devices:
  1870. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  1871. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  1872. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  1873. time=2025-07-19T17:50:48.931+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  1874. time=2025-07-19T17:50:48.932+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52671"
  1875. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  1876. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  1877. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  1878. llama_model_loader: - kv 0: general.architecture str = llama
  1879. llama_model_loader: - kv 1: general.type str = model
  1880. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  1881. llama_model_loader: - kv 3: general.version str = v0.1
  1882. llama_model_loader: - kv 4: general.finetune str = Instruct
  1883. llama_model_loader: - kv 5: general.basename str = Mixtral
  1884. llama_model_loader: - kv 6: general.size_label str = 8x7B
  1885. llama_model_loader: - kv 7: general.license str = apache-2.0
  1886. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  1887. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  1888. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  1889. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  1890. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  1891. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  1892. llama_model_loader: - kv 14: llama.block_count u32 = 32
  1893. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  1894. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  1895. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  1896. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  1897. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  1898. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  1899. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  1900. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  1901. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  1902. llama_model_loader: - kv 24: general.file_type u32 = 2
  1903. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  1904. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  1905. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  1906. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  1907. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  1908. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  1909. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  1910. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  1911. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  1912. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  1913. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  1914. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  1915. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  1916. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  1917. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  1918. llama_model_loader: - type f32: 97 tensors
  1919. llama_model_loader: - type q4_0: 161 tensors
  1920. llama_model_loader: - type q8_0: 64 tensors
  1921. llama_model_loader: - type q6_K: 1 tensors
  1922. print_info: file format = GGUF V3 (latest)
  1923. print_info: file type = Q4_0
  1924. print_info: file size = 24.63 GiB (4.53 BPW)
  1925. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  1926. load: special tokens cache size = 3
  1927. load: token to piece cache size = 0.1637 MB
  1928. print_info: arch = llama
  1929. print_info: vocab_only = 0
  1930. print_info: n_ctx_train = 32768
  1931. print_info: n_embd = 4096
  1932. print_info: n_layer = 32
  1933. print_info: n_head = 32
  1934. print_info: n_head_kv = 8
  1935. print_info: n_rot = 128
  1936. print_info: n_swa = 0
  1937. print_info: n_swa_pattern = 1
  1938. print_info: n_embd_head_k = 128
  1939. print_info: n_embd_head_v = 128
  1940. print_info: n_gqa = 4
  1941. print_info: n_embd_k_gqa = 1024
  1942. print_info: n_embd_v_gqa = 1024
  1943. print_info: f_norm_eps = 0.0e+00
  1944. print_info: f_norm_rms_eps = 1.0e-05
  1945. print_info: f_clamp_kqv = 0.0e+00
  1946. print_info: f_max_alibi_bias = 0.0e+00
  1947. print_info: f_logit_scale = 0.0e+00
  1948. print_info: f_attn_scale = 0.0e+00
  1949. print_info: n_ff = 14336
  1950. print_info: n_expert = 8
  1951. print_info: n_expert_used = 2
  1952. print_info: causal attn = 1
  1953. print_info: pooling type = 0
  1954. print_info: rope type = 0
  1955. print_info: rope scaling = linear
  1956. print_info: freq_base_train = 1000000.0
  1957. print_info: freq_scale_train = 1
  1958. print_info: n_ctx_orig_yarn = 32768
  1959. print_info: rope_finetuned = unknown
  1960. print_info: ssm_d_conv = 0
  1961. print_info: ssm_d_inner = 0
  1962. print_info: ssm_d_state = 0
  1963. print_info: ssm_dt_rank = 0
  1964. print_info: ssm_dt_b_c_rms = 0
  1965. print_info: model type = 8x7B
  1966. print_info: model params = 46.70 B
  1967. print_info: general.name = Mixtral 8x7B Instruct v0.1
  1968. print_info: vocab type = SPM
  1969. print_info: n_vocab = 32000
  1970. print_info: n_merges = 0
  1971. print_info: BOS token = 1 '<s>'
  1972. print_info: EOS token = 2 '</s>'
  1973. print_info: UNK token = 0 '<unk>'
  1974. print_info: LF token = 13 '<0x0A>'
  1975. print_info: EOG token = 2 '</s>'
  1976. print_info: max token length = 48
  1977. load_tensors: loading model tensors, this can take a while... (mmap = false)
  1978. time=2025-07-19T17:50:49.051+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  1979. load_tensors: offloading 21 repeating layers to GPU
  1980. load_tensors: offloaded 21/33 layers to GPU
  1981. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  1982. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  1983. llama_context: constructing llama_context
  1984. llama_context: n_seq_max = 1
  1985. llama_context: n_ctx = 2035
  1986. llama_context: n_ctx_per_seq = 2035
  1987. llama_context: n_batch = 512
  1988. llama_context: n_ubatch = 512
  1989. llama_context: causal_attn = 1
  1990. llama_context: flash_attn = 0
  1991. llama_context: freq_base = 1000000.0
  1992. llama_context: freq_scale = 1
  1993. llama_context: n_ctx_per_seq (2035) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  1994. llama_context: CPU output buffer size = 0.14 MiB
  1995. llama_kv_cache_unified: kv_size = 2048, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  1996. llama_kv_cache_unified: CUDA0 KV buffer size = 168.00 MiB
  1997. llama_kv_cache_unified: CPU KV buffer size = 88.00 MiB
  1998. llama_kv_cache_unified: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
  1999. llama_context: CUDA0 compute buffer size = 405.00 MiB
  2000. llama_context: CUDA_Host compute buffer size = 12.01 MiB
  2001. llama_context: graph nodes = 1574
  2002. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  2003. time=2025-07-19T17:51:04.834+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.03 seconds"
  2004. time=2025-07-19T17:51:04.837+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2035 prompt=2044 keep=5 new=2035
  2005. [GIN] 2025/07/19 - 17:51:08 | 200 | 0s | 192.168.1.1 | GET "/"
  2006. [GIN] 2025/07/19 - 17:51:15 | 200 | 27.1912518s | 83.77.231.178 | POST "/api/generate"
  2007. time=2025-07-19T17:51:20.364+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.3 GiB" free_swap="28.9 GiB"
  2008. time=2025-07-19T17:51:20.364+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="18.2 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[18.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="826.0 MiB"
  2009. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2010. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2011. llama_model_loader: - kv 0: general.architecture str = llama
  2012. llama_model_loader: - kv 1: general.type str = model
  2013. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2014. llama_model_loader: - kv 3: general.version str = v0.1
  2015. llama_model_loader: - kv 4: general.finetune str = Instruct
  2016. llama_model_loader: - kv 5: general.basename str = Mixtral
  2017. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2018. llama_model_loader: - kv 7: general.license str = apache-2.0
  2019. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2020. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2021. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2022. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2023. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2024. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2025. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2026. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2027. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2028. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2029. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2030. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2031. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2032. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2033. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2034. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2035. llama_model_loader: - kv 24: general.file_type u32 = 2
  2036. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2037. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2038. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2039. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2040. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2041. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2042. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2043. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2044. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2045. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2046. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2047. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2048. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2049. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2050. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2051. llama_model_loader: - type f32: 97 tensors
  2052. llama_model_loader: - type q4_0: 161 tensors
  2053. llama_model_loader: - type q8_0: 64 tensors
  2054. llama_model_loader: - type q6_K: 1 tensors
  2055. print_info: file format = GGUF V3 (latest)
  2056. print_info: file type = Q4_0
  2057. print_info: file size = 24.63 GiB (4.53 BPW)
  2058. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2059. load: special tokens cache size = 3
  2060. load: token to piece cache size = 0.1637 MB
  2061. print_info: arch = llama
  2062. print_info: vocab_only = 1
  2063. print_info: model type = ?B
  2064. print_info: model params = 46.70 B
  2065. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2066. print_info: vocab type = SPM
  2067. print_info: n_vocab = 32000
  2068. print_info: n_merges = 0
  2069. print_info: BOS token = 1 '<s>'
  2070. print_info: EOS token = 2 '</s>'
  2071. print_info: UNK token = 0 '<unk>'
  2072. print_info: LF token = 13 '<0x0A>'
  2073. print_info: EOG token = 2 '</s>'
  2074. print_info: max token length = 48
  2075. llama_model_load: vocab only - skipping tensors
  2076. time=2025-07-19T17:51:20.393+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 1879 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52736"
  2077. time=2025-07-19T17:51:20.399+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  2078. time=2025-07-19T17:51:20.399+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  2079. time=2025-07-19T17:51:20.399+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  2080. time=2025-07-19T17:51:20.441+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  2081. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  2082. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  2083. ggml_cuda_init: found 1 CUDA devices:
  2084. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  2085. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  2086. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  2087. time=2025-07-19T17:51:20.517+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  2088. time=2025-07-19T17:51:20.517+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52736"
  2089. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  2090. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2091. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2092. llama_model_loader: - kv 0: general.architecture str = llama
  2093. llama_model_loader: - kv 1: general.type str = model
  2094. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2095. llama_model_loader: - kv 3: general.version str = v0.1
  2096. llama_model_loader: - kv 4: general.finetune str = Instruct
  2097. llama_model_loader: - kv 5: general.basename str = Mixtral
  2098. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2099. llama_model_loader: - kv 7: general.license str = apache-2.0
  2100. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2101. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2102. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2103. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2104. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2105. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2106. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2107. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2108. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2109. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2110. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2111. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2112. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2113. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2114. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2115. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2116. llama_model_loader: - kv 24: general.file_type u32 = 2
  2117. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2118. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2119. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2120. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2121. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2122. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2123. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2124. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2125. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2126. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2127. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2128. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2129. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2130. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2131. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2132. llama_model_loader: - type f32: 97 tensors
  2133. llama_model_loader: - type q4_0: 161 tensors
  2134. llama_model_loader: - type q8_0: 64 tensors
  2135. llama_model_loader: - type q6_K: 1 tensors
  2136. print_info: file format = GGUF V3 (latest)
  2137. print_info: file type = Q4_0
  2138. print_info: file size = 24.63 GiB (4.53 BPW)
  2139. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2140. load: special tokens cache size = 3
  2141. load: token to piece cache size = 0.1637 MB
  2142. print_info: arch = llama
  2143. print_info: vocab_only = 0
  2144. print_info: n_ctx_train = 32768
  2145. print_info: n_embd = 4096
  2146. print_info: n_layer = 32
  2147. print_info: n_head = 32
  2148. print_info: n_head_kv = 8
  2149. print_info: n_rot = 128
  2150. print_info: n_swa = 0
  2151. print_info: n_swa_pattern = 1
  2152. print_info: n_embd_head_k = 128
  2153. print_info: n_embd_head_v = 128
  2154. print_info: n_gqa = 4
  2155. print_info: n_embd_k_gqa = 1024
  2156. print_info: n_embd_v_gqa = 1024
  2157. print_info: f_norm_eps = 0.0e+00
  2158. print_info: f_norm_rms_eps = 1.0e-05
  2159. print_info: f_clamp_kqv = 0.0e+00
  2160. print_info: f_max_alibi_bias = 0.0e+00
  2161. print_info: f_logit_scale = 0.0e+00
  2162. print_info: f_attn_scale = 0.0e+00
  2163. print_info: n_ff = 14336
  2164. print_info: n_expert = 8
  2165. print_info: n_expert_used = 2
  2166. print_info: causal attn = 1
  2167. print_info: pooling type = 0
  2168. print_info: rope type = 0
  2169. print_info: rope scaling = linear
  2170. print_info: freq_base_train = 1000000.0
  2171. print_info: freq_scale_train = 1
  2172. print_info: n_ctx_orig_yarn = 32768
  2173. print_info: rope_finetuned = unknown
  2174. print_info: ssm_d_conv = 0
  2175. print_info: ssm_d_inner = 0
  2176. print_info: ssm_d_state = 0
  2177. print_info: ssm_dt_rank = 0
  2178. print_info: ssm_dt_b_c_rms = 0
  2179. print_info: model type = 8x7B
  2180. print_info: model params = 46.70 B
  2181. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2182. print_info: vocab type = SPM
  2183. print_info: n_vocab = 32000
  2184. print_info: n_merges = 0
  2185. print_info: BOS token = 1 '<s>'
  2186. print_info: EOS token = 2 '</s>'
  2187. print_info: UNK token = 0 '<unk>'
  2188. print_info: LF token = 13 '<0x0A>'
  2189. print_info: EOG token = 2 '</s>'
  2190. print_info: max token length = 48
  2191. load_tensors: loading model tensors, this can take a while... (mmap = false)
  2192. time=2025-07-19T17:51:20.650+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  2193. load_tensors: offloading 21 repeating layers to GPU
  2194. load_tensors: offloaded 21/33 layers to GPU
  2195. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  2196. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  2197. llama_context: constructing llama_context
  2198. llama_context: n_seq_max = 1
  2199. llama_context: n_ctx = 1879
  2200. llama_context: n_ctx_per_seq = 1879
  2201. llama_context: n_batch = 512
  2202. llama_context: n_ubatch = 512
  2203. llama_context: causal_attn = 1
  2204. llama_context: flash_attn = 0
  2205. llama_context: freq_base = 1000000.0
  2206. llama_context: freq_scale = 1
  2207. llama_context: n_ctx_per_seq (1879) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  2208. llama_context: CPU output buffer size = 0.14 MiB
  2209. llama_kv_cache_unified: kv_size = 1888, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  2210. llama_kv_cache_unified: CUDA0 KV buffer size = 154.88 MiB
  2211. llama_kv_cache_unified: CPU KV buffer size = 81.12 MiB
  2212. llama_kv_cache_unified: KV self size = 236.00 MiB, K (f16): 118.00 MiB, V (f16): 118.00 MiB
  2213. llama_context: CUDA0 compute buffer size = 405.00 MiB
  2214. llama_context: CUDA_Host compute buffer size = 11.69 MiB
  2215. llama_context: graph nodes = 1574
  2216. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  2217. time=2025-07-19T17:51:36.181+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  2218. [GIN] 2025/07/19 - 17:51:46 | 200 | 26.5624173s | 83.77.231.178 | POST "/api/generate"
  2219. [GIN] 2025/07/19 - 17:51:46 | 200 | 0s | 83.77.231.178 | GET "/api/ps"
  2220. time=2025-07-19T17:51:50.753+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.2 GiB" free_swap="28.9 GiB"
  2221. time=2025-07-19T17:51:50.754+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="18.2 GiB" memory.required.kv="260.4 MiB" memory.required.allocations="[18.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="166.3 MiB" memory.graph.partial="826.1 MiB"
  2222. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2223. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2224. llama_model_loader: - kv 0: general.architecture str = llama
  2225. llama_model_loader: - kv 1: general.type str = model
  2226. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2227. llama_model_loader: - kv 3: general.version str = v0.1
  2228. llama_model_loader: - kv 4: general.finetune str = Instruct
  2229. llama_model_loader: - kv 5: general.basename str = Mixtral
  2230. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2231. llama_model_loader: - kv 7: general.license str = apache-2.0
  2232. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2233. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2234. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2235. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2236. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2237. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2238. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2239. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2240. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2241. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2242. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2243. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2244. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2245. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2246. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2247. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2248. llama_model_loader: - kv 24: general.file_type u32 = 2
  2249. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2250. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2251. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2252. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2253. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2254. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2255. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2256. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2257. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2258. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2259. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2260. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2261. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2262. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2263. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2264. llama_model_loader: - type f32: 97 tensors
  2265. llama_model_loader: - type q4_0: 161 tensors
  2266. llama_model_loader: - type q8_0: 64 tensors
  2267. llama_model_loader: - type q6_K: 1 tensors
  2268. print_info: file format = GGUF V3 (latest)
  2269. print_info: file type = Q4_0
  2270. print_info: file size = 24.63 GiB (4.53 BPW)
  2271. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2272. load: special tokens cache size = 3
  2273. load: token to piece cache size = 0.1637 MB
  2274. print_info: arch = llama
  2275. print_info: vocab_only = 1
  2276. print_info: model type = ?B
  2277. print_info: model params = 46.70 B
  2278. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2279. print_info: vocab type = SPM
  2280. print_info: n_vocab = 32000
  2281. print_info: n_merges = 0
  2282. print_info: BOS token = 1 '<s>'
  2283. print_info: EOS token = 2 '</s>'
  2284. print_info: UNK token = 0 '<unk>'
  2285. print_info: LF token = 13 '<0x0A>'
  2286. print_info: EOG token = 2 '</s>'
  2287. print_info: max token length = 48
  2288. llama_model_load: vocab only - skipping tensors
  2289. time=2025-07-19T17:51:50.778+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2083 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52782"
  2290. time=2025-07-19T17:51:50.781+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  2291. time=2025-07-19T17:51:50.781+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  2292. time=2025-07-19T17:51:50.781+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  2293. time=2025-07-19T17:51:50.817+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  2294. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  2295. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  2296. ggml_cuda_init: found 1 CUDA devices:
  2297. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  2298. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  2299. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  2300. time=2025-07-19T17:51:50.900+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  2301. time=2025-07-19T17:51:50.900+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52782"
  2302. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  2303. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2304. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2305. llama_model_loader: - kv 0: general.architecture str = llama
  2306. llama_model_loader: - kv 1: general.type str = model
  2307. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2308. llama_model_loader: - kv 3: general.version str = v0.1
  2309. llama_model_loader: - kv 4: general.finetune str = Instruct
  2310. llama_model_loader: - kv 5: general.basename str = Mixtral
  2311. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2312. llama_model_loader: - kv 7: general.license str = apache-2.0
  2313. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2314. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2315. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2316. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2317. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2318. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2319. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2320. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2321. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2322. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2323. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2324. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2325. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2326. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2327. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2328. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2329. llama_model_loader: - kv 24: general.file_type u32 = 2
  2330. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2331. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2332. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2333. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2334. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2335. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2336. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2337. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2338. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2339. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2340. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2341. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2342. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2343. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2344. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2345. llama_model_loader: - type f32: 97 tensors
  2346. llama_model_loader: - type q4_0: 161 tensors
  2347. llama_model_loader: - type q8_0: 64 tensors
  2348. llama_model_loader: - type q6_K: 1 tensors
  2349. print_info: file format = GGUF V3 (latest)
  2350. print_info: file type = Q4_0
  2351. print_info: file size = 24.63 GiB (4.53 BPW)
  2352. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2353. load: special tokens cache size = 3
  2354. load: token to piece cache size = 0.1637 MB
  2355. print_info: arch = llama
  2356. print_info: vocab_only = 0
  2357. print_info: n_ctx_train = 32768
  2358. print_info: n_embd = 4096
  2359. print_info: n_layer = 32
  2360. print_info: n_head = 32
  2361. print_info: n_head_kv = 8
  2362. print_info: n_rot = 128
  2363. print_info: n_swa = 0
  2364. print_info: n_swa_pattern = 1
  2365. print_info: n_embd_head_k = 128
  2366. print_info: n_embd_head_v = 128
  2367. print_info: n_gqa = 4
  2368. print_info: n_embd_k_gqa = 1024
  2369. print_info: n_embd_v_gqa = 1024
  2370. print_info: f_norm_eps = 0.0e+00
  2371. print_info: f_norm_rms_eps = 1.0e-05
  2372. print_info: f_clamp_kqv = 0.0e+00
  2373. print_info: f_max_alibi_bias = 0.0e+00
  2374. print_info: f_logit_scale = 0.0e+00
  2375. print_info: f_attn_scale = 0.0e+00
  2376. print_info: n_ff = 14336
  2377. print_info: n_expert = 8
  2378. print_info: n_expert_used = 2
  2379. print_info: causal attn = 1
  2380. print_info: pooling type = 0
  2381. print_info: rope type = 0
  2382. print_info: rope scaling = linear
  2383. print_info: freq_base_train = 1000000.0
  2384. print_info: freq_scale_train = 1
  2385. print_info: n_ctx_orig_yarn = 32768
  2386. print_info: rope_finetuned = unknown
  2387. print_info: ssm_d_conv = 0
  2388. print_info: ssm_d_inner = 0
  2389. print_info: ssm_d_state = 0
  2390. print_info: ssm_dt_rank = 0
  2391. print_info: ssm_dt_b_c_rms = 0
  2392. print_info: model type = 8x7B
  2393. print_info: model params = 46.70 B
  2394. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2395. print_info: vocab type = SPM
  2396. print_info: n_vocab = 32000
  2397. print_info: n_merges = 0
  2398. print_info: BOS token = 1 '<s>'
  2399. print_info: EOS token = 2 '</s>'
  2400. print_info: UNK token = 0 '<unk>'
  2401. print_info: LF token = 13 '<0x0A>'
  2402. print_info: EOG token = 2 '</s>'
  2403. print_info: max token length = 48
  2404. load_tensors: loading model tensors, this can take a while... (mmap = false)
  2405. time=2025-07-19T17:51:51.033+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  2406. load_tensors: offloading 21 repeating layers to GPU
  2407. load_tensors: offloaded 21/33 layers to GPU
  2408. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  2409. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  2410. llama_context: constructing llama_context
  2411. llama_context: n_seq_max = 1
  2412. llama_context: n_ctx = 2083
  2413. llama_context: n_ctx_per_seq = 2083
  2414. llama_context: n_batch = 512
  2415. llama_context: n_ubatch = 512
  2416. llama_context: causal_attn = 1
  2417. llama_context: flash_attn = 0
  2418. llama_context: freq_base = 1000000.0
  2419. llama_context: freq_scale = 1
  2420. llama_context: n_ctx_per_seq (2083) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  2421. llama_context: CPU output buffer size = 0.14 MiB
  2422. llama_kv_cache_unified: kv_size = 2112, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  2423. llama_kv_cache_unified: CUDA0 KV buffer size = 173.25 MiB
  2424. llama_kv_cache_unified: CPU KV buffer size = 90.75 MiB
  2425. llama_kv_cache_unified: KV self size = 264.00 MiB, K (f16): 132.00 MiB, V (f16): 132.00 MiB
  2426. llama_context: CUDA0 compute buffer size = 393.13 MiB
  2427. llama_context: CUDA_Host compute buffer size = 12.13 MiB
  2428. llama_context: graph nodes = 1574
  2429. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  2430. time=2025-07-19T17:52:06.565+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  2431. [GIN] 2025/07/19 - 17:52:15 | 200 | 25.1089685s | 83.77.231.178 | POST "/api/generate"
  2432. time=2025-07-19T17:52:19.894+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.4 GiB" free_swap="29.2 GiB"
  2433. time=2025-07-19T17:52:19.895+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="277.5 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="175.1 MiB" memory.graph.partial="826.4 MiB"
  2434. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2435. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2436. llama_model_loader: - kv 0: general.architecture str = llama
  2437. llama_model_loader: - kv 1: general.type str = model
  2438. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2439. llama_model_loader: - kv 3: general.version str = v0.1
  2440. llama_model_loader: - kv 4: general.finetune str = Instruct
  2441. llama_model_loader: - kv 5: general.basename str = Mixtral
  2442. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2443. llama_model_loader: - kv 7: general.license str = apache-2.0
  2444. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2445. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2446. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2447. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2448. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2449. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2450. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2451. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2452. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2453. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2454. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2455. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2456. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2457. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2458. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2459. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2460. llama_model_loader: - kv 24: general.file_type u32 = 2
  2461. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2462. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2463. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2464. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2465. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2466. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2467. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2468. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2469. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2470. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2471. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2472. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2473. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2474. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2475. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2476. llama_model_loader: - type f32: 97 tensors
  2477. llama_model_loader: - type q4_0: 161 tensors
  2478. llama_model_loader: - type q8_0: 64 tensors
  2479. llama_model_loader: - type q6_K: 1 tensors
  2480. print_info: file format = GGUF V3 (latest)
  2481. print_info: file type = Q4_0
  2482. print_info: file size = 24.63 GiB (4.53 BPW)
  2483. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2484. load: special tokens cache size = 3
  2485. load: token to piece cache size = 0.1637 MB
  2486. print_info: arch = llama
  2487. print_info: vocab_only = 1
  2488. print_info: model type = ?B
  2489. print_info: model params = 46.70 B
  2490. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2491. print_info: vocab type = SPM
  2492. print_info: n_vocab = 32000
  2493. print_info: n_merges = 0
  2494. print_info: BOS token = 1 '<s>'
  2495. print_info: EOS token = 2 '</s>'
  2496. print_info: UNK token = 0 '<unk>'
  2497. print_info: LF token = 13 '<0x0A>'
  2498. print_info: EOG token = 2 '</s>'
  2499. print_info: max token length = 48
  2500. llama_model_load: vocab only - skipping tensors
  2501. time=2025-07-19T17:52:19.917+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2220 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52810"
  2502. time=2025-07-19T17:52:19.920+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  2503. time=2025-07-19T17:52:19.921+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  2504. time=2025-07-19T17:52:19.921+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  2505. time=2025-07-19T17:52:19.964+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  2506. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  2507. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  2508. ggml_cuda_init: found 1 CUDA devices:
  2509. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  2510. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  2511. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  2512. time=2025-07-19T17:52:20.047+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  2513. time=2025-07-19T17:52:20.047+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52810"
  2514. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  2515. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2516. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2517. llama_model_loader: - kv 0: general.architecture str = llama
  2518. llama_model_loader: - kv 1: general.type str = model
  2519. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2520. llama_model_loader: - kv 3: general.version str = v0.1
  2521. llama_model_loader: - kv 4: general.finetune str = Instruct
  2522. llama_model_loader: - kv 5: general.basename str = Mixtral
  2523. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2524. llama_model_loader: - kv 7: general.license str = apache-2.0
  2525. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2526. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2527. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2528. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2529. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2530. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2531. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2532. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2533. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2534. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2535. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2536. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2537. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2538. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2539. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2540. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2541. llama_model_loader: - kv 24: general.file_type u32 = 2
  2542. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2543. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2544. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2545. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2546. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2547. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2548. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2549. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2550. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2551. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2552. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2553. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2554. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2555. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2556. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2557. llama_model_loader: - type f32: 97 tensors
  2558. llama_model_loader: - type q4_0: 161 tensors
  2559. llama_model_loader: - type q8_0: 64 tensors
  2560. llama_model_loader: - type q6_K: 1 tensors
  2561. print_info: file format = GGUF V3 (latest)
  2562. print_info: file type = Q4_0
  2563. print_info: file size = 24.63 GiB (4.53 BPW)
  2564. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2565. load: special tokens cache size = 3
  2566. load: token to piece cache size = 0.1637 MB
  2567. print_info: arch = llama
  2568. print_info: vocab_only = 0
  2569. print_info: n_ctx_train = 32768
  2570. print_info: n_embd = 4096
  2571. print_info: n_layer = 32
  2572. print_info: n_head = 32
  2573. print_info: n_head_kv = 8
  2574. print_info: n_rot = 128
  2575. print_info: n_swa = 0
  2576. print_info: n_swa_pattern = 1
  2577. print_info: n_embd_head_k = 128
  2578. print_info: n_embd_head_v = 128
  2579. print_info: n_gqa = 4
  2580. print_info: n_embd_k_gqa = 1024
  2581. print_info: n_embd_v_gqa = 1024
  2582. print_info: f_norm_eps = 0.0e+00
  2583. print_info: f_norm_rms_eps = 1.0e-05
  2584. print_info: f_clamp_kqv = 0.0e+00
  2585. print_info: f_max_alibi_bias = 0.0e+00
  2586. print_info: f_logit_scale = 0.0e+00
  2587. print_info: f_attn_scale = 0.0e+00
  2588. print_info: n_ff = 14336
  2589. print_info: n_expert = 8
  2590. print_info: n_expert_used = 2
  2591. print_info: causal attn = 1
  2592. print_info: pooling type = 0
  2593. print_info: rope type = 0
  2594. print_info: rope scaling = linear
  2595. print_info: freq_base_train = 1000000.0
  2596. print_info: freq_scale_train = 1
  2597. print_info: n_ctx_orig_yarn = 32768
  2598. print_info: rope_finetuned = unknown
  2599. print_info: ssm_d_conv = 0
  2600. print_info: ssm_d_inner = 0
  2601. print_info: ssm_d_state = 0
  2602. print_info: ssm_dt_rank = 0
  2603. print_info: ssm_dt_b_c_rms = 0
  2604. print_info: model type = 8x7B
  2605. print_info: model params = 46.70 B
  2606. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2607. print_info: vocab type = SPM
  2608. print_info: n_vocab = 32000
  2609. print_info: n_merges = 0
  2610. print_info: BOS token = 1 '<s>'
  2611. print_info: EOS token = 2 '</s>'
  2612. print_info: UNK token = 0 '<unk>'
  2613. print_info: LF token = 13 '<0x0A>'
  2614. print_info: EOG token = 2 '</s>'
  2615. print_info: max token length = 48
  2616. load_tensors: loading model tensors, this can take a while... (mmap = false)
  2617. time=2025-07-19T17:52:20.172+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  2618. load_tensors: offloading 21 repeating layers to GPU
  2619. load_tensors: offloaded 21/33 layers to GPU
  2620. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  2621. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  2622. [GIN] 2025/07/19 - 17:52:25 | 200 | 0s | 83.77.231.178 | GET "/api/ps"
  2623. llama_context: constructing llama_context
  2624. llama_context: n_seq_max = 1
  2625. llama_context: n_ctx = 2220
  2626. llama_context: n_ctx_per_seq = 2220
  2627. llama_context: n_batch = 512
  2628. llama_context: n_ubatch = 512
  2629. llama_context: causal_attn = 1
  2630. llama_context: flash_attn = 0
  2631. llama_context: freq_base = 1000000.0
  2632. llama_context: freq_scale = 1
  2633. llama_context: n_ctx_per_seq (2220) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  2634. llama_context: CPU output buffer size = 0.14 MiB
  2635. llama_kv_cache_unified: kv_size = 2240, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  2636. llama_kv_cache_unified: CUDA0 KV buffer size = 183.75 MiB
  2637. llama_kv_cache_unified: CPU KV buffer size = 96.25 MiB
  2638. llama_kv_cache_unified: KV self size = 280.00 MiB, K (f16): 140.00 MiB, V (f16): 140.00 MiB
  2639. llama_context: CUDA0 compute buffer size = 393.38 MiB
  2640. llama_context: CUDA_Host compute buffer size = 12.38 MiB
  2641. llama_context: graph nodes = 1574
  2642. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  2643. time=2025-07-19T17:52:35.956+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.04 seconds"
  2644. time=2025-07-19T17:52:35.961+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2220 prompt=2311 keep=5 new=2220
  2645. [GIN] 2025/07/19 - 17:52:44 | 200 | 25.2718447s | 83.77.231.178 | POST "/api/generate"
  2646. time=2025-07-19T17:52:49.080+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.3 GiB" free_swap="29.1 GiB"
  2647. time=2025-07-19T17:52:49.080+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="272.5 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="172.5 MiB" memory.graph.partial="826.3 MiB"
  2648. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2649. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2650. llama_model_loader: - kv 0: general.architecture str = llama
  2651. llama_model_loader: - kv 1: general.type str = model
  2652. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2653. llama_model_loader: - kv 3: general.version str = v0.1
  2654. llama_model_loader: - kv 4: general.finetune str = Instruct
  2655. llama_model_loader: - kv 5: general.basename str = Mixtral
  2656. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2657. llama_model_loader: - kv 7: general.license str = apache-2.0
  2658. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2659. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2660. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2661. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2662. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2663. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2664. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2665. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2666. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2667. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2668. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2669. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2670. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2671. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2672. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2673. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2674. llama_model_loader: - kv 24: general.file_type u32 = 2
  2675. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2676. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2677. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2678. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2679. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2680. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2681. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2682. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2683. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2684. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2685. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2686. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2687. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2688. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2689. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2690. llama_model_loader: - type f32: 97 tensors
  2691. llama_model_loader: - type q4_0: 161 tensors
  2692. llama_model_loader: - type q8_0: 64 tensors
  2693. llama_model_loader: - type q6_K: 1 tensors
  2694. print_info: file format = GGUF V3 (latest)
  2695. print_info: file type = Q4_0
  2696. print_info: file size = 24.63 GiB (4.53 BPW)
  2697. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2698. load: special tokens cache size = 3
  2699. load: token to piece cache size = 0.1637 MB
  2700. print_info: arch = llama
  2701. print_info: vocab_only = 1
  2702. print_info: model type = ?B
  2703. print_info: model params = 46.70 B
  2704. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2705. print_info: vocab type = SPM
  2706. print_info: n_vocab = 32000
  2707. print_info: n_merges = 0
  2708. print_info: BOS token = 1 '<s>'
  2709. print_info: EOS token = 2 '</s>'
  2710. print_info: UNK token = 0 '<unk>'
  2711. print_info: LF token = 13 '<0x0A>'
  2712. print_info: EOG token = 2 '</s>'
  2713. print_info: max token length = 48
  2714. llama_model_load: vocab only - skipping tensors
  2715. time=2025-07-19T17:52:49.106+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2180 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52922"
  2716. time=2025-07-19T17:52:49.108+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  2717. time=2025-07-19T17:52:49.108+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  2718. time=2025-07-19T17:52:49.109+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  2719. time=2025-07-19T17:52:49.148+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  2720. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  2721. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  2722. ggml_cuda_init: found 1 CUDA devices:
  2723. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  2724. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  2725. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  2726. time=2025-07-19T17:52:49.231+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  2727. time=2025-07-19T17:52:49.231+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52922"
  2728. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  2729. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2730. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2731. llama_model_loader: - kv 0: general.architecture str = llama
  2732. llama_model_loader: - kv 1: general.type str = model
  2733. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2734. llama_model_loader: - kv 3: general.version str = v0.1
  2735. llama_model_loader: - kv 4: general.finetune str = Instruct
  2736. llama_model_loader: - kv 5: general.basename str = Mixtral
  2737. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2738. llama_model_loader: - kv 7: general.license str = apache-2.0
  2739. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2740. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2741. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2742. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2743. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2744. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2745. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2746. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2747. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2748. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2749. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2750. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2751. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2752. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2753. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2754. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2755. llama_model_loader: - kv 24: general.file_type u32 = 2
  2756. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2757. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2758. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2759. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2760. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2761. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2762. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2763. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2764. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2765. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2766. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2767. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2768. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2769. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2770. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2771. llama_model_loader: - type f32: 97 tensors
  2772. llama_model_loader: - type q4_0: 161 tensors
  2773. llama_model_loader: - type q8_0: 64 tensors
  2774. llama_model_loader: - type q6_K: 1 tensors
  2775. print_info: file format = GGUF V3 (latest)
  2776. print_info: file type = Q4_0
  2777. print_info: file size = 24.63 GiB (4.53 BPW)
  2778. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2779. load: special tokens cache size = 3
  2780. load: token to piece cache size = 0.1637 MB
  2781. print_info: arch = llama
  2782. print_info: vocab_only = 0
  2783. print_info: n_ctx_train = 32768
  2784. print_info: n_embd = 4096
  2785. print_info: n_layer = 32
  2786. print_info: n_head = 32
  2787. print_info: n_head_kv = 8
  2788. print_info: n_rot = 128
  2789. print_info: n_swa = 0
  2790. print_info: n_swa_pattern = 1
  2791. print_info: n_embd_head_k = 128
  2792. print_info: n_embd_head_v = 128
  2793. print_info: n_gqa = 4
  2794. print_info: n_embd_k_gqa = 1024
  2795. print_info: n_embd_v_gqa = 1024
  2796. print_info: f_norm_eps = 0.0e+00
  2797. print_info: f_norm_rms_eps = 1.0e-05
  2798. print_info: f_clamp_kqv = 0.0e+00
  2799. print_info: f_max_alibi_bias = 0.0e+00
  2800. print_info: f_logit_scale = 0.0e+00
  2801. print_info: f_attn_scale = 0.0e+00
  2802. print_info: n_ff = 14336
  2803. print_info: n_expert = 8
  2804. print_info: n_expert_used = 2
  2805. print_info: causal attn = 1
  2806. print_info: pooling type = 0
  2807. print_info: rope type = 0
  2808. print_info: rope scaling = linear
  2809. print_info: freq_base_train = 1000000.0
  2810. print_info: freq_scale_train = 1
  2811. print_info: n_ctx_orig_yarn = 32768
  2812. print_info: rope_finetuned = unknown
  2813. print_info: ssm_d_conv = 0
  2814. print_info: ssm_d_inner = 0
  2815. print_info: ssm_d_state = 0
  2816. print_info: ssm_dt_rank = 0
  2817. print_info: ssm_dt_b_c_rms = 0
  2818. print_info: model type = 8x7B
  2819. print_info: model params = 46.70 B
  2820. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2821. print_info: vocab type = SPM
  2822. print_info: n_vocab = 32000
  2823. print_info: n_merges = 0
  2824. print_info: BOS token = 1 '<s>'
  2825. print_info: EOS token = 2 '</s>'
  2826. print_info: UNK token = 0 '<unk>'
  2827. print_info: LF token = 13 '<0x0A>'
  2828. print_info: EOG token = 2 '</s>'
  2829. print_info: max token length = 48
  2830. load_tensors: loading model tensors, this can take a while... (mmap = false)
  2831. time=2025-07-19T17:52:49.360+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  2832. load_tensors: offloading 21 repeating layers to GPU
  2833. load_tensors: offloaded 21/33 layers to GPU
  2834. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  2835. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  2836. llama_context: constructing llama_context
  2837. llama_context: n_seq_max = 1
  2838. llama_context: n_ctx = 2180
  2839. llama_context: n_ctx_per_seq = 2180
  2840. llama_context: n_batch = 512
  2841. llama_context: n_ubatch = 512
  2842. llama_context: causal_attn = 1
  2843. llama_context: flash_attn = 0
  2844. llama_context: freq_base = 1000000.0
  2845. llama_context: freq_scale = 1
  2846. llama_context: n_ctx_per_seq (2180) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  2847. llama_context: CPU output buffer size = 0.14 MiB
  2848. llama_kv_cache_unified: kv_size = 2208, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  2849. llama_kv_cache_unified: CUDA0 KV buffer size = 181.12 MiB
  2850. llama_kv_cache_unified: CPU KV buffer size = 94.88 MiB
  2851. llama_kv_cache_unified: KV self size = 276.00 MiB, K (f16): 138.00 MiB, V (f16): 138.00 MiB
  2852. llama_context: CUDA0 compute buffer size = 393.31 MiB
  2853. llama_context: CUDA_Host compute buffer size = 12.32 MiB
  2854. llama_context: graph nodes = 1574
  2855. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  2856. time=2025-07-19T17:53:05.143+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.03 seconds"
  2857. time=2025-07-19T17:53:05.147+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2180 prompt=2279 keep=5 new=2180
  2858. [GIN] 2025/07/19 - 17:53:21 | 200 | 33.5639499s | 83.77.231.178 | POST "/api/generate"
  2859. time=2025-07-19T17:53:25.838+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="29.5 GiB"
  2860. time=2025-07-19T17:53:25.839+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="27.0 GiB" memory.required.partial="18.4 GiB" memory.required.kv="460.1 MiB" memory.required.allocations="[18.4 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="269.3 MiB" memory.graph.partial="829.2 MiB"
  2861. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2862. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2863. llama_model_loader: - kv 0: general.architecture str = llama
  2864. llama_model_loader: - kv 1: general.type str = model
  2865. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2866. llama_model_loader: - kv 3: general.version str = v0.1
  2867. llama_model_loader: - kv 4: general.finetune str = Instruct
  2868. llama_model_loader: - kv 5: general.basename str = Mixtral
  2869. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2870. llama_model_loader: - kv 7: general.license str = apache-2.0
  2871. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2872. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2873. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2874. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2875. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2876. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2877. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2878. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2879. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2880. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2881. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2882. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2883. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2884. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2885. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2886. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2887. llama_model_loader: - kv 24: general.file_type u32 = 2
  2888. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2889. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2890. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2891. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2892. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2893. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2894. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2895. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2896. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2897. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2898. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2899. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2900. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2901. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2902. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2903. llama_model_loader: - type f32: 97 tensors
  2904. llama_model_loader: - type q4_0: 161 tensors
  2905. llama_model_loader: - type q8_0: 64 tensors
  2906. llama_model_loader: - type q6_K: 1 tensors
  2907. print_info: file format = GGUF V3 (latest)
  2908. print_info: file type = Q4_0
  2909. print_info: file size = 24.63 GiB (4.53 BPW)
  2910. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2911. load: special tokens cache size = 3
  2912. load: token to piece cache size = 0.1637 MB
  2913. print_info: arch = llama
  2914. print_info: vocab_only = 1
  2915. print_info: model type = ?B
  2916. print_info: model params = 46.70 B
  2917. print_info: general.name = Mixtral 8x7B Instruct v0.1
  2918. print_info: vocab type = SPM
  2919. print_info: n_vocab = 32000
  2920. print_info: n_merges = 0
  2921. print_info: BOS token = 1 '<s>'
  2922. print_info: EOS token = 2 '</s>'
  2923. print_info: UNK token = 0 '<unk>'
  2924. print_info: LF token = 13 '<0x0A>'
  2925. print_info: EOG token = 2 '</s>'
  2926. print_info: max token length = 48
  2927. llama_model_load: vocab only - skipping tensors
  2928. time=2025-07-19T17:53:25.864+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 3681 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52946"
  2929. time=2025-07-19T17:53:25.866+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  2930. time=2025-07-19T17:53:25.866+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  2931. time=2025-07-19T17:53:25.866+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  2932. time=2025-07-19T17:53:25.915+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  2933. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  2934. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  2935. ggml_cuda_init: found 1 CUDA devices:
  2936. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  2937. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  2938. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  2939. time=2025-07-19T17:53:25.996+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  2940. time=2025-07-19T17:53:25.997+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52946"
  2941. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  2942. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  2943. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  2944. llama_model_loader: - kv 0: general.architecture str = llama
  2945. llama_model_loader: - kv 1: general.type str = model
  2946. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  2947. llama_model_loader: - kv 3: general.version str = v0.1
  2948. llama_model_loader: - kv 4: general.finetune str = Instruct
  2949. llama_model_loader: - kv 5: general.basename str = Mixtral
  2950. llama_model_loader: - kv 6: general.size_label str = 8x7B
  2951. llama_model_loader: - kv 7: general.license str = apache-2.0
  2952. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  2953. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  2954. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  2955. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  2956. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  2957. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  2958. llama_model_loader: - kv 14: llama.block_count u32 = 32
  2959. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  2960. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  2961. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  2962. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  2963. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  2964. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  2965. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  2966. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  2967. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  2968. llama_model_loader: - kv 24: general.file_type u32 = 2
  2969. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  2970. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  2971. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  2972. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  2973. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  2974. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  2975. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  2976. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  2977. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  2978. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  2979. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  2980. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  2981. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  2982. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  2983. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  2984. llama_model_loader: - type f32: 97 tensors
  2985. llama_model_loader: - type q4_0: 161 tensors
  2986. llama_model_loader: - type q8_0: 64 tensors
  2987. llama_model_loader: - type q6_K: 1 tensors
  2988. print_info: file format = GGUF V3 (latest)
  2989. print_info: file type = Q4_0
  2990. print_info: file size = 24.63 GiB (4.53 BPW)
  2991. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  2992. load: special tokens cache size = 3
  2993. load: token to piece cache size = 0.1637 MB
  2994. print_info: arch = llama
  2995. print_info: vocab_only = 0
  2996. print_info: n_ctx_train = 32768
  2997. print_info: n_embd = 4096
  2998. print_info: n_layer = 32
  2999. print_info: n_head = 32
  3000. print_info: n_head_kv = 8
  3001. print_info: n_rot = 128
  3002. print_info: n_swa = 0
  3003. print_info: n_swa_pattern = 1
  3004. print_info: n_embd_head_k = 128
  3005. print_info: n_embd_head_v = 128
  3006. print_info: n_gqa = 4
  3007. print_info: n_embd_k_gqa = 1024
  3008. print_info: n_embd_v_gqa = 1024
  3009. print_info: f_norm_eps = 0.0e+00
  3010. print_info: f_norm_rms_eps = 1.0e-05
  3011. print_info: f_clamp_kqv = 0.0e+00
  3012. print_info: f_max_alibi_bias = 0.0e+00
  3013. print_info: f_logit_scale = 0.0e+00
  3014. print_info: f_attn_scale = 0.0e+00
  3015. print_info: n_ff = 14336
  3016. print_info: n_expert = 8
  3017. print_info: n_expert_used = 2
  3018. print_info: causal attn = 1
  3019. print_info: pooling type = 0
  3020. print_info: rope type = 0
  3021. print_info: rope scaling = linear
  3022. print_info: freq_base_train = 1000000.0
  3023. print_info: freq_scale_train = 1
  3024. print_info: n_ctx_orig_yarn = 32768
  3025. print_info: rope_finetuned = unknown
  3026. print_info: ssm_d_conv = 0
  3027. print_info: ssm_d_inner = 0
  3028. print_info: ssm_d_state = 0
  3029. print_info: ssm_dt_rank = 0
  3030. print_info: ssm_dt_b_c_rms = 0
  3031. print_info: model type = 8x7B
  3032. print_info: model params = 46.70 B
  3033. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3034. print_info: vocab type = SPM
  3035. print_info: n_vocab = 32000
  3036. print_info: n_merges = 0
  3037. print_info: BOS token = 1 '<s>'
  3038. print_info: EOS token = 2 '</s>'
  3039. print_info: UNK token = 0 '<unk>'
  3040. print_info: LF token = 13 '<0x0A>'
  3041. print_info: EOG token = 2 '</s>'
  3042. print_info: max token length = 48
  3043. load_tensors: loading model tensors, this can take a while... (mmap = false)
  3044. time=2025-07-19T17:53:26.118+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  3045. load_tensors: offloading 21 repeating layers to GPU
  3046. load_tensors: offloaded 21/33 layers to GPU
  3047. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  3048. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  3049. [GIN] 2025/07/19 - 17:53:30 | 200 | 0s | 83.77.231.178 | GET "/"
  3050. [GIN] 2025/07/19 - 17:53:30 | 404 | 0s | 83.77.231.178 | GET "/favicon.ico"
  3051. llama_context: constructing llama_context
  3052. llama_context: n_seq_max = 1
  3053. llama_context: n_ctx = 3681
  3054. llama_context: n_ctx_per_seq = 3681
  3055. llama_context: n_batch = 512
  3056. llama_context: n_ubatch = 512
  3057. llama_context: causal_attn = 1
  3058. llama_context: flash_attn = 0
  3059. llama_context: freq_base = 1000000.0
  3060. llama_context: freq_scale = 1
  3061. llama_context: n_ctx_per_seq (3681) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  3062. llama_context: CPU output buffer size = 0.14 MiB
  3063. llama_kv_cache_unified: kv_size = 3712, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  3064. llama_kv_cache_unified: CUDA0 KV buffer size = 304.50 MiB
  3065. llama_kv_cache_unified: CPU KV buffer size = 159.50 MiB
  3066. llama_kv_cache_unified: KV self size = 464.00 MiB, K (f16): 232.00 MiB, V (f16): 232.00 MiB
  3067. llama_context: CUDA0 compute buffer size = 396.25 MiB
  3068. llama_context: CUDA_Host compute buffer size = 15.26 MiB
  3069. llama_context: graph nodes = 1574
  3070. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  3071. time=2025-07-19T17:53:41.650+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  3072. time=2025-07-19T17:53:41.655+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=3681 prompt=7678 keep=5 new=3681
  3073. [GIN] 2025/07/19 - 17:53:51 | 200 | 26.5249814s | 83.77.231.178 | POST "/api/generate"
  3074. time=2025-07-19T17:53:56.175+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.5 GiB" free_swap="29.5 GiB"
  3075. time=2025-07-19T17:53:56.175+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="18.2 GiB" memory.required.kv="260.5 MiB" memory.required.allocations="[18.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="166.3 MiB" memory.graph.partial="826.1 MiB"
  3076. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3077. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3078. llama_model_loader: - kv 0: general.architecture str = llama
  3079. llama_model_loader: - kv 1: general.type str = model
  3080. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3081. llama_model_loader: - kv 3: general.version str = v0.1
  3082. llama_model_loader: - kv 4: general.finetune str = Instruct
  3083. llama_model_loader: - kv 5: general.basename str = Mixtral
  3084. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3085. llama_model_loader: - kv 7: general.license str = apache-2.0
  3086. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3087. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3088. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3089. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3090. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3091. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3092. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3093. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3094. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3095. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3096. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3097. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3098. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3099. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3100. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3101. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3102. llama_model_loader: - kv 24: general.file_type u32 = 2
  3103. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3104. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3105. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3106. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3107. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3108. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3109. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3110. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3111. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3112. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3113. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3114. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3115. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3116. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3117. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3118. llama_model_loader: - type f32: 97 tensors
  3119. llama_model_loader: - type q4_0: 161 tensors
  3120. llama_model_loader: - type q8_0: 64 tensors
  3121. llama_model_loader: - type q6_K: 1 tensors
  3122. print_info: file format = GGUF V3 (latest)
  3123. print_info: file type = Q4_0
  3124. print_info: file size = 24.63 GiB (4.53 BPW)
  3125. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3126. load: special tokens cache size = 3
  3127. load: token to piece cache size = 0.1637 MB
  3128. print_info: arch = llama
  3129. print_info: vocab_only = 1
  3130. print_info: model type = ?B
  3131. print_info: model params = 46.70 B
  3132. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3133. print_info: vocab type = SPM
  3134. print_info: n_vocab = 32000
  3135. print_info: n_merges = 0
  3136. print_info: BOS token = 1 '<s>'
  3137. print_info: EOS token = 2 '</s>'
  3138. print_info: UNK token = 0 '<unk>'
  3139. print_info: LF token = 13 '<0x0A>'
  3140. print_info: EOG token = 2 '</s>'
  3141. print_info: max token length = 48
  3142. llama_model_load: vocab only - skipping tensors
  3143. time=2025-07-19T17:53:56.197+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2084 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52982"
  3144. time=2025-07-19T17:53:56.201+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  3145. time=2025-07-19T17:53:56.201+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  3146. time=2025-07-19T17:53:56.201+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  3147. time=2025-07-19T17:53:56.240+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  3148. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  3149. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  3150. ggml_cuda_init: found 1 CUDA devices:
  3151. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  3152. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  3153. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  3154. time=2025-07-19T17:53:56.321+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  3155. time=2025-07-19T17:53:56.322+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52982"
  3156. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  3157. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3158. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3159. llama_model_loader: - kv 0: general.architecture str = llama
  3160. llama_model_loader: - kv 1: general.type str = model
  3161. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3162. llama_model_loader: - kv 3: general.version str = v0.1
  3163. llama_model_loader: - kv 4: general.finetune str = Instruct
  3164. llama_model_loader: - kv 5: general.basename str = Mixtral
  3165. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3166. llama_model_loader: - kv 7: general.license str = apache-2.0
  3167. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3168. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3169. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3170. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3171. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3172. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3173. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3174. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3175. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3176. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3177. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3178. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3179. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3180. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3181. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3182. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3183. llama_model_loader: - kv 24: general.file_type u32 = 2
  3184. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3185. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3186. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3187. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3188. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3189. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3190. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3191. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3192. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3193. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3194. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3195. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3196. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3197. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3198. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3199. llama_model_loader: - type f32: 97 tensors
  3200. llama_model_loader: - type q4_0: 161 tensors
  3201. llama_model_loader: - type q8_0: 64 tensors
  3202. llama_model_loader: - type q6_K: 1 tensors
  3203. print_info: file format = GGUF V3 (latest)
  3204. print_info: file type = Q4_0
  3205. print_info: file size = 24.63 GiB (4.53 BPW)
  3206. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3207. load: special tokens cache size = 3
  3208. load: token to piece cache size = 0.1637 MB
  3209. print_info: arch = llama
  3210. print_info: vocab_only = 0
  3211. print_info: n_ctx_train = 32768
  3212. print_info: n_embd = 4096
  3213. print_info: n_layer = 32
  3214. print_info: n_head = 32
  3215. print_info: n_head_kv = 8
  3216. print_info: n_rot = 128
  3217. print_info: n_swa = 0
  3218. print_info: n_swa_pattern = 1
  3219. print_info: n_embd_head_k = 128
  3220. print_info: n_embd_head_v = 128
  3221. print_info: n_gqa = 4
  3222. print_info: n_embd_k_gqa = 1024
  3223. print_info: n_embd_v_gqa = 1024
  3224. print_info: f_norm_eps = 0.0e+00
  3225. print_info: f_norm_rms_eps = 1.0e-05
  3226. print_info: f_clamp_kqv = 0.0e+00
  3227. print_info: f_max_alibi_bias = 0.0e+00
  3228. print_info: f_logit_scale = 0.0e+00
  3229. print_info: f_attn_scale = 0.0e+00
  3230. print_info: n_ff = 14336
  3231. print_info: n_expert = 8
  3232. print_info: n_expert_used = 2
  3233. print_info: causal attn = 1
  3234. print_info: pooling type = 0
  3235. print_info: rope type = 0
  3236. print_info: rope scaling = linear
  3237. print_info: freq_base_train = 1000000.0
  3238. print_info: freq_scale_train = 1
  3239. print_info: n_ctx_orig_yarn = 32768
  3240. print_info: rope_finetuned = unknown
  3241. print_info: ssm_d_conv = 0
  3242. print_info: ssm_d_inner = 0
  3243. print_info: ssm_d_state = 0
  3244. print_info: ssm_dt_rank = 0
  3245. print_info: ssm_dt_b_c_rms = 0
  3246. print_info: model type = 8x7B
  3247. print_info: model params = 46.70 B
  3248. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3249. print_info: vocab type = SPM
  3250. print_info: n_vocab = 32000
  3251. print_info: n_merges = 0
  3252. print_info: BOS token = 1 '<s>'
  3253. print_info: EOS token = 2 '</s>'
  3254. print_info: UNK token = 0 '<unk>'
  3255. print_info: LF token = 13 '<0x0A>'
  3256. print_info: EOG token = 2 '</s>'
  3257. print_info: max token length = 48
  3258. load_tensors: loading model tensors, this can take a while... (mmap = false)
  3259. time=2025-07-19T17:53:56.452+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  3260. load_tensors: offloading 21 repeating layers to GPU
  3261. load_tensors: offloaded 21/33 layers to GPU
  3262. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  3263. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  3264. llama_context: constructing llama_context
  3265. llama_context: n_seq_max = 1
  3266. llama_context: n_ctx = 2084
  3267. llama_context: n_ctx_per_seq = 2084
  3268. llama_context: n_batch = 512
  3269. llama_context: n_ubatch = 512
  3270. llama_context: causal_attn = 1
  3271. llama_context: flash_attn = 0
  3272. llama_context: freq_base = 1000000.0
  3273. llama_context: freq_scale = 1
  3274. llama_context: n_ctx_per_seq (2084) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  3275. llama_context: CPU output buffer size = 0.14 MiB
  3276. llama_kv_cache_unified: kv_size = 2112, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  3277. llama_kv_cache_unified: CUDA0 KV buffer size = 173.25 MiB
  3278. llama_kv_cache_unified: CPU KV buffer size = 90.75 MiB
  3279. llama_kv_cache_unified: KV self size = 264.00 MiB, K (f16): 132.00 MiB, V (f16): 132.00 MiB
  3280. llama_context: CUDA0 compute buffer size = 393.13 MiB
  3281. llama_context: CUDA_Host compute buffer size = 12.13 MiB
  3282. llama_context: graph nodes = 1574
  3283. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  3284. time=2025-07-19T17:54:11.983+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.78 seconds"
  3285. [GIN] 2025/07/19 - 17:54:20 | 200 | 25.2196376s | 83.77.231.178 | POST "/api/generate"
  3286. time=2025-07-19T17:54:24.889+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="29.4 GiB"
  3287. time=2025-07-19T17:54:24.889+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="18.3 GiB" memory.required.kv="316.5 MiB" memory.required.allocations="[18.3 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="195.2 MiB" memory.graph.partial="827.0 MiB"
  3288. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3289. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3290. llama_model_loader: - kv 0: general.architecture str = llama
  3291. llama_model_loader: - kv 1: general.type str = model
  3292. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3293. llama_model_loader: - kv 3: general.version str = v0.1
  3294. llama_model_loader: - kv 4: general.finetune str = Instruct
  3295. llama_model_loader: - kv 5: general.basename str = Mixtral
  3296. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3297. llama_model_loader: - kv 7: general.license str = apache-2.0
  3298. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3299. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3300. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3301. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3302. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3303. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3304. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3305. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3306. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3307. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3308. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3309. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3310. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3311. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3312. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3313. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3314. llama_model_loader: - kv 24: general.file_type u32 = 2
  3315. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3316. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3317. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3318. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3319. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3320. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3321. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3322. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3323. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3324. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3325. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3326. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3327. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3328. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3329. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3330. llama_model_loader: - type f32: 97 tensors
  3331. llama_model_loader: - type q4_0: 161 tensors
  3332. llama_model_loader: - type q8_0: 64 tensors
  3333. llama_model_loader: - type q6_K: 1 tensors
  3334. print_info: file format = GGUF V3 (latest)
  3335. print_info: file type = Q4_0
  3336. print_info: file size = 24.63 GiB (4.53 BPW)
  3337. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3338. load: special tokens cache size = 3
  3339. load: token to piece cache size = 0.1637 MB
  3340. print_info: arch = llama
  3341. print_info: vocab_only = 1
  3342. print_info: model type = ?B
  3343. print_info: model params = 46.70 B
  3344. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3345. print_info: vocab type = SPM
  3346. print_info: n_vocab = 32000
  3347. print_info: n_merges = 0
  3348. print_info: BOS token = 1 '<s>'
  3349. print_info: EOS token = 2 '</s>'
  3350. print_info: UNK token = 0 '<unk>'
  3351. print_info: LF token = 13 '<0x0A>'
  3352. print_info: EOG token = 2 '</s>'
  3353. print_info: max token length = 48
  3354. llama_model_load: vocab only - skipping tensors
  3355. time=2025-07-19T17:54:24.914+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2532 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52986"
  3356. time=2025-07-19T17:54:24.917+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  3357. time=2025-07-19T17:54:24.917+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  3358. time=2025-07-19T17:54:24.917+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  3359. time=2025-07-19T17:54:24.964+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  3360. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  3361. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  3362. ggml_cuda_init: found 1 CUDA devices:
  3363. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  3364. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  3365. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  3366. time=2025-07-19T17:54:25.070+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  3367. time=2025-07-19T17:54:25.070+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52986"
  3368. time=2025-07-19T17:54:25.168+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  3369. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  3370. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3371. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3372. llama_model_loader: - kv 0: general.architecture str = llama
  3373. llama_model_loader: - kv 1: general.type str = model
  3374. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3375. llama_model_loader: - kv 3: general.version str = v0.1
  3376. llama_model_loader: - kv 4: general.finetune str = Instruct
  3377. llama_model_loader: - kv 5: general.basename str = Mixtral
  3378. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3379. llama_model_loader: - kv 7: general.license str = apache-2.0
  3380. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3381. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3382. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3383. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3384. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3385. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3386. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3387. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3388. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3389. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3390. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3391. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3392. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3393. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3394. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3395. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3396. llama_model_loader: - kv 24: general.file_type u32 = 2
  3397. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3398. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3399. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3400. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3401. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3402. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3403. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3404. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3405. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3406. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3407. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3408. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3409. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3410. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3411. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3412. llama_model_loader: - type f32: 97 tensors
  3413. llama_model_loader: - type q4_0: 161 tensors
  3414. llama_model_loader: - type q8_0: 64 tensors
  3415. llama_model_loader: - type q6_K: 1 tensors
  3416. print_info: file format = GGUF V3 (latest)
  3417. print_info: file type = Q4_0
  3418. print_info: file size = 24.63 GiB (4.53 BPW)
  3419. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3420. load: special tokens cache size = 3
  3421. load: token to piece cache size = 0.1637 MB
  3422. print_info: arch = llama
  3423. print_info: vocab_only = 0
  3424. print_info: n_ctx_train = 32768
  3425. print_info: n_embd = 4096
  3426. print_info: n_layer = 32
  3427. print_info: n_head = 32
  3428. print_info: n_head_kv = 8
  3429. print_info: n_rot = 128
  3430. print_info: n_swa = 0
  3431. print_info: n_swa_pattern = 1
  3432. print_info: n_embd_head_k = 128
  3433. print_info: n_embd_head_v = 128
  3434. print_info: n_gqa = 4
  3435. print_info: n_embd_k_gqa = 1024
  3436. print_info: n_embd_v_gqa = 1024
  3437. print_info: f_norm_eps = 0.0e+00
  3438. print_info: f_norm_rms_eps = 1.0e-05
  3439. print_info: f_clamp_kqv = 0.0e+00
  3440. print_info: f_max_alibi_bias = 0.0e+00
  3441. print_info: f_logit_scale = 0.0e+00
  3442. print_info: f_attn_scale = 0.0e+00
  3443. print_info: n_ff = 14336
  3444. print_info: n_expert = 8
  3445. print_info: n_expert_used = 2
  3446. print_info: causal attn = 1
  3447. print_info: pooling type = 0
  3448. print_info: rope type = 0
  3449. print_info: rope scaling = linear
  3450. print_info: freq_base_train = 1000000.0
  3451. print_info: freq_scale_train = 1
  3452. print_info: n_ctx_orig_yarn = 32768
  3453. print_info: rope_finetuned = unknown
  3454. print_info: ssm_d_conv = 0
  3455. print_info: ssm_d_inner = 0
  3456. print_info: ssm_d_state = 0
  3457. print_info: ssm_dt_rank = 0
  3458. print_info: ssm_dt_b_c_rms = 0
  3459. print_info: model type = 8x7B
  3460. print_info: model params = 46.70 B
  3461. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3462. print_info: vocab type = SPM
  3463. print_info: n_vocab = 32000
  3464. print_info: n_merges = 0
  3465. print_info: BOS token = 1 '<s>'
  3466. print_info: EOS token = 2 '</s>'
  3467. print_info: UNK token = 0 '<unk>'
  3468. print_info: LF token = 13 '<0x0A>'
  3469. print_info: EOG token = 2 '</s>'
  3470. print_info: max token length = 48
  3471. load_tensors: loading model tensors, this can take a while... (mmap = false)
  3472. load_tensors: offloading 21 repeating layers to GPU
  3473. load_tensors: offloaded 21/33 layers to GPU
  3474. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  3475. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  3476. llama_context: constructing llama_context
  3477. llama_context: n_seq_max = 1
  3478. llama_context: n_ctx = 2532
  3479. llama_context: n_ctx_per_seq = 2532
  3480. llama_context: n_batch = 512
  3481. llama_context: n_ubatch = 512
  3482. llama_context: causal_attn = 1
  3483. llama_context: flash_attn = 0
  3484. llama_context: freq_base = 1000000.0
  3485. llama_context: freq_scale = 1
  3486. llama_context: n_ctx_per_seq (2532) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  3487. llama_context: CPU output buffer size = 0.14 MiB
  3488. llama_kv_cache_unified: kv_size = 2560, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  3489. llama_kv_cache_unified: CUDA0 KV buffer size = 210.00 MiB
  3490. llama_kv_cache_unified: CPU KV buffer size = 110.00 MiB
  3491. llama_kv_cache_unified: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB
  3492. llama_context: CUDA0 compute buffer size = 394.00 MiB
  3493. llama_context: CUDA_Host compute buffer size = 13.01 MiB
  3494. llama_context: graph nodes = 1574
  3495. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  3496. time=2025-07-19T17:54:40.954+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.04 seconds"
  3497. time=2025-07-19T17:54:40.959+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2532 prompt=3176 keep=5 new=2532
  3498. [GIN] 2025/07/19 - 17:54:53 | 200 | 28.9310929s | 83.77.231.178 | POST "/api/generate"
  3499. time=2025-07-19T17:54:57.959+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.4 GiB" free_swap="29.5 GiB"
  3500. time=2025-07-19T17:54:57.960+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=21 layers.split="" memory.available="[18.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="18.2 GiB" memory.required.kv="260.6 MiB" memory.required.allocations="[18.2 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="166.4 MiB" memory.graph.partial="826.1 MiB"
  3501. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3502. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3503. llama_model_loader: - kv 0: general.architecture str = llama
  3504. llama_model_loader: - kv 1: general.type str = model
  3505. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3506. llama_model_loader: - kv 3: general.version str = v0.1
  3507. llama_model_loader: - kv 4: general.finetune str = Instruct
  3508. llama_model_loader: - kv 5: general.basename str = Mixtral
  3509. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3510. llama_model_loader: - kv 7: general.license str = apache-2.0
  3511. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3512. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3513. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3514. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3515. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3516. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3517. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3518. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3519. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3520. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3521. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3522. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3523. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3524. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3525. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3526. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3527. llama_model_loader: - kv 24: general.file_type u32 = 2
  3528. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3529. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3530. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3531. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3532. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3533. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3534. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3535. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3536. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3537. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3538. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3539. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3540. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3541. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3542. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3543. llama_model_loader: - type f32: 97 tensors
  3544. llama_model_loader: - type q4_0: 161 tensors
  3545. llama_model_loader: - type q8_0: 64 tensors
  3546. llama_model_loader: - type q6_K: 1 tensors
  3547. print_info: file format = GGUF V3 (latest)
  3548. print_info: file type = Q4_0
  3549. print_info: file size = 24.63 GiB (4.53 BPW)
  3550. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3551. load: special tokens cache size = 3
  3552. load: token to piece cache size = 0.1637 MB
  3553. print_info: arch = llama
  3554. print_info: vocab_only = 1
  3555. print_info: model type = ?B
  3556. print_info: model params = 46.70 B
  3557. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3558. print_info: vocab type = SPM
  3559. print_info: n_vocab = 32000
  3560. print_info: n_merges = 0
  3561. print_info: BOS token = 1 '<s>'
  3562. print_info: EOS token = 2 '</s>'
  3563. print_info: UNK token = 0 '<unk>'
  3564. print_info: LF token = 13 '<0x0A>'
  3565. print_info: EOG token = 2 '</s>'
  3566. print_info: max token length = 48
  3567. llama_model_load: vocab only - skipping tensors
  3568. time=2025-07-19T17:54:57.984+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2085 --batch-size 512 --n-gpu-layers 21 --threads 16 --no-mmap --parallel 1 --port 52996"
  3569. time=2025-07-19T17:54:57.987+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  3570. time=2025-07-19T17:54:57.987+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  3571. time=2025-07-19T17:54:57.987+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  3572. time=2025-07-19T17:54:58.028+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  3573. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  3574. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  3575. ggml_cuda_init: found 1 CUDA devices:
  3576. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  3577. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  3578. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  3579. time=2025-07-19T17:54:58.111+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  3580. time=2025-07-19T17:54:58.111+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:52996"
  3581. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  3582. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3583. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3584. llama_model_loader: - kv 0: general.architecture str = llama
  3585. llama_model_loader: - kv 1: general.type str = model
  3586. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3587. llama_model_loader: - kv 3: general.version str = v0.1
  3588. llama_model_loader: - kv 4: general.finetune str = Instruct
  3589. llama_model_loader: - kv 5: general.basename str = Mixtral
  3590. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3591. llama_model_loader: - kv 7: general.license str = apache-2.0
  3592. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3593. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3594. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3595. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3596. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3597. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3598. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3599. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3600. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3601. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3602. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3603. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3604. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3605. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3606. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3607. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3608. llama_model_loader: - kv 24: general.file_type u32 = 2
  3609. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3610. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3611. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3612. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3613. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3614. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3615. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3616. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3617. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3618. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3619. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3620. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3621. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3622. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3623. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3624. llama_model_loader: - type f32: 97 tensors
  3625. llama_model_loader: - type q4_0: 161 tensors
  3626. llama_model_loader: - type q8_0: 64 tensors
  3627. llama_model_loader: - type q6_K: 1 tensors
  3628. print_info: file format = GGUF V3 (latest)
  3629. print_info: file type = Q4_0
  3630. print_info: file size = 24.63 GiB (4.53 BPW)
  3631. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3632. load: special tokens cache size = 3
  3633. load: token to piece cache size = 0.1637 MB
  3634. print_info: arch = llama
  3635. print_info: vocab_only = 0
  3636. print_info: n_ctx_train = 32768
  3637. print_info: n_embd = 4096
  3638. print_info: n_layer = 32
  3639. print_info: n_head = 32
  3640. print_info: n_head_kv = 8
  3641. print_info: n_rot = 128
  3642. print_info: n_swa = 0
  3643. print_info: n_swa_pattern = 1
  3644. print_info: n_embd_head_k = 128
  3645. print_info: n_embd_head_v = 128
  3646. print_info: n_gqa = 4
  3647. print_info: n_embd_k_gqa = 1024
  3648. print_info: n_embd_v_gqa = 1024
  3649. print_info: f_norm_eps = 0.0e+00
  3650. print_info: f_norm_rms_eps = 1.0e-05
  3651. print_info: f_clamp_kqv = 0.0e+00
  3652. print_info: f_max_alibi_bias = 0.0e+00
  3653. print_info: f_logit_scale = 0.0e+00
  3654. print_info: f_attn_scale = 0.0e+00
  3655. print_info: n_ff = 14336
  3656. print_info: n_expert = 8
  3657. print_info: n_expert_used = 2
  3658. print_info: causal attn = 1
  3659. print_info: pooling type = 0
  3660. print_info: rope type = 0
  3661. print_info: rope scaling = linear
  3662. print_info: freq_base_train = 1000000.0
  3663. print_info: freq_scale_train = 1
  3664. print_info: n_ctx_orig_yarn = 32768
  3665. print_info: rope_finetuned = unknown
  3666. print_info: ssm_d_conv = 0
  3667. print_info: ssm_d_inner = 0
  3668. print_info: ssm_d_state = 0
  3669. print_info: ssm_dt_rank = 0
  3670. print_info: ssm_dt_b_c_rms = 0
  3671. print_info: model type = 8x7B
  3672. print_info: model params = 46.70 B
  3673. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3674. print_info: vocab type = SPM
  3675. print_info: n_vocab = 32000
  3676. print_info: n_merges = 0
  3677. print_info: BOS token = 1 '<s>'
  3678. print_info: EOS token = 2 '</s>'
  3679. print_info: UNK token = 0 '<unk>'
  3680. print_info: LF token = 13 '<0x0A>'
  3681. print_info: EOG token = 2 '</s>'
  3682. print_info: max token length = 48
  3683. load_tensors: loading model tensors, this can take a while... (mmap = false)
  3684. time=2025-07-19T17:54:58.238+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  3685. load_tensors: offloading 21 repeating layers to GPU
  3686. load_tensors: offloaded 21/33 layers to GPU
  3687. load_tensors: CUDA_Host model buffer size = 8782.09 MiB
  3688. load_tensors: CUDA0 model buffer size = 16435.78 MiB
  3689. llama_context: constructing llama_context
  3690. llama_context: n_seq_max = 1
  3691. llama_context: n_ctx = 2085
  3692. llama_context: n_ctx_per_seq = 2085
  3693. llama_context: n_batch = 512
  3694. llama_context: n_ubatch = 512
  3695. llama_context: causal_attn = 1
  3696. llama_context: flash_attn = 0
  3697. llama_context: freq_base = 1000000.0
  3698. llama_context: freq_scale = 1
  3699. llama_context: n_ctx_per_seq (2085) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  3700. llama_context: CPU output buffer size = 0.14 MiB
  3701. llama_kv_cache_unified: kv_size = 2112, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  3702. llama_kv_cache_unified: CUDA0 KV buffer size = 173.25 MiB
  3703. llama_kv_cache_unified: CPU KV buffer size = 90.75 MiB
  3704. llama_kv_cache_unified: KV self size = 264.00 MiB, K (f16): 132.00 MiB, V (f16): 132.00 MiB
  3705. llama_context: CUDA0 compute buffer size = 393.13 MiB
  3706. llama_context: CUDA_Host compute buffer size = 12.13 MiB
  3707. llama_context: graph nodes = 1574
  3708. llama_context: graph splits = 136 (with bs=512), 3 (with bs=1)
  3709. time=2025-07-19T17:55:14.022+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.04 seconds"
  3710. [GIN] 2025/07/19 - 17:55:23 | 200 | 26.3996882s | 83.77.231.178 | POST "/api/generate"
  3711. [GIN] 2025/07/19 - 17:56:08 | 200 | 0s | 192.168.1.1 | GET "/"
  3712. [GIN] 2025/07/19 - 18:01:08 | 200 | 0s | 192.168.1.1 | GET "/"
  3713. [GIN] 2025/07/19 - 18:06:08 | 200 | 0s | 192.168.1.1 | GET "/"
  3714. time=2025-07-19T18:07:46.341+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="31.0 GiB"
  3715. time=2025-07-19T18:07:46.342+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="19.8 GiB" memory.required.kv="260.6 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="166.4 MiB" memory.graph.partial="826.1 MiB"
  3716. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3717. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3718. llama_model_loader: - kv 0: general.architecture str = llama
  3719. llama_model_loader: - kv 1: general.type str = model
  3720. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3721. llama_model_loader: - kv 3: general.version str = v0.1
  3722. llama_model_loader: - kv 4: general.finetune str = Instruct
  3723. llama_model_loader: - kv 5: general.basename str = Mixtral
  3724. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3725. llama_model_loader: - kv 7: general.license str = apache-2.0
  3726. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3727. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3728. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3729. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3730. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3731. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3732. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3733. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3734. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3735. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3736. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3737. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3738. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3739. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3740. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3741. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3742. llama_model_loader: - kv 24: general.file_type u32 = 2
  3743. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3744. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3745. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3746. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3747. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3748. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3749. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3750. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3751. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3752. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3753. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3754. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3755. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3756. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3757. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3758. llama_model_loader: - type f32: 97 tensors
  3759. llama_model_loader: - type q4_0: 161 tensors
  3760. llama_model_loader: - type q8_0: 64 tensors
  3761. llama_model_loader: - type q6_K: 1 tensors
  3762. print_info: file format = GGUF V3 (latest)
  3763. print_info: file type = Q4_0
  3764. print_info: file size = 24.63 GiB (4.53 BPW)
  3765. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3766. load: special tokens cache size = 3
  3767. load: token to piece cache size = 0.1637 MB
  3768. print_info: arch = llama
  3769. print_info: vocab_only = 1
  3770. print_info: model type = ?B
  3771. print_info: model params = 46.70 B
  3772. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3773. print_info: vocab type = SPM
  3774. print_info: n_vocab = 32000
  3775. print_info: n_merges = 0
  3776. print_info: BOS token = 1 '<s>'
  3777. print_info: EOS token = 2 '</s>'
  3778. print_info: UNK token = 0 '<unk>'
  3779. print_info: LF token = 13 '<0x0A>'
  3780. print_info: EOG token = 2 '</s>'
  3781. print_info: max token length = 48
  3782. llama_model_load: vocab only - skipping tensors
  3783. time=2025-07-19T18:07:46.365+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2085 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53414"
  3784. time=2025-07-19T18:07:46.368+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  3785. time=2025-07-19T18:07:46.368+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  3786. time=2025-07-19T18:07:46.368+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  3787. time=2025-07-19T18:07:46.402+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  3788. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  3789. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  3790. ggml_cuda_init: found 1 CUDA devices:
  3791. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  3792. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  3793. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  3794. time=2025-07-19T18:07:46.478+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  3795. time=2025-07-19T18:07:46.479+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53414"
  3796. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  3797. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3798. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3799. llama_model_loader: - kv 0: general.architecture str = llama
  3800. llama_model_loader: - kv 1: general.type str = model
  3801. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3802. llama_model_loader: - kv 3: general.version str = v0.1
  3803. llama_model_loader: - kv 4: general.finetune str = Instruct
  3804. llama_model_loader: - kv 5: general.basename str = Mixtral
  3805. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3806. llama_model_loader: - kv 7: general.license str = apache-2.0
  3807. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3808. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3809. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3810. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3811. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3812. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3813. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3814. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3815. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3816. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3817. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3818. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3819. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3820. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3821. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3822. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3823. llama_model_loader: - kv 24: general.file_type u32 = 2
  3824. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3825. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3826. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3827. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3828. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3829. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3830. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3831. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3832. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3833. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3834. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3835. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3836. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3837. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3838. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3839. llama_model_loader: - type f32: 97 tensors
  3840. llama_model_loader: - type q4_0: 161 tensors
  3841. llama_model_loader: - type q8_0: 64 tensors
  3842. llama_model_loader: - type q6_K: 1 tensors
  3843. print_info: file format = GGUF V3 (latest)
  3844. print_info: file type = Q4_0
  3845. print_info: file size = 24.63 GiB (4.53 BPW)
  3846. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3847. load: special tokens cache size = 3
  3848. load: token to piece cache size = 0.1637 MB
  3849. print_info: arch = llama
  3850. print_info: vocab_only = 0
  3851. print_info: n_ctx_train = 32768
  3852. print_info: n_embd = 4096
  3853. print_info: n_layer = 32
  3854. print_info: n_head = 32
  3855. print_info: n_head_kv = 8
  3856. print_info: n_rot = 128
  3857. print_info: n_swa = 0
  3858. print_info: n_swa_pattern = 1
  3859. print_info: n_embd_head_k = 128
  3860. print_info: n_embd_head_v = 128
  3861. print_info: n_gqa = 4
  3862. print_info: n_embd_k_gqa = 1024
  3863. print_info: n_embd_v_gqa = 1024
  3864. print_info: f_norm_eps = 0.0e+00
  3865. print_info: f_norm_rms_eps = 1.0e-05
  3866. print_info: f_clamp_kqv = 0.0e+00
  3867. print_info: f_max_alibi_bias = 0.0e+00
  3868. print_info: f_logit_scale = 0.0e+00
  3869. print_info: f_attn_scale = 0.0e+00
  3870. print_info: n_ff = 14336
  3871. print_info: n_expert = 8
  3872. print_info: n_expert_used = 2
  3873. print_info: causal attn = 1
  3874. print_info: pooling type = 0
  3875. print_info: rope type = 0
  3876. print_info: rope scaling = linear
  3877. print_info: freq_base_train = 1000000.0
  3878. print_info: freq_scale_train = 1
  3879. print_info: n_ctx_orig_yarn = 32768
  3880. print_info: rope_finetuned = unknown
  3881. print_info: ssm_d_conv = 0
  3882. print_info: ssm_d_inner = 0
  3883. print_info: ssm_d_state = 0
  3884. print_info: ssm_dt_rank = 0
  3885. print_info: ssm_dt_b_c_rms = 0
  3886. print_info: model type = 8x7B
  3887. print_info: model params = 46.70 B
  3888. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3889. print_info: vocab type = SPM
  3890. print_info: n_vocab = 32000
  3891. print_info: n_merges = 0
  3892. print_info: BOS token = 1 '<s>'
  3893. print_info: EOS token = 2 '</s>'
  3894. print_info: UNK token = 0 '<unk>'
  3895. print_info: LF token = 13 '<0x0A>'
  3896. print_info: EOG token = 2 '</s>'
  3897. print_info: max token length = 48
  3898. load_tensors: loading model tensors, this can take a while... (mmap = false)
  3899. time=2025-07-19T18:07:46.618+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  3900. load_tensors: offloading 23 repeating layers to GPU
  3901. load_tensors: offloaded 23/33 layers to GPU
  3902. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  3903. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  3904. llama_context: constructing llama_context
  3905. llama_context: n_seq_max = 1
  3906. llama_context: n_ctx = 2085
  3907. llama_context: n_ctx_per_seq = 2085
  3908. llama_context: n_batch = 512
  3909. llama_context: n_ubatch = 512
  3910. llama_context: causal_attn = 1
  3911. llama_context: flash_attn = 0
  3912. llama_context: freq_base = 1000000.0
  3913. llama_context: freq_scale = 1
  3914. llama_context: n_ctx_per_seq (2085) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  3915. llama_context: CPU output buffer size = 0.14 MiB
  3916. llama_kv_cache_unified: kv_size = 2112, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  3917. llama_kv_cache_unified: CUDA0 KV buffer size = 189.75 MiB
  3918. llama_kv_cache_unified: CPU KV buffer size = 74.25 MiB
  3919. llama_kv_cache_unified: KV self size = 264.00 MiB, K (f16): 132.00 MiB, V (f16): 132.00 MiB
  3920. llama_context: CUDA0 compute buffer size = 393.13 MiB
  3921. llama_context: CUDA_Host compute buffer size = 12.13 MiB
  3922. llama_context: graph nodes = 1574
  3923. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  3924. time=2025-07-19T18:07:52.881+02:00 level=INFO source=server.go:637 msg="llama runner started in 6.51 seconds"
  3925. [GIN] 2025/07/19 - 18:08:00 | 200 | 14.3169454s | 100.107.36.63 | POST "/api/generate"
  3926. time=2025-07-19T18:08:25.644+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.4 GiB" free_swap="31.2 GiB"
  3927. time=2025-07-19T18:08:25.645+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="275.8 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="174.2 MiB" memory.graph.partial="826.3 MiB"
  3928. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  3929. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  3930. llama_model_loader: - kv 0: general.architecture str = llama
  3931. llama_model_loader: - kv 1: general.type str = model
  3932. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  3933. llama_model_loader: - kv 3: general.version str = v0.1
  3934. llama_model_loader: - kv 4: general.finetune str = Instruct
  3935. llama_model_loader: - kv 5: general.basename str = Mixtral
  3936. llama_model_loader: - kv 6: general.size_label str = 8x7B
  3937. llama_model_loader: - kv 7: general.license str = apache-2.0
  3938. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  3939. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  3940. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  3941. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  3942. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  3943. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  3944. llama_model_loader: - kv 14: llama.block_count u32 = 32
  3945. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  3946. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  3947. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  3948. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  3949. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  3950. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  3951. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  3952. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  3953. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  3954. llama_model_loader: - kv 24: general.file_type u32 = 2
  3955. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  3956. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  3957. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  3958. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  3959. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  3960. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  3961. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  3962. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  3963. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  3964. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  3965. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  3966. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  3967. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  3968. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  3969. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  3970. llama_model_loader: - type f32: 97 tensors
  3971. llama_model_loader: - type q4_0: 161 tensors
  3972. llama_model_loader: - type q8_0: 64 tensors
  3973. llama_model_loader: - type q6_K: 1 tensors
  3974. print_info: file format = GGUF V3 (latest)
  3975. print_info: file type = Q4_0
  3976. print_info: file size = 24.63 GiB (4.53 BPW)
  3977. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  3978. load: special tokens cache size = 3
  3979. load: token to piece cache size = 0.1637 MB
  3980. print_info: arch = llama
  3981. print_info: vocab_only = 1
  3982. print_info: model type = ?B
  3983. print_info: model params = 46.70 B
  3984. print_info: general.name = Mixtral 8x7B Instruct v0.1
  3985. print_info: vocab type = SPM
  3986. print_info: n_vocab = 32000
  3987. print_info: n_merges = 0
  3988. print_info: BOS token = 1 '<s>'
  3989. print_info: EOS token = 2 '</s>'
  3990. print_info: UNK token = 0 '<unk>'
  3991. print_info: LF token = 13 '<0x0A>'
  3992. print_info: EOG token = 2 '</s>'
  3993. print_info: max token length = 48
  3994. llama_model_load: vocab only - skipping tensors
  3995. time=2025-07-19T18:08:25.669+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2206 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53424"
  3996. time=2025-07-19T18:08:25.672+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  3997. time=2025-07-19T18:08:25.672+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  3998. time=2025-07-19T18:08:25.673+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  3999. time=2025-07-19T18:08:25.725+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  4000. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  4001. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4002. ggml_cuda_init: found 1 CUDA devices:
  4003. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  4004. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  4005. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  4006. time=2025-07-19T18:08:25.801+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  4007. time=2025-07-19T18:08:25.802+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53424"
  4008. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  4009. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4010. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4011. llama_model_loader: - kv 0: general.architecture str = llama
  4012. llama_model_loader: - kv 1: general.type str = model
  4013. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4014. llama_model_loader: - kv 3: general.version str = v0.1
  4015. llama_model_loader: - kv 4: general.finetune str = Instruct
  4016. llama_model_loader: - kv 5: general.basename str = Mixtral
  4017. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4018. llama_model_loader: - kv 7: general.license str = apache-2.0
  4019. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4020. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4021. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4022. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4023. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4024. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4025. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4026. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4027. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4028. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4029. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4030. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4031. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4032. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4033. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4034. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4035. llama_model_loader: - kv 24: general.file_type u32 = 2
  4036. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4037. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4038. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4039. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4040. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4041. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4042. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4043. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4044. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4045. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4046. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4047. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4048. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4049. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4050. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4051. llama_model_loader: - type f32: 97 tensors
  4052. llama_model_loader: - type q4_0: 161 tensors
  4053. llama_model_loader: - type q8_0: 64 tensors
  4054. llama_model_loader: - type q6_K: 1 tensors
  4055. print_info: file format = GGUF V3 (latest)
  4056. print_info: file type = Q4_0
  4057. print_info: file size = 24.63 GiB (4.53 BPW)
  4058. time=2025-07-19T18:08:25.923+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  4059. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4060. load: special tokens cache size = 3
  4061. load: token to piece cache size = 0.1637 MB
  4062. print_info: arch = llama
  4063. print_info: vocab_only = 0
  4064. print_info: n_ctx_train = 32768
  4065. print_info: n_embd = 4096
  4066. print_info: n_layer = 32
  4067. print_info: n_head = 32
  4068. print_info: n_head_kv = 8
  4069. print_info: n_rot = 128
  4070. print_info: n_swa = 0
  4071. print_info: n_swa_pattern = 1
  4072. print_info: n_embd_head_k = 128
  4073. print_info: n_embd_head_v = 128
  4074. print_info: n_gqa = 4
  4075. print_info: n_embd_k_gqa = 1024
  4076. print_info: n_embd_v_gqa = 1024
  4077. print_info: f_norm_eps = 0.0e+00
  4078. print_info: f_norm_rms_eps = 1.0e-05
  4079. print_info: f_clamp_kqv = 0.0e+00
  4080. print_info: f_max_alibi_bias = 0.0e+00
  4081. print_info: f_logit_scale = 0.0e+00
  4082. print_info: f_attn_scale = 0.0e+00
  4083. print_info: n_ff = 14336
  4084. print_info: n_expert = 8
  4085. print_info: n_expert_used = 2
  4086. print_info: causal attn = 1
  4087. print_info: pooling type = 0
  4088. print_info: rope type = 0
  4089. print_info: rope scaling = linear
  4090. print_info: freq_base_train = 1000000.0
  4091. print_info: freq_scale_train = 1
  4092. print_info: n_ctx_orig_yarn = 32768
  4093. print_info: rope_finetuned = unknown
  4094. print_info: ssm_d_conv = 0
  4095. print_info: ssm_d_inner = 0
  4096. print_info: ssm_d_state = 0
  4097. print_info: ssm_dt_rank = 0
  4098. print_info: ssm_dt_b_c_rms = 0
  4099. print_info: model type = 8x7B
  4100. print_info: model params = 46.70 B
  4101. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4102. print_info: vocab type = SPM
  4103. print_info: n_vocab = 32000
  4104. print_info: n_merges = 0
  4105. print_info: BOS token = 1 '<s>'
  4106. print_info: EOS token = 2 '</s>'
  4107. print_info: UNK token = 0 '<unk>'
  4108. print_info: LF token = 13 '<0x0A>'
  4109. print_info: EOG token = 2 '</s>'
  4110. print_info: max token length = 48
  4111. load_tensors: loading model tensors, this can take a while... (mmap = false)
  4112. load_tensors: offloading 23 repeating layers to GPU
  4113. load_tensors: offloaded 23/33 layers to GPU
  4114. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  4115. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  4116. llama_context: constructing llama_context
  4117. llama_context: n_seq_max = 1
  4118. llama_context: n_ctx = 2206
  4119. llama_context: n_ctx_per_seq = 2206
  4120. llama_context: n_batch = 512
  4121. llama_context: n_ubatch = 512
  4122. llama_context: causal_attn = 1
  4123. llama_context: flash_attn = 0
  4124. llama_context: freq_base = 1000000.0
  4125. llama_context: freq_scale = 1
  4126. llama_context: n_ctx_per_seq (2206) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  4127. llama_context: CPU output buffer size = 0.14 MiB
  4128. llama_kv_cache_unified: kv_size = 2208, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  4129. llama_kv_cache_unified: CUDA0 KV buffer size = 198.38 MiB
  4130. llama_kv_cache_unified: CPU KV buffer size = 77.62 MiB
  4131. llama_kv_cache_unified: KV self size = 276.00 MiB, K (f16): 138.00 MiB, V (f16): 138.00 MiB
  4132. llama_context: CUDA0 compute buffer size = 393.31 MiB
  4133. llama_context: CUDA_Host compute buffer size = 12.32 MiB
  4134. llama_context: graph nodes = 1574
  4135. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  4136. time=2025-07-19T18:08:30.181+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.51 seconds"
  4137. time=2025-07-19T18:08:30.185+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2206 prompt=2254 keep=5 new=2206
  4138. [GIN] 2025/07/19 - 18:08:36 | 200 | 12.0389404s | 100.107.36.63 | POST "/api/generate"
  4139. time=2025-07-19T18:09:01.925+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.5 GiB" free_swap="31.2 GiB"
  4140. time=2025-07-19T18:09:01.925+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="315.1 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="194.5 MiB" memory.graph.partial="826.9 MiB"
  4141. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4142. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4143. llama_model_loader: - kv 0: general.architecture str = llama
  4144. llama_model_loader: - kv 1: general.type str = model
  4145. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4146. llama_model_loader: - kv 3: general.version str = v0.1
  4147. llama_model_loader: - kv 4: general.finetune str = Instruct
  4148. llama_model_loader: - kv 5: general.basename str = Mixtral
  4149. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4150. llama_model_loader: - kv 7: general.license str = apache-2.0
  4151. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4152. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4153. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4154. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4155. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4156. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4157. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4158. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4159. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4160. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4161. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4162. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4163. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4164. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4165. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4166. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4167. llama_model_loader: - kv 24: general.file_type u32 = 2
  4168. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4169. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4170. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4171. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4172. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4173. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4174. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4175. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4176. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4177. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4178. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4179. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4180. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4181. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4182. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4183. llama_model_loader: - type f32: 97 tensors
  4184. llama_model_loader: - type q4_0: 161 tensors
  4185. llama_model_loader: - type q8_0: 64 tensors
  4186. llama_model_loader: - type q6_K: 1 tensors
  4187. print_info: file format = GGUF V3 (latest)
  4188. print_info: file type = Q4_0
  4189. print_info: file size = 24.63 GiB (4.53 BPW)
  4190. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4191. load: special tokens cache size = 3
  4192. load: token to piece cache size = 0.1637 MB
  4193. print_info: arch = llama
  4194. print_info: vocab_only = 1
  4195. print_info: model type = ?B
  4196. print_info: model params = 46.70 B
  4197. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4198. print_info: vocab type = SPM
  4199. print_info: n_vocab = 32000
  4200. print_info: n_merges = 0
  4201. print_info: BOS token = 1 '<s>'
  4202. print_info: EOS token = 2 '</s>'
  4203. print_info: UNK token = 0 '<unk>'
  4204. print_info: LF token = 13 '<0x0A>'
  4205. print_info: EOG token = 2 '</s>'
  4206. print_info: max token length = 48
  4207. llama_model_load: vocab only - skipping tensors
  4208. time=2025-07-19T18:09:01.949+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2521 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53432"
  4209. time=2025-07-19T18:09:01.952+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  4210. time=2025-07-19T18:09:01.952+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  4211. time=2025-07-19T18:09:01.952+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  4212. time=2025-07-19T18:09:02.007+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  4213. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  4214. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4215. ggml_cuda_init: found 1 CUDA devices:
  4216. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  4217. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  4218. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  4219. time=2025-07-19T18:09:02.093+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  4220. time=2025-07-19T18:09:02.094+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53432"
  4221. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  4222. time=2025-07-19T18:09:02.203+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  4223. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4224. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4225. llama_model_loader: - kv 0: general.architecture str = llama
  4226. llama_model_loader: - kv 1: general.type str = model
  4227. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4228. llama_model_loader: - kv 3: general.version str = v0.1
  4229. llama_model_loader: - kv 4: general.finetune str = Instruct
  4230. llama_model_loader: - kv 5: general.basename str = Mixtral
  4231. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4232. llama_model_loader: - kv 7: general.license str = apache-2.0
  4233. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4234. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4235. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4236. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4237. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4238. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4239. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4240. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4241. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4242. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4243. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4244. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4245. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4246. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4247. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4248. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4249. llama_model_loader: - kv 24: general.file_type u32 = 2
  4250. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4251. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4252. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4253. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4254. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4255. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4256. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4257. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4258. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4259. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4260. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4261. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4262. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4263. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4264. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4265. llama_model_loader: - type f32: 97 tensors
  4266. llama_model_loader: - type q4_0: 161 tensors
  4267. llama_model_loader: - type q8_0: 64 tensors
  4268. llama_model_loader: - type q6_K: 1 tensors
  4269. print_info: file format = GGUF V3 (latest)
  4270. print_info: file type = Q4_0
  4271. print_info: file size = 24.63 GiB (4.53 BPW)
  4272. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4273. load: special tokens cache size = 3
  4274. load: token to piece cache size = 0.1637 MB
  4275. print_info: arch = llama
  4276. print_info: vocab_only = 0
  4277. print_info: n_ctx_train = 32768
  4278. print_info: n_embd = 4096
  4279. print_info: n_layer = 32
  4280. print_info: n_head = 32
  4281. print_info: n_head_kv = 8
  4282. print_info: n_rot = 128
  4283. print_info: n_swa = 0
  4284. print_info: n_swa_pattern = 1
  4285. print_info: n_embd_head_k = 128
  4286. print_info: n_embd_head_v = 128
  4287. print_info: n_gqa = 4
  4288. print_info: n_embd_k_gqa = 1024
  4289. print_info: n_embd_v_gqa = 1024
  4290. print_info: f_norm_eps = 0.0e+00
  4291. print_info: f_norm_rms_eps = 1.0e-05
  4292. print_info: f_clamp_kqv = 0.0e+00
  4293. print_info: f_max_alibi_bias = 0.0e+00
  4294. print_info: f_logit_scale = 0.0e+00
  4295. print_info: f_attn_scale = 0.0e+00
  4296. print_info: n_ff = 14336
  4297. print_info: n_expert = 8
  4298. print_info: n_expert_used = 2
  4299. print_info: causal attn = 1
  4300. print_info: pooling type = 0
  4301. print_info: rope type = 0
  4302. print_info: rope scaling = linear
  4303. print_info: freq_base_train = 1000000.0
  4304. print_info: freq_scale_train = 1
  4305. print_info: n_ctx_orig_yarn = 32768
  4306. print_info: rope_finetuned = unknown
  4307. print_info: ssm_d_conv = 0
  4308. print_info: ssm_d_inner = 0
  4309. print_info: ssm_d_state = 0
  4310. print_info: ssm_dt_rank = 0
  4311. print_info: ssm_dt_b_c_rms = 0
  4312. print_info: model type = 8x7B
  4313. print_info: model params = 46.70 B
  4314. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4315. print_info: vocab type = SPM
  4316. print_info: n_vocab = 32000
  4317. print_info: n_merges = 0
  4318. print_info: BOS token = 1 '<s>'
  4319. print_info: EOS token = 2 '</s>'
  4320. print_info: UNK token = 0 '<unk>'
  4321. print_info: LF token = 13 '<0x0A>'
  4322. print_info: EOG token = 2 '</s>'
  4323. print_info: max token length = 48
  4324. load_tensors: loading model tensors, this can take a while... (mmap = false)
  4325. load_tensors: offloading 23 repeating layers to GPU
  4326. load_tensors: offloaded 23/33 layers to GPU
  4327. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  4328. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  4329. llama_context: constructing llama_context
  4330. llama_context: n_seq_max = 1
  4331. llama_context: n_ctx = 2521
  4332. llama_context: n_ctx_per_seq = 2521
  4333. llama_context: n_batch = 512
  4334. llama_context: n_ubatch = 512
  4335. llama_context: causal_attn = 1
  4336. llama_context: flash_attn = 0
  4337. llama_context: freq_base = 1000000.0
  4338. llama_context: freq_scale = 1
  4339. llama_context: n_ctx_per_seq (2521) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  4340. llama_context: CPU output buffer size = 0.14 MiB
  4341. llama_kv_cache_unified: kv_size = 2528, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  4342. llama_kv_cache_unified: CUDA0 KV buffer size = 227.12 MiB
  4343. llama_kv_cache_unified: CPU KV buffer size = 88.88 MiB
  4344. llama_kv_cache_unified: KV self size = 316.00 MiB, K (f16): 158.00 MiB, V (f16): 158.00 MiB
  4345. llama_context: CUDA0 compute buffer size = 393.94 MiB
  4346. llama_context: CUDA_Host compute buffer size = 12.94 MiB
  4347. llama_context: graph nodes = 1574
  4348. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  4349. time=2025-07-19T18:09:06.459+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.51 seconds"
  4350. time=2025-07-19T18:09:06.462+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2521 prompt=3125 keep=5 new=2521
  4351. [GIN] 2025/07/19 - 18:09:19 | 200 | 18.661376s | 100.107.36.63 | POST "/api/generate"
  4352. time=2025-07-19T18:09:50.430+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="31.2 GiB"
  4353. time=2025-07-19T18:09:50.430+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="19.8 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="826.0 MiB"
  4354. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4355. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4356. llama_model_loader: - kv 0: general.architecture str = llama
  4357. llama_model_loader: - kv 1: general.type str = model
  4358. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4359. llama_model_loader: - kv 3: general.version str = v0.1
  4360. llama_model_loader: - kv 4: general.finetune str = Instruct
  4361. llama_model_loader: - kv 5: general.basename str = Mixtral
  4362. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4363. llama_model_loader: - kv 7: general.license str = apache-2.0
  4364. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4365. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4366. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4367. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4368. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4369. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4370. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4371. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4372. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4373. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4374. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4375. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4376. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4377. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4378. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4379. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4380. llama_model_loader: - kv 24: general.file_type u32 = 2
  4381. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4382. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4383. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4384. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4385. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4386. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4387. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4388. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4389. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4390. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4391. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4392. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4393. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4394. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4395. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4396. llama_model_loader: - type f32: 97 tensors
  4397. llama_model_loader: - type q4_0: 161 tensors
  4398. llama_model_loader: - type q8_0: 64 tensors
  4399. llama_model_loader: - type q6_K: 1 tensors
  4400. print_info: file format = GGUF V3 (latest)
  4401. print_info: file type = Q4_0
  4402. print_info: file size = 24.63 GiB (4.53 BPW)
  4403. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4404. load: special tokens cache size = 3
  4405. load: token to piece cache size = 0.1637 MB
  4406. print_info: arch = llama
  4407. print_info: vocab_only = 1
  4408. print_info: model type = ?B
  4409. print_info: model params = 46.70 B
  4410. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4411. print_info: vocab type = SPM
  4412. print_info: n_vocab = 32000
  4413. print_info: n_merges = 0
  4414. print_info: BOS token = 1 '<s>'
  4415. print_info: EOS token = 2 '</s>'
  4416. print_info: UNK token = 0 '<unk>'
  4417. print_info: LF token = 13 '<0x0A>'
  4418. print_info: EOG token = 2 '</s>'
  4419. print_info: max token length = 48
  4420. llama_model_load: vocab only - skipping tensors
  4421. time=2025-07-19T18:09:50.454+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 1913 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53442"
  4422. time=2025-07-19T18:09:50.457+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  4423. time=2025-07-19T18:09:50.457+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  4424. time=2025-07-19T18:09:50.457+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  4425. time=2025-07-19T18:09:50.507+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  4426. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  4427. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4428. ggml_cuda_init: found 1 CUDA devices:
  4429. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  4430. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  4431. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  4432. time=2025-07-19T18:09:50.584+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  4433. time=2025-07-19T18:09:50.584+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53442"
  4434. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  4435. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4436. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4437. llama_model_loader: - kv 0: general.architecture str = llama
  4438. llama_model_loader: - kv 1: general.type str = model
  4439. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4440. llama_model_loader: - kv 3: general.version str = v0.1
  4441. llama_model_loader: - kv 4: general.finetune str = Instruct
  4442. llama_model_loader: - kv 5: general.basename str = Mixtral
  4443. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4444. llama_model_loader: - kv 7: general.license str = apache-2.0
  4445. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4446. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4447. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4448. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4449. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4450. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4451. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4452. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4453. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4454. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4455. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4456. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4457. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4458. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4459. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4460. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4461. llama_model_loader: - kv 24: general.file_type u32 = 2
  4462. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4463. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4464. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4465. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4466. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4467. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4468. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4469. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4470. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4471. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4472. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4473. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4474. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4475. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4476. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4477. llama_model_loader: - type f32: 97 tensors
  4478. llama_model_loader: - type q4_0: 161 tensors
  4479. llama_model_loader: - type q8_0: 64 tensors
  4480. llama_model_loader: - type q6_K: 1 tensors
  4481. print_info: file format = GGUF V3 (latest)
  4482. print_info: file type = Q4_0
  4483. print_info: file size = 24.63 GiB (4.53 BPW)
  4484. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4485. time=2025-07-19T18:09:50.708+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  4486. load: special tokens cache size = 3
  4487. load: token to piece cache size = 0.1637 MB
  4488. print_info: arch = llama
  4489. print_info: vocab_only = 0
  4490. print_info: n_ctx_train = 32768
  4491. print_info: n_embd = 4096
  4492. print_info: n_layer = 32
  4493. print_info: n_head = 32
  4494. print_info: n_head_kv = 8
  4495. print_info: n_rot = 128
  4496. print_info: n_swa = 0
  4497. print_info: n_swa_pattern = 1
  4498. print_info: n_embd_head_k = 128
  4499. print_info: n_embd_head_v = 128
  4500. print_info: n_gqa = 4
  4501. print_info: n_embd_k_gqa = 1024
  4502. print_info: n_embd_v_gqa = 1024
  4503. print_info: f_norm_eps = 0.0e+00
  4504. print_info: f_norm_rms_eps = 1.0e-05
  4505. print_info: f_clamp_kqv = 0.0e+00
  4506. print_info: f_max_alibi_bias = 0.0e+00
  4507. print_info: f_logit_scale = 0.0e+00
  4508. print_info: f_attn_scale = 0.0e+00
  4509. print_info: n_ff = 14336
  4510. print_info: n_expert = 8
  4511. print_info: n_expert_used = 2
  4512. print_info: causal attn = 1
  4513. print_info: pooling type = 0
  4514. print_info: rope type = 0
  4515. print_info: rope scaling = linear
  4516. print_info: freq_base_train = 1000000.0
  4517. print_info: freq_scale_train = 1
  4518. print_info: n_ctx_orig_yarn = 32768
  4519. print_info: rope_finetuned = unknown
  4520. print_info: ssm_d_conv = 0
  4521. print_info: ssm_d_inner = 0
  4522. print_info: ssm_d_state = 0
  4523. print_info: ssm_dt_rank = 0
  4524. print_info: ssm_dt_b_c_rms = 0
  4525. print_info: model type = 8x7B
  4526. print_info: model params = 46.70 B
  4527. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4528. print_info: vocab type = SPM
  4529. print_info: n_vocab = 32000
  4530. print_info: n_merges = 0
  4531. print_info: BOS token = 1 '<s>'
  4532. print_info: EOS token = 2 '</s>'
  4533. print_info: UNK token = 0 '<unk>'
  4534. print_info: LF token = 13 '<0x0A>'
  4535. print_info: EOG token = 2 '</s>'
  4536. print_info: max token length = 48
  4537. load_tensors: loading model tensors, this can take a while... (mmap = false)
  4538. load_tensors: offloading 23 repeating layers to GPU
  4539. load_tensors: offloaded 23/33 layers to GPU
  4540. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  4541. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  4542. llama_context: constructing llama_context
  4543. llama_context: n_seq_max = 1
  4544. llama_context: n_ctx = 1913
  4545. llama_context: n_ctx_per_seq = 1913
  4546. llama_context: n_batch = 512
  4547. llama_context: n_ubatch = 512
  4548. llama_context: causal_attn = 1
  4549. llama_context: flash_attn = 0
  4550. llama_context: freq_base = 1000000.0
  4551. llama_context: freq_scale = 1
  4552. llama_context: n_ctx_per_seq (1913) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  4553. llama_context: CPU output buffer size = 0.14 MiB
  4554. llama_kv_cache_unified: kv_size = 1920, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  4555. llama_kv_cache_unified: CUDA0 KV buffer size = 172.50 MiB
  4556. llama_kv_cache_unified: CPU KV buffer size = 67.50 MiB
  4557. llama_kv_cache_unified: KV self size = 240.00 MiB, K (f16): 120.00 MiB, V (f16): 120.00 MiB
  4558. llama_context: CUDA0 compute buffer size = 405.00 MiB
  4559. llama_context: CUDA_Host compute buffer size = 11.76 MiB
  4560. llama_context: graph nodes = 1574
  4561. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  4562. time=2025-07-19T18:09:54.716+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.26 seconds"
  4563. [GIN] 2025/07/19 - 18:10:02 | 200 | 12.9362782s | 100.107.36.63 | POST "/api/generate"
  4564. time=2025-07-19T18:10:39.725+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.4 GiB" free_swap="30.7 GiB"
  4565. time=2025-07-19T18:10:39.726+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=22 layers.split="" memory.available="[19.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.1 GiB" memory.required.kv="321.6 MiB" memory.required.allocations="[19.1 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="197.8 MiB" memory.graph.partial="827.0 MiB"
  4566. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4567. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4568. llama_model_loader: - kv 0: general.architecture str = llama
  4569. llama_model_loader: - kv 1: general.type str = model
  4570. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4571. llama_model_loader: - kv 3: general.version str = v0.1
  4572. llama_model_loader: - kv 4: general.finetune str = Instruct
  4573. llama_model_loader: - kv 5: general.basename str = Mixtral
  4574. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4575. llama_model_loader: - kv 7: general.license str = apache-2.0
  4576. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4577. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4578. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4579. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4580. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4581. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4582. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4583. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4584. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4585. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4586. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4587. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4588. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4589. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4590. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4591. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4592. llama_model_loader: - kv 24: general.file_type u32 = 2
  4593. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4594. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4595. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4596. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4597. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4598. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4599. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4600. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4601. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4602. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4603. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4604. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4605. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4606. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4607. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4608. llama_model_loader: - type f32: 97 tensors
  4609. llama_model_loader: - type q4_0: 161 tensors
  4610. llama_model_loader: - type q8_0: 64 tensors
  4611. llama_model_loader: - type q6_K: 1 tensors
  4612. print_info: file format = GGUF V3 (latest)
  4613. print_info: file type = Q4_0
  4614. print_info: file size = 24.63 GiB (4.53 BPW)
  4615. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4616. load: special tokens cache size = 3
  4617. load: token to piece cache size = 0.1637 MB
  4618. print_info: arch = llama
  4619. print_info: vocab_only = 1
  4620. print_info: model type = ?B
  4621. print_info: model params = 46.70 B
  4622. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4623. print_info: vocab type = SPM
  4624. print_info: n_vocab = 32000
  4625. print_info: n_merges = 0
  4626. print_info: BOS token = 1 '<s>'
  4627. print_info: EOS token = 2 '</s>'
  4628. print_info: UNK token = 0 '<unk>'
  4629. print_info: LF token = 13 '<0x0A>'
  4630. print_info: EOG token = 2 '</s>'
  4631. print_info: max token length = 48
  4632. llama_model_load: vocab only - skipping tensors
  4633. time=2025-07-19T18:10:39.750+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2573 --batch-size 512 --n-gpu-layers 22 --threads 16 --no-mmap --parallel 1 --port 53489"
  4634. time=2025-07-19T18:10:39.753+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  4635. time=2025-07-19T18:10:39.753+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  4636. time=2025-07-19T18:10:39.753+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  4637. time=2025-07-19T18:10:39.808+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  4638. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  4639. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4640. ggml_cuda_init: found 1 CUDA devices:
  4641. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  4642. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  4643. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  4644. time=2025-07-19T18:10:39.893+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  4645. time=2025-07-19T18:10:39.893+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53489"
  4646. time=2025-07-19T18:10:40.004+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  4647. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  4648. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4649. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4650. llama_model_loader: - kv 0: general.architecture str = llama
  4651. llama_model_loader: - kv 1: general.type str = model
  4652. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4653. llama_model_loader: - kv 3: general.version str = v0.1
  4654. llama_model_loader: - kv 4: general.finetune str = Instruct
  4655. llama_model_loader: - kv 5: general.basename str = Mixtral
  4656. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4657. llama_model_loader: - kv 7: general.license str = apache-2.0
  4658. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4659. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4660. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4661. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4662. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4663. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4664. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4665. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4666. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4667. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4668. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4669. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4670. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4671. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4672. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4673. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4674. llama_model_loader: - kv 24: general.file_type u32 = 2
  4675. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4676. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4677. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4678. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4679. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4680. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4681. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4682. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4683. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4684. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4685. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4686. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4687. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4688. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4689. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4690. llama_model_loader: - type f32: 97 tensors
  4691. llama_model_loader: - type q4_0: 161 tensors
  4692. llama_model_loader: - type q8_0: 64 tensors
  4693. llama_model_loader: - type q6_K: 1 tensors
  4694. print_info: file format = GGUF V3 (latest)
  4695. print_info: file type = Q4_0
  4696. print_info: file size = 24.63 GiB (4.53 BPW)
  4697. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4698. load: special tokens cache size = 3
  4699. load: token to piece cache size = 0.1637 MB
  4700. print_info: arch = llama
  4701. print_info: vocab_only = 0
  4702. print_info: n_ctx_train = 32768
  4703. print_info: n_embd = 4096
  4704. print_info: n_layer = 32
  4705. print_info: n_head = 32
  4706. print_info: n_head_kv = 8
  4707. print_info: n_rot = 128
  4708. print_info: n_swa = 0
  4709. print_info: n_swa_pattern = 1
  4710. print_info: n_embd_head_k = 128
  4711. print_info: n_embd_head_v = 128
  4712. print_info: n_gqa = 4
  4713. print_info: n_embd_k_gqa = 1024
  4714. print_info: n_embd_v_gqa = 1024
  4715. print_info: f_norm_eps = 0.0e+00
  4716. print_info: f_norm_rms_eps = 1.0e-05
  4717. print_info: f_clamp_kqv = 0.0e+00
  4718. print_info: f_max_alibi_bias = 0.0e+00
  4719. print_info: f_logit_scale = 0.0e+00
  4720. print_info: f_attn_scale = 0.0e+00
  4721. print_info: n_ff = 14336
  4722. print_info: n_expert = 8
  4723. print_info: n_expert_used = 2
  4724. print_info: causal attn = 1
  4725. print_info: pooling type = 0
  4726. print_info: rope type = 0
  4727. print_info: rope scaling = linear
  4728. print_info: freq_base_train = 1000000.0
  4729. print_info: freq_scale_train = 1
  4730. print_info: n_ctx_orig_yarn = 32768
  4731. print_info: rope_finetuned = unknown
  4732. print_info: ssm_d_conv = 0
  4733. print_info: ssm_d_inner = 0
  4734. print_info: ssm_d_state = 0
  4735. print_info: ssm_dt_rank = 0
  4736. print_info: ssm_dt_b_c_rms = 0
  4737. print_info: model type = 8x7B
  4738. print_info: model params = 46.70 B
  4739. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4740. print_info: vocab type = SPM
  4741. print_info: n_vocab = 32000
  4742. print_info: n_merges = 0
  4743. print_info: BOS token = 1 '<s>'
  4744. print_info: EOS token = 2 '</s>'
  4745. print_info: UNK token = 0 '<unk>'
  4746. print_info: LF token = 13 '<0x0A>'
  4747. print_info: EOG token = 2 '</s>'
  4748. print_info: max token length = 48
  4749. load_tensors: loading model tensors, this can take a while... (mmap = false)
  4750. load_tensors: offloading 22 repeating layers to GPU
  4751. load_tensors: offloaded 22/33 layers to GPU
  4752. load_tensors: CUDA_Host model buffer size = 7999.43 MiB
  4753. load_tensors: CUDA0 model buffer size = 17218.44 MiB
  4754. llama_context: constructing llama_context
  4755. llama_context: n_seq_max = 1
  4756. llama_context: n_ctx = 2573
  4757. llama_context: n_ctx_per_seq = 2573
  4758. llama_context: n_batch = 512
  4759. llama_context: n_ubatch = 512
  4760. llama_context: causal_attn = 1
  4761. llama_context: flash_attn = 0
  4762. llama_context: freq_base = 1000000.0
  4763. llama_context: freq_scale = 1
  4764. llama_context: n_ctx_per_seq (2573) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  4765. llama_context: CPU output buffer size = 0.14 MiB
  4766. llama_kv_cache_unified: kv_size = 2592, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  4767. llama_kv_cache_unified: CUDA0 KV buffer size = 222.75 MiB
  4768. llama_kv_cache_unified: CPU KV buffer size = 101.25 MiB
  4769. llama_kv_cache_unified: KV self size = 324.00 MiB, K (f16): 162.00 MiB, V (f16): 162.00 MiB
  4770. llama_context: CUDA0 compute buffer size = 394.06 MiB
  4771. llama_context: CUDA_Host compute buffer size = 13.07 MiB
  4772. llama_context: graph nodes = 1574
  4773. llama_context: graph splits = 124 (with bs=512), 3 (with bs=1)
  4774. time=2025-07-19T18:10:56.035+02:00 level=INFO source=server.go:637 msg="llama runner started in 16.28 seconds"
  4775. time=2025-07-19T18:10:56.040+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2573 prompt=2966 keep=5 new=2573
  4776. [GIN] 2025/07/19 - 18:11:08 | 200 | 0s | 192.168.1.1 | GET "/"
  4777. [GIN] 2025/07/19 - 18:11:10 | 200 | 31.4398242s | 100.107.36.63 | POST "/api/generate"
  4778. time=2025-07-19T18:11:43.020+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="31.1 GiB"
  4779. time=2025-07-19T18:11:43.021+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="279.4 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="176.1 MiB" memory.graph.partial="826.4 MiB"
  4780. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4781. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4782. llama_model_loader: - kv 0: general.architecture str = llama
  4783. llama_model_loader: - kv 1: general.type str = model
  4784. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4785. llama_model_loader: - kv 3: general.version str = v0.1
  4786. llama_model_loader: - kv 4: general.finetune str = Instruct
  4787. llama_model_loader: - kv 5: general.basename str = Mixtral
  4788. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4789. llama_model_loader: - kv 7: general.license str = apache-2.0
  4790. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4791. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4792. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4793. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4794. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4795. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4796. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4797. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4798. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4799. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4800. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4801. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4802. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4803. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4804. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4805. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4806. llama_model_loader: - kv 24: general.file_type u32 = 2
  4807. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4808. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4809. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4810. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4811. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4812. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4813. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4814. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4815. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4816. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4817. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4818. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4819. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4820. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4821. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4822. llama_model_loader: - type f32: 97 tensors
  4823. llama_model_loader: - type q4_0: 161 tensors
  4824. llama_model_loader: - type q8_0: 64 tensors
  4825. llama_model_loader: - type q6_K: 1 tensors
  4826. print_info: file format = GGUF V3 (latest)
  4827. print_info: file type = Q4_0
  4828. print_info: file size = 24.63 GiB (4.53 BPW)
  4829. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4830. load: special tokens cache size = 3
  4831. load: token to piece cache size = 0.1637 MB
  4832. print_info: arch = llama
  4833. print_info: vocab_only = 1
  4834. print_info: model type = ?B
  4835. print_info: model params = 46.70 B
  4836. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4837. print_info: vocab type = SPM
  4838. print_info: n_vocab = 32000
  4839. print_info: n_merges = 0
  4840. print_info: BOS token = 1 '<s>'
  4841. print_info: EOS token = 2 '</s>'
  4842. print_info: UNK token = 0 '<unk>'
  4843. print_info: LF token = 13 '<0x0A>'
  4844. print_info: EOG token = 2 '</s>'
  4845. print_info: max token length = 48
  4846. llama_model_load: vocab only - skipping tensors
  4847. time=2025-07-19T18:11:43.045+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2235 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53552"
  4848. time=2025-07-19T18:11:43.048+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  4849. time=2025-07-19T18:11:43.048+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  4850. time=2025-07-19T18:11:43.048+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  4851. time=2025-07-19T18:11:43.102+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  4852. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  4853. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  4854. ggml_cuda_init: found 1 CUDA devices:
  4855. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  4856. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  4857. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  4858. time=2025-07-19T18:11:43.175+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  4859. time=2025-07-19T18:11:43.176+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53552"
  4860. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  4861. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4862. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4863. llama_model_loader: - kv 0: general.architecture str = llama
  4864. llama_model_loader: - kv 1: general.type str = model
  4865. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4866. llama_model_loader: - kv 3: general.version str = v0.1
  4867. llama_model_loader: - kv 4: general.finetune str = Instruct
  4868. llama_model_loader: - kv 5: general.basename str = Mixtral
  4869. llama_model_loader: - kv 6: general.size_label str = 8x7B
  4870. llama_model_loader: - kv 7: general.license str = apache-2.0
  4871. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  4872. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  4873. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  4874. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  4875. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  4876. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  4877. llama_model_loader: - kv 14: llama.block_count u32 = 32
  4878. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  4879. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  4880. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  4881. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  4882. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  4883. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  4884. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  4885. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  4886. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  4887. llama_model_loader: - kv 24: general.file_type u32 = 2
  4888. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  4889. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  4890. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  4891. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  4892. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  4893. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  4894. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  4895. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  4896. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  4897. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  4898. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  4899. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  4900. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  4901. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  4902. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  4903. llama_model_loader: - type f32: 97 tensors
  4904. llama_model_loader: - type q4_0: 161 tensors
  4905. llama_model_loader: - type q8_0: 64 tensors
  4906. llama_model_loader: - type q6_K: 1 tensors
  4907. print_info: file format = GGUF V3 (latest)
  4908. print_info: file type = Q4_0
  4909. print_info: file size = 24.63 GiB (4.53 BPW)
  4910. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  4911. load: special tokens cache size = 3
  4912. load: token to piece cache size = 0.1637 MB
  4913. print_info: arch = llama
  4914. print_info: vocab_only = 0
  4915. print_info: n_ctx_train = 32768
  4916. print_info: n_embd = 4096
  4917. print_info: n_layer = 32
  4918. print_info: n_head = 32
  4919. print_info: n_head_kv = 8
  4920. print_info: n_rot = 128
  4921. print_info: n_swa = 0
  4922. print_info: n_swa_pattern = 1
  4923. print_info: n_embd_head_k = 128
  4924. print_info: n_embd_head_v = 128
  4925. print_info: n_gqa = 4
  4926. print_info: n_embd_k_gqa = 1024
  4927. print_info: n_embd_v_gqa = 1024
  4928. print_info: f_norm_eps = 0.0e+00
  4929. print_info: f_norm_rms_eps = 1.0e-05
  4930. print_info: f_clamp_kqv = 0.0e+00
  4931. print_info: f_max_alibi_bias = 0.0e+00
  4932. print_info: f_logit_scale = 0.0e+00
  4933. print_info: f_attn_scale = 0.0e+00
  4934. print_info: n_ff = 14336
  4935. print_info: n_expert = 8
  4936. print_info: n_expert_used = 2
  4937. print_info: causal attn = 1
  4938. print_info: pooling type = 0
  4939. print_info: rope type = 0
  4940. print_info: rope scaling = linear
  4941. print_info: freq_base_train = 1000000.0
  4942. print_info: freq_scale_train = 1
  4943. print_info: n_ctx_orig_yarn = 32768
  4944. print_info: rope_finetuned = unknown
  4945. print_info: ssm_d_conv = 0
  4946. print_info: ssm_d_inner = 0
  4947. print_info: ssm_d_state = 0
  4948. print_info: ssm_dt_rank = 0
  4949. print_info: ssm_dt_b_c_rms = 0
  4950. print_info: model type = 8x7B
  4951. print_info: model params = 46.70 B
  4952. print_info: general.name = Mixtral 8x7B Instruct v0.1
  4953. print_info: vocab type = SPM
  4954. print_info: n_vocab = 32000
  4955. print_info: n_merges = 0
  4956. print_info: BOS token = 1 '<s>'
  4957. print_info: EOS token = 2 '</s>'
  4958. print_info: UNK token = 0 '<unk>'
  4959. print_info: LF token = 13 '<0x0A>'
  4960. print_info: EOG token = 2 '</s>'
  4961. print_info: max token length = 48
  4962. load_tensors: loading model tensors, this can take a while... (mmap = false)
  4963. time=2025-07-19T18:11:43.304+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  4964. load_tensors: offloading 23 repeating layers to GPU
  4965. load_tensors: offloaded 23/33 layers to GPU
  4966. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  4967. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  4968. llama_context: constructing llama_context
  4969. llama_context: n_seq_max = 1
  4970. llama_context: n_ctx = 2235
  4971. llama_context: n_ctx_per_seq = 2235
  4972. llama_context: n_batch = 512
  4973. llama_context: n_ubatch = 512
  4974. llama_context: causal_attn = 1
  4975. llama_context: flash_attn = 0
  4976. llama_context: freq_base = 1000000.0
  4977. llama_context: freq_scale = 1
  4978. llama_context: n_ctx_per_seq (2235) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  4979. llama_context: CPU output buffer size = 0.14 MiB
  4980. llama_kv_cache_unified: kv_size = 2240, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  4981. llama_kv_cache_unified: CUDA0 KV buffer size = 201.25 MiB
  4982. llama_kv_cache_unified: CPU KV buffer size = 78.75 MiB
  4983. llama_kv_cache_unified: KV self size = 280.00 MiB, K (f16): 140.00 MiB, V (f16): 140.00 MiB
  4984. llama_context: CUDA0 compute buffer size = 393.38 MiB
  4985. llama_context: CUDA_Host compute buffer size = 12.38 MiB
  4986. llama_context: graph nodes = 1574
  4987. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  4988. time=2025-07-19T18:11:47.813+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.76 seconds"
  4989. time=2025-07-19T18:11:47.816+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2235 prompt=2503 keep=5 new=2235
  4990. [GIN] 2025/07/19 - 18:12:02 | 200 | 20.2151959s | 100.107.36.63 | POST "/api/generate"
  4991. time=2025-07-19T18:12:39.591+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="31.2 GiB"
  4992. time=2025-07-19T18:12:39.591+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="339.2 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="206.9 MiB" memory.graph.partial="827.3 MiB"
  4993. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  4994. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  4995. llama_model_loader: - kv 0: general.architecture str = llama
  4996. llama_model_loader: - kv 1: general.type str = model
  4997. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  4998. llama_model_loader: - kv 3: general.version str = v0.1
  4999. llama_model_loader: - kv 4: general.finetune str = Instruct
  5000. llama_model_loader: - kv 5: general.basename str = Mixtral
  5001. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5002. llama_model_loader: - kv 7: general.license str = apache-2.0
  5003. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5004. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5005. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5006. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5007. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5008. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5009. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5010. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5011. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5012. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5013. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5014. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5015. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5016. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5017. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5018. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5019. llama_model_loader: - kv 24: general.file_type u32 = 2
  5020. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5021. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5022. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5023. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5024. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5025. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5026. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5027. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5028. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5029. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5030. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5031. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5032. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5033. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5034. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5035. llama_model_loader: - type f32: 97 tensors
  5036. llama_model_loader: - type q4_0: 161 tensors
  5037. llama_model_loader: - type q8_0: 64 tensors
  5038. llama_model_loader: - type q6_K: 1 tensors
  5039. print_info: file format = GGUF V3 (latest)
  5040. print_info: file type = Q4_0
  5041. print_info: file size = 24.63 GiB (4.53 BPW)
  5042. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5043. load: special tokens cache size = 3
  5044. load: token to piece cache size = 0.1637 MB
  5045. print_info: arch = llama
  5046. print_info: vocab_only = 1
  5047. print_info: model type = ?B
  5048. print_info: model params = 46.70 B
  5049. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5050. print_info: vocab type = SPM
  5051. print_info: n_vocab = 32000
  5052. print_info: n_merges = 0
  5053. print_info: BOS token = 1 '<s>'
  5054. print_info: EOS token = 2 '</s>'
  5055. print_info: UNK token = 0 '<unk>'
  5056. print_info: LF token = 13 '<0x0A>'
  5057. print_info: EOG token = 2 '</s>'
  5058. print_info: max token length = 48
  5059. llama_model_load: vocab only - skipping tensors
  5060. time=2025-07-19T18:12:39.617+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2714 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53589"
  5061. time=2025-07-19T18:12:39.620+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  5062. time=2025-07-19T18:12:39.620+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  5063. time=2025-07-19T18:12:39.620+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  5064. time=2025-07-19T18:12:39.676+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  5065. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  5066. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  5067. ggml_cuda_init: found 1 CUDA devices:
  5068. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  5069. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  5070. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  5071. time=2025-07-19T18:12:39.757+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  5072. time=2025-07-19T18:12:39.758+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53589"
  5073. time=2025-07-19T18:12:39.871+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  5074. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  5075. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5076. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5077. llama_model_loader: - kv 0: general.architecture str = llama
  5078. llama_model_loader: - kv 1: general.type str = model
  5079. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5080. llama_model_loader: - kv 3: general.version str = v0.1
  5081. llama_model_loader: - kv 4: general.finetune str = Instruct
  5082. llama_model_loader: - kv 5: general.basename str = Mixtral
  5083. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5084. llama_model_loader: - kv 7: general.license str = apache-2.0
  5085. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5086. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5087. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5088. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5089. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5090. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5091. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5092. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5093. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5094. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5095. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5096. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5097. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5098. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5099. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5100. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5101. llama_model_loader: - kv 24: general.file_type u32 = 2
  5102. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5103. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5104. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5105. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5106. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5107. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5108. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5109. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5110. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5111. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5112. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5113. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5114. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5115. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5116. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5117. llama_model_loader: - type f32: 97 tensors
  5118. llama_model_loader: - type q4_0: 161 tensors
  5119. llama_model_loader: - type q8_0: 64 tensors
  5120. llama_model_loader: - type q6_K: 1 tensors
  5121. print_info: file format = GGUF V3 (latest)
  5122. print_info: file type = Q4_0
  5123. print_info: file size = 24.63 GiB (4.53 BPW)
  5124. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5125. load: special tokens cache size = 3
  5126. load: token to piece cache size = 0.1637 MB
  5127. print_info: arch = llama
  5128. print_info: vocab_only = 0
  5129. print_info: n_ctx_train = 32768
  5130. print_info: n_embd = 4096
  5131. print_info: n_layer = 32
  5132. print_info: n_head = 32
  5133. print_info: n_head_kv = 8
  5134. print_info: n_rot = 128
  5135. print_info: n_swa = 0
  5136. print_info: n_swa_pattern = 1
  5137. print_info: n_embd_head_k = 128
  5138. print_info: n_embd_head_v = 128
  5139. print_info: n_gqa = 4
  5140. print_info: n_embd_k_gqa = 1024
  5141. print_info: n_embd_v_gqa = 1024
  5142. print_info: f_norm_eps = 0.0e+00
  5143. print_info: f_norm_rms_eps = 1.0e-05
  5144. print_info: f_clamp_kqv = 0.0e+00
  5145. print_info: f_max_alibi_bias = 0.0e+00
  5146. print_info: f_logit_scale = 0.0e+00
  5147. print_info: f_attn_scale = 0.0e+00
  5148. print_info: n_ff = 14336
  5149. print_info: n_expert = 8
  5150. print_info: n_expert_used = 2
  5151. print_info: causal attn = 1
  5152. print_info: pooling type = 0
  5153. print_info: rope type = 0
  5154. print_info: rope scaling = linear
  5155. print_info: freq_base_train = 1000000.0
  5156. print_info: freq_scale_train = 1
  5157. print_info: n_ctx_orig_yarn = 32768
  5158. print_info: rope_finetuned = unknown
  5159. print_info: ssm_d_conv = 0
  5160. print_info: ssm_d_inner = 0
  5161. print_info: ssm_d_state = 0
  5162. print_info: ssm_dt_rank = 0
  5163. print_info: ssm_dt_b_c_rms = 0
  5164. print_info: model type = 8x7B
  5165. print_info: model params = 46.70 B
  5166. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5167. print_info: vocab type = SPM
  5168. print_info: n_vocab = 32000
  5169. print_info: n_merges = 0
  5170. print_info: BOS token = 1 '<s>'
  5171. print_info: EOS token = 2 '</s>'
  5172. print_info: UNK token = 0 '<unk>'
  5173. print_info: LF token = 13 '<0x0A>'
  5174. print_info: EOG token = 2 '</s>'
  5175. print_info: max token length = 48
  5176. load_tensors: loading model tensors, this can take a while... (mmap = false)
  5177. load_tensors: offloading 23 repeating layers to GPU
  5178. load_tensors: offloaded 23/33 layers to GPU
  5179. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  5180. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  5181. llama_context: constructing llama_context
  5182. llama_context: n_seq_max = 1
  5183. llama_context: n_ctx = 2714
  5184. llama_context: n_ctx_per_seq = 2714
  5185. llama_context: n_batch = 512
  5186. llama_context: n_ubatch = 512
  5187. llama_context: causal_attn = 1
  5188. llama_context: flash_attn = 0
  5189. llama_context: freq_base = 1000000.0
  5190. llama_context: freq_scale = 1
  5191. llama_context: n_ctx_per_seq (2714) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  5192. llama_context: CPU output buffer size = 0.14 MiB
  5193. llama_kv_cache_unified: kv_size = 2720, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  5194. llama_kv_cache_unified: CUDA0 KV buffer size = 244.38 MiB
  5195. llama_kv_cache_unified: CPU KV buffer size = 95.62 MiB
  5196. llama_kv_cache_unified: KV self size = 340.00 MiB, K (f16): 170.00 MiB, V (f16): 170.00 MiB
  5197. llama_context: CUDA0 compute buffer size = 394.31 MiB
  5198. llama_context: CUDA_Host compute buffer size = 13.32 MiB
  5199. llama_context: graph nodes = 1574
  5200. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  5201. time=2025-07-19T18:12:44.130+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.51 seconds"
  5202. time=2025-07-19T18:12:44.134+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2714 prompt=3489 keep=5 new=2714
  5203. [GIN] 2025/07/19 - 18:12:53 | 200 | 14.7450462s | 100.107.36.63 | POST "/api/generate"
  5204. [GIN] 2025/07/19 - 18:13:17 | 200 | 635.6µs | 192.168.1.2 | GET "/api/tags"
  5205. [GIN] 2025/07/19 - 18:13:17 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  5206. [GIN] 2025/07/19 - 18:13:17 | 200 | 0s | 192.168.1.2 | GET "/api/version"
  5207. [GIN] 2025/07/19 - 18:13:18 | 200 | 511.7µs | 192.168.1.2 | GET "/api/tags"
  5208. [GIN] 2025/07/19 - 18:13:18 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  5209. [GIN] 2025/07/19 - 18:13:21 | 200 | 557.6µs | 192.168.1.2 | GET "/api/tags"
  5210. [GIN] 2025/07/19 - 18:13:21 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  5211. [GIN] 2025/07/19 - 18:13:30 | 200 | 1.0038ms | 192.168.1.2 | GET "/api/tags"
  5212. [GIN] 2025/07/19 - 18:13:30 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  5213. time=2025-07-19T18:13:43.589+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.1 GiB" free_swap="30.6 GiB"
  5214. time=2025-07-19T18:13:43.590+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="328.1 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="201.2 MiB" memory.graph.partial="827.1 MiB"
  5215. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5216. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5217. llama_model_loader: - kv 0: general.architecture str = llama
  5218. llama_model_loader: - kv 1: general.type str = model
  5219. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5220. llama_model_loader: - kv 3: general.version str = v0.1
  5221. llama_model_loader: - kv 4: general.finetune str = Instruct
  5222. llama_model_loader: - kv 5: general.basename str = Mixtral
  5223. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5224. llama_model_loader: - kv 7: general.license str = apache-2.0
  5225. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5226. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5227. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5228. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5229. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5230. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5231. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5232. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5233. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5234. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5235. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5236. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5237. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5238. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5239. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5240. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5241. llama_model_loader: - kv 24: general.file_type u32 = 2
  5242. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5243. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5244. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5245. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5246. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5247. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5248. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5249. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5250. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5251. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5252. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5253. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5254. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5255. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5256. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5257. llama_model_loader: - type f32: 97 tensors
  5258. llama_model_loader: - type q4_0: 161 tensors
  5259. llama_model_loader: - type q8_0: 64 tensors
  5260. llama_model_loader: - type q6_K: 1 tensors
  5261. print_info: file format = GGUF V3 (latest)
  5262. print_info: file type = Q4_0
  5263. print_info: file size = 24.63 GiB (4.53 BPW)
  5264. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5265. load: special tokens cache size = 3
  5266. load: token to piece cache size = 0.1637 MB
  5267. print_info: arch = llama
  5268. print_info: vocab_only = 1
  5269. print_info: model type = ?B
  5270. print_info: model params = 46.70 B
  5271. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5272. print_info: vocab type = SPM
  5273. print_info: n_vocab = 32000
  5274. print_info: n_merges = 0
  5275. print_info: BOS token = 1 '<s>'
  5276. print_info: EOS token = 2 '</s>'
  5277. print_info: UNK token = 0 '<unk>'
  5278. print_info: LF token = 13 '<0x0A>'
  5279. print_info: EOG token = 2 '</s>'
  5280. print_info: max token length = 48
  5281. llama_model_load: vocab only - skipping tensors
  5282. time=2025-07-19T18:13:43.614+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2625 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53682"
  5283. time=2025-07-19T18:13:43.617+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  5284. time=2025-07-19T18:13:43.617+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  5285. time=2025-07-19T18:13:43.617+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  5286. time=2025-07-19T18:13:43.667+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  5287. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  5288. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  5289. ggml_cuda_init: found 1 CUDA devices:
  5290. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  5291. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  5292. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  5293. time=2025-07-19T18:13:43.749+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  5294. time=2025-07-19T18:13:43.750+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53682"
  5295. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  5296. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5297. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5298. llama_model_loader: - kv 0: general.architecture str = llama
  5299. llama_model_loader: - kv 1: general.type str = model
  5300. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5301. llama_model_loader: - kv 3: general.version str = v0.1
  5302. llama_model_loader: - kv 4: general.finetune str = Instruct
  5303. llama_model_loader: - kv 5: general.basename str = Mixtral
  5304. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5305. llama_model_loader: - kv 7: general.license str = apache-2.0
  5306. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5307. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5308. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5309. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5310. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5311. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5312. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5313. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5314. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5315. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5316. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5317. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5318. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5319. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5320. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5321. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5322. llama_model_loader: - kv 24: general.file_type u32 = 2
  5323. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5324. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5325. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5326. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5327. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5328. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5329. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5330. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5331. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5332. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5333. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5334. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5335. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5336. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5337. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5338. llama_model_loader: - type f32: 97 tensors
  5339. llama_model_loader: - type q4_0: 161 tensors
  5340. llama_model_loader: - type q8_0: 64 tensors
  5341. llama_model_loader: - type q6_K: 1 tensors
  5342. print_info: file format = GGUF V3 (latest)
  5343. print_info: file type = Q4_0
  5344. print_info: file size = 24.63 GiB (4.53 BPW)
  5345. time=2025-07-19T18:13:43.868+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  5346. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5347. load: special tokens cache size = 3
  5348. load: token to piece cache size = 0.1637 MB
  5349. print_info: arch = llama
  5350. print_info: vocab_only = 0
  5351. print_info: n_ctx_train = 32768
  5352. print_info: n_embd = 4096
  5353. print_info: n_layer = 32
  5354. print_info: n_head = 32
  5355. print_info: n_head_kv = 8
  5356. print_info: n_rot = 128
  5357. print_info: n_swa = 0
  5358. print_info: n_swa_pattern = 1
  5359. print_info: n_embd_head_k = 128
  5360. print_info: n_embd_head_v = 128
  5361. print_info: n_gqa = 4
  5362. print_info: n_embd_k_gqa = 1024
  5363. print_info: n_embd_v_gqa = 1024
  5364. print_info: f_norm_eps = 0.0e+00
  5365. print_info: f_norm_rms_eps = 1.0e-05
  5366. print_info: f_clamp_kqv = 0.0e+00
  5367. print_info: f_max_alibi_bias = 0.0e+00
  5368. print_info: f_logit_scale = 0.0e+00
  5369. print_info: f_attn_scale = 0.0e+00
  5370. print_info: n_ff = 14336
  5371. print_info: n_expert = 8
  5372. print_info: n_expert_used = 2
  5373. print_info: causal attn = 1
  5374. print_info: pooling type = 0
  5375. print_info: rope type = 0
  5376. print_info: rope scaling = linear
  5377. print_info: freq_base_train = 1000000.0
  5378. print_info: freq_scale_train = 1
  5379. print_info: n_ctx_orig_yarn = 32768
  5380. print_info: rope_finetuned = unknown
  5381. print_info: ssm_d_conv = 0
  5382. print_info: ssm_d_inner = 0
  5383. print_info: ssm_d_state = 0
  5384. print_info: ssm_dt_rank = 0
  5385. print_info: ssm_dt_b_c_rms = 0
  5386. print_info: model type = 8x7B
  5387. print_info: model params = 46.70 B
  5388. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5389. print_info: vocab type = SPM
  5390. print_info: n_vocab = 32000
  5391. print_info: n_merges = 0
  5392. print_info: BOS token = 1 '<s>'
  5393. print_info: EOS token = 2 '</s>'
  5394. print_info: UNK token = 0 '<unk>'
  5395. print_info: LF token = 13 '<0x0A>'
  5396. print_info: EOG token = 2 '</s>'
  5397. print_info: max token length = 48
  5398. load_tensors: loading model tensors, this can take a while... (mmap = false)
  5399. load_tensors: offloading 23 repeating layers to GPU
  5400. load_tensors: offloaded 23/33 layers to GPU
  5401. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  5402. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  5403. [GIN] 2025/07/19 - 18:13:52 | 200 | 518.5µs | 192.168.1.2 | GET "/api/tags"
  5404. [GIN] 2025/07/19 - 18:13:52 | 200 | 0s | 192.168.1.2 | GET "/api/ps"
  5405. llama_context: constructing llama_context
  5406. llama_context: n_seq_max = 1
  5407. llama_context: n_ctx = 2625
  5408. llama_context: n_ctx_per_seq = 2625
  5409. llama_context: n_batch = 512
  5410. llama_context: n_ubatch = 512
  5411. llama_context: causal_attn = 1
  5412. llama_context: flash_attn = 0
  5413. llama_context: freq_base = 1000000.0
  5414. llama_context: freq_scale = 1
  5415. llama_context: n_ctx_per_seq (2625) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  5416. llama_context: CPU output buffer size = 0.14 MiB
  5417. llama_kv_cache_unified: kv_size = 2656, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  5418. llama_kv_cache_unified: CUDA0 KV buffer size = 238.62 MiB
  5419. llama_kv_cache_unified: CPU KV buffer size = 93.38 MiB
  5420. llama_kv_cache_unified: KV self size = 332.00 MiB, K (f16): 166.00 MiB, V (f16): 166.00 MiB
  5421. llama_context: CUDA0 compute buffer size = 394.19 MiB
  5422. llama_context: CUDA_Host compute buffer size = 13.19 MiB
  5423. llama_context: graph nodes = 1574
  5424. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  5425. time=2025-07-19T18:13:58.898+02:00 level=INFO source=server.go:637 msg="llama runner started in 15.28 seconds"
  5426. time=2025-07-19T18:13:58.903+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2625 prompt=3240 keep=5 new=2625
  5427. [GIN] 2025/07/19 - 18:14:14 | 200 | 31.14405s | 100.107.36.63 | POST "/api/generate"
  5428. time=2025-07-19T18:15:50.453+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="31.2 GiB"
  5429. time=2025-07-19T18:15:50.453+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="313.9 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="193.8 MiB" memory.graph.partial="826.9 MiB"
  5430. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5431. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5432. llama_model_loader: - kv 0: general.architecture str = llama
  5433. llama_model_loader: - kv 1: general.type str = model
  5434. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5435. llama_model_loader: - kv 3: general.version str = v0.1
  5436. llama_model_loader: - kv 4: general.finetune str = Instruct
  5437. llama_model_loader: - kv 5: general.basename str = Mixtral
  5438. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5439. llama_model_loader: - kv 7: general.license str = apache-2.0
  5440. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5441. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5442. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5443. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5444. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5445. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5446. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5447. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5448. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5449. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5450. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5451. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5452. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5453. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5454. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5455. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5456. llama_model_loader: - kv 24: general.file_type u32 = 2
  5457. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5458. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5459. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5460. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5461. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5462. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5463. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5464. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5465. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5466. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5467. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5468. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5469. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5470. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5471. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5472. llama_model_loader: - type f32: 97 tensors
  5473. llama_model_loader: - type q4_0: 161 tensors
  5474. llama_model_loader: - type q8_0: 64 tensors
  5475. llama_model_loader: - type q6_K: 1 tensors
  5476. print_info: file format = GGUF V3 (latest)
  5477. print_info: file type = Q4_0
  5478. print_info: file size = 24.63 GiB (4.53 BPW)
  5479. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5480. load: special tokens cache size = 3
  5481. load: token to piece cache size = 0.1637 MB
  5482. print_info: arch = llama
  5483. print_info: vocab_only = 1
  5484. print_info: model type = ?B
  5485. print_info: model params = 46.70 B
  5486. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5487. print_info: vocab type = SPM
  5488. print_info: n_vocab = 32000
  5489. print_info: n_merges = 0
  5490. print_info: BOS token = 1 '<s>'
  5491. print_info: EOS token = 2 '</s>'
  5492. print_info: UNK token = 0 '<unk>'
  5493. print_info: LF token = 13 '<0x0A>'
  5494. print_info: EOG token = 2 '</s>'
  5495. print_info: max token length = 48
  5496. llama_model_load: vocab only - skipping tensors
  5497. time=2025-07-19T18:15:50.480+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2511 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53709"
  5498. time=2025-07-19T18:15:50.483+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  5499. time=2025-07-19T18:15:50.483+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  5500. time=2025-07-19T18:15:50.483+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  5501. time=2025-07-19T18:15:50.536+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  5502. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  5503. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  5504. ggml_cuda_init: found 1 CUDA devices:
  5505. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  5506. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  5507. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  5508. time=2025-07-19T18:15:50.615+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  5509. time=2025-07-19T18:15:50.615+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53709"
  5510. time=2025-07-19T18:15:50.733+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  5511. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  5512. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5513. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5514. llama_model_loader: - kv 0: general.architecture str = llama
  5515. llama_model_loader: - kv 1: general.type str = model
  5516. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5517. llama_model_loader: - kv 3: general.version str = v0.1
  5518. llama_model_loader: - kv 4: general.finetune str = Instruct
  5519. llama_model_loader: - kv 5: general.basename str = Mixtral
  5520. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5521. llama_model_loader: - kv 7: general.license str = apache-2.0
  5522. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5523. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5524. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5525. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5526. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5527. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5528. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5529. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5530. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5531. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5532. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5533. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5534. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5535. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5536. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5537. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5538. llama_model_loader: - kv 24: general.file_type u32 = 2
  5539. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5540. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5541. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5542. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5543. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5544. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5545. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5546. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5547. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5548. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5549. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5550. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5551. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5552. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5553. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5554. llama_model_loader: - type f32: 97 tensors
  5555. llama_model_loader: - type q4_0: 161 tensors
  5556. llama_model_loader: - type q8_0: 64 tensors
  5557. llama_model_loader: - type q6_K: 1 tensors
  5558. print_info: file format = GGUF V3 (latest)
  5559. print_info: file type = Q4_0
  5560. print_info: file size = 24.63 GiB (4.53 BPW)
  5561. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5562. load: special tokens cache size = 3
  5563. load: token to piece cache size = 0.1637 MB
  5564. print_info: arch = llama
  5565. print_info: vocab_only = 0
  5566. print_info: n_ctx_train = 32768
  5567. print_info: n_embd = 4096
  5568. print_info: n_layer = 32
  5569. print_info: n_head = 32
  5570. print_info: n_head_kv = 8
  5571. print_info: n_rot = 128
  5572. print_info: n_swa = 0
  5573. print_info: n_swa_pattern = 1
  5574. print_info: n_embd_head_k = 128
  5575. print_info: n_embd_head_v = 128
  5576. print_info: n_gqa = 4
  5577. print_info: n_embd_k_gqa = 1024
  5578. print_info: n_embd_v_gqa = 1024
  5579. print_info: f_norm_eps = 0.0e+00
  5580. print_info: f_norm_rms_eps = 1.0e-05
  5581. print_info: f_clamp_kqv = 0.0e+00
  5582. print_info: f_max_alibi_bias = 0.0e+00
  5583. print_info: f_logit_scale = 0.0e+00
  5584. print_info: f_attn_scale = 0.0e+00
  5585. print_info: n_ff = 14336
  5586. print_info: n_expert = 8
  5587. print_info: n_expert_used = 2
  5588. print_info: causal attn = 1
  5589. print_info: pooling type = 0
  5590. print_info: rope type = 0
  5591. print_info: rope scaling = linear
  5592. print_info: freq_base_train = 1000000.0
  5593. print_info: freq_scale_train = 1
  5594. print_info: n_ctx_orig_yarn = 32768
  5595. print_info: rope_finetuned = unknown
  5596. print_info: ssm_d_conv = 0
  5597. print_info: ssm_d_inner = 0
  5598. print_info: ssm_d_state = 0
  5599. print_info: ssm_dt_rank = 0
  5600. print_info: ssm_dt_b_c_rms = 0
  5601. print_info: model type = 8x7B
  5602. print_info: model params = 46.70 B
  5603. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5604. print_info: vocab type = SPM
  5605. print_info: n_vocab = 32000
  5606. print_info: n_merges = 0
  5607. print_info: BOS token = 1 '<s>'
  5608. print_info: EOS token = 2 '</s>'
  5609. print_info: UNK token = 0 '<unk>'
  5610. print_info: LF token = 13 '<0x0A>'
  5611. print_info: EOG token = 2 '</s>'
  5612. print_info: max token length = 48
  5613. load_tensors: loading model tensors, this can take a while... (mmap = false)
  5614. load_tensors: offloading 23 repeating layers to GPU
  5615. load_tensors: offloaded 23/33 layers to GPU
  5616. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  5617. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  5618. llama_context: constructing llama_context
  5619. llama_context: n_seq_max = 1
  5620. llama_context: n_ctx = 2511
  5621. llama_context: n_ctx_per_seq = 2511
  5622. llama_context: n_batch = 512
  5623. llama_context: n_ubatch = 512
  5624. llama_context: causal_attn = 1
  5625. llama_context: flash_attn = 0
  5626. llama_context: freq_base = 1000000.0
  5627. llama_context: freq_scale = 1
  5628. llama_context: n_ctx_per_seq (2511) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  5629. llama_context: CPU output buffer size = 0.14 MiB
  5630. llama_kv_cache_unified: kv_size = 2528, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  5631. llama_kv_cache_unified: CUDA0 KV buffer size = 227.12 MiB
  5632. llama_kv_cache_unified: CPU KV buffer size = 88.88 MiB
  5633. llama_kv_cache_unified: KV self size = 316.00 MiB, K (f16): 158.00 MiB, V (f16): 158.00 MiB
  5634. llama_context: CUDA0 compute buffer size = 393.94 MiB
  5635. llama_context: CUDA_Host compute buffer size = 12.94 MiB
  5636. llama_context: graph nodes = 1574
  5637. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  5638. time=2025-07-19T18:15:55.240+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.76 seconds"
  5639. time=2025-07-19T18:15:55.243+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=2511 prompt=2835 keep=5 new=2511
  5640. [GIN] 2025/07/19 - 18:16:08 | 200 | 0s | 192.168.1.1 | GET "/"
  5641. [GIN] 2025/07/19 - 18:16:10 | 200 | 20.3360164s | 100.107.36.63 | POST "/api/generate"
  5642. time=2025-07-19T18:16:11.200+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.6 GiB" free_swap="31.3 GiB"
  5643. time=2025-07-19T18:16:11.200+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.8 GiB" memory.required.partial="19.8 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="826.0 MiB"
  5644. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5645. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5646. llama_model_loader: - kv 0: general.architecture str = llama
  5647. llama_model_loader: - kv 1: general.type str = model
  5648. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5649. llama_model_loader: - kv 3: general.version str = v0.1
  5650. llama_model_loader: - kv 4: general.finetune str = Instruct
  5651. llama_model_loader: - kv 5: general.basename str = Mixtral
  5652. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5653. llama_model_loader: - kv 7: general.license str = apache-2.0
  5654. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5655. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5656. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5657. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5658. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5659. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5660. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5661. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5662. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5663. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5664. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5665. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5666. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5667. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5668. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5669. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5670. llama_model_loader: - kv 24: general.file_type u32 = 2
  5671. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5672. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5673. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5674. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5675. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5676. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5677. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5678. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5679. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5680. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5681. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5682. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5683. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5684. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5685. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5686. llama_model_loader: - type f32: 97 tensors
  5687. llama_model_loader: - type q4_0: 161 tensors
  5688. llama_model_loader: - type q8_0: 64 tensors
  5689. llama_model_loader: - type q6_K: 1 tensors
  5690. print_info: file format = GGUF V3 (latest)
  5691. print_info: file type = Q4_0
  5692. print_info: file size = 24.63 GiB (4.53 BPW)
  5693. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5694. load: special tokens cache size = 3
  5695. load: token to piece cache size = 0.1637 MB
  5696. print_info: arch = llama
  5697. print_info: vocab_only = 1
  5698. print_info: model type = ?B
  5699. print_info: model params = 46.70 B
  5700. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5701. print_info: vocab type = SPM
  5702. print_info: n_vocab = 32000
  5703. print_info: n_merges = 0
  5704. print_info: BOS token = 1 '<s>'
  5705. print_info: EOS token = 2 '</s>'
  5706. print_info: UNK token = 0 '<unk>'
  5707. print_info: LF token = 13 '<0x0A>'
  5708. print_info: EOG token = 2 '</s>'
  5709. print_info: max token length = 48
  5710. llama_model_load: vocab only - skipping tensors
  5711. time=2025-07-19T18:16:11.224+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\Haldi\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Haldi\\.ollama\\models\\blobs\\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 --ctx-size 2027 --batch-size 512 --n-gpu-layers 23 --threads 16 --no-mmap --parallel 1 --port 53712"
  5712. time=2025-07-19T18:16:11.226+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
  5713. time=2025-07-19T18:16:11.226+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
  5714. time=2025-07-19T18:16:11.227+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
  5715. time=2025-07-19T18:16:11.280+02:00 level=INFO source=runner.go:815 msg="starting go runner"
  5716. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  5717. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  5718. ggml_cuda_init: found 1 CUDA devices:
  5719. Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  5720. load_backend: loaded CUDA backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
  5721. load_backend: loaded CPU backend from C:\Users\Haldi\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
  5722. time=2025-07-19T18:16:11.355+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
  5723. time=2025-07-19T18:16:11.356+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:53712"
  5724. llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
  5725. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5726. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5727. llama_model_loader: - kv 0: general.architecture str = llama
  5728. llama_model_loader: - kv 1: general.type str = model
  5729. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5730. llama_model_loader: - kv 3: general.version str = v0.1
  5731. llama_model_loader: - kv 4: general.finetune str = Instruct
  5732. llama_model_loader: - kv 5: general.basename str = Mixtral
  5733. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5734. llama_model_loader: - kv 7: general.license str = apache-2.0
  5735. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5736. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5737. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5738. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5739. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5740. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5741. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5742. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5743. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5744. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5745. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5746. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5747. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5748. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5749. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5750. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5751. llama_model_loader: - kv 24: general.file_type u32 = 2
  5752. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5753. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5754. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5755. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5756. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5757. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5758. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5759. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5760. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5761. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5762. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5763. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5764. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5765. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5766. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5767. llama_model_loader: - type f32: 97 tensors
  5768. llama_model_loader: - type q4_0: 161 tensors
  5769. llama_model_loader: - type q8_0: 64 tensors
  5770. llama_model_loader: - type q6_K: 1 tensors
  5771. print_info: file format = GGUF V3 (latest)
  5772. print_info: file type = Q4_0
  5773. print_info: file size = 24.63 GiB (4.53 BPW)
  5774. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5775. load: special tokens cache size = 3
  5776. load: token to piece cache size = 0.1637 MB
  5777. print_info: arch = llama
  5778. print_info: vocab_only = 0
  5779. print_info: n_ctx_train = 32768
  5780. print_info: n_embd = 4096
  5781. print_info: n_layer = 32
  5782. print_info: n_head = 32
  5783. print_info: n_head_kv = 8
  5784. print_info: n_rot = 128
  5785. print_info: n_swa = 0
  5786. print_info: n_swa_pattern = 1
  5787. print_info: n_embd_head_k = 128
  5788. print_info: n_embd_head_v = 128
  5789. print_info: n_gqa = 4
  5790. print_info: n_embd_k_gqa = 1024
  5791. print_info: n_embd_v_gqa = 1024
  5792. print_info: f_norm_eps = 0.0e+00
  5793. print_info: f_norm_rms_eps = 1.0e-05
  5794. print_info: f_clamp_kqv = 0.0e+00
  5795. print_info: f_max_alibi_bias = 0.0e+00
  5796. print_info: f_logit_scale = 0.0e+00
  5797. print_info: f_attn_scale = 0.0e+00
  5798. print_info: n_ff = 14336
  5799. print_info: n_expert = 8
  5800. print_info: n_expert_used = 2
  5801. print_info: causal attn = 1
  5802. print_info: pooling type = 0
  5803. print_info: rope type = 0
  5804. print_info: rope scaling = linear
  5805. print_info: freq_base_train = 1000000.0
  5806. print_info: freq_scale_train = 1
  5807. print_info: n_ctx_orig_yarn = 32768
  5808. print_info: rope_finetuned = unknown
  5809. print_info: ssm_d_conv = 0
  5810. print_info: ssm_d_inner = 0
  5811. print_info: ssm_d_state = 0
  5812. print_info: ssm_dt_rank = 0
  5813. print_info: ssm_dt_b_c_rms = 0
  5814. print_info: model type = 8x7B
  5815. print_info: model params = 46.70 B
  5816. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5817. print_info: vocab type = SPM
  5818. print_info: n_vocab = 32000
  5819. print_info: n_merges = 0
  5820. print_info: BOS token = 1 '<s>'
  5821. print_info: EOS token = 2 '</s>'
  5822. print_info: UNK token = 0 '<unk>'
  5823. print_info: LF token = 13 '<0x0A>'
  5824. print_info: EOG token = 2 '</s>'
  5825. print_info: max token length = 48
  5826. load_tensors: loading model tensors, this can take a while... (mmap = false)
  5827. time=2025-07-19T18:16:11.478+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
  5828. load_tensors: offloading 23 repeating layers to GPU
  5829. load_tensors: offloaded 23/33 layers to GPU
  5830. load_tensors: CUDA_Host model buffer size = 7216.77 MiB
  5831. load_tensors: CUDA0 model buffer size = 18001.09 MiB
  5832. llama_context: constructing llama_context
  5833. llama_context: n_seq_max = 1
  5834. llama_context: n_ctx = 2027
  5835. llama_context: n_ctx_per_seq = 2027
  5836. llama_context: n_batch = 512
  5837. llama_context: n_ubatch = 512
  5838. llama_context: causal_attn = 1
  5839. llama_context: flash_attn = 0
  5840. llama_context: freq_base = 1000000.0
  5841. llama_context: freq_scale = 1
  5842. llama_context: n_ctx_per_seq (2027) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  5843. llama_context: CPU output buffer size = 0.14 MiB
  5844. llama_kv_cache_unified: kv_size = 2048, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
  5845. llama_kv_cache_unified: CUDA0 KV buffer size = 184.00 MiB
  5846. llama_kv_cache_unified: CPU KV buffer size = 72.00 MiB
  5847. llama_kv_cache_unified: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
  5848. llama_context: CUDA0 compute buffer size = 405.00 MiB
  5849. llama_context: CUDA_Host compute buffer size = 12.01 MiB
  5850. llama_context: graph nodes = 1574
  5851. llama_context: graph splits = 112 (with bs=512), 3 (with bs=1)
  5852. time=2025-07-19T18:16:15.736+02:00 level=INFO source=server.go:637 msg="llama runner started in 4.51 seconds"
  5853. [GIN] 2025/07/19 - 18:16:23 | 200 | 13.0084194s | 100.107.36.63 | POST "/api/generate"
  5854. time=2025-07-19T18:17:45.516+02:00 level=INFO source=server.go:135 msg="system memory" total="47.1 GiB" free="32.3 GiB" free_swap="30.9 GiB"
  5855. time=2025-07-19T18:17:45.517+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=23 layers.split="" memory.available="[20.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="26.9 GiB" memory.required.partial="19.8 GiB" memory.required.kv="328.1 MiB" memory.required.allocations="[19.8 GiB]" memory.weights.total="24.6 GiB" memory.weights.repeating="24.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="201.2 MiB" memory.graph.partial="827.1 MiB"
  5856. llama_model_loader: loaded meta data with 40 key-value pairs and 323 tensors from C:\Users\Haldi\.ollama\models\blobs\sha256-f2dc41fa964b42bfe34e9fb09c0acdcfbfd6e52f1332930b4eacc9d6ad1c6cd2 (version GGUF V3 (latest))
  5857. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  5858. llama_model_loader: - kv 0: general.architecture str = llama
  5859. llama_model_loader: - kv 1: general.type str = model
  5860. llama_model_loader: - kv 2: general.name str = Mixtral 8x7B Instruct v0.1
  5861. llama_model_loader: - kv 3: general.version str = v0.1
  5862. llama_model_loader: - kv 4: general.finetune str = Instruct
  5863. llama_model_loader: - kv 5: general.basename str = Mixtral
  5864. llama_model_loader: - kv 6: general.size_label str = 8x7B
  5865. llama_model_loader: - kv 7: general.license str = apache-2.0
  5866. llama_model_loader: - kv 8: general.base_model.count u32 = 1
  5867. llama_model_loader: - kv 9: general.base_model.0.name str = Mixtral 8x7B v0.1
  5868. llama_model_loader: - kv 10: general.base_model.0.version str = v0.1
  5869. llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
  5870. llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mixt...
  5871. llama_model_loader: - kv 13: general.languages arr[str,5] = ["fr", "it", "de", "es", "en"]
  5872. llama_model_loader: - kv 14: llama.block_count u32 = 32
  5873. llama_model_loader: - kv 15: llama.context_length u32 = 32768
  5874. llama_model_loader: - kv 16: llama.embedding_length u32 = 4096
  5875. llama_model_loader: - kv 17: llama.feed_forward_length u32 = 14336
  5876. llama_model_loader: - kv 18: llama.attention.head_count u32 = 32
  5877. llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8
  5878. llama_model_loader: - kv 20: llama.rope.freq_base f32 = 1000000.000000
  5879. llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
  5880. llama_model_loader: - kv 22: llama.expert_count u32 = 8
  5881. llama_model_loader: - kv 23: llama.expert_used_count u32 = 2
  5882. llama_model_loader: - kv 24: general.file_type u32 = 2
  5883. llama_model_loader: - kv 25: llama.vocab_size u32 = 32000
  5884. llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
  5885. llama_model_loader: - kv 27: tokenizer.ggml.model str = llama
  5886. llama_model_loader: - kv 28: tokenizer.ggml.pre str = default
  5887. llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  5888. llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,32000] = [-1000.000000, -1000.000000, -1000.00...
  5889. llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,32000] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  5890. llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 1
  5891. llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 2
  5892. llama_model_loader: - kv 34: tokenizer.ggml.unknown_token_id u32 = 0
  5893. llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
  5894. llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
  5895. llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
  5896. llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
  5897. llama_model_loader: - kv 39: general.quantization_version u32 = 2
  5898. llama_model_loader: - type f32: 97 tensors
  5899. llama_model_loader: - type q4_0: 161 tensors
  5900. llama_model_loader: - type q8_0: 64 tensors
  5901. llama_model_loader: - type q6_K: 1 tensors
  5902. print_info: file format = GGUF V3 (latest)
  5903. print_info: file type = Q4_0
  5904. print_info: file size = 24.63 GiB (4.53 BPW)
  5905. load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  5906. load: special tokens cache size = 3
  5907. load: token to piece cache size = 0.1637 MB
  5908. print_info: arch = llama
  5909. print_info: vocab_only = 1
  5910. print_info: model type = ?B
  5911. print_info: model params = 46.70 B
  5912. print_info: general.name = Mixtral 8x7B Instruct v0.1
  5913. print_info: vocab type = SPM
  5914. print_info: n_vocab = 32000
  5915. print_info: n_merges = 0
  5916. print_info: BOS token = 1 '<s>'
  5917. print_info: EOS token = 2 '</s>'
  5918. print_info: UNK token = 0 '<unk>'
  5919. print_info: LF token = 13 '<0x0A>'
  5920. print_info: EOG token = 2 '</s>'
  5921. print_info: max token length = 48
  5922. llama_model_load: vocab only - skipping tensors
  5923.  
Advertisement
Add Comment
Please, Sign In to add comment