Untitled

I currently have Backyard.AI app version 0.26.2 with Experimental backend enabled.
Max Model Context is set to 8k.


[2024-08-12 10:52:43.619] [info]  Spawning new model process...
[2024-08-12 10:52:43.620] [info]  Entered init()
[2024-08-12 10:52:43.620] [info]  Dispatching action "SPAWN"
[2024-08-12 10:52:43.620] [info]  Handling side effects after entering state "spawning-px"
[2024-08-12 10:52:45.370] [info]  Starting gpu detection for cublas-12.1.0
[2024-08-12 10:52:45.478] [info]  Finished gpu detection for cublas-12.1.0 after 108 ms
[2024-08-12 10:52:45.478] [info]  Using free VRAM if available
[2024-08-12 10:52:45.479] [info]  Fetched GPU and available vRAM {
  cardName: 'NVIDIA GeForce RTX 4060 Ti',
  maxUsableVRamMiB: 15647.671875,
  gpuDeviceInfo: { index: 0, type: 'cublas' }
}
[2024-08-12 10:52:45.480] [info]  Running auto kv quant detection.
[2024-08-12 10:52:45.480] [info]  Found 23 layers: {
  maxLayers: 28,
  ctxAdjustment: 4,
  kvCacheSize: 540,
  printedVRam: 892.6299999999999,
  scratchBufferSize: 372.06,
  estimatedScratchBufferSize: 372.06,
  vRamBudget: 15447.671875,
  isNotCLBlast: true,
  vRamPerLayer: 500.5699999999999,
  vRamForLayerMiB: 15236.998499999996
}
[2024-08-12 10:52:45.481] [info]  Found 23/28 layers for {"k":"f16","v":"f16"}
[2024-08-12 10:52:45.481] [info]  Found 24 layers: {
  maxLayers: 28,
  ctxAdjustment: 4,
  kvCacheSize: 540,
  printedVRam: 892.6299999999999,
  scratchBufferSize: 372.06,
  estimatedScratchBufferSize: 372.06,
  vRamBudget: 15447.671875,
  isNotCLBlast: true,
  vRamPerLayer: 500.5699999999999,
  vRamForLayerMiB: 14901.596999999996
}
[2024-08-12 10:52:45.481] [info]  Found 24/28 layers for {"k":"q8_0","v":"q8_0"}
[2024-08-12 10:52:45.481] [info]  Found 25 layers: {
  maxLayers: 28,
  ctxAdjustment: 4,
  kvCacheSize: 540,
  printedVRam: 892.6299999999999,
  scratchBufferSize: 372.06,
  estimatedScratchBufferSize: 372.06,
  vRamBudget: 15447.671875,
  isNotCLBlast: true,
  vRamPerLayer: 500.5699999999999,
  vRamForLayerMiB: 14946.820499999996
}
[2024-08-12 10:52:45.481] [info]  Found 25/28 layers for {"k":"q4_0","v":"q4_0"}
[2024-08-12 10:52:45.481] [info]  Spawning model px: { gpuLayers: 25, isAtMaxLayers: false }
[2024-08-12 10:52:45.481] [info]  Spawning llama server process...
[2024-08-12 10:52:46.038] [info]  Parsing GGUF model header took 541 ms
[2024-08-12 10:52:46.039] [info]  Detected model architecture: deepseek2
[2024-08-12 10:52:46.039] [info]  Rope params: {
  ropeFreqBase: 10000,
  ropeFreqScale: 1,
  finetuneContextLength: 163840,
  ctxSize: 8192
}
[2024-08-12 10:52:46.039] [info]  {
  model: 'mradermacher__DeepSeek-V2-Lite-Chat-i1-GGUF__DeepSeek-V2-Lite-Chat.i1-Q6_K.gguf',
  llamaBin: 'llama-cpp-binaries\\windows\\cublas-12.1.0\\v0.25.28\\noavx\\backyard.exe',
  flags: [
    '--host',
    '127.0.0.1',
    '--port',
    '62240',
    '--model',
    'D:\\AI\\character\\models\\mradermacher__DeepSeek-V2-Lite-Chat-i1-GGUF__DeepSeek-V2-Lite-Chat.i1-Q6_K.gguf',
    '--ctx-size',
    '8192',
    '--rope-freq-base',
    '10000',
    '--rope-freq-scale',
    '1',
    '--batch-size',
    '512',
    '--log-disable',
    '--flash-attn',
    '--cache-type-k',
    'q4_0',
    '--cache-type-v',
    'q4_0',
    '--mlock',
    '--n-gpu-layers',
    '25'
  ]
}
[2024-08-12 10:52:46.040] [info]  Attempting to start llama process { CUDA_VISIBLE_DEVICES: '0' }
[2024-08-12 10:52:46.042] [info]  Spawned llama process, pid: 40228 GPU Acceleration: 25
[2024-08-12 10:52:46.042] [info]  Dispatching action "SPAWN_DONE"
[2024-08-12 10:52:46.042] [info]  Finished init()
[2024-08-12 10:52:46.042] [info]  Handling side effects after entering state "starting-llama"
[2024-08-12 10:52:46.042] [info]  Starting llama server...
[2024-08-12 10:52:48.581] [info]  Tried to cancel when not streaming
[2024-08-12 10:52:51.695] [info]  Unsupported file format. Try a different quantization for this model, or toggle the Experimental backend in the Advanced settings.
[2024-08-12 10:52:51.695] [info]
___STDERR___
llama_model_loader: loaded meta data with 49 key-value pairs and 377 tensors from D:\AI\character\models\mradermacher__DeepSeek-V2-Lite-Chat-i1-GGUF__DeepSeek-V2-Lite-Chat.i1-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = DeepSeek-V2-Lite-Chat
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
............
llm_load_print_meta: n_lora_q             = 0
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1408
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 1.0
llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors: offloading 25 repeating layers to GPU
llm_load_tensors: offloaded 25/28 layers to GPU
llm_load_tensors:        CPU buffer size =  3424.92 MiB
llm_load_tensors:      CUDA0 buffer size = 12514.25 MiB
......................................................................................
llama_new_context_with_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_new_context_with_model: V cache quantization requires flash_attn
llama_init_from_gpt_params: error: failed to create context with model 'D:\AI\character\models\mradermacher__DeepSeek-V2-Lite-Chat-i1-GGUF__DeepSeek-V2-Lite-Chat.i1-Q6_K.gguf'

___STDERR___
[2024-08-12 10:52:51.696] [error] Unexpected error initializing server: Error: Unsupported file format. Try a different quantization for this model, or toggle the Experimental backend in the Advanced settings.
    at ChildProcess.<anonymous> (C:\Users\progmars\AppData\Local\faraday\app-0.26.2\resources\app.asar\dist\server\main.js:1095:13476)
    at ChildProcess.emit (node:events:519:28)
    at ChildProcess.emit (node:domain:488:12)
    at ChildProcess._handle.onexit (node:internal/child_process:294:12)
[2024-08-12 10:52:51.697] [info]  Successfully terminated server px and removed listeners.
[2024-08-12 10:52:51.698] [info]  Dispatching action "ERROR"
[2024-08-12 10:52:51.699] [info]  Handling side effects after entering state "error"