Untitled

***
Welcome to KoboldCpp - Version 1.68
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(model=None, model_param='/mnt/Orlando/gguf/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=9, usecublas=['rowsplit', 'mmq'], usevulkan=None, useclblast=None, noblas=False, contextsize=8192, gpulayers=999, tensor_split=[8.0, 10.0, 5.0], ropeconfig=[0.0, 10000.0], blasbatchsize=2048, blasthreads=9, lora=None, noshift=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark='stdout', multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, mmproj='', password=None, ignoremissing=False, chatcompletionsadapter='', flashattention=True, quantkv=1, forceversion=0, smartcontext=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=0, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)
==========
Loading model: /mnt/Orlando/gguf/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

Applying Tensor Split...Automatic RoPE Scaling: Using (scale:1.000, base:500000.0).


Processing Prompt [BLAS] (0 / 8092 tokens)
Processing Prompt [BLAS] (2048 / 8092 tokens)
Processing Prompt [BLAS] (4096 / 8092 tokens)
Processing Prompt [BLAS] (6144 / 8092 tokens)
Processing Prompt [BLAS] (8092 / 8092 tokens)

Generating (1 / 100 tokens)
Generating (2 / 100 tokens)
Generating (3 / 100 tokens)
Generating (4 / 100 tokens)
Generating (5 / 100 tokens)
Generating (6 / 100 tokens)
Generating (7 / 100 tokens)
Generating (8 / 100 tokens)
Generating (9 / 100 tokens)
Generating (10 / 100 tokens)
Generating (11 / 100 tokens)
Generating (12 / 100 tokens)
Generating (13 / 100 tokens)
Generating (14 / 100 tokens)
Generating (15 / 100 tokens)
Generating (16 / 100 tokens)
Generating (17 / 100 tokens)
Generating (18 / 100 tokens)
Generating (19 / 100 tokens)
Generating (20 / 100 tokens)
Generating (21 / 100 tokens)
Generating (22 / 100 tokens)
Generating (23 / 100 tokens)
Generating (24 / 100 tokens)
Generating (25 / 100 tokens)
Generating (26 / 100 tokens)
Generating (27 / 100 tokens)
Generating (28 / 100 tokens)
Generating (29 / 100 tokens)
Generating (30 / 100 tokens)
Generating (31 / 100 tokens)
Generating (32 / 100 tokens)
Generating (33 / 100 tokens)
Generating (34 / 100 tokens)
Generating (35 / 100 tokens)
Generating (36 / 100 tokens)
Generating (37 / 100 tokens)
Generating (38 / 100 tokens)
Generating (39 / 100 tokens)
Generating (40 / 100 tokens)
Generating (41 / 100 tokens)
Generating (42 / 100 tokens)
Generating (43 / 100 tokens)
Generating (44 / 100 tokens)
Generating (45 / 100 tokens)
Generating (46 / 100 tokens)
Generating (47 / 100 tokens)
Generating (48 / 100 tokens)
Generating (49 / 100 tokens)
Generating (50 / 100 tokens)
Generating (51 / 100 tokens)
Generating (52 / 100 tokens)
Generating (53 / 100 tokens)
Generating (54 / 100 tokens)
Generating (55 / 100 tokens)
Generating (56 / 100 tokens)
Generating (57 / 100 tokens)
Generating (58 / 100 tokens)
Generating (59 / 100 tokens)
Generating (60 / 100 tokens)
Generating (61 / 100 tokens)
Generating (62 / 100 tokens)
Generating (63 / 100 tokens)
Generating (64 / 100 tokens)
Generating (65 / 100 tokens)
Generating (66 / 100 tokens)
Generating (67 / 100 tokens)
Generating (68 / 100 tokens)
Generating (69 / 100 tokens)
Generating (70 / 100 tokens)
Generating (71 / 100 tokens)
Generating (72 / 100 tokens)
Generating (73 / 100 tokens)
Generating (74 / 100 tokens)
Generating (75 / 100 tokens)
Generating (76 / 100 tokens)
Generating (77 / 100 tokens)
Generating (78 / 100 tokens)
Generating (79 / 100 tokens)
Generating (80 / 100 tokens)
Generating (81 / 100 tokens)
Generating (82 / 100 tokens)
Generating (83 / 100 tokens)
Generating (84 / 100 tokens)
Generating (85 / 100 tokens)
Generating (86 / 100 tokens)
Generating (87 / 100 tokens)
Generating (88 / 100 tokens)
Generating (89 / 100 tokens)
Generating (90 / 100 tokens)
Generating (91 / 100 tokens)
Generating (92 / 100 tokens)
Generating (93 / 100 tokens)
Generating (94 / 100 tokens)
Generating (95 / 100 tokens)
Generating (96 / 100 tokens)
Generating (97 / 100 tokens)
Generating (98 / 100 tokens)
Generating (99 / 100 tokens)
Generating (100 / 100 tokens)
CtxLimit: 8192/8192, Process:70.86s (8.8ms/T = 114.20T/s), Generate:19.92s (199.2ms/T = 5.02T/s), Total:90.78s (1.10T/s)Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Running benchmark (Not Saved)...

Benchmark Completed - v1.68 Results:
======
Flags: NoAVX2=False Threads=9 HighPriority=False NoBlas=False Cublas_Args=['rowsplit', 'mmq'] Tensor_Split=[8.0, 10.0, 5.0] BlasThreads=9 BlasBatchSize=2048 FlashAttention=True KvCache=1
Timestamp: 2024-06-24 23:00:48.688551+00:00
Backend: koboldcpp_cublas.so
Layers: 999
Model: Meta-Llama-3-70B-Instruct-Q4_K_M
MaxCtx: 8192
GenAmount: 100
-----
ProcessingTime: 70.858s
ProcessingSpeed: 114.20T/s
GenerationTime: 19.923s
GenerationSpeed: 5.02T/s
TotalTime: 90.781s
Output:
```

-----
Server was not started, main function complete. Idling.