Untitled

# Multimodal and GGUF

## Projectors and LLMs:

- KoboldCPP uses projectors to enable vision functionality. Projectors are named with the scheme 'mmproj' and should not be heavily quantized. Since they are not very large (less than 1GB usually) it is best to leave them at F16. The quant between the LLM and the projector does not have to match.

- A CLIP model is usually composed of a vision encoder and a text encoder. The vision encoder takes features of an image and converts them into embeddings. The text encoder does the same but with text. When combined they can do things like classify images by comparing them with given words or compare descriptions with images and see if they match. This is useful for things like searching and image generation. However, a plain CLIP model is not capable of generating text the way an LLM can.

- The way vision models work with KoboldCPP is by taking a CLIP model, usually a vision transformer (ViT), and replacing or supplementing the text encoder with an LLM. By training together, the LLM is then able to generate text while accessing the embeddings shared with the vision model.

- The bridge between them takes the form of a projector. This projector is highly modular. It can be swapped between LLM model weights, but it cannot move across model architectures! You cannot attach a Vicuna projector to a Llama 3 language model, or even a Vicuna 7B projector with a Vicuna 13B model! However, given the active nature of the fine-tuning community, you usually have a great number of models to choose from for a given projector.

## How it Works on the API Level:

1. Image files are converted to a text representation of their binary data, called base64. This was developed as a way to send binary files over text mediums, like adding a zip file in the body of a text file.

2. The base64 text is sent to the KoboldCPP backend over the API in a JSON file as an item in an array called 'Images'. Up to four images can be in this array. Also in the JSON is the prompt as a string in the prompt field, sampler settings, and other optional values the client can specify.

3. The image data is sent behind the context and memory and the prompt, so if these are long, the image may lose relevance to the model.

4. The image is decoded by KoboldCPP and sent to the CLIP encoder. This processes the image, turning it into an array of RGB number values, then resizes it and segments it into portions as specified by the model. It is commonly chopped into 'patches' of 14x14 pixels, which are roughly the equivalent of an image 'token'.

5. These patches are then turned into embeddings, which are a series of high-dimensional numbers called vectors. The LLM and the image projector have been trained to share the same vector space, so the vision model can send this to the language model which can 'see' it the same way it grasps ideas in language.

## FAQ:

### How do I know which projector can work with which model?

The same way you know which prompt template to use for a fine-tune: either you are told or you figure out what the base model is, or you just try it and see if it works or not. Do not despair, though, for I have included here a cheat-sheet to get you started.

### I am getting nonsense or repetitive generations. What is happening?

Your prompt is bad, your model and projector fit well enough to not crash KoboldCPP but not well enough to actually work, or your sampler settings are wrong. Even if you send it an image composed of random pixels it should still produce coherent generations when asked about it (I have tried it).

### The model is ignoring the image!

KoboldCPP attaches the image behind the memory, context, and prompt. If those are too long it gets lost and the model forgets about it.

### It refuses to see what is actually in the image!

Yeah, if your image has people in it touching each other's bathing suit areas, it tends to either ignore that, ignore the image completely, or mention an 'intimate moment'. This is regardless of what language model is attached to it. Don't like it? Put your 'energy' to use and train one that doesn't do that.

### I have a question about something not addressed in this document.

You have reached the limits of my expertise. For fear of being a charlatan, I will answer no further inquiries.

## Cheat Sheet

There are primary vision systems that are available for GGUF and work in KoboldCPP, along with their context size and vision encoder or pixel size. Some of these may be wrong because I am working off of trial and error and memory. Feel free to add corrections and additions:

- Llava Phi
  - Phi-3 Mini 4k 4B (4096) (clip-vit-large-patch14-336)

- Xtuner
  - Llama 3 8B (8192) (336)

- Llava 1.5
  - Mistral 7B (vit-large336-custom)
  - Phi-2 3B

- Unhinged
  - Llama 3.1 8B

- Llava 1.6 (Llava Next)
  - Nous-Hermes-Yi 34B (4096) (vit-large336-custom)
  - Vicuna 1.5 7B, 13B (4096) (vit-large336-custom)
  - Mistral 7B (32768) (vit-large336-custom)
  - Llama 3 8B (8192) (clip-vit-large-patch14-336)

- Llava Next Video
  - Mistral 7B (4096) (vit-large336-custom)

- MobileVLM
  - MobileLlama 1.4B (?) (?)

- MobileVLM 2.1
  - MobileLlama 1.4B (2048) (clip-vit-large-patch14-336)

- MiniCPM 2.5
  - Llama 3 8B (8192) (image encoder for MiniCPM-V) (448px)

- MiniCPM 2.6
  - Qwen2 8B (32768) (image encoder for MiniCPM-V) (448px)

- ShareGPT4V
  - Vicuna 7B, 13B (4096) (vit-large336-l12)

- Bunny V
  - Phi-3 Mini 4K 4B
  - Llama3 8B