Llama Blocks

A work in progress model weight layer subset system. The idea being to allow the user to change which layers are used from a model file, and how those layers are arranged in blocks (and if required, the output of blocks can be merged before going as the input to further blocks). To be of real use, the system would need custom model files created for it. That contain more layers than typically would be used at a time. So the model file could include the layers from two or more fine tunes. And the block definition  file (a simple text file) would define which layers to use and how (if required) to merge block outputs. I'm not sure how useful the block output merge feature is, as compared to a static merge of model weights, it requires all the selected layer weights to be processed and then the output of the block of layers are merged.

So the main uses of this system that I see are:
1. Allowing different versions of merges like Undi's 20b merges (with the inverted etc versions) to be a single model file with all the weight layers from the combined versions. Then the definition file would define what layers and in what order to use. So instead of 4 versions of a merge with different order and blocks of weights in them. It would be a slightly larger single file of the model weights and then 4 (or however many) definition files.
2. Allow the user to select how different fine tunes are merged. So instead of a static merge, a model file could include the weights (or a subset of the weights) from two or more fine tunes. The definition file would state which layers (split into blocks) and how to merge the output of the blocks.
3. To allow the use of larger models on systems that are near the lower end of the required ram for that model. So for example I have found that I can run 33B llama models (Q3) on a 16GB system. But sometimes encounter disk swapping. I have found that if I cut out a few of the layers (no more than 4 or 5) spread across the model. Then the output doesn't seem to be noticably worse, but it cuts memory usage enough to allow it to run in 16GB without disk access.

How To use:
It is very much a work in progress.
The code is https://pastebin.com/JB5fzpT0
Only the llama.cpp file is changed. So you need to replace the original llama.cpp file with that version and then rebuild.
It is based on https://github.com/ggerganov/llama.cpp/tree/51a7cf5c6e490b2f51c82daa76c4ca4f8d845826 (latest commit at time of writing).
Then you need to create a definition file. Currently the definition filename is hardcoded (really need to change that). To layerBlocks.txt. So create a empty text file with that name and then the format of the file is simple (but currently rigid due to basic parsing).
In the file you define blocks. Which are either blocks of layers or merge blocks.

So a simple definition file that used all 40 layers in a 13b model in a single block would be:

WeightBlock; 0; 40; 0; -1
TotalLayers; 40

The first line defines the single layer block and states it starts at layer 0 and ends at layer 40 (well actually 39) and says that the weights layers are also going to start from layer 0. It also say the input to that block is -1 which is the token input.

So the format is: BlockType; StartLayerNumber; EndLayerNumber; StartWeightLayer; InputFromBlockNumber

Now the StartWeightLayer, allows a block to use the weights from a different layer of the model. So if a model had more layers than was going to be used (say 60 layers with some from a different fine tune). Then the definition file might be:

WeightBlock; 0; 20; 0; -1
WeightBlock; 20; 40; 40; 0
TotalLayers; 40

Which would still define 40 total layers, split into two blocks. With the first block having it's layers number 0 to 19, and using weights from the model file layer weights 0 to 19. While the second block would have the layers numbered 20 to 39, but would use the weights from model file layers 40 to 59. Note also the second block defines its input as coming from block 0 (the first block).

The second type of block is the merge block.  Currenly it only supports a simple scale add merge of two blocks. So with the above model file. Say it had layers 0 to 39 from one fine tune. And then laters 20 to 39 from a second fine tune. In the above example, we used the first 20 layers from the first fine tune and then last 20 layers from the other fine tune. However with the merge block we could have used all 60 layers.

WeightBlock; 0; 20; 0; -1
WeightBlock; 20; 40; 20; 0
WeightBlock; 40; 60; 40; 0
MergeBlock; 1; 0.5; 2; 0.5
TotalLayers; 60

So in this example, we have three weight blocks. The first two blocks are just using the first 40 layers of the model file, like a normal 13b model would use. The third block is a bit different in that it also takes its input from the first block (block 0), just like the second block did. So the model graph runs through the first 20 layers, then runs the output from them through two seperate subgraphs and then in the mergeblock the output from those subgraph sections are merged back together. Now I need to do more testing to see if this has any difference in output compared to a static merge of the model weights. However what it does allow is the user to set the merge percentage from each sub block.

The merge block format is: BlockType, First Merge Block Number, First Merge Block ScaleFactor, Second Merge Block Number, Second Merge Block ScaleFactor. So in this example the two blocks are merged 50:50