MoE Merge Misconceptions

Clearing Up Misconceptions about sparse Mixture of Experts Models and Merges

With the sparse MoE architecture, a router and multiple small models, or experts, are trained simultaneously and on the same training data, allowed the distribution of computational tasks. During inference, the router determines which experts are best suited to handle specific portions of the input based on their training and specialization. It then aggregates their outputs to form a cohesive response. This dynamic allocation of tasks enables MoE models to handle complex queries more efficiently and effectively than traditional dense models.

There is a misconception, even in the more knowledgeable of the hobbyist AI community, that each separate expert in an MoE model is trained separately in a particular area of knowledge or function, like language, math, logic, facts, creative content, etc.

In reality, the router and each expert are trained at once and on the same training data. During training, each expert learns a combination of shared, general data and specific related pieces of data. Specialization is not actually separate domains as we think of them, and there is much overlap between what each expert learns. This is more like having a classroom with a teacher and multiple students, all learning the same material, and each student gaining both general knowledge and also concentrating on particular areas of interest, or strengths.

Later, when training is complete, these graduated students make up the model, with the teacher (router) knowing which students learned more about (or became experts in) specific concepts and knowledge. During a task, or prompt, the teacher will chooses two (or more) of the most appropriate students for each problem (token) and then considers and integrates their input before making the ultimate decision themselves about which token to choose.

It’s important to realize this is an oversimplification. While some degree of specialization does occur, it's more subtle and less distinct than the analogy might imply. In actuality, the specialization in MoE models can be more about efficiency in processing certain types of patterns or data rather than developing discrete domain expertise. This is why one can't just remove experts and replace them with ones not involved in the original lessons, as their knowledge won't compliment the other experts and the teacher will be unable to effectively know which experts to consult.

Using tools such as mergekit, some hobbyists have begun to experiment with what can be called “FrankenMoEs,” an evolution of “Frankenmerges.” When merging a single, dense model, weights (the association of data in an LLM) are combined, giving the end product a combination of characteristics of the original source models. However, doing so with an sparse MoE model isn’t so simple, as simply swapping out experts from this unified whole severely degrades the final product and negates many of the benefits of MoE.

In this case we can look at it like a coach with a team of players. The coach and players are trained together on the same game (overlapping knowledge as well as knowledge on their specific positions). Because of this, the coach knows each player’s strengths and weaknesses and how to allocate players. However, the coach and/or one or more players are suddenly swapped out with random people from completely different sports, the coach no longer knows what players to put into which positions, and the players do not know how to play by the same rules (lacking both necessary shared knowledge or a meaningful specialty). In this example, even if the coach learns the strengths and weaknesses of these new, random players (retraining), the players still won't fully complement each other as they come from different sports.

In conclusion, experts of an MoE model are not distinct entities (or separate models), nor are they intended to have specific domain knowledge; rather, it's how the architecture handles the distribution of training data. MoE is more a method to divide elements of the same training over several experts to reduce the computational demand of a model and increase performance by grouping related pieces of data, not actually divide knowledge into specific domains. Additionally, MoE models allow for more efficient scaling of model size and complexity, enabling the training of larger models than would be feasible with a traditional, dense architecture.

Key points:
* True MoE experts are trained together, not separately
* There is an integral cohesion of experts, not separate entities with domain knowledge
* Expert specialization is through experience, not designation
* The router is an integral part of the complete system, trained simultaneously with experts