Clearing Up Misconceptions about Sparse Mixture of Experts Models and Merges

Clearing Up Misconceptions about Sparse Mixture of Experts Models and Merges

This article is written for hobbyists and non-AI professionals who are already exploring local Large Language Model merging and specifically aims to explain the concerns of using tools like mergekit to simply swap out or merge experts of an Mixture of Experts model. It assumes a basic knowledge of generative transformer-based AI.
Background

The last few years have seen rapid growth in the development of generative transformer-based Large Language Model (LLM) technology. Many professionals and technology enthusiasts are already familiar with OpenAI’s ChatGPT, Google Bard, and Anthropic’s Claude as well as models that can be run locally on consumer hardware like Meta’s Llama (currently Llama 2), Stanford’s Alpaca, LMSYS Org’s Vicuna, Mistral AI’s Mistral.

Until recently, users running local LLMs had a choice between smaller, faster, less resource intensive but less capable models like Mistral 7B and Llama 2 13B and larger, slower, resource-intensive but more powerful models like Llama 2 70B. Now, Mistral’s latest offering, Mixtral, attempts to bridge the gap between these two extremes using an architecture known as Mixture of Experts (or MoE), which is capable of generating better output than smaller models while requiring less expensive hardware than their larger counterparts.

Although its implementation is recent, the concept of MoE isn’t new, with the first paper on the subject, 'Adaptive Mixtures of Local Experts about MoE’, published in 1991. This architecture involves training a special model, or router, and several discrete models, known as experts, simultaneously and on the same dataset. This approach allows for the efficient allocation of computational resources where each discrete but connected “expert” model learns both shared general and specific uniquely grouped knowledge, and the router selects the appropriate experts to activate per token and then synthesize their returned tokens to generate a unified response. This allows MoE models to have access to more information through activation of only the most appropriate collections of weights, or experts, improving efficiency compared to traditional dense models.

Misconceptions

It may sound like each expert is trained in distinct knowledge areas, such as language, math, logic, facts, or creative content of that each expert in an MoE model is trained separately on specific data. This is a common misconception even in the hobbyist AI community.
In reality, both the router and experts are trained concurrently on the same data. Each expert learns a mix of general data and specific, related information. Their specialization is not in separate domains as commonly perceived, and there is considerable (and in fact necessary) overlap in their learning. This is akin to a classroom where all students learn the same material, but each picks up more in particular areas of interest or strengths.

After training, the teacher (router) and all graduated students (experts) form the model. The router (now in the role of a team leader) knows which experts have gained expertise in particular areas. During inference, the router selects the most appropriate experts for each token, integrating their input before deciding on the final output.
However, this is an oversimplification. Specialization in MoE models often relates more to processing efficiency for certain data patterns rather than discrete domain expertise. Removing and replacing experts with those not involved in the original training disrupts this synergy, as their knowledge doesn't complement the existing experts, and the router cannot effectively determine the best experts to consult.

Merges and MoE Models

Merging two or more different models to blend their generative knowledge, or weights, has become popular, especially among the hobbyist AI community. This approach is now being done with Mixtral as well, but we are still in early days and some don’t understand the implications of such merges.

Some hobbyists are experimenting with 'FrankenMoEs,' an evolution of 'Frankenmerges,' using tools like mergekit. Merging a single, dense model involves combining weights (associations of data in a Large Language Model), thus mixing characteristics of the source models. However, this process is more complex with sparse MoE models. Replacing experts from the unified structure significantly degrades the overall model and undermines the MoE advantages.

Consider the analogy of a sports coach training a team. The coach and players, trained together, possess overlapping knowledge and specific skills for their positions. If the coach and some players are suddenly replaced with individuals from different sports, the cohesion is lost. The new coach wouldn't know how to best allocate players, and the players would be unfamiliar with the game's rules. Even if the coach adapts to the new players' strengths and weaknesses, the team won't function as cohesively as before.

Like our teacher and student analogy, this too is an over-simplication to help understanding the collaborative aspect of MoE models. In reality, the router functions more like a gatekeeper or a selector, deciding which experts are best suited for the task at hand based on their training and specialties.

Conclusion

Experts in an MoE model are not separate entities with distinct domain knowledge. Instead, MoE architecture optimizes how training data is distributed among various experts. This approach is not about dividing knowledge into specific domains but about grouping related data to reduce computational demands and enhance performance. MoE models also facilitate more efficient scaling of model size and complexity, enabling the training of larger models than would be feasible with traditional, dense architectures.

Key Points:

	•	An MoE router and experts are trained together and on the same datasets, not separately or on different datasets.
	•	Experts form an integral cohesive unit, rather than being separate entities with distinct domain knowledge.
	•	Expert specialization emerges from experience, not predetermined designation.
	•	The router is a critical component of the system, trained in tandem with the experts.