Mixture-of-Experts(MoE) : DeepSeek’s simplified model explained

DeepSeek shook the AI world! App uses an architecture that has revolutionize how AI models are trained and work in a much cheaper and efficient way. But, before looking into how this model works Mixture-of-Experts is not a new concept. Microsoft’s Z-code translation API uses MoE architecture to support a massive scale of model parameters while maintaining computational efficiency. (for those who are wondering – Project Z-Code is a component of Microsoft’s larger XYZ-code initiative (opens in new tab) to combine AI models for text, vision, audio, and language. Z-code supports the creation of AI systems that can speak, see, hear, and understand.(Source – Microsoft)). Point being, this model is already in use, but got the hype with Deepseek.

DeepSeek employs a “Mixture-of-Experts” (MoE) architecture, which activates only a small portion of the model’s parameters at any given time, leading to efficient computing power usage. Imagine a company with a team of specialists, each expert in a different area. When a complex project comes in, instead of everyone working on everything, a smart manager (let’s call them the “router”) decides which experts are best suited for each part of the project. The beauty of this system is that it’s efficient. Not every expert needs to work on every task, saving time and resources.

How does “Mixture-of-Experts” (MoE) architecture work?

Key Components :

  1. Experts: Multiple specialized sub-networks or layers, each trained to handle specific aspects of a task or data subset.
  2. Gating Network/Router: A selector mechanism that dynamically routes input data to the most relevant experts based on the input’s characteristics.
  3. Sparse Activation: A method where only a subset of experts is activated for each input, optimizing computational efficiency.

Input = Problem statement/prompt

Process =

  • Dividing problem statement into simpler,managable parts
  • Router assigns it to specific expert
  • Activated experts generate output

Final Output = weighted combination of the activated experts’ outputs

How MoE works( in layman terms)

Imagine you’re organizing a party with various tasks to handle. Instead of one person trying to do everything, you have a team of specialists, each great at specific jobs. Here’s how a Mixture of Experts (MoE) would work in this scenario:

  1. Party Planner (Gating Network):
    This is like a smart manager who knows each team member’s strengths and assigns tasks accordingly.
  2. Experts Party Planning Team:
    • Chef (Expert 1)
    • DJ (Expert 2)
    • Decorator (Expert 3)
    • Guest list/invitation Coordinator (Expert 4)
  3. How it works:
    • When a task comes up, the Party Planner quickly decides who’s best suited for it.
    • For a food-related task, the Chef is activated.
    • For music choices, the DJ takes charge.
    • If it’s about decorations, the Decorator steps in.
    • The Coordinator handles guest-related issues.
  4. Efficiency:
    • Not everyone works on every task.
    • Only the most suitable experts are called upon for each specific job.
    • This approach saves time and ensures each task is handled by the best person for it.

This party planning example illustrates how MoE models work in AI, with specialized experts handling different aspects of a problem and a smart system (gating network) deciding which experts to use for each task.

MoE vs Traditional AI Architecture

pectMixture of Experts (MoE)Traditional AI
ArchitectureMultiple specialized subnetworks (experts) with a gating networkSingle, monolithic network
ComputationConditional computation, activating only relevant expertsAll parameters active for every input
ScalabilityEnhanced scalability by adding or adjusting expertsLimited by computational resources
EfficiencyImproved efficiency through selective expert activationFixed computational cost regardless of input complexity
Task HandlingBetter at handling complex, diverse tasksGeneralist approach to all tasks
Resource UsageOptimizes resources by activating only relevant expertsAll parameters used for every task
Training ComplexityMore complex training processSimpler training process
InterpretabilityCan be challenging due to dynamic expert selectionOften easier to interpret and understand
SpecializationExperts focus on specific aspects or subtasksGeneralist approach to problem-solving

Besides, all of the advantages in efficiency and cost reduction the MoE model brings in, there are some risks and challenges involved to implement this including

  • risk of under utilization of experts by activating same experts every time, with disproportionate load balancing
  • higher complexity with the increase in experts as the system scales
  • communication cost between experts
  • fine tuning and training the experts , there may be no – one size fit all as the experts are diverse

AI landscape is ever evolving! This could be the way forward to having efficient models. DeepSeek is already being integrated in other AI Apps like Perplexity, which is no surprise. Collectively these models are revolutionizing the world and future of AI.

Leave a comment