Research

Mixtral of Experts

A technical report on Mixtral-8x7B model.

13B active params

Mixtral-8x7B-instruct: SFT+DPO

architecture

Output of the expert layer is given by:

where is the gating network and is the output of the -th expert.

Multiple ways to implement the gating network:

  • “Unified scaling laws for routed language models”
  • “Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning”
  • “Mixture-of-experts with expert choice routing”
  • Softmax over the Top-K logits of a linear layer:

Here the value of is a hyperparameter. Total parameter cnt grows with , while the number of parameters used for processing each token (active parameter count) grows with (up to )

MoE layer replaces the FFN layer in the transformer model. Mixtral uses SwiGLU as and . AKA, each token is routed to 2 SwiGLU sub-blocks. The output:

Implementation of the MoE layer in the Mistral codebase:

class MoeLayer(nn.Module):
    def __init__(self, experts: List[nn.Module], gate: nn.Module, moe_args: MoeArgs):
        ...
 
    def forward(self, inputs: torch.Tensor):
        gate_logits = self.gate(inputs)
        weights, selected_experts = torch.topk(gate_logits, self.args.num_experts_per_tok)
        weights = F.softmax(weights, dim=1, dtype=torch.float).to(inputs.dtype)
        results = torch.zeros_like(inputs)
        for i, expert in enumerate(self.experts):
            batch_idx, nth_expert = torch.where(selected_experts == i)
            results[batch_idx] += weights[batch_idx, nth_expert, None] * expert(
                inputs[batch_idx]
            )
        return results

And replace the FFN in the transformer block:

class TransformerBlock(nn.Module):
    ...
    if args.moe is not None:
        self.feed_forward = MoeLayer(
            experts=[FeedForward(args=args) for _ in range(args.moe.num_experts)],
            gate=nn.Linear(args.dim, args.moe.num_experts, bias=False),
            moe_args=args.moe,
        )
    else:
        self.feed_forward = FeedForward(args=args)
    ...

deployment

  • Megablocks: specialized kernels that cast the FFN ops of the MoE layer as large sparse MatMul.
  • Model Parallelism/Expert Parallelism: MoE layers can be distributed to multiple GPUs. EP needs load balancing.
  • Leveraging locality as consecutive assignments is significantly higher than random for higher layers.
    • “Fast inference of mixture-of-experts language models with offloading.”