Research
Mixtral of Experts
A technical report on Mixtral-8x7B model.
13B active params
Mixtral-8x7B-instruct: SFT+DPO
architecture
Output of the expert layer is given by:
where is the gating network and is the output of the -th expert.
Multiple ways to implement the gating network:
- “Unified scaling laws for routed language models”
- “Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning”
- “Mixture-of-experts with expert choice routing”
- Softmax over the Top-K logits of a linear layer:
Here the value of is a hyperparameter. Total parameter cnt grows with , while the number of parameters used for processing each token (active parameter count) grows with (up to )
MoE layer replaces the FFN layer in the transformer model. Mixtral uses SwiGLU as and . AKA, each token is routed to 2 SwiGLU sub-blocks. The output:
Implementation of the MoE layer in the Mistral codebase:
class MoeLayer(nn.Module):
def __init__(self, experts: List[nn.Module], gate: nn.Module, moe_args: MoeArgs):
...
def forward(self, inputs: torch.Tensor):
gate_logits = self.gate(inputs)
weights, selected_experts = torch.topk(gate_logits, self.args.num_experts_per_tok)
weights = F.softmax(weights, dim=1, dtype=torch.float).to(inputs.dtype)
results = torch.zeros_like(inputs)
for i, expert in enumerate(self.experts):
batch_idx, nth_expert = torch.where(selected_experts == i)
results[batch_idx] += weights[batch_idx, nth_expert, None] * expert(
inputs[batch_idx]
)
return results
And replace the FFN in the transformer block:
class TransformerBlock(nn.Module):
...
if args.moe is not None:
self.feed_forward = MoeLayer(
experts=[FeedForward(args=args) for _ in range(args.moe.num_experts)],
gate=nn.Linear(args.dim, args.moe.num_experts, bias=False),
moe_args=args.moe,
)
else:
self.feed_forward = FeedForward(args=args)
...
deployment
- Megablocks: specialized kernels that cast the FFN ops of the MoE layer as large sparse MatMul.
- Model Parallelism/Expert Parallelism: MoE layers can be distributed to multiple GPUs. EP needs load balancing.
- Leveraging locality as consecutive assignments is significantly higher than random for higher layers.
- “Fast inference of mixture-of-experts language models with offloading.”