Research

Mixtral of Experts

A technical report on Mixtral-8x7B model.

13B active params

Mixtral-8x7B-instruct: SFT+DPO

architecture

Output of the expert layer is given by:

i = 0 \sum n - 1 G (x) \cdot E_{i} (x)

where $G (x)$ is the gating network and $E_{i} (x)$ is the output of the $i$ -th expert.

Multiple ways to implement the gating network:

“Unified scaling laws for routed language models”
“Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning”
“Mixture-of-experts with expert choice routing”
Softmax over the Top-K logits of a linear layer: $G (x) := S o f t ma x (T o p K (x \cdot W_{g}))$

Here the value of $K$ is a hyperparameter. Total parameter cnt grows with $n$ , while the number of parameters used for processing each token (active parameter count) grows with $K$ (up to $n$ )

MoE layer replaces the FFN layer in the transformer model. Mixtral uses SwiGLU as $E_{i} (x)$ and $K = 2$ . AKA, each token is routed to 2 SwiGLU sub-blocks. The output:

y = i = 0 \sum n - 1 S o f t ma x (T o p 2 (x \cdot W_{g}))_{i} \cdot Sw i G L U_{i} (x)

Implementation of the MoE layer in the Mistral codebase:

class MoeLayer(nn.Module):
    def __init__(self, experts: List[nn.Module], gate: nn.Module, moe_args: MoeArgs):
        ...
 
    def forward(self, inputs: torch.Tensor):
        gate_logits = self.gate(inputs)
        weights, selected_experts = torch.topk(gate_logits, self.args.num_experts_per_tok)
        weights = F.softmax(weights, dim=1, dtype=torch.float).to(inputs.dtype)
        results = torch.zeros_like(inputs)
        for i, expert in enumerate(self.experts):
            batch_idx, nth_expert = torch.where(selected_experts == i)
            results[batch_idx] += weights[batch_idx, nth_expert, None] * expert(
                inputs[batch_idx]
            )
        return results

And replace the FFN in the transformer block:

class TransformerBlock(nn.Module):
    ...
    if args.moe is not None:
        self.feed_forward = MoeLayer(
            experts=[FeedForward(args=args) for _ in range(args.moe.num_experts)],
            gate=nn.Linear(args.dim, args.moe.num_experts, bias=False),
            moe_args=args.moe,
        )
    else:
        self.feed_forward = FeedForward(args=args)
    ...

deployment

Megablocks: specialized kernels that cast the FFN ops of the MoE layer as large sparse MatMul.
Model Parallelism/Expert Parallelism: MoE layers can be distributed to multiple GPUs. EP needs load balancing.
Leveraging locality as consecutive assignments is significantly higher than random for higher layers.
- “Fast inference of mixture-of-experts language models with offloading.”

🪴 My Garden

Recent writing

2024-02-29

On Note-taking, and Why I Don't Use Obsidian

My Experience With Setapp - A Subscription Service for Mac Apps

Explorer

2024-02-29

Research

Mixtral of Experts

architecture

deployment

Graph View

Backlinks