MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Zhengyan Zhang${}^{1,2}$, Yankai Lin${}^{3}$, Zhiyuan Liu${}^{1,2,4,5\dagger}$,
Peng Li${}^{3,6}$, Maosong Sun${}^{1,2,4,5,7\dagger}$, Jie Zhou${}^{3}$

${}^{1}$Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
${}^{2}$Beijing National Research Center for Information Science and Technology
${}^{3}$Pattern Recognition Center, WeChat AI, Tencent Inc
${}^{4}$International Innovation Center of Tsinghua University, Shanghai, China
${}^{5}$Beijing Academy of Artificial Intelligence
${}^{6}$Institute for AI Industry Research (AIR), Tsinghua University, China
${}^{7}$Jiangsu Collaborative Innovation Center for Language Ability, Xuzhou, China
[email protected] {liuzy,sms}@tsinghua.edu.cn

${}^{\dagger}$ Corresponding authors
Part of the work was done while Peng Li was working at Tencent.

Abstract

Section Summary: Researchers have found that certain components in advanced AI language models, called feed-forward networks, hold much of the knowledge about language and facts, but how they work internally remains mysterious. A new study reveals that these networks rarely use all their parts for most inputs, much like the efficient, sparse wiring of the human brain, prompting an experiment to divide them into specialized "experts" and add smart routers to select only the needed ones—a process dubbed MoEfication. This approach keeps the model's performance nearly intact while using just 10 to 30 percent of the original resources, speeding up computations by twofold and offering new ways to understand the AI's inner workings.

Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can conditionally use 10% to 30% of FFN parameters while maintaining over 95% original performance for different models on various downstream tasks. Besides, MoEfication brings two advantages: (1) it significantly reduces the FLOPS of inference, i.e., 2x speedup with 25% of FFN parameters, and (2) it provides a fine-grained perspective to study the inner mechanism of FFNs. The source code of this paper can be obtained from https://github.com/thunlp/MoEfication.

Executive Summary: Researchers at Tsinghua University and Tencent have long noted the success of Transformer models in processing natural language, powering tools like chatbots and translation systems. These models rely heavily on feed-forward networks (FFNs), which make up most of the model's parameters and store key knowledge about language and facts. However, little is known about how FFNs actually compute results, creating a gap in understanding that limits improvements in model efficiency and interpretability. This matters now as these models grow larger and more resource-intensive, straining computing power and costs for real-world applications like search engines and virtual assistants.

The document aims to explore whether FFNs exhibit sparse activation patterns—where only a small fraction of neurons fire for any given input—and if this leads to natural functional partitions, much like specialized regions in the human brain. It tests converting standard Transformer models into equivalent Mixture-of-Experts (MoE) versions, where FFNs act as a collection of specialized sub-networks that activate selectively.

To do this, the authors developed a process called MoEfication, applied to established models like T5 and BERT. In the first step, they analyzed activation data from fine-tuned models to group neurons that fire together into separate "experts," effectively splitting the FFN parameters without retraining. In the second step, they built simple routing mechanisms to select the most relevant experts for each input during use. They drew on data from thousands of examples in standard language benchmarks, covering tasks like question answering and text classification, over several months of computation on typical hardware.

The main findings confirm sparse activation: for instance, in a 700-million-parameter model, 90% of inputs activated fewer than 5% of FFN neurons. MoEfication successfully partitioned these into experts, allowing the model to use just 10% to 30% of original FFN parameters while retaining over 95% of performance on benchmarks like GLUE (for general language understanding) and question-answering tasks. With 25% of parameters active, inference speed doubled on standard computer processors and increased by 20% on graphics cards. This setup also revealed routing patterns, showing how inputs consistently favored certain experts for specific linguistic functions.

These results imply that Transformers inherently learn efficient, brain-like divisions in their FFNs, enabling major efficiency gains without losing capability. For organizations deploying these models, this could cut energy use and inference time in half, reducing costs and environmental impact while maintaining accuracy—unlike prior work that assumed uniform activation across all parameters. It challenges the need for full model activation, opening doors to faster, greener AI systems and deeper insights into model behavior for safer, more reliable applications.

Based on these outcomes, leaders should prioritize MoEfication or similar techniques in new Transformer deployments to achieve quick speedups, starting with pilot integrations in high-volume tasks like search or summarization. Further, teams could analyze routing patterns from MoEfied models to refine future architectures. If scaling to very large models, options include balancing expert count for 1.5x to 2x speed gains versus minimal performance trade-offs, weighing hardware specifics.

Key limitations include reliance on models with ReLU activation (a common but not universal choice) and testing only on English-language benchmarks, so results may vary in multilingual or domain-specific settings. Confidence in the core findings is high, backed by consistent performance across multiple tasks and models, though caution is advised for untested scenarios like real-time interactive systems where edge cases might arise. Additional validation on diverse datasets would strengthen applicability.

1 Introduction

Section Summary: Recent years have seen huge advances in Transformer-based language models, but while researchers have studied their attention mechanisms extensively, they've largely overlooked the feed-forward networks, which make up most of the model's parameters and act like knowledge-storing memory. This paper examines these networks and discovers that they show sparse activation, where only a small percentage of neurons light up for any given input, much like the efficient sparsity in the human brain, raising questions about whether specialized functional groups form within them. To explore this, the authors propose "MoEfication," a method to transform Transformers into mixtures of experts by splitting and routing the networks, which experiments show can maintain over 95% performance using just 10-30% of the original parameters, while speeding up processing and enabling deeper insights into how these models work.

Recent years have witnessed great success of Transformer-based pre-trained language models (PLMs) (Devlin et al., 2019; Brown et al., 2021; Han et al., 2021), attracting many efforts to interpret the inner mechanism of Transformer (Manning et al., 2020; Kovaleva et al., 2019). However, most of these works focus on the attention mechanism but ignore the feed-forward networks (FFNs), which constitute nearly two-thirds of model parameters. Although recent work has shown that FFNs can be viewed as memory networks storing amounts of knowledge (Geva et al., 2021; Dai et al., 2021), the computational patterns of FFNs are still unclear.

In this work, we study the activation patterns of FFNs in Transformer models and find a phenomenon of sparse activation, i.e., only a tiny fraction of neurons are activated for a single input. For example, when we perform inference on a fine-tuned T5-Large model (Raffel et al., 2020) with 700-million parameters, 90% inputs only activate less than 5% neurons[^1]. This phenomenon is similar to the sparsity in the human brain (Olshausen and Field, 1996; Gross, 2002), which drives research on functional partitions of the human brain (Garey, 1999). Inspired by such observation, we further raise up a question: do the functional partitions also emerge in artificial neural models, i.e., FFNs in pre-trained Transformer?

To investigate this problem, we explore whether a Transformer can be converted into an equivalent Mixture-of-Experts (MoE) model (Bengio, 2013), which regards different functional partitions in FFNs as different experts conditionally activated. Specially, we propose MoEfication to discover the functional partitions (experts) in FFNs and build routers for selecting experts. It consists of two phases. (1) Expert Construction: Split a whole feed-forward layer into multiple experts. The goal is to group those neurons that are often activated simultaneously into the same expert network. (2) Expert Selection: Select those experts that contain as many activated neurons as possible for each input to approximate to the original results.

In the experiments, we evaluate MoEfication on two typical kinds of downstream tasks, including GLUE and QA benchmarks (Wang et al., 2019; Rajpurkar et al., 2016; Lai et al., 2017), using T5 and BERT (Raffel et al., 2020; Devlin et al., 2019). Experimental results verify that FFNs in Transformers can be converted to mixtures of experts, and thus we can use only 10% to 30% of FFN parameters to maintain over 95% original performance, which verifies that the pre-trained Transformers also learn the functional partitions in FFNs. Besides, MoEfication brings two advantages: (1) It can significantly speed up the inference of Transformers. Using 25% of FFN parameters brings 2x speedup on CPU and 1.2x speedup on GPU. (2) We can study MoEfied models to interpret the inner mechanism of FFNs at a fine-grained level. In this work, we study their routing patterns and hope these findings can help future work on the design and training of MoE models.[^1]: T5 uses ReLU as the activation function. We treat the neurons having positive outputs as activated neurons.