Revolutionizing Large Language Models with Parameter Efficient Expert Retrieval (PEER)

The Mixture-of-Experts (MoE) technique has gained popularity for scaling large language models (LLMs) while keeping computational costs manageable. However, current MoE architectures have limitations that restrict them to a relatively small number of experts. The classic transformer architecture used in LLMs relies on attention layers and feedforward (FFW) layers, with FFW layers contributing a significant portion of the model’s parameters. This poses a bottleneck when it comes to scaling transformers, as the computational footprint is directly proportional to the size of the FFW layer.

In a new paper, Google DeepMind introduces Parameter Efficient Expert Retrieval (PEER), a groundbreaking architecture that addresses the challenges of scaling MoE models to millions of experts. PEER replaces the fixed router used in traditional MoE architectures with a learned index, enabling efficient routing of input data to a vast pool of experts. By utilizing a fast initial computation to create a shortlist of potential candidates before selecting and activating the top experts, PEER can handle a large number of experts without compromising speed.

The concept of increasing the granularity of an MoE model, which refers to the number of experts, has been shown to lead to performance gains. Moreover, high-granularity MoE models can facilitate more efficient learning of new knowledge and adaptation to continuous data streams. PEER’s design utilizes tiny experts with a single neuron in the hidden layer, enabling the sharing of hidden neurons among experts for improved knowledge transfer and parameter efficiency.

PEER’s multi-head retrieval approach, similar to the mechanism used in transformer models, enhances the efficiency of routing input data to experts. This architecture can be seamlessly integrated into existing transformer models or used to replace FFW layers, contributing to a more streamlined and efficient model. Furthermore, PEER’s parameter-efficient approach reduces the number of active parameters in the MoE layer, leading to decreased computation and activation memory consumption during pre-training and inference.

The researchers conducted experiments to evaluate the performance of PEER compared to transformer models with dense feedforward layers and other MoE architectures. The results demonstrated that PEER models achieve a superior performance-compute tradeoff, with lower perplexity scores achieved using the same computational budget as traditional models. Additionally, increasing the number of experts in a PEER model further reduces perplexity, showcasing the scalability and efficiency of the architecture.

The development of Parameter Efficient Expert Retrieval (PEER) marks a significant advancement in the field of large language models. By revolutionizing the scalability and efficiency of Mixture-of-Experts (MoE) architectures, PEER offers a compelling alternative to traditional dense feedforward layers in transformer models. The findings from the research challenge the notion that MoE models have limitations in terms of the number of experts, highlighting the potential for scaling MoE to millions of experts. With its innovative design and superior performance-compute tradeoff, PEER has the potential to revolutionize the cost and complexity of training and serving very large language models.

Articles You May Like

Leave a Reply Cancel reply