知乎热榜 ( ) • 2024-04-07 16:10
何宜晖的回答

Google Deepmind introduced Mixture-of-Depths which dynamically allocates compute for different input.

1/ Mixture-of-Depths (MoD) is a novel approach that allows tokens to skip attention layers via residual connections, rather than being dropped entirely. This efficiently adapts the classic stochastic depth idea for language models.

2/ The paper introduces a clever "expert choice" routing mechanism, adapts it for causal language modeling. This enables smarter allocation of compute across tokens and layers for autoregressive generation. ⚙️

3/ Combining Mixture-of-Depths with Mixture-of-Experts (MoE) into a "MoDE" model yields even better results, showing the synergy between these two techniques.

4/ Scaling law analysis suggests MoD's benefits increase with model size, though experiments only went up to 3B params. Excited to see if this holds true for even larger language models!

Overall, Mixture-of-Depths offers an efficient and dynamic way to allocate compute in transformers, paving the way for more flexible and scalable language models. Definitely a paper worth reading! ‍