Abstract
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads - naively requiring a full pass (or caching of activations) over the input sequence for each generated token - similarly to attention-based models. In this paper, we seek to enable O(1) compute and memory cost per token in any pre-trained long convolution architecture to reduce memory footprint and increase throughput during generation. Concretely, our methods consist in extracting low-dimensional linear state-space models from each convolution layer, building upon rational interpolation and model-order reduction techniques. We further introduce architectural improvements to convolution-based layers such as Hyena: by weight-tying the filters across channels into heads, we achieve higher pretraining quality and reduce the number of filters to be distilled. The resulting model achieves 10× higher throughput than Transformers and 1.5× higher than Hyena at 1.3B parameters, without any loss in quality after distillation.
| Original language | English |
|---|---|
| Title of host publication | Advances in Neural Information Processing Systems 36 (NeurIPS 2023) |
| Subtitle of host publication | [Proceedings] |
| Editors | A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine |
| Publisher | NeurIPS |
| Pages | 1-45 |
| Number of pages | 45 |
| ISBN (Electronic) | 9781713899921 |
| DOIs | |
| Publication status | Published - 2023 |
| Event | 37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States Duration: 10 Dec 2023 → 16 Dec 2023 |
Conference
| Conference | 37th Conference on Neural Information Processing Systems, NeurIPS 2023 |
|---|---|
| Country/Territory | United States |
| City | New Orleans |
| Period | 10/12/23 → 16/12/23 |
Bibliographical note
Publisher Copyright:© 2023 Neural information processing systems foundation. All rights reserved.