Publications
Evaluating Deep Learning Recommendation Model Training Scalability with the Dynamic Opera Network
Abstract
Deep learning is commonly used to make personalized recommendations to users for a wide variety of activities. However, deep learning recommendation model (DLRM) training is increasingly dominated by all-to-all and many-to-many communication patterns. While there are a wide variety of algorithms to efficiently overlap communication and computation for many collective operations, these patterns are strictly limited by network bottlenecks. We propose co-designing DLRM model training with the recently proposed Opera network, which is designed to avoid multiple network hops using time-varying source-to-destination circuits. Using measurements from state-of-the-art NVIDIA A100 GPUs, we simulate DLRM model training on networks ranging from 16 to 1024 nodes and demonstrate up to 1.79× improvement using Opera compared with equivalent fat-tree networks. We identify important parameters affecting …
- Date
- April 22, 2024
- Authors
- Connor Imes, Andrew Rittenbach, Peng Xie, Dong In D Kang, John Paul Walters, Stephen P Crago
- Book
- Proceedings of the 4th Workshop on Machine Learning and Systems
- Pages
- 169-175