Publications

Evaluating Deep Learning Recommendation Model Training Scalability with the Dynamic Opera Network

Abstract

Deep learning is commonly used to make personalized recommendations to users for a wide variety of activities. However, deep learning recommendation model (DLRM) training is increasingly dominated by all-to-all and many-to-many communication patterns. While there are a wide variety of algorithms to efficiently overlap communication and computation for many collective operations, these patterns are strictly limited by network bottlenecks. We propose co-designing DLRM model training with the recently proposed Opera network, which is designed to avoid multiple network hops using time-varying source-to-destination circuits. Using measurements from state-of-the-art NVIDIA A100 GPUs, we simulate DLRM model training on networks ranging from 16 to 1024 nodes and demonstrate up to 1.79× improvement using Opera compared with equivalent fat-tree networks. We identify important parameters affecting …

Date
April 22, 2024
Authors
Connor Imes, Andrew Rittenbach, Peng Xie, Dong In D Kang, John Paul Walters, Stephen P Crago
Book
Proceedings of the 4th Workshop on Machine Learning and Systems
Pages
169-175