Distributed matrix computations — matrix-matrix or matrix-vector multiplications — are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). A number of algorithms have been developed to mitigate the straggler issue, and several of them, based on MDS codes, are optimal in terms of straggler resilience. However they suffer from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. We develop a convolutional code-based approach to solve this problem that removes these limitations. It has excellent numerical robustness, and can be decoded using a fast peeling decoder that only involves add/subtract operations. Its numerical robustness can be theoretically quantified by deriving a computable upper bound on the worst case condition number over all possible decoding matrices by drawing connections with the properties of large block Toeplitz matrices. Moreover, Several practical matrix computation scenarios involve sparse matrices. MDS codes typically require dense linear combinations of submatrices of the original matrices which destroy the inherent sparsity of those input matrices. This is problematic as it results in significantly higher worker computation times. We present schemes by imposing constraints on the level of coding that is required in generating the encoded submatrices. This significantly reduces the worker computation time as compared to previous approaches and also allows us to leverage partial computation by stragglers. Exhaustive numerical experiments on Amazon Web Services (AWS) clusters support our findings.
Anindya Bijoy Das is a PhD candidate at the department of Electrical and Computer Engineering in Iowa State University, Ames, Iowa. He got his M. Eng. in 2018 from Iowa State University, and since then he has been working under the supervision of Professor Dr. Aditya Ramamoorthy on straggler mitigation in distributed computations. Specifically he has developed different novel schemes based on coding theory which can improve the numerical stability and enhance the speed of distributed computation in presence of stragglers. These methods can be directly incorporated into the framework of deep learning or optimization methods. His minor is Mathematics, and he utilizes several ideas from mathematics to prove the necessary theorems in his research. Before starting his graduate study in 2016, he completed his BSc in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology in 2014. He worked on automatic detection of epilepsy and epileptic zones, where he utilized machine learning to classify the practical EEG data. He published his works in reputed forums like IEEE Transactions on Information Theory, IEEE Signal Processing Magazine etc. Beside his research, he had been a lecturer in Electrical and Computer Engineering in Presidency University, Bangladesh. He has also got the teaching excellence award in Iowa State University for his outstanding performance as a teaching assistant.