Seminars and Events
Interview Seminar by Anindya Bijoy Das, Distributed matrix computations
Event Details
Distributed matrix computations — matrix-matrix or matrix-vector multiplications — are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). A number of algorithms have been developed to mitigate the straggler issue, and several of them, based on MDS codes, are optimal in terms of straggler resilience. However they suffer from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. We develop a convolutional code-based approach to solve this problem that removes these limitations. It has excellent numerical robustness, and can be decoded using a fast peeling decoder that only involves add/subtract operations. Its numerical robustness can be theoretically quantified by deriving a computable upper bound on the worst case condition number over all possible decoding matrices by drawing connections with the properties of large block Toeplitz matrices. Moreover, Several practical matrix computation scenarios involve sparse matrices. MDS codes typically require dense linear combinations of submatrices of the original matrices which destroy the inherent sparsity of those input matrices. This is problematic as it results in significantly higher worker computation times. We present schemes by imposing constraints on the level of coding that is required in generating the encoded submatrices. This significantly reduces the worker computation time as compared to previous approaches and also allows us to leverage partial computation by stragglers. Exhaustive numerical experiments on Amazon Web Services (AWS) clusters support our findings.