The tensor train (TT) format has proved useful in many areas of the computational sciences and engineering, in particular when solving high-dimensional partial differential equations and associated control, optimization, and uncertainty quantification problems. Though the TT format allows enormous reductions in computational complexity
and memory demands, real-world problems often still require use of massively parallel computing resources. Intrinsically, the basic operations in TT-based algorithms are not well-behaved regarding parallel performance and are often limited by the communication bottleneck.
In this talk, we present efficient and scalable parallel algorithms for performing basic mathematical operations for low-rank tensors represented in the TT format. We consider algorithms for addition, elementwise multiplication, computing norms and inner products, orthogonalization, and rounding (rank truncation). These are the kernel operations for applications such as iterative Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-memory computation, and we use a data distribution and strategy that parallelizes computations for individual cores within the TT format. We analyze the computation and communication costs of the proposed algorithms to show their scalability, and we present numerical experiments that demonstrate their efficiency on both shared-memory and distributed-memory parallel systems. For example, we observe better single-core performance than the existing MATLAB TT-Toolbox in rounding a 2GB TT tensor, and our implementation achieves a 34fold speedup using all 40 cores of a single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores for all mathematical operations.
Joint work with Hussam Al Daas and Grey Ballard