Algorithms have two costs: arithmetic and communication, i.e. moving data between
levels of a memory hierarchy or processors over a network. Communication costs
(measured in time or energy per operation) already greatly exceed arithmetic costs,
and the gap is growing over time following technological trends. Thus our goal is to
design algorithms that minimize communication.
We overview recent results in this area, and give more details on results with architectural implications:
(1) We show that some algorithms can achieve perfect strong scaling (in time and energy),
subject to topology requirements on the interconnect network, while others cannot.
(2) Since writes can be more expensive than reads in some emerging nonvolatile memory
technologies, we extend our communication lower bounds and optimal algorithms to
independently minimize writes and reads; in some cases it is possible to do asymptotically
fewer writes than reads, but not in other cases.
(3) The next IEEE 754 Floating Point Standard is addressing increasing demand for
bitwise reproducibility in numerical computation, by including new instructions intended
to make floating point summation associative at low cost, so that algorithms, including
much of linear algebra, can be bitwise reproducible independent of data layout, the
number of processors, etc. We show how our communication lower bounds and
optimal algorithms can be extended to include reproducibility.
Finally, we outline the mathematical ideas, based on on recent extensions of the
Holder-Brascamp-Lieb inequality, that extend these ideas to general algorithms that
can be described as nested loops accessing arrays.