The problem: you can't tell how each routine's results are going to be used, so you miss the opportunity to select operand distributions and operator implementations which would avoid communication later.
To solve this we use delayed evaluation to capture the control-flow of user programs at runtime. Now we know the context in which values are used, we can propagate data placement constraints backwards. To make this work, we have to optimise really fast, so we have formulated the problem carefully using affine functions to represent data placement and constraints.
To further reduce the overheads, we detect when an equivalent DAG re-occurs, and re-use the result of optimising it. We also optimise incrementally, and perform further optimisation iterations if a potentially sub-optimal execution plan is re-used repeatedly.
Preliminary performance result were presented and there followed a discussion of extensions and applications of the work.