Optimus extends parts of the compiler and runtime of DryadLinQ in order to provide optimisations of the execution plan graph (EPG). An EPG is normally a static data structure that represents the computations and dataflow in a directed acyclic graph; the distributed execution engine uses the EPG to distributed tasks among other things.
Optimus collects statistics (the statistic function can be user-defined) in order to optimise the EPG. From the architectural point of view, there is a master node that receives these statistics and the worker nodes generate them, performing some aggregation of the data collected to keep the messages as small as possible. Optimus permits the developer to code rewriting functions that will be called by the runtime rewritting engine when looking for optimisations in the EPG.
Before a function is called, the node collects statistics based on input data, sends it to the rewriting engine which updates the EPG. Afterwards, the rewriting engine can choose the best function based on those runtime statistics. The following steps show a simplified workflow for Optimus:
- Before a node performs a computation, collect runtime statistics (based on the input to the computation)
- Send statistic to rewriting engine
- The rewriting engine optimises the EPG and re-wires the function that the nodes will execute on a per node basis – this solves problems with skew data 3.1. The re-wiring may be running a new computation that calls again the rewriting engine to further optimise the new data set before proceeding to running the function.
- Perform computation and go back to step 1, if necessary
In the paper, they give details on how they solved dynamic data partitioning, MapReduce jobs, joins and iterative computations.