StarPU Handbook
|
intro qui parle de coherency entre autres
When the application allocates data, whenever possible it should use the function starpu_malloc(), which will ask CUDA or OpenCL to make the allocation itself and pin the corresponding allocated memory. This is needed to permit asynchronous data transfer, i.e. permit data transfer to overlap with computations. Otherwise, the trace will show that the DriverCopyAsync
state takes a lot of time, this is because CUDA or OpenCL then reverts to synchronous transfers.
By default, StarPU leaves replicates of data wherever they were used, in case they will be re-used by other tasks, thus saving the data transfer time. When some task modifies some data, all the other replicates are invalidated, and only the processing unit which ran that task will have a valid replicate of the data. If the application knows that this data will not be re-used by further tasks, it should advise StarPU to immediately replicate it to a desired list of memory nodes (given through a bitmask). This can be understood like the write-through mode of CPU caches.
will for instance request to always automatically transfer a replicate into the main memory (node 0
), as bit 0
of the write-through bitmask is being set.
will request to always automatically broadcast the updated data to all memory nodes.
Setting the write-through mask to ~0U
can also be useful to make sure all memory nodes always have a copy of the data, so that it is never evicted when memory gets scarse.
Implicit data dependency computation can become expensive if a lot of tasks access the same piece of data. If no dependency is required on some piece of data (e.g. because it is only accessed in read-only mode, or because write accesses are actually commutative), use the function starpu_data_set_sequential_consistency_flag() to disable implicit dependencies on that data.
In the same vein, accumulation of results in the same data can become a bottleneck. The use of the mode STARPU_REDUX permits to optimize such accumulation (see Data Reduction). To a lesser extent, the use of the flag STARPU_COMMUTE keeps the bottleneck, but at least permits the accumulation to happen in any order.
Applications often need a data just for temporary results. In such a case, registration can be made without an initial value, for instance this produces a vector data:
StarPU will then allocate the actual buffer only when it is actually needed, e.g. directly on the GPU without allocating in main memory.
In the same vein, once the temporary results are not useful any more, the data should be thrown away. If the handle is not to be reused, it can be unregistered:
actual unregistration will be done after all tasks working on the handle terminate.
If the handle is to be reused, instead of unregistering it, it can simply be invalidated:
the buffers containing the current value will then be freed, and reallocated only when another task writes some value to the handle.
The scheduling policies heft
, dmda
and pheft
perform data prefetch (see STARPU_PREFETCH): as soon as a scheduling decision is taken for a task, requests are issued to transfer its required data to the target processing unit, if needed, so that when the processing unit actually starts the task, its data will hopefully be already available and it will not have to wait for the transfer to finish.
The application may want to perform some manual prefetching, for several reasons such as excluding initial data transfers from performance measurements, or setting up an initial statically-computed data distribution on the machine before submitting tasks, which will thus guide StarPU toward an initial task distribution (since StarPU will try to avoid further transfers).
This can be achieved by giving the function starpu_data_prefetch_on_node() the handle and the desired target memory node.
An existing piece of data can be partitioned in sub parts to be used by different tasks, for instance:
The task submission then uses the function starpu_data_get_sub_data() to retrieve the sub-handles to be passed as tasks parameters.
Partitioning can be applied several times, see examples/basic_examples/mult.c
and examples/filters/
.
Wherever the whole piece of data is already available, the partitioning will be done in-place, i.e. without allocating new buffers but just using pointers inside the existing copy. This is particularly important to be aware of when using OpenCL, where the kernel parameters are not pointers, but handles. The kernel thus needs to be also passed the offset within the OpenCL buffer:
And the kernel has to shift from the pointer passed by the OpenCL driver:
StarPU provides various interfaces and filters for matrices, vectors, etc., but applications can also write their own data interfaces and filters, see examples/interface
and examples/filters/custom_mf
for an example.
In various cases, some piece of data is used to accumulate intermediate results. For instances, the dot product of a vector, maximum/minimum finding, the histogram of a photograph, etc. When these results are produced along the whole machine, it would not be efficient to accumulate them in only one place, incurring data transmission each and access concurrency.
StarPU provides a mode STARPU_REDUX, which permits to optimize that case: it will allocate a buffer on each memory node, and accumulate intermediate results there. When the data is eventually accessed in the normal mode STARPU_R, StarPU will collect the intermediate results in just one buffer.
For this to work, the user has to use the function starpu_data_set_reduction_methods() to declare how to initialize these buffers, and how to assemble partial results.
For instance, cg
uses that to optimize its dot product: it first defines the codelets for initialization and reduction:
and attaches them as reduction methods for its handle dtq
:
and dtq_handle
can now be used in mode STARPU_REDUX for the dot products with partitioned vectors:
During registration, we have here provided NULL
, i.e. there is no initial value to be taken into account during reduction. StarPU will thus only take into account the contributions from the tasks dot_kernel_cl
. Also, it will not allocate any memory for dtq_handle
before tasks dot_kernel_cl
are ready to run.
If another dot product has to be performed, one could unregister dtq_handle
, and re-register it. But one can also call starpu_data_invalidate_submit() with the parameter dtq_handle
, which will clear all data from the handle, thus resetting it back to the initial status register(NULL)
.
The example cg
also uses reduction for the blocked gemv kernel, leading to yet more relaxed dependencies and more parallelism.
STARPU_REDUX can also be passed to starpu_mpi_task_insert() in the MPI case. That will however not produce any MPI communication, but just pass STARPU_REDUX to the underlying starpu_task_insert(). It is up to the application to call starpu_mpi_redux_data(), which posts tasks that will reduce the partial results among MPI nodes into the MPI node which owns the data. For instance, some hypothetical application which collects partial results into data res
, then uses it for other computation, before looping again with a new reduction:
There are two kinds of temporary buffers: temporary data which just pass results from a task to another, and scratch data which are needed only internally by tasks.
Data can sometimes be entirely produced by a task, and entirely consumed by another task, without the need for other parts of the application to access it. In such case, registration can be done without prior allocation, by using the special memory node number -1
, and passing a zero pointer. StarPU will actually allocate memory only when the task creating the content gets scheduled, and destroy it on unregistration.
In addition to that, it can be tedious for the application to have to unregister the data, since it will not use its content anyway. The unregistration can be done lazily by using the function starpu_data_unregister_submit(), which will record that no more tasks accessing the handle will be submitted, so that it can be freed as soon as the last task accessing it is over.
The following code examplifies both points: it registers the temporary data, submits three tasks accessing it, and records the data for automatic unregistration.
The application may also want to see the temporary data initialized on the fly before being used by the task. This can be done by using starpu_data_set_reduction_methods() to set an initialization codelet (no redux codelet is needed).
Some kernels sometimes need temporary data to achieve the computations, i.e. a workspace. The application could allocate it at the start of the codelet function, and free it at the end, but that would be costly. It could also allocate one buffer per worker (similarly to How To Initialize A Computation Library Once For Each Worker?), but that would make them systematic and permanent. A more optimized way is to use the data access mode STARPU_SCRATCH, as examplified below, which provides per-worker buffers without content consistency.
StarPU will make sure that the buffer is allocated before executing the task, and make this allocation per-worker: for CPU workers, notably, each worker has its own buffer. This means that each task submitted above will actually have its own workspace, which will actually be the same for all tasks running one after the other on the same worker. Also, if for instance GPU memory becomes scarce, StarPU will notice that it can free such buffers easily, since the content does not matter.
The example examples/pi
uses scratches for some temporary buffer.
It may be interesting to represent the same piece of data using two different data structures: one that would only be used on CPUs, and one that would only be used on GPUs. This can be done by using the multiformat interface. StarPU will be able to convert data from one data structure to the other when needed. Note that the scheduler dmda
is the only one optimized for this interface. The user must provide StarPU with conversion codelets:
Kernels can be written almost as for any other interface. Note that STARPU_MULTIFORMAT_GET_CPU_PTR shall only be used for CPU kernels. CUDA kernels must use STARPU_MULTIFORMAT_GET_CUDA_PTR, and OpenCL kernels must use STARPU_MULTIFORMAT_GET_OPENCL_PTR. STARPU_MULTIFORMAT_GET_NX may be used in any kind of kernel.
A full example may be found in examples/basic_examples/multiformat.c
.
Let's define a new data interface to manage complex numbers.
Registering such a data to StarPU is easily done using the function starpu_data_register(). The last parameter of the function, interface_complex_ops
, will be described below.
Different operations need to be defined for a data interface through the type starpu_data_interface_ops. We only define here the basic operations needed to run simple applications. The source code for the different functions can be found in the file examples/interface/complex_interface.c
.
Functions need to be defined to access the different fields of the complex interface from a StarPU data handle.
Similar functions need to be defined to access the different fields of the complex interface from a void *
pointer to be used within codelet implemetations.
Complex data interfaces can then be registered to StarPU.
and used by codelets.
The whole code for this complex data interface is available in the directory examples/interface/
.
When executing a task on a GPU for instance, StarPU would normally copy all the needed data for the tasks on the embedded memory of the GPU. It may however happen that the task kernel would rather have some of the datas kept in the main memory instead of copied in the GPU, a pivoting vector for instance. This can be achieved by setting the starpu_codelet::specific_nodes flag to 1, and then fill the starpu_codelet::nodes array (or starpu_codelet::dyn_nodes when starpu_codelet::nbuffers is greater than STARPU_NMAXBUFS) with the node numbers where data should be copied to, or -1 to let StarPU copy it to the memory node where the task will be executed. For instance, with the following codelet:
the first data of the task will be kept in the main memory, while the second data will be copied to the CUDA GPU as usual.