StarPU Handbook
 All Data Structures Files Functions Variables Typedefs Enumerations Enumerator Macros Modules Pages
Check List When Performance Are Not There

TODO: improve!

Simply encapsulating application kernels into tasks already permits to seamlessly support CPU and GPUs at the same time. To achieve good performance, we give below a list of features which should be checked.

Data Related Features That May Improve Performance

link to Data Management

link to Data Prefetch

Task Related Features That May Improve Performance

link to Task Granularity

link to Task Submission

link to Task Priorities

Scheduling Related Features That May Improve Performance

link to Task Scheduling Policy

link to Task Distribution Vs Data Transfer

link to Power-based Scheduling

link to Static Scheduling

CUDA-specific Optimizations

Due to CUDA limitations, StarPU will have a hard time overlapping its own communications and the codelet computations if the application does not use a dedicated CUDA stream for its computations instead of the default stream, which synchronizes all operations of the GPU. StarPU provides one by the use of starpu_cuda_get_local_stream() which can be used by all CUDA codelet operations to avoid this issue. For instance:

func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
cudaStreamSynchronize(starpu_cuda_get_local_stream());

Unfortunately, some CUDA libraries do not have stream variants of kernels. That will lower the potential for overlapping.

Calling starpu_cublas_init() makes StarPU already do appropriate calls for the CUBLAS library. Some libraries like Magma may however change the current stream, one then has to call cublasSetKernelStream(starpu_cuda_get_local_stream()); at the beginning of the codelet to make sure that CUBLAS is really using the proper stream.

If the kernel can be made to only use this local stream or other self-allocated streams, i.e. the whole kernel submission can be made asynchronous, then one should enable asynchronous execution of the kernel. That means setting the flag STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping the cudaStreamSynchronize() call at the end of the cuda_func function, so that it returns immediately after having queued the kernel to the local stream. That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of only at each kernel submission. The kernel just has to make sure that StarPU can use the local stream to synchronize with the kernel startup and completion.

Using the flag STARPU_CUDA_ASYNC also permits to enable concurrent kernel execution, on cards which support it (Kepler and later, notably). This is enabled by setting the environment variable STARPU_NWORKER_PER_CUDA to the number of kernels to execute concurrently. This is useful when kernels are small and do not feed the whole GPU with threads to run.

OpenCL-specific Optimizations

If the kernel can be made to only use the StarPU-provided command queue or other self-allocated queues, i.e. the whole kernel submission can be made asynchronous, then one should enable asynchronous execution of the kernel. This means setting the flag STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping the clFinish() and starpu_opencl_collect_stats() calls at the end of the kernel, so that it returns immediately after having queued the kernel to the provided queue. That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of only at each kernel submission. The kernel just has to make sure that StarPU can use the command queue it has provided to synchronize with the kernel startup and completion.

Detection Stuck Conditions

It may happen that for some reason, StarPU does not make progress for a long period of time. Reason are sometimes due to contention inside StarPU, but sometimes this is due to external reasons, such as stuck MPI driver, or CUDA driver, etc.

export STARPU_WATCHDOG_TIMEOUT=10000 (STARPU_WATCHDOG_TIMEOUT)

allows to make StarPU print an error message whenever StarPU does not terminate any task for 10ms. In addition to that,

export STARPU_WATCHDOG_CRASH=1 (STARPU_WATCHDOG_CRASH)

raises SIGABRT in that condition, thus allowing to catch the situation in gdb. It can also be useful to type "handle SIGABRT nopass" in gdb to be able to let the process continue, after inspecting the state of the process.

How to limit memory per node

TODO

Talk about STARPU_LIMIT_CUDA_devid_MEM, STARPU_LIMIT_CUDA_MEM, STARPU_LIMIT_OPENCL_devid_MEM, STARPU_LIMIT_OPENCL_MEM and STARPU_LIMIT_CPU_MEM

starpu_memory_get_total()

starpu_memory_get_available()

Performance Model Calibration

Most schedulers are based on an estimation of codelet duration on each kind of processing unit. For this to be possible, the application programmer needs to configure a performance model for the codelets of the application (see Performance Model Example for instance). History-based performance models use on-line calibration. StarPU will automatically calibrate codelets which have never been calibrated yet, and save the result in $STARPU_HOME/.starpu/sampling/codelets. The models are indexed by machine name. To share the models between machines (e.g. for a homogeneous cluster), use export STARPU_HOSTNAME=some_global_name (STARPU_HOSTNAME). To force continuing calibration, use export STARPU_CALIBRATE=1 (STARPU_CALIBRATE). This may be necessary if your application has not-so-stable performance. StarPU will force calibration (and thus ignore the current result) until 10 (_STARPU_CALIBRATION_MINIMUM) measurements have been made on each architecture, to avoid badly scheduling tasks just because the first measurements were not so good. Details on the current performance model status can be obtained from the command starpu_perfmodel_display: the -l option lists the available performance models, and the -s option permits to choose the performance model to be displayed. The result looks like:

$ starpu_perfmodel_display -s starpu_slu_lu_model_11
performance model for cpu_impl_0
# hash    size     flops         mean          dev           n
914f3bef  1048576  0.000000e+00  2.503577e+04  1.982465e+02  8
3e921964  65536    0.000000e+00  5.527003e+02  1.848114e+01  7
e5a07e31  4096     0.000000e+00  1.717457e+01  5.190038e+00  14
...

Which shows that for the LU 11 kernel with a 1MiB matrix, the average execution time on CPUs was about 25ms, with a 0.2ms standard deviation, over 8 samples. It is a good idea to check this before doing actual performance measurements.

A graph can be drawn by using the tool starpu_perfmodel_plot:

$ starpu_perfmodel_plot -s starpu_slu_lu_model_11
4096 16384 65536 262144 1048576 4194304 
$ gnuplot starpu_starpu_slu_lu_model_11.gp
$ gv starpu_starpu_slu_lu_model_11.eps
starpu_starpu_slu_lu_model_11.png

If a kernel source code was modified (e.g. performance improvement), the calibration information is stale and should be dropped, to re-calibrate from start. This can be done by using export STARPU_CALIBRATE=2 (STARPU_CALIBRATE).

Note: history-based performance models get calibrated only if a performance-model-based scheduler is chosen.

The history-based performance models can also be explicitly filled by the application without execution, if e.g. the application already has a series of measurements. This can be done by using starpu_perfmodel_update_history(), for instance:

static struct starpu_perfmodel perf_model = {
.symbol = "my_perfmodel",
};
struct starpu_codelet cl = {
.cuda_funcs = { cuda_func1, cuda_func2 },
.nbuffers = 1,
.modes = {STARPU_W},
.model = &perf_model
};
void feed(void) {
struct my_measure *measure;
struct starpu_task task;
task.cl = &cl;
for (measure = &measures[0]; measure < measures[last]; measure++) {
starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
task.handles[0] = handle;
starpu_perfmodel_update_history(&perf_model, &task,
STARPU_CUDA_DEFAULT + measure->cudadev, 0,
measure->implementation, measure->time);
}
}

Measurement has to be provided in milliseconds for the completion time models, and in Joules for the energy consumption models.

Profiling

A quick view of how many tasks each worker has executed can be obtained by setting export STARPU_WORKER_STATS=1 (STARPU_WORKER_STATS). This is a convenient way to check that execution did happen on accelerators, without penalizing performance with the profiling overhead.

A quick view of how much data transfers have been issued can be obtained by setting export STARPU_BUS_STATS=1 (STARPU_BUS_STATS).

More detailed profiling information can be enabled by using export STARPU_PROFILING=1 (STARPU_PROFILING) or by calling starpu_profiling_status_set() from the source code. Statistics on the execution can then be obtained by using export STARPU_BUS_STATS=1 and export STARPU_WORKER_STATS=1 . More details on performance feedback are provided in the next chapter.