User Manual¶
Storage Module¶
The storage module is one of the main modules in GridTools. The main target of this module is to provide proper means to access, view, and modify data. The detail of the hardware is hidden to the user, but the user is responsible for indicating what storage is needed. Different Backends are available for different kinds of storages, see Selecting the Backend. Following subsections will explain the different basic elements that are used in the storage module.
The storage has the notion of halo
. This in turns allows the user of the storage to distinguish between the compute
domain and the full domain, where the latter includes the halos. The halo is symmetric in the storage class, that is,
the number of halo points at the two ends of a dimension is the same. Different dimensions may have different numbers of
halo points. The main reason for including halos in the storage is to enforce the alignment of the storage in the inner
region to improve the performance of stencil operations.
Storage Info¶
The storage info can be seen as a meta data object that keeps information about a storage. This information is alignment,
data layout in memory, halo areas, dimensionality, size of the storage, etc. The object must be created before storage
can be allocated. The storage info object can be shared among different storages, if they have the same meta data. The
right storage info type can be retrieved with storage_traits<Backend>
. The supported backends are described
in Backends.
As an example, we can retrieve the default three-dimensional storage info type with size 10 x 10 x 10 for the CUDA backend, with no halos and identifier 0 as follows:
#include <gridtools/storage/storage_traits_cuda.hpp>
namespace gt = gridtools;
using backend_t = gt::backend::cuda; // or gt::backend::mc or gt::backend::x86
using storage_info_t = gt::storage_traits<backend_t>::select_storage_info<0 /* id */, 3 /* dims */, gt::halo<0, 0, 0>>;
storage_info_t si{10, 10, 10};
Be aware that the identifier used in select_storage_info must uniquely determine the type of the storage info and the actual instantiation with sizes.
The traits class has several ways how a storage info type can be inferred:
select_storage_info<Id, Dims, gt::halo<...>>
returns the default storage info for a certain backend.select_custom_layout_storage_info<Id, gt::layout<...>, gt::halo<...>>
lets you specify the exact layout map (see below) of the storage. Keep in mind that the default layout should have best performance on the target architecture.select_special_storage_info<Id, gt::selector<...>, gt::halo<...>>
lets you select a default storage info for a certain backend, but you can mask certain dimensions. For examplegt::selector<0, 0, 1>
will result in a three-dimensional storage where all dimensions except in k-direction are masked. The layout will be optimized for the chosen backend.select_storage_info_align<Id, Dims, gt::halo<...>, Alignment>
lets you select a default storage info for a certain backend, but with a non-default alignment requirement.select_custom_layout_storage_info_align<Id, gt::layout<...>, gt::halo<...>, Alignment>
lets you specify the exact layout of the storage and the alignment requirement.select_special_storage_info_align<Id, gt::selector<...>, gt::halo<...>, Alignment>
lets you mask certain dimensions and you can specify the alignment requirement.
Fig. 3 shows a depiction of the storage_info
compile-time data.

Fig. 3 Storage info and associated compile-time data.¶
Id: A unique identifier for the storage info type
Note
For each
storage_info
type you should use only one instantiation. The mapping between a storage and the run-time information in the storage_info has to be done at compile time via the index. Thus GridTools cannot distinguish the storages by the run-time sizes passed to the storage_info. If you want to instantiate multiple storage_info with the same halo, alignment, layout but with different dimensionality you must use a different ID.Layout Map: Information about the memory layout. The
layout_map
template takes a permutation of the value from0
toN - 1
, whereN
is the number of dimensions of the storage. The values indicate the order of the dimensions by decreasing strides. For instance a C arrayX[i][j][k]
layout would be equivalent tolayout_map<0, 1, 2>
. The dimension with stride 1 has the highest indexi, so in this case, the k-stride is unit stride. A Fortran style arrayX[i][j][k]
layout would be equivalent tolayout_map<2, 1, 0>
, meaning that the i-stride is unit-stride (thus, the first index of thelayout_map
is 2).There is also the possibility to mask dimensions. This means that the storage appears as n-dimensional but the masked dimensions are ignored. For instance a
storage_info
withlayout_map<1, -1, 0>
describes a 3-dimensional storage, where the i-stride is unit stride and the j dimension is masked. In this case the storage is allocated as a two-dimensional array, but it behaves as a three-dimensional array. Accessing the array at(i, j, k)
always returns the element at(i, 0, k)
. This kind of storage can be used two implement oriented planes in stencils.The following example creates a storage info for the cuda backend. It has a masked dimension in
J
(i.e., it represents a I-K-plane) and the I-dimension is unit stride.#include <gridtools/storage/storage_traits_cuda.hpp> namespace gt = gridtools; using backend_t = gt::backend::cuda; using storage_info_t = gt::storage_traits<backend_t>::select_custom_layout_storage_info<0 /* id */, gt::layout<1, -1, 0>, gt::halo<0, 0, 0>>; storage_info_t si{10, 10, 10};
Selector: Selectors in the storage traits allow to mask dimensions, but automatically have the optimal layout for the chosen backend. Dimensions in the selector that are set to 0 will be masked. For example, we can create a storage which is masked in
J
(again a I-K-plane), whose layout is optimal for the cuda backend:#include <gridtools/storage/storage_traits_cuda.hpp> namespace gt = gridtools; using backend_t = gt::backend::cuda; using storage_info_t = gt::storage_traits<backend_t>::select_special_storage_info<0 /* id */, gt::selector<1, 0, 1>, gt::halo<0, 0, 0>>; storage_info_t si{10, 10, 10};
Alignment: Information about the alignment in terms of number of elements. There is the possibility to provide information about how the data points should be aligned in memory. The alignment is enforced to the first elements in the compute domain of the dimension with stride one. This provides a huge performance gain for some architectures (e.g., GPUs). The storage module combines the alignment, layout, and halo information in order to align the non-halo data points of the stride-one dimension in memory. The legacy x86 Backend uses no alignment (
alignment<1>
) by default. The CUDA Backend uses a default alignment of 32 data elements (alignment<32>
), the MC Backend defaults toalignment<8>
.Halo: The halo information has to be passed as type information to the storage info. The reason for this is that the proper alignment can only be computed with given halo information. The storage info object provides aligned data points (non-halo points) for the stride 1 dimension. The halo information is given as follows:
halo<Sizes...>
where sizes is the halo size at both ends of the corresponding dimension, in terms of element sizes. E.g.,halo<2, 4, 0>
is a halo of size 2 in direction I+ and I-, halo of size 4 in direction J+ and J-, and no halo in K.Note that the size passed to the constructor of the storage info is including the halo. The following example will create a storage info with halos 1 in i-direction and halos 0 in j-direction. The allocated storage is 10 x 10 x 10, but the compute domain is only 8 x 10 x 10. The element (1, 0, 0) is the first element of the compute domain and will be aligned.
#include <gridtools/storage/storage_traits_cuda.hpp> namespace gt = gridtools; using backend_t = gt::backend::cuda; using storage_info_t = gt::storage_traits<backend_t>::select_storage_info<0 /* id */, 3 /* dims */, gt::halo<1, 0, 0>>; storage_info_t si{10, 10, 10};
A visual representation of the storage info is given in Fig. 4. The allocated memory starts at the bottom. The allocation starts with some padding (due to the alignment requirements). The compute domain (yellow) is surrounded by a halo region (green).

Fig. 4 Example storage info.¶
Interface: A storage_info
object provides methods for querying the meta data.
template <uint_t D> uint_t total_length() const
: retrieve the total number of data points in dimensionD
dimensionstemplate <int D> int stride() const
: retrieve the stride in dimensionD
const array<uint_t, ndims> &total_lengths() const
: return the array of total number of data points in each directionconst array<uint_t, ndims> &strides() const
: return the array of (aligned) strides.template <uint_t D> uint_t begin() const
: retrieve the position of the first non halo point in dimension Dtemplate <uint_t D> uint_t end() const
: retrieve the position of the last non halo point in dimension Dtemplate <uint_t D> uint_t total_begin() const
: retrieve the position of the first point (can also be a halo point) in dimensionD
(always0
)template <uint_t D> uint_t total_end() const
: retrieve the position of the last point (can also be a halo point) in dimensionD
Data Store¶
Once the storage_info
object is created a data_store
can be created. A data_store
is keeping together the
storage_info
object, a data type and the actual memory allocation (see also Fig. 5). The main
purpose of the data_store
is to provide means for synchronizing, keeping consistency, and cleaning up memory. A
data_store
does a shallow copy when being copied, which means that the underlying allocated storage is not copied
and original and copy will point to the same data.
When a data store is used on an accelerator, it has two copies of the data, one copy resides in the memory of the host system, and a second copy resides in the memory of the accelerator (target system).

Fig. 5 Data Store and associated compile-time data.¶
The type of a Data Store combines a storage_info
type and the underlying type. Given a storage_info_t
,
the proper type for the data store can be retrieved using storage_traits<Backend>::select_storage
:
using data_store_t = storage_traits<Backend>::data_store_t<double, storage_info_t>;
Example: Following codes snippets show how Data Stores can be created. At first the user has to identify if the memory management is done externally or not. If it is done externally the Data Store won’t allocate nor deallocate the memory. The standard use-case is to use managed Data Stores. The Data Store can be initialized with a value or a lambda and can optionally be named by passing an additional string.
using storage_info_t = storage_traits<backend::mc>::select_storage_info<0 /* id */, 2 /* dims */, gt::halo<1, 0>>;
using data_store_t = storage_traits<backend::mc>::data_store_t<double, storage_info_t>;
storage_info_t si(6, 3); // 4 data points + 2 halo points in I, 3 data points and 0 halo points in J
// standard use cases
data_store_t ds1(si); // create a data store without a name (will allocate memory internally)
data_store_t ds2(si, "ds2"); // create a data store with a name (will allocate memory internally)
data_store_t ds3(si, 1.0, "ds3"); // create a named and value initialized data store
data_store_t ds4(si, [](int i, int j) { return i+j; }, "ds4"); // create a named and lambda initialized data store
// copying a data store
ds2 = ds1; // ds2 will deallocate the previously allocated memory and will point to the same data as ds1.
// external pointer use case
extern double* external_ptr;
data_store_t ds_ext(si, external_ptr); // create a data store that is not managing the memory
Interface:
The data_store
object provides methods for performing following things:
void reset()
: reset the data_store. maybe deallocates memory.bool valid() const
: check if underlying storage info and storage is valid.const array<uint_t, ndims> &total_lengths() const
: return the array of total number of data points in each directionconst array<uint_t, ndims> &strides() const
: return the array of (aligned) strides.void sync() const
: synchronize the copies on the host and the target.reactivate_target_write_views
: re-enables read-write device views (see Data View)reactivate_host_write_views
: re-enabled read-write host views (see Data View)std::string const &name() const
: retrieve the name of the storage.
Note
The Data Store cannot be used to modify or access the data. In order to do so we use the view concept, which is explained next.
Data View¶
The view is a lightweight object to access and modify the data stored in a data_store
. Views can be generated for
both host and target. Views can be created in read only or in read-write mode. The read-only views can be beneficial
because they don’t trigger synchronizations with the device. These functions are used to create views:
auto host_view = make_host_view(data_store);
auto host_view_ro = make_host_view<access_mode::read_only>(data_store);
auto target_view = make_target_view(data_store);
auto target_view_ro = make_target_view<access_mode::read_only>(data_store);
A read-write view requests exclusive ownership on the data store, meaning that a read-write view on the host puts the data store in a state such that no view (neither read-only nor read-write) can be created or used on target.
auto host_view = make_host_view(data_store);
// creating another host view is ok (read-write and read-only)
auto host_view_ro = make_host_view<access_mode::read_only>(data_store);
auto another_host_view = make_host_view(data_store);
// creating any target view will fail without sync
// FAILS: auto target_view = make_target_view<access_mode::read_only>(data_store);
data_store.sync();
// now a target view can be created
auto target_view = make_target_view(data_store);
A read-only view has mutual ownership on the data store: a read-only view on host can only be created and used if no read-write view exists on target. Read-only views can co-exist on host and target. When a read-only view exists on host, and a read-write view is created on target, you can safely use the read-write view, but the read-only view cannot be used until the data store is synchronized.
// read_only views can co-exist on host and target
auto host_view_ro = make_host_view<access_mode::read_only>(data_store);
auto target_view_ro = make_target_view<access_mode::read_only>(data_store);
// creating a read-write view will invalidate the host view
auto host_view_ro = make_host_view(data_store);
// host_views cannot be created or used again until sync is called
data_store.sync();
// now we can create and use host views again
auto another_host_view_ro = make_host_view<access_mode::read_only>(data_store);
In order to end a view, the data store must be synchronized explicity by calling data_store.sync();
. The destructor
of a view will not end the view. Note that synchronizing a data store will re-enable read-only views, but it will not
re-enable read-write views. Those can be reenabled by calling data_store.reactivate_host_write_views()
(which will
invalidate read-only views on target) and data_store.reactivate_target_write_views()
(which will invalidate
read-only views on host). Note that the
Example:
Following example shows the instantiation of a data_store
and a corresponding host and target view. The host view
can be used when the data is modified on a host. Target views can be accessed from the target only, if it has a separate
memory space (e.g., GPU). If target and host share the memory space, host and target views are identical.
...
template <typename View>
__global__ void kernel(View s) {
s(0, 0) = s(0, 10); // set (0, 0) to 3.1415
}
...
// instantiate a data_store
using storage_info_t = storage_traits<backend::cuda>::select_storage_info<0, 2, gt::halo<0, 0>>;
using data_store_t = storage_traits<backend::mc>::data_store_t<double, storage_info_t>;
storage_info_t si(20, 20);
data_store_t ds(si);
// create a view to ds (data_store)
auto host_view = make_host_view(ds); // read write view
// set (0, 10) to 3.1415
host_view(0, 10) = 3.1415;
// synchronize the data store
ds.sync();
// create a target view
auto device_view_ds = make_target_view(ds); // read write view
// call kernel
kernel<<<1, 1>>>(device_view_ds);
// synchronize the data store
ds.sync();
// reactivate the host view
ds.reactivate_host_write_views();
// expect (0, 0) == (0, 10)
EXPECT_EQ(host_view(0, 10), host_view(0, 0));
Interface:
The data_view
construction can be created with make_host_view
and make_target_view
:
#include <gridtools/storages/data_store.hpp>
auto view1 = make_host_view<Access>(data_store);
auto view2 = make_target_view<Access>(data_store);
Access is either access_mode::read_only
or access_mode::read_write
storage/common/definitions.hpp
. view1
can be used by a host thread to access data, view2
is a view that can be used from the target system to access data.
The data_view
object has the following public API:
const array<uint_t, ndims> &total_lengths() const
: return the array of total number of data points in each directionconst array<uint_t, ndims> &strides() const
: return the array of (aligned) strides.template <typename... Coords> data_t& operator()(Coords... c) const
: used to access elements. E.g.,view(0, 0, 2)
will return the third element.
At any point, the consistency of a view with a data store can be checked with
auto view = make_host_view<Access>(data_store);
check_consistency(data_store, view);
The function bool check_consistency(DataStore const&, DataView const&)
returns true, if the given view can be used
to access and modify the data. If it returns false, the user needs to call data_store.sync()
or reactivate write
functions
Stencil Operators¶
Stencil operators are the GridTools-equivalent of functors in regular C++ code. They are assumed to have no side-effects and no status (which is why they are marked as static). As functions they have an interface and an implementation. The interface informs both the caller, on the order and types of arguments that have to be passed to it, and the implementation, on the names and types of the symbols available to it.
The stencil operator specifies the computation to be performed in each grid point of the Iteration Space of the stencil Computation. In the implementation, a point of the Iteration Space at which the stencil operator is called is referred to as Iteration Point.
A stencil operator is a class, or a struct, with the following public properties:
A list of Accessor types that are associated to the data fields the stencil operator will access in its implementation.
A
param_list
listing all the Accessor types defined above. They are created usingmake_param_list
.A set of static template member functions named
apply
, also referred to as Apply-Methods. These functions should be annotated withGT_FUNCTION
which ensures that the functions can be run on GPU and that they are inlined.
The user may add additional static functions for internal usage within the stencil operator.
See the Example for a concrete usage of the syntax of the stencil operators.
Accessor Type¶
There are two kinds of Accessors:
Regular accessors indicate an access to a regular Data Store of a Grid.
Global accessors indicate that the data to be referred does not participate in the iteration and always point to a same read only datum to be used in the operator, which is hold in a Global Parameter.
Accessors are defined as follows:
// regular accessors
using name = accessor<I, intent, [extent = extent<>, N]>;
// global accessor
using name = global_accessor<I>;
where
name
is the name associated to the Accessor and will be used in the implementation of the stencil operator. The name of an accessor is only used within the stencil.I
is an integral index. The indices of the Accessors in a given stencil operators must be in the range from 0 to N - 1, where N is the number of Accessors used by the stencil operator. No index can appear twice. If these rules are not followed the compilation fails.For global accessors, the index is the only argument.
intent
indicates the type of access the stencil operator makes to the data associated to the Accessor. Possible values areintent::in
to specify read-only accessintent::inout
to specify read-write access. Theextent
for i and j forinout
must be made of all zeros (see next points)
Alternatively, the abbreviations
in_accessor
ourinout_accessor
can be used respectively.using read_only_accessor = in_accessor<I, [extent, N]>; using read_write_accessor = inout_accessor<I, [extent, N]>;
extent
defines the maximum offsets at which the implementation will access data around the Iteration Point. They are defined as follows:extent<i_minus = 0, i_plus = 0, j_minus = 0, j_plus = 0, k_minus = 0, k_plus = 0>;
An extent takes three pairs of numbers, one pair for each dimension of the iteration space. The first number of the pair must be non-positive and indicates the maximum offset in the direction of decreasing indices (also called minus direction*). The second number must be non-negative and indicates the maximum offset in the direction of increasing indices (also called plus direction*). Numbers can be ommitted and default to zero.
Example:
using my_accessor = inout_accessor<I, extent<-1, 1, 0, 2>;
This accessor guarantees that at most one element in negative and positive i-direction will be accessed (i.e. we will never access
field(i + 2)
). Further, it guarantees that in j-direction, no elements in negative and at most two elements in positive direction will be accessed. In k-direction, the field is not accessed with any offset.Note that
extent<>
is a valid extent and indicates that the field is always accessed at the iteration point.Note
Behaviour is undefined if a field is accessed at extents that are outside the box defined by
extent
. Under certain configuration this might lead to erroneous results. Extents bigger than the ones actually accessed by the implementation will potentially result in performance loss.N
identifies the number of dimensions of the Data Store. By default this value is set to 3.
Accessor Aliases¶
An Accessor alias is a regular Accessor which has an offset predefiend at compile-time. For instance,
we might have a 4-dimensional Data Store, where the last dimension refers to the 3 component u, v, w of a vector
field, e.g. the wind-speed. This vector field is accessed via an Accessor called vel
using vel = in_accessor<0, extent<>, 4>;
For convenience we can introduce an alias to the w
component
using w = alias<vel, dimension<4>>::set<2>;
The line above sets the fourth offset at compile-time to the value 2, so that we have the following equivalency:
w() == vel(0, 0, 0, 2)
This allows to considerably lighten the notation in complicated expressions. Note that you can still access the other dimensions with an offset, by using the alias, for example:
dimension<1> i;
dimension<4> q;
w(i + 1) == vel(i + 1, q + 2)
Parameter list¶
The parameter list param_list
is defined as follows:
using param_list = make_param_list<accessors...>;
where _accessors_
is a comma separated list of all the Accessors specified before. For example:
using in_ = in_accessor<0>;
using out_ = inout_accessor<1>;
using param_list = make_param_list<in_, out_>;
Note
Specifying the parameter list is mandatory because C++ cannot infer what types have been defined as accessors.
Apply-Method¶
A stencil operator can have several apply
-methods, defining the functors to be applied at different Vertical
Intervals. An apply
method takes at most two arguments:
The first argument is a templated parameter usually called
eval
that holds internal information.The second arguments specifies the Vertical Interval to which functor is applied to. If the grid is not created with an axis, but only with a vertical size, this argument can be skipped and the stencil is applied to the whole axis. Whenever the grid is created using an axis, the user should specify a vertical interval for each method.
Example:
template <typename Eval>
GT_FUNCTION static
void apply(Eval const& eval, region);
Within an apply
-method, data bound to the accessors can be accessed through the eval
argument.
eval(accessor_name())
Note
Make sure to add the parentheses after the accessor.
The previous syntax will evaluate the accessor at the iteration point. Values can be accessed at offsets relative to the evaluation point by passing a sequence of integral indices to the accessor:
eval(accessor_name(1, 0, -1))
This accesses an element at an offset of 1 in the first dimension (plus direction) of the Iteration Point, and
an offset of 1 in the minus direction in the third dimension. A way to think of it is to consider the point of
evaluation as a triplet i
, j
and k
, and those offsets are added to the current index coordinates to
identifying the actual value to access. The evaluation returns a reference to the value for accessors with
inout
-intent, and a const reference for Accessors with in
-intent.
The next example calculates the average of two neighbored values in the first dimension and assign it to the output field:
eval(out()) = (eval(in()) + eval(in(1, 0, 0))) / 2;
Note
Writing into non-zero offsets is considered bad practice and can lead to incorrect computations. This may be forbidden in the future.
Stencil operators must not have any horizontal dependencies within itself. That means: If a stencil operator writes into a field, it must not access this field with non-zero horizontal offset.
eval(out()) = eval(in(0, 1, 0)); // ok, if this stage does not write to in eval(out()) = eval(out(0, 1, 0)); // undefined! eval(out()) = eval(out(0, 0, 1)); // undefined, if execution policy is parallel (see stencil composition sections)
Example¶
struct flx_function {
using out = inout_accessor<0>;
using in = in_accessor<1, extent<0, 1, 0, 0>>;
using lap = in_accessor<2, extent<0, 1, 0, 0>>;
using param_list = make_param_list<out, in, lap>;
template <typename Evaluation>
GT_FUNCTION static void apply(Evaluation const &eval, full_interval) {
eval(out()) = eval(lap(1, 0, 0)) - eval(lap(0, 0, 0));
if (eval(out()) * (eval(in(1, 0, 0)) - eval(in(0, 0, 0))) > 0) {
eval(out()) = 0.;
}
}
};
Expressions¶
Todo
Should we remove the expressions namespace? Why should we forbid this functionality in certain cases?
Multiple calls to eval can be merged into one when the expressions
-namespace is imported. This is possible, because
calculations with accessors produce expressions that can be evaluated. Readibility can be greatly improved, but it might
have some negative impact on compilation time.
using namespace expressions;
template <typename Evaluation>
GT_FUNCTION static void apply(Evaluation const &eval) {
using namespace expressions;
eval(out()) = eval(lap(1, 0, 0) - lap(0, 0, 0));
if (eval(out() * (in(1, 0, 0) - in(0, 0, 0)) > 0) {
eval(out()) = 0.;
}
}
The expressions
-namespace has overloads for common operations on accessors, namely +
, -
, *
, /
,
pow<>
. Using those operations with accessors creates an expression that can be evaluated using eval
. An
extensive example of its usage, demonstrating its effectiveness, can be found in the shallow water implementation.
Note that those expressions can also be used to lazily evaluate expressions. This provides a way to reuse expressions in your code:
using namespace expressions;
constexpr auto cond = out() * (in(1, 0, 0) - in(0, 0, 0);
template <typename Evaluation>
GT_FUNCTION static void apply(Evaluation &eval) {
eval(out()) = eval(lap(1, 0, 0) - lap(0, 0, 0));
if (eval(cond) > 0) {
eval(out()) = 0.;
}
}
Execution Model¶
Stencil operations are executed in a three dimensional Iteration space. The first two dimensions of the iteration space, usually referred to as I and J dimensions identify the horizontal dimension. There is no prescription on in what order stencil operators in different points of the same IJ plane will be executed. Stencil operators in the third dimension of the iteration space, usually referred as K or vertical dimension, can have prescribed order of executions. There are three different ways of execution policies for the K dimension:
forward: The computation at index k in the vertical dimension is executed after index k - 1 for all points in the horizontal plane;
backward: The computation at index k in the vertical dimension is executed after index k + 1 for all points in the horizontal plane;
parallel: No order is specified and execution can happen concurrently.
An execution strategy for a computation is indicated by execute::order()
,
where order is one of the K dimension execution orders: forward
, backward
, parallel
.
The parallel execution policy allows for tuning the degree of parallelism (in the CUDA backend), by selecting
execute::parallel_block<vertical_block_size>
, where vertical_block_size
describes the dimension of CUDA block
in the vertical (a too high value e.g. each k-level in a separate CUDA-thread might result in a too low instruction level
parallelism and therefore in non-optimal performance).
Note
The option to specify the vertical block size will most likely be moved to the backend in the future, see Selecting the Backend.
More concretely, a multistage is a list of stages (implemented with stencil operators) to be executed with a certain execution policy. A computation combines several multistages, and will execute one multistage after the other.
For each IJ plane, the stages of a multistage will be executed strictly one after the other. This means, that a stage can assume, that the previous stage has been applied to the the whole IJ plane before it is executed. The user can explicitly create independent stages, that don’t require this restriction.
If the execution policy is parallel, a stage cannot impose any assumptions on which stages are applied before in another IJ plane.
If the execution policy is forward, it is guaranteed, that if a stage is executed at index
k
, then all stages of the multistage were already applied to the same column with smallerk
, and that no stages of the multistage are already applied to the indices in the same column with largerk
.If the execution policy is backward, it is guaranteed, that if a stage is executed at index
k
, then all stages of the multistage were already applied to the same column with largerk
, and that no stages of the multistage are already applied to the indices in the same column with smallerk
.
Access restrictions¶
The execution model imposes restrictions on how accessors can be evaluated. The following restrictions apply:
Accessors can be read at any offset at any time, if the fields bound to the accessor are read-only within a whole multistage
A stage may write to a field through an accessor. In this stage and in all later stages in the same multistage, this field must not be read anymore with any non-zero offsets, except with k-offsets in case of non-parallel policy. For example, you must not access such a field with
eval(field(-1, -1, -1))
.
Stencil Composition¶
Preparing Arguments: Storage Placeholders¶
To allow naming arguments, that are passed to stencil computations, at compile time, GridTools uses storage placeholders. The storage placeholders are pairs of an integer identifier and a storage type (a GridTools Data Store type).
For example
using p_in = arg<0, data_store_t>;
using p_out = arg<1, data_store_t>;
defines two placeholders named p_in and p_out. Note that the identifier, that is, the first argument passed to arg, must be unique in the computation using it.
A second placeholder type for temporary data exists that can be defined in exactly the same way:
using p_tmp = tmp_arg<2, data_store_t>
The difference between temporary and non-temporary placeholders is that while non-temporary placeholders are bound with a user-allocated data-store, this is not needed for temporary placeholders. GridTools takes care to allocate a suitable storage for those placeholders. The storage duration of a temporary starts with the first stage of the first multistage, and ends with the last stage of the last multistage.
Note
If you define multiple placeholders with the same ID, they can not be distinguished by GridTools even if they have different names or storage types! arg and tmp_arg arguments can have the same identifiers, and still represent different placeholders.
The placeholders can be used as arguments to certain GridTools functions. There are two use cases: first, they define, which stages are called with which arguments, e.g., in
auto stage = make_stage<some_stage>(p_in(), p_tmp(), p_out());
they define that during execution the stage some_stage
will be called with the fields bound to the placeholders at
that point (see further below for documentation of the make_stage function). The second use case is binding run-time
values to previously declared arguments, e.g.,
computation.run(p_in() = input_data, p_out() = output_data);
runs a computation on input_data and output_data. Storages must only be assigned to arg placeholders, all data for the tmp_arg arguments is managed by GridTools.
Defining the Iteration Space: the Grid¶
The Stencil Operators describe operations on a single Iteration Point.
The Iteration Space defines on which points the operator should be applied. In this section
we cover how to define the Iteration Space with the grid
object.
GridTools offers a set of functions which ease the construction of the grid
:
-
grid
make_grid
(int size_i, int size_j, int size_k)¶ The simplest Iteration Space will iterate the cube defined by the Vertical Intervals
[0, size_i-1]
,[0, size_j-1]
,[0, size_k-1]
. This function must only be used if all stages used within the computation have zero extents.
-
grid
make_grid
(halo_descriptor halo_i, halo_descriptor halo_j, int size_z)¶ For finer control of the iteration space a Halo Descriptor can be passed for the horizontal directions (
I
andJ
). The 3rd and 4th argument of thehalo_descriptor
define the start and the endpoint of the Iteration Space. Note that the minus (first argument) and plus (second argument) of thehalo_descriptor
should be larger than the maximum extent of the whole computation.Note
The last argument of the halo descriptor (total length) is not used in in the grid. This will be improved in the future.
auto grid = make_grid({3, 3, 10, 20, 30}, {3, 3, 10, 20, 30}, 10);
This example will create a grid. The iteration space in
i
andj
will be[10, 20]
(including20
!). The computation is required not to access data outside of[7, 23]
. The iteration space ink
is[0, 9]
.
-
grid
make_grid
(int size_i, int size_j, Axis axis)¶ The vertical axis needs to be passed to the grid when using several vertical regions. The axis can be constructed by passing it the size of each of the vertical regions. Details follow in the coming sections.
-
grid
make_grid
(halo_descriptor halo_i, halo_descriptor halo_j, Axis axis)¶ See explanations in other functions.
Vertical Regions and Vertical Boundary Conditions¶
The GridTools Execution Model allows to be sequential in the vertical dimension (k
). Additionally, GridTools
offers the possibility to split the vertical dimension into vertical regions,
where stencils can perform different operations. Typical applications of this pattern are models which define
terrain-following coordinates close to the earth surface and flat-coordinates in upper
levels of the atmosphere. Another use-case are vertical boundary-conditions which can
be directly integrated into the stencil operation.
Note
In the following we will distinguish two concepts: first Vertical Regions are non-overlapping subsets of the vertical Iteration Space with run-time defined sizes; second Vertical Intervals (or just Interval) are compile-time defined descriptions from one vertical level (not every vertical level can be selected, see below) to another.
Default Interval¶
In simple applications, where all vertical levels should be treated equally, GridTools allows to use a default
Vertical Interval which covers the full vertical region. In this and only this case the apply methods of the stencil
operators should be defined without specifying an Vertical Interval (see Section
Apply-Methods) and the iteration space should be created using one of the simple
constructors in the previous section (namely, either make_grid(int, int, int)
, or
make_grid(halo_descriptor, halo_descriptor, int)
.
Defining Vertical Intervals¶
GridTools allows to split the full vertical iteration space into regions. The number of vertical regions needs to be specified at compile-time, while the size of each region can be defined at run-time.
For defining a computation with more than a single vertical region we need to define an axis
first.
using my_axis_t = axis<N>
where N describes the number of vertical regions.
At runtime the axis is instantiated with the sizes of each region,
my_axis_t my_axis{N0, N1, N2, ...};
where the Nx
are the sizes of region x
. With our axis object we can now generate a grid with one of the following
signatures
grid make_grid(int size_i, int size_j, axis<N> my_axis)
grid make_grid(halo_descriptor halo_i, halo_descriptor halo_j, axis<N> my_axis)
Each region already defines a Vertical Interval which can be queried from the axis by
using first_interval = my_axis_t::get_interval<0>;
using second_interval = my_axis_t::get_interval<1>;
using full_interval = my_axis_t::full_interval;
...
Note that the Vertical Intervals are compile time object, i.e. C++ types. These Vertical Intervals are used for defining which Apply-Method version of the stencil operator should be used during the iteration.
Vertical Interval provides meta-functions which allow to define modified Vertical Intervals
interval::first_level
, which is the Interval (a C++ type) describing the first level of the Intervalinterval::last_level
, which is Interval describing the last level of the Intervalinterval::modify<begin, end>
, which is an Interval extended (begin
< 0) or shrunk (begin
> 0) at at the beginning of the Interval and extended (end
> 0) or shrunk (end
< 0) at the end of the Interval.interval::shift<value>
, which is the Interval shifted byvalue
, i.e. it is a shortcut formodify<value, value>
.
Examples:
using axis_t = axis<2>; // axis with 2 vertical regions
axis_t my_axis(5, 10); // iteration space spans 5 + 10 levels
using first_interval = axis_t::get_interval<0>; // interval [0, 4]
using second_interval = axis_t::get_interval1>; // [5, 14]
using full_interval = my_axis_t::full_interval; // [0, 14]
using first_level_only_interval = full_interval::first_level; // [0]
using last_level_only_interval = full_interval::last_level; // [14]
using a_middle_level_interval = second_interval::first_level; // [5]
using a_middle_interval = a_middle_level_interval::modify<-1, 0>; // [4, 5]
using a_middle_interval2 = a_middle_interval::shift<1>; // [5, 6]
Note
Only two levels around a vertical region can be addressed in this way. This can be changed by using the method described in the next section.
Advanced Functionality for Vertical Intervals¶
The axis
type has an additional template parameter to change a default restriction:
template <size_t NIntervals, axis_config::offset_limit<int = 2>> class axis;
offset_limit
defines how many levels around each vertical region can be addressed by specializedApply
-methods. Increasing this value could have negative implications on compile-time.
Composition of Stencils¶
The main component of GridTools provide the capability of composing different Stages. A stage is the application of a single Stencil Operator to an Iteration Space. The ability to fuse multiple stages allows the GridTools library to improve the memory locality of the computation by taking advantage of the produce consumer relations.
Note
When composing stencils, each output data field must be written only once. Failing to do so will end up in undefined behavior. We plan to introduce compile time checks to catch the violation of this condition.
The result of a composition is a Multi-Stage. For reason of uniformity, a Multi-Stage with a singe Stage, is still called a multi-stage, since the composition is generally assumed to take a list of stages, even if the stage is just one.
Note
This decision put the emphasis on the fact that composition is considered the main feature of GridTools. If your application is made of a simple stencil, or stencils that cannot be composed (see below), then GridTools may not be the right solution for you.
GridTools allows multi-stage computations to be composed. The final composition, with the addition of a Grid is simply called a GridTools Computation.
Initially we need to specify the stages. A stage is specified by indicating a Stencil Operator and some Placeholders to its arguments. The placeholders are aliases to the actual arguments and are used to compute the data dependencies of the computation independently of the actual data fields that will be accessed by the computation. The syntax for specifying a stage uses a helper function called make_stage:
auto stage = make_stage<operator>(plc0(), plc1(), ...);
Where the operator is the stencil operator of the stage and the plc0, plc1, … are the placeholders. The number and the intent of the placeholders depend on the Stencil Operator.
A Multi-Stage is a concatenation of stages, plus the indication of the Vertical Execution Order to be used for all the stages.
The following example demonstrates how to create a multistage for a diffusion operator.
auto multi_stage = make_multistage(
execute::forward(),
make_stage<lap_operator>(p_lap(), p_in()),
make_stage<flx_operator>(p_flx(), p_in(), p_lap()),
make_stage<fly_operator>(p_fly(), p_in(), p_lap()),
make_stage<out_operator>(p_out(), p_in(), p_flx(), p_fly())
);
Note
It is not possible to make a Multi-Stage with stages of different Vertical Execution Orders.
The data-dependence analysis of GridTools will determine the data flow and the Extents at which each data field will be accessed. This information is then passed to the architecture specific backend for the execution.
Before the computation can be executed, a Grid, specifying the Iteration Space, must be provided.
Creating a complete stencil for horizontal diffusion looks as follows:
auto horizontal_diffusion = make_computation<backend_t>(
grid,
make_multistage(
execute::forward(),
make_stage<lap_operator>(p_lap(), p_in()),
make_stage<flx_operator>(p_flx(), p_in(), p_lap()),
make_stage<fly_operator>(p_fly(), p_in(), p_lap()),
make_stage<out_operator>(p_out(), p_in(), p_flx(), p_fly())
)
);
where backend_t
specifies the Backend that will execute the computation.
The details about dependencies between stages and multistages are described in Execution Model. If two stages
have no dependencies, they don’t need to be synchronized. Thus, the user can make them independent. In case of the
horizontal diffusion, the fluxes in x and y have no dependency on each other, thus, make_independent
can be used:
auto horizontal_diffusion = make_computation<backend_t>(
grid,
make_multistage(
execute::forward(),
make_stage<lap_operator>(p_lap(), p_in()),
make_independent(
make_stage<flx_operator>(p_flx(), p_in(), p_lap()),
make_stage<fly_operator>(p_fly(), p_in(), p_lap())),
make_stage<out_operator>(p_out(), p_in(), p_flx(), p_fly())
)
);
make_independent
does not have impact on the data-dependency analysis but, potentially, only on the execution
schedule.
A computation can also have several multistages. The general signature is as follows:
make_computation<backend_t>(grid, multi_stage0, other_multi_stages...);
A computation can incorporate multiple multi-stage computations and they may have different execution policies. This is very useful for implementing computations that require two vertical swipes, one ascending and one descending, but has to be used with care. Before explaining the rules for using this feature, we provide a simple example of the syntax in the vertical advection example:
auto vertical_advection = make_computation<backend_t>(
grid,
make_multistage(
execute::forward(),
make_stage<forward_op>(p_utens_stage(), p_wcon()),
),
make_multistage(
execute::backward(),
make_stage<backward_op>(p_utens_stage(), p_u_pos())
)
);
This option is particularly useful on multi-core machines with caches, since the backend there can actively pass information between the two stages thus improving substantially the performance.
To execute a given Multi-Stage, the run method has to be called. Data has to be assigned to each non-temporary Placeholder.
horizontal_diffusion.run(p_out() = out_data, p_in() = in_data);
Note
No data can be assigned to temporary fields (i.e., tmp_arg Placeholders). Storage for all temporary arguments is managed by GridTools.
In case that the stencil is always run on the same data fields, you can also
assign those fields to the respective placeholders directly in the
make_computation
call, i.e. at definition time instead of run time, e.g.:
auto horizontal_diffusion = make_computation<backend_t>(
grid,
p_out() = out_data, // data assignment happens here
p_in() = in_data, // and here
make_multistage(
execute::forward(),
make_stage<lap_operator>(p_lap(), p_in()),
make_independent(
make_stage<flx_operator>(p_flx(), p_in(), p_lap()),
make_stage<fly_operator>(p_fly(), p_in(), p_lap())),
make_stage<out_operator>(p_out(), p_in(), p_flx(), p_fly())
)
);
horizontal_diffusion.run() // no data has to be specified here
You can mix the two patterns, that is, assign data to some placeholders at
definition time and to some at run time. But each placeholder needs exactly one
data assignment, either inside make_computation
or inside the run
method. It is therefore not possible to override definition-time assignments
present in make_computation
at run time in the run
method.
There are other details that pertain Placeholders, Grid and also other GridTools constructs that can greatly improve performance of the computations, especially Software-Managed Caches.
Selecting the Backend¶
One of the key concepts of GridTools is portability between different target architectures.
Stencil operators are written and composed in an architecture-independent way and then instantiated
for a given backend
. The backend
is a tag type with with the following possible values:
backend::cuda
: a GPU-enabled backend for NVIDIA GPUsbackend::mc
: a backend for modern CPUs with long vector-length.backend::x86
: a legacy CPU-backend with focus on caching of vertical stencils, likely to be removed in the future.
Currently we recommend one of the following two backends for optimal performance
using backend_t = backend::cuda;
for GPUs or
using backend_t = backend::mc;
for modern CPUs or Xeon Phis.
Advanced Functionality¶
Stencil Functions¶
Stencil functions offer the possibility to call Stencil Operators from other Stencil Operators. To use stencil functions the header
#include <gridtools/stencil_composition/stencil_functions.hpp>
needs to be included. Two variants are available: functional calls which return a value and procedural calls with side-effect on the given arguments (Accessors).
Function Calls: call<>¶
The basic syntax for function calls is as follows
auto x = call<stencil_operator, vertical_interval, return_value_type>::with(eval, accessors...);
Let us describe the template parameters first:
stencil_operator
is the operator to be called.vertical_interval
is the Vertical Interval where the operator will be applied (it can be empty, if the stencil operator has an apply method without a vertical region).return_value_type
is the type of the return value for the function call. Thereturn_value_type
will be automatically deduced from the firstaccessor
if not specified explicitly.
The context object eval
has to be passed as the first argument to with
, followed by the
Accessors which are arguments of the operator.
Note that the first accessor in the stencil operator must be an artificial accessor of type inout_accessor
. This
argument must not be passed to the function, but instead it is the return value of the function.
Example:
struct lap_function {
using out = inout_accessor<0>; // artificial first accessor
using in = in_accessor<1, extent<-1, 1, -1, 1>>;
using param_list = make_param_list<out, in>;
template <typename Evaluation>
GT_FUNCTION static void apply(Evaluation eval) {
eval(out()) = 4. * eval(in()) -
(eval(in(-1, 0)) + eval(in(0, -1)) + eval(in(0, 1)) + eval(in(1, 0)));
}
};
// later:
auto ret = call<lap_function>::with(eval, param()); // only one parameter
This function calculates the laplacian of a field. Note that the function is called only with one parameter, because the
first accessor (out
) is the artificial accessor representing the return value.
Note
If you pass a vertical interval to call
, a matching apply method needs to exist in the called operator.
The Accessors which are passed in the function call can have offsets in the usual way. Additionally the whole operator can be shifted to be executed
on a different grid point, by specifying a relative location with at
:
call<...>::at<offset_i, offset_j, offset_k>::with(...);
For example, you can evaluate the laplacian at the next k-level as follows:
call<lap_function>::at<0, 0, 1>::with(eval, param());
Procedure Calls: call_proc<>¶
Procedural calls work in the same way as function calls, but all Accessor, which are inout
in the operator,
can be modified.
call_proc<stencil_operator, vertical_interval>::with(eval, accessors...);
Also offsets work the same way as for function calls. Using at
with call_proc
is not recommended and support
might be dropped in the future.
You can call the laplacian above with the following syntax:
call_proc<lap_function>::with(eval, lap(), param());
Software-Managed Caches¶
Software-Managed Caches are syntax elements that are used to describe data reuse pattern of the stencil computations. They are an essential functionality of the GridTools in order to deliver an efficient implementation of memory bound codes, since the library uses this information to allocate cached fields in a fast on-chip scratch-pad memory.
In computing architectures like NVIDIA GPUs, where the use of the different on-chip memory hierarchy must be explicitly declared using the CUDA programming model, the use of Software-Managed Caches of GridTools increases the data locality of stencil algorithms and provides a significant performance speedup.
While the library is capable of exploiting several on-chip memory layers (like texture cache, const cache, shared memory, and registers of NVIDIA GPUs) the GridTools language is abstracting these underlying memory layers and exposes syntax elements that are computing architecture agnostic.
Therefore the Software-Managed Cache syntax should be used by the user to describe only data reuse patterns, and not type of on-chip memory that should be exploited (which is a decision delegated to the computing architecture backend of the library).
An example of the syntax for caching certain fields of a computation is shown below
auto comp = make_computation<backend_t>(
grid,
make_multistage(
execute::parallel(),
define_caches(cache<cache_type::ij, cache_io_policy::local>(p_f1(), p_f2())),
make_stage<lap_function>(p_f1(), p_f2(), p_in()),
make_stage<lap_function>(p_out(), p_f1(), p_f2())
)
);
The Cache DSL elements are enclosed into a define_caches
construct,
that accept any number of cache
constructs. At the same time, each
cache
construct can specify multiple fields that shared the same
access pattern.
Note
It is important to note that the cache
specifications
are prescribing the behavior of the library: if a Cache
is specified, a Cache will be used. In the rare case of
using too many Caches a decrease in performance might be
observed due to saturation of available resources
The cache
construct adheres to the following syntax:
cache<cache_type, cache_io_policy>(p_args...)
where p_args...
is a list of placeholders for which the specified caching
should be used.
Full examples on Cache usages can be found in the source code in the
horizontal diffusion
and
vertical advection.
We now describe the details of each element of the cache
constructs.
Cache Type¶
The cache_type
describes the type of access pattern present in our stencil for the field being cached. It’s
value can be one of the following (where we indicate the basic mean of implementation on the GPUs, so that the user can understand the amount of resources involved):
cache_type::ij
: cache data fields whose access pattern lies in the ij-plane, i.e. only offsets of the type i ± X or j ± Y are allowed (the GPU backend will cache these fields in shared memory). It is undefined behaviour to access data with k-offsets.cache_type::k
: cache data field whose access pattern is restricted to the k-direction, i.e. only offsets of the type k ± Z (the GPU backend will cache these fields in registers). It is undefined behaviour to access data with offsets in i or j direction.
Cache Policy¶
The cache_policy
specifies a synchronization policy between the data in the Cache and the data in main memory. A scratch-pad can be used
in order to allocate temporary computations that do not require data persistency across multiple stencils. However often the data that is
being cached is already present in main memory fields. In this case, the Software-Managed Caches of GridTools gives the possibility
to specify a cache policy that allows to synchronize the main memory with the cached field.
The possible values are:
cache_io_policy::fill
: fill the scratch-pad buffer with data from main memory field before use.cache_io_policy::flush
: After the execution of the stencil operators the data in the Cache is written back into the main memory fields.cache_io_policy::fill_and_flush
: The combination offill
andflush
cache_io_policy::local
: The scratch-pad data is not persistent and only available within the scope of a multi-stage. Local caches are only meaningful in connection with temporary arguments.

Fig. 6 Representation of an implementation for a cache<cache_type::k, cache_io_policy::fill_and_flush>
that is used within a
stencil with Extent <-2, 1>
in the vertical dimension and implemented as a ring-buffer with 4 levels (in order to allocate all possible offsetted accesses). The three operations
are triggered automatically by the library for a fill_and_flush Cache when the vertical loop transition from level 9 to level 10.¶
Expandable Parameters¶
Expandable parameters implement a “single stencil multiple storages” pattern. They are useful when we have a vector of storages which have the same storage info, and we want to perform the same operation with all of them (a typical situation when implementing e.g. time differentiation schemes). Normally this could be achieved by creating a loop and running multiple computations, but this solution would be inefficient. A more efficient solution is provided through the expandable parameters API.
The user must collect the storage pointers in a std::vector
std::vector<storage_t> list = {
storage1, storage2, storage3, storage4, storage5, storage6, storage7, storage8};
This std::vector
is then used as a storage type with no differences with respect to
the regular storages.
The implementation requires the user to specify an integer expand_factor
when defining the computation.
The optimal value for expand_factor
might need to be tuned for different backends.
auto comp_ = make_expandable_computation<backend_t>(
expand_factor<4>,
domain_,
grid_,
make_multistage(execute::forward(), make_stage<functor>(p())));
comp.run(p() = list);
The vector of
storages is then partitioned into chunks of expand_factor
size (with a remainder). Each
chunk is unrolled within a computation, and for each chunk a different computation is
instantiated. The remainder elements are then processed one by one.
Summing up, the only differences with respect to the case without expandable parameters are:
make_expandable_computation
has to be used instead ofmake_computation
an
expand_factor
has to be passed to themake_expandable_computation
, defining the size of the chunks ofexpandable parameters should be unrolled in each computation.
a
std::vector
of storage pointers has to be used instead of a single storage.
All the rest is managed by GridTools, so that the user is not exposed to the complexity of the unrolling, he can reuse the code when the expand factor changes, and he can resize dynamically the expandable parameters vector, for instance by adding or removing elements.
Global Parameters and Accessors¶
Global accessors are accessing read-only data which is independent of the current Iteration Point. For this reason, intent, Extents and offsets cannot be specified for a global accessor. Here the term global means that the data is the same for the whole Iteration Space. An example can be a constant scalar parameter that you want to pass to the functor, or a user defined struct containing various configuration options.
The API allows the user to define an arbitrary object to act as a Global Parameter as long as it is trivially copyable.
To create a Global Parameter from a user-defined object, we pass the object to make_global_parameter()
auto my_global_parameter = make_global_parameter<backend_t>(my_object);
Note that the object is read-only only during the execution of a computation, but can be modified in between executions with
update_global_parameter<backend_t>(my_global_parameter, new_value);
The Accessor associated with a global parameter must be a global accessor
using global_acc = global_accessor<0>;
Calling eval
on the global accessor returns the user defined data structure. Supposing that
this data structure contains a user function called ordinal
, returning an integer, we can write
in the do method
auto ordinal_ = eval(global_boundary()).ordinal();
Note
All the member functions defined in the user-defined data structure must be labeled with GT_FUNCTION
, in order for
them to be callable from devices.
Boundary Conditions¶
Introduction¶
The boundary condition module in GridTools is designed following the principle that boundary conditions can be arbitrarily complex, so we want the user to be able to specify any boundary condition code to set up their problems.
Preliminaries¶
One main concept that is needed for the boundary condition is the one of direction.
In a 3D regular grid, which is where this implementation of the boundary condition library applies, we associate a 3D axis system, and the cell indices (i, j, k) naturally lie on it. With this axis system the concept of “vector” can be defined to indicate distances and directions. Direction is the one thing we need here. Instead of using unitary vectors to indicate directions, as it is usually the case for euclidean spaces, we use vectors whose components are -1, 0, and 1. For example, \((1, 1, 1)\) is the direction indicated by the unit vector \((1, 1, 1)/\sqrt3\). If we take the center of a 3D grid, then we can define 26 different directions \(\{(i, j, k): i, j, k \in \{-1, 0, 1\}\}\setminus\{(0, 0, 0)\}\) that identify the different faces, edges and corners of the cube to which the grid is topologically analogous with.
The main idea is that a boundary condition class specializes operator() on a direction, or a subset of directions, and then perform the user specified computation on the boundaries on those directions.
The user can define their own boundary condition classes and perform
specific computation in each direction. For this reason GridTools provides
a direction
type which can take three direction values, that are
indicated as minus_
, plus_
and zero_
, which are values of an
enum
called sign
.
Boundary Condition Class¶
The boundary condition class is a regular class which need to be copy
constructible, and whose member functions should be decorated with the
GT_FUNCTION
keyword to enable accelerators. It must not contain references to data that may be not available on the target device where the boundary conditions are applied.
The boundary condition class provides overloads for the operator()
which take as first argument a direction
object, a number of Data
Stores that are the inputs, and three integer values that will contains the
coordinate indices of the cell that is being iterated on.
All overloads must have the same number of arguments: the first argument is the direction over which the overload will be applied to, then there is the list of Data Views that will be accessed by the boundary class, and finally three integers that contains the indices of the element being accessed in the call.
It is standard practice to let the view types be template
arguments. For instance, here a class that applies a copy-boundary
condition (copy the second view into the first one) for all direction
apart all directions for which the third component is minus_
:
struct example_bc {
double value;
GT_FUNCTION
example_bc(double v) : value(v) {}
template <typename Direction, typename DataField0, typename DataField1>
GT_FUNCTION void operator()(Direction,
DataField0 &data_field0, DataField1 const &data_field1,
unsigned i, unsigned j, unsigned k) const
{
data_field0(i, j, k) = data_field1(i, j, k);
}
template <sign I, sign J, typename DataField0, typename DataField1>
GT_FUNCTION void operator()(direction<I, J, minus_>,
DataField0 &data_field0, DataField1 const &,
unsigned i, unsigned j, unsigned k) const
{
data_field0(i, j, k) = value;
}
};
operator() of the boundary class is called by the library, on the 26 directions, and got each value in the data that correspond to each direction. In the previous example, each direction in which the third component is minus
will select the specialized overload, while all other directions select the first implementation.
Boundary Condition Application¶
To apply the above boundary conditions class to the data fields, we need to construct the boundary object, but also to specify the Halo regions. The Halo regions are specified using Halo Descriptors.
To do this we need an array of Halo Descriptors initialized with the Halo information of the data fields.
Note
The fifth number, namely the total length, in the Halo Descriptor is not used by the boundary condition application module, but we kept the same implementation we had in other parts of GridTools to avoid introducing too many concepts. This will be improved in the future.
array<halo_descriptor, 3> halos;
halos[0] = halo_descriptor(1, 1, 1, d1 - 2, d1);
halos[1] = halo_descriptor(1, 1, 1, d2 - 2, d2);
halos[2] = halo_descriptor(1, 1, 1, d3 - 2, d3);
After this is done we can apply the boundary condition by, as in this
example, constructing the boundary object and applying it to the data
fields. The number of data fields to pass is equal to the number of
fields the operator()
overloads of the boundary class require.
boundary<example_bc, backend_t>(halos, example_bc(42)).apply(out_s, in_s);
As can be noted, the backend
is also needed to select the proper
implementation of the boundary application algorithm (see Backend). out_s
and
in_s
are the two data fields passed to the application. The fact
that the first is the output and second is the input derives from the
signature of the overloads of operator()
, and it is user defined.
Boundary Predication¶
Predication is an additional feature to control the boundary application. The predicate type have to be specified as template argument of the boundary class, and the instantiated object of that type passed as third argument of the boundary class constructor, as in the following example:
boundary<direction_bc_input<uint_t>, backend_t, predicate_t>
(halos, direction_bc_input<uint_t>(42), predicate_t{}).apply(out_s, in_s);
The predicate must obey a fixed interface, that is, it has to accept
as argument a direction
object, so that the user can, at runtime,
disable some operator()
overloads. This can be very useful when
the user is running on a parallel domain decomposed domain, and only
the global boundaries need to updated with the boundary conditions
application and the rest should have their Halos updated from
neighbors.
Provided Boundary Conditions¶
GridTools provides few boundary application classes for some common cases. They are
copy_boundary
to copy the boundary of the last field of the argument list of apply into the other ones;template <class T> value_boundary
to set the boundary to a value for all the data fields provided;zero_boundary
to set the boundary to the default constructed value type of the data fields (usually a zero) for the input fields.
Halo Exchanges¶
Introduction¶
The communication module in GridTools is dubbed GCL. It’s a low level halo-update interface for 3D fields that takes 3D arrays of some types, and the descriptions of the halos, and perform the exchanges in a scalable way.
It is low-level because the requirements from which it was initially
designed, required easy interoperability with C and Fortran, so the API
takes pointers and sizes. The sizes are specified by
halo_descriptor
, which are loosely inspired by the BLAS description
of dimensions of matrices. A new, more modern set of interfaces are
being implemented, to serve more general cases, such as higher
dimensions and other grids.
We first start with some preliminaries and then discuss the main interfaces.
Preliminaries¶
Processor Grid¶
The processor grid is a concept that describe a 3D lattice of computing elements (you may think of those as MPI tasks). The identifiers of them are tuples of indices. This naturally maps to a 3D decomposition of a data field.
Layout Map¶
The communication layer needs two Layout Maps: one for describing the data, and one for the processor grid. For the user, the dimensions of the data are always indicated as first, second, and third (or i, j, k), it is the Layout Map that indicates the stride orders, as in the following example:
For instance:
// i, j, k
layout_map<1, 0, 2>
This Layout Map indicates that the first dimension in the data (i) is the
second in the decreasing stride order, while the second (j) has the
biggest stride, and last dimension (k) is the one with stride 1. The
largest strides are associated to smaller indices, so that
layout_map<0, 1, 2>
corresponds to a C-layout, while
layout_map<2, 1, 0>
to a Fortran layout.
The second template Layout Map in the Halo Exchange pattern is the map between data coordinates and the processor grid coordinates.
The following layout specification
layout_map<1, 0, 2>
would mean: The first dimension in data matches with the second
dimension of the computing grid, the second dimension of the data to
the first of the processing grid, and the third one to the third
one. This is rarely different from layout_map<0, 1, 2>
, so it can
generally be ignored, but we give an example to clarify its meaning.
Suppose the processor grid (domain decomposition sizes) has size PIxPJx1. Now, we want to say that the first dimension on data ‘extends’ to the computing grid on (or that the first dimension in the data corresponds to) the first dimension in the computing grid. Let’s consider a 2x1 process grid, and the first dimension of the data being the rows (i) and the second the column (j). In this case we are assuming a distribution like in Fig. 7.

Fig. 7 Example data distribution among two processes.¶
In this case the map between data and the processor grid is:
layout_map<0, 1, 2> // The 3rd dimension stride is 1
On the other hand, having specified
layout_map<1, 0, 2>
for this map, would imply a layout/distribution like the following Fig. 8.

Fig. 8 Example data distribution among two processes.¶
Where the second dimension in the data correspond to the fist dimension in the processor grid. Again, the data coordinates ordering is the one the user choose to be the logical order in the application, not the increasing stride order.
Halo Descriptor¶
Given a dimension of the data (array), the communication module
requires the user to describe it using the halo_descriptor
class,
which takes five integers. This class identifies the data that needs
to be exchanged.
Consider a dimension which has minus
halo lines on one side, and
plus
halo lines on the other (The minus and plus indicate the sides
close to index 0 and the last index of the dimension,
respectively). The beginning of the inner region is marked by begin
and its ending by end
. The end is inclusive, meaning that the index
specified by it, is part of the inner region. Another value is
necessary, which has to be larger than end - begin + 1 + minus + plus
, and
is the total_length
. This parameter is the equivalent of the
“leading dimension” in BLAS. With these five numbers we can describe
arbitrary dimensions, with paddings on the left and on the right, such
as the example in Fig. 9.
The interface for specifying a halo descriptor is fairly simple, where the name of arguments should be self-explanatory:
halo_descriptor(uint_t minus, uint_t plus, uint_t begin, uint_t end, uint_t total_length)
Todo
annotate example with minus plus begin and end

Fig. 9 Example halo descriptor with one halo point on the left and two on the right.¶
GCL Communication Module¶
Now we are ready to describe the Halo Exchange patterns objects. The first one is halo_exchange_dynamic_ut
. The ut
suffix stands for uniform types
, meaning that the data fields that this object will manage must all store the same value types, that are declared at instantiation time. The domain decomposition goes up to three dimensions and the data to be exchanged contained in 3 dimensional arrays (lower dimensions can be handled by setting the missing dimensions to 1). Being designed for three dimensional data, the layout maps have three elements (refer to :numref:storage-info for more information).
The type of the object is defined as in this example:
using pattern_type = halo_exchange_dynamic_ut<layout_map<0, 1, 2>,
layout_map<0, 1, 2>, value_type, gcl_cpu>;
The template arguments are:
the layout if the data;
the mapping between the data dimensions and processing Grid, as described above (leave it as
layout_map<0, 1, 2>
if in doubt);the type of the values to be exchanged;
the place where the data lives and for which the code is optimized. The options for this arguments are
gcl_gpu
,gcl_cpu
andgcl_mc
(not supported yet).
The Halo Exchange object can be instantiated as:
pattern_type he(pattern_type::grid_type::period_type(true, false, true), CartComm);
Where period_type
indicates whether the corresponding dimensions are
periodic or not. CartComm
is the MPI communicator describing the
computing grid.
After the object has been instantiated, the user registers the halos for the corresponding dimension and the five numbers we described above, for the three dimensions (0 is the first dimension).
he.add_halo<0>(minus0, plus0, begin0, end0, len0);
he.add_halo<1>(minus1, plus1, begin1, end1, len1);
he.add_halo<2>(minus2, plus2, begin2, end2, len2);
When the registration is done a setup function must be called before running data exchange. The argument in the set up function is the maximum number of data arrays that the pattern will exchange in a single step. In this example we set it to 3, so that exchanging more than 3 fields will lead to a runtime error. Be aware that setting a larger number of supported fields leads to larger memory allocations. The code looks like:
he.setup(3);
Now we are ready to exchange the data, by passing (up to) three pointers to the data to pack, then calling exchange and then unpack into the destination data, as in the following example:
he.pack(array0, array1, array2);
he.start_exchange();
he.wait();
he.unpack(array0, array1, array2)
Alternatively, the pointers can be put in a std::vector<value_type*>
so that the code would look like:
he.pack(vector_of_pointers);
he.start_exchange();
he.wait();
he.unpack(vector_of_pointers);
An alternative pattern supporting different element types is:
using pattern_type = halo_exchange_generic<layout_map<0, 1, 2>, arch_type>;
Now the Layout Map in the type is the mapping of dimensions to
the computing grid (the number of dimensions is 3, so the layout map
has three elements), and arch_type is either gcl_gpu
, gcl_cpu
or gcl_mc
(not supported yet).
The construction of the object is identical to the previous one, but the set-up somewhat more complex now, since we have to indicate the maximum sizes and number of fields we will exchange using this object.
array<halo_descriptor, 3> halo_dsc;
halo_dsc[0] = halo_descriptor(H1, H1, H1, DIM1 + H1 - 1, DIM1 + 2 * H1);
halo_dsc[1] = halo_descriptor(H2, H2, H2, DIM2 + H2 - 1, DIM2 + 2 * H2);
halo_dsc[2] = halo_descriptor(H3, H3, H3, DIM3 + H3 - 1, DIM3 + 2 * H3);
he.setup(4, // maximum number of fields
field_on_the_fly<int, layoutmap, pattern_type::traits>(null_ptr, halo_dsc),
sizeof(biggest_type_to_be_used)); // Estimates the sizes
The halo descriptors above indicate the largest arrays the user will
exchange, while the field_on_the_fly
specify a type and layout
(and mandatory traits). The type does not have any effect here, and
neither the layout. The traits are important, and the halos are
essential. With this pattern, the user needs to indicate what is the
size of the largest value type they will exchange.
When using the pattern, each data field should be wrapped into a
field_on_the_fly
object, such as
field_on_the_fly<value_type1, layoutmap1, pattern_type::traits> field1(
ptr1, halo_dsc1);
field_on_the_fly<value_type2, layoutmap2, pattern_type::traits> field2(
ptr2, halo_dsc2);
field_on_the_fly<value_type3, layoutmap3, pattern_type::traits> field3(
ptr3, halo_dsc3);
Now each field can have different types and layouts, and halo descriptors. The exchange happens very similarly as before:
he.pack(field1, field2, field3);
he.exchange();
he.unpack(field1, field2, field3);
The interface accepting a std::vector
also works for this pattern (in case all the
fields have the same type).
Distributed Boundary Conditions¶
Design Principles:¶
When doing expandable parameters, the user may want to apply BCs and perform communications on a sub-set of the Data Stores collected in these data representations. For this reason an interface for applying distributed boundary conditions takes single data-stores only.
The user may want to apply different BCs to the same data-store at different times during an executions, so the binding between BCs and data-stores should be done at member-function level, not at class level, in order to remove the need for instantiation of heavy objects like Halo-updates.
The same holds for the Data Stores to be exchanged: we need to plug the Data Stores at the last minute before doing the packing/unpacking and boundary apply. The requirement given by the underlying communication layer is that the number of data fields to be exchanged must be less than or equal to the maximum number of data fields specified at construction time.
The Halo Exchange patterns are quite heavy objects so they have to be constructed and passed around as references. The
setup
needs to be executed only once to prevent memory leaks.The Halo information for communication could be derived by a
storage_info
class, but there may be cases in which a separate Halo information can be provided, and differentstorage_info
s (with different indices, for instance) may have the same communication requirements (for instance in cases of implicit staggering). For this reason the halo_descriptor is passed explicitly to the distributed boundary construction interface.The
value_type
should be passed as an additional template parameter to the distributed boundaries interfaces. Thevalue_type
is used to compute the sizes of the buffers and the data movement operations needed by communication.
Communication Traits¶
Communication traits helps the distributed boundary condition interface to customize itself to the need of the user. A general communication traits class is available in distributed_boundaries/comm_traits.hpp
. The traits required by the distributed boundaries interface, as provided by GridTools, are listed below here.
template <typename StorageType, typename Arch>
struct comm_traits {
using proc_layout = gridtools::layout_map<...>; // Layout of the processing grid to relate the data layout to the distribution of data
using proc_grid_type = MPI_3D_process_grid_t; // Type of the computing grid
using comm_arch_type = Arch; // Architecture for the communication pattern
compute_arch = backend::...; // Architecture of the stencil/boundary condition backend
static constexpr int version = gridtools::packing_version::...; // Packing/Unpacking version
using data_layout = typename StorageType::storage_info_t::layout_t; // Layout of data
using value_type = typename StorageType::data_t; // Value Type
};
Binding Boundaries and Communication¶
GridTools provides a facility for applying boundary conditions. The distributed boundaries interfaces uses this facility underneath. The boundary application in GridTools accept specific boundary classes that specify how to deal with boundaries in different directions and predicated to deal with periodicity and domain decomposition. The latter will be dealt by the distributed boundary interfaces (refer to the boundary condition interfaces for more details).
The distributed boundaries interface require a user to specify which Data Stores required communication and which require also boundary conditions, and in the latter case what boundary functions to use.
The way of specifying this is through the function bind_bc
which has the following signature:
unspecified_type x = bind_bc(boundary_class, data_stores, ...);
The number of Data Stores is dictated by the boundary_class::operator()
, that is user defined (or provided by GridTools).
The Data Stores specified in the function call will be passed to the boundary_class
and also used in Halo-update operations.
However, some data fields used in boundary conditions may be read-only and should not be passed to the Halo-update operation, both for avoiding unnecessary operations and to limit the amount of memory used by the Halo-update layer. For this reason the data_store
passed the bind_bc
can actually be std::placeholders
. Only the actual data_store
specified in the bind_bc
call will be passed to the communication layer. To bind the placeholders
to actual data_store
the user must bind then using .associate(data_stores...)
with the same mechanism used in std::bind
as in the following example, in which data_store
c
is associated to placeholder _1
.
using namespace std::placeholders;
bind_bc(copy_boundary{}, b, _1).associate(c)
This example, copies the boundary of c
into b
, and performs the halo exchanges for b
. The halo exchange will not be executed on c
, which is read only in this call.
If halo exchanges should be applied to both fields, and the boundary of `c
should be copied into b
, both fields should be used used in bind_bc
:
bind_bc(copy_boundary{}, a, b);
Distributed Boundaries¶
The distributed boundaries class takes the communication traits as template argument. In the next example we use the communication traits class provided by GridTools, and communication_arch
is one of the GCL specifiers of where the data accessed by a Halo Exchange object reside.
using dbs_t = distributed_boundaries<comm_traits<storage_type, communication_arch>>;
During construction more information is required about the Halo structure. We use here the usual Halo Descriptor.
The user needs also to indicate which dimensions are periodic (refer to GCL Communication Module for more information), and this is done using another GridTools facility which is the boollist
. Finally, to let the library compute the right amount of memory to allocate before hand, the maximum number of fields to be exchanged in one call have to be specified. The code showing an example of how to do it follows:
halo_descriptor di{halo_sx0, halo_dx0, begin0, end0, len0};
halo_descriptor dj{halo_sx1, halo_dx1, begin1, end1, len1};
halo_descriptor dk{halo_sx2, halo_dx2, begin2, end2, len2};
array<halo_descriptor, 3> halos{di, dj, dk};
boollist<3> periodicity{b0, b1, b2}; // b0, b1, b2 are booleans. If true it will indicate that the corresponding dimension is periodic across the grid of processors.
int max_ds = 4; // maximum number of data stores to be used in a halo_update operation
dbs_t dbs{halos, periodicity, max_ds, MPI_COMMUNICATOR};
Above here the Halo are the local Data Stores sizes, which are usually the tiles of a domain decomposed global domain, which has global boundaries. The idea is to apply the boundary conditions to the global boundaries while performing Halo updates for the Halo regions between sub-domains of the domain decomposed global domain.
The distributed_boundary
object allows the user to query the properties of the grid of processes, for instance the coordinates of the current process and the size of the computing grid.
int pi, pj, pk;
dist_boundaries.proc_grid().coords(pi, pj, pk); // Coordinates of current process
int PI, PJ, PK;
dist_boundaries.proc_grid().dims(PI, PJ, PK); // Sizes of the current grid of processes
When invoking the boundary application and Halo-update operations the user calls the exchange
member of distributed_boundaries
. The arguments of exchange
are either Data Stores stores or bind_bc
objects which associate a boundary condition to Data Stores. The Data Stores passed directly to the exchange
methods have their halo updated according to the halo and periodicity information specified at distributed_boundaries
object construction.Arguments created with bind_bc
are updated as mentioned above; halo exchanges are only applied if the fields are inside bind_bc
, but not in associate
.
Next, we show a complete example where two boundary are applied using a fixed value on Data Store a
and a copy_boundary
to copy the value of Data Store c
into Data Store b
(refer to GCL Communication Module). The halos of data store c
will not be exchange; this field serves as source of data for the copy_boundary
. Three fields will have their Halo updated by the next example, namely a
, b
and d
:
dist_boundaries.exchange(bind_bc(value_boundary<double>{3.14}, a), bind_bc(copy_boundary{}, b, _1).associate(c), d);
An additional facility provided is an alternative to the exchange
method. This is used to skip the Halo updates altogether, and it is called boundary_only
, and the code to use it is identical to the previous example, barring the function name:
dist_boundaries.boundary_only(bind_bc(value_boundary<double>{3.14}, a), bind_bc(copy_boundary{}, b, _1).associate(c), d);
This function will not do any halo exchange, but only update the boundaries of a
and b
. Passing d
is possible, but redundant as no boundary is given.
Interfacing to other programming languages¶
GridTools provides an easy macro interface to generate bindings to C and Fortran.
Suppose, the user wants to export the function add_impl
.
int add_impl(int l, int r) {
return l + r;
}
The macros GT_EXPORT_*
provide ways to generate bindings to functions. The different
flavours of this macro are explained below. The macro generates a wrapper around the function
add_impl
which is called add
and registers the function to be exported.
#include <gridtools/c_bindings/export.hpp>
GT_EXPORT_BINDING_2(add, add_impl);
The user can generate a C header and a Fortran module that matches this header by
adding a call to gt_add_bindings_library
in his CMake project:
find_package(GridTools REQUIRED)
list(APPEND CMAKE_MODULE_PATH "${GridTools_MODULE_PATH}")
include(gt_bindings)
gt_add_bindings_library(add_lib SOURCES add.cpp)
This will generate a library add_lib
which contains the exported symbol add
,
and it will generate a target add_lib_declarations
that generates the files
add_lib.h
and add_lib.f90
containing the bindings that can be used from C
and Fortran.
The C header contains the exported function (boilerplate code removed):
int add(int, int);
The generated Fortran module contains the corresponding declaration:
module add_lib
implicit none
interface
integer(c_int) function add(arg0, arg1) bind(c)
use iso_c_binding
integer(c_int), value :: arg0
integer(c_int), value :: arg1
end function
end interface
end
Exporting functions with no array-type arguments¶
There exist various flavours of these macros. Functions which are non-templated or fully-specialized
can be exported with GT_EXPORT_BINDING
, for example:
int add_impl(int, int) { return 0; }
GT_EXPORT_BINDING_2(add, add_impl);
template <typename T>
T add_impl(T, T) { return {}; }
GT_EXPORT_BINDING(2, add_impl<int, int>);
All functions exist in two flavours: Either you can pass the number of arguments as part of the name
of the macro (GT_EXPORT_BINDING_2
stands for two arguments), or you can pass it as a first argument
to the generic GT_EXPORT_BINDING
. The first flavours exist for up to 9 arguments.
Note that GT_EXPORT_BINDING_X
requires a name and a function pointer as its arguments.
A lambda cannot be passed as function pointer; thus, the type of the arguments cannot be
deduced. In such cases, the functions can be exported with GT_EXPORT_BINDING_WITH_SIGNATURE_X
,
which additionally takes the function type as an argument:
GT_EXPORT_BINDING_WITH_SIGNATURE_2(add, int(int, int), [](int l, int r) { return l + r; });
Templated functions can be exported for a given set of specializations using
GT_EXPORT_GENERIC_BINDING_X
. In addition to the function name and the function pointer, it takes a list of
overloads for which the bindings are generated:
GT_EXPORT_GENERIC_BINDING(2, add, add_impl, (int, int)(double, int));
In the generated Fortran module, generic bindings will produce an interface combining the different overloads:
interface
integer(c_int) function add_f0(arg0, arg1) bind(c)
use iso_c_binding
integer(c_int), value :: arg0
integer(c_int), value :: arg1
end function
real(c_double) function add_f1(arg0, arg1) bind(c)
use iso_c_binding
real(c_double), value :: arg0
integer(c_int), value :: arg1
end function
end interface
interface add
procedure add_f0, add_f1
end interface
Complex types¶
Only a limited set of types can be passed from Fortran / C through the C bindings interface to C++, namely integral and floating point types, booleans and pointers to those types.
Array references, GridTools storages, and any type that is fortran_array_bindable
appear as gt_fortran_array_descriptor
in the C bindings. This structure allows
the user to describe the data that needs to be passed to C++.
It is possible to write bindings to functions that accept or return other types.
During the generation process, they are replaced with pointers to the type gt_handle
.
std::vector<int> make_vector_impl() { return {}; }
void use_vector_impl(std::vector<int>) {}
GT_EXPORT_BINDING_0(make_vector, make_vector_impl);
GT_EXPORT_BINDING_1(use_vector, use_vector_impl);
The code above will generate the following signatures in the C-header:
gt_handle* make_vector();
void use_vector(gt_handle*);
The user needs to make sure that the types that stand behind gt_handle
match, otherwise
an exception will be thrown.
Exporting functions with array-type arguments to Fortran¶
Special macros exist to export function that take array-like arguments to Fortran. While the normal
macros export such arguments as gt_fortran_array_descriptor
, the “wrapped” macros create
additional wrappers around the functions that fill the structures themselves.
void dummy_impl(int (&a)[2][2]) {}
GT_EXPORT_BINDING_WRAPPED_1(dummy, dummy_impl);
The function dummy_impl
is taking a reference to an array. When exporting this function with
GT_EXPORT_BINDING_WRAPPED_X
, an additional wrapper is generated in the Fortran bindings:
module add_lib
implicit none
interface
subroutine dummy_impl(arg0) bind(c, name="dummy")
use iso_c_binding
use array_descriptor
type(gt_fortran_array_descriptor) :: arg0
end subroutine
end interface
contains
subroutine dummy(arg0)
use iso_c_binding
use array_descriptor
integer(c_int), dimension(:,:), target :: arg0
type(gt_fortran_array_descriptor) :: descriptor0
descriptor0%rank = 2
descriptor0%type = 1
descriptor0%dims = reshape(shape(arg0), &
shape(descriptor0%dims), (/0/))
descriptor0%data = c_loc(arg0(lbound(arg0, 1),lbound(arg0, 2)))
call dummy_impl(descriptor0)
end subroutine
end
This allows to call the Fortran function dummy
in a convenient way:
integer(c_int), dimension(:, :) :: some_array
call dummy(some_array)
The bindings will take care that the rank matches and it will infer the size of the array automatically.
All additional macros behave as mentioned above, namely GT_EXPORT_BINDING_WITH_SIGNATURE_WRAPPED
,
and GT_EXPORT_BINDING_GENERIC_WRAPPED
.
Data types need to be fortran_array_wrappable in order to be compatible with these macros. Natively, only
C arrays and fortran_array_adapter
are fortran_array_wrappable. The latter is an adapter between
Fortran arrays and GridTools storages, such that the user can pass a Fortran array to a C++ function,
which then can be transformed into a GridTools storage.
#include <gridtools/interface/fortran_array_adapter.hpp>
using storage_info_t = storage_traits<Backend>::storage_info_t<0, 3>;
using data_store_t = storage_traits<Backend>::data_store_t<double, storage_info_t>;
void modify_array_impl(fortran_array_adapter<data_store_t> inout) {
data_store_t data_store{storage_info_t{10, 10, 10}};
transform(data_store, inout);
// use data_store
transform(inout, data_store);
}
GT_EXPORT_BINDING_WRAPPED_1(modify_array, modify_array_impl)
CMake usage¶
A call to gt_add_bindings_library
generates the libraries and the headers. By default,
the C header file and the Fortran file is written directly into the source tree. This choice was
taken to improve building in cross-build environments, because the process cannot rely
on generated binaries being executable on the host system. The output folder can be overwritten
by setting FORTRAN_OUTPUT_DIR
and C_OUTPUT_DIR
.
By default, the name of the generated Fortran module is set to the name of the library. A different
name can be set with FORTRAN_MODULE_NAME
.