# Summing the entries of a matrix using a stencil with Thrust

*Posted on 22 April 2013*

The task sounds simple: sum every element according to a stencil in a two-dimensional layout. Ideally it should be done on the GPU, provided the input is already sitting in the device memory. This operation occurs, for instance, in agent-based modelling, when the agent collects information about its environment. The stencil in that case is a circle with a radius that corresponds to the agents' horizon:

The idea would be to avoid hand-tuning CUDA reduction kernels, and rely on Thrust to save work. The sum we need is a kind of inner product of the stencil and the submatrix tile at a given offset:

```
thrust::inner_product(tile.begin(), tile.end(),
stencil.begin(), 0);
```

Yet, given Thrust's fundamentally one-dimensional focus, it proves to be
a complicated task to accomplish. The difficulty arises that we need to
extract a 2D sub-array from a 2D matrix. We need to define a new class
tile, with its own `begin()`

and `end()`

functions that can serve as input
for the above call. The constructor simply fills in the values for its
private variables that keep track of the tile size and the leading
dimension of the matrix:

```
tiled_range(Iterator first, Iterator last,
data_type tile_size_x, data_type tile_size_y,
data_type leading_dimension)
: first(first), last(last),
tile_size_x(tile_size_x), tile_size_y(tile_size_y),
leading_dimension(leading_dimension) {}
```

To get the iterator right for the `tiled_range`

class, we will need a
counting, a transform, and a permutation iterator, with the latter being
the iterator type for our class. To keep things short, these are
typedef-ed.

```
typedef typename thrust::counting_iterator<data_type>
CountingIterator;
typedef typename thrust::transform_iterator<tile_functor, CountingIterator>
TransformIterator;
typedef typename thrust::permutation_iterator<Iterator,TransformIterator>
PermutationIterator;
```

The permuation iterator takes two arguments:

- An iterator to the range V on which the permutation will be applied
- The reindexing scheme that defines how the elements of V will be permuted.

The first one will be the entire matrix, whereas the "permutation" will be a new transform iterator that ensures only the tile elements are returned. The transform iterator in turn takes two parameters:

- An
`Iterator`

pointing to the input to this`transform_iterator`

's`AdaptableUnaryFunction`

. - An
`AdaptableUnaryFunction`

used to transform the objects pointed to by`x`

.

The iterator will be a counting operator starting at zero. The unary function maps the linear counter to the correct index of the matrix based on the tile parameters and the leading dimension of the matrix. The functor itself is fairly simple:

```
struct tile_functor : public thrust::unary_function<data_type,data_type>
{
data_type tile_size_x;
data_type leading_dimension;
tile_functor(data_type tile_size_x, data_type leading_dimension)
: tile_size_x(tile_size_x), leading_dimension(leading_dimension) {}
__host__ __device__
data_type operator()(const data_type& i) const
{
int x=i % tile_size_x;
int y=i / tile_size_x;
return leading_dimension*y + x;
}
};
```

With all components prepared, the `begin()`

function becomes

```
PermutationIterator begin(void) const
{
return PermutationIterator(first, TransformIterator(CountingIterator(0),
tile_functor(tile_size_x, leading_dimension)));
}
```

The `end()`

function is again very simple:

```
PermutationIterator end(void) const
{
return begin() + tile_size_y * tile_size_x;
}
```

A complete example is given here, which should just compile with nvcc. The main problem is that the permutation iterator has dubious efficiency in accessing the GPU memory, and accesses are probably not as coalesced as they should be. The usage example mentioned in the first paragraph uses small stencil at random locations, which is pretty hard on the GPU. Speedups around 8x are still apparent on Fermi hardware compared to a single core, which is not great, but it is a workable solution in a hybrid scenario where other CPU cores can be kept busy with OpenMP or MPI.