To implement multi-GPU stencil computation using Accelerate

Sat Jul 9 14:01:13 BST 2011

Hello,
I'm trying to benchmark accelerate, by applying some
computationally-heavy functions on a large array. For short, I'd like
to generate codes like this:

__device__ Real f (Real x){
  for (int i = 0; i < iteration; ++i) {
    x = 4*x*(1-x);
  }
  return x;
}

What I've tried is this:

https://github.com/nushio3/accelerate-test/blob/411048b78dfc0c95be0e9b610c37f3e47dfda7b7/step04/Cell.hs

However, looking at CUDA_PROFILE, each map is getting called as a
separate cuda kernel and I can not get good benchmark results (not
larger than 20Gflops single precision on a M2050)

On the other hand, if I try the following expression

(foldl (.) id $ replicate iteration (\x -> 4*x*(1-x)))

Accelerate tries to expand every term. This causes the size of the
expression to grow as an exponential function of 'iteration,' and I
get stack overflow quickly.

Any good idea?

-- 
MURANUSHI Takayuki
The Hakubi Center, Kyoto University : http://www.hakubi.kyoto-u.ac.jp/