To implement multi-GPU stencil computation using Accelerate
Takayuki Muranushi
nushio at yukawa.kyoto-u.ac.jp
Sat Jul 9 14:01:13 BST 2011
Hello,
I'm trying to benchmark accelerate, by applying some
computationally-heavy functions on a large array. For short, I'd like
to generate codes like this:
__device__ Real f (Real x){
for (int i = 0; i < iteration; ++i) {
x = 4*x*(1-x);
}
return x;
}
What I've tried is this:
https://github.com/nushio3/accelerate-test/blob/411048b78dfc0c95be0e9b610c37f3e47dfda7b7/step04/Cell.hs
However, looking at CUDA_PROFILE, each map is getting called as a
separate cuda kernel and I can not get good benchmark results (not
larger than 20Gflops single precision on a M2050)
On the other hand, if I try the following expression
(foldl (.) id $ replicate iteration (\x -> 4*x*(1-x)))
Accelerate tries to expand every term. This causes the size of the
expression to grow as an exponential function of 'iteration,' and I
get stack overflow quickly.
Any good idea?
--
MURANUSHI Takayuki
The Hakubi Center, Kyoto University : http://www.hakubi.kyoto-u.ac.jp/
More information about the Accelerate
mailing list