static code generation and open Accs

Mon Aug 23 22:09:27 EDT 2010

Hi Manuel and Stephen,

We also share Stephen's requirement.  We want to invoke an Accelerate computation with new input data multiple times with minimal overhead.

I also agree that the principle of Manuel's option (2) below is the best way forward.  However, having experimented, a number of additional issues need to be considered.

The front end AST conversion and look up for a cached version of the code is slow (i.e. excluding compilation and code generation).  With the current version of the front-end and CUDA back-end it takes ~170ms.  With the proposed variant of 'run' won't this penalty still be incurred each time the 'run' variant is called?  I believe that there is lots of scope for speeding this up by optimising the language front-end and CUDA back-end.  However, my guess is that the penalty is always going to be an issue for applications that repeat the computation at a high frequency.

I guess there needs to be a way of bypassing the overhead of AST conversion and code lookup - so that there is a way of directly invoking the computation again with new 'input data'.

Is there a way to achieve this?

PS: We are also really interested in a C backend too!

Rami

On 24/08/2010, at 12:13 AM, Stephen Blackheath [to Accelerate] wrote:

Manuel,

What do you think?

I think that's all mighty fine!  I understand everything you said, and I
see how it all fits together.  The Template Haskell is a great idea!

One small detail is that on our cross-compiler, TH doesn't work.  This
to do with how I compiled it, but it is possible that GHC might need
some persuading before it will run Template Haskell when compiled as a
cross compiler.  We should use TH in the general case, but I can bypass
it for my application.

My priority is completing the game, so some time soon I am going to
decide whether to contribute a C back end to accelerate, and then use
that, or hand-code the C.  I know which one would be more fun, but I
have to be hard-nosed about it.  Time is not the only consideration.
Code quality is also important, so accelerate is very attractive indeed.

Later I may add support for VecLib like we discussed, but I'd start with
plain C.

If and when I come to it, I feel like I've got enough background now to
start ripping into it and sending you patches, with plenty of questions
about the details of course, so thank you very much for your email.

Steve

On 23/08/10 11:31 PM, Manuel M T Chakravarty wrote:
Steve,

Stephen Blackheath [to Accelerate]:
I mentioned to you at AusHac2010 that I've got some Haskell code in
our video game that needs speeding up, and accelerate is looking
like an attractive option to avoid having to write some C.

Since our platforms are mobile ones (and therefore no compiler at
runtime), I'm considering writing a static C back end for
accelerate, so I took a look at what would be needed.

The generated code would need to be a C function that takes
arguments, and so the accelerate AST would have to represent a
lamba expression. Assuming I'm reading the code right, it looks
like I can do this easily by adding the arguments to the
environment, then binding an Avar to the argument names at the
Haskell level.  Primitive values can be passed in as scalars.

However, the accelerate language code could not have the current
type of Acc a ~ OpenAcc () a, because that doesn't allow the
environment type required by Avar.  The obvious way to fix this is
to change every function in the accelerate language to have type
OpenAcc aenv a instead of Acc a.  But - I can't imagine you would
want to do this.

I don't think that this is necessary.

As far as the front-end of Accelerate is concerned, there is not
really a need for array computations to be closed.  Similar to the
existing functions D.A.A.Smart.convertFun1 and
D.A.A.Smart.convertFun2, we could have support for converting
array-valued functions (over Acc) — and by using a type class, we
could avoid having an awkward family of conversion functions, but
just have one (overloaded) class method.

The CUDA backend couldn't translate such functions in its current
version.  However, as you were planning to have a new backend anyway,
that might not matter.  Nevertheless, it might also make sense to
find a more general solution that could be used with all backends.
(Then, we might be able to use both the C backend and the CUDA
backend dynamically as well as statically.)  The latter might be
achieved in either of two ways:

(1) The actual CUDA backend code D.A.A.CUDA.Compile.compileAcc takes
an OpenAcc anyway.  So, we could think about changing the interface.
One disadvantage is that using a type class to avoid a family of run
functions complicated the interface.

(2) Given a function 'foo :: Vector Float ->  Acc (Vector Float)' and
two invocations 'foo vec1' and 'foo vec2' in the same program, the
CUDA backend doesn't really generate the code for 'foo' twice.
Instead, it does actually abstract 'foo' out, compiles that once, and
caches the result.  On the second invocation of 'foo', the cached
binary is used.  We would only need to arrange for that cached code
to be generated at compile time.

I'm in favour of Alternative (2).  The caching works by identifying
all occurrences of the 'use' function in the array expression.  The
compilation process abstracts over all of them, generating binary
code that is effectively a function over all 'use'd arrays.  We could
exploit that by introducing, in addition to 'run', a 'precompile'
function for each backend.  The function 'precompile' would have the
same signature as 'run' and it would behave in the same way with two
exceptions: (1) it wouldn't attempt to transfer the 'use'd arrays
from the host to the device (in the case of CUDA) and (2) it wouldn't
actually invoke the generated code.

Even in applications that don't have to statically generate code,
'precompile' will be useful to populate the cache (e.g., in a game,
you don't want to take the hit of the initial compile during the
actual game play, but you want to move that into the startup phase).

To precompile, we don't even need to make up unused array argument.
We can use issue 'precompile (foo undefined)' — the 'undefined' array
value won't be touched, thanks to laziness.  Finally, to generate
code during *compile* time, rather than at start up, we can use
Template Haskell.  (This should amount to a rather simple use of TH
if I'm not overlooking anything.)

Summary ~~~~~~~ I'd prefer a general mechanism to separate code
generation from the first invocation of an Accelerate computation.
We can achieve that by introducing 'precompile', a variant of 'run',
that omits actual code execution.  In combination with TH, this
enables static code generation.  We could use that approach uniformly
in the CUDA backend and in a C backend (as well as other backends).

What do you think?

Manuel

_______________________________________________
Accelerate mailing list
Accelerate at projects.haskell.org<mailto:Accelerate at projects.haskell.org>
http://projects.haskell.org/cgi-bin/mailman/listinfo/accelerate

________________________________
The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments.