more comments - Simplelists

more comments Jeffrey Mark Siskind 26 Apr 2020 00:04 UTC
Here is another desiderata, more in the category of wishful thinking or future
research.

It is fairly straightforward to compile a certain particular subset of Scheme
into CUDA or XLA and have it run efficiently on GPUs. That compilation process
does not require sophisticated static analysis or source-code transformation.
That subset includes shared arrays like those in the SRFI, with lower and
upper bounds, and the addition of strides. It also includes suitable map and
fold operations. The map and fold operations need to be able to nest. But the
only primitives that must ultimately be passed in must be scalar arithmetic
operations like + - * / max min sqrt log exp sin cos atan and possibly < = >
<= >= and boolean operations like and or not. The code should not have any
recursion. It should be possible to trivially inline everything. There should
be no aggregates except the arrays being operated on by map and fold. Thus
there should be no consing and no gc. There should be no control flow except
that which is encoded in the map and fold operations. Thus no if. Except
possibly simple ifs that can be encoded as conditional moves.

It is fairly straightforward to formally specify (variants of) this
sublanguage. And it is fairly straightforward to translate these to CUDA or
XLA. There are variants like whether the map and fold operations can operate
on subsets of the dimensions/axes. And whether they can handle padding and
interpolation.

But the key issue is whether the sublanguage can be made sufficiently
expressive to allow formulation of all of the operations in (Py)Torch and
cuDNN. Currently, the way (Py)Torch is implemented is that there is a (huge)
library of handwritten CUDA code that is linked in via FFI. And that code is
not compositional. For example, one cannot compose LSTM with convolution to
make CNN(LSTM), i.e. convolving an LSTM over a sequence of images. It thus is
very un-Schemely. It would be really nice, if the entire code base of
handwritten (Py)Torch operations could be replaced with code written in a
higher-level language in a compositional framework. But it is crucial that it
be possible to compile the code into runnable code that is as efficient as the
handwritten low-level (Py)Torch CUDA code.

    Jeff (http://engineering.purdue.edu/~qobi)