On Sat, Nov 9, 2024 at 5:51 AM Bradley Lucier <xxxxxx@purdue.edu> wrote:
On 11/6/24 19:26, Alex Shinn wrote:
>
> On Tue, Nov 5, 2024 at 12:07 PM Bradley Lucier <xxxxxx@purdue.edu
> <mailto:xxxxxx@purdue.edu>> wrote:
>
>     Can you give an example where a broadcast just amounts to a reshape?  I
>     can't think of one.
>
> Any time the broadcast is just dropping a trivial dimension, it's a reshape.
> This does come up fairly often.

[...]
I don't see that dropping a trivial dimension is a possibility with this
definition.  Adding one is, but that means the other array arguments
have a trivial dimension as their leading axis, which would likely not
happen very often.

Sorry, obviously I meant adding, not dropping, and it happens more
often than you think.

> Unfortunately, this doesn't support all use cases.  The most common
> (and most important) operation is matrix multiplication, which is not a map.

I don't see how matrix multiplication can be written using array
broadcasting.  Perhaps that's not what you mean.

If you have a deep neural network, most of the operations are
matrix multiplication, along with activation functions which tend
to sum along an axis.  The backpropagation step is then applying
automatic differentiation on matrix multiplications whose inputs
may have trivial dimensions and require broadcasting.

Sorry I don't have time to spell out a full example right now,
just understand that when using automatic differentiation you
end up with many, sometimes unintuitive, combinations of
operations.

> where `fast-array+` is currently BLAS but I plan to move to CUDA.
> The point here is that the fast path relies on the arguments being
> normal arrays - trying to move the broadcasting logic into the fast
> path would slow it down and defeat the purpose.

I don't see the need to broadcast the arrays before using the fast path,
and, indeed, I don't see how the BLAS routines that you use to implement
fast-array+ even give the correct answer if you have a
nontrivially-broadcast argument.

I don't see how you couldn't see this :)

The entire point of my broadcasting implementation is that it
provides an otherwise normal specialized array, for which you
can obtain the needed coefficients to resolve any multi-index
to an offset.  That's all BLAS needs to work.  The only special
thing about it is that some axes may be broadcast, which means
there can be zero coefficients.

[Technically, due to the BLAS design it doesn't allow zero
coefficients, but there's no reason it couldn't, and I intend
to remove this restriction when moving to CUDA.  Instead
of iterating from lowest to highest resulting offset, you
iterate over the index itself.]
 
I understand the desire to do things that don't impede fast
implementations, I tried to do that with the sample implementation code
to move array elements, relying on memmove in the best case, but I don't
see yet how you think array broadcasting interacts with that.

We don't need to argue about this.  I will continue to use
my array-broadcast implementation, which is possible as
an add-on library using specialized-array-share, and just
happens to break the one-to-one rule.  I was sharing this
with others because you were listing numpy compatibility
and noted the lack of broadcasting, and I think this is a
particularly elegant and efficient way to achieve
broadcasting.

If you want to encourage implementations to enforce the
one-to-one rule then my array-broadcast no longer
becomes portable but that won't stop me from using it.

--
Alex