On Sat, Nov 9, 2024 at 5:51 AM Bradley Lucier <xxxxxx@purdue.edu> wrote:

On 11/6/24 19:26, Alex Shinn wrote:
>
> On Tue, Nov 5, 2024 at 12:07 PM Bradley Lucier <xxxxxx@purdue.edu
> <mailto:xxxxxx@purdue.edu>> wrote:
>
> Can you give an example where a broadcast just amounts to a reshape? I
> can't think of one.
>
> Any time the broadcast is just dropping a trivial dimension, it's a reshape.
> This does come up fairly often.

[...]

I don't see that dropping a trivial dimension is a possibility with this
definition. Adding one is, but that means the other array arguments
have a trivial dimension as their leading axis, which would likely not
happen very often.

Sorry, obviously I meant adding, not dropping, and it happens more

often than you think.

> Unfortunately, this doesn't support all use cases. The most common
> (and most important) operation is matrix multiplication, which is not a map.

I don't see how matrix multiplication can be written using array
broadcasting. Perhaps that's not what you mean.

If you have a deep neural network, most of the operations are

matrix multiplication, along with activation functions which tend

to sum along an axis. The backpropagation step is then applying

automatic differentiation on matrix multiplications whose inputs

may have trivial dimensions and require broadcasting.

Sorry I don't have time to spell out a full example right now,

just understand that when using automatic differentiation you

end up with many, sometimes unintuitive, combinations of

operations.

> where `fast-array+` is currently BLAS but I plan to move to CUDA.
> The point here is that the fast path relies on the arguments being
> normal arrays - trying to move the broadcasting logic into the fast
> path would slow it down and defeat the purpose.

I don't see the need to broadcast the arrays before using the fast path,
and, indeed, I don't see how the BLAS routines that you use to implement
fast-array+ even give the correct answer if you have a
nontrivially-broadcast argument.

I don't see how you couldn't see this :)

The entire point of my broadcasting implementation is that it

provides an otherwise normal specialized array, for which you

can obtain the needed coefficients to resolve any multi-index

to an offset. That's all BLAS needs to work. The only special

thing about it is that some axes may be broadcast, which means

there can be zero coefficients.

[Technically, due to the BLAS design it doesn't allow zero

coefficients, but there's no reason it couldn't, and I intend

to remove this restriction when moving to CUDA. Instead

of iterating from lowest to highest resulting offset, you

iterate over the index itself.]

I understand the desire to do things that don't impede fast
implementations, I tried to do that with the sample implementation code
to move array elements, relying on memmove in the best case, but I don't
see yet how you think array broadcasting interacts with that.

We don't need to argue about this. I will continue to use

my array-broadcast implementation, which is possible as

an add-on library using specialized-array-share, and just

happens to break the one-to-one rule. I was sharing this

with others because you were listing numpy compatibility

and noted the lack of broadcasting, and I think this is a

particularly elegant and efficient way to achieve

broadcasting.

If you want to encourage implementations to enforce the

one-to-one rule then my array-broadcast no longer

becomes portable but that won't stop me from using it.

Alex