Re: Proposal to reduce the number of argument to array-for-each, array-every, and array-any Bradley Lucier 06 Dec 2025 02:43 UTC

On 12/2/25 19:01, Alex Shinn wrote:
> For one example:
>
> ;; generalized dot product
> (define (array-dot a b)
>    (let ((sum 0))
>      (array-for-each (lambda (ai bi) (set! sum (+ sum (* ai bi)))) a b)
>      sum))
>
> Here if a and b are both specialized arrays, array-for-each can use
> more efficient indexing, without the need for intermediary multi-indices.
> Even without that optimization I think the intermediary map, if not inlined,
> would hurt the loop performance, but it may be worth benchmarking.

I implemented this idea as

(define %%use-fast-path #t)
(set! %%use-fast-path %%use-fast-path)

(define (%%array-for-each f array arrays)
   (if (and %%use-fast-path
            (or (null? arrays) (null? (cdr arrays)))
            (every (lambda (A)
                     (and (specialized-array? A) (%%array-packed? A)))
                   (cons array arrays)))
       (case (length arrays)
         ((0) (let ((base_0 (apply (%%array-indexer array)
                                   (%%interval-lower-bounds->list
(%%array-domain array))))
                    (getter_0 (storage-class-getter
(%%array-storage-class array)))
                    (body_0 (%%array-body array))
                    (number-of-elements (%%interval-volume
(%%array-domain array))))
                (do ((elements-remaining number-of-elements (fx-
elements-remaining 1))
                     (i_0 base_0 (fx+ i_0 1)))
                    ((eqv? elements-remaining 0) (void))
                  (f (getter_0 body_0 i_0)))))
         ((1) (let ((base_0 (apply (%%array-indexer array)
                                   (%%interval-lower-bounds->list
(%%array-domain array))))
                    (base_1 (apply (%%array-indexer (car arrays))
                                   (%%interval-lower-bounds->list
(%%array-domain (car arrays)))))
                    (getter_0 (storage-class-getter
(%%array-storage-class array)))
                    (getter_1 (storage-class-getter
(%%array-storage-class (car arrays))))
                    (body_0 (%%array-body array))
                    (body_1 (%%array-body (car arrays)))
                    (number-of-elements (%%interval-volume
(%%array-domain array))))
                (do ((elements-remaining number-of-elements (fx-
elements-remaining 1))
                     (i_0 base_0 (fx+ i_0 1))
                     (i_1 base_1 (fx+ i_1 1)))
                    ((eqv? elements-remaining 0) (void))
                  (f (getter_0 body_0 i_0)
                     (getter_1 body_1 i_1))))))
       (%%interval-for-each
(%%specialize-function-applied-to-array-getters f array arrays)
                            (%%array-domain array))))

I added the flag %%use-fast-path to easily turn on and off the
specialized code.

I wrote the following simple benchmark:

(declare (standard-bindings)
          (extended-bindings)
          (block)
          (not safe))

(define (dot-product a b)
   (array-fold-left fl+ 0. (array-map fl* a b)))

(define (dot-product-2 a b)
   (let ((sum 0.))
     (array-for-each (lambda (ai bi) (set! sum (fl+ sum (fl* ai bi)))) a b)
     sum))

(define (dot-product-3 a b)
   (array-fold-left (lambda (sum a b) (fl+ sum (fl* a b))) 0. a b))

and compiled everything with Github head Gambit on my 10-year-old Linux box.

Here are some times:

 > (time (dot-product a b))
(time (dot-product a b))
     0.041890 secs real time
     0.041791 secs cpu time (0.041629 user, 0.000162 system)
     no collections
     656 bytes allocated
     no minor faults
     no major faults
     150424893 cpu cycles
2000000.
 > (set! %%use-fast-path #t)
 > (time (dot-product-2  a b))
(time (dot-product-2 a b))
     0.021725 secs real time
     0.021642 secs cpu time (0.021611 user, 0.000031 system)
     no collections
     576 bytes allocated
     1 minor fault
     no major faults
     77963570 cpu cycles
2000000.
 > (set! %%use-fast-path #f)
 > (time (dot-product-2  a b))
(time (dot-product-2 a b))
     0.029797 secs real time
     0.029708 secs cpu time (0.029680 user, 0.000028 system)
     no collections
     464 bytes allocated
     no minor faults
     no major faults
     106990493 cpu cycles
2000000.
 > (time (dot-product-3  a b))
(time (dot-product-3 a b))
     0.072196 secs real time
     0.072079 secs cpu time (0.059903 user, 0.012176 system)
     no collections
     96000560 bytes allocated
     11719 minor faults
     no major faults
     259275183 cpu cycles
2000000.

Lessons I take from this:

1.  Adding the simplified indexing you suggest to the sample
implementation cuts CPU time for your example dot product code from
0.029708 to 0.021725 seconds.  The new code can be generated from a
macro, so it may be worth the change.

2.  I was lazy in implementing multi-array array-fold-left, generating a
lot of intermediate argument lists, and taking more time than necessary.

3.  Gambit's "immediate flonum" representation really does eliminate
flonum boxing for many "common" flonums (at least all the ones used in
this example, all flonums returned from (random-real), ...).