This optimisation is conceptually very similar to how concat is fused into scatter. This is important for algorithms that produce many points to be reduced per item. A fairly minimal example of something that should fuse but currently does not is
entry test n =
let (is, vs) = tabulate n (\i -> ((i, (i+1) % n), (1f32, 1f32))) |> unzip
let unzip_concat as = unzip as |> uncurry concat
in hist (+) 0f32 n (unzip_concat is) (unzip_concat vs)
here the lack off fusion is not too problematic, as the number of items to reduce scales with the array, but once the number of items to reduce increases beyond that there can be some really bad scaling issues, as (is, vs) needs to be manifested due to the lack of fusion. Unlike scatter where the number of points to be scattered are usually less than or equal to the size of the destination array, in reduce_by_index this is often not the case.
I have some code used to transfer point-value pairs a onto coarser grid using polynomial interpolation, and this produces many items to reduce per point to be transferred, especially in higher dimensions. This is necessary for implementing Particle Mesh Ewald (PME) summation which is a common method in Molecular Dynamics (MD) and more generally for solving linear partial differential equations with complex geometry.
As an example of the scaling issues, this is the results for 3d cubic interpolation where each point is reduced to 4x4x4 bins with different weights. The grid size is kept constant, only the number of points to be reduced is changed.
grid_test.fut:cubic_bench (no tuning file):
[3][100000]i32: 4475μs (95% CI: [ 4464.6, 4510.0])
[3][1000000]i32: 40286μs (95% CI: [ 39761.9, 42289.3])
[3][10000000]i32: 69413528μs |███ | 3 runs ^C
The smaller systems are also slower than expected, but as the array that needs to be manifested grows, the function becomes unusably slow.
In the worst case here, the intermediate data, that would be fused away, is far larger than the input data or result grid.
This optimisation is conceptually very similar to how
concatis fused intoscatter. This is important for algorithms that produce many points to be reduced per item. A fairly minimal example of something that should fuse but currently does not ishere the lack off fusion is not too problematic, as the number of items to reduce scales with the array, but once the number of items to reduce increases beyond that there can be some really bad scaling issues, as
(is, vs)needs to be manifested due to the lack of fusion. Unlikescatterwhere the number of points to be scattered are usually less than or equal to the size of the destination array, inreduce_by_indexthis is often not the case.I have some code used to transfer point-value pairs a onto coarser grid using polynomial interpolation, and this produces many items to reduce per point to be transferred, especially in higher dimensions. This is necessary for implementing Particle Mesh Ewald (PME) summation which is a common method in Molecular Dynamics (MD) and more generally for solving linear partial differential equations with complex geometry.
As an example of the scaling issues, this is the results for 3d cubic interpolation where each point is reduced to 4x4x4 bins with different weights. The grid size is kept constant, only the number of points to be reduced is changed.
The smaller systems are also slower than expected, but as the array that needs to be manifested grows, the function becomes unusably slow.
In the worst case here, the intermediate data, that would be fused away, is far larger than the input data or result grid.