Support Julia 1.13 with fix for @device_functions macro#3031
Support Julia 1.13 with fix for @device_functions macro#3031KSepetanc wants to merge 17 commits intoJuliaGPU:masterfrom
Conversation
|
Well, I didn't really ask for a duplicate PR. I suggested to either merge them into CUDA.jl as two separate, sequential PRs, or – if you want to merge them as a single PR into CUDA.jl – create such a PR. I don't care either way; using two separate PRs seems simpler, but I leave that choice up to you. |
|
The way I see it, this is the second option, i.e. single PR with both changes.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3031 +/- ##
==========================================
- Coverage 89.46% 89.35% -0.12%
==========================================
Files 148 148
Lines 13047 13044 -3
==========================================
- Hits 11673 11655 -18
- Misses 1374 1389 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: aa08bb6 | Previous: 7a27d77 | Ratio |
|---|---|---|---|
latency/precompile |
43987599741.5 ns |
44455759835 ns |
0.99 |
latency/ttfp |
13112185336 ns |
13140153243 ns |
1.00 |
latency/import |
3768904795 ns |
3755312424 ns |
1.00 |
integration/volumerhs |
9441187.5 ns |
9442840 ns |
1.00 |
integration/byval/slices=1 |
145953 ns |
145598 ns |
1.00 |
integration/byval/slices=3 |
423132 ns |
422554 ns |
1.00 |
integration/byval/reference |
144068 ns |
143811 ns |
1.00 |
integration/byval/slices=2 |
284567 ns |
284011 ns |
1.00 |
integration/cudadevrt |
102730.5 ns |
102397 ns |
1.00 |
kernel/indexing |
13556 ns |
13434 ns |
1.01 |
kernel/indexing_checked |
14190 ns |
13908 ns |
1.02 |
kernel/occupancy |
656.4909090909091 ns |
644.5636363636364 ns |
1.02 |
kernel/launch |
2165.4 ns |
2090.3 ns |
1.04 |
kernel/rand |
14651 ns |
14479 ns |
1.01 |
array/reverse/1d |
19132 ns |
18661 ns |
1.03 |
array/reverse/2dL_inplace |
66383 ns |
66252 ns |
1.00 |
array/reverse/1dL |
69365 ns |
68893 ns |
1.01 |
array/reverse/2d |
20812.5 ns |
21087 ns |
0.99 |
array/reverse/1d_inplace |
10603.166666666668 ns |
10503.833333333332 ns |
1.01 |
array/reverse/2d_inplace |
11726 ns |
11399.5 ns |
1.03 |
array/reverse/2dL |
72872 ns |
73163 ns |
1.00 |
array/reverse/1dL_inplace |
66310 ns |
66146 ns |
1.00 |
array/copy |
18407 ns |
18502.5 ns |
0.99 |
array/iteration/findall/int |
145156.5 ns |
146476.5 ns |
0.99 |
array/iteration/findall/bool |
130342 ns |
130795 ns |
1.00 |
array/iteration/findfirst/int |
83970.5 ns |
84133 ns |
1.00 |
array/iteration/findfirst/bool |
81352 ns |
81624.5 ns |
1.00 |
array/iteration/scalar |
66441 ns |
65804 ns |
1.01 |
array/iteration/logical |
196855 ns |
198187.5 ns |
0.99 |
array/iteration/findmin/1d |
84288 ns |
86504 ns |
0.97 |
array/iteration/findmin/2d |
116696 ns |
117154 ns |
1.00 |
array/reductions/reduce/Int64/1d |
38815 ns |
41088.5 ns |
0.94 |
array/reductions/reduce/Int64/dims=1 |
42194.5 ns |
52190.5 ns |
0.81 |
array/reductions/reduce/Int64/dims=2 |
59158 ns |
59179 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
87315 ns |
87126 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84580 ns |
84418.5 ns |
1.00 |
array/reductions/reduce/Float32/1d |
34125.5 ns |
34001 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
40912 ns |
39890 ns |
1.03 |
array/reductions/reduce/Float32/dims=2 |
56426.5 ns |
55899 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
51790 ns |
51535 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70031 ns |
69798 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
38957 ns |
40980.5 ns |
0.95 |
array/reductions/mapreduce/Int64/dims=1 |
51415 ns |
41741 ns |
1.23 |
array/reductions/mapreduce/Int64/dims=2 |
58881.5 ns |
59036 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
87459 ns |
87134 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84594 ns |
84427 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
33985 ns |
33457 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1 |
39625 ns |
48711 ns |
0.81 |
array/reductions/mapreduce/Float32/dims=2 |
56741 ns |
55941 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
51514 ns |
51352 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69443.5 ns |
68956 ns |
1.01 |
array/broadcast |
20582 ns |
20251 ns |
1.02 |
array/copyto!/gpu_to_gpu |
10606.333333333334 ns |
10684.333333333334 ns |
0.99 |
array/copyto!/cpu_to_gpu |
213329 ns |
214898 ns |
0.99 |
array/copyto!/gpu_to_cpu |
284108 ns |
281876 ns |
1.01 |
array/accumulate/Int64/1d |
117753 ns |
118336 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79494 ns |
79780 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155312.5 ns |
155968.5 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1698684 ns |
1694089 ns |
1.00 |
array/accumulate/Int64/dims=2L |
960326.5 ns |
960949 ns |
1.00 |
array/accumulate/Float32/1d |
100444 ns |
100823 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76182 ns |
76350 ns |
1.00 |
array/accumulate/Float32/dims=2 |
144330.5 ns |
144365 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1584804 ns |
1584729 ns |
1.00 |
array/accumulate/Float32/dims=2L |
656644 ns |
656302 ns |
1.00 |
array/construct |
1334.6 ns |
1283.1 ns |
1.04 |
array/random/randn/Float32 |
36132 ns |
36610 ns |
0.99 |
array/random/randn!/Float32 |
30242 ns |
30335 ns |
1.00 |
array/random/rand!/Int64 |
27134.5 ns |
26934 ns |
1.01 |
array/random/rand!/Float32 |
8286.333333333334 ns |
8186.666666666667 ns |
1.01 |
array/random/rand/Int64 |
36919 ns |
30201.5 ns |
1.22 |
array/random/rand/Float32 |
12506 ns |
12396 ns |
1.01 |
array/permutedims/4d |
55908 ns |
52729 ns |
1.06 |
array/permutedims/2d |
52565 ns |
52645 ns |
1.00 |
array/permutedims/3d |
52683 ns |
53080 ns |
0.99 |
array/sorting/1d |
2735353.5 ns |
2736443 ns |
1.00 |
array/sorting/by |
3305047.5 ns |
3305811 ns |
1.00 |
array/sorting/2d |
1068187 ns |
1071655.5 ns |
1.00 |
cuda/synchronization/stream/auto |
983.6470588235294 ns |
1034.5263157894738 ns |
0.95 |
cuda/synchronization/stream/nonblocking |
8144.2 ns |
7705.9 ns |
1.06 |
cuda/synchronization/stream/blocking |
823.6262626262626 ns |
784.4516129032259 ns |
1.05 |
cuda/synchronization/context/auto |
1133.7 ns |
1133.5 ns |
1.00 |
cuda/synchronization/context/nonblocking |
8208.9 ns |
7594.6 ns |
1.08 |
cuda/synchronization/context/blocking |
905.6326530612245 ns |
885.6792452830189 ns |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
Closes #3019.
@eschnett asked me to create a new duplicate PR #3020 of his, but with fix for macro
@device_functions. He couldn't test if the fix works as I made PR on his fork that does not have CI infrastructure.