feat[cuda]: export arrays to ArrowDeviceArray by a10y · Pull Request #6253 · vortex-data/vortex

a10y · 2026-02-02T20:04:11Z

This PR adds scaffolding for the first part of implementing export of Vortex Arrays directly to Arrow C Data Device Interface arrays, and implements the actual export for StructArray and PrimitiveArray

Setup

Arrow defines a standard for exchanging data in-process to native code using the C Data Interface, which converts an Arrow array into a struct of pointers.

There is an (experimental, but mostly stable) extension called the C Data Device Interface which includes extra information such as the GPU device type and device ID.

cudf knows how to import data exported this way using the from_arrow_device_column method for column_views.

We replicate the ArrowArray and ArrowDeviceArray structs from the spec.

These are very similar to the existing FFI_ArrowArray and FFI_ArrowSchema structs from the arrow-data crate, except that those use arrow Bytes and real pointers and we need to use CUDA device pointers and BufferHandles.

Testing

Rather than polluting the repo/CI with fetching and building cudf, I made a new repo https://github.com/vortex-data/cudf-test-harness/

This will load a cdylib, call its export_array() function, imports it into cudf and calls some simple functions, then frees it by calling the embedded release function pointer. This step is hooked up to CI now and only adds ~1m to the CUDA tests jobs.

vortex-cuda/cudf-test/src/lib.rs

codspeed-hq · 2026-02-03T17:16:21Z

Merging this PR will degrade performance by 33.69%

⚡ 3 improved benchmarks
❌ 1 regressed benchmark
✅ 1134 untouched benchmarks
⏩ 1265 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`chunked_bool_into_canonical[(1000, 10)]`	41.8 µs	63 µs	-33.69%
⚡	Simulation	`patched_take_10k_contiguous_not_patches`	1.5 ms	1.2 ms	+18.49%
⚡	Simulation	`patched_take_10k_random`	2.4 ms	2.1 ms	+10.85%
⚡	Simulation	`patched_take_10k_adversarial`	2.3 ms	2 ms	+10.69%

_{Comparing aduffy/c-device-iface (9824f29) with develop (5648546)}

1265 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Signed-off-by: Andrew Duffy <[email protected]>

0ax1

Couple of questions / remarks

.github/workflows/ci.yml

vortex-cuda/cudf-test/src/lib.rs

vortex-cuda/src/arrow/canonical.rs

0ax1 · 2026-02-04T10:36:06Z

vortex-cuda/src/arrow/canonical.rs

+    // We need the children to be held across await points.
+    let mut children = Vec::with_capacity(fields.len());
+
+    for field in fields.iter() {


How about sth along the lines of:

let futures = fields .iter() .map(|field| { let field = field.clone(); async move { let cuda_field = field.execute_cuda(ctx).await?; export_canonical(cuda_field, ctx).await } }) .collect::<Vec<_>>(); let results = join_all(futures).await;

If fields live on different streams, they run in parallel.

are fields going to live on different streams? if so, things might get a little weird with the SyncEvent stuff

Specifically: the contract for the returned ArrowDeviceArray is that its sync_event field, if not NULL, contains a pointer to a cudaStream_t which AFAIU can only be linked to one stream.

Similarly the CudaExecutionCtx is linked to only one global stream rn. So assuming that StructArray::execute_cuda(ctx) gets called, all of its fields should be getting executed with the same context, and thus the same stream.

You can make a stream wait for another one but that’s a complexity we should figure out how to avoid.

I think we have at most one stream per file during a scan rn (looking at vortex-file/src/tests gpu_scan)? Where the stream is retrieved from the stream pool and capped at max 4 (DEFAULT_STREAM_POOL_CAPACITY).

that’s a complexity we should figure out how to avoid.

We should manifest this in writing in code docs somewhere. The scenario where this can bite us is if we want to saturate a GPU with one large file, as there'd be no parallelism on a stream level in that case.

vortex-cuda/src/arrow/canonical.rs

0ax1 · 2026-02-04T10:44:59Z

vortex-cuda/src/arrow/mod.rs

+}
+
+unsafe impl Send for ArrowDeviceArray {}
+unsafe impl Sync for ArrowDeviceArray {}


We can drop the Sync impl I think.

vortex-cuda/src/arrow/mod.rs

vortex-cuda/cudf-test/src/lib.rs

joseph-isaacs · 2026-02-04T16:07:48Z

vortex-cuda/src/arrow/canonical.rs

+pub(crate) struct CanonicalDeviceArrayExport;
+
+#[async_trait]
+impl ExportDeviceArray for CanonicalDeviceArrayExport {


would this logic every differ from export arrow with a different execute function?

I'm not sure I understand the question

vortex-cuda/src/arrow/mod.rs

joseph-isaacs · 2026-02-04T16:17:30Z

Could we share logic between this exporter and the arrow export if we could pass around a different executor?

Signed-off-by: Andrew Duffy <[email protected]>

a10y · 2026-02-04T19:21:08Z

I'm not sure what you mean by passing around a different exporter, but no I don't think there's much shared logic between this and just execute_arrow

Signed-off-by: Andrew Duffy <[email protected]>

0ax1 · 2026-02-05T00:01:41Z

vortex-cuda/src/executor.rs

    }

    /// See `VortexCudaStream::move_to_device`.
-    pub fn move_to_device<T: DeviceRepr + Debug + Send + Sync + 'static>(


0ax1 · 2026-02-05T00:13:37Z

vortex-cuda/src/arrow/canonical.rs

+        children.push(arrow_field);
+    }
+
+    let mut private_data = PrivateData::new(vec![None], children, ctx)?;


If we await for all fields in the lines above to complete, why do we need the additional sync event? This should only be relevant if we want to hand out an arrowdevicearray where canonicalization might not be completed yet, and is still processing (c style) async right?

Signed-off-by: Andrew Duffy <[email protected]>

This reverts commit 75279b5. Signed-off-by: Andrew Duffy <[email protected]>

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from 9d62079 to 18f8a3d Compare February 2, 2026 20:05

a10y commented Feb 2, 2026

View reviewed changes

vortex-cuda/cudf-test/src/lib.rs Outdated Show resolved Hide resolved

a10y force-pushed the aduffy/c-device-iface branch from 3a8ddb6 to a2d1c95 Compare February 3, 2026 17:19

a10y added the changelog/feature A new feature label Feb 3, 2026

a10y changed the title ~~WIP: initial work toward export to device interface~~ feat[cuda]: export arrays to ArrowDeviceArray Feb 3, 2026

a10y force-pushed the aduffy/c-device-iface branch 2 times, most recently from 497d765 to 79b7afc Compare February 3, 2026 17:59

a10y marked this pull request as ready for review February 3, 2026 18:19

a10y added 10 commits February 3, 2026 13:28

initial work toward export to device interface

e54c3cb

Signed-off-by: Andrew Duffy <[email protected]>

starting on cudf integration tests

1135748

Signed-off-by: Andrew Duffy <[email protected]>

add cudf test harness

032e334

Signed-off-by: Andrew Duffy <[email protected]>

more test

182e35b

Signed-off-by: Andrew Duffy <[email protected]>

fix tests

a31192f

Signed-off-by: Andrew Duffy <[email protected]>

delete a bunch of code, add some more to the test

da8c680

Signed-off-by: Andrew Duffy <[email protected]>

cleanup

fc53a44

Signed-off-by: Andrew Duffy <[email protected]>

drop impl

2c94c69

Signed-off-by: Andrew Duffy <[email protected]>

free the right way

0990251

Signed-off-by: Andrew Duffy <[email protected]>

add CI step to validate cudf compatibility

1495dbc

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from 74549dc to 930f824 Compare February 3, 2026 18:28

lol

b489712

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from 930f824 to b489712 Compare February 3, 2026 18:29

a10y added 2 commits February 3, 2026 13:30

comment

e67b57a

Signed-off-by: Andrew Duffy <[email protected]>

fix

b672525

Signed-off-by: Andrew Duffy <[email protected]>

a10y requested review from 0ax1 and joseph-isaacs February 3, 2026 18:49

a10y and others added 2 commits February 3, 2026 16:06

sync event propagation

75cb1ee

Signed-off-by: Andrew Duffy <[email protected]>

add struct array for C export (#6294)

a06184d

Signed-off-by: Andrew Duffy <[email protected]>

0ax1 reviewed Feb 4, 2026

View reviewed changes

joseph-isaacs reviewed Feb 4, 2026

View reviewed changes

vortex-cuda/cudf-test/src/lib.rs Outdated Show resolved Hide resolved

joseph-isaacs reviewed Feb 4, 2026

View reviewed changes

vortex-cuda/src/arrow/mod.rs Show resolved Hide resolved

a10y added 4 commits February 4, 2026 11:59

decimal

8a3bf78

Signed-off-by: Andrew Duffy <[email protected]>

rest of canonical types

3b43572

Signed-off-by: Andrew Duffy <[email protected]>

fix up test lib

66a3e3c

Signed-off-by: Andrew Duffy <[email protected]>

copy all buffers to device

bf87ef3

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from 861e17c to c479579 Compare February 4, 2026 19:49

update

0b523c1

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from c479579 to 0b523c1 Compare February 4, 2026 19:58

a10y added 2 commits February 4, 2026 15:11

one more

a504e89

Signed-off-by: Andrew Duffy <[email protected]>

fix lint

b90267d

Signed-off-by: Andrew Duffy <[email protected]>

0ax1 reviewed Feb 5, 2026

View reviewed changes

a10y added 3 commits February 4, 2026 21:38

cuda

89b5c05

Signed-off-by: Andrew Duffy <[email protected]>

tweak: remove strings frmo test

b29ed37

Signed-off-by: Andrew Duffy <[email protected]>

keep string, ditch decimal?

a3185ae

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from 79fd7f4 to a3185ae Compare February 5, 2026 03:37

a10y added 3 commits February 4, 2026 22:43

revert

75279b5

Signed-off-by: Andrew Duffy <[email protected]>

Revert "revert"

ae62f71

This reverts commit 75279b5. Signed-off-by: Andrew Duffy <[email protected]>

REVERT ME: debug print

9824f29

Signed-off-by: Andrew Duffy <[email protected]>

a10y force-pushed the aduffy/c-device-iface branch from 1ceb4f4 to 9824f29 Compare February 5, 2026 03:51

Conversation

a10y commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

Testing

Uh oh!

Uh oh!

codspeed-hq bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 33.69%

Performance Changes

Footnotes

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0ax1 Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joseph-isaacs commented Feb 4, 2026

Uh oh!

a10y commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0ax1 Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

a10y commented Feb 2, 2026 •

edited

Loading

codspeed-hq bot commented Feb 3, 2026 •

edited

Loading

0ax1 Feb 4, 2026 •

edited

Loading

0ax1 Feb 5, 2026 •

edited

Loading