Skip to content

feat[cuda]: export arrays to ArrowDeviceArray#6253

Open
a10y wants to merge 28 commits intodevelopfrom
aduffy/c-device-iface
Open

feat[cuda]: export arrays to ArrowDeviceArray#6253
a10y wants to merge 28 commits intodevelopfrom
aduffy/c-device-iface

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Feb 2, 2026

This PR adds scaffolding for the first part of implementing export of Vortex Arrays directly to Arrow C Data Device Interface arrays, and implements the actual export for StructArray and PrimitiveArray

Setup

Arrow defines a standard for exchanging data in-process to native code using the C Data Interface, which converts an Arrow array into a struct of pointers.

There is an (experimental, but mostly stable) extension called the C Data Device Interface which includes extra information such as the GPU device type and device ID.

cudf knows how to import data exported this way using the from_arrow_device_column method for column_views.

We replicate the ArrowArray and ArrowDeviceArray structs from the spec.

These are very similar to the existing FFI_ArrowArray and FFI_ArrowSchema structs from the arrow-data crate, except that those use arrow Bytes and real pointers and we need to use CUDA device pointers and BufferHandles.

Testing

Rather than polluting the repo/CI with fetching and building cudf, I made a new repo https://github.com/vortex-data/cudf-test-harness/

This will load a cdylib, call its export_array() function, imports it into cudf and calls some simple functions, then frees it by calling the embedded release function pointer. This step is hooked up to CI now and only adds ~1m to the CUDA tests jobs.

@a10y a10y force-pushed the aduffy/c-device-iface branch from 9d62079 to 18f8a3d Compare February 2, 2026 20:05
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 3, 2026

Merging this PR will degrade performance by 33.69%

⚡ 3 improved benchmarks
❌ 1 regressed benchmark
✅ 1134 untouched benchmarks
⏩ 1265 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_into_canonical[(1000, 10)] 41.8 µs 63 µs -33.69%
Simulation patched_take_10k_contiguous_not_patches 1.5 ms 1.2 ms +18.49%
Simulation patched_take_10k_random 2.4 ms 2.1 ms +10.85%
Simulation patched_take_10k_adversarial 2.3 ms 2 ms +10.69%

Comparing aduffy/c-device-iface (9824f29) with develop (5648546)

Open in CodSpeed

Footnotes

  1. 1265 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y force-pushed the aduffy/c-device-iface branch from 3a8ddb6 to a2d1c95 Compare February 3, 2026 17:19
@a10y a10y added the changelog/feature A new feature label Feb 3, 2026
@a10y a10y changed the title WIP: initial work toward export to device interface feat[cuda]: export arrays to ArrowDeviceArray Feb 3, 2026
@a10y a10y force-pushed the aduffy/c-device-iface branch 2 times, most recently from 497d765 to 79b7afc Compare February 3, 2026 17:59
@a10y a10y marked this pull request as ready for review February 3, 2026 18:19
a10y added 10 commits February 3, 2026 13:28
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/c-device-iface branch from 74549dc to 930f824 Compare February 3, 2026 18:28
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/c-device-iface branch from 930f824 to b489712 Compare February 3, 2026 18:29
a10y added 2 commits February 3, 2026 13:30
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y requested review from 0ax1 and joseph-isaacs February 3, 2026 18:49
a10y and others added 2 commits February 3, 2026 16:06
Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions / remarks

// We need the children to be held across await points.
let mut children = Vec::with_capacity(fields.len());

for field in fields.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sth along the lines of:

let futures = fields
    .iter()
    .map(|field| {
        let field = field.clone();
        async move {
            let cuda_field = field.execute_cuda(ctx).await?;
            export_canonical(cuda_field, ctx).await
        }
    })
    .collect::<Vec<_>>();

let results = join_all(futures).await;

If fields live on different streams, they run in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are fields going to live on different streams? if so, things might get a little weird with the SyncEvent stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically: the contract for the returned ArrowDeviceArray is that its sync_event field, if not NULL, contains a pointer to a cudaStream_t which AFAIU can only be linked to one stream.

Similarly the CudaExecutionCtx is linked to only one global stream rn. So assuming that StructArray::execute_cuda(ctx) gets called, all of its fields should be getting executed with the same context, and thus the same stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a stream wait for another one but that’s a complexity we should figure out how to avoid.

Copy link
Contributor

@0ax1 0ax1 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have at most one stream per file during a scan rn (looking at vortex-file/src/tests gpu_scan)? Where the stream is retrieved from the stream pool and capped at max 4 (DEFAULT_STREAM_POOL_CAPACITY).

that’s a complexity we should figure out how to avoid.

We should manifest this in writing in code docs somewhere. The scenario where this can bite us is if we want to saturate a GPU with one large file, as there'd be no parallelism on a stream level in that case.

}

unsafe impl Send for ArrowDeviceArray {}
unsafe impl Sync for ArrowDeviceArray {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the Sync impl I think.

pub(crate) struct CanonicalDeviceArrayExport;

#[async_trait]
impl ExportDeviceArray for CanonicalDeviceArrayExport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this logic every differ from export arrow with a different execute function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question

@joseph-isaacs
Copy link
Contributor

Could we share logic between this exporter and the arrow export if we could pass around a different executor?

a10y added 4 commits February 4, 2026 11:59
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y
Copy link
Contributor Author

a10y commented Feb 4, 2026

I'm not sure what you mean by passing around a different exporter, but no I don't think there's much shared logic between this and just execute_arrow

@a10y a10y force-pushed the aduffy/c-device-iface branch from 861e17c to c479579 Compare February 4, 2026 19:49
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/c-device-iface branch from c479579 to 0b523c1 Compare February 4, 2026 19:58
a10y added 2 commits February 4, 2026 15:11
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
}

/// See `VortexCudaStream::move_to_device`.
pub fn move_to_device<T: DeviceRepr + Debug + Send + Sync + 'static>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

children.push(arrow_field);
}

let mut private_data = PrivateData::new(vec![None], children, ctx)?;
Copy link
Contributor

@0ax1 0ax1 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we await for all fields in the lines above to complete, why do we need the additional sync event? This should only be relevant if we want to hand out an arrowdevicearray where canonicalization might not be completed yet, and is still processing (c style) async right?

a10y added 3 commits February 4, 2026 21:38
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/c-device-iface branch from 79fd7f4 to a3185ae Compare February 5, 2026 03:37
a10y added 3 commits February 4, 2026 22:43
Signed-off-by: Andrew Duffy <[email protected]>
This reverts commit 75279b5.

Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
@a10y a10y force-pushed the aduffy/c-device-iface branch from 1ceb4f4 to 9824f29 Compare February 5, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants