Skip to content

[FEA] Provide an API to list all files in a remote directory #813

@kingcrimsontianyu

Description

@kingcrimsontianyu

PDS-H benchmarks with higher scaling factors usually store hundreds of files under a certain data directory (e.g. directory lineitem) as opposed to using a single large data file (e.g. file lineitem.parquet). While cuDF can handle directories on local file systems via Python, it is unable to do so for remote directories. The current workaround is for users to enumerate all remote files in the benchmark, which is too cumbersome and inconvenient.

We need to investigate what types of endpoints (S3, S3 presigned, WebHDFS) in KvikIO can support file listing and how to do that via libcurl. Ideally, we want to have an interface similar to:

// Can invoke a namesake polymorphic function in the endpoint
std::vector<std::string> kvikio::RemoteHandle::listdir(std::string const& url, bool recursive = true/false);

In passing, another necessary feature is to check whether a given URL points to a remote file or directory.

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions