-
Notifications
You must be signed in to change notification settings - Fork 6
Explore bridges between CWL and Dask #908
Description
Description
There is a great overlap potential between (Dask Clusters + Argo Workflows) and (Docker/Kubernetes + CWL), given EOEPCA, and considering the tools that can leverage different HPC/Clouds. Given how Dask ties well with multiple data representations for scientific work (xarray, numpy, pandas, dataframe, scikit, etc.) it is often a good (or more easily accessible?) for users working within the same Python environment where they manipulate data. However, Dask lacks the workflow annotation layer that CWL provides.
If we could define some helpers/converters between them (can they be natively represented, or simply embedding the Python script is enough?), we could lower the adoption/introduction bar of CWL concepts (notably for "simpler" use cases that do not employ advanced Dask features).
To Do
- see if Dask can be integrated with CWL /
cwltoolusing advanced notations for special runtime requirements
examples, but not limited to:- Mapping CWL DAG vs Dask Collection Interface and Dask Specification?
- CWL
ResourceRequirementvs Dask HPC resources - GPU
cwltool:CUDARequirementvs Dask GPU - CWL Workflow vs Dask Task Graph
- CWL Workflow
scattervs ??? (i.e.: how to detect where Dask fan-in/out the "apps") - CWL Workflow
stepsvs Daskdelayed(notably for@dask.delayeddecorated functions)
- otherwise, see if Dask as
CommandLineToolwith an embedded Python is enough
(e.g.: https://pavics-weaver.readthedocs.io/en/latest/package.html#script-application) - Ensure some control over HPC/Clouds (i.e.: if Dask mappings are provided, they won't magically gain access to compute locations)
- how to establish or limit if needed were data can be sent?
- should it be considered at all, or let it happen if allowed, and fail execution otherwise?
- How to provide other Dask dependencies (eg.: if the code relies on a certain data-specific or compute-specific package)
- Role of the
DockerRequirementto provide an appropriate env? - Offer pre-made ones?
- Role of the
- Visualization of the Dask Graph (in contrast to [Feature] Graph image rendering of contained CWL #213)
References
Extra Docs:
- https://docs.dask.org/en/stable/deploying.html
- https://docs.dask.org/en/stable/user-interfaces.html
- https://cloudprovider.dask.org/en/latest/
- https://eoepca.readthedocs.io/projects/processing/en/latest/design/processing-runner/dask/
- https://eoepca.readthedocs.io/projects/processing/en/latest/design/processing-runner/argo/
- https://duke-gcb.github.io/calrissian/
- https://toil.readthedocs.io/en/latest/
Issues;