-
Notifications
You must be signed in to change notification settings - Fork 185
[Enh]: nw.scan_ndjson as a function, passing narwhals schema #3505
Description
We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?
tldr: I would love a nw.scan_ndjson function that can lazily read ndjson, backed by pyspark, and pass a narwhals schema.
df = nw.scan_ndjson(path, backend=pyspark, schema=nw.Schema(...))I am running spark jobs in Azure ML, at Microsoft, and I'm trying to migrate my code from "hardcoded spark v3.5" to narwhals.
Narwhals is alluring, since it is quite difficult to write well-tested code with pyspark, and my thinking is that be writing my AML Components using narwhals, I should be able to quite easily expose the code backed with polars as a "CommandComponent" (single node job) or backed with Spark for distributed computing with the SparkComponent.
Then I can easily test my code on conventional machines, and quickly switch to spark when I need it to handle massive data.
A lot of the data we need to parse is served from various APIs as ndjson, which is essentially multiple json files - one per row in the ndjson file.
When I parse these with spark or polars, it is very valuable to be explicit about the schema ahead of time, since some columns might be inferred as "Null" datatype if all the row values used for inferring the dtype for a given column are null.
At the moment, I think I have to (?) type the schemas for each of these separately, since I don't see how I can pass or convert the nw.Schema object to be able to be read by pyspark.
Thus, I find myself desiring a nw.scan_ndjson that can lazily read ndjson, backed by pyspark, and pass a narwhals schema.
Please describe the purpose of the new feature or describe the problem to solve.
See previous text block
Suggest a solution if possible.
No response
If you have tried alternatives, please describe them below.
No response
Additional information that may help us understand your needs.
No response