Skip to content

[Enh]: nw.scan_ndjson as a function, passing narwhals schema #3505

@thomasaarholt

Description

@thomasaarholt

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

tldr: I would love a nw.scan_ndjson function that can lazily read ndjson, backed by pyspark, and pass a narwhals schema.

df = nw.scan_ndjson(path, backend=pyspark, schema=nw.Schema(...))

I am running spark jobs in Azure ML, at Microsoft, and I'm trying to migrate my code from "hardcoded spark v3.5" to narwhals.
Narwhals is alluring, since it is quite difficult to write well-tested code with pyspark, and my thinking is that be writing my AML Components using narwhals, I should be able to quite easily expose the code backed with polars as a "CommandComponent" (single node job) or backed with Spark for distributed computing with the SparkComponent.

Then I can easily test my code on conventional machines, and quickly switch to spark when I need it to handle massive data.

A lot of the data we need to parse is served from various APIs as ndjson, which is essentially multiple json files - one per row in the ndjson file.

When I parse these with spark or polars, it is very valuable to be explicit about the schema ahead of time, since some columns might be inferred as "Null" datatype if all the row values used for inferring the dtype for a given column are null.

At the moment, I think I have to (?) type the schemas for each of these separately, since I don't see how I can pass or convert the nw.Schema object to be able to be read by pyspark.

Thus, I find myself desiring a nw.scan_ndjson that can lazily read ndjson, backed by pyspark, and pass a narwhals schema.

Please describe the purpose of the new feature or describe the problem to solve.

See previous text block

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions