Skip to content

[SPARK-57457][SQL] Support nanosecond-precision timestamp types in the CSV datasource (v1 and v2)#56818

Closed
vinodkc wants to merge 1 commit into
apache:masterfrom
vinodkc:spark-57457-nanosecond-csv
Closed

[SPARK-57457][SQL] Support nanosecond-precision timestamp types in the CSV datasource (v1 and v2)#56818
vinodkc wants to merge 1 commit into
apache:masterfrom
vinodkc:spark-57457-nanosecond-csv

Conversation

@vinodkc

@vinodkc vinodkc commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds nanosecond-precision timestamp support (TIMESTAMP_NTZ(p) and TIMESTAMP_LTZ(p)) to the CSV datasource, for both the v1 (CSVFileFormat) and v2 (CSVTable) paths.

Specifically:

  • Parser (UnivocityParser): adds TimestampNTZNanosType and TimestampLTZNanosType cases that delegate to the existing parseWithoutTimeZoneNanos / parseNanos formatter methods.
  • Generator (UnivocityGenerator): adds the corresponding write-path cases that delegate to formatWithoutTimeZoneNanos / formatNanos.

Why are the changes needed?

CSV rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through CSV.

Does this PR introduce any user-facing change?

Yes. Users can write and read TimestampNTZNanosType(p) / TimestampLTZNanosType(p) (p in 7..9) with CSV

How was this patch tested?

  • CsvFunctionsSuite — updated the existing from_csv nanosecond timestamp test: the test now asserts successful parsing and correct truncated value rather than expecting an UNSUPPORTED_DATATYPE exception.

  • FileBasedDataSourceSuite — new end-to-end round-trip test covering both v1 and v2 source paths, precisions (7–9), and both TimestampNTZNanosType and TimestampLTZNanosType, verifying that a DataFrame written to CSV and read back with a matching schema produces identical results.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code (Sonnet 4.6) was used to assist with this patch.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 0 non-blocking, 0 nits. LGTM — a minimal, faithful extension of the CSV datasource to nanosecond TIMESTAMP_NTZ(p) / TIMESTAMP_LTZ(p) (p ∈ 7..9), mirroring the micro-precision converters and dropping the capability gate on both v1 (CSVFileFormat) and v2 (CSVTable).

Verification

The read/write paths are symmetric (parseWithoutTimeZoneNanos/parseNanosformatWithoutTimeZoneNanos/formatNanos), all four formatter methods exist with matching signatures over TimestampNanosVal, and both capability gates are updated (the prior "not supported" test correctly drops CSV). Two deliberate behaviors I checked and cleared: the nanos LTZ case omits the micro TimestampType legacy-parse fallback (appropriate — nanos has no Spark 1.x/2.0 legacy data, and micro TimestampNTZType has none either), and the 3-digit default timestamp format truncates sub-millisecond digits for all timestamp types (pre-existing, not a nanos regression; tests use explicit formats). Coverage is solid: FileBasedDataSourceSuite round-trips v1+v2 × NTZ/LTZ × precisions 7–9 (write → read-back → checkAnswer), and the updated from_csv test validates the read path against an independently-authored string with precision truncation.

@HyukjinKwon — could you take a look as well, since you've been close to the CSV/Univocity datasource paths?

@MaxGekk

MaxGekk commented Jun 27, 2026

Copy link
Copy Markdown
Member

+1, LGTM. Merging to master/4.x.
Thank you, @vinodkc and @dongjoon-hyun @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in ab78cb5 Jun 27, 2026
MaxGekk pushed a commit that referenced this pull request Jun 27, 2026
…e CSV datasource (v1 and v2)

### What changes were proposed in this pull request?

This PR adds nanosecond-precision timestamp support (`TIMESTAMP_NTZ(p)` and `TIMESTAMP_LTZ(p)`) to the `CSV` datasource, for both the v1 (`CSVFileFormat`) and v2 (`CSVTable`) paths.

Specifically:
- Parser (`UnivocityParser`): adds `TimestampNTZNanosType` and `TimestampLTZNanosType` cases that delegate to the existing `parseWithoutTimeZoneNanos` / `parseNanos` formatter methods.
- Generator (`UnivocityGenerator`): adds the corresponding write-path cases that delegate to `formatWithoutTimeZoneNanos` / f`ormatNanos`.

### Why are the changes needed?

`CSV` rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through CSV.

### Does this PR introduce _any_ user-facing change?

Yes. Users can write and read `TimestampNTZNanosType(p)` / `TimestampLTZNanosType(p)` (p in 7..9) with CSV

### How was this patch tested?

- `CsvFunctionsSuite`   — updated the existing from_csv nanosecond timestamp test: the test now asserts successful parsing and correct truncated value rather than expecting an UNSUPPORTED_DATATYPE exception.

- `FileBasedDataSourceSuite`  — new end-to-end round-trip test covering both v1 and v2 source paths, precisions (7–9), and both TimestampNTZNanosType and TimestampLTZNanosType, verifying that a DataFrame written to CSV and read back with a matching schema produces identical results.

### Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code (Sonnet 4.6) was used to assist with this patch.

Closes #56818 from vinodkc/spark-57457-nanosecond-csv.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ab78cb5)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants