[SPARK-57457][SQL] Support nanosecond-precision timestamp types in the CSV datasource (v1 and v2)#56818
[SPARK-57457][SQL] Support nanosecond-precision timestamp types in the CSV datasource (v1 and v2)#56818vinodkc wants to merge 1 commit into
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
0 blocking, 0 non-blocking, 0 nits. LGTM — a minimal, faithful extension of the CSV datasource to nanosecond TIMESTAMP_NTZ(p) / TIMESTAMP_LTZ(p) (p ∈ 7..9), mirroring the micro-precision converters and dropping the capability gate on both v1 (CSVFileFormat) and v2 (CSVTable).
Verification
The read/write paths are symmetric (parseWithoutTimeZoneNanos/parseNanos ↔ formatWithoutTimeZoneNanos/formatNanos), all four formatter methods exist with matching signatures over TimestampNanosVal, and both capability gates are updated (the prior "not supported" test correctly drops CSV). Two deliberate behaviors I checked and cleared: the nanos LTZ case omits the micro TimestampType legacy-parse fallback (appropriate — nanos has no Spark 1.x/2.0 legacy data, and micro TimestampNTZType has none either), and the 3-digit default timestamp format truncates sub-millisecond digits for all timestamp types (pre-existing, not a nanos regression; tests use explicit formats). Coverage is solid: FileBasedDataSourceSuite round-trips v1+v2 × NTZ/LTZ × precisions 7–9 (write → read-back → checkAnswer), and the updated from_csv test validates the read path against an independently-authored string with precision truncation.
@HyukjinKwon — could you take a look as well, since you've been close to the CSV/Univocity datasource paths?
|
+1, LGTM. Merging to master/4.x. |
…e CSV datasource (v1 and v2) ### What changes were proposed in this pull request? This PR adds nanosecond-precision timestamp support (`TIMESTAMP_NTZ(p)` and `TIMESTAMP_LTZ(p)`) to the `CSV` datasource, for both the v1 (`CSVFileFormat`) and v2 (`CSVTable`) paths. Specifically: - Parser (`UnivocityParser`): adds `TimestampNTZNanosType` and `TimestampLTZNanosType` cases that delegate to the existing `parseWithoutTimeZoneNanos` / `parseNanos` formatter methods. - Generator (`UnivocityGenerator`): adds the corresponding write-path cases that delegate to `formatWithoutTimeZoneNanos` / f`ormatNanos`. ### Why are the changes needed? `CSV` rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through CSV. ### Does this PR introduce _any_ user-facing change? Yes. Users can write and read `TimestampNTZNanosType(p)` / `TimestampLTZNanosType(p)` (p in 7..9) with CSV ### How was this patch tested? - `CsvFunctionsSuite` — updated the existing from_csv nanosecond timestamp test: the test now asserts successful parsing and correct truncated value rather than expecting an UNSUPPORTED_DATATYPE exception. - `FileBasedDataSourceSuite` — new end-to-end round-trip test covering both v1 and v2 source paths, precisions (7–9), and both TimestampNTZNanosType and TimestampLTZNanosType, verifying that a DataFrame written to CSV and read back with a matching schema produces identical results. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code (Sonnet 4.6) was used to assist with this patch. Closes #56818 from vinodkc/spark-57457-nanosecond-csv. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit ab78cb5) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
This PR adds nanosecond-precision timestamp support (
TIMESTAMP_NTZ(p)andTIMESTAMP_LTZ(p)) to theCSVdatasource, for both the v1 (CSVFileFormat) and v2 (CSVTable) paths.Specifically:
UnivocityParser): addsTimestampNTZNanosTypeandTimestampLTZNanosTypecases that delegate to the existingparseWithoutTimeZoneNanos/parseNanosformatter methods.UnivocityGenerator): adds the corresponding write-path cases that delegate toformatWithoutTimeZoneNanos/ formatNanos.Why are the changes needed?
CSVrejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through CSV.Does this PR introduce any user-facing change?
Yes. Users can write and read
TimestampNTZNanosType(p)/TimestampLTZNanosType(p)(p in 7..9) with CSVHow was this patch tested?
CsvFunctionsSuite— updated the existing from_csv nanosecond timestamp test: the test now asserts successful parsing and correct truncated value rather than expecting an UNSUPPORTED_DATATYPE exception.FileBasedDataSourceSuite— new end-to-end round-trip test covering both v1 and v2 source paths, precisions (7–9), and both TimestampNTZNanosType and TimestampLTZNanosType, verifying that a DataFrame written to CSV and read back with a matching schema produces identical results.Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code (Sonnet 4.6) was used to assist with this patch.