[SPARK-57459][SQL] Support nanosecond-precision timestamp types in the Avro datasource (v1 and v2)#56825
Open
MaxGekk wants to merge 4 commits into
Open
[SPARK-57459][SQL] Support nanosecond-precision timestamp types in the Avro datasource (v1 and v2)#56825MaxGekk wants to merge 4 commits into
MaxGekk wants to merge 4 commits into
Conversation
…e Avro datasource (v1 and v2) ### What changes were proposed in this pull request? This PR adds read and write support for the nanosecond-capable timestamp types `TIMESTAMP_NTZ(p)` and `TIMESTAMP_LTZ(p)` (p in 7-9) to the Avro datasource (v1 `AvroFileFormat` and v2 `AvroTable`), reaching parity with the microsecond `TimestampType` / `TimestampNTZType`, and removes the SPARK-57166 rejection guardrail. - `SchemaConverters`: map `TimestampLTZNanosType` / `TimestampNTZNanosType` to the Avro `timestamp-nanos` / `local-timestamp-nanos` logical types (available in the bundled Avro 1.12.1), carrying the precision via the `spark.sql.catalyst.type` property; the reverse direction maps them back, defaulting to nanosecond precision for files written by external tools. - `AvroSerializer`: pack the internal `(epochMicros, nanosWithinMicro)` value into epoch-nanoseconds (Long), surfacing out-of-range values as `DATETIME_OVERFLOW`. - `AvroDeserializer`: unpack epoch-nanoseconds via floorDiv/floorMod and truncate the sub-microsecond digits to the declared precision. - `AvroUtils.supportsDataType`: drop the `AnyTimestampNanoType` rejection so the types are supported by both the v1 and v2 paths. Nanosecond timestamps are always proleptic Gregorian, so they are exempt from datetime rebasing, matching the Parquet path. ### Why are the changes needed? To extend nanosecond-precision timestamp support (umbrella SPARK-56822) to the Avro datasource so it can read and write `TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)` with p in 7-9. ### Does this PR introduce _any_ user-facing change? Yes. With `spark.sql.timestampNanosTypes.enabled=true`, columns of type `TIMESTAMP_NTZ(7-9)` / `TIMESTAMP_LTZ(7-9)` can now be written to and read from Avro files (previously rejected with `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`). ### How was this patch tested? Added tests in `AvroSuite` (round-trip for precisions 7-9 on v1/v2, external-reader unit-correctness, reading a plain Avro file without the catalyst-type property, and write overflow) and ran `AvroV1Suite` / `AvroV2Suite` plus the Avro serde and logical-type suites. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 2.1, Claude Opus 4.8
Introduce `DateTimeUtils.epochNanosToTimestampNanos(epochNanos, precision)` as the inverse of `timestampNanosToEpochNanos`, and use it from both the Avro deserializer and the Parquet `TimestampNanosParquetOps` converter, removing the duplicated floorDiv/floorMod + precision-truncation logic. Add unit tests for the new helper in `DateTimeUtilsSuite`.
…ared DateTimeUtils helper Consolidate the encode-with-overflow wrapper (try timestampNanosToEpochNanos, catch ArithmeticException -> DATETIME_OVERFLOW naming the sink) that was duplicated in AvroSerializer (sink="Avro") and TimestampNanosParquetOps (sink="Parquet INT64") into a single DateTimeUtils.timestampNanosToEpochNanos (value, isNtz, sink) overload, mirroring the decode-path consolidation (epochNanosToTimestampNanos). Behavior is unchanged. Co-authored-by: Isaac
…d overflow helper The prior commit hoisted the encode-with-overflow wrapper out of TimestampNanosParquetOps into DateTimeUtils.timestampNanosToEpochNanos (value, isNtz, sink); update the suite's two packing/overflow tests to call the shared helper with sink="Parquet INT64". Behavior asserted is unchanged (combine result and DATETIME_OVERFLOW condition). Co-authored-by: Isaac
Member
Author
|
@uros-b @stevomitric Could you review this PR, please. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Umbrella: SPARK-56822 (Timestamps with nanosecond precision).
This PR adds read and write support for the nanosecond-capable timestamp types
TIMESTAMP_NTZ(p)andTIMESTAMP_LTZ(p)(pin 7-9) to the Avro datasource (v1AvroFileFormatand v2AvroTable), reaching parity with the microsecondTimestampType/TimestampNTZType, and removes the SPARK-57166 rejection guardrail.SchemaConverters: mapTimestampLTZNanosType/TimestampNTZNanosTypeto the Avrotimestamp-nanos/local-timestamp-nanoslogical types (available in the bundled Avro 1.12.1, on alongstoring epoch-nanoseconds), carrying the fractional-second precision via thespark.sql.catalyst.typeproperty. The reverse direction maps these logical types back, defaulting to nanosecond precision (9) for files written by external tools that lack the property.AvroSerializer: pack the internal(epochMicros, nanosWithinMicro)value into a single epoch-nanosecondsLong(DateTimeUtils.timestampNanosToEpochNanos), surfacing values outside the signed-int64 epoch-nanos range (~1677-09-21 .. 2262-04-11) as aDATETIME_OVERFLOWerror.AvroDeserializer: unpack epoch-nanoseconds viafloorDiv/floorModand truncate the sub-microsecond digits to the declared precision.AvroUtils.supportsDataType: drop theAnyTimestampNanoTyperejection so the types are accepted by both the v1 and v2 write/read paths.Like the Parquet path, nanosecond timestamps are always proleptic Gregorian and are therefore exempt from datetime rebasing.
Why are the changes needed?
To extend nanosecond-precision timestamp support (umbrella SPARK-56822) to the Avro datasource so it can read and write
TIMESTAMP_NTZ(p)/TIMESTAMP_LTZ(p)withpin 7-9, matching the existing microsecond timestamp behavior and the Parquet/ORC nanosecond support.Does this PR introduce any user-facing change?
Yes. With
spark.sql.timestampNanosTypes.enabled=true, columns of typeTIMESTAMP_NTZ(7-9)/TIMESTAMP_LTZ(7-9)can now be written to and read from Avro files. Previously such columns were rejected withUNSUPPORTED_DATA_TYPE_FOR_DATASOURCE. This is a change within the unreleased master/branch only.How was this patch tested?
Added tests in
AvroSuite:GenericDatumReaderand assert the stored epoch-nanoseconds and the logical-type name;spark.sql.catalyst.typeproperty (defaults to nanosecond precision);DATETIME_OVERFLOW.Ran
AvroV1Suite/AvroV2Suite(new tests pass on both) plusAvroSerdeSuite,AvroV1/V2LogicalTypeSuite, andAvroCatalystDataConversionSuite(no regressions), andsql/avroscalastyle.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 2.1, Claude Opus 4.8