Skip to content

[SPARK-57459][SQL] Support nanosecond-precision timestamp types in the Avro datasource (v1 and v2)#56825

Open
MaxGekk wants to merge 4 commits into
apache:masterfrom
MaxGekk:nanos-avro
Open

[SPARK-57459][SQL] Support nanosecond-precision timestamp types in the Avro datasource (v1 and v2)#56825
MaxGekk wants to merge 4 commits into
apache:masterfrom
MaxGekk:nanos-avro

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Jun 26, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Umbrella: SPARK-56822 (Timestamps with nanosecond precision).

This PR adds read and write support for the nanosecond-capable timestamp types TIMESTAMP_NTZ(p) and TIMESTAMP_LTZ(p) (p in 7-9) to the Avro datasource (v1 AvroFileFormat and v2 AvroTable), reaching parity with the microsecond TimestampType / TimestampNTZType, and removes the SPARK-57166 rejection guardrail.

  • SchemaConverters: map TimestampLTZNanosType / TimestampNTZNanosType to the Avro timestamp-nanos / local-timestamp-nanos logical types (available in the bundled Avro 1.12.1, on a long storing epoch-nanoseconds), carrying the fractional-second precision via the spark.sql.catalyst.type property. The reverse direction maps these logical types back, defaulting to nanosecond precision (9) for files written by external tools that lack the property.
  • AvroSerializer: pack the internal (epochMicros, nanosWithinMicro) value into a single epoch-nanoseconds Long (DateTimeUtils.timestampNanosToEpochNanos), surfacing values outside the signed-int64 epoch-nanos range (~1677-09-21 .. 2262-04-11) as a DATETIME_OVERFLOW error.
  • AvroDeserializer: unpack epoch-nanoseconds via floorDiv / floorMod and truncate the sub-microsecond digits to the declared precision.
  • AvroUtils.supportsDataType: drop the AnyTimestampNanoType rejection so the types are accepted by both the v1 and v2 write/read paths.

Like the Parquet path, nanosecond timestamps are always proleptic Gregorian and are therefore exempt from datetime rebasing.

Why are the changes needed?

To extend nanosecond-precision timestamp support (umbrella SPARK-56822) to the Avro datasource so it can read and write TIMESTAMP_NTZ(p) / TIMESTAMP_LTZ(p) with p in 7-9, matching the existing microsecond timestamp behavior and the Parquet/ORC nanosecond support.

Does this PR introduce any user-facing change?

Yes. With spark.sql.timestampNanosTypes.enabled=true, columns of type TIMESTAMP_NTZ(7-9) / TIMESTAMP_LTZ(7-9) can now be written to and read from Avro files. Previously such columns were rejected with UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE. This is a change within the unreleased master/branch only.

How was this patch tested?

Added tests in AvroSuite:

  • round-trip for precisions 7-9 for both NTZ and LTZ, across the v1 and v2 sources, including nulls and inferred-schema precision preservation;
  • external-reader unit-correctness: decode the written file with a plain Avro GenericDatumReader and assert the stored epoch-nanoseconds and the logical-type name;
  • reading a plain Avro file produced without the spark.sql.catalyst.type property (defaults to nanosecond precision);
  • writing an out-of-range value fails loudly with DATETIME_OVERFLOW.

Ran AvroV1Suite / AvroV2Suite (new tests pass on both) plus AvroSerdeSuite, AvroV1/V2LogicalTypeSuite, and AvroCatalystDataConversionSuite (no regressions), and sql / avro scalastyle.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 2.1, Claude Opus 4.8

MaxGekk added 4 commits June 26, 2026 22:32
…e Avro datasource (v1 and v2)

### What changes were proposed in this pull request?
This PR adds read and write support for the nanosecond-capable timestamp types
`TIMESTAMP_NTZ(p)` and `TIMESTAMP_LTZ(p)` (p in 7-9) to the Avro datasource (v1
`AvroFileFormat` and v2 `AvroTable`), reaching parity with the microsecond
`TimestampType` / `TimestampNTZType`, and removes the SPARK-57166 rejection
guardrail.

- `SchemaConverters`: map `TimestampLTZNanosType` / `TimestampNTZNanosType` to the
  Avro `timestamp-nanos` / `local-timestamp-nanos` logical types (available in the
  bundled Avro 1.12.1), carrying the precision via the `spark.sql.catalyst.type`
  property; the reverse direction maps them back, defaulting to nanosecond
  precision for files written by external tools.
- `AvroSerializer`: pack the internal `(epochMicros, nanosWithinMicro)` value into
  epoch-nanoseconds (Long), surfacing out-of-range values as `DATETIME_OVERFLOW`.
- `AvroDeserializer`: unpack epoch-nanoseconds via floorDiv/floorMod and truncate
  the sub-microsecond digits to the declared precision.
- `AvroUtils.supportsDataType`: drop the `AnyTimestampNanoType` rejection so the
  types are supported by both the v1 and v2 paths.

Nanosecond timestamps are always proleptic Gregorian, so they are exempt from
datetime rebasing, matching the Parquet path.

### Why are the changes needed?
To extend nanosecond-precision timestamp support (umbrella SPARK-56822) to the
Avro datasource so it can read and write `TIMESTAMP_NTZ(p)` / `TIMESTAMP_LTZ(p)`
with p in 7-9.

### Does this PR introduce _any_ user-facing change?
Yes. With `spark.sql.timestampNanosTypes.enabled=true`, columns of type
`TIMESTAMP_NTZ(7-9)` / `TIMESTAMP_LTZ(7-9)` can now be written to and read from
Avro files (previously rejected with `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`).

### How was this patch tested?
Added tests in `AvroSuite` (round-trip for precisions 7-9 on v1/v2, external-reader
unit-correctness, reading a plain Avro file without the catalyst-type property, and
write overflow) and ran `AvroV1Suite` / `AvroV2Suite` plus the Avro serde and
logical-type suites.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 2.1, Claude Opus 4.8
Introduce `DateTimeUtils.epochNanosToTimestampNanos(epochNanos, precision)` as
the inverse of `timestampNanosToEpochNanos`, and use it from both the Avro
deserializer and the Parquet `TimestampNanosParquetOps` converter, removing the
duplicated floorDiv/floorMod + precision-truncation logic. Add unit tests for the
new helper in `DateTimeUtilsSuite`.
…ared DateTimeUtils helper

Consolidate the encode-with-overflow wrapper (try timestampNanosToEpochNanos,
catch ArithmeticException -> DATETIME_OVERFLOW naming the sink) that was
duplicated in AvroSerializer (sink="Avro") and TimestampNanosParquetOps
(sink="Parquet INT64") into a single DateTimeUtils.timestampNanosToEpochNanos
(value, isNtz, sink) overload, mirroring the decode-path consolidation
(epochNanosToTimestampNanos). Behavior is unchanged.

Co-authored-by: Isaac
…d overflow helper

The prior commit hoisted the encode-with-overflow wrapper out of
TimestampNanosParquetOps into DateTimeUtils.timestampNanosToEpochNanos
(value, isNtz, sink); update the suite's two packing/overflow tests to call
the shared helper with sink="Parquet INT64". Behavior asserted is unchanged
(combine result and DATETIME_OVERFLOW condition).

Co-authored-by: Isaac
@MaxGekk

MaxGekk commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

@uros-b @stevomitric Could you review this PR, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant