[SPARK-57590][SQL] Read and infer Parquet schema from tar archives by akshatshenoi-db · Pull Request #56828 · apache/spark

akshatshenoi-db · 2026-06-26T23:57:28Z

What changes were proposed in this pull request?

This extends the archive-read feature to Parquet. When spark.sql.files.archive.reader.enabled is set, file-based data sources can read files packed in tar archives (.tar / .tar.gz / .tgz), treating each archive entry as if it were a separate file during both scan and schema inference. Support has landed incrementally per format: CSV (SPARK-57135 read, SPARK-57321 infer), JSON (SPARK-57419), text (SPARK-57478), XML (SPARK-57479), and Avro (SPARK-57481).

All of those formats are streaming: each entry is parsed from a bounded InputStream through the shared ArchiveReader, so nothing is unpacked to disk. Parquet cannot use that path -- it is a random-access format that reads a trailing footer, so an entry must be a complete, seekable file. This PR adds Parquet support by unpacking entries to local temp files one at a time:

ArchiveReader gains localizeEntries (materialize each kept entry to a file under a dir, one at a time) and readLocalizedEntries (the random-access counterpart to readEntries): each kept entry is unpacked to a temp file, read with the plain JVM reader, and the reader and temp file are released before the next entry opens. The temp dir is removed on task completion; FileScanRDD closes the (Closeable) entry iterator -- releasing the current reader and the archive stream -- so an abandoned read (e.g. a LIMIT) does not leak.
ParquetFileFormat: isSplitable returns false for archives (one split per archive); the per-file read is factored into readSingleFile and reused for each archive entry; archive entries read with the plain JVM reader and input_file_name() / _metadata.file_path stay the archive path, not the temp file.
Schema inference: archive entries' footers are read driver-side alongside any loose files, folding one entry into the merged schema at a time; only the first entry is unpacked when mergeSchema = false. A corrupt archive is skipped under ignoreCorruptFiles; a missing archive is governed by ignoreMissingFiles (a FileNotFoundException is not silently dropped under ignoreCorruptFiles, matching FileScanRDD).

V2 data sources are intentionally untouched -- archive dispatch lives only in the V1 FileFormat path.

Why are the changes needed?

Parquet is the most common columnar format, and packing many small Parquet part-files into a single tar archive is a natural way to ship them. Completing the archive-read series for Parquet lets users scan and infer schemas from those archives with the same gated, per-entry semantics already available for CSV/JSON/text/XML/Avro, without first unpacking the archive to a directory.

Does this PR introduce any user-facing change?

Yes, gated by spark.sql.files.archive.reader.enabled (default false). When enabled, Parquet files packed in tar archives can be read and their schema inferred; with the flag off (the default) there is no behavior change.

How was this patch tested?

New ParquetTarArchiveReadSuite (Parquet bound to the shared ArchiveReadSuiteBase over tar containers via ParquetArchiveReadBase), covering archive-vs-directory read parity, vectorized and row-based readers, input_file_name() reporting the archive path, an abandoned LIMIT read, and differing-field reads. The shared ArchiveReadSuiteBase inference/read parity tests run for Parquet as well.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…ve merge conflicts and cover mergeSchema inference

akshatshenoi-db added 2 commits June 26, 2026 23:55

[SPARK-57590][SQL] Read and infer Parquet schema from tar archives

703f43c

[SPARK-57590][SQL] Address review: use CANNOT_MERGE_SCHEMAS for archi…

4bd11d0

…ve merge conflicts and cover mergeSchema inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57590][SQL] Read and infer Parquet schema from tar archives#56828

[SPARK-57590][SQL] Read and infer Parquet schema from tar archives#56828
akshatshenoi-db wants to merge 2 commits into
apache:masterfrom
akshatshenoi-db:archive-parquet

akshatshenoi-db commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

akshatshenoi-db commented Jun 26, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant