Skip to content

[SPARK-57590][SQL] Read and infer Parquet schema from tar archives#56828

Open
akshatshenoi-db wants to merge 2 commits into
apache:masterfrom
akshatshenoi-db:archive-parquet
Open

[SPARK-57590][SQL] Read and infer Parquet schema from tar archives#56828
akshatshenoi-db wants to merge 2 commits into
apache:masterfrom
akshatshenoi-db:archive-parquet

Conversation

@akshatshenoi-db

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This extends the archive-read feature to Parquet. When spark.sql.files.archive.reader.enabled is set, file-based data sources can read files packed in tar archives (.tar / .tar.gz / .tgz), treating each archive entry as if it were a separate file during both scan and schema inference. Support has landed incrementally per format: CSV (SPARK-57135 read, SPARK-57321 infer), JSON (SPARK-57419), text (SPARK-57478), XML (SPARK-57479), and Avro (SPARK-57481).

All of those formats are streaming: each entry is parsed from a bounded InputStream through the shared ArchiveReader, so nothing is unpacked to disk. Parquet cannot use that path -- it is a random-access format that reads a trailing footer, so an entry must be a complete, seekable file. This PR adds Parquet support by unpacking entries to local temp files one at a time:

  • ArchiveReader gains localizeEntries (materialize each kept entry to a file under a dir, one at a time) and readLocalizedEntries (the random-access counterpart to readEntries): each kept entry is unpacked to a temp file, read with the plain JVM reader, and the reader and temp file are released before the next entry opens. The temp dir is removed on task completion; FileScanRDD closes the (Closeable) entry iterator -- releasing the current reader and the archive stream -- so an abandoned read (e.g. a LIMIT) does not leak.
  • ParquetFileFormat: isSplitable returns false for archives (one split per archive); the per-file read is factored into readSingleFile and reused for each archive entry; archive entries read with the plain JVM reader and input_file_name() / _metadata.file_path stay the archive path, not the temp file.
  • Schema inference: archive entries' footers are read driver-side alongside any loose files, folding one entry into the merged schema at a time; only the first entry is unpacked when mergeSchema = false. A corrupt archive is skipped under ignoreCorruptFiles; a missing archive is governed by ignoreMissingFiles (a FileNotFoundException is not silently dropped under ignoreCorruptFiles, matching FileScanRDD).

V2 data sources are intentionally untouched -- archive dispatch lives only in the V1 FileFormat path.

Why are the changes needed?

Parquet is the most common columnar format, and packing many small Parquet part-files into a single tar archive is a natural way to ship them. Completing the archive-read series for Parquet lets users scan and infer schemas from those archives with the same gated, per-entry semantics already available for CSV/JSON/text/XML/Avro, without first unpacking the archive to a directory.

Does this PR introduce any user-facing change?

Yes, gated by spark.sql.files.archive.reader.enabled (default false). When enabled, Parquet files packed in tar archives can be read and their schema inferred; with the flag off (the default) there is no behavior change.

How was this patch tested?

New ParquetTarArchiveReadSuite (Parquet bound to the shared ArchiveReadSuiteBase over tar containers via ParquetArchiveReadBase), covering archive-vs-directory read parity, vectorized and row-based readers, input_file_name() reporting the archive path, an abandoned LIMIT read, and differing-field reads. The shared ArchiveReadSuiteBase inference/read parity tests run for Parquet as well.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant