[SPARK-57590][SQL] Read and infer Parquet schema from tar archives#56828
Open
akshatshenoi-db wants to merge 2 commits into
Open
[SPARK-57590][SQL] Read and infer Parquet schema from tar archives#56828akshatshenoi-db wants to merge 2 commits into
akshatshenoi-db wants to merge 2 commits into
Conversation
…ve merge conflicts and cover mergeSchema inference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This extends the archive-read feature to Parquet. When
spark.sql.files.archive.reader.enabledis set, file-based data sources can read files packed in tar archives (.tar/.tar.gz/.tgz), treating each archive entry as if it were a separate file during both scan and schema inference. Support has landed incrementally per format: CSV (SPARK-57135 read, SPARK-57321 infer), JSON (SPARK-57419), text (SPARK-57478), XML (SPARK-57479), and Avro (SPARK-57481).All of those formats are streaming: each entry is parsed from a bounded
InputStreamthrough the sharedArchiveReader, so nothing is unpacked to disk. Parquet cannot use that path -- it is a random-access format that reads a trailing footer, so an entry must be a complete, seekable file. This PR adds Parquet support by unpacking entries to local temp files one at a time:ArchiveReadergainslocalizeEntries(materialize each kept entry to a file under a dir, one at a time) andreadLocalizedEntries(the random-access counterpart toreadEntries): each kept entry is unpacked to a temp file, read with the plain JVM reader, and the reader and temp file are released before the next entry opens. The temp dir is removed on task completion;FileScanRDDcloses the (Closeable) entry iterator -- releasing the current reader and the archive stream -- so an abandoned read (e.g. aLIMIT) does not leak.ParquetFileFormat:isSplitablereturns false for archives (one split per archive); the per-file read is factored intoreadSingleFileand reused for each archive entry; archive entries read with the plain JVM reader andinput_file_name()/_metadata.file_pathstay the archive path, not the temp file.mergeSchema = false. A corrupt archive is skipped underignoreCorruptFiles; a missing archive is governed byignoreMissingFiles(aFileNotFoundExceptionis not silently dropped underignoreCorruptFiles, matchingFileScanRDD).V2 data sources are intentionally untouched -- archive dispatch lives only in the V1
FileFormatpath.Why are the changes needed?
Parquet is the most common columnar format, and packing many small Parquet part-files into a single tar archive is a natural way to ship them. Completing the archive-read series for Parquet lets users scan and infer schemas from those archives with the same gated, per-entry semantics already available for CSV/JSON/text/XML/Avro, without first unpacking the archive to a directory.
Does this PR introduce any user-facing change?
Yes, gated by
spark.sql.files.archive.reader.enabled(default false). When enabled, Parquet files packed in tar archives can be read and their schema inferred; with the flag off (the default) there is no behavior change.How was this patch tested?
New
ParquetTarArchiveReadSuite(Parquet bound to the sharedArchiveReadSuiteBaseover tar containers viaParquetArchiveReadBase), covering archive-vs-directory read parity, vectorized and row-based readers,input_file_name()reporting the archive path, an abandonedLIMITread, and differing-field reads. The sharedArchiveReadSuiteBaseinference/read parity tests run for Parquet as well.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code