Skip to content

Commit dbe16d6

Browse files
CopilotlewingrzikmCopilot
authored
TarReader: implement GNU sparse format 1.0 (PAX) (#125283)
`TarReader` was not handling GNU sparse format 1.0 PAX entries, causing ~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs built on macOS/APFS) to expose internal placeholder paths like `GNUSparseFile.0/real-file.dll`, incorrect sizes, and corrupted extracted content. ## Changes Added read-only support for GNU sparse format 1.0 (PAX). When `TarReader` encounters PAX extended attributes `GNU.sparse.major=1` and `GNU.sparse.minor=0`, it resolves the real file name from `GNU.sparse.name`, reports the expanded size from `GNU.sparse.realsize`, and wraps the raw data stream with `GnuSparseStream` which presents the expanded virtual file content (zeros for holes, packed data at correct offsets). The sparse map embedded in the data section is parsed **lazily** on first `Read`, so `_dataStream` remains unconsumed during entry construction. This allows `TarWriter.WriteEntry` to round-trip the condensed sparse data correctly for both seekable and non-seekable source archives. Older GNU sparse formats (0.0, 0.1) and write support are not addressed. Additional correctness and robustness improvements based on code review: - `GnuSparseStream` now overrides `DisposeAsync` to properly await async disposal of the underlying raw stream. - `TarHeader.Read` now throws `InvalidDataException` if `GNU.sparse.realsize` is negative, consistent with validation of the regular `_size` field. - Segment validation uses overflow-safe arithmetic (`offset > _realSize || length > _realSize - offset`). - `FindSegmentFromCurrent` uses binary search (O(log n)) for backward seeks, preserving the O(1) amortized forward scan for the common sequential-read case. ```csharp // Before: entry.Name == "GNUSparseFile.0/dotnet.dll", entry.Length == 512 // After: entry.Name == "dotnet.dll", entry.Length == 1048576 using var reader = new TarReader(archiveStream); TarEntry entry = reader.GetNextEntry(); entry.DataStream.ReadExactly(content); // correctly expanded virtual file ``` ## Testing All existing tests pass. New `TarReader.SparseFile.Tests.cs` covers: - Parameterized sparse layouts (single segment, holes, multiple segments, all-holes) × `copyData` × sync/async - Corrupted sparse map handling (non-numeric values, truncated maps, buffer overflow) × sync/async - Negative `GNU.sparse.realsize` value throws `InvalidDataException` (sync and async) — the test helper `WriteSparseEntry` omits `GNU.sparse.realsize` from the `PaxTarEntry` constructor's attribute dictionary (to avoid constructor-level validation) and instead injects it via reflection into the internal `TarHeader.ExtendedAttributes` dictionary after construction, so the archive can be built while ensuring `TarReader.GetNextEntry()` is the one that throws - Wrong sparse version detection (missing minor, wrong major) - Seekable random access, partial reads, advance-past-entry correctness - Round-trip copy through TarWriter with seekable/non-seekable source × copyData - Sparse layout scenarios tested against real `golang_tar` test data files (`pax-nil-sparse-data.tar`, `pax-nil-sparse-hole.tar`, `pax-sparse-big.tar`) from the `System.Formats.Tar.TestData` NuGet package, plus programmatically constructed archives for additional coverage - Test code refactored to eliminate duplication: `AdvancePastEntry_DoesNotCorruptNextEntry` and `CopySparseEntryToNewArchive_PreservesExpandedContent` now share archive construction helpers (`WriteSparseEntry`, `BuildSparseArchive`, `BuildRawSparseArchive`) with the rest of the test suite <!-- START COPILOT CODING AGENT TIPS --> --- 💬 Send tasks to Copilot coding agent from [Slack](https://gh.io/cca-slack-docs) and [Teams](https://gh.io/cca-teams-docs) to turn conversations into code. Copilot posts an update in your thread when it's finished. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: lewing <24063+lewing@users.noreply.github.com> Co-authored-by: Larry Ewing <lewing@microsoft.com> Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com> Co-authored-by: Radek Zikmund <r.zikmund.rz@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 1653a0f commit dbe16d6

File tree

7 files changed

+1195
-2
lines changed

7 files changed

+1195
-2
lines changed

src/libraries/System.Formats.Tar/src/System.Formats.Tar.csproj

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
<Compile Include="System\Formats\Tar\TarWriterOptions.cs" />
3737
<Compile Include="System\Formats\Tar\SubReadStream.cs" />
3838
<Compile Include="System\Formats\Tar\SeekableSubReadStream.cs" />
39+
<Compile Include="System\Formats\Tar\GnuSparseStream.cs" />
3940
<Compile Include="$(CommonPath)DisableRuntimeMarshalling.cs" Link="Common\DisableRuntimeMarshalling.cs" />
4041
<Compile Include="$(CommonPath)System\IO\Archiving.Utils.cs" Link="Common\System\IO\Archiving.Utils.cs" />
4142
<Compile Include="$(CommonPath)System\IO\PathInternal.cs" Link="Common\System\IO\PathInternal.cs" />

0 commit comments

Comments
 (0)