Commit dbe16d6
TarReader: implement GNU sparse format 1.0 (PAX) (#125283)
`TarReader` was not handling GNU sparse format 1.0 PAX entries, causing
~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs
built on macOS/APFS) to expose internal placeholder paths like
`GNUSparseFile.0/real-file.dll`, incorrect sizes, and corrupted
extracted content.
## Changes
Added read-only support for GNU sparse format 1.0 (PAX). When
`TarReader` encounters PAX extended attributes `GNU.sparse.major=1` and
`GNU.sparse.minor=0`, it resolves the real file name from
`GNU.sparse.name`, reports the expanded size from `GNU.sparse.realsize`,
and wraps the raw data stream with `GnuSparseStream` which presents the
expanded virtual file content (zeros for holes, packed data at correct
offsets).
The sparse map embedded in the data section is parsed **lazily** on
first `Read`, so `_dataStream` remains unconsumed during entry
construction. This allows `TarWriter.WriteEntry` to round-trip the
condensed sparse data correctly for both seekable and non-seekable
source archives.
Older GNU sparse formats (0.0, 0.1) and write support are not addressed.
Additional correctness and robustness improvements based on code review:
- `GnuSparseStream` now overrides `DisposeAsync` to properly await async
disposal of the underlying raw stream.
- `TarHeader.Read` now throws `InvalidDataException` if
`GNU.sparse.realsize` is negative, consistent with validation of the
regular `_size` field.
- Segment validation uses overflow-safe arithmetic (`offset > _realSize
|| length > _realSize - offset`).
- `FindSegmentFromCurrent` uses binary search (O(log n)) for backward
seeks, preserving the O(1) amortized forward scan for the common
sequential-read case.
```csharp
// Before: entry.Name == "GNUSparseFile.0/dotnet.dll", entry.Length == 512
// After: entry.Name == "dotnet.dll", entry.Length == 1048576
using var reader = new TarReader(archiveStream);
TarEntry entry = reader.GetNextEntry();
entry.DataStream.ReadExactly(content); // correctly expanded virtual file
```
## Testing
All existing tests pass. New `TarReader.SparseFile.Tests.cs` covers:
- Parameterized sparse layouts (single segment, holes, multiple
segments, all-holes) × `copyData` × sync/async
- Corrupted sparse map handling (non-numeric values, truncated maps,
buffer overflow) × sync/async
- Negative `GNU.sparse.realsize` value throws `InvalidDataException`
(sync and async) — the test helper `WriteSparseEntry` omits
`GNU.sparse.realsize` from the `PaxTarEntry` constructor's attribute
dictionary (to avoid constructor-level validation) and instead injects
it via reflection into the internal `TarHeader.ExtendedAttributes`
dictionary after construction, so the archive can be built while
ensuring `TarReader.GetNextEntry()` is the one that throws
- Wrong sparse version detection (missing minor, wrong major)
- Seekable random access, partial reads, advance-past-entry correctness
- Round-trip copy through TarWriter with seekable/non-seekable source ×
copyData
- Sparse layout scenarios tested against real `golang_tar` test data
files (`pax-nil-sparse-data.tar`, `pax-nil-sparse-hole.tar`,
`pax-sparse-big.tar`) from the `System.Formats.Tar.TestData` NuGet
package, plus programmatically constructed archives for additional
coverage
- Test code refactored to eliminate duplication:
`AdvancePastEntry_DoesNotCorruptNextEntry` and
`CopySparseEntryToNewArchive_PreservesExpandedContent` now share archive
construction helpers (`WriteSparseEntry`, `BuildSparseArchive`,
`BuildRawSparseArchive`) with the rest of the test suite
<!-- START COPILOT CODING AGENT TIPS -->
---
💬 Send tasks to Copilot coding agent from
[Slack](https://gh.io/cca-slack-docs) and
[Teams](https://gh.io/cca-teams-docs) to turn conversations into code.
Copilot posts an update in your thread when it's finished.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
Co-authored-by: Larry Ewing <lewing@microsoft.com>
Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Co-authored-by: Radek Zikmund <r.zikmund.rz@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 1653a0f commit dbe16d6
File tree
7 files changed
+1195
-2
lines changed- src/libraries/System.Formats.Tar
- src
- System/Formats/Tar
- tests
- TarReader
7 files changed
+1195
-2
lines changedLines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
39 | 40 | | |
40 | 41 | | |
41 | 42 | | |
| |||
0 commit comments