Skip to content

branch-4.1: [fix](iceberg) Avoid dict reads on mixed-encoding position delete files #61759#62036

Open
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-61759-branch-4.1
Open

branch-4.1: [fix](iceberg) Avoid dict reads on mixed-encoding position delete files #61759#62036
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-61759-branch-4.1

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot commented Apr 2, 2026

Cherry-picked from #61759

…es (#61759)

### What problem does this PR solve?

Iceberg parquet position delete files currently treat the `file_path`
column as dictionary-coded as long as the column chunk has a dictionary
page. That check is too loose: parquet allows mixed encodings in the
same column chunk, so a chunk can contain both dictionary-encoded and
plain-encoded data pages.

When that happens, Doris builds a `ColumnDictI32` for `file_path`, but
the plain decoder later calls `insert_many_strings()`, which fails with:

`Method insert_many_strings is not supported for ColumnDictionary`

This PR fixes the issue by only using dictionary-backed decoding for
Iceberg position delete `file_path` columns when the entire parquet
column chunk is fully dictionary encoded. Mixed-encoding chunks now fall
back to normal string columns.

It also adds BE unit coverage for:
- fully dictionary-encoded parquet metadata
- mixed dictionary/plain parquet metadata
- parquet metadata without `encoding_stats` but with non-dictionary
encodings
@github-actions github-actions bot requested a review from yiguolei as a code owner April 2, 2026 06:00
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 2, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring closed this Apr 2, 2026
@dataroaring dataroaring reopened this Apr 2, 2026
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 2, 2026

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants