Skip to content

Document / Add an example of preserving dictionary encoding when reading parquet #9095

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This has come up several times, most recently on the arrow mailing list:

https://lists.apache.org/thread/5kg3q0y4cqzl16q6vrvkxlw0yxmk4241

Discussing how to expose dictionary data may lead to multiple overlapping
considerations, long discussions and perhaps format and API changes. So we
hope that there could be some loopholes or small change that could
potentially unblock such optimization without going into a large design/API
space. For instance:

  1. Can we introduce a hint to ParquetReader which will produce
    DictionaryArray for the given column instead of a concrete array
    (StringViewArray in our case)?
  2. When doing late materialization, maybe we can extend ArrowPredicate,
    so that it first instructs Parquet reader that it wants to get encoded
    dictionaries first, and once they are supplied, return another predicate
    that will be applied to encoded data. E.g., "x = some_value" translates to
    "x_encoded = index".

@tustvold pointed out:

What you are requesting is already supported in parquet-rs. In
particular if you request a UTF8 or Binary DictionaryArray for the
column it will decode the column preserving the dictionary encoding. You
can override the embedded arrow schema, if any, using
ArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatch
across row groups and therefore across dictionaries, which the async
reader doesn't, this should never materialize the dictionary. FWIW the
ViewArray decodeders will also preserve the dictionary encoding,
however, the dictionary encoded nature is less explicit in the resulting
arrays.

The API does have an example, but it shows how to read i32 as a timestamp, rather than dictionary encoding
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_schema

Describe the solution you'd like
I would like these features to be more easily documented:

  1. An example showing how to override the schema of the parquet reader to keep the Dictionary encoding

The example should mention that the dictionary encoding is preserved even when the original data was not dictionary encoded

It woul

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions