[QDP] Streaming & Large-Data Support for All Encodings — Roadmap

## Summary

This roadmap defines the implementation path for **streaming and large-data support** across all QDP encodings: adding **IQP / IQP-Z** to the Parquet streaming pipeline, introducing **additional input formats** (e.g. chunked NumPy, HDF5), and completing **documentation and baselines** so that “encode from file” is a first-class workflow. It is scoped to be comparable in impact to [Pipeline Tuning #969](https://github.com/apache/mahout/issues/969), but focuses on **feature coverage and input ecosystem** rather than pipeline performance.

## Motivation

- **Gap:** `encode_from_parquet()` currently supports only `amplitude`, `angle`, and `basis`. IQP and IQP-Z have kernels and in-memory `encode()` / `encode_batch()`, but **no streaming path** from Parquet or other large files.
- **Goal:** Enable all encodings (including IQP) to use the existing dual-stream pipeline from Parquet and, in later phases, from other large-data sources.
- **Non-overlap with #969:** addresses pipeline **performance** (observability, chunk/pool tuning, event-based buffer reuse). This roadmap addresses **who can run** and **from where data is read** (streaming encodings + input formats + docs).

---

## Phase 1: IQP / IQP-Z streaming encoding

**Deliverables**

- Support `encode_from_parquet(path, num_qubits, "iqp" | "iqp-z")` so that large Parquet files are processed through the existing dual-stream pipeline.
- Unit and integration tests; optional small throughput benchmark for Parquet + IQP.

**Implementation outline**

1. **Add IQP `ChunkEncoder`** in `qdp-core/src/encoding/` (following the pattern of amplitude/angle/basis):
   - Implement `ChunkEncoder`: `validate_sample_size`, `needs_staging_copy`, `init_state`, `encode_chunk`.
   - IQP full: `sample_size = num_qubits + num_qubits*(num_qubits-1)/2`; IQP-Z: `num_qubits`.
   - Reuse kernel calls and length checks from `qdp-core/src/gpu/encodings/iqp.rs`.
2. **Wire into `encode_from_parquet()`** in `encoding/mod.rs`: add branches for `"iqp"` and `"iqp-z"` calling `stream_encode` with the appropriate IQP encoder variant.
3. **Tests:** Reuse logic from `tests/iqp_encoding.rs`; add integration test that reads a small Parquet file and runs stream encode for IQP/IQP-Z.

**Key files:** `qdp-core/src/encoding/mod.rs`, new or extended `encoding/iqp.rs` (streaming), `qdp-core/src/gpu/encodings/iqp.rs` (existing), `qdp-core/tests/`.

---

## Phase 2: Additional input formats (streaming readers)

**Deliverables**

- At least one large-data–friendly streaming reader implemented and plugged into the encoding pipeline.
- Candidates (from [readers README](https://github.com/apache/mahout/blob/main/qdp/docs/readers/README.md) Future Enhancements): **chunked NumPy** (large `.npy`), or **HDF5**.

**Implementation outline**

1. **Implement a new reader** satisfying `StreamingDataReader` in `qdp-core/src/reader.rs` (`read_chunk(&mut self, buffer: &mut [f64]) -> Result<usize>`).
2. **Integrate with encoding:** Either extend `encode_from_*` to accept the new reader (or select by path/extension) so that `stream_encode` can consume data from the new source.
3. **Tests and docs:** Unit tests for the new reader; at least one end-to-end test (e.g. amplitude or IQP from the new format). Update `qdp/docs/readers/README.md`.

**Key files:** `qdp-core/src/readers/`, `qdp-core/src/reader.rs`, `qdp/docs/readers/README.md`.

---

## Phase 3: Baselines and documentation

**Deliverables**

- Reproducible throughput description or benchmark flow for “large file + all encodings (including IQP)”.
- Complete **Getting Started** and **Examples** for QDP (currently TODO in the docs), making “encode from file” a first-class documented workflow.

**Implementation outline**

1. **Benchmark:** Define and document a small workflow (e.g. in `qdp-python/benchmark/` or `qdp/docs/`) for Parquet + amplitude/angle/basis/iqp; align with #969 Phase 2 baseline methodology where useful.
2. **Docs:**  
   - **Getting Started:** Install, minimal example, typical `encode` / `encode_from_parquet` usage (including IQP).  
   - **Examples:** 2–3 full examples (e.g. in-memory amplitude, Parquet + IQP, DLPack → PyTorch).  
   - Optionally: short API summary in the QDP API doc.
3. **Relationship to #969:** Reuse Phase 2 observability/baseline flow if available, to avoid duplicate tooling.

---

## Phase order and dependencies

- **Phase 1** is independent; only depends on the current pipeline and IQP kernels.
- **Phase 2** builds on the same `stream_encode` interface (can be parallelized with Phase 1 once reader integration is agreed).
- **Phase 3** can be done in parallel with Phase 1/2; the “large file + IQP” benchmark is most meaningful after Phase 1 is merged.

**Suggested order:** Land Phase 1 first, then Phase 2; Phase 3 docs can start early, with benchmark steps finalized after Phase 1.

---

## Alternatives considered

- **Only document current behavior:** Does not address the missing IQP streaming path or additional formats.
- **Single big PR:** Phased approach allows incremental review and reduces risk.

---

## Additional context

- IQP kernel and GPU encoding already exist: `qdp-kernels/src/iqp.cu`, `qdp-core/src/gpu/encodings/iqp.rs`, and `qdp-core/tests/iqp_encoding.rs`.
- Streaming pipeline and `ChunkEncoder` are in `qdp-core/src/encoding/` (amplitude, angle, basis); `encode_from_parquet` is in `encoding/mod.rs`.
- Readers design: [qdp/docs/readers/README.md](https://github.com/apache/mahout/blob/main/qdp/docs/readers/README.md).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] Streaming & Large-Data Support for All Encodings — Roadmap #993

Summary

Motivation

Phase 1: IQP / IQP-Z streaming encoding

Phase 2: Additional input formats (streaming readers)

Phase 3: Baselines and documentation

Phase order and dependencies

Alternatives considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QDP] Streaming & Large-Data Support for All Encodings — Roadmap #993

Description

Summary

Motivation

Phase 1: IQP / IQP-Z streaming encoding

Phase 2: Additional input formats (streaming readers)

Phase 3: Baselines and documentation

Phase order and dependencies

Alternatives considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions