shared: improve FCS parsing and preview metadata by Austin-s-h · Pull Request #4858 · quiltdata/quilt

Austin-s-h · 2026-04-26T12:57:28Z

Summary

Shared-layer FCS parsing improvements: switch from fcsparser to flowio, add multi-level fallback, and emit a Vega-Lite scatterplot spec for downstream renderers.

This PR is scoped to:

lambdas/shared/
py-shared/

Sibling / dependent PRs

This work was originally bundled in one large PR; it has been split into three independent PRs that may land in any order:

PR	Layer	Files
#4858 (this)	Shared lambda parsing + `vegaLite` spec generation	`lambdas/shared/`, `py-shared/`
#4859	Consumer lambdas (`indexer`, `preview`, `tabular_preview`, `thumbnail`) updated to use the new shared layer	`lambdas/{indexer,preview,tabular_preview,thumbnail}/`
#4860	Catalog frontend renders the scatter chart when `vegaLite` is present	`catalog/app/components/Preview/Fcs.*`

Each downstream PR is a no-op without the others (renderer guards on !!vegaLite; consumers gracefully handle absent fields), so any landing order works.

Changes

switch shared FCS parsing from fcsparser to flowio
add multi-level fallback: full parse → metadata-only via FlowData(only_text=True) → raw TEXT segment
generate vegaLite scatter spec from numeric event data
filter NaN/Inf before serialization (addresses Greptile P1 from initial review)
accurate "Showing N events" / "Downsampled to N events" subtitle (addresses Greptile P2)
extend tests to cover NaN/Inf filtering, downsampling determinism, and raw TEXT fallback
update shared dependency locks (Python 3.13 compat already on this branch)

Validation

cd lambdas/shared && uv run pytest tests/test_preview.py
See #4859 for end-to-end consumer-lambda validation

Deployment context

Tracked in deployment#2395 (stack/unstable2) and rolled out via deployment#2394 (FCS end-to-end bump).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Austin-s-h · 2026-04-26T13:26:47Z

@drernie I think this slice should be ready!

drernie · 2026-04-27T05:17:44Z

Thanks so much! I will ask Engineering to review it ASAP, hopefully this week.

…ltdata/quilt - Add workflow_dispatch to deploy-catalog.yaml so we can build catalog from this branch (it brings the FCS UI changes from #4858). - Repin t4-lambda-shared and quilt-shared in lambdas/indexer/pyproject.toml from the Austin-s-h fork SHA to the quiltdata/quilt SHA carrying the P1 fix. The earlier indexer build failed because the fork SHA isn't reachable in our build environment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-27T16:54:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.79%. Comparing base (cea497e) to head (85e2f1d).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4858      +/-   ##
==========================================
+ Coverage   45.69%   45.79%   +0.09%     
==========================================
  Files         829      829              
  Lines       33532    33590      +58     
  Branches     5698     5698              
==========================================
+ Hits        15323    15381      +58     
  Misses      16212    16212              
  Partials     1997     1997

Flag	Coverage Δ
api-python	`93.14% <ø> (ø)`
catalog	`19.55% <ø> (ø)`
lambda	`96.70% <100.00%> (+0.07%)`	⬆️
py-shared	`98.18% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The catalog frontend FCS wiring (loaders/Fcs.js, renderers/Fcs.jsx, types.js, plus their .spec tests) is moving to a sibling PR so this PR stays scoped to the shared lambda layer. Coordinates with the new catalog FCS PR (to be opened on quiltdata/quilt) which contains exactly these 5 files. The frontend renderer guards on !!vegaLite, so either side can land first without regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uv.lock was regenerated with the previous version bump from 0.1.0 → 0.1.1 (in 20d03ba) but lost the numpy entry — the [es] extra still declares numpy<2 and the test group depends on quilt-shared[es], so test-py-shared CI failed with "lockfile needs to be updated." Regenerate via 'uv lock' in py-shared/. Adds numpy 1.26.4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the shared-layer FCS preview pipeline to use flowio instead of fcsparser, adds robust parsing fallbacks (full parse → metadata-only → raw TEXT segment), and emits a Vega-Lite scatterplot spec in the preview metadata for downstream renderers.

Changes:

Replace FCS parsing implementation with flowio and add multi-stage fallback parsing.
Generate and attach a Vega-Lite scatterplot spec (vegaLite) for numeric event data, with downsampling and invalid-value filtering.
Update dependencies/locks and expand shared-layer tests to cover the new FCS behavior.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`lambdas/shared/src/t4_lambda_shared/preview.py`	New `flowio`-based FCS parser, TEXT-segment fallback parser, and Vega-Lite scatter spec generation.
`lambdas/shared/tests/test_preview.py`	Adds/updates tests for plotting, metadata-only fallback, downsampling determinism, invalid-value filtering, and raw TEXT parsing.
`lambdas/shared/pyproject.toml`	Switch preview extra from `fcsparser` to `flowio` and add explicit `numpy` constraint.
`lambdas/shared/uv.lock`	Lockfile updates reflecting `flowio` adoption and dependency changes.
`py-shared/pyproject.toml`	Bumps `quilt-shared` version and adds `numpy` to an optional dependency group.
`py-shared/uv.lock`	Lockfile updates reflecting `quilt-shared` version bump and dependency additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T00:11:46Z

+    expected_values = fd.event_count * fd.channel_count
+    values = list(fd.events)
+    if len(values) < expected_values:
+        raise ValueError('FCS data is truncated or malformed')
+
+    rows = [
+        values[offset:offset + fd.channel_count]
+        for offset in range(0, expected_values, fd.channel_count)
+    ]
+    data = pandas.DataFrame(rows, columns=channel_names)
+


_parse_fcs_flowio_full materializes all events into a Python list and then builds a second full rows list via slicing before creating the DataFrame. For large FCS files this doubles memory usage and can easily exhaust Lambda memory/disk, even though downstream rendering only needs a downsampled subset. Consider constructing an array directly (e.g., numpy.fromiter(..., count=expected_values).reshape(event_count, channel_count)) or iterating in chunks, and avoid the intermediate rows list.

Fixed in 85e2f1d.

Replaced the two-list intermediate (list(fd.events) + chunked rows comprehension) with a single preallocated numpy.fromiter(..., count=expected_values) and a zero-copy .reshape(event_count, channel_count) before handing it to pandas. For the 8M-float example case the peak intermediate footprint drops from ~450 MB of Python floats to ~64 MB of contiguous float64 — about a 7× reduction.

The truncation guard is preserved: if fd.events yields fewer items than event_count * channel_count, numpy.fromiter raises and we re-raise with the original FCS data is truncated or malformed message. 12/12 shared tests pass.

I consider you a 7x better web-developer!! What is your favorite way to catch these sorts of things early? I cover in tests but I never have memory complexity readout.

Three findings from copilot-pull-request-reviewer on this PR: 1. _split_fcs_text_tokens: drop the `if token` filter on the return list. Empty tokens are valid in FCS TEXT segments (e.g., a key with empty value); filtering them out can shift key/value alignment in the zip pairing and produce incorrect metadata. 2. _build_fcs_scatter_spec: replace `Series.map(isfinite)` with `numpy.isfinite(series.to_numpy())`. The vectorized numpy version avoids per-element Python calls — significant speedup for large FCS event tables before downsampling. 3. py-shared[es]: remove the `numpy < 2` constraint. py-shared has no numpy imports anywhere; the constraint was added in the original split commit but was never used. Removing it keeps ES installs lightweight. lambdas/shared still declares its own numpy>=1.26,<3. Verified: 12/12 lambdas/shared tests pass; 36/36 py-shared tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…doubling) Addresses Copilot finding on quilt#4858 about `_parse_fcs_flowio_full` materializing event data twice. ## Problem The previous implementation built two full copies of the event data in memory before constructing the DataFrame: 1. `values = list(fd.events)` — flat Python list of every float in every channel of every event. For an N-event × C-channel file that's N×C Python float objects (~28 bytes each). 2. `rows = [values[offset:offset + C] for offset in range(0, N×C, C)]` — same data again, now sliced into N row-lists of C floats each. Only after both lists exist does pandas build the DataFrame. For a moderate FCS file (1M events × 8 channels = 8M floats), this peaks at roughly 450 MB of intermediate Python objects — easy to OOM inside the preview Lambda's memory budget, even though the eventual DataFrame and downsampled scatter spec are far smaller. ## Fix Stream `fd.events` directly into a single preallocated `numpy.ndarray` of exactly the expected size, then reshape it (no copy) into the event × channel matrix that pandas wraps: values = numpy.fromiter(fd.events, dtype=float, count=expected_values) data = pandas.DataFrame(values.reshape(N, C), columns=channel_names) `numpy.fromiter(..., count=expected_values)` preallocates the array once. `reshape` is a metadata-only view, not a copy. Result: one contiguous float64 buffer (~64 MB for the 8M-float example, ~7× smaller than before) plus the DataFrame's view of it. The truncation guard is preserved: if `fd.events` yields fewer than `expected_values` items, `numpy.fromiter` raises `ValueError`, which is re-raised with the original message so callers see the same "FCS data is truncated or malformed" error. ## Validation - 12/12 lambdas/shared/tests/test_preview.py pass (incl. the FCS parsing, downsampling, NaN/Inf, and TEXT-segment fallback cases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drernie · 2026-04-28T03:46:54Z

@Austin-s-h heads-up — this PR has been narrowed and amended on your branch since your last push. Quick summary:

Scope change

Catalog FCS frontend wiring (catalog/app/components/Preview/Fcs.*) was extracted into a sibling PR: quilt#4860. This PR is now lambdas/shared/ + py-shared/ only — much easier to review.
Consumer lambda updates remain in quilt#4859, now rebased on top of this branch.

Commits added on top of 20d03bae

dd96f259 — drop catalog FCS frontend wiring (moved to Catalog: render FCS Vega scatter from preview vegaLite spec #4860)
3564ff37 — regenerate py-shared/uv.lock (the version bump in 20d03bae had dropped the numpy entry, breaking test-py-shared CI)
38fd78f4 — address 3 Copilot findings: preserve empty FCS TEXT tokens; vectorized numpy.isfinite in scatter spec; remove unused numpy<2 from py-shared[es]
85e2f1d1 — address Copilot's memory-doubling finding in _parse_fcs_flowio_full via numpy.fromiter + zero-copy reshape (~7× peak-memory reduction for large files)

Status

CI: ✅ all 44 checks green
Greptile: all P1 + P2 addressed (your 20d03bae covered most; the rest are in the commits above)
Copilot: all 4 inline comments replied to with fix-commit references
Mergeable, awaiting human review

Let me know if any of the changes I pushed look wrong — happy to revert/adjust. Thanks for the original work, this is a great improvement to FCS preview.

drernie · 2026-04-29T04:25:01Z

Claude Code, lots of context, and repeated yelling. :-)

…

On Tue, Apr 28, 2026 at 16:54 Austin Hovland ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lambdas/shared/src/t4_lambda_shared/preview.py <#4858 (comment)>: > + expected_values = fd.event_count * fd.channel_count + values = list(fd.events) + if len(values) < expected_values: + raise ValueError('FCS data is truncated or malformed') + + rows = [ + values[offset:offset + fd.channel_count] + for offset in range(0, expected_values, fd.channel_count) + ] + data = pandas.DataFrame(rows, columns=channel_names) + I consider you a 7x better web-developer!! What is your favorite way to catch these sorts of things early? I cover in tests but I never have memory complexity readout. — Reply to this email directly, view it on GitHub <#4858 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAE2T6UCI4L5FKQQY4EHP34YFACRAVCNFSM6AAAAACYG2Z2GOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DCOJTGI3DOMBZG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…atrix preview, binary sniff (quiltdata#4901) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ared

The h5ad handler intermittently fails with h5py's "truncated file" OSError when the underlying HTTP read tears mid-transfer. The condition is non-deterministic — the same file/URL succeeds on a fresh attempt — so a single bounded retry hides the flap. Structural errors (ValueError / KeyError from a genuinely malformed h5ad) still go straight to the error envelope. Also include the (query-stripped) URL in the warning log so post-hoc debugging can distinguish a transient transport hiccup from a file that is actually broken. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the h5ad torn-read retry + URL logging from PR quiltdata/quilt:fix/h5ad-retry-torn-read → Austin-s-h#10 (targets preview-lambda-shared / #4858), propagated through: - pr-4858-fcs-shared (07b4b3c fix) - pr-4859-fcs-consumers (2a299bc merge) - 260427-catalog-fcs-vega (dee7002 merge) New tabular_preview tip: dee7002. Closes the intermittent "Unexpected Error" on h5ad previews observed against tf-dev-alexion (eg. test/preview-h5ad/test.h5ad). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(tabular_preview): retry once on torn h5ad read; log URL on failure

shared: split preview lambda base

42de6e8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Austin-s-h mentioned this pull request Apr 26, 2026

Preview lambda 7.3.0 #4857

Closed

8 tasks

greptile-apps Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread lambdas/shared/src/t4_lambda_shared/preview.py

fix: Refactor FCS scatter plot handling and improve test coverage

20d03ba

drernie mentioned this pull request Apr 27, 2026

Catalog: render FCS Vega scatter from preview vegaLite spec #4860

Open

drernie changed the title ~~shared: improve FCS parsing and preview metadata~~ shared: improve FCS parsing and preview metadata Apr 27, 2026

drernie mentioned this pull request Apr 27, 2026

preview: update consumers for shared FCS and PDF changes #4859

Open

drernie requested a review from Copilot April 28, 2026 00:08

Copilot started reviewing on behalf of drernie April 28, 2026 00:09 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

drernie and others added 2 commits April 27, 2026 17:46

Merge branch 'master' into preview-lambda-shared

596b14a

This was referenced May 15, 2026

Preview Lambda Improvements #4813

Open

fix(preview): close #4813 gaps — text fallback, h5ad error/matrix preview, binary sniff #4901

Merged

drernie and others added 4 commits May 15, 2026 09:22

fix(preview): close quiltdata#4813 gaps — text fallback, h5ad error/m…

fb95e3f

…atrix preview, binary sniff (quiltdata#4901) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'quiltdata:preview-lambda-shared' into preview-lambda-sh…

5e1c4e6

…ared

fix(preview): reduce lambda cold-start runtime imports

0969638

drernie mentioned this pull request May 22, 2026

fix(tabular_preview): retry once on torn h5ad read; log URL on failure Austin-s-h/quilt#10

Merged

2 tasks

Merge branch 'master' into preview-lambda-shared

a78eaf3

Merge pull request #10 from quiltdata/fix/h5ad-retry-torn-read

85adcd1

fix(tabular_preview): retry once on torn h5ad read; log URL on failure

Austin-s-h mentioned this pull request May 26, 2026

Modernize uv packaging and local catalog workflows Austin-s-h/quilt#12

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared: improve FCS parsing and preview metadata#4858

shared: improve FCS parsing and preview metadata#4858
Austin-s-h wants to merge 13 commits into
quiltdata:masterfrom
Austin-s-h:preview-lambda-shared

Austin-s-h commented Apr 26, 2026 •

edited by drernie

Loading

Uh oh!

Uh oh!

Austin-s-h commented Apr 26, 2026

Uh oh!

drernie commented Apr 27, 2026

Uh oh!

codecov Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

drernie Apr 28, 2026

Uh oh!

Austin-s-h Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

drernie commented Apr 28, 2026

Uh oh!

drernie commented Apr 29, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Austin-s-h commented Apr 26, 2026 • edited by drernie Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sibling / dependent PRs

Changes

Validation

Deployment context

Uh oh!

Uh oh!

Austin-s-h commented Apr 26, 2026

Uh oh!

drernie commented Apr 27, 2026

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

drernie Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Austin-s-h Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

drernie commented Apr 28, 2026

Uh oh!

drernie commented Apr 29, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Austin-s-h commented Apr 26, 2026 •

edited by drernie

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading