shared: improve FCS parsing and preview metadata#4858
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@drernie I think this slice should be ready! |
|
Thanks so much! I will ask Engineering to review it ASAP, hopefully this week. |
…ltdata/quilt - Add workflow_dispatch to deploy-catalog.yaml so we can build catalog from this branch (it brings the FCS UI changes from #4858). - Repin t4-lambda-shared and quilt-shared in lambdas/indexer/pyproject.toml from the Austin-s-h fork SHA to the quiltdata/quilt SHA carrying the P1 fix. The earlier indexer build failed because the fork SHA isn't reachable in our build environment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4858 +/- ##
==========================================
+ Coverage 45.69% 45.79% +0.09%
==========================================
Files 829 829
Lines 33532 33590 +58
Branches 5698 5698
==========================================
+ Hits 15323 15381 +58
Misses 16212 16212
Partials 1997 1997
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The catalog frontend FCS wiring (loaders/Fcs.js, renderers/Fcs.jsx, types.js, plus their .spec tests) is moving to a sibling PR so this PR stays scoped to the shared lambda layer. Coordinates with the new catalog FCS PR (to be opened on quiltdata/quilt) which contains exactly these 5 files. The frontend renderer guards on !!vegaLite, so either side can land first without regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uv.lock was regenerated with the previous version bump from 0.1.0 → 0.1.1 (in 20d03ba) but lost the numpy entry — the [es] extra still declares numpy<2 and the test group depends on quilt-shared[es], so test-py-shared CI failed with "lockfile needs to be updated." Regenerate via 'uv lock' in py-shared/. Adds numpy 1.26.4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the shared-layer FCS preview pipeline to use flowio instead of fcsparser, adds robust parsing fallbacks (full parse → metadata-only → raw TEXT segment), and emits a Vega-Lite scatterplot spec in the preview metadata for downstream renderers.
Changes:
- Replace FCS parsing implementation with
flowioand add multi-stage fallback parsing. - Generate and attach a Vega-Lite scatterplot spec (
vegaLite) for numeric event data, with downsampling and invalid-value filtering. - Update dependencies/locks and expand shared-layer tests to cover the new FCS behavior.
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
lambdas/shared/src/t4_lambda_shared/preview.py |
New flowio-based FCS parser, TEXT-segment fallback parser, and Vega-Lite scatter spec generation. |
lambdas/shared/tests/test_preview.py |
Adds/updates tests for plotting, metadata-only fallback, downsampling determinism, invalid-value filtering, and raw TEXT parsing. |
lambdas/shared/pyproject.toml |
Switch preview extra from fcsparser to flowio and add explicit numpy constraint. |
lambdas/shared/uv.lock |
Lockfile updates reflecting flowio adoption and dependency changes. |
py-shared/pyproject.toml |
Bumps quilt-shared version and adds numpy to an optional dependency group. |
py-shared/uv.lock |
Lockfile updates reflecting quilt-shared version bump and dependency additions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| expected_values = fd.event_count * fd.channel_count | ||
| values = list(fd.events) | ||
| if len(values) < expected_values: | ||
| raise ValueError('FCS data is truncated or malformed') | ||
|
|
||
| rows = [ | ||
| values[offset:offset + fd.channel_count] | ||
| for offset in range(0, expected_values, fd.channel_count) | ||
| ] | ||
| data = pandas.DataFrame(rows, columns=channel_names) | ||
|
|
There was a problem hiding this comment.
_parse_fcs_flowio_full materializes all events into a Python list and then builds a second full rows list via slicing before creating the DataFrame. For large FCS files this doubles memory usage and can easily exhaust Lambda memory/disk, even though downstream rendering only needs a downsampled subset. Consider constructing an array directly (e.g., numpy.fromiter(..., count=expected_values).reshape(event_count, channel_count)) or iterating in chunks, and avoid the intermediate rows list.
There was a problem hiding this comment.
Fixed in 85e2f1d.
Replaced the two-list intermediate (list(fd.events) + chunked rows comprehension) with a single preallocated numpy.fromiter(..., count=expected_values) and a zero-copy .reshape(event_count, channel_count) before handing it to pandas. For the 8M-float example case the peak intermediate footprint drops from ~450 MB of Python floats to ~64 MB of contiguous float64 — about a 7× reduction.
The truncation guard is preserved: if fd.events yields fewer items than event_count * channel_count, numpy.fromiter raises and we re-raise with the original FCS data is truncated or malformed message. 12/12 shared tests pass.
There was a problem hiding this comment.
I consider you a 7x better web-developer!! What is your favorite way to catch these sorts of things early? I cover in tests but I never have memory complexity readout.
Three findings from copilot-pull-request-reviewer on this PR: 1. _split_fcs_text_tokens: drop the `if token` filter on the return list. Empty tokens are valid in FCS TEXT segments (e.g., a key with empty value); filtering them out can shift key/value alignment in the zip pairing and produce incorrect metadata. 2. _build_fcs_scatter_spec: replace `Series.map(isfinite)` with `numpy.isfinite(series.to_numpy())`. The vectorized numpy version avoids per-element Python calls — significant speedup for large FCS event tables before downsampling. 3. py-shared[es]: remove the `numpy < 2` constraint. py-shared has no numpy imports anywhere; the constraint was added in the original split commit but was never used. Removing it keeps ES installs lightweight. lambdas/shared still declares its own numpy>=1.26,<3. Verified: 12/12 lambdas/shared tests pass; 36/36 py-shared tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doubling)
Addresses Copilot finding on quilt#4858 about
`_parse_fcs_flowio_full` materializing event data twice.
## Problem
The previous implementation built two full copies of the event data
in memory before constructing the DataFrame:
1. `values = list(fd.events)` — flat Python list of every float in
every channel of every event. For an N-event × C-channel file
that's N×C Python float objects (~28 bytes each).
2. `rows = [values[offset:offset + C] for offset in range(0, N×C, C)]`
— same data again, now sliced into N row-lists of C floats each.
Only after both lists exist does pandas build the DataFrame.
For a moderate FCS file (1M events × 8 channels = 8M floats), this
peaks at roughly 450 MB of intermediate Python objects — easy to OOM
inside the preview Lambda's memory budget, even though the eventual
DataFrame and downsampled scatter spec are far smaller.
## Fix
Stream `fd.events` directly into a single preallocated `numpy.ndarray`
of exactly the expected size, then reshape it (no copy) into the
event × channel matrix that pandas wraps:
values = numpy.fromiter(fd.events, dtype=float, count=expected_values)
data = pandas.DataFrame(values.reshape(N, C), columns=channel_names)
`numpy.fromiter(..., count=expected_values)` preallocates the array
once. `reshape` is a metadata-only view, not a copy. Result: one
contiguous float64 buffer (~64 MB for the 8M-float example, ~7×
smaller than before) plus the DataFrame's view of it.
The truncation guard is preserved: if `fd.events` yields fewer than
`expected_values` items, `numpy.fromiter` raises `ValueError`, which
is re-raised with the original message so callers see the same
"FCS data is truncated or malformed" error.
## Validation
- 12/12 lambdas/shared/tests/test_preview.py pass (incl. the FCS
parsing, downsampling, NaN/Inf, and TEXT-segment fallback cases).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@Austin-s-h heads-up — this PR has been narrowed and amended on your branch since your last push. Quick summary: Scope change
Commits added on top of
Status
Let me know if any of the changes I pushed look wrong — happy to revert/adjust. Thanks for the original work, this is a great improvement to FCS preview. |
|
Claude Code, lots of context, and repeated yelling. :-)
…On Tue, Apr 28, 2026 at 16:54 Austin Hovland ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lambdas/shared/src/t4_lambda_shared/preview.py
<#4858 (comment)>:
> + expected_values = fd.event_count * fd.channel_count
+ values = list(fd.events)
+ if len(values) < expected_values:
+ raise ValueError('FCS data is truncated or malformed')
+
+ rows = [
+ values[offset:offset + fd.channel_count]
+ for offset in range(0, expected_values, fd.channel_count)
+ ]
+ data = pandas.DataFrame(rows, columns=channel_names)
+
I consider you a 7x better web-developer!! What is your favorite way to
catch these sorts of things early? I cover in tests but I never have memory
complexity readout.
—
Reply to this email directly, view it on GitHub
<#4858 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAE2T6UCI4L5FKQQY4EHP34YFACRAVCNFSM6AAAAACYG2Z2GOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DCOJTGI3DOMBZG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…atrix preview, binary sniff (quiltdata#4901) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The h5ad handler intermittently fails with h5py's "truncated file" OSError when the underlying HTTP read tears mid-transfer. The condition is non-deterministic — the same file/URL succeeds on a fresh attempt — so a single bounded retry hides the flap. Structural errors (ValueError / KeyError from a genuinely malformed h5ad) still go straight to the error envelope. Also include the (query-stripped) URL in the warning log so post-hoc debugging can distinguish a transient transport hiccup from a file that is actually broken. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the h5ad torn-read retry + URL logging from PR quiltdata/quilt:fix/h5ad-retry-torn-read → Austin-s-h#10 (targets preview-lambda-shared / #4858), propagated through: - pr-4858-fcs-shared (07b4b3c fix) - pr-4859-fcs-consumers (2a299bc merge) - 260427-catalog-fcs-vega (dee7002 merge) New tabular_preview tip: dee7002. Closes the intermittent "Unexpected Error" on h5ad previews observed against tf-dev-alexion (eg. test/preview-h5ad/test.h5ad). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(tabular_preview): retry once on torn h5ad read; log URL on failure
Summary
Shared-layer FCS parsing improvements: switch from
fcsparsertoflowio, add multi-level fallback, and emit a Vega-Lite scatterplot spec for downstream renderers.This PR is scoped to:
lambdas/shared/py-shared/Sibling / dependent PRs
This work was originally bundled in one large PR; it has been split into three independent PRs that may land in any order:
vegaLitespec generationlambdas/shared/,py-shared/indexer,preview,tabular_preview,thumbnail) updated to use the new shared layerlambdas/{indexer,preview,tabular_preview,thumbnail}/vegaLiteis presentcatalog/app/components/Preview/Fcs.*Each downstream PR is a no-op without the others (renderer guards on
!!vegaLite; consumers gracefully handle absent fields), so any landing order works.Changes
fcsparsertoflowioFlowData(only_text=True)→ raw TEXT segmentvegaLitescatter spec from numeric event dataValidation
cd lambdas/shared && uv run pytest tests/test_preview.pyDeployment context
Tracked in deployment#2395 (
stack/unstable2) and rolled out via deployment#2394 (FCS end-to-end bump).