Skip to content

shared: improve FCS parsing and preview metadata#4858

Open
Austin-s-h wants to merge 13 commits into
quiltdata:masterfrom
Austin-s-h:preview-lambda-shared
Open

shared: improve FCS parsing and preview metadata#4858
Austin-s-h wants to merge 13 commits into
quiltdata:masterfrom
Austin-s-h:preview-lambda-shared

Conversation

@Austin-s-h
Copy link
Copy Markdown
Collaborator

@Austin-s-h Austin-s-h commented Apr 26, 2026

Summary

Shared-layer FCS parsing improvements: switch from fcsparser to flowio, add multi-level fallback, and emit a Vega-Lite scatterplot spec for downstream renderers.

This PR is scoped to:

  • lambdas/shared/
  • py-shared/

Sibling / dependent PRs

This work was originally bundled in one large PR; it has been split into three independent PRs that may land in any order:

PR Layer Files
#4858 (this) Shared lambda parsing + vegaLite spec generation lambdas/shared/, py-shared/
#4859 Consumer lambdas (indexer, preview, tabular_preview, thumbnail) updated to use the new shared layer lambdas/{indexer,preview,tabular_preview,thumbnail}/
#4860 Catalog frontend renders the scatter chart when vegaLite is present catalog/app/components/Preview/Fcs.*

Each downstream PR is a no-op without the others (renderer guards on !!vegaLite; consumers gracefully handle absent fields), so any landing order works.

Changes

  • switch shared FCS parsing from fcsparser to flowio
  • add multi-level fallback: full parse → metadata-only via FlowData(only_text=True) → raw TEXT segment
  • generate vegaLite scatter spec from numeric event data
  • filter NaN/Inf before serialization (addresses Greptile P1 from initial review)
  • accurate "Showing N events" / "Downsampled to N events" subtitle (addresses Greptile P2)
  • extend tests to cover NaN/Inf filtering, downsampling determinism, and raw TEXT fallback
  • update shared dependency locks (Python 3.13 compat already on this branch)

Validation

  • cd lambdas/shared && uv run pytest tests/test_preview.py
  • See #4859 for end-to-end consumer-lambda validation

Deployment context

Tracked in deployment#2395 (stack/unstable2) and rolled out via deployment#2394 (FCS end-to-end bump).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Austin-s-h Austin-s-h mentioned this pull request Apr 26, 2026
8 tasks
Comment thread lambdas/shared/src/t4_lambda_shared/preview.py
@Austin-s-h
Copy link
Copy Markdown
Collaborator Author

@drernie I think this slice should be ready!

@drernie
Copy link
Copy Markdown
Member

drernie commented Apr 27, 2026

Thanks so much! I will ask Engineering to review it ASAP, hopefully this week.

drernie added a commit that referenced this pull request Apr 27, 2026
…ltdata/quilt

- Add workflow_dispatch to deploy-catalog.yaml so we can build catalog from
  this branch (it brings the FCS UI changes from #4858).
- Repin t4-lambda-shared and quilt-shared in lambdas/indexer/pyproject.toml
  from the Austin-s-h fork SHA to the quiltdata/quilt SHA carrying the P1
  fix. The earlier indexer build failed because the fork SHA isn't reachable
  in our build environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.79%. Comparing base (cea497e) to head (85e2f1d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4858      +/-   ##
==========================================
+ Coverage   45.69%   45.79%   +0.09%     
==========================================
  Files         829      829              
  Lines       33532    33590      +58     
  Branches     5698     5698              
==========================================
+ Hits        15323    15381      +58     
  Misses      16212    16212              
  Partials     1997     1997              
Flag Coverage Δ
api-python 93.14% <ø> (ø)
catalog 19.55% <ø> (ø)
lambda 96.70% <100.00%> (+0.07%) ⬆️
py-shared 98.18% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The catalog frontend FCS wiring (loaders/Fcs.js, renderers/Fcs.jsx,
types.js, plus their .spec tests) is moving to a sibling PR so this
PR stays scoped to the shared lambda layer.

Coordinates with the new catalog FCS PR (to be opened on quiltdata/quilt)
which contains exactly these 5 files. The frontend renderer guards on
!!vegaLite, so either side can land first without regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drernie drernie changed the title shared: improve FCS parsing and preview metadata shared: improve FCS parsing and preview metadata Apr 27, 2026
uv.lock was regenerated with the previous version bump from 0.1.0 →
0.1.1 (in 20d03ba) but lost the numpy entry — the [es] extra still
declares numpy<2 and the test group depends on quilt-shared[es], so
test-py-shared CI failed with "lockfile needs to be updated."

Regenerate via 'uv lock' in py-shared/. Adds numpy 1.26.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the shared-layer FCS preview pipeline to use flowio instead of fcsparser, adds robust parsing fallbacks (full parse → metadata-only → raw TEXT segment), and emits a Vega-Lite scatterplot spec in the preview metadata for downstream renderers.

Changes:

  • Replace FCS parsing implementation with flowio and add multi-stage fallback parsing.
  • Generate and attach a Vega-Lite scatterplot spec (vegaLite) for numeric event data, with downsampling and invalid-value filtering.
  • Update dependencies/locks and expand shared-layer tests to cover the new FCS behavior.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
lambdas/shared/src/t4_lambda_shared/preview.py New flowio-based FCS parser, TEXT-segment fallback parser, and Vega-Lite scatter spec generation.
lambdas/shared/tests/test_preview.py Adds/updates tests for plotting, metadata-only fallback, downsampling determinism, invalid-value filtering, and raw TEXT parsing.
lambdas/shared/pyproject.toml Switch preview extra from fcsparser to flowio and add explicit numpy constraint.
lambdas/shared/uv.lock Lockfile updates reflecting flowio adoption and dependency changes.
py-shared/pyproject.toml Bumps quilt-shared version and adds numpy to an optional dependency group.
py-shared/uv.lock Lockfile updates reflecting quilt-shared version bump and dependency additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lambdas/shared/src/t4_lambda_shared/preview.py Outdated
Comment on lines +150 to +160
expected_values = fd.event_count * fd.channel_count
values = list(fd.events)
if len(values) < expected_values:
raise ValueError('FCS data is truncated or malformed')

rows = [
values[offset:offset + fd.channel_count]
for offset in range(0, expected_values, fd.channel_count)
]
data = pandas.DataFrame(rows, columns=channel_names)

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_fcs_flowio_full materializes all events into a Python list and then builds a second full rows list via slicing before creating the DataFrame. For large FCS files this doubles memory usage and can easily exhaust Lambda memory/disk, even though downstream rendering only needs a downsampled subset. Consider constructing an array directly (e.g., numpy.fromiter(..., count=expected_values).reshape(event_count, channel_count)) or iterating in chunks, and avoid the intermediate rows list.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 85e2f1d.

Replaced the two-list intermediate (list(fd.events) + chunked rows comprehension) with a single preallocated numpy.fromiter(..., count=expected_values) and a zero-copy .reshape(event_count, channel_count) before handing it to pandas. For the 8M-float example case the peak intermediate footprint drops from ~450 MB of Python floats to ~64 MB of contiguous float64 — about a 7× reduction.

The truncation guard is preserved: if fd.events yields fewer items than event_count * channel_count, numpy.fromiter raises and we re-raise with the original FCS data is truncated or malformed message. 12/12 shared tests pass.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider you a 7x better web-developer!! What is your favorite way to catch these sorts of things early? I cover in tests but I never have memory complexity readout.

Comment thread lambdas/shared/src/t4_lambda_shared/preview.py Outdated
Comment thread py-shared/pyproject.toml Outdated
drernie and others added 2 commits April 27, 2026 17:46
Three findings from copilot-pull-request-reviewer on this PR:

1. _split_fcs_text_tokens: drop the `if token` filter on the return
   list. Empty tokens are valid in FCS TEXT segments (e.g., a key with
   empty value); filtering them out can shift key/value alignment in
   the zip pairing and produce incorrect metadata.

2. _build_fcs_scatter_spec: replace `Series.map(isfinite)` with
   `numpy.isfinite(series.to_numpy())`. The vectorized numpy version
   avoids per-element Python calls — significant speedup for large
   FCS event tables before downsampling.

3. py-shared[es]: remove the `numpy < 2` constraint. py-shared has no
   numpy imports anywhere; the constraint was added in the original
   split commit but was never used. Removing it keeps ES installs
   lightweight. lambdas/shared still declares its own numpy>=1.26,<3.

Verified: 12/12 lambdas/shared tests pass; 36/36 py-shared tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doubling)

Addresses Copilot finding on quilt#4858 about
`_parse_fcs_flowio_full` materializing event data twice.

## Problem

The previous implementation built two full copies of the event data
in memory before constructing the DataFrame:

  1. `values = list(fd.events)` — flat Python list of every float in
     every channel of every event. For an N-event × C-channel file
     that's N×C Python float objects (~28 bytes each).

  2. `rows = [values[offset:offset + C] for offset in range(0, N×C, C)]`
     — same data again, now sliced into N row-lists of C floats each.

Only after both lists exist does pandas build the DataFrame.

For a moderate FCS file (1M events × 8 channels = 8M floats), this
peaks at roughly 450 MB of intermediate Python objects — easy to OOM
inside the preview Lambda's memory budget, even though the eventual
DataFrame and downsampled scatter spec are far smaller.

## Fix

Stream `fd.events` directly into a single preallocated `numpy.ndarray`
of exactly the expected size, then reshape it (no copy) into the
event × channel matrix that pandas wraps:

    values = numpy.fromiter(fd.events, dtype=float, count=expected_values)
    data = pandas.DataFrame(values.reshape(N, C), columns=channel_names)

`numpy.fromiter(..., count=expected_values)` preallocates the array
once. `reshape` is a metadata-only view, not a copy. Result: one
contiguous float64 buffer (~64 MB for the 8M-float example, ~7×
smaller than before) plus the DataFrame's view of it.

The truncation guard is preserved: if `fd.events` yields fewer than
`expected_values` items, `numpy.fromiter` raises `ValueError`, which
is re-raised with the original message so callers see the same
"FCS data is truncated or malformed" error.

## Validation

- 12/12 lambdas/shared/tests/test_preview.py pass (incl. the FCS
  parsing, downsampling, NaN/Inf, and TEXT-segment fallback cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drernie
Copy link
Copy Markdown
Member

drernie commented Apr 28, 2026

@Austin-s-h heads-up — this PR has been narrowed and amended on your branch since your last push. Quick summary:

Scope change

  • Catalog FCS frontend wiring (catalog/app/components/Preview/Fcs.*) was extracted into a sibling PR: quilt#4860. This PR is now lambdas/shared/ + py-shared/ only — much easier to review.
  • Consumer lambda updates remain in quilt#4859, now rebased on top of this branch.

Commits added on top of 20d03bae

  • dd96f259 — drop catalog FCS frontend wiring (moved to Catalog: render FCS Vega scatter from preview vegaLite spec #4860)
  • 3564ff37 — regenerate py-shared/uv.lock (the version bump in 20d03bae had dropped the numpy entry, breaking test-py-shared CI)
  • 38fd78f4 — address 3 Copilot findings: preserve empty FCS TEXT tokens; vectorized numpy.isfinite in scatter spec; remove unused numpy<2 from py-shared[es]
  • 85e2f1d1 — address Copilot's memory-doubling finding in _parse_fcs_flowio_full via numpy.fromiter + zero-copy reshape (~7× peak-memory reduction for large files)

Status

  • CI: ✅ all 44 checks green
  • Greptile: all P1 + P2 addressed (your 20d03bae covered most; the rest are in the commits above)
  • Copilot: all 4 inline comments replied to with fix-commit references
  • Mergeable, awaiting human review

Let me know if any of the changes I pushed look wrong — happy to revert/adjust. Thanks for the original work, this is a great improvement to FCS preview.

@drernie
Copy link
Copy Markdown
Member

drernie commented Apr 29, 2026 via email

drernie and others added 4 commits May 15, 2026 09:22
…atrix preview, binary sniff (quiltdata#4901)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The h5ad handler intermittently fails with h5py's "truncated file" OSError
when the underlying HTTP read tears mid-transfer. The condition is
non-deterministic — the same file/URL succeeds on a fresh attempt — so a
single bounded retry hides the flap. Structural errors (ValueError /
KeyError from a genuinely malformed h5ad) still go straight to the
error envelope.

Also include the (query-stripped) URL in the warning log so post-hoc
debugging can distinguish a transient transport hiccup from a file
that is actually broken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
drernie added a commit that referenced this pull request May 22, 2026
Picks up the h5ad torn-read retry + URL logging from PR
quiltdata/quilt:fix/h5ad-retry-torn-read → Austin-s-h#10
(targets preview-lambda-shared / #4858), propagated through:
  - pr-4858-fcs-shared (07b4b3c fix)
  - pr-4859-fcs-consumers (2a299bc merge)
  - 260427-catalog-fcs-vega (dee7002 merge)

New tabular_preview tip: dee7002.

Closes the intermittent "Unexpected Error" on h5ad previews observed
against tf-dev-alexion (eg. test/preview-h5ad/test.h5ad).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(tabular_preview): retry once on torn h5ad read; log URL on failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants