Skip to content

Windowed impute strategy#276

Merged
pdimens merged 19 commits intomainfrom
windowed_impute
Apr 16, 2026
Merged

Windowed impute strategy#276
pdimens merged 19 commits intomainfrom
windowed_impute

Conversation

@pdimens
Copy link
Copy Markdown
Owner

@pdimens pdimens commented Apr 10, 2026

This significantly modifies the impute workflow to use a buffered window strategy as the default. The region strategy is still available, as is the legacy do-all-of-it strategy. This is now all contained with the --strategy -s option. Checks and validations have been updated to match, as have the impute stitch_collate reports. Fixes #275

Summary by CodeRabbit

  • New Features

    • Flexible imputation modes: window-based, region-based, and whole-genome ("all") with buffer/window sizing; imputation outputs and reports are now region-scoped.
    • Interactive image viewer for browsing plots; notebooks switched to the new viewer and CSV-backed table rendering.
    • CI adds an always-run impute job producing consolidated imputeAll outputs.
  • Bug Fixes

    • Better detection/reporting for missing input files.
    • Region bounds validated and auto-clamped to contig limits; contig discovery refined.
  • Chores

    • Bumped tool versions and added image-processing dependency.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces region-based imputation with a strategy-driven interface (--strategy supporting window:, contig:start-end, or all), implements window/region generation and region-scoped Snakemake rules with region-aware merging, updates VCF contig handling/validation, adds interactive image-viewer report utilities, and updates CI/environments/dependencies.

Changes

Cohort / File(s) Summary
CLI & Impute command
harpy/commands/impute.py, harpy/common/cli_filetypes.py
Add --strategy (new ImputeStrategy Click type) and --buffer; refactor impute() signature and parsing to accept window:, contig:start-end, or all; adapt Snakemake inputs/params and reporting text.
Snakemake: impute workflow
harpy/snakefiles/impute.smk
Add makewindows, Contig, and extract_contig_info; run STITCH per (contig,region); introduce merge_regions and merge_contigs; rename/replace rules to be region-scoped and update outputs/logging paths.
VCF validation & contigs
harpy/validation/vcf.py, harpy/validation/impute_parameters.py
Replace biallelic-contig list with contigs dict (name→length) and add get_contigs(); change validate_region() signature/behavior (clamp end to contig length, remove buffer return); minor impute-params log text update.
Reporting utilities & notebooks
harpy/report/components.py, harpy/report/utilities.py, harpy/report/notebooks/*
Add embed_image() and image_viewer() (interactive image selector), add extract_metric() utility; replace many notebook inline table/image logic with ITable, image_viewer, and extract_metric; update kernelspec metadata.
CI, environments, deps
.github/workflows/tests.yml, harpy/common/environments.py, pixi.toml
CI: switch impute invocations to strategy (-s) and add an always-run imputeAll step; bump samtools/bcftools and add tabix; pixi: bump htslib/bcftools/samtools and add pillow (note altered pandas line).
Report/table rendering replacement
harpy/report/components.py, many harpy/report/notebooks/*.ipynb
Remove/replace standard_itable usage with ITable(...).render() and change output filename conventions to .csv where applicable.
Miscellaneous small edits
harpy/common/printing.py, harpy/common/file_ops.py, harpy/common/summaries.py, harpy/validation/*.py
Docstring wording tweak; classify "Missing input files" as source-of-error; small summary text adjustments; signature/return updates noted above.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (harpy impute)
    participant Parser as Strategy Parser (ImputeStrategy)
    participant VCF as VCF Validator
    participant RegionGen as Window/Region Generator
    participant Snake as Snakemake Workflow
    participant Stitch as STITCH
    participant Merger as VCF Merger

    CLI->>Parser: parse --strategy (window:|contig:start-end|all) and --buffer
    Parser->>VCF: request contigs (get_contigs) if needed
    VCF-->>Parser: contigs dict (name → length)
    Parser->>RegionGen: if window -> makewindows(contigs, window, buffer)
    RegionGen-->>Snake: list of per-contig regions / region strings
    Snake->>Stitch: launch per-(contig,region) jobs (--regionStart/--regionEnd/--buffer)
    Stitch-->>Snake: region-scoped VCF.gz + indices + plots
    Snake->>Merger: merge_regions() → contig-level BCFs
    Merger->>Merger: merge_contigs() → final paramset BCF/CSI
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Refine quarto #204: Touches report configuration and the impute Snakemake workflow; overlaps with notebook/report/component and snakefile changes.
  • add assembly #146: Modifies imputation workflow internals and wiring; related to the window/region refactor in snakefiles.
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive Most changes directly support windowed imputation (impute.py, impute.smk, cli_filetypes.py, vcf.py). Some changes appear tangential: notebook refactoring (ITable/image_viewer replacements), tool version updates (samtools/bcftools/tabix), and summary changes. These may be preparatory or unrelated improvements but don't contradict the windowed imputation objective. Clarify whether notebook refactoring and tool version updates are strictly necessary for windowed imputation or represent separate improvements bundled into this PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Windowed impute strategy' directly and clearly summarizes the primary change: introducing a windowed strategy for the impute command as the default approach.
Linked Issues check ✅ Passed The PR implements all key objectives from issue #275: adds buffered window strategy support [#275], enables per-window STITCH parallelization in the Snakemake workflow [#275], integrates windowed execution natively without external scripts [#275], and exposes strategy selection via --strategy/-s CLI option [#275].

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch windowed_impute

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
harpy/common/environments.py (1)

59-60: LGTM! The tabix dependency addition is necessary.

The addition of bioconda::tabix is correct and required, as the impute workflow explicitly invokes tabix to index VCF files (as shown in harpy/snakefiles/impute.smk:166).

Optional: Consider pinning the tabix version for reproducibility

For better reproducibility and to prevent potential issues from future tabix updates, consider specifying a version constraint:

         "impute" : [
             "bioconda::r-stitch>=1.8.4",
-            "bioconda::tabix"
+            "bioconda::tabix=1.11"
         ],

Note: While this file shows mixed practices (some dependencies are pinned, others aren't), version pinning helps ensure consistent behavior across environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/common/environments.py` around lines 59 - 60, The dependency list now
includes the unpinned package identifier "bioconda::tabix"; update that entry to
a version-pinned spec (e.g., change the "bioconda::tabix" list item to a
concrete version constraint matching the version used in CI or your environment)
so the environment is reproducible—locate and replace the "bioconda::tabix"
string in the dependency list in environments.py (the same entry referenced by
the impute workflow) with a pinned version constraint.
harpy/report/components.py (2)

232-235: Rename input here.

This shadows the Python builtin and is what Ruff is complaining about. A small rename keeps notebook debugging less confusing.

Small cleanup
 def extract_metric(x: list[str], param: str):
     '''Convenience function to find the relevnant sections of the bcftools.stats file and return a table'''
-    input = "".join(s for s in x if s.startswith(param))
-    return pd.read_table(StringIO(input), sep = "\t", header = None)
+    matched_text = "".join(s for s in x if s.startswith(param))
+    return pd.read_table(StringIO(matched_text), sep = "\t", header = None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/components.py` around lines 232 - 235, The function
extract_metric contains a local variable named input that shadows the Python
builtin; rename that variable (e.g., matched_text or input_str) in the
extract_metric function so the call StringIO(input) and the comprehension
"".join(...) use the new name, preserving behavior and return value while
avoiding builtin shadowing; update both the assignment and the StringIO(...)
reference inside extract_metric.

165-230: stitch_image_viewer() now duplicates image_viewer() almost verbatim.

The notebook callers already pass sortkey, recursive=True, and thing_to_select="region" directly into image_viewer(), so keeping a second encoder/template means future fixes have to land twice. Consider making this a thin wrapper instead.

Refactor sketch
-def stitch_image_viewer(label: str, dir: str, pattern: str, scale: float = 1.0):
-    '''
-    Create a javascript image viewer with a file picker for the different chromosomal regions.
-    Images are embedded (thus can safely have their original files deleted) and can be down-scaled.
-    '''
-    imgfmt = "image/png" if "png" in pattern else "image/jpg"
-    options_parts = []
-    paths = sorted(
-        Path(dir).rglob(pattern),
-        key=lambda p: int(p.parents[1].name.split('-')[0])
-    )
-    uid = uuid.uuid4().hex[:8]
-    images_dict = {}
-    for p in paths:
-        pname = p.parents[1].name
-        options_parts.append(f'<option value="{pname}">{pname}</option>')
-        image = PImage.open(p)
-        if scale < 1:
-            image = image.resize((int(image.size[0] * scale), int(image.size[1] * scale)), PImage.LANCZOS)
-        buf = BytesIO()
-        image.save(buf, format="PNG" if "png" in pattern else "jpeg")
-        buf.seek(0)
-        encoded = base64.b64encode(buf.read()).decode()
-        images_dict[pname] = f"data:{imgfmt};base64,{encoded}"
-
-    images_js = json.dumps(images_dict)
-    options_html = "\n".join(options_parts)
-
-    html = f"""..."""
-    display(HTML(html))
+def stitch_image_viewer(label: str, image_dir: str, pattern: str, scale: float = 1.0):
+    return image_viewer(
+        label,
+        image_dir,
+        pattern,
+        sortkey=lambda p: int(p.parents[1].name.split('-')[0]),
+        thing_to_select="region",
+        recursive=True,
+        scale=scale,
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/components.py` around lines 165 - 230, stitch_image_viewer
duplicates image_viewer; replace its full implementation with a thin wrapper
that delegates to image_viewer to avoid duplicated encoder/template logic.
Specifically, change stitch_image_viewer(label: str, dir: str, pattern: str,
scale: float = 1.0) so it simply calls image_viewer(...) passing through label,
dir, pattern and scale and forwarding any additional needed args (e.g., sortkey,
recursive=True, thing_to_select="region" where callers expect them) so all
encoding/HTML generation remains centralized in image_viewer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@harpy/commands/impute.py`:
- Around line 65-84: The code treats region as a bool, causing
workflow.param(region, "region") to write True instead of the validated region
string; change the logic to keep a boolean flag (e.g., is_region) and capture
the validated region string from vcffile.validate_region(strategy) (e.g.,
validated_region = vcffile.validate_region(strategy)), then pass
validated_region to workflow.param(validated_region, "region") when is_region is
True; also ensure window is still parsed from strategy when it
startswith("window:") and that workflow.param(window, "window-size") remains
used when appropriate.

In `@harpy/common/cli_filetypes.py`:
- Around line 281-289: The validator currently allows zero/negative window sizes
and invalid region bounds; update the window handling and region validation in
the block that checks pieces[0] == "window" and the else branch that parses
parts = pieces[1].split("-"): for the "window" branch ensure pieces[1] parses to
an integer > 0 (use int(...) or isInt and then int) and call self.fail with the
existing message if <= 0; for the region branch ensure both bounds parse as
integers, start > 0, end > start (use self.isInt on both and compare ints) and
call self.fail with the region error message when those checks fail so
zero/negative and reversed coordinates are rejected before downstream use.

In `@harpy/report/components.py`:
- Around line 95-99: embed_image currently opens the file with PImage.open(x),
resizes and saves back to the same path (overwriting the source); change it to
resize the image in memory and embed that image data so the source file isn't
modified. In the embed_image function, after PImage.open(x) perform the resize
into an in-memory buffer (e.g., BytesIO) and return an IPython.display.Image
constructed from that in-memory bytes with embed=True instead of using
Image(filename=x), leaving the original file untouched and preventing cumulative
degradation.
- Around line 108-123: The dropdown keys are collapsing because pname is set to
p.parents[1].name which is identical for flat directories, causing
images_dict[pname] to be overwritten; change the selector key to a unique file
identifier (e.g., use p.name or a relative path like str(p.relative_to(dir)))
and use that same unique key when appending to options_parts and populating
images_dict so each image entry is unique (update references to pname in the
block that builds paths, options_parts, images_dict and images_js).

In `@harpy/snakefiles/impute.smk`:
- Around line 178-200: The rule merge_regions currently drops the {contig}
wildcard in its region-mode outputs while still referencing wc.contig in the
input lambda and log, breaking wildcard inference and downstream contig_report;
change merge_regions to always emit the contig-level outputs
"{paramset}/contigs/{contig}/{contig}.bcf" and
"{paramset}/contigs/{contig}/{contig}.bcf.csi" (so wc.contig stays inferred and
the input lambda and log can keep using wc.contig), and then add a tiny
finalization rule (e.g., finalize_paramset_bcf) that runs only for the
single-region case to create the archive-level "{paramset}/{paramset}.bcf" and
"{paramset}/{paramset}.bcf.csi" (by copying/concatenating or symlinking the
contig BCFs and creating indexes) so contig_report and other consumers continue
to get the contig-level path and the paramset-level artifacts still exist when
needed.
- Around line 54-57: The region parsing incorrectly splits twice on different
separators causing start to include the contig (e.g., "3L:3000"); update the
logic that builds contigs[...]/Contig by first splitting region on ":" into cntg
and coords (e.g., cntg, coords = region.split(":", 1)), then split coords on "-"
into start and end (e.g., start, end = coords.split("-", 1)), and convert start
and end to int before constructing Contig(cntg, int(end), start=int(start)) so
int() receives only numeric strings; update the code that sets cntg and
contigs[...] accordingly.

In `@harpy/validation/vcf.py`:
- Around line 154-172: The region parsing currently accepts nonsensical ranges
(e.g., start 0 or start > end); update the logic around contig, startpos, endpos
in the method that splits region so you validate ordering and bounds: after
computing startpos/endpos but before using/returning them, ensure startpos is >=
1 and startpos <= endpos (and keep the existing endpos > contig adjustment), and
on invalid ranges call self.print.validation(False) and self.print.error with a
clear message and guidance (similar style to the existing contig-missing error
referencing self.biallelic_file), otherwise call self.print.validation(True) and
return the validated/adjusted contig, startpos, endpos.

---

Nitpick comments:
In `@harpy/common/environments.py`:
- Around line 59-60: The dependency list now includes the unpinned package
identifier "bioconda::tabix"; update that entry to a version-pinned spec (e.g.,
change the "bioconda::tabix" list item to a concrete version constraint matching
the version used in CI or your environment) so the environment is
reproducible—locate and replace the "bioconda::tabix" string in the dependency
list in environments.py (the same entry referenced by the impute workflow) with
a pinned version constraint.

In `@harpy/report/components.py`:
- Around line 232-235: The function extract_metric contains a local variable
named input that shadows the Python builtin; rename that variable (e.g.,
matched_text or input_str) in the extract_metric function so the call
StringIO(input) and the comprehension "".join(...) use the new name, preserving
behavior and return value while avoiding builtin shadowing; update both the
assignment and the StringIO(...) reference inside extract_metric.
- Around line 165-230: stitch_image_viewer duplicates image_viewer; replace its
full implementation with a thin wrapper that delegates to image_viewer to avoid
duplicated encoder/template logic. Specifically, change
stitch_image_viewer(label: str, dir: str, pattern: str, scale: float = 1.0) so
it simply calls image_viewer(...) passing through label, dir, pattern and scale
and forwarding any additional needed args (e.g., sortkey, recursive=True,
thing_to_select="region" where callers expect them) so all encoding/HTML
generation remains centralized in image_viewer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fef0a5b0-0bd4-403d-8063-0824facdbe66

📥 Commits

Reviewing files that changed from the base of the PR and between 0770bb2 and a27ded2.

📒 Files selected for processing (13)
  • .github/workflows/tests.yml
  • harpy/commands/impute.py
  • harpy/common/cli_filetypes.py
  • harpy/common/environments.py
  • harpy/common/file_ops.py
  • harpy/common/printing.py
  • harpy/report/components.py
  • harpy/report/notebooks/preproc_stats.ipynb
  • harpy/report/notebooks/stitch_collate.ipynb
  • harpy/snakefiles/impute.smk
  • harpy/validation/impute_parameters.py
  • harpy/validation/vcf.py
  • pixi.toml

Comment thread harpy/commands/impute.py
Comment thread harpy/common/cli_filetypes.py Outdated
Comment thread harpy/report/components.py Outdated
Comment thread harpy/report/components.py Outdated
Comment thread harpy/snakefiles/impute.smk Outdated
Comment thread harpy/snakefiles/impute.smk
Comment thread harpy/validation/vcf.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
harpy/report/components.py (1)

115-127: ⚠️ Potential issue | 🟠 Major

images_dict keys can still be overwritten when labels collide.

At Line 117/126, pname is both display label and dictionary key. If two files resolve to the same label, one silently replaces the other.

Proposed fix
-    for p in paths:
-        pname = option_key(p)
-        options_parts.append(f'<option value="{pname}">{pname}</option>')
+    for p in paths:
+        display_name = option_key(p)
+        key = str(p.relative_to(image_dir))
+        options_parts.append(f'<option value="{key}">{display_name}</option>')
@@
-        images_dict[pname] = f"data:{imgfmt};base64,{encoded}"
+        images_dict[key] = f"data:{imgfmt};base64,{encoded}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/components.py` around lines 115 - 127, The code currently uses
pname (from option_key) as both the option label and the images_dict key,
allowing collisions to silently overwrite entries; modify the loop that builds
options_parts and images_dict (referencing option_key, paths, pname,
options_parts, images_dict) to ensure unique dictionary keys by disambiguating
duplicate pname values (for example by appending a numeric suffix or unique
index when a pname already exists in images_dict) and use that unique key for
both the <option value="..."> entry and the images_dict key so the displayed
label remains unchanged while stored keys are distinct.
harpy/commands/impute.py (1)

65-70: ⚠️ Potential issue | 🟠 Major

Preserve the validated region string instead of a boolean.

This still drops the clamped (contig, start, end) returned by validate_region(), so Snakemake sees the original strategy value and the workflow summary reports Target Region: True instead of the actual region.

🔧 Minimal fix
-    region = strategy.lower() != "all" and not strategy.lower().startswith("window:")
-    window = None
-    if region:
-        vcffile.validate_region(strategy)
+    region = None
+    window = None
+    if strategy.lower() != "all" and not strategy.lower().startswith("window:"):
+        contig, startpos, endpos = vcffile.validate_region(strategy)
+        region = f"{contig}:{startpos}-{endpos}"
     elif strategy.lower().startswith("window:"):
-        window = int(strategy.split(":")[1])
+        window = int(strategy.split(":", 1)[1])
@@
     workflow.param(grid_size, "grid-size")
     if region:
-        workflow.param(strategy, "region")
+        workflow.param(region, "region")
     elif window:
         workflow.param(window, "window-size")

Also applies to: 79-80, 89-95

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/commands/impute.py` around lines 65 - 70, The code currently sets
region as a boolean and calls vcffile.validate_region(strategy) without
preserving its return value; change this so region is assigned the validated
region string/tuple returned by vcffile.validate_region(strategy) (e.g., region
= vcffile.validate_region(strategy)) and only treat it as None when not
provided, and keep the window handling (window = int(strategy.split(":")[1]))
unchanged; apply the same fix to the other occurrences where validate_region is
called (the similar blocks referenced around the validate_region calls at the
later branches) so Snakemake receives the actual validated region instead of
True/False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@harpy/report/components.py`:
- Around line 118-155: The HTML injection risk comes from unescaped filenames
and selector text being interpolated into the template (see options_parts.append
usage building option elements from pname and the interpolations of label and
thing_to_select into the HTML). Fix by HTML-escaping all user/file-derived
strings before inserting into attributes/text: escape pname when building
options (both value and inner text), and escape label and thing_to_select when
interpolating into the template (use html.escape(..., quote=True) or
equivalent). Ensure the escaped strings are used for options_html and in the
f-string template so no raw filenames or labels are rendered unescaped.
- Around line 101-102: The image saving and the returned Image metadata disagree
on format: image.save(..., format = "PNG" if "png" in x.lower() else "jpeg") may
save as JPEG but the return always uses format="png"; update the logic in the
block around image.save and the return Image(...) to compute a single format
variable (e.g., format_str = "PNG" if "png" in x.lower() else "JPEG") and use
that same variable for both image.save(buf, format=format_str) and return
Image(data=buf.getvalue(), format=format_str.lower() or format_str as required
by Image), ensuring consistent case/values expected by the Image constructor.

In `@harpy/report/utilities.py`:
- Around line 7-10: The extract_metric helper should avoid calling pd.read_table
on empty input and should match section tokens more strictly: in function
extract_metric, change the selection filter to only include lines that start
with the exact section token plus a tab (e.g. s.startswith(f"{param}\t")) and
after building selectiontext check if it's empty/blank (selectiontext.strip() ==
"") and if so return an empty pandas.DataFrame instead of calling pd.read_table;
otherwise proceed to parse with pd.read_table(StringIO(selectiontext), sep="\t",
header=None).

In `@harpy/snakefiles/impute.smk`:
- Around line 181-199: The rule currently uses find on the filesystem to build
{output.filelist} which can pick up stale files and ignores the deterministic
input order; instead, use the declared Snakemake inputs (the lambda/collect
expression producing the per-region VCFs via contigs[wc.contig].regions and ext)
by adding a named input list (e.g., vcf_files) and write {output.filelist} from
that list inside the shell command (or via a run block) so you concatenate
exactly the scheduled files in the intended order; apply the same change to the
duplicate block mentioned for lines 203-219, and reference the same wildcards
({wildcards.paramset}, {wildcards.contig}) and outputs (output.filelist,
output.bcf) when writing the filelist.

In `@harpy/validation/vcf.py`:
- Around line 167-172: Update the user-facing error text in the region
validation to reflect the current biallelic threshold used by
find_biallelic_contigs(): change the phrase "at least 2 biallelic SNPs" to "at
least 5 biallelic SNPs" (or, if a threshold constant/property exists,
interpolate that value instead) in the error constructed in the validation block
that prints "missing contig" so the message in the contig check matches
find_biallelic_contigs()'s criterion.

---

Duplicate comments:
In `@harpy/commands/impute.py`:
- Around line 65-70: The code currently sets region as a boolean and calls
vcffile.validate_region(strategy) without preserving its return value; change
this so region is assigned the validated region string/tuple returned by
vcffile.validate_region(strategy) (e.g., region =
vcffile.validate_region(strategy)) and only treat it as None when not provided,
and keep the window handling (window = int(strategy.split(":")[1])) unchanged;
apply the same fix to the other occurrences where validate_region is called (the
similar blocks referenced around the validate_region calls at the later
branches) so Snakemake receives the actual validated region instead of
True/False.

In `@harpy/report/components.py`:
- Around line 115-127: The code currently uses pname (from option_key) as both
the option label and the images_dict key, allowing collisions to silently
overwrite entries; modify the loop that builds options_parts and images_dict
(referencing option_key, paths, pname, options_parts, images_dict) to ensure
unique dictionary keys by disambiguating duplicate pname values (for example by
appending a numeric suffix or unique index when a pname already exists in
images_dict) and use that unique key for both the <option value="..."> entry and
the images_dict key so the displayed label remains unchanged while stored keys
are distinct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9cbd1602-6b39-456b-bf3a-733ff5f8739a

📥 Commits

Reviewing files that changed from the base of the PR and between a27ded2 and 29b09ee.

📒 Files selected for processing (8)
  • harpy/commands/impute.py
  • harpy/common/cli_filetypes.py
  • harpy/report/components.py
  • harpy/report/notebooks/bcftools_stats.ipynb
  • harpy/report/notebooks/stitch_collate.ipynb
  • harpy/report/utilities.py
  • harpy/snakefiles/impute.smk
  • harpy/validation/vcf.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/report/notebooks/stitch_collate.ipynb

Comment thread harpy/report/components.py Outdated
Comment thread harpy/report/components.py Outdated
Comment thread harpy/report/utilities.py Outdated
Comment thread harpy/snakefiles/impute.smk
Comment thread harpy/validation/vcf.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
harpy/commands/impute.py (1)

84-85: Inconsistent case handling for "all" strategy.

Lines 65 and 70 use strategy.lower() for comparisons, but line 84 uses strategy != "all" without .lower(). While this currently works (since ImputeStrategy enforces lowercase "all"), the inconsistency could become a bug if the validator ever allows case-insensitive input.

Suggested fix for consistency
-    if strategy != "all":
+    if strategy.lower() != "all":
         workflow.param(buffer, "buffer")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/commands/impute.py` around lines 84 - 85, The comparison against the
"all" strategy is inconsistent: change the check that currently reads strategy
!= "all" to use a normalized form (e.g., strategy.lower() != "all") so it
matches the other comparisons that call strategy.lower(); update the conditional
around workflow.param(buffer, "buffer") to use strategy.lower() for consistency
with earlier checks (referencing the strategy variable and the
workflow.param(buffer, "buffer") call in impute.py and the ImputeStrategy
usage).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@harpy/common/cli_filetypes.py`:
- Around line 286-291: The two self.fail(...) calls that only pass a message
should be updated to include the Click parameter and context objects (i.e., call
self.fail(message, param, ctx)) to match the other failures; locate the failing
calls that validate window strategy and region format (the self.fail invocations
near the pieces/parts checks in the method that uses self.isInt) and change them
to pass the same param and ctx variables used by the surrounding self.fail
usages.

---

Nitpick comments:
In `@harpy/commands/impute.py`:
- Around line 84-85: The comparison against the "all" strategy is inconsistent:
change the check that currently reads strategy != "all" to use a normalized form
(e.g., strategy.lower() != "all") so it matches the other comparisons that call
strategy.lower(); update the conditional around workflow.param(buffer, "buffer")
to use strategy.lower() for consistency with earlier checks (referencing the
strategy variable and the workflow.param(buffer, "buffer") call in impute.py and
the ImputeStrategy usage).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 887e1436-1833-4ebc-91da-9705e2780065

📥 Commits

Reviewing files that changed from the base of the PR and between 29b09ee and 66b7fcc.

📒 Files selected for processing (2)
  • harpy/commands/impute.py
  • harpy/common/cli_filetypes.py

Comment thread harpy/common/cli_filetypes.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
harpy/report/notebooks/validate_fastq.ipynb (1)

104-144: ⚠️ Potential issue | 🟠 Major

Incomplete migration will cause runtime errors.

Two issues in the table rendering:

  1. Line 110: The ITable instance is created but .render() is never called, so it won't display anything.

  2. Lines 123-143: The cell still calls standard_itable(...) and references itables.JavascriptFunction, but neither standard_itable nor itables are imported. This will raise NameError at runtime.

You should either:

  • Remove the old standard_itable cell entirely and add .render() to the new ITable call, or
  • Replace the old cell's logic with the new ITable approach
Proposed fix (option 1: use ITable with render)

Remove lines 113-145 (the old standard_itable cell) and update the new cell:

-        "tbl = ITable(data, \"validate.fastq.csv\")"
+        "ITable(data, \"validate.fastq.csv\", fixedcols=1).render()"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/notebooks/validate_fastq.ipynb` around lines 104 - 144, The
ITable instance is created but never rendered and the old standard_itable cell
references missing names (standard_itable, itables.JavascriptFunction) causing
runtime NameError; fix by removing the legacy standard_itable cell and call
.render() on the ITable object (i.e., replace the current "tbl = ITable(data,
\"validate.fastq.csv\")" cell with a cell that invokes tbl.render()), or
alternatively replace the legacy cell body with the new ITable.render()
approach; if you choose to keep standard_itable, import the missing symbols
(standard_itable and itables) before using them.
harpy/report/notebooks/validate_bam.ipynb (1)

121-161: ⚠️ Potential issue | 🟠 Major

Incomplete migration will cause runtime errors.

Same issue as validate_fastq.ipynb:

  1. Line 127: ITable instance created but .render() not called - nothing will display.

  2. Lines 140-160: The cell calls standard_itable(...) and itables.JavascriptFunction, but neither is imported. This will raise NameError.

Proposed fix

Remove lines 130-162 (the old standard_itable cell) and update the new cell:

-        "tbl = ITable(data, \"validate.bam.csv\", fixedcols = 1)"
+        "ITable(data, \"validate.bam.csv\", fixedcols=1).render()"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/notebooks/validate_bam.ipynb` around lines 121 - 161, The ITable
created via ITable(data, "validate.bam.csv", fixedcols = 1) is never rendered
and the later cell references standard_itable and itables.JavascriptFunction
which are not imported; remove the old standard_itable cell, call tbl.render()
immediately after creating the ITable (i.e., invoke ITable(...).render() or add
tbl.render()), and update the remaining cell to either import standard_itable
and itables (so standard_itable(..., coldefs=[{"createdCell":
itables.JavascriptFunction(...)}]) resolves) or replace
itables.JavascriptFunction with the correct, already-imported helper; ensure all
referenced symbols (ITable, tbl.render, standard_itable,
itables.JavascriptFunction) are present via import or replaced so no NameError
occurs.
♻️ Duplicate comments (1)
harpy/snakefiles/impute.smk (1)

217-224: ⚠️ Potential issue | 🔴 Critical

Same issue: file list needs one file per line.

Apply the same fix as merge_regions to ensure bcftools concat -f receives properly formatted input.

🐛 Proposed fix
     shell:
         """
-        echo {input} > {output.filelist}
+        printf '%s\n' {input} > {output.filelist}
         {{
             bcftools concat --threads {threads} -f {output.filelist} |
             bcftools sort -Ob -o {output.bcf} --write-index
         }}  2> {log}
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/snakefiles/impute.smk` around lines 217 - 224, The shell block in the
rule using bcftools concat (the echo {input} > {output.filelist} and bcftools
concat -f {output.filelist} invocation) writes all input paths on one line;
change it to write one filename per line (as done in merge_regions) so bcftools
concat -f gets a properly formatted filelist—use a printf/join-with-newline
approach to emit each {input} item on its own line before calling bcftools
concat, keeping the existing {output.bcf}, {output.filelist}, {log} and
{threads} symbols.
🧹 Nitpick comments (3)
harpy/report/notebooks/align_stats.ipynb (2)

718-718: Minor style inconsistency: positional vs keyword argument.

This call passes fixedcols as a positional argument (1), while other ITable calls in this PR use the keyword form (fixedcols=1). Consider using fixedcols=1 for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/notebooks/align_stats.ipynb` at line 718, The ITable(...) call
uses a positional argument for fixedcols causing style inconsistency; change the
call ITable(contig_avg, f"{samplename}.align.depth.csv", 1).render() to pass
fixedcols as a keyword (fixedcols=1) so it matches other ITable usages and keeps
argument order explicit.

732-732: Same style note as above.

Uses positional 1 for fixedcols instead of keyword fixedcols=1.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/notebooks/align_stats.ipynb` at line 732, The call to ITable
uses a positional argument for fixedcols (passing 1) which is inconsistent with
the rest of the notebook; change the call ITable(outliers.sort_values('Read
Depth', ascending=False).head(100), f"{samplename}.align.outliers.csv",
1).render() to use the keyword argument fixedcols=1 so it becomes ITable(...,
fixedcols=1).render(), ensuring consistency and clarity.
harpy/report/notebooks/stitch_collate.ipynb (1)

288-288: Consider adding defensive handling for the sortkey lambda.

The sortkey lambda lambda p: int(p.parents[1].name.split('-')[0]) assumes the grandparent directory name contains a - delimiter and the first part is numeric. If the directory structure doesn't match (e.g., unexpected naming or missing -), this will raise ValueError or IndexError.

If the directory structure is guaranteed by the workflow, this is fine. Otherwise, consider wrapping with a try-except or validating the format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@harpy/report/notebooks/stitch_collate.ipynb` at line 288, The sort key lambda
used in the image_viewer call (image_viewer(..., sortkey = lambda p:
int(p.parents[1].name.split('-')[0]), ...)) can raise ValueError/IndexError for
unexpected directory names; replace the inline lambda with a small defensive
helper (e.g. safe_sortkey) that attempts to parse
int(p.parents[1].name.split('-')[0]) inside a try/except and returns a sensible
fallback (like 0 or the original name) on failure, then pass
sortkey=safe_sortkey to image_viewer to avoid crashes on malformed directory
names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@harpy/common/summaries.py`:
- Around line 146-157: The code reads PARAMETERS.get("window", None) but the
imputation workflow passes the value under "window-size", so the window variable
is always None and the elif window: branch never runs; update the lookup in
summaries.py to use PARAMETERS.get("window-size", None) (affecting the window
variable used in the region/regiontext logic that builds regiontext for region,
buffer and window branches) so the buffer is correctly appended when windowed
imputation is used.

In `@harpy/snakefiles/impute.smk`:
- Around line 178-192: In the rule merge_regions, the two input entries (the
first lambda collecting .vcf.gz.tbi paths and the second vcf = lambda collecting
.vcf.gz paths) are missing a separating comma; edit the rule merge_regions input
block to insert a comma after the first collect(...) lambda so the second named
input vcf = lambda wc: collect(...) is a separate list item, ensuring correct
Python/Snakemake syntax for the input declarations that reference
contigs[wc.contig].regions.
- Around line 197-204: The current shell block writes {input.vcf} into
{output.filelist} with echo which produces a single space-separated line but
bcftools concat -f expects one file path per line; update the shell command in
the rule (the shell: block that creates {output.filelist} and then runs
`bcftools concat -f {output.filelist}`) to emit one filename per line (e.g., use
printf '%s\n' or replace spaces with newlines) so each input path is on its own
line before calling bcftools concat.
- Around line 18-19: In the impute() method, change the PARAMETERS lookup key
from "window" to "window-size" (replace self.PARAMETERS.get("window", None) with
self.PARAMETERS.get("window-size", None)) so the parameter name matches what
impute() is passed by the Snakefile/command; update any local variable usage
(e.g., window) as needed to use the retrieved value.

In `@pixi.toml`:
- Line 20: The pixi.toml has a corrupted dependency entry for the pandas key
(duplicate/concatenated text). Edit the pixi.toml entry for "pandas" to remove
the duplicated fragment so the line reads a single valid TOML assignment (e.g.,
pandas = ">=2.3.3,<3"), ensuring proper quoting and no duplicated key; after
fixing, run a TOML validator or the project's dependency parser to confirm the
manifest parses cleanly.

---

Outside diff comments:
In `@harpy/report/notebooks/validate_bam.ipynb`:
- Around line 121-161: The ITable created via ITable(data, "validate.bam.csv",
fixedcols = 1) is never rendered and the later cell references standard_itable
and itables.JavascriptFunction which are not imported; remove the old
standard_itable cell, call tbl.render() immediately after creating the ITable
(i.e., invoke ITable(...).render() or add tbl.render()), and update the
remaining cell to either import standard_itable and itables (so
standard_itable(..., coldefs=[{"createdCell": itables.JavascriptFunction(...)}])
resolves) or replace itables.JavascriptFunction with the correct,
already-imported helper; ensure all referenced symbols (ITable, tbl.render,
standard_itable, itables.JavascriptFunction) are present via import or replaced
so no NameError occurs.

In `@harpy/report/notebooks/validate_fastq.ipynb`:
- Around line 104-144: The ITable instance is created but never rendered and the
old standard_itable cell references missing names (standard_itable,
itables.JavascriptFunction) causing runtime NameError; fix by removing the
legacy standard_itable cell and call .render() on the ITable object (i.e.,
replace the current "tbl = ITable(data, \"validate.fastq.csv\")" cell with a
cell that invokes tbl.render()), or alternatively replace the legacy cell body
with the new ITable.render() approach; if you choose to keep standard_itable,
import the missing symbols (standard_itable and itables) before using them.

---

Duplicate comments:
In `@harpy/snakefiles/impute.smk`:
- Around line 217-224: The shell block in the rule using bcftools concat (the
echo {input} > {output.filelist} and bcftools concat -f {output.filelist}
invocation) writes all input paths on one line; change it to write one filename
per line (as done in merge_regions) so bcftools concat -f gets a properly
formatted filelist—use a printf/join-with-newline approach to emit each {input}
item on its own line before calling bcftools concat, keeping the existing
{output.bcf}, {output.filelist}, {log} and {threads} symbols.

---

Nitpick comments:
In `@harpy/report/notebooks/align_stats.ipynb`:
- Line 718: The ITable(...) call uses a positional argument for fixedcols
causing style inconsistency; change the call ITable(contig_avg,
f"{samplename}.align.depth.csv", 1).render() to pass fixedcols as a keyword
(fixedcols=1) so it matches other ITable usages and keeps argument order
explicit.
- Line 732: The call to ITable uses a positional argument for fixedcols (passing
1) which is inconsistent with the rest of the notebook; change the call
ITable(outliers.sort_values('Read Depth', ascending=False).head(100),
f"{samplename}.align.outliers.csv", 1).render() to use the keyword argument
fixedcols=1 so it becomes ITable(..., fixedcols=1).render(), ensuring
consistency and clarity.

In `@harpy/report/notebooks/stitch_collate.ipynb`:
- Line 288: The sort key lambda used in the image_viewer call (image_viewer(...,
sortkey = lambda p: int(p.parents[1].name.split('-')[0]), ...)) can raise
ValueError/IndexError for unexpected directory names; replace the inline lambda
with a small defensive helper (e.g. safe_sortkey) that attempts to parse
int(p.parents[1].name.split('-')[0]) inside a try/except and returns a sensible
fallback (like 0 or the original name) on failure, then pass
sortkey=safe_sortkey to image_viewer to avoid crashes on malformed directory
names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1bb0b66a-486e-417f-8c25-49b18cb3e294

📥 Commits

Reviewing files that changed from the base of the PR and between 67dae84 and 8e5ecdb.

📒 Files selected for processing (18)
  • harpy/common/cli_filetypes.py
  • harpy/common/summaries.py
  • harpy/report/components.py
  • harpy/report/notebooks/align_bxstats.ipynb
  • harpy/report/notebooks/align_stats.ipynb
  • harpy/report/notebooks/bcftools_stats.ipynb
  • harpy/report/notebooks/hapcut.ipynb
  • harpy/report/notebooks/impute.ipynb
  • harpy/report/notebooks/preproc_stats.ipynb
  • harpy/report/notebooks/qc_bx_stats.ipynb
  • harpy/report/notebooks/stitch_collate.ipynb
  • harpy/report/notebooks/sv.ipynb
  • harpy/report/notebooks/validate_bam.ipynb
  • harpy/report/notebooks/validate_fastq.ipynb
  • harpy/report/utilities.py
  • harpy/snakefiles/impute.smk
  • harpy/validation/vcf.py
  • pixi.toml
🚧 Files skipped from review as they are similar to previous changes (3)
  • harpy/report/notebooks/preproc_stats.ipynb
  • harpy/common/cli_filetypes.py
  • harpy/report/components.py

Comment thread harpy/common/summaries.py Outdated
Comment thread harpy/snakefiles/impute.smk
Comment thread harpy/snakefiles/impute.smk
Comment thread harpy/snakefiles/impute.smk
Comment thread pixi.toml Outdated
@pdimens pdimens merged commit b48bccc into main Apr 16, 2026
17 checks passed
@pdimens pdimens deleted the windowed_impute branch April 16, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallelize harpy impute runs across genome windows

1 participant