Integrate AlphaGenome for Non-Coding Regulatory predictions#6
Conversation
- Add AlphaGenomeConnector in `src/hg_dt/models/alphagenome.py` to interface with the DeepMind client. - Add `src/hg_dt/analyze/deltas.py` to compute quantitative deltas for accessibility, contact maps, and expression, and identify distal loops and silenced elements. - Update `src/tad_boundaries.py` with `compute_insulation_delta` and `compare_tad_boundaries`. - Update `src/polymer_sim.py` with `polymer_from_contact_map` function to simulate 3D polymer from a direct contact matrix. - Add integration test for *MYC* enhancer deletion simulation. - Fix broken/flaky tests in `test_polymer_sim.py` and `test_tad_boundaries.py`. Co-authored-by: AkeBoss-tech <69588353+AkeBoss-tech@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
This PR adds initial AlphaGenome integration primitives (API connector + delta computations) and extends existing TAD/polymer tooling and tests to support analyzing regulatory/3D organizational consequences from predicted tracks/contact maps.
Changes:
- Added an
AlphaGenomeConnectorwrapper (optional dependency) plus delta/loop/silencing utilities for interpreting AlphaGenome outputs. - Extended polymer simulation with a
polymer_from_contact_map()wrapper and adjusted tests for determinism. - Added insulation-delta and TAD boundary comparison helpers; updated TAD boundary tests for NaN/None handling and single-boundary behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/hg_dt/models/alphagenome.py |
Adds AlphaGenome SDK connector wrapper and output selection mapping. |
src/hg_dt/analyze/deltas.py |
Introduces helper functions to compute track/contact deltas and detect loops/silencing. |
src/tad_boundaries.py |
Adds insulation delta + TAD boundary comparison helpers; adjusts TAD interval calling for safer grouping/chromsize fallback. |
src/polymer_sim.py |
Adds polymer_from_contact_map() to run polymer simulation directly from a contact matrix. |
tests/test_alphagenome_integration.py |
New integration-style test using a mock AlphaGenome connector and MYC enhancer deletion scenario. |
tests/test_tad_boundaries.py |
Fixes boundary type set comparison around None vs NaN; adjusts single-boundary test call. |
tests/test_polymer_sim.py |
Seeds NumPy global RNG to make mocked cooler matrices reproducible across calls. |
tests/conftest.py |
Updates sample restraints fixture data. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def __array__(self): | ||
| return self.values |
There was a problem hiding this comment.
MockTrack.__array__ is missing the optional dtype parameter that NumPy may pass when converting via np.array(...)/np.asarray(...). This can raise TypeError: __array__() takes 1 positional argument but 2 were given and break the test. Update __array__ to accept dtype=None and return np.asarray(self.values, dtype=dtype) (or delegate to self.values.__array__(dtype) when available).
| def __array__(self): | |
| return self.values | |
| def __array__(self, dtype=None): | |
| return np.asarray(self.values, dtype=dtype) |
| with np.errstate(divide='ignore', invalid='ignore'): | ||
| mean_ins = np.nanmean(ins) | ||
| if mean_ins > 0: | ||
| ins_score = np.log2(ins / mean_ins) |
There was a problem hiding this comment.
compute_insulation_delta can produce -inf/+inf values when ins[i] is 0 (e.g., at edges) and mean_ins > 0, due to np.log2(ins / mean_ins). These infinities can later propagate into downstream consumers (e.g., insulation_to_backbone_stiffness), yielding NaNs. Consider adding a small pseudocount / clipping to keep ins strictly positive before the log, or explicitly replace non-finite values after the log.
| ins_score = np.log2(ins / mean_ins) | |
| safe_ins = np.clip(ins, np.finfo(float).tiny, None) | |
| ins_score = np.log2(safe_ins / mean_ins) | |
| ins_score = np.where(np.isfinite(ins_score), ins_score, 0.0) |
| chrom_len = clr.chromsizes.get(chrom_name, 100_000_000) # Fallback to a large number if chromsizes missing or if testing with mock | ||
| sorted_bounds = chrom_bounds.sort_values('start') | ||
|
|
||
| positions = sorted_bounds[['start', 'end', 'boundary_class']].values.tolist() |
There was a problem hiding this comment.
The chrom_len fallback is hard-coded to 100_000_000, which can create extremely large first/last TAD intervals when chromsizes are missing; these are then likely discarded by the max_tad_length_bp filter (default 3Mb), resulting in an empty output despite having boundaries. If chromsizes are unavailable, consider deriving a bounded chrom_len from the boundary positions (e.g., last boundary end + max_tad_length_bp) and/or skipping first/last TAD generation in that case.
| chrom_len = clr.chromsizes.get(chrom_name, 100_000_000) # Fallback to a large number if chromsizes missing or if testing with mock | |
| sorted_bounds = chrom_bounds.sort_values('start') | |
| positions = sorted_bounds[['start', 'end', 'boundary_class']].values.tolist() | |
| sorted_bounds = chrom_bounds.sort_values('start') | |
| positions = sorted_bounds[['start', 'end', 'boundary_class']].values.tolist() | |
| last_boundary_end = int(positions[-1][1]) if positions else 0 | |
| chrom_len = clr.chromsizes.get(chrom_name) | |
| if chrom_len is None: | |
| # When chromsizes are unavailable (e.g. mocks/tests), avoid creating | |
| # artificially huge edge TADs that will be discarded by the | |
| # max_tad_length_bp filter. | |
| chrom_len = last_boundary_end + max_tad_length_bp |
| def predict_sequence( | ||
| self, | ||
| sequence: str, | ||
| organism="HUMAN", | ||
| requested_outputs: Optional[List[str]] = None | ||
| ): | ||
| """ | ||
| Predict all relevant tracks for a 1 Mb sequence. | ||
|
|
||
| Args: | ||
| sequence: 1 Mb DNA sequence | ||
| organism: "HUMAN" or "MOUSE" | ||
| requested_outputs: list of strings (e.g. ['ATAC', 'CHIP_TF', 'CHIP_HISTONE', 'CAGE', 'CONTACT_MAPS']) | ||
| """ | ||
| if requested_outputs is None: | ||
| requested_outputs = [ | ||
| dna_client.OutputType.ATAC, | ||
| dna_client.OutputType.CHIP_TF, | ||
| dna_client.OutputType.CHIP_HISTONE, | ||
| dna_client.OutputType.CAGE, | ||
| dna_client.OutputType.CONTACT_MAPS | ||
| ] | ||
| else: | ||
| # Map string to enum | ||
| output_map = { | ||
| 'ATAC': dna_client.OutputType.ATAC, | ||
| 'CHIP_TF': dna_client.OutputType.CHIP_TF, | ||
| 'CHIP_HISTONE': dna_client.OutputType.CHIP_HISTONE, | ||
| 'CAGE': dna_client.OutputType.CAGE, | ||
| 'CONTACT_MAPS': dna_client.OutputType.CONTACT_MAPS, | ||
| 'RNA_SEQ': dna_client.OutputType.RNA_SEQ | ||
| } | ||
| requested_outputs = [output_map[req] for req in requested_outputs if req in output_map] | ||
|
|
There was a problem hiding this comment.
requested_outputs is typed/documented as List[str], but the default value passed is a list of dna_client.OutputType enums. If a caller also passes enums, the current string-to-enum mapping will drop them (producing an empty/partial requested_outputs). Consider accepting List[Union[str, dna_client.OutputType]] and passing through enum values unchanged (and update the docstring accordingly).
This PR integrates AlphaGenome models to predict regulatory and 3D organizational consequences of DNA modifications.
It fulfills the requirements for "Work Order 02: AlphaGenome Integration & 3D Organization Delta" by building out the AlphaGenome API connector, the Track Delta Engine, refactoring TAD boundary computations, extending the 3D polymer simulation engine, and validating via an integration test of a mock MYC enhancer deletion.
PR created automatically by Jules for task 3754893517514412865 started by @AkeBoss-tech