Skip to content

Integrate AlphaGenome for Non-Coding Regulatory predictions#6

Open
AkeBoss-tech wants to merge 1 commit intosetup-hg38-env-1687678791183735257from
alphagenome-integration-3754893517514412865
Open

Integrate AlphaGenome for Non-Coding Regulatory predictions#6
AkeBoss-tech wants to merge 1 commit intosetup-hg38-env-1687678791183735257from
alphagenome-integration-3754893517514412865

Conversation

@AkeBoss-tech
Copy link
Copy Markdown
Owner

This PR integrates AlphaGenome models to predict regulatory and 3D organizational consequences of DNA modifications.

It fulfills the requirements for "Work Order 02: AlphaGenome Integration & 3D Organization Delta" by building out the AlphaGenome API connector, the Track Delta Engine, refactoring TAD boundary computations, extending the 3D polymer simulation engine, and validating via an integration test of a mock MYC enhancer deletion.


PR created automatically by Jules for task 3754893517514412865 started by @AkeBoss-tech

- Add AlphaGenomeConnector in `src/hg_dt/models/alphagenome.py` to interface with the DeepMind client.
- Add `src/hg_dt/analyze/deltas.py` to compute quantitative deltas for accessibility, contact maps, and expression, and identify distal loops and silenced elements.
- Update `src/tad_boundaries.py` with `compute_insulation_delta` and `compare_tad_boundaries`.
- Update `src/polymer_sim.py` with `polymer_from_contact_map` function to simulate 3D polymer from a direct contact matrix.
- Add integration test for *MYC* enhancer deletion simulation.
- Fix broken/flaky tests in `test_polymer_sim.py` and `test_tad_boundaries.py`.

Co-authored-by: AkeBoss-tech <69588353+AkeBoss-tech@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copilot AI review requested due to automatic review settings April 5, 2026 04:41
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial AlphaGenome integration primitives (API connector + delta computations) and extends existing TAD/polymer tooling and tests to support analyzing regulatory/3D organizational consequences from predicted tracks/contact maps.

Changes:

  • Added an AlphaGenomeConnector wrapper (optional dependency) plus delta/loop/silencing utilities for interpreting AlphaGenome outputs.
  • Extended polymer simulation with a polymer_from_contact_map() wrapper and adjusted tests for determinism.
  • Added insulation-delta and TAD boundary comparison helpers; updated TAD boundary tests for NaN/None handling and single-boundary behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/hg_dt/models/alphagenome.py Adds AlphaGenome SDK connector wrapper and output selection mapping.
src/hg_dt/analyze/deltas.py Introduces helper functions to compute track/contact deltas and detect loops/silencing.
src/tad_boundaries.py Adds insulation delta + TAD boundary comparison helpers; adjusts TAD interval calling for safer grouping/chromsize fallback.
src/polymer_sim.py Adds polymer_from_contact_map() to run polymer simulation directly from a contact matrix.
tests/test_alphagenome_integration.py New integration-style test using a mock AlphaGenome connector and MYC enhancer deletion scenario.
tests/test_tad_boundaries.py Fixes boundary type set comparison around None vs NaN; adjusts single-boundary test call.
tests/test_polymer_sim.py Seeds NumPy global RNG to make mocked cooler matrices reproducible across calls.
tests/conftest.py Updates sample restraints fixture data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +24
def __array__(self):
return self.values
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MockTrack.__array__ is missing the optional dtype parameter that NumPy may pass when converting via np.array(...)/np.asarray(...). This can raise TypeError: __array__() takes 1 positional argument but 2 were given and break the test. Update __array__ to accept dtype=None and return np.asarray(self.values, dtype=dtype) (or delegate to self.values.__array__(dtype) when available).

Suggested change
def __array__(self):
return self.values
def __array__(self, dtype=None):
return np.asarray(self.values, dtype=dtype)

Copilot uses AI. Check for mistakes.
Comment thread src/tad_boundaries.py
with np.errstate(divide='ignore', invalid='ignore'):
mean_ins = np.nanmean(ins)
if mean_ins > 0:
ins_score = np.log2(ins / mean_ins)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_insulation_delta can produce -inf/+inf values when ins[i] is 0 (e.g., at edges) and mean_ins > 0, due to np.log2(ins / mean_ins). These infinities can later propagate into downstream consumers (e.g., insulation_to_backbone_stiffness), yielding NaNs. Consider adding a small pseudocount / clipping to keep ins strictly positive before the log, or explicitly replace non-finite values after the log.

Suggested change
ins_score = np.log2(ins / mean_ins)
safe_ins = np.clip(ins, np.finfo(float).tiny, None)
ins_score = np.log2(safe_ins / mean_ins)
ins_score = np.where(np.isfinite(ins_score), ins_score, 0.0)

Copilot uses AI. Check for mistakes.
Comment thread src/tad_boundaries.py
Comment on lines +336 to +339
chrom_len = clr.chromsizes.get(chrom_name, 100_000_000) # Fallback to a large number if chromsizes missing or if testing with mock
sorted_bounds = chrom_bounds.sort_values('start')

positions = sorted_bounds[['start', 'end', 'boundary_class']].values.tolist()
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chrom_len fallback is hard-coded to 100_000_000, which can create extremely large first/last TAD intervals when chromsizes are missing; these are then likely discarded by the max_tad_length_bp filter (default 3Mb), resulting in an empty output despite having boundaries. If chromsizes are unavailable, consider deriving a bounded chrom_len from the boundary positions (e.g., last boundary end + max_tad_length_bp) and/or skipping first/last TAD generation in that case.

Suggested change
chrom_len = clr.chromsizes.get(chrom_name, 100_000_000) # Fallback to a large number if chromsizes missing or if testing with mock
sorted_bounds = chrom_bounds.sort_values('start')
positions = sorted_bounds[['start', 'end', 'boundary_class']].values.tolist()
sorted_bounds = chrom_bounds.sort_values('start')
positions = sorted_bounds[['start', 'end', 'boundary_class']].values.tolist()
last_boundary_end = int(positions[-1][1]) if positions else 0
chrom_len = clr.chromsizes.get(chrom_name)
if chrom_len is None:
# When chromsizes are unavailable (e.g. mocks/tests), avoid creating
# artificially huge edge TADs that will be discarded by the
# max_tad_length_bp filter.
chrom_len = last_boundary_end + max_tad_length_bp

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +60
def predict_sequence(
self,
sequence: str,
organism="HUMAN",
requested_outputs: Optional[List[str]] = None
):
"""
Predict all relevant tracks for a 1 Mb sequence.

Args:
sequence: 1 Mb DNA sequence
organism: "HUMAN" or "MOUSE"
requested_outputs: list of strings (e.g. ['ATAC', 'CHIP_TF', 'CHIP_HISTONE', 'CAGE', 'CONTACT_MAPS'])
"""
if requested_outputs is None:
requested_outputs = [
dna_client.OutputType.ATAC,
dna_client.OutputType.CHIP_TF,
dna_client.OutputType.CHIP_HISTONE,
dna_client.OutputType.CAGE,
dna_client.OutputType.CONTACT_MAPS
]
else:
# Map string to enum
output_map = {
'ATAC': dna_client.OutputType.ATAC,
'CHIP_TF': dna_client.OutputType.CHIP_TF,
'CHIP_HISTONE': dna_client.OutputType.CHIP_HISTONE,
'CAGE': dna_client.OutputType.CAGE,
'CONTACT_MAPS': dna_client.OutputType.CONTACT_MAPS,
'RNA_SEQ': dna_client.OutputType.RNA_SEQ
}
requested_outputs = [output_map[req] for req in requested_outputs if req in output_map]

Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested_outputs is typed/documented as List[str], but the default value passed is a list of dna_client.OutputType enums. If a caller also passes enums, the current string-to-enum mapping will drop them (producing an empty/partial requested_outputs). Consider accepting List[Union[str, dna_client.OutputType]] and passing through enum values unchanged (and update the docstring accordingly).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants